Designing Databases for Historical Research
C. Fundamentals of database design
C4. Database ‘layers’
Databases often involve several stages of work before they can be fully utilised for analysis. This is because well designed databases arrange data into several layers. The ‘Three Layer’ model of database design serves to illustrate how the organisation of different types of data within a database can dramatically improve the analytical potential of that database. The Standardisation Layer in particular is one that historians should invest time and effort into developing, and practical methods of doing this will be addressed in Section F.
C4i –The ‘Three Layer’ model of database design
The basic premise of the Three Layer model is to create different kinds of data. The first kind is data that is derived from the source and which is entered into the database in the form in which it appears originally; the second kind is data that is adapted to obey more closely the strictures imposed by the database to enable the retrieval, querying and analysis processes to be performed more efficiently and accurately. This second type of data is standardised in one manner or another, where for example variations of spelling of a particular entity are codified into a single form to make it easier to find them in the database.
Before we discuss the specific layers, it is important to make the point that you will need to keep the different layers separate: that is, you should always be able to tell whether a piece of data is from the source or whether it has been standardised in some way by you. In terms of database structure, every field will always belong to one layer only, although tables can contain fields that belong to more than one layer.
The Source layer
This layer comprises the tables and fields (see Section C5) that contain information taken from the source and the forms of the data which have been entered verbatim from the original. No adaptation of the original information has taken place in this layer. This allows us to retrieve information that shows what the source actually said.
The Standardisation layer
This layer comprises the tables and fields which contain data that you have adapted to make analysis easier, and will include data where spelling has been standardised, where abbreviations have been expanded, where a single dating system is used and so on (see Section F). There are two opportunities for creating this layer, either at the database design stage, or after all the data entry has been completed. If the former approach has been chosen, then during data entry you must be rigorously consistent in the way that you enter your standardised forms (e.g. always spelling ‘John’ in the same way), and you must document how you have standardised. If done after data entry as post-processing, you can create your standardised values globally across the whole body of data, but this can be time consuming when dealing with lots of information that needs to be standardised. If possible, the former approach is almost always the better option to take.
The Interpretation (also known as Enrichment) layer
This layer is in many ways an optional one, whereas the other two are not. This layer comprises data and materials which have been drawn into the database from elsewhere other than your sources, in order to make the database and the data it contains a more useful and powerful resource. This can consist of classification, interpolation, interpretation and derived variables. It can also include making links between data: two pieces of information when linked make a third new piece of information. It can include reshaping the data in the database to increase usability (for example drawing together dozens of records about an individual in order to create a single encyclopaedic record which contains everything there is to know about that person), and it can include incorporating contextual material drawn from entirely separate (perhaps secondary) resources. Many databases used for personal research do not include an Interpretation layer.
Describing these layers in this way might suggest that they each exist in entirely separate ‘areas’ of the database as discrete collections of data. In most cases this is not true, however. In most cases data belonging to each field will co-exist within tables, but within separate fields within the tables (see Section C5): for example you might create two fields for ‘occupation’ in the same table that records information about people, in one field (belonging to the Source layer) you can record how the occupation is presented in the source, in the second field (belonging to the Standardisation layer) you can record a standardised version of the occupation. The standardised version will be used for querying and analysis, because it will be easier to find (by virtue of the fact it is always represented in the same way).
When moving between layers it is important you are always aware of what belongs to the Source layer and what belongs to the other layers. This will usually be obvious (at least to you, if not to others), but there should be explicit rules defined (and recorded) explaining the demarcation between the layers. In other words the layers should be managed so that it is always possible to backtrack or undo a piece of interpretation that appears in the database in the event of you changing your mind about something. Similarly you should always be able to get from the result of a query which draws upon standardised data back to the original source data. Much of this layer management will occur through the structure of the database and will not require active intervention on your part, but it is worth remembering at all times that any given piece of data belongs to only one or other of your layers.
These are the principles of database layers, but as with everything else about database design, there is a degree of flexibility about how steadfastly you hold to them. For example, employing a source layer and a standardisation layer can mean entering lots of data twice in two different forms – the original version of an occupation, and the standardised version, for example – which will clearly slow down the data entry process. There is a judgement to be made by every historian creating a database as to how far these different layers should be used, and specifically, whether or not some pieces of information need both a source layer and a standardised layer. If you are unlikely to use a piece of information for analysis, then perhaps it does not need to have a standardised version; if you do not need to have a piece of information in its original form, then perhaps its source layer version is not required. These decisions need to be made carefully, as they impact directly upon how long your data entry will take and how easily you will be able to perform your analysis.