Designing Databases for Historical Research
Historical Research Handbook: Designing Databases for Historical Research
E. Entity relationship modelling
E2. Entity Relationship Modelling step-by-step
Entity Relationship Modelling (ERM) is where you model the information from your sources to fit into a structure that will form the basis of your database design. The product of ERM is an Entity Relational Diagram (ERD), which will depict three components: entities (tables), attributes (fields) and the relationships between them.
Stage 1: Determine the purpose of the database
This stage is always the starting point of the ERM process, and is especially important if you are adopting the Method-oriented approach to your design, but it is a vital step in designing any database. No database exists without a purpose: with a Method-oriented database it is crucial that you know exactly what you want to do with your data before you design the database. At this stage you will need to make your decisions about what information you want to keep and what you want to discard, and you will need to be prepared to abide by the consequences of this decision throughout the lifecycle of your database project. This point deserves to be made emphatically: although it is always possible to retrofit the design of your database to include information that you had initially decided to discard, it is not always a trivial matter to do so, particularly if you have to enter another phase of data entry to collect the new information.
Stage 2: List entities
Once you know what you want your database to do, you need to divide your anticipated information into discrete subjects: each subject, or entity, will evolve into a separate table.[1] We split the entities into separate tables for the purposes of efficiency, for avoiding ambiguity, and because this allows us the maximum flexibility for querying once the data are in.
This stage of the ERM process sounds deceptively simple, but is in fact probably the most difficult step of the whole process, and it is certainly worth reading up on the subject (see the Further Reading section).
For example, consider a research project that was investigating political elections in eighteenth-century Bristol, and the sources consisted of a collection of poll books which recorded the votes cast by the local electorates in the communities where elections took place (image E2i):
E2i –An example page of an 18th century poll book[2]
With sources such as these we might pursue a research question which was something like: ‘Analyse the geographic and economic determinants of voting in early 18th century Bristol’. With a question like this we would be interested in geography, economic status and voting patterns in relation to a group of individuals. In terms of entities, we might conclude that there is only one: people, actually, more precisely, we would be considering voters, which would lead us to the position of deciding that we would need a table into which we would enter our information about voters.
Alternatively if we were using probate materials for our research, and wanted to create a database into which we could enter information extracted from wills, we would need to consider the entities from this source. We might conceive of our entities – our discrete subjects – as breaking down into ‘document’, with each will being treated as a different document; ‘person’, containing information about different types of people and their roles – testator, recipient, executor etc.; ‘bequest’ with a range of types of associated information; and ‘object’, being the object of a bequest. If our research was interested in the material culture of a period or place, this latter entity would be particularly important, whereas if the project was concerned only with networks of people and social interrelations, the ‘object’ entity might not be necessary.
In both these examples the possibility of these entities being identified has been stressed, because the definition of entities is as much a product of the purpose of the database as it is of the information contained in the sources. As mentioned at the beginning of this Handbook, no two databases will be designed in the same way, as no two databases will be built for the same purposes – different designers may well identify different entities based upon their unique appreciation of their research and their sources. And this is why to a large extent this step of the ERM process is the most difficult!
However it is perhaps worth considering the inclusion of three commonly chosen entities in the design of your database:
- People - with a related entity of ‘role’ (being the reason why they are present in the source)
- Document – where archival and bibliographical material can be entered (and thus enabling the tracking of every piece of data in the database to its source)
- Event – a slightly more abstract entity, one which describes an instance of whatever it is your source records (a trial, a taxation assessment, an election etc.) and where information about dates can be recorded.
Exercise
Assume you have a research project that is examining the demographic profile of families in a particular parish, and your principal source will be ecclesiastical registers of baptisms, marriages and burials. List the entities from this source that would need to be included in your database design.[3]
Stage 3: Identify the relationships between the entities
Bearing in mind the nature of relationships between different elements of information (see Section D), this step of the ERM process requires you to identify which of your entities are related, and what type of relationship exists between them. This is an exercise in abstract logic, and will take considerable practice: in addition, quite often, this stage will require revisiting stage 2 and redefining the entities you originally chose.
If we return to the database of wills mentioned in stage 2 with the entities ‘document’, ‘person’, ‘bequest’ and ‘object’, we would need to unpick the nature of the relationships between these entities. We might logically decide that the relationships would look something like this (the arrowheads depict the ‘many’ side of a one-to-many relationship (see Section D):
E2ii –Example of relationships identified between entities (wills)
A single document (will) can contain information about more than one person, and also about more than one bequest, whilst a bequest can include information about more than one object, so all of these relationships are one-to-many.
Exercise
Identify the relationships between the entities chosen for the parish register database in Stage 2.
Stage 4: Investigate and rectify problems
This stage is fairly self-explanatory. It is possible to spot problems with the incipient design even at this relatively early point in the ERM process, and if they exist it is better to do so here than after investing work in the later stages.
Look out in particular for:
- Relationships which do not appear to be one-to-many: remember you cannot have entities related by a many-to-many relationship, and whilst you can have them related through a one-to-one relationship, it may be worth rethinking the two entities involved (see Section D)
- Redundant relationships: if entities can be linked in more than one way, you should work out which link should be kept and which should be discarded – if Table A is related to Table B, and Table B is related to Table C, then Tables A and C are already by definition related, and do not need a ‘direct’ relationship to exist between the two
Stage 5: List all attributes associated with each entity, and identify keys
This stage involves listing the precise attributes of each entity that has been identified in the previous stages of the ERM process, by deciding on the fields that should occur in each table. Each field is to contain one piece of information about the subject of the entity, and that one piece will be a single aspect of information that can be known about the entity (see Section C5). Once you have listed the attributes for each entity, you must then identify which fields will act as the primary and foreign keys in each table, remembering that relationships between tables do not exist between tables, but between a specific field in one table and a specific field in the related table (see Section D).
E2iii –Example of attributes and keys identified within entities (wills)[4]
You should always as a matter of course add a generic ‘Notes’ field into each table, with the memo datatype, as it is likely to prove invaluable at moments of crisis when entering data (see Section F)!
Stage 6: Construct the Entity Relationship Diagram (ERD)
Once you have completed stage 5, you are in a position to create the ERD for your database design, which will resemble the heavily simplified example above (image E2iii).
Exercise
Construct an ERD for the parish register database, ensuring that you have identified your entities and their attributes, the relationships between the entities and the fields which act as keys.
[1] In fact they may evolve into separate groups of related tables, but for the moment this does not need to be considered.
[2] An Exact List of the Votes of the Freeholders and Freemen of the City and County of Bristol taken at the Election of Members of Parliament (Bristol, 1722) p.19.
[3] If you are not familiar with parish registers as sources, it is possible to find examples of transcripts (of varying quality) online. One such set of transcripts, for the county of Hampshire, can be found at http://www.knightroots.co.uk/parishes.htm (accessed 23/03/2011).
[4] Terms in black are entity (table names), those in red are primary keys, those in green are foreign keys and those in grey are the remaining attributes/fields, some of which pertain to the database’s Source layer, and some to its Standardisation layer. It is very important to note that this diagram is entirely illustrative, and is not intended to be prescriptive or definitive in any way!