Designing Databases for Historical Research

Historical Research Handbook: Designing Databases for Historical Research

F. Problems facing the historian

F2. Problematic information

There are certain categories of historical information which are habitually problematic, and unfortunately these tend to be those subjects that often constitute analytical units, namely geography, chronology and orthography.


Geographical information

The problem with geographical information as it occurs in historical sources is that the boundaries of administrative units overlap and change over time, so that the same physical location can belong to different counties/parishes/wards/precincts and so on depending upon the date of the source being consulted. Obviously this means that if your sources cover a long period of time, you will need to be aware of what implications of any boundary changes in that period may have for your data. This is especially true if you are recording data in a hierarchical fashion: for example if you have a field in a table for ‘Parish’, and another for ‘County’, and every record will be given a value in each field. If the parish of St Harry Potter is situated in the county of Hogwartshire at the beginning of the 17th century, then records connected with this parish would have these two values entered into the respective fields in the table. If, however, administrative changes in the 18th century alter the county boundaries so that St Harry Potter suddenly belongs to the county of Elsewhereshire, then the records will have the values of St Harry Potter in the parish field, and Elsewhereshire in the county field. Whilst this is accurate, it suddenly causes a problem for the database, in that you will have a number of records with the same string in the ‘Parish’ field – and so will be recognised by the database as meaning exactly the same thing - but which historically speaking have different meanings at different points in time.

In this instance there are two ways of dealing with this issue. Firstly, you simply stay aware of the problem, and when running queries on parishes you take the ‘County’ field into account as well as the ‘Parish’ field. This will enable you to specify which version of the parish of St Harry Potter you are analysing. Secondly, you could modify the Parish value to specify which version it is, so instead of entering St Harry Potter, you could enter St Harry Potter: Hogwartshire or St Harry Potter: Elsewhereshire into the Parish field. This would simplify the complication of running queries in this situation, but it would technically break the database rule about ‘atomic values’ (see Section C5, Rule no.9).

This particular problem is even more significant when it is not just the geographical boundaries that change, but when the actual entities themselves change. For example, 17th century London had over 100 parishes in the early part of the century, many of them absolutely tiny in terms of area and population. After the Great Fire, the opportunity was taken to rationalise the parishes, with the result that many were merged or united, often with the newly created entity retaining the name of one of the pre-Fire parishes, whilst each parish still maintained its own existence for some administrative purposes (eg. St Martin Ironmonger Lane and St Olave Jewry). Here the problem is not one of changing hierarchy (which parish belongs to which county), but one of meaning (what area/population is the source referring to at this date when referring to ‘St Martin Ironmonger’?). Various approaches to solving this are used, including that for the preceding example, but what is most important is to be clear in the data at all times precisely what is meant by the geographical terms you enter into the database.


Chronological/dating information

All of the possible problems created by shifting geographical terminology apply to the identification of dates in historical data. This is clearly a more serious issue the further back in history your sources were generated, when calendars and dating systems were more varied and plentiful, and record-keepers had more of a choice in what dating system they could choose. The important thing to remember here, as with geography (and indeed everything else entered into the database), is that the database does not recognise meaning. The database will have no concept of when the ‘Friday after the Feast of the Decollation of St John the Baptist in the thirty-first year of Henry III’ was,[1] which means that this date, as a value, cannot be treated chronologically by the database (that is, sorted or queried by date). Regnal years, mayoral years, feast days, the days of fairs and markets etc. when used to date information in the sources will need to be converted into a value that uses an actual modern date format. Alongside this there is of course the issue of the shift from Julian to Gregorian calendars, so that if your data spans 1752 you will need to convert years into one of the Old or New Style systems.[2]

Do not forget the datatype of the field into which dating information will be entered (see Section C5), bearing in mind that ‘Text’ datatype fields will sort dates alphabetically whereas ‘Date/Time’ datatype fields will sort them chronologically.


Orthography/variant forms

This is the really big area in which historical sources provide information that is problematic for the database: how do you deal with information that appears with many different spellings or in entirely different forms when in reality it means the same thing (or at least you wish to treat it as the same thing)? How will you deal with contractions and abbreviations, particularly when they are not consistent in the source? How will you accommodate information that is incomplete, or is difficult to read or understand where you are uncertain about its meaning? All of these issues are practically certain to crop up at some point in the data entry process, and all of them will need to be addressed to some extent to prevent problems and inaccuracies arising during the analysis of your data (for the impact that these issues have upon querying, for example, join one of our face to face Database courses.

[1] The 30th of August 1247, approximately.

[2] Note that this does not necessarily literally mean ‘convert’: it would be entirely reasonable if your research required it to have two fields to enter date information, one that contained the date verbatim from the source, and the second into which the modern rendering could be entered. Querying and sorting could then take place using the latter field.