Designing Databases for Historical Research

Historical Research Handbook: Designing Databases for Historical Research

F. Problems facing the historian

F1. Introduction

This section of the Handbook is in some ways the most crucial, as it addresses a range of issues that affect the historian in particular during the course of designing their database. Most of these issues arise through the variability and ambiguity that inevitably accompanies the information present in historical sources, and which is not something which non-historical database creators encounter: usually these latter will know precisely what kind of information they will have, and can therefore comfortably predict how the database will need to be structured to make best use of it. Historians will not be able to predict the nature of their information so confidently, even if they know their sources intimately, as there will almost always be instances where the sources confound them – be it through extra information cropping up in the material which appears to be beyond the scope of the source, or through missing or illegible text, or through marginalia, deletions and so on.

This variability or ambiguity in the type and shape of information within the sources, (whilst one of the more exciting aspects of historical research), is not the only occasion where the materials and practices of the historian create difficulties when a database is brought into the equation. Chronology, topography, geography, orthography, and a range of other historical contexts introduce an element of ‘fuzziness’ into the data, and fuzziness is anathema to the strictures of the relational database model. The irregularity of historical information must be managed through the design of the database in such a way as to maintain a balance between keeping the detail and richness of the source to the extent required by the project, whilst at the same time applying enough standardising and reshaping of the information to allow the database tools to operate with maximum efficiency and flexibility.

At the root of most of these ‘interfacing’ problems is the fact that databases do not ascribe meaning to any piece of information you enter into it. The database does not recognise words or numbers or dates: when you enter data into a table, all the database understands is that a field contains a value which consists of a collection of characters organised in a particular order (something that is known as a ‘string’). When two strings are identical, such as might occur with the values appearing in the Surname field of a ‘Person’ table across two records, then the database understands that these two pieces of data are the same and have the same meaning. If two strings are not identical, then the database does not recognise them as meaning the same – even if to the historian it is clear that ‘Jno’, ‘John’ and ‘Johan~’ are all ways of representing the same piece of information.

For example (image F1i), when running a query in a database containing the names of members of a London Livery Company from the sixteenth to twentieth centuries, one would be able to search for everyone named John Smith:[1]

F1i – Query results of a search for individuals called John Smith (using standardised form of name)

As we see, there are 275 individuals called John Smith in the database. We only know this, however, because we chose to run the query using criteria in Sname field – the fields which contain the surname of individuals – and the StdFname field – the field which contains the standardised version of the forename. If we ran the same query using the criteria “John” and “Smith” using the non-standardised version of the forename (in the field Fname), we would retrieve a smaller number of records:

F1ii – Query results of a search for individuals called John Smith (using non-standardised form of name)

This time only 169 records are retrieved. The difference between the two sets of results arises from the variant spelling of the forename ‘John’ (which can be seen in the first image). When searched for ‘John’ in the standardised field, everyone called John had the same value; when we searched in the non-standardised field where the original spelling of the name had been entered, the record for anyone whose name was not spelt exactly as “JOHN” was missed by the query. Whilst the historian knows that all the variant spellings in effect mean the same thing, the database cannot recognise this so long as the strings are not identical. Thus, in order for our queries and analyses to be performed accurately by the database, we have to make sure that data which means the same thing can be identified by the database as meaning the same thing.[2]

[1] For detailed guidance on how to run queries in databases, please sign up to our face to face Database courses.

[2] The problem of non-identical strings not being recognised as meaning the same thing acting as an impediment to the running of queries can be mitigated to some extent through the use of various querying tools, but even then the problem can still lead to inaccurate query results.