Designing Databases for Historical Research

B. Sources, information and data

B2. Information and data

The tricky part of the process of using databases in historical research lies in the ‘shape’ of the information that is found in our sources. Databases have very strict rules about what type of information goes where, how it is represented and what can be done with it (see Section C), and if the information from our sources can be made to obey these rules then it has become data. Of course the problem facing historians is that information can take many forms in our sources, even if the research is only considering a single type of source and that source is a relatively simple one. Sources that are irregular in ‘shape’, such as textual sources with long narrative accounts written in paragraphs and chapters and so on, or databases of image/sound/video collections, are particularly problematic when it comes to converting their information into data; but the problem will also arise in the more structured sources (such as census listings or taxation assessments), which are never quite as simple as they might appear.

The concept of ‘shape’ here is one that is fundamental to the understanding of how databases work and the efforts needed to enter our sources into them. One of the database rules alluded to above is that all data in the database sit in tables, regardless of what kind of data they are. This means that information taken from our sources will need to fit into a tabular structure – that is, arranged by rows and columns – by the time it has been entered into the database. Often this is not the ‘shape’ the information is in when we open the pages of our sources, and usually we will have to mould it into a more compliant shape. As we shall see, this will cause us to accept something of a compromise between maintaining the full richness and integrity of our sources’ information on the one hand, and maximising the analytical potential of the data we create on the other.

Information from our sources is what we are interested in. It is what we will use to perform our historical analyses, and it is the raw material of our research. Away from the database, when looking at our sources as a methodological necessity we extract information from them and record that information as notes (sometimes as transcripts) in a variety of forms.  The recording of information in this way allows us access to what we need without having to consult the original source in the future, but the form of our notes also allows us to accommodate the vagaries in the types of information that we can obtain from the source. In making notes we assimilate the variations in the type and scope of the information being recorded without concern for the shape of that information, something that is no longer possible in a database environment.

For example, image B2i depicts an interesting historical source, eminently useful for researching a variety of social, economic, cultural or political subjects in the context of mid-nineteenth century Chicago. The text of this pamphlet provides the historian with the bulk of the source’s information - information about places, dates, themes and events and so on – but from the point of view of database design it is important to note that not all of the information is contained in the source’s text.  It is important to identify these non-textual types of information (such as page dimensions, layout, font types and sizes, language, archival stamps, colours used etc.) because if they are important to your research then they will need to be accommodated within the database design, and in some cases this will involve extra conversion processes. Descriptions of the source can be useful information every bit as much as what the source actually says.

When considering this pamphlet as a candidate for inclusion into a database, the most obvious aspect of this particular source is that it does not look much like a table. It is not ‘rectangular’ in terms of its shape - the text is not organised into columns and rows. This makes it difficult to ascertain the scope of the information (what there is information about) without actually reading the whole source, in the same way that you might be able to with a source arranged by rows and columns in a database.

B2i – Example of an historical source [1]

Immediately therefore it becomes apparent that if we wanted to include this information in our database, we would need to think carefully about how to enter the information we want into the tabular structure required. How would it be possible to reorder the information into columns and rows – what would our columns be, how could the information be divided into instances of something (rows)? Our sources, whilst they may be wonderfully useful things, are not often actually suited for use in databases.

On the other hand there are sources which are more promising at first glance in terms of their suitability for inclusion in a database. Take for example the returns of the census enumerators (such as that for the 1850 US Census, image B2ii), a source which is as ‘rectangular’ in shape as it is possible to be. Here the information is conveniently arranged into columns and rows – each columns pertains to one particular type of information (name, age, occupation and so on), and each row corresponds to information about a single individual. This is a source which will ‘fit’ into the database structure without the need for too much conversion, as its inherent shape approximates that required by the database quite closely.

B2ii – Example of a ‘rectangular’ historical source [2]

However it is worth noticing that even here the translation process between source information and database data will not necessarily be an entirely problem-free one. Whilst the bulk of the information is contained within the tabular structure of the source, not all of it is. The information at the top of the page for example, vital information about the place and date of the listing, as well as the identity of the enumerator, is not contained within the table of the individual returns. In the database of this page, this information would need to be accommodated within a table somewhere, giving us some thinking to do about how this should be managed. Similarly, there are a number of pieces of information which might be useful to our research which do not exist in the table of individual listings: the arrow pointing to the Lincoln household, for example, or the various ticks and crosses, emendations and marginalia, some of which are not original to the source but which still constitute information, might be desirable for inclusion in the database. As we shall see in Section C not all the information from a source need necessarily be included in the database and significant decisions about this will need to be made, but the information that is required, no matter what its shape or where it is located in the source, will need to be appropriately converted before it can be used in the database.

Manuscript Exercise

The need to understand the differences between the shape that information takes in our sources and the shape that data has to adopt within databases, is something that this Handbook will return to repeatedly, from a variety of angles. Squeezing information into the right shape for use in a database is not the only form of conversion that is required, however, as we shall see in Section F, but it is the most fundamental stage of the process, and is the most important step in the design stages, as we shall see in Section E.





[1] Pamphlet calling for a strike at the McCormick Reaper Works on the Haymarket Square in Chicago, 1866.  Available at Wikimedia Commons (accessed 25/03/2011).

[2] The household of Abraham Lincoln as described in the returns of the 1850 US Census.  Available at Wikimedia Commons (accessed 25/03/2011).