Designing Databases for Historical Research

F. Problems facing the historian

F3. Standardisation, classification and coding

The principal way forward for accommodating data containing these kinds of problems is to apply (often quite liberally) a standardisation layer into the design of the database (see Section C4) through the use of standardisation, classification and coding. These three activities are a step removed from simply gathering and entering information derived from the sources: this is where we add (or possibly change) information in order to make retrieving information and performing analysis easier. We use these techniques to overcome the problem of information that means the same thing appearing differently in the database, which prevents the database from connecting like with like (the fundamental pre-requisite for analysing data). For historians this is a more important step than for other kinds of database users, because the variety of forms and ambiguity of meaning of our sources does not sit well with the exactitude required by the database (as with the example of trying to find all of our records about John Smith, Section F1), so that more of a standardisation layer needs to be implemented.

Standardisation, classification and coding are three distinct techniques which overlap, and most databases will use a combination of the three when adding a standardisation layer into the design:

 

Standardisation

This is the process of deciding upon one way of representing a piece of information that appears in the source in a number of different ways (e.g. one way of spelling place/personal names; one way of recording dates and so on) and then entering that standardised version into the table. Consider using standardisation when dealing with values that appear slightly different, but mean the same thing - ‘Ag Lab’ and ‘Agricultural Labour’ as values would be treated very differently by the database, so if you wanted them to be considered as the same thing, you would signal this to the database by giving each record with a variant of this occupation the same standardised value.

 

Classification

This is the process of grouping together information (‘strings’) according to some theoretical, empirical or entirely arbitrary scheme, often using a hierarchical system in order to improve analytical potential. Classification is about allocating groups, and then placing your data in those groups. These groups can be hierarchical, and the hierarchy will let you perform your analysis at a variety of levels. Classification is less about capturing the information in your sources and is much more about serving your research needs.

When using a classification system it is very important to remember two things: firstly, since it is an arbitrary component of your database’s Standardisation layer designed to improve your research analysis, the system does actually need to meet  your has to be able to meet the requirements you have for it. Secondly, therefore, the system needs to have been devised before data entry begins, it needs to intellectually convincing (at least as far as your historical methodologies are concerned) and it needs to be applied within your data consistently.

It is also worth being aware of how other historians have classified their information. There have been many classification systems created by the good and the great of the historical profession,[1] many of which have been used subsequently by others for two reasons: they allow comparability between the findings of different projects; and because they allow historians to turn different sources into continuous series of information. That is, two projects investigating the same thing at different periods may have to rely on different sources: by classifying their (probably slightly different information) into similar classification systems, a case can be made (convincingly or otherwise) that the research is comparable. This is not to say that you should necessarily try to adopt an existing scheme rather than develop one that suits your research better, but it is worth keeping in mind if you are interested in comparing your analysis with that of another historian. In addition, given that classification systems in practice really only entail adding an extra field in a table into which the classified value is added, there is nothing stopping you (other than perhaps time) from employing more than one classification system for the same information in the database.

A detailed example of a classification system can seen in an ongoing project which is investigating the material aspects of early modern households, and which uses a database to record minutely detailed information about material objects. One of the many ways it treats the information about objects is to classify objects by type, in order to be able to compare like objects despite the often substantial differences in the ways they are referred to in the sources. This works by adding a field in the table where item type data is recorded into which an ItemClass code value can be added:

F3i – Data about material objects that have been classified and coded

The ItemClass field here is populated with codes, and these codes record precisely what type of item the record is about (you can see what the source calls the item in the ItemDescr field).[2] The fact that the code is a numeric value, and the fact that the same numeric code is applied to the same type of object regardless of how it is described in the source, means that the ItemClass field acts as a standardised value.

Additionally, however, the ItemClass field enables the use of a hierarchical classification system (to examine a partial sample of the classification system, download the Microsoft Excel file Material Object Type Classification.xls). The hierarchy operates by describing objects at three increasingly detailed levels:

§        Code I: the broadest level (for example, Linen (household); Storage; Tools; Clothing – Outer; Lighting etc.)

§        Code II: the middle level, offering sub-groups of Code I (for example Tools > Domestic implements; Clothing – Outer > Footwear)

§        Code III: the most detailed level of description (for example Clothing – Outer > Footwear > Boots)

To illustrate this we can take the example of how the database classifies objects that are used for seating:


F3ii–Classification system for objects in the category of ‘Seating’

You will notice from the Microsoft Excel spreadsheet that each code level has a two or three digit numeric code, so Code I: Seating has the numeric code 05, that for Code II: Chair is  02, and that for Code III: Wicker Chair is 006. These individual codes become elided into a single numeric code (in the case of the wicker chair – 0502006) which is the value that gets entered into the relevant single field (ItemClass) in the record for the wicker chair in the database.

This may sound complicated and slow to implement, but the benefit of doing so is considerable. Firstly, the database can be created so that the codes can be automatically selected rather than memorised by the database creator, so that they do not have to stop to remember or look up what code needs to be entered for any given object. Secondly, and here is the principal reason for employing a hierarchical system, once the data have been coded, they can be analysed at three different semantic levels. The historian could, if they wished, analyse all instances of wicker chairs in the database by running queries on all records which had the ItemClass value “0502006”. Alternatively, if they were interested in analysing the properties of all the chairs in the database, they could do so by running queries on all records with an ItemClass value that begins “0502***”. Lastly, if the point of the research was to look at all objects used for seating, a query could be designed to retrieve all records with an ItemClass value that began “05*****”. This is an incredibly powerful analytical tool, and one that would be impossible to achieve without the use of a hierarchical classification system: to run a query to find all objects used for seating without a classification system would require looking for each qualifying object that the historian can anticipate or remember, by name and taking into account the variant spellings that might apply.[3]

Hierarchical classification systems are very flexible things as well. They can include as many levels as you require to analyse your data, and they do not need to employ numeric codes when simple standardised text would be easier to implement.[4]

 

Coding

Coding is the process of substituting (not necessarily literally) one value for another, for the purpose of recording a complex and variable piece of information through a short and consistent value. Coding is often closely associated with classification, and in addition to saving time in data entry (it is much quicker to type a short code word than it is to type five or six words) codes additionally act as standardisation (that is, the same form [code] is entered for the same information no matter how the latter appears in the source).

 

These techniques are implemented to make the data more readily useable by the database: the codes, classifications and standardised forms which are used are simple and often easier to incorporate in to a query design than the complicated and incomplete original text strings that appear in the source; but more importantly, they are consistent, making them much easier to find. However there are a number of things to bear in mind when using them, the most important of which is there are two ways of applying these techniques:

  • By replacing original values in the table with standardised/coded/classified forms
  • By adding standardised/coded/classified forms into the table alongside the original values

Both of these approaches present a trade-off between maintaining the integrity of the source and improving the efficiency of the potential analysis, in much the same way as the choices offered as part of the design process when selecting the Source- or Method-oriented approach to the database (see Section C3). The first approach to standardising, to replace the original version of source information in any chosen field(s) with standardised forms of data, enables the speeding up of data entry at the expense of losing what the source says. It also serves as a type of quality control, as entering standardised data (especially if controlled with a ‘look-up list’) is less prone to data entry errors than the original forms that appear in the source.

The second approach, to enter standardised values in addition to the original forms, allows for the best of both worlds: you achieve the accuracy and efficiency benefits of standardisation without losing the information as it is presented in the source. Of course, this happens at the cost of extra data inputting time, as you enter material twice.

When considering both approaches, bear in mind that you will only need to standardise some of the fields in your tables, not every field in every table. The candidates for standardising, classifying and coding are those fields that are likely to be heavily used in your record-linkage or querying, where being able to identify like with like values is important. Creators of databases built around the Source-oriented principle should exercise particular caution when employing these techniques.

 

 



[1] See for example that developed for household types in The population history of England, 1541-1871: a reconstruction (1981) by E.A. Wrigley and R.S. Schofield; or the ongoing HISCO project to develop an international classification system for occupations, available at http://historyofwork.iisg.nl/ (accessed 23/03/2011).

[2] Note in passing that many of the other fields in this example contain codes as well – this table contributes substantially to the database’s Standardisation layer.

[3] It would, for example, need to look for all stools, buffet stools, wicker chairs, forms, settles, benches etc., leading to extremely complicated queries with possibly more criteria that the database can handle. For criteria in queries please sign up to one of our face-to-face Database courses.

[4] Indeed numeric codes are somewhat old fashioned in modern database usage, although they are no less efficient for being outmoded.