Print bookPrint book

Designing Databases for Historical Research

Historical Research Handbook: Designing Databases for Historical Research

Site: Postgraduate online research training
Course: Databases for Historians
Book: Designing Databases for Historical Research
Printed by: Guest user
Date: Friday, 18 August 2017, 5:59 AM

A. Introduction

A1. Database Design Concepts

The purpose of this Historical Research Handbook is to provide an introduction to designing databases for use in historical research. It will provide an overview of important concepts – both historical in nature and in terms of databases – that the historian will need to consider before embarking upon designing a database, and it will provide a number of starting points for overcoming certain design problems that specifically affect historians when they come to wrestle their sources into a database. This Handbook does not cover the actual construction of databases for historical research in any practical manner.

A1. Database design concepts

After working through this Handbook it is hoped that you will have a good understanding of the complex relationship between historical sources, information and data, and will be aware of the translation processes that are required when moving from one to the others. The informational contents of historical sources need to be converted – often in multiple ways – before they can be used as data, and a number of methodological decisions will need to be taken as this is done. Unlike the more mechanical aspects of using databases in historical research, such as building tables, linking records or running aggregate analyses, this translation process is not only difficult to learn other than through experience, it is also likely to be a substantially different process for every historian doing it. Each historian has different materials, different projects and different research aims, and so the databases they build will (or should) address these in the way that best fits their specific purpose. This ‘modelling’ of historical data is a difficult process, but happily the difficulties that arise are those that by and large are faced by historians in their everyday, non-database, work, meaning that you will be well equipped to deal with them. The modelling of data is also possibly the most interesting and enjoyable aspect of using databases in historical research, although perhaps this is only a relative response to the long hours of data entry that follow the initial design of the database!

This Handbook does not require the use of database software, although it will show the occasional screenshot of a database for the purposes of illustration. Instead it will spend most of its sections on discussing sources and research questions, and how these need to be recast when interacting with a database with its strictures and rules. The exercises that are offered will not have right or wrong ‘answers’, just as there is no right or wrong way to design a database of historical sources. Or rather, it may be more accurate to suggest that while there is no right way to design a database, there are a number of (if not wrong exactly) unhelpful ways to design a database, and this Handbook will focus as much on the latter as on the former.

 

 

B. Sources, information and data

B1. Introduction

B2. Information and Data

B1. Introduction

In this section we are going to address some of the issues that historians face when it comes to thinking about building and using a database for their research. Quite what ‘using a database for their research’ actually means is a subject that we will return to in Section C of this Handbook, as it is a subject that encompasses a range of issues which are likely to impact upon the design of a historical database. Essentially what this section will focus on is the difference between ‘information’ and ‘data’ – the former being what sources provide, the latter being what databases need – and it will begin the process of considering how to move from one to the other.

Unfortunately, the historian is faced with particular kinds of problem when it comes to converting sources into a useful database resource, problems which are not shared by most other database users. This (as we shall see) boils down to two separate inescapable realities of historical research:

  • The historian often does not know precisely what kinds of analyses they want to conduct when starting out on their research
  • The extent and scope of the information contained within the historian’s particular sources cannot usually be anticipated fully

In other words the sheer unpredictability of many historical research projects, the various tangents and new lines of inquiry that open up as soon as you get to grips with the sources, as well as the constant promise of unearthing a type of information that you were not expecting, make designing databases a difficult proposition for historians. Indeed, in many ways, these two factors provide conditions which are entirely contrary to the environment required by the structures and functions of a database. The difficulty for the historian is that what is required is to take information that is informal and unstructured, translate it, and make it fit into a rigidly formal and structured medium. Reconciling the two – the milieu of the historian and the rules of databases – is the principal aim of this Handbook.

Much of what is discussed here is about good practice and ensuring that the most common and critical mistakes are avoided at the most important stage of database creation. Errors at this juncture will have an effect on how useful the database will be: they will make data entry more laborious and more difficult; and more seriously, they will have a significant impact upon the database’s ability to retrieve data for analysis. It is very important therefore to design the database as ‘correctly’ as possible, initially, to minimise the need for retrospective restructuring further down the line (although some of this will inevitably be necessary).

 

B2. Information and data

The tricky part of the process of using databases in historical research lies in the ‘shape’ of the information that is found in our sources. Databases have very strict rules about what type of information goes where, how it is represented and what can be done with it (see Section C), and if the information from our sources can be made to obey these rules then it has become data. Of course the problem facing historians is that information can take many forms in our sources, even if the research is only considering a single type of source and that source is a relatively simple one. Sources that are irregular in ‘shape’, such as textual sources with long narrative accounts written in paragraphs and chapters and so on, or databases of image/sound/video collections, are particularly problematic when it comes to converting their information into data; but the problem will also arise in the more structured sources (such as census listings or taxation assessments), which are never quite as simple as they might appear.

The concept of ‘shape’ here is one that is fundamental to the understanding of how databases work and the efforts needed to enter our sources into them. One of the database rules alluded to above is that all data in the database sit in tables, regardless of what kind of data they are. This means that information taken from our sources will need to fit into a tabular structure – that is, arranged by rows and columns – by the time it has been entered into the database. Often this is not the ‘shape’ the information is in when we open the pages of our sources, and usually we will have to mould it into a more compliant shape. As we shall see, this will cause us to accept something of a compromise between maintaining the full richness and integrity of our sources’ information on the one hand, and maximising the analytical potential of the data we create on the other.

Information from our sources is what we are interested in. It is what we will use to perform our historical analyses, and it is the raw material of our research. Away from the database, when looking at our sources as a methodological necessity we extract information from them and record that information as notes (sometimes as transcripts) in a variety of forms.  The recording of information in this way allows us access to what we need without having to consult the original source in the future, but the form of our notes also allows us to accommodate the vagaries in the types of information that we can obtain from the source. In making notes we assimilate the variations in the type and scope of the information being recorded without concern for the shape of that information, something that is no longer possible in a database environment.

For example, image B2i depicts an interesting historical source, eminently useful for researching a variety of social, economic, cultural or political subjects in the context of mid-nineteenth century Chicago. The text of this pamphlet provides the historian with the bulk of the source’s information - information about places, dates, themes and events and so on – but from the point of view of database design it is important to note that not all of the information is contained in the source’s text.  It is important to identify these non-textual types of information (such as page dimensions, layout, font types and sizes, language, archival stamps, colours used etc.) because if they are important to your research then they will need to be accommodated within the database design, and in some cases this will involve extra conversion processes. Descriptions of the source can be useful information every bit as much as what the source actually says.

When considering this pamphlet as a candidate for inclusion into a database, the most obvious aspect of this particular source is that it does not look much like a table. It is not ‘rectangular’ in terms of its shape - the text is not organised into columns and rows. This makes it difficult to ascertain the scope of the information (what there is information about) without actually reading the whole source, in the same way that you might be able to with a source arranged by rows and columns in a database.

B2i – Example of an historical source [1]

Immediately therefore it becomes apparent that if we wanted to include this information in our database, we would need to think carefully about how to enter the information we want into the tabular structure required. How would it be possible to reorder the information into columns and rows – what would our columns be, how could the information be divided into instances of something (rows)? Our sources, whilst they may be wonderfully useful things, are not often actually suited for use in databases.

On the other hand there are sources which are more promising at first glance in terms of their suitability for inclusion in a database. Take for example the returns of the census enumerators (such as that for the 1850 US Census, image B2ii), a source which is as ‘rectangular’ in shape as it is possible to be. Here the information is conveniently arranged into columns and rows – each columns pertains to one particular type of information (name, age, occupation and so on), and each row corresponds to information about a single individual. This is a source which will ‘fit’ into the database structure without the need for too much conversion, as its inherent shape approximates that required by the database quite closely.

B2ii – Example of a ‘rectangular’ historical source [2]

However it is worth noticing that even here the translation process between source information and database data will not necessarily be an entirely problem-free one. Whilst the bulk of the information is contained within the tabular structure of the source, not all of it is. The information at the top of the page for example, vital information about the place and date of the listing, as well as the identity of the enumerator, is not contained within the table of the individual returns. In the database of this page, this information would need to be accommodated within a table somewhere, giving us some thinking to do about how this should be managed. Similarly, there are a number of pieces of information which might be useful to our research which do not exist in the table of individual listings: the arrow pointing to the Lincoln household, for example, or the various ticks and crosses, emendations and marginalia, some of which are not original to the source but which still constitute information, might be desirable for inclusion in the database. As we shall see in Section C not all the information from a source need necessarily be included in the database and significant decisions about this will need to be made, but the information that is required, no matter what its shape or where it is located in the source, will need to be appropriately converted before it can be used in the database.

Manuscript Exercise

The need to understand the differences between the shape that information takes in our sources and the shape that data has to adopt within databases, is something that this Handbook will return to repeatedly, from a variety of angles. Squeezing information into the right shape for use in a database is not the only form of conversion that is required, however, as we shall see in Section F, but it is the most fundamental stage of the process, and is the most important step in the design stages, as we shall see in Section E.

 

 

 

 



[1] Pamphlet calling for a strike at the McCormick Reaper Works on the Haymarket Square in Chicago, 1866.  Available at Wikimedia Commons (accessed 25/03/2011).

[2] The household of Abraham Lincoln as described in the returns of the 1850 US Census.  Available at Wikimedia Commons (accessed 25/03/2011).

 

 

C1. Introduction

This section of the Handbook provides an introduction to the basic concepts that underpin the design and use of databases. It will focus on aspects of the database which will need to be considered at the beginning of the database design process, alongside the identification of information in the sources (see Sections B and E). The ideas and approaches discussed here are independent of the sources being employed by the historian, and as such will apply to every database used in a historical research project (and indeed to most databases in general). This section will not address technology or software, but rather it will examine conceptual approaches to designing database in the abstract, and specifically how these concepts will affect databases to be created and used by historians.

 

 

C2. The Purpose of the database

As we shall see in Section E, the very first step in the formal process for designing a database is to decide what purpose(s) the database is to serve. This is something that is perhaps not as obvious or as straightforward as one might expect, given that databases in the abstract can indeed serve one or more of a number of different kinds of function. In essence, however, there are three types of function that the historian is likely to be interested in:

  • Data management
  • Record linkage
  • Pattern elucidation/aggregate analysis

Each of these functions is a goal that can be achieved through shaping of the database in the design process, and each will require some elements of the database design to be conducted in specific ways, although they are by no means mutually exclusive. And this latter point is an important one, given that most historians will want to have access to the full range of functionality offered by the database, and will likely engage in research that will require all three of the listed types of activity. Or, to put it another way, many historians are unlikely to know precisely what it is they want to do with their database at the very beginning of the design process, which is when these decisions should be taken. This is why, as we shall see later in this section, many historians are inclined to design databases which maximise flexibility in what they can use them for later on in the project (a goal which will come at the price of design simplicity).

The data management aspect of the database is in many cases almost a by-product of how the database works, and yet it is also one of its most powerful and useful functions. Simply being able to hold vast quantities of information from different sources as data all in one place, in a form that makes it possible to find any given piece of information and see it in relation to other pieces of information, is a very important tool for the historian. Many historians use a database for bibliographical organisation, allowing them to connect notes from secondary reading to information taken from primary sources and being able to trace either back to its source. The simpler tools of database software can be used to find information quickly and easily, making the database a robust mechanism for holding information for retrieval.

Record-linkage is where the database, and particularly where the relational database (see Sections D and E), comes into its own. Connecting people, places, dates, events and themes across sources, periods and geographical or administrative boundaries is clearly an incredibly useful task to perform, and whilst the database can do this, the efficiency and accuracy of the linkages will be dictated by both the design of the database structure and the nature of the data model (see Section E).

Finally once the information from your sources has been converted into data, the database software can be employed to group information together. Once records can be aggregated, then it becomes possible to count them, meaning that statistical analyses can be performed and structural patterns can be identified within the information. Again, however, the efficiency and accuracy of this kind of function will depend on the design of the database and the manner in which the information has been converted. In particular, this kind of functionality will depend a great deal upon the latter, and if the historian aims to perform this kind of analysis extensively, then there will need to be a considerable effort put into applying a ‘standardisation layer’ to the data (see Section C4).

 

C3. Conceptual models of database design

Whilst it is true that every database ever built has been designed specifically for a particular conjunction of purpose and data, and is therefore to a greater or lesser extent distinctive, it is also true that there are two principal overarching approaches to designing databases. The two conceptual models are known as:

The Source-oriented  approach (sometimes called the Object-oriented approach)

and

The Method-oriented approach (also known as the Model-oriented approach)

These two models should be viewed as polar opposites at the ends of a sliding scale, where the design of a database is based on an approach somewhere between the two extremes. Every database design will be something of a compromise, and no database will ever constitute the ‘perfect source-oriented database’, nor will there ever be the ‘perfect method-oriented database’.

 

C3i – The two conceptual approaches to database design

The Source-oriented model of database design dictates that everything about the design of the historical database is geared towards recording every last piece of information from the sources, omitting nothing, and in effect becoming a digital surrogate for the original. The information contained within the sources, and the shape of that information, completely ordains how the database must be built.

The lifecycle of an ideal source-oriented database can be represented thus:

C3ii – Lifecycle of the Source-oriented database

This approach to database design is very attractive to the historian as it places the sources at the centre of the database project. Entering data into a database is a very time consuming activity, however, and this becomes much more so if you are taking pains to record all of the information that exists in your sources. Ultimately you will need to make choices about which information you will exclude from the database, contrary to the principles of the Source-oriented model, which will undermine the database’s role as a digital surrogate for your sources but which will at least allow you to perform your research within a reasonable period.

The Source-oriented approach, if rigidly applied, can lead to a design that quickly becomes unwieldy as you try to accommodate every last piece of information from your source, some of which may only occur once. But, it does allow for wider analytical approaches to be taken later, so that potential queries are not reliant on the initial research agenda, meaning that the database does not restrict the directions your research might take. It also allows you the reassurance of not having to anticipate all of your research questions in advance, which the Method-oriented model does. The Source-oriented model transfers the source (with all its peculiarities and irregularities) in a reasonably reliable way into the database with little loss of information – ‘everything’ is recorded (or at least what is excluded is done so by your conscious choice), and if later something becomes interesting, you will not have to go back to the source to enter information that you did not deem interesting enough to begin with. The Source-oriented model also enables you to record information from the source ‘as is’, and lets you take decisions about meaning later – so ‘merc.’ can be recorded as ‘merc.’, and not expanded to ‘merchant’ or ‘mercer’ at the point of entry into the database. [1]

At the other end of the scale, the lifecycle of the Method-oriented model database could be represented in a different way:

C3iii – Lifecycle of the Method-oriented database

This approach to database design is based on what the database is intended to do, rather than the nature of the information it is intended to contain. Consequently, if adopting this model for designing your database, it is absolutely vital that you know before you begin precisely what you will want to be able to do with the database – including what queries you will want to run. The level of precision needed here should not be underestimated either, given that the database requires a high degree of granularity to perform analysis –the database will not be able to ‘analyse the demographic characteristics of the population’, for example, whereas it will be able to ‘aggregate, count and link the variables of age, gender, marital status, occupation, taxation assessment, place of residence’ and so on. When designing any database it will be necessary to think at this latter level of detail, but if you are designing a Method-oriented database then it becomes much more important.

Method-oriented databases are quicker to design, build and enter data into, but it is very hard to deviate away from the designed function of the database, in order to (for example) pursue newly discovered lines of enquiry.

Ultimately, historians will need to steer a middle course between the two extreme models, perhaps with a slight tendency to lean towards the Source-oriented approach. When making decisions about what information you need from your sources to go into the database, it is important to take into account that your needs may change over the course of a project that might take a number of years. If you want to be able to maintain the maximum flexibility in your research agenda, then you will need to accommodate more information in the database design than if you are very clear on what it is you need to do (and what that is will never change). If you do not know whether your research needs will change, err on the side of accommodating more information – do not exclude information about servants unless you are absolutely sure that you will never want to treat ‘households with servants’ as a unit of analysis, because if you have not entered that information, then it will not be there to query later on.

However you should not dismiss the Method-oriented model out of hand when considering the approach to your database design. If you know your source(s) very well in advance, and you have definite pre-determined research needs, and you know you will not be attempting to recover all the information from the source, and you know in advance exactly how you will treat your data and what questions you will ask of it – if all this is true, you can use the Method-oriented approach. Alternatively, if you are creating a database which is not actually for historical research, but is designed to be a resource with pre-defined functionality and a limited set of tools that a user can use,[2] then a Method-oriented design is also appropriate.



[1] Leaving this kind of ‘normalisation’ until later in the project is beneficial as it allows you to make decisions about the meaning of data until you have the full body of data to act as context.

[2] Such as an online database with fixed search and retrieval functionality, for example Old Bailey Online (http://www.oldbaileyonline.org/, accessed 23/30/2011).

 

C4. Database ‘layers’

Databases often involve several stages of work before they can be fully utilised for analysis. This is because well designed databases arrange data into several layers. The ‘Three Layer’ model of database design serves to illustrate how the organisation of different types of data within a database can dramatically improve the analytical potential of that database. The Standardisation Layer in particular is one that historians should invest time and effort into developing, and practical methods of doing this will be addressed in Section F.

C4i –The ‘Three Layer’ model of database design

The basic premise of the Three Layer model is to create different kinds of data. The first kind is data that is derived from the source and which is entered into the database in the form in which it appears originally; the second kind is data that is adapted to obey more closely the strictures imposed by the database to enable the retrieval, querying and analysis processes to be performed more efficiently and accurately. This second type of data is standardised in one manner or another, where for example variations of spelling of a particular entity are codified into a single form to make it easier to find them in the database.

Before we discuss the specific layers, it is important to make the point that you will need to keep the different layers separate: that is, you should always be able to tell whether a piece of data is from the source or whether it has been standardised in some way by you. In terms of database structure, every field will always belong to one layer only, although tables can contain fields that belong to more than one layer.


The Source layer

This layer comprises the tables and fields (see Section C5) that contain information taken from the source and the forms of the data which have been entered verbatim from the original. No adaptation of the original information has taken place in this layer. This allows us to retrieve information that shows what the source actually said.


The Standardisation layer

This layer comprises the tables and fields which contain data that you have adapted to make analysis easier, and will include data where spelling has been standardised, where abbreviations have been expanded, where a single dating system is used and so on (see Section F). There are two opportunities for creating this layer, either at the database design stage, or after all the data entry has been completed. If the former approach has been chosen, then during data entry you must be rigorously consistent in the way that you enter your standardised forms (e.g. always spelling ‘John’ in the same way), and you must document how you have standardised. If done after data entry as post-processing, you can create your standardised values globally across the whole body of data, but this can be time consuming when dealing with lots of information that needs to be standardised. If possible, the former approach is almost always the better option to take.

 

The Interpretation (also known as Enrichment) layer

This layer is in many ways an optional one, whereas the other two are not. This layer comprises data and materials  which have been drawn into the database from elsewhere other than your sources, in order to make the database and the data it contains a more useful and powerful resource. This can consist of classification, interpolation, interpretation and derived variables. It can also include making links between data: two pieces of information when linked make a third new piece of information. It can include reshaping the data in the database to increase usability (for example drawing together dozens of records about an individual in order to create a single encyclopaedic record which contains everything there is to know about that person), and it can include incorporating contextual material drawn from entirely separate (perhaps secondary) resources. Many databases used for personal research do not include an Interpretation layer.

Describing these layers in this way might suggest that they each exist in entirely separate ‘areas’ of the database as discrete collections of data. In most cases this is not true, however. In most cases data belonging to each field will co-exist within tables, but within separate fields within the tables (see Section C5): for example you might create two fields for ‘occupation’ in the same table that records information about people, in one field (belonging to the Source layer) you can record how the occupation is presented in the source, in the second field (belonging to the Standardisation layer) you can record a standardised version of the occupation. The standardised version will be used for querying and analysis, because it will be easier to find (by virtue of the fact it is always represented in the same way).

When moving between layers it is important you are always aware of what belongs to the Source layer and what belongs to the other layers. This will usually be obvious (at least to you, if not to others), but there should be explicit rules defined (and recorded) explaining the demarcation between the layers. In other words the layers should be managed so that it is always possible to backtrack or undo a piece of interpretation that appears in the database in the event of you changing your mind about something. Similarly you should always be able to get from the result of a query which draws upon standardised data back to the original source data. Much of this layer management will occur through the structure of the database and will not require active intervention on your part, but it is worth remembering at all times that any given piece of data belongs to only one or other of your layers.

These are the principles of database layers, but as with everything else about database design, there is a degree of flexibility about how steadfastly you hold to them. For example, employing a source layer and a standardisation layer can mean entering lots of data twice in two different forms – the original version of an occupation, and the standardised version, for example – which will clearly slow down the data entry process. There is a judgement to be made by every historian creating a database as to how far these different layers should be used, and specifically, whether or not some pieces of information need both a source layer and a standardised layer. If you are unlikely to use a piece of information for analysis, then perhaps it does not need to have a standardised version; if you do not need to have a piece of information in its original form, then perhaps its source layer version is not required. These decisions need to be made carefully, as they impact directly upon how long your data entry will take and how easily you will be able to perform your analysis.

 

 

C5. Database definitions – tables, fields, records, values, rules and datatypes

This section of the Handbook will introduce a few definitions and database terms which will act as a shorthand for various concepts and processes that will be discussed in other sections.  Again, the issues discussed here will apply generally to all databases designed for historical research, as well as to most databases in general.

Harvey and Press provide a definition of a database:

 

“A database is a collection of inter-related data organised in a pre-determined manner according to a set of logical rules, and is structured to reflect the natural relationships of the data and the uses to which they will be put, rather than reflecting the demands of the hardware and software.”

For the full description see: Further Reading

which is a useful if perhaps formal way of describing both the content and environment of a database. Within the database itself, however, are a number of different ‘things’, called Objects,[1] which serve a variety of functions (these include tables where the actual data is stored, queries, user interfaces and so on). For the moment we will concentrate on only the first of these objects, the tables, and we will look at some of the terms connected with them.

There are four main elements to any table in a database, and each of these have (somewhat inevitably) a number of names:

§         Table (also known as Entities)

§         Field (also variously known as Column, Variable, Attribute)

§         Record (also known as Row)

§         Field name (also known as the Attribute name)

C5i –The database table, showing fields and records

C5ii –The database table, showing fields and records (containing data from 17th century parish registers) [2]

In each database, data are stored in tables, and most databases will have more than one table. These tables will have relationships between them which connect the information they contain and will thus be ‘relational’ (see Section D). Tables are made up of fields (columns) and records (rows). Each field contains one type of information, and for each record the information in that field will be of the same kind.

 

Database Rules

Returning to the ‘logical rules’ referred to by Harvey and Press, these can be summarised as follows:

  1. The ‘information rule’: all information in a relational database is represented in one way – as a value in a field in a table
  2. Each table in a relational database will have a name which uniquely identifies it and which will distinguish it from the other tables; the table should contain information about only one entity. Further Information
  3. Each field within a table is given a unique name within the table. Further Information
  4. The values in each field must have the same logical properties, that is, must be of the same datatype: numerical or text (we will be looking at datatypes shortly)
  5. Records will contain closely related information about a single instance of the entity (that is, the subject of the table – for example, the forename and surname of a single individual in a table about individuals)
  6. The order of records in a table must not be significant Further Information
  7. The order of fields in a table must not be significant
  8. Each complete record must be unique
  9. Each field should contain ‘atomic values’: this is, indivisible data (e.g. first and last names of an individual should always be held in separate fields)

In many cases the database software will prevent you from breaking these rules, but in some cases it is possible to circumvent them, whether intentionally or otherwise. With those rules that can be broken, it is almost always unwise to do so, as it will generally lead to confusion (for the database if not its user) when it comes to performing analysis.

Datatypes

We have seen that databases are made up of tables, and that tables follow certain rules. One of the rules to be considered is that each field within a database must be of a certain datatype. There are a number of different datatypes, and although the names by which these datatypes are known will vary slightly from database software to database software, what they do and how they operate will be essentially the same.

In essence what datatypes do is to control what type of data is entered into a field. Each field in each table is assigned a datatype, usually ‘text’ or ‘numeric’, and this in turn dictates what kind of data can be entered into that field. The purpose of datatypes is partly to act as a kind of quality control mechanism, to help prevent data being entered incorrectly; and partly they are used to help the database understand the meaning of what is typed in. For example, the piece of data ‘23/03/2011’ will not be recognised as a date by the database if it is entered into a field that has the datatype ‘text’, whereas it will understand the meaning of this value if it is entered in a field with the ‘date/time’ datatype.

A brief description of the principal datatypes follows.

The basic datatypes: text

  • This is the default datatype for your fields which will be assigned by the database whenever you add a new field to a table
  • This datatype will allow the entry of textual and numerical values, although it will treat that latter differently from numerical values entered into a ‘number’ datatype field
  • Fields with this datatype will generally allow a maximum of 255 characters to be entered Further Information

 

 

The basic datatypes: memo

  • Fields with this datatype are used for lengthy texts and combinations of text and numbers
  • Up to 65,000 characters can be entered (the precise number may change depending on the database software being used)
  • Data in these types of field cannot be sorted
  • Data in these types of field are difficult, although not impossible, to query

 

 

The basic datatypes: number

  • This datatype allows the entry of numerical data that can be used in arithmetical calculations
  • There are a number of variations of this datatype, which control aspects of the numbers that can be entered, such as the size of the numbers, the number of decimal places and so on: Further Information
    • Byte: Stores numbers from 0 to 255 (no fractions)
    • Decimal: stores positive and negative numbers down to 28 decimal places
    • Integer: Stores numbers from –32,768 to 32,767 (no fractions)
    • Long integer: (default) stores numbers from –2,147,483,648 to 2,147,483,647 (no fractions)
    • Single: stores very large positive and negative numbers to 7 decimal places
    • Double: stores very large positive and negative numbers to 15 decimal places

The basic datatypes: date/time

  • This datatype enables the entry of dates covering the period 100 through to 9999
  • This datatype can be customised in order to control the format of dates that are entered
  • Warning: in Microsoft Access, the Office autocorrect feature may well change some of your dates if it is active (e.g. “02/12/04” will autocorrect to “02/12/2004” unless you enter the year in full)
  • This datatype will allow the sorting of records chronologically, which the same values entered into a text datatype field would not (the latter would sort the records alphabetically – alphabetically “01/01/2010” would come before “31/12/1245”)

 

 

The basic datatypes: currency

  • This datatype allows the entry of numerical values data used in mathematical calculations involving data with one to four decimal places, with the inclusion of currency symbols

 

 

The basic datatypes: autonumber

  • This datatype automatically generates a unique sequential or random value whenever a new record is added to a table. Further Information
  • AutoNumber fields cannot be updated, that is, you cannot enter data into them manually

 

 

The basic datatypes: yes/no

  • A field with this type of datatype will only contain one of two values (Yes/No, True/False, or On/Off)
  • Quite often database software will represent this type of field as a checkbox in the table
  • The underlying text value of a yes/no field will be -1 (for yes) or 0 (for no)

 

 

The basic datatypes: OLE

  • A field with this datatype is one in which another file is embedded, as a Microsoft Excel spreadsheet, a Microsoft Word document, an image, a sound or video file, an html link, or indeed any other type of file

Every field in every table will need to have one of these datatypes assigned, and the decision as to which type is chosen should be factored into the database design process (see Section E). For most fields the datatype chosen will be either ‘text’ or ‘number’. Keep in mind how these two datatypes treat numerical data differently, particularly in terms of how they cause data to be sorted:

  • 1,10,11,2,3,4,5,6,7,8,9 is how data will be sorted if the datatype is ‘text’ (that is, alphabetically)
  • 1,2,3,4,5,6,7,8,9,10,11 is how data will be sorted if the datatype is ‘number’ (that is, numerically)


[1] For more on Objects please sign up to one of our face-to-face Database courses.

[2] Note in this table the Forename, Surname, Occupation, Residence and  Comments fields contain values taken directly from the source, and are part of this database’s Source layer. The Stated_Sex field contains a standardised value (in fact a code), and is part of the Standardisation layer.

 

 

C6. Conclusion

All of the foregoing concepts and aspects of database organisation need to be taken into consideration when designing the database that will be used in your historical research, as each of them will either have an implication for how some of your historical information will need to be converted into suitably shaped data (see Section E), or else they will have an impact upon the types of linking, counting and analysing it will be possible to perform on your data. Designing the database will not only involve modelling the information from your sources, but it will also involve making decisions about the conceptual approach of your database, the functions you want it to perform, and how you want the tables and fields of the database to operate at the most minute of levels.


 

D1. Introduction

As mentioned in Section C, most databases consist of data held in more than one table, and this is especially true for databases where the data is derived from historical sources. Relationships are created between the tables to connect the data in one to the data in the other: more precisely, relationships are used to connect specific records in one table to specific records in another. In many ways relationships, and the whole relational data model, comprise the most difficult aspect of designing a database, and not necessarily because they are difficult to actually create. What is difficult about relationships is why we need them – the reasons for using related data can seem obscure and unnecessary at the start of a database project, especially if you have limited experience of using databases. They are, however, extremely important. In essence what relationships allow us to do is two-fold: firstly they allow us to simplify very significantly the process of data entry (and incidentally at the same time enable us to protect the quality of the data we enter by limiting data entry errors); and secondly they serve to ensure that the results of our queries are accurate by making it clear precisely what it is that is being queried.

 

 

D2. Functions of relationships

These dual functions of relationships are best illustrated with an example.

Imagine a database which contained data about people and the hats that they owned, comprising personal information about name, gender, age and so on, as well as information about hat type and the materials used in their manufacture. There are two ways that this database could be designed:

  • A single (flat file) table data model, where all of the information about people and hats was entered into the same table
  • A relational data model where two tables are created, one to contain information about people, and one to contain information about hats

The two scenarios are both feasible and will allow you to conduct a detailed analysis of people and their hats, but each brings with them some very significant consequences if chosen.

 

Scenario A: all information in one table

Scenario B: two related tables

 

 

D3. Types of relationship

Once the need for related tables arises, it is important to understand that there are different kinds of relationship that can exist between two tables. These differences are not technical, in the sense of being dictated by or a product of the mechanisms of the database; they are instead a function of the logical, semantic connection between the information between the two tables.

D3i – The three types of relationships

There are three types of relationship that can exist between two tables in a database, not all of which are useful or desirable.

 

One-to-one relationships:

  • This relationship exists where a record in Table A can only have one related record in Table B, and a record in Table B can only have a single matching record in Table A
  • For example, an MP can have only one constituency, and a constituency can have only one MP

This type of relationship is unusual in a database, as in many cases where a one-to-one relationship exists the information in the two tables could exist in a single table. This is not to say that if, as a part of your Entity Relationship Modelling process (see Section E), you discover a one-to-one relationship between two of your designed tables you should redesign the tables into a single one, only that you can if you want to. For example, if you wanted to, you could create a single table to enter information about Members of Parliament and about their constituency, as when it came to entering this data you would not encounter the duplication of information that was problematic in the example of people and hats in Section D2. On the other hand, tables are supposed to be discrete entities, and so logically speaking you might prefer to conceive of ‘MPs’ and ‘Constituencies’ as two different entities, and thus two different tables. The important thing to remember with one-to-one relationships is that the database software that you use to build your database will allow you to create this kind of relationship, and that it will not create any problems when it comes to running queries.

 

One-to-many relationships:

  • This relationship exists where a record in Table A can have no, one or more matching record in Table B, but a record in Table B can only have one matching records in Table A
  • For example, a mother can have more than one child, but a child can have only one biological mother

This is the most common type of relationship in found in databases, and is usually the type that you want to build into your designs. As illustrated in the people and hats scenario (Section D2) this type of relationship is used to overcome the kinds of problems that arise within the database when the information drawn from the sources would require the duplication of data if entered into a single table.

 

Many-to-many relationships:

  • This relationship exists where a record in Table A can have no, one or many matching records in Table B, and a record in Table B can have no, one or more than one matching record in Table A
  • For example, an author can write more than one book and a book can be written by more than one author

If you discover this kind of relationship operating within your database design at the end of the Entity Relationship Modelling process, then you have a problem which will need to be addressed before you can proceed to actually building the database. Many-to-many relationships will not work in databases, as they will essentially break any query you try to run on the tables related, throwing the query into a ‘loop’ which will generate gibberish as results.

Given how problematic this type of relationship is, it is somewhat disheartening to see how frequently they crop up when modelling historical information! The way of dealing with a many-to-many relationship requires something of a conceptual leap, as it requires the creation of a table, sometimes called a Junction Table, to sit between the two related tables. This Junction Table will act in an abstract fashion – the data it will contain will not be information as such, but they will serve to split the many-to-many relationship into two one-to-many relationships.

D3ii – Many-to-many relationship between Author and Book tables

Take the database which contains a table about Authors and a table about Books, which might be designed according to the Entity Relationship Diagram depicted in image D3ii (for Entity Relationship Diagrams see Section E2). The arrowheads indicate the ‘many’ side of a relationship, here indicating that both tables are on the ‘many’ side, clearly highlighting a problem. To overcome the many-to-many relationship, we would insert a Junction Table to spit the relationship into two one-to-many relationships, as indicated in image D3iii.

D3iii – Many-to-many relationship between Author and Book tables split with a Junction Table

Note that each record in the Junction Table contains three fields: a unique ID for each record (Junction ID), and then a field for each of the Author IDs and Book IDs. Each record therefore becomes a unique combination of Author and Book IDs, which indicates which books were written by which authors:

D3iv – Many-to-many relationship between Author and Book tables split with a Junction Table – showing data

The Junction Table here is effectively circumventing the many-to-many relationship between books and authors, and each record it contains acts as a statement linking one or more author with one or more books. The first two records in the Junction Table, for example indicate that Author ID 1 was the writer of Book IDs 1 and 2, whilst the last two records indicate that Book ID 9 was co-authored by Author IDs 2 and 5. The relationship between books and authors is managed by the Junction Table, whilst the details about books and authors are kept in their respective tables.

This arrangement, whilst somewhat convoluted, will enable the database to run queries that draw on information in both the Book and Author tables when it would otherwise not be able to due to the many-to-many relationship. It is therefore a very valuable technique to bear in mind when identifying relationships between tables as part of the database design process.

 

 

D4. Primary and Foreign Keys

Primary and Foreign Keys are absolutely crucial to the running of a relational database because they serve as the ‘anchors’ that keep relationships connected to the relevant tables. The keys are fields that serve a particular purpose within a table: they are not used to capture information drawn for the sources, but instead they are used to keep track of the information that is necessary for the database to know which records in one table are connected to records in the related table.

One of the ‘rules’ of database design (Section C5) states that ‘each complete record must be unique’, meaning that when taken as a whole, each record in a table must be unique. With historical sources, this can sometimes be problematic, as quite often the same information can crop up repeatedly over a period of time – such as, for example, when an individual is named as a witness to a number of wills across a decade. To make sure that the records of a table obey this particular rule of databases, it is necessary to guarantee that each record will be unique, and if the nature of our historical information prevents us from being able to guarantee this, we are forced to cheat. In doing so, we actually achieve a number of useful effects in the design of our tables.

The way that we guarantee that each record in a table will be unique is to add a field into which we can enter (or have the database enter automatically) a unique identifier, a value which will be different for every record that is added to the table. Normally this is a sequential number, such as the values in the various ID fields in the People and Hats database (Section D2). A sequential number value applied to each record that will be different for each record will guarantee by itself that every record as a whole will be unique – because the ID value will never be duplicated from one record to the next.

However this field which provides us with a unique value for every record also serves to act as the Primary Key field for a table, this being the field that acts as the anchor for the ‘one’ side of a one-to-many relationship. The field that acts as an anchor on the other side of the relationship that exists within the table that is on the ‘many’ side of the relationship is known as the Foreign Key. The Foreign Key field will not contain a unique value for every record: because it is on the many side of the relationships, the same ID value is likely to occur in more than one record. Both the Primary and Foreign Key fields contain the same information.

Consider the People and Hats database again:

Person table:

PersonID

Forename

Surname

Occupation

Residence

Age

Gender

HairColour

Income

1

Samuel

Spade

Builder

Peckham

23

M

Blonde

£25,000

2

Philip

Marlowe

Accountant

Dulwich

35

M

Dappled

£34,533

3

Hercule

Poirot

Plumber

Chelsea

26

M

Bald

£250,000

4

Miss

Marple

Doctor

Surbiton

58

F

Grey

£1.50

Hat table:

PersonID

HatID

HatType

HatColour

HatMaterial

1

1

Baseball cap

Black/red

Polyester

2

2

Bowler

Pink

Satin/fur

2

3

Crash helmet

Green

Kevlar/fur

3

4

Stetson

Brown

Leather

3

5

Captain’s hat

White/blue

Silk/braid

4

6

Nightcap

Grubby yellow

Rags

The relationship is one-to-many, a person can have many hats, and the relationship is anchored on the PersonID field which is present in both tables. The PersonID field contains the unique ID number that is assigned to every person who is entered in the Person table – because every person is only entered once in the Person table, the PersonID field in the Person table is unique, and it is the Primary Key. In the Hat table, because a person can own more than one hat, the Person ID value will not be unique – the same ID number will crop up every time a record contains a hat owned by that person (as is the case for persons 2 and 3). The PersonID field in the Hat table is therefore the Foreign Key.

When designing the database, there is a stage in the process where you will need to think about what fields will appear in your tables (see Section E2), and it is very important when doing so that you remember to identify the Primary and Foreign Keys for your tables. Every table you design should have a Primary Key field – a field with the datatype ‘autonumber’ which will generate a unique value for every new record you add. Not every table will have a Foreign Key field, only those that are on the ‘many’ side of a one-to-many relationship. When you do add a Foreign Key field to a table, remember that it will contain the same information (that is, the ID numbers) drawn from the field that is the Primary Key for that relationship. Without the Key fields, the database software will not be able to correctly manage the relationship, with the result that it will not be able to identify which records from the one table are connected to the which from the other, a situation that will make performing analysis or even simple retrieval of information all but impossible.

 

E1. Introduction

Throughout this Handbook so far reference has been made to the translation and conversion processes involved in taking information from sources and turning them into data within the database. This section describes precisely the tasks involved in performing these processes, which are collectively known as Entity Relationship Modelling (ERM). The mechanics of ERM are in fact a lot less intimidating than the name implies, but it is nevertheless a complex activity, and one that is likely to prove challenging at the first few attempts. Luckily, however, the various stages of ERM draw very heavily upon the skills and experience that the historian utilises as a matter of course during their research anyway, which, unlike most aspects of database use, places the historical researcher at something of an advantage. The difficulty of the ERM process is directly proportional to the complexity of the source(s) being used in the research, with some types of sources being (relatively) simpler to model than others. Highly structured sources like census returns, lists of inhabitants, poll books and so on will be easier to model than ‘semi-structured’ sources such as probate inventories, which in turn will present fewer problems than completely unstructured material such as narrative texts and interviews, and so on. However all will have their own particular features and problems to complicate the modelling.

The process of ERM serves a number of purposes. Firstly, it makes the historian decide upon what it is the database is to achieve in terms of its functions. Secondly, it identifies the types of information that can be obtained from the sources, and in conjunction with the database’s chosen aims, aids the historian in deciding upon which information from the sources should be entered into the database, and which can be can be excluded. Thirdly, ERM makes the historian think in detail about the components of the database, its tables, fields, relationships, datatypes, and so on, decisions on all of which are crucial to a successful database design. Finally, it encourages the consideration of the layers of the database, what information needs to be entered into both the Source layer and the Standardisation layer, what can be entered only into the latter, and how extensive the latter needs to be. Once these tasks have been conducted, the historian is left with a very precise idea of what the database will look like, and, on a more practical note, will be left with the design of their database on paper (an Entity Relationship Diagram [ERD]).

 

 

E2. Entity Relationship Modelling step-by-step

Entity Relationship Modelling (ERM) is where you model the information from your sources to fit into a structure that will form the basis of your database design. The product of ERM is an Entity Relational Diagram (ERD), which will depict three components: entities (tables), attributes (fields) and the relationships between them.

 

Stage 1: Determine the purpose of the database

This stage is always the starting point of the ERM process, and is especially important if you are adopting the Method-oriented approach to your design, but it is a vital step in designing any database. No database exists without a purpose: with a Method-oriented database it is crucial that you know exactly what you want to do with your data before you design the database. At this stage you will need to make your decisions about what information you want to keep and what you want to discard, and you will need to be prepared to abide by the consequences of this decision throughout the lifecycle of your database project. This point deserves to be made emphatically: although it is always possible to retrofit the design of your database to include information that you had initially decided to discard, it is not always a trivial matter to do so, particularly if you have to enter another phase of data entry to collect the new information.

 

Stage 2: List entities

Once you know what you want your database to do, you need to divide your anticipated information into discrete subjects: each subject, or entity, will evolve into a separate table.[1] We split the entities into separate tables for the purposes of efficiency, for avoiding ambiguity, and because this allows us the maximum flexibility for querying once the data are in.

This stage of the ERM process sounds deceptively simple, but is in fact probably the most difficult step of the whole process, and it is certainly worth reading up on the subject (see the Further Reading section).

For example, consider a research project that was investigating political elections in eighteenth-century Bristol, and the sources consisted of a collection of poll books which recorded the votes cast by the local electorates in the communities where elections took place (image E2i):

E2i –An example page of an 18th century poll book[2]

With sources such as these we might pursue a research question which was something like: ‘Analyse the geographic and economic determinants of voting in early 18th century Bristol’. With a question like this we would be interested in geography, economic status and voting patterns in relation to a group of individuals. In terms of entities, we might conclude that there is only one: people, actually, more precisely, we would be considering voters, which would lead us to the position of deciding that we would need a table into which we would enter our information about voters.

Alternatively if we were using probate materials for our research, and wanted to create a database into which we could enter information extracted from wills, we would need to consider the entities from this source. We might conceive of our entities – our discrete subjects – as breaking down into ‘document’, with each will being treated as a different document; ‘person’, containing information about different types of people and their roles – testator, recipient, executor etc.; ‘bequest’ with a range of types of associated information; and ‘object’, being the object of a bequest. If our research was interested in the material culture of a period or place, this latter entity would be particularly important, whereas if the project was concerned only with networks of people and social interrelations, the ‘object’ entity might not be necessary.

In both these examples the possibility of these entities being identified has been stressed, because the definition of entities is as much a product of the purpose of the database as it is of the information contained in the sources. As mentioned at the beginning of this Handbook, no two databases will be designed in the same way, as no two databases will be built for the same purposes – different designers may well identify different entities based upon their unique appreciation of their research and their sources. And this is why to a large extent this step of the ERM process is the most difficult!

However it is perhaps worth considering the inclusion of three commonly chosen entities in the design of your database:

  •  People  - with a related entity of ‘role’ (being the reason why they are present in the source)
  • Document – where archival and bibliographical material can be entered (and thus enabling the tracking of every piece of data in the database to its source)
  • Event – a slightly more abstract entity, one which describes an instance of whatever it is your source records (a trial, a taxation assessment, an election etc.) and where information about dates can be recorded.

 

Exercise

Assume you have a research project that is examining the demographic profile of families in a particular parish, and your principal source will be ecclesiastical registers of baptisms, marriages and burials. List the entities from this source that would need to be included in your database design.[3]

 

Stage 3: Identify the relationships between the entities

Bearing in mind the nature of relationships between different elements of information (see Section D), this step of the ERM process requires you to identify which of your entities are related, and what type of relationship exists between them. This is an exercise in abstract logic, and will take considerable practice: in addition, quite often, this stage will require revisiting stage 2 and redefining the entities you originally chose.

If we return to the database of wills mentioned in stage 2 with the entities ‘document’, ‘person’, ‘bequest’ and ‘object’, we would need to unpick the nature of the relationships between these entities. We might logically decide that the relationships would look something like this (the arrowheads depict the ‘many’ side of a one-to-many relationship (see Section D):

 

 

  

E2ii –Example of relationships identified between entities (wills)

A single document (will) can contain information about more than one person, and also about more than one bequest, whilst a bequest can include information about more than one object, so all of these relationships are one-to-many.

 

Exercise

Identify the relationships between the entities chosen for the parish register database in Stage 2.

 

Stage 4: Investigate and rectify problems

This stage is fairly self-explanatory. It is possible to spot problems with the incipient design even at this relatively early point in the ERM process, and if they exist it is better to do so here than after investing work in the later stages.

Look out in particular for:

  • Relationships which do not appear to be one-to-many: remember you cannot have entities related by a many-to-many relationship, and whilst you can have them related through a one-to-one relationship, it may be worth rethinking the two entities involved (see Section D)
  • Redundant relationships: if entities can be linked in more than one way, you should work out which link should be kept and which should be discarded – if Table A is related to Table B, and Table B is related to Table C, then Tables A and C are already by definition related, and do not need a ‘direct’ relationship to exist between the two

 

Stage 5: List all attributes associated with each entity, and identify keys

This stage involves listing the precise attributes of each entity that has been identified in the previous stages of the ERM process, by deciding on the fields that should occur in each table. Each field is to contain one piece of information about the subject of the entity, and that one piece will be a single aspect of information that can be known about the entity (see Section C5). Once you have listed the attributes for each entity, you must then identify which fields will act as the primary and foreign keys in each table, remembering that relationships between tables do not exist between tables, but between a specific field in one table and a specific field in the related table (see Section D).

E2iii –Example of attributes and keys identified within entities (wills)[4]

You should always as a matter of course add a generic ‘Notes’ field into each table, with the memo datatype, as it is likely to prove invaluable at moments of crisis when entering data (see Section F)!

Stage 6: Construct the Entity Relationship Diagram (ERD)

Once you have completed stage 5, you are in a position to create the ERD for your database design, which will resemble the heavily simplified example above (image E2iii).

 

Exercise

Construct an ERD for the parish register database, ensuring that you have identified your entities and their attributes, the relationships between the entities and the fields which act as keys.

 



[1] In fact they may evolve into separate groups of related tables, but for the moment this does not need to be considered.

[2] An Exact List of the Votes of the Freeholders and Freemen of the City and County of Bristol taken at the Election of Members of Parliament (Bristol, 1722) p.19.

[3] If you are not familiar with parish registers as sources, it is possible to find examples of transcripts (of varying quality) online. One such set of transcripts, for the county of Hampshire, can be found at http://www.knightroots.co.uk/parishes.htm (accessed 23/03/2011).

[4] Terms in black are entity (table names), those in red are primary keys, those in green are foreign keys and those in grey are the remaining attributes/fields, some of which pertain to the database’s Source layer, and some to its Standardisation layer. It is very important to note that this diagram is entirely illustrative, and is not intended to be prescriptive or definitive in any way!

 

 

E3. Conclusion

The process of Entity Relationship Modelling (ERM) is difficult, and rapidly becomes more difficult if you are blessed with a number of different kinds of sources, each of which contains rich information about a variety of subjects. If you are using multiple sources, it is a good idea to avoid creating entities that are source specific: for example, if you are using census returns and taxation lists, both of which contain information about people, do not create two tables for people (one containing the information from one source, the other from the second). Stick to the abstract logic of the information – what is important to your research is people, so accommodate all of the information about people in the same place. Not only does this make sense from the point of view of logic, but it will also make it much easier to find data about specific individuals later on (either manually or via queries): looking for a person is easier to do if everyone is located in one table rather than several.[1]

No Entity Relational Diagram (ERD) will ever be perfect, as with so much else involved in database design it will be a matter of compromise. The success of an ERD is something that can only be determined in one way – by the database performing the tasks it was intended and designed to do, and this is something that will not become evident until after you have begun entering data and using the database for analysis. This is why the creation of the ERD is (or should be) swiftly followed by a period of intense testing of the database ‘in action’, in order to quickly identify where the design is impeding the database’s purpose (see Section G).



[1] Ultimately of course this is a matter of personal judgement: you may decide that your entity is not ‘people’, but is in fact two separate entities comprising ‘census return’ and ‘tax payer’, in which case you would be able to argue for two separate tables. You would still face the problem of having to look for individuals in more than one table, however, should the need ever arise.

 

 

F1. Introduction

This section of the Handbook is in some ways the most crucial, as it addresses a range of issues that affect the historian in particular during the course of designing their database. Most of these issues arise through the variability and ambiguity that inevitably accompanies the information present in historical sources, and which is not something which non-historical database creators encounter: usually these latter will know precisely what kind of information they will have, and can therefore comfortably predict how the database will need to be structured to make best use of it. Historians will not be able to predict the nature of their information so confidently, even if they know their sources intimately, as there will almost always be instances where the sources confound them – be it through extra information cropping up in the material which appears to be beyond the scope of the source, or through missing or illegible text, or through marginalia, deletions and so on.

This variability or ambiguity in the type and shape of information within the sources, (whilst one of the more exciting aspects of historical research), is not the only occasion where the materials and practices of the historian create difficulties when a database is brought into the equation. Chronology, topography, geography, orthography, and a range of other historical contexts introduce an element of ‘fuzziness’ into the data, and fuzziness is anathema to the strictures of the relational database model. The irregularity of historical information must be managed through the design of the database in such a way as to maintain a balance between keeping the detail and richness of the source to the extent required by the project, whilst at the same time applying enough standardising and reshaping of the information to allow the database tools to operate with maximum efficiency and flexibility.

At the root of most of these ‘interfacing’ problems is the fact that databases do not ascribe meaning to any piece of information you enter into it. The database does not recognise words or numbers or dates: when you enter data into a table, all the database understands is that a field contains a value which consists of a collection of characters organised in a particular order (something that is known as a ‘string’). When two strings are identical, such as might occur with the values appearing in the Surname field of a ‘Person’ table across two records, then the database understands that these two pieces of data are the same and have the same meaning. If two strings are not identical, then the database does not recognise them as meaning the same – even if to the historian it is clear that ‘Jno’, ‘John’ and ‘Johan~’ are all ways of representing the same piece of information.

For example (image F1i), when running a query in a database containing the names of members of a London Livery Company from the sixteenth to twentieth centuries, one would be able to search for everyone named John Smith:[1]

F1i – Query results of a search for individuals called John Smith (using standardised form of name)

As we see, there are 275 individuals called John Smith in the database. We only know this, however, because we chose to run the query using criteria in Sname field – the fields which contain the surname of individuals – and the StdFname field – the field which contains the standardised version of the forename. If we ran the same query using the criteria “John” and “Smith” using the non-standardised version of the forename (in the field Fname), we would retrieve a smaller number of records:

F1ii – Query results of a search for individuals called John Smith (using non-standardised form of name)

This time only 169 records are retrieved. The difference between the two sets of results arises from the variant spelling of the forename ‘John’ (which can be seen in the first image). When searched for ‘John’ in the standardised field, everyone called John had the same value; when we searched in the non-standardised field where the original spelling of the name had been entered, the record for anyone whose name was not spelt exactly as “JOHN” was missed by the query. Whilst the historian knows that all the variant spellings in effect mean the same thing, the database cannot recognise this so long as the strings are not identical. Thus, in order for our queries and analyses to be performed accurately by the database, we have to make sure that data which means the same thing can be identified by the database as meaning the same thing.[2]



[1] For detailed guidance on how to run queries in databases, please sign up to our face to face Database courses.

[2] The problem of non-identical strings not being recognised as meaning the same thing acting as an impediment to the running of queries can be mitigated to some extent through the use of various querying tools, but even then the problem can still lead to inaccurate query results.

F2. Problematic information

There are certain categories of historical information which are habitually problematic, and unfortunately these tend to be those subjects that often constitute analytical units, namely geography, chronology and orthography.

 

Geographical information

The problem with geographical information as it occurs in historical sources is that the boundaries of administrative units overlap and change over time, so that the same physical location can belong to different counties/parishes/wards/precincts and so on depending upon the date of the source being consulted. Obviously this means that if your sources cover a long period of time, you will need to be aware of what implications of any boundary changes in that period may have for your data. This is especially true if you are recording data in a hierarchical fashion: for example if you have a field in a table for ‘Parish’, and another for ‘County’, and every record will be given a value in each field. If the parish of St Harry Potter is situated in the county of Hogwartshire at the beginning of the 17th century, then records connected with this parish would have these two values entered into the respective fields in the table. If, however, administrative changes in the 18th century alter the county boundaries so that St Harry Potter suddenly belongs to the county of Elsewhereshire, then the records will have the values of St Harry Potter in the parish field, and Elsewhereshire in the county field. Whilst this is accurate, it suddenly causes a problem for the database, in that you will have a number of records with the same string in the ‘Parish’ field – and so will be recognised by the database as meaning exactly the same thing - but which historically speaking have different meanings at different points in time.

In this instance there are two ways of dealing with this issue. Firstly, you simply stay aware of the problem, and when running queries on parishes you take the ‘County’ field into account as well as the ‘Parish’ field. This will enable you to specify which version of the parish of St Harry Potter you are analysing. Secondly, you could modify the Parish value to specify which version it is, so instead of entering St Harry Potter, you could enter St Harry Potter: Hogwartshire or St Harry Potter: Elsewhereshire into the Parish field. This would simplify the complication of running queries in this situation, but it would technically break the database rule about ‘atomic values’ (see Section C5, Rule no.9).

This particular problem is even more significant when it is not just the geographical boundaries that change, but when the actual entities themselves change. For example, 17th century London had over 100 parishes in the early part of the century, many of them absolutely tiny in terms of area and population. After the Great Fire, the opportunity was taken to rationalise the parishes, with the result that many were merged or united, often with the newly created entity retaining the name of one of the pre-Fire parishes, whilst each parish still maintained its own existence for some administrative purposes (eg. St Martin Ironmonger Lane and St Olave Jewry). Here the problem is not one of changing hierarchy (which parish belongs to which county), but one of meaning (what area/population is the source referring to at this date when referring to ‘St Martin Ironmonger’?). Various approaches to solving this are used, including that for the preceding example, but what is most important is to be clear in the data at all times precisely what is meant by the geographical terms you enter into the database.

 

Chronological/dating information

All of the possible problems created by shifting geographical terminology apply to the identification of dates in historical data. This is clearly a more serious issue the further back in history your sources were generated, when calendars and dating systems were more varied and plentiful, and record-keepers had more of a choice in what dating system they could choose. The important thing to remember here, as with geography (and indeed everything else entered into the database), is that the database does not recognise meaning. The database will have no concept of when the ‘Friday after the Feast of the Decollation of St John the Baptist in the thirty-first year of Henry III’ was,[1] which means that this date, as a value, cannot be treated chronologically by the database (that is, sorted or queried by date). Regnal years, mayoral years, feast days, the days of fairs and markets etc. when used to date information in the sources will need to be converted into a value that uses an actual modern date format. Alongside this there is of course the issue of the shift from Julian to Gregorian calendars, so that if your data spans 1752 you will need to convert years into one of the Old or New Style systems.[2]

Do not forget the datatype of the field into which dating information will be entered (see Section C5), bearing in mind that ‘Text’ datatype fields will sort dates alphabetically whereas ‘Date/Time’ datatype fields will sort them chronologically.

 

Orthography/variant forms

This is the really big area in which historical sources provide information that is problematic for the database: how do you deal with information that appears with many different spellings or in entirely different forms when in reality it means the same thing (or at least you wish to treat it as the same thing)? How will you deal with contractions and abbreviations, particularly when they are not consistent in the source? How will you accommodate information that is incomplete, or is difficult to read or understand where you are uncertain about its meaning? All of these issues are practically certain to crop up at some point in the data entry process, and all of them will need to be addressed to some extent to prevent problems and inaccuracies arising during the analysis of your data (for the impact that these issues have upon querying, for example, join one of our face to face Database courses.



[1] The 30th of August 1247, approximately.

[2] Note that this does not necessarily literally mean ‘convert’: it would be entirely reasonable if your research required it to have two fields to enter date information, one that contained the date verbatim from the source, and the second into which the modern rendering could be entered. Querying and sorting could then take place using the latter field.

 

F3. Standardisation, classification and coding

The principal way forward for accommodating data containing these kinds of problems is to apply (often quite liberally) a standardisation layer into the design of the database (see Section C4) through the use of standardisation, classification and coding. These three activities are a step removed from simply gathering and entering information derived from the sources: this is where we add (or possibly change) information in order to make retrieving information and performing analysis easier. We use these techniques to overcome the problem of information that means the same thing appearing differently in the database, which prevents the database from connecting like with like (the fundamental pre-requisite for analysing data). For historians this is a more important step than for other kinds of database users, because the variety of forms and ambiguity of meaning of our sources does not sit well with the exactitude required by the database (as with the example of trying to find all of our records about John Smith, Section F1), so that more of a standardisation layer needs to be implemented.

Standardisation, classification and coding are three distinct techniques which overlap, and most databases will use a combination of the three when adding a standardisation layer into the design:

 

Standardisation

This is the process of deciding upon one way of representing a piece of information that appears in the source in a number of different ways (e.g. one way of spelling place/personal names; one way of recording dates and so on) and then entering that standardised version into the table. Consider using standardisation when dealing with values that appear slightly different, but mean the same thing - ‘Ag Lab’ and ‘Agricultural Labour’ as values would be treated very differently by the database, so if you wanted them to be considered as the same thing, you would signal this to the database by giving each record with a variant of this occupation the same standardised value.

 

Classification

This is the process of grouping together information (‘strings’) according to some theoretical, empirical or entirely arbitrary scheme, often using a hierarchical system in order to improve analytical potential. Classification is about allocating groups, and then placing your data in those groups. These groups can be hierarchical, and the hierarchy will let you perform your analysis at a variety of levels. Classification is less about capturing the information in your sources and is much more about serving your research needs.

When using a classification system it is very important to remember two things: firstly, since it is an arbitrary component of your database’s Standardisation layer designed to improve your research analysis, the system does actually need to meet  your has to be able to meet the requirements you have for it. Secondly, therefore, the system needs to have been devised before data entry begins, it needs to intellectually convincing (at least as far as your historical methodologies are concerned) and it needs to be applied within your data consistently.

It is also worth being aware of how other historians have classified their information. There have been many classification systems created by the good and the great of the historical profession,[1] many of which have been used subsequently by others for two reasons: they allow comparability between the findings of different projects; and because they allow historians to turn different sources into continuous series of information. That is, two projects investigating the same thing at different periods may have to rely on different sources: by classifying their (probably slightly different information) into similar classification systems, a case can be made (convincingly or otherwise) that the research is comparable. This is not to say that you should necessarily try to adopt an existing scheme rather than develop one that suits your research better, but it is worth keeping in mind if you are interested in comparing your analysis with that of another historian. In addition, given that classification systems in practice really only entail adding an extra field in a table into which the classified value is added, there is nothing stopping you (other than perhaps time) from employing more than one classification system for the same information in the database.

A detailed example of a classification system can seen in an ongoing project which is investigating the material aspects of early modern households, and which uses a database to record minutely detailed information about material objects. One of the many ways it treats the information about objects is to classify objects by type, in order to be able to compare like objects despite the often substantial differences in the ways they are referred to in the sources. This works by adding a field in the table where item type data is recorded into which an ItemClass code value can be added:

F3i – Data about material objects that have been classified and coded

The ItemClass field here is populated with codes, and these codes record precisely what type of item the record is about (you can see what the source calls the item in the ItemDescr field).[2] The fact that the code is a numeric value, and the fact that the same numeric code is applied to the same type of object regardless of how it is described in the source, means that the ItemClass field acts as a standardised value.

Additionally, however, the ItemClass field enables the use of a hierarchical classification system (to examine a partial sample of the classification system, download the Microsoft Excel file Material Object Type Classification.xls). The hierarchy operates by describing objects at three increasingly detailed levels:

§        Code I: the broadest level (for example, Linen (household); Storage; Tools; Clothing – Outer; Lighting etc.)

§        Code II: the middle level, offering sub-groups of Code I (for example Tools > Domestic implements; Clothing – Outer > Footwear)

§        Code III: the most detailed level of description (for example Clothing – Outer > Footwear > Boots)

To illustrate this we can take the example of how the database classifies objects that are used for seating:


F3ii–Classification system for objects in the category of ‘Seating’

You will notice from the Microsoft Excel spreadsheet that each code level has a two or three digit numeric code, so Code I: Seating has the numeric code 05, that for Code II: Chair is  02, and that for Code III: Wicker Chair is 006. These individual codes become elided into a single numeric code (in the case of the wicker chair – 0502006) which is the value that gets entered into the relevant single field (ItemClass) in the record for the wicker chair in the database.

This may sound complicated and slow to implement, but the benefit of doing so is considerable. Firstly, the database can be created so that the codes can be automatically selected rather than memorised by the database creator, so that they do not have to stop to remember or look up what code needs to be entered for any given object. Secondly, and here is the principal reason for employing a hierarchical system, once the data have been coded, they can be analysed at three different semantic levels. The historian could, if they wished, analyse all instances of wicker chairs in the database by running queries on all records which had the ItemClass value “0502006”. Alternatively, if they were interested in analysing the properties of all the chairs in the database, they could do so by running queries on all records with an ItemClass value that begins “0502***”. Lastly, if the point of the research was to look at all objects used for seating, a query could be designed to retrieve all records with an ItemClass value that began “05*****”. This is an incredibly powerful analytical tool, and one that would be impossible to achieve without the use of a hierarchical classification system: to run a query to find all objects used for seating without a classification system would require looking for each qualifying object that the historian can anticipate or remember, by name and taking into account the variant spellings that might apply.[3]

Hierarchical classification systems are very flexible things as well. They can include as many levels as you require to analyse your data, and they do not need to employ numeric codes when simple standardised text would be easier to implement.[4]

 

Coding

Coding is the process of substituting (not necessarily literally) one value for another, for the purpose of recording a complex and variable piece of information through a short and consistent value. Coding is often closely associated with classification, and in addition to saving time in data entry (it is much quicker to type a short code word than it is to type five or six words) codes additionally act as standardisation (that is, the same form [code] is entered for the same information no matter how the latter appears in the source).

 

These techniques are implemented to make the data more readily useable by the database: the codes, classifications and standardised forms which are used are simple and often easier to incorporate in to a query design than the complicated and incomplete original text strings that appear in the source; but more importantly, they are consistent, making them much easier to find. However there are a number of things to bear in mind when using them, the most important of which is there are two ways of applying these techniques:

  • By replacing original values in the table with standardised/coded/classified forms
  • By adding standardised/coded/classified forms into the table alongside the original values

Both of these approaches present a trade-off between maintaining the integrity of the source and improving the efficiency of the potential analysis, in much the same way as the choices offered as part of the design process when selecting the Source- or Method-oriented approach to the database (see Section C3). The first approach to standardising, to replace the original version of source information in any chosen field(s) with standardised forms of data, enables the speeding up of data entry at the expense of losing what the source says. It also serves as a type of quality control, as entering standardised data (especially if controlled with a ‘look-up list’) is less prone to data entry errors than the original forms that appear in the source.

The second approach, to enter standardised values in addition to the original forms, allows for the best of both worlds: you achieve the accuracy and efficiency benefits of standardisation without losing the information as it is presented in the source. Of course, this happens at the cost of extra data inputting time, as you enter material twice.

When considering both approaches, bear in mind that you will only need to standardise some of the fields in your tables, not every field in every table. The candidates for standardising, classifying and coding are those fields that are likely to be heavily used in your record-linkage or querying, where being able to identify like with like values is important. Creators of databases built around the Source-oriented principle should exercise particular caution when employing these techniques.

 

 



[1] See for example that developed for household types in The population history of England, 1541-1871: a reconstruction (1981) by E.A. Wrigley and R.S. Schofield; or the ongoing HISCO project to develop an international classification system for occupations, available at http://historyofwork.iisg.nl/ (accessed 23/03/2011).

[2] Note in passing that many of the other fields in this example contain codes as well – this table contributes substantially to the database’s Standardisation layer.

[3] It would, for example, need to look for all stools, buffet stools, wicker chairs, forms, settles, benches etc., leading to extremely complicated queries with possibly more criteria that the database can handle. For criteria in queries please sign up to one of our face-to-face Database courses.

[4] Indeed numeric codes are somewhat old fashioned in modern database usage, although they are no less efficient for being outmoded.

 

 

G. Conclusion

If you apply the principles and techniques discussed in the design of your database, you may well find that you spend a considerable amount of time in the process. Unfortunately there is no getting around this: designing databases simply is a time consuming business, especially if you have adopted a Source-oriented approach and you are working with a range of different, rich and complex sources. However, the time you spend working on the design will be more than repaid when it comes to the data entry and data analysis stages of the database project – and this cannot be overstated. Historical sources will give rise to all manner of complications and problems, intellectual and in terms of the mechanics of databases, and the more you can anticipate these and accommodate them in the design of the database, the more efficient and less frustrating the subsequent use of the database will be.

Before you begin the process of designing your database, and producing your Entity Relationship Diagram, it is worth spending a little time seeing how other historians have designed their databases (see the resources listed in the Further Reading section). You should also read through the other HRH Handbooks on Databases by Mark Merry as these describe in detail the processes of building databases and performing analysis respectively; and it completely necessary to see what is required in order for these processes to work smoothly, so that the design can facilitate take these requirements into account from the very beginning.

Finally, it is worth reiterating that designing databases is difficult, and there is no substitute for practice. No database is ever perfect, and the only indicator of quality, or success, when it comes to database design is whether or not it serves the various functions that you intended. If you can manage the information from your sources in the way that you need, and if you can perform the analysis that you require, and if you can be as flexible as you need in both of these areas, then your design is successful. But you do not have to wait until the latter stages of your database use to find out how successful you have been in the design – you can and should test the design of the database very early on. After producing your Entity Relationship Diagram, build a structural prototype of your database (that is, with only the tables and relationships, without worrying too much about the other tools that go into creating the database application) and spend a week entering data. If you are using multiple sources, enter material from each of the sources. As soon as you start entering data you will very quickly begin to see where any deficiencies in the design might be – look out for:

  • Information that you would like to analyse which appears repeatedly, but you have nowhere specific to put it (i.e. for which you will need to add new fields)
  • If you find yourself repeating information from record to record, you will need to think about re-ordering your relationships to prevent this (see Section D)
  • Watch out for your datatypes, and change them where they are unhelpful
  • Look for data that could be standardised or classified
  • Look out for information that you had not anticipated when designing the database

It is likely that you will find examples of all of these in a very short space of time. Once you have spent some time entering data, design and run some queries to test whether or not the research questions you know you will want answers to can actually be answered by the current design. Running queries is the ultimate test of whether the database design works or not, and it is likely that you will find yourself rearranging fields in the light of what you learn. The queries will also highlight (often starkly) how much standardising of information you will need to engage in.

Once you have finished this testing, and moved on to design and rebuild ‘Version 2’ of the database, you will be well on the way to creating one of the most powerful research tools available to the historian. It will be a struggle to begin with, but it will be worth it in the end!

 

 

Creative Commons Licence
Databases for Historians (HRH) by Mark Merry is licensed under a Creative Commons Attribution 2.0 UK: England & Wales License.