Designing Databases for Historical Research

C. Fundamentals of database design

C5. Database definitions – tables, fields, records, values, rules and datatypes

This section of the Handbook will introduce a few definitions and database terms which will act as a shorthand for various concepts and processes that will be discussed in other sections.  Again, the issues discussed here will apply generally to all databases designed for historical research, as well as to most databases in general.

Harvey and Press provide a definition of a database:

 

“A database is a collection of inter-related data organised in a pre-determined manner according to a set of logical rules, and is structured to reflect the natural relationships of the data and the uses to which they will be put, rather than reflecting the demands of the hardware and software.”

For the full description see: Further Reading

which is a useful if perhaps formal way of describing both the content and environment of a database. Within the database itself, however, are a number of different ‘things’, called Objects,[1] which serve a variety of functions (these include tables where the actual data is stored, queries, user interfaces and so on). For the moment we will concentrate on only the first of these objects, the tables, and we will look at some of the terms connected with them.

There are four main elements to any table in a database, and each of these have (somewhat inevitably) a number of names:

§         Table (also known as Entities)

§         Field (also variously known as Column, Variable, Attribute)

§         Record (also known as Row)

§         Field name (also known as the Attribute name)

C5i –The database table, showing fields and records

C5ii –The database table, showing fields and records (containing data from 17th century parish registers) [2]

In each database, data are stored in tables, and most databases will have more than one table. These tables will have relationships between them which connect the information they contain and will thus be ‘relational’ (see Section D). Tables are made up of fields (columns) and records (rows). Each field contains one type of information, and for each record the information in that field will be of the same kind.

 

Database Rules

Returning to the ‘logical rules’ referred to by Harvey and Press, these can be summarised as follows:

  1. The ‘information rule’: all information in a relational database is represented in one way – as a value in a field in a table
  2. Each table in a relational database will have a name which uniquely identifies it and which will distinguish it from the other tables; the table should contain information about only one entity. Further Information
  3. Each field within a table is given a unique name within the table. Further Information
  4. The values in each field must have the same logical properties, that is, must be of the same datatype: numerical or text (we will be looking at datatypes shortly)
  5. Records will contain closely related information about a single instance of the entity (that is, the subject of the table – for example, the forename and surname of a single individual in a table about individuals)
  6. The order of records in a table must not be significant Further Information
  7. The order of fields in a table must not be significant
  8. Each complete record must be unique
  9. Each field should contain ‘atomic values’: this is, indivisible data (e.g. first and last names of an individual should always be held in separate fields)

In many cases the database software will prevent you from breaking these rules, but in some cases it is possible to circumvent them, whether intentionally or otherwise. With those rules that can be broken, it is almost always unwise to do so, as it will generally lead to confusion (for the database if not its user) when it comes to performing analysis.

Datatypes

We have seen that databases are made up of tables, and that tables follow certain rules. One of the rules to be considered is that each field within a database must be of a certain datatype. There are a number of different datatypes, and although the names by which these datatypes are known will vary slightly from database software to database software, what they do and how they operate will be essentially the same.

In essence what datatypes do is to control what type of data is entered into a field. Each field in each table is assigned a datatype, usually ‘text’ or ‘numeric’, and this in turn dictates what kind of data can be entered into that field. The purpose of datatypes is partly to act as a kind of quality control mechanism, to help prevent data being entered incorrectly; and partly they are used to help the database understand the meaning of what is typed in. For example, the piece of data ‘23/03/2011’ will not be recognised as a date by the database if it is entered into a field that has the datatype ‘text’, whereas it will understand the meaning of this value if it is entered in a field with the ‘date/time’ datatype.

A brief description of the principal datatypes follows.

The basic datatypes: text

  • This is the default datatype for your fields which will be assigned by the database whenever you add a new field to a table
  • This datatype will allow the entry of textual and numerical values, although it will treat that latter differently from numerical values entered into a ‘number’ datatype field
  • Fields with this datatype will generally allow a maximum of 255 characters to be entered Further Information

 

 

The basic datatypes: memo

  • Fields with this datatype are used for lengthy texts and combinations of text and numbers
  • Up to 65,000 characters can be entered (the precise number may change depending on the database software being used)
  • Data in these types of field cannot be sorted
  • Data in these types of field are difficult, although not impossible, to query

 

 

The basic datatypes: number

  • This datatype allows the entry of numerical data that can be used in arithmetical calculations
  • There are a number of variations of this datatype, which control aspects of the numbers that can be entered, such as the size of the numbers, the number of decimal places and so on: Further Information
    • Byte: Stores numbers from 0 to 255 (no fractions)
    • Decimal: stores positive and negative numbers down to 28 decimal places
    • Integer: Stores numbers from –32,768 to 32,767 (no fractions)
    • Long integer: (default) stores numbers from –2,147,483,648 to 2,147,483,647 (no fractions)
    • Single: stores very large positive and negative numbers to 7 decimal places
    • Double: stores very large positive and negative numbers to 15 decimal places

The basic datatypes: date/time

  • This datatype enables the entry of dates covering the period 100 through to 9999
  • This datatype can be customised in order to control the format of dates that are entered
  • Warning: in Microsoft Access, the Office autocorrect feature may well change some of your dates if it is active (e.g. “02/12/04” will autocorrect to “02/12/2004” unless you enter the year in full)
  • This datatype will allow the sorting of records chronologically, which the same values entered into a text datatype field would not (the latter would sort the records alphabetically – alphabetically “01/01/2010” would come before “31/12/1245”)

 

 

The basic datatypes: currency

  • This datatype allows the entry of numerical values data used in mathematical calculations involving data with one to four decimal places, with the inclusion of currency symbols

 

 

The basic datatypes: autonumber

  • This datatype automatically generates a unique sequential or random value whenever a new record is added to a table. Further Information
  • AutoNumber fields cannot be updated, that is, you cannot enter data into them manually

 

 

The basic datatypes: yes/no

  • A field with this type of datatype will only contain one of two values (Yes/No, True/False, or On/Off)
  • Quite often database software will represent this type of field as a checkbox in the table
  • The underlying text value of a yes/no field will be -1 (for yes) or 0 (for no)

 

 

The basic datatypes: OLE

  • A field with this datatype is one in which another file is embedded, as a Microsoft Excel spreadsheet, a Microsoft Word document, an image, a sound or video file, an html link, or indeed any other type of file

Every field in every table will need to have one of these datatypes assigned, and the decision as to which type is chosen should be factored into the database design process (see Section E). For most fields the datatype chosen will be either ‘text’ or ‘number’. Keep in mind how these two datatypes treat numerical data differently, particularly in terms of how they cause data to be sorted:

  • 1,10,11,2,3,4,5,6,7,8,9 is how data will be sorted if the datatype is ‘text’ (that is, alphabetically)
  • 1,2,3,4,5,6,7,8,9,10,11 is how data will be sorted if the datatype is ‘number’ (that is, numerically)


[1] For more on Objects please sign up to one of our face-to-face Database courses.

[2] Note in this table the Forename, Surname, Occupation, Residence and  Comments fields contain values taken directly from the source, and are part of this database’s Source layer. The Stated_Sex field contains a standardised value (in fact a code), and is part of the Standardisation layer.