Organising and Designing Quantitative Data

Site: Postgraduate online research training
Course: Module 2: Planning the research
Book: Organising and Designing Quantitative Data
Printed by: Guest user
Date: Monday, 16 May 2022, 1:56 PM

Description

Quantitative data

1. Introduction

The organisation and storage of quantitative data are often challenging and complex tasks that require specialist software and tools as well as careful conceptualisation and documentation. Taking some time over choosing the right tools and thinking about data structures from the outset can go a long way. Unfortunately it will not automatically make the task a simple one, but it can help to make better data, with better documentation, to ensure future preservation and access and easier re-use (including by yourself!).

2. Modelling Data Structures

Most historians do not have the luxury of working only with pre-existing, carefully prepared and documented statistical data.

Sources that are irregular in 'shape', such as textual sources with long narrative accounts written in paragraphs and chapters and so on, or databases of image/sound/video collections, are particularly problematic when it comes to converting their information into data; but the problem will also arise in the more structured sources (such as census listings or taxation assessments), which are never quite as simple as they might appear.

- Mark Merry, Designing Databases

Analysing or 'modelling' data structures, designing databases, making decisions about categorisation, normalisation, and so on, are as important for effective management of quantitative data as file naming and organising folders. However, this is in most cases considerably more complex than the simple tree structure of files and directories, and historians undertaking quantitative analysis will often need to learn to use specialist tools and techniques. Therefore, it must be emphasised that the discussion here is merely intended to provide some basic guidance and resources.

Every project has to be tailored to the challenges of the historical sources it uses and the research questions being asked. Unfortunately, many books and online resources for quantitative data management are aimed at social scientists or scientists, and while these can be useful for learning about concepts, techniques and general issues involved in quantitative analysis, they do not address the particular issues historians face in attempting to transform typically variable and messy historical sources into regularly structured data that can be used for quantitative and statistical analysis.

Even if you don't plan to use databases (but see further on for reasons why you might!), discussions of database design usually have wider application for thinking about how to model data - ie, how to analyse your historical sources and 'unpack' its underlying structures, categories and relationships.

A further consideration for a historian at the planning stage is whether you will need to be able to return easily to the full original source. If you are collecting aggregate data (or data for aggregation) for statistical analysis this may not be an issue, but it will be important if your methodology uses quantification primarily as an entry-point for deeper, qualitative work, or will move frequently between the two modes of analysis.

2.1 Useful resources

An introduction to the evolution of databases and data models since the 1960s:

This free online course on database design for historical research, written by a historian, is useful for thinking about designing structured data even if you don't plan to use relational databases:

A helpful introductory online workshop:

2.2 Further reading

Jean Bauer, 'Fielding History', in Writing History in the Digital Age, edited by Jack Dougherty and Kristen Nawrotzki (2011).

Pat Hudson, History by Numbers: An Introduction to Quantitative Approaches (2000), especially chapter 9 on software and computerizing quantitative history.

Stephen Ramsay, 'Databases', in A Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth (2004).

3. Choosing the right tools

Unfortunately, certain general rules tend to apply to software for the management and analysis of quantitative data. The more powerful the software...

  • the more time and effort needed to master its use
  • the more expensive it is likely to be (or if it is not, it is likely to be even more challenging to learn to use)
  • the more complex the data and documentation it generates
  • the more complicated preserving and sharing its data is likely to be

So, sometimes it may be necessary to balance the desired functionality of a tool against the investment of money, time or both that its effective use and data management will require - especially for small projects. If you are making a major investment of your personal resources, you need to look for tools that can be re-used and learn skills that will have applications beyond the immediate project!

There are a number of distinct types of software that may be used by historians for organising, storing and analysing quantitative data, and they have different strengths. The choice most historians are most likely to make for storing data is between spreadsheets or database management systems. However, there are other options that you may want to consider in certain cases.

3.1 Spreadsheets

Examples: Microsoft Excel, OpenOffice Calc, Numbers (Mac), Google Docs Spreadsheets

Advantages: the real strengths of spreadsheet software lie in analysing data rather than in storing and managing it. They are easy to set up, enter data, run calculations, create graphs etc. So, if your data is suitable for storing in a spreadsheet, you may not need any further tools.

Disadvantages: spreadsheets are simple 'flat' tables and they soon become unwieldy for large or complex datasets. They are better at handling numbers than words. By default, they often make assumptions about data formats (eg for dates) that are less appropriate for historical data than for modern financial data (for which they were primarily intended).

All of these spreadsheet tools are proprietary software and you will need to convert the data for preservation purposes. 

3.2 Databases

Examples:

  • proprietary: Microsoft Access, Filemaker (Mac);
  • open source: MySQL, SQLite;
  • XML databases: eXist, baseX

Advantages:

  • databases (or 'database management systems') can store, retrieve and manipulate much larger and more complex datasets than spreadsheets
  • most database systems are 'relational': this means that multiple tables can be linked together using "key" fields.

Disadvantages:

  • database software is not as easy to learn to use; open source databases in particular tend not to have friendly user interfaces; MySQL requires a server for installation
  • you need to spend much more time planning and designing your database before you can get started. (But from the data management point of view this may not be a disadvantage at all!)
  • databases are primarily storage rather than analytical tools; for most analysis and visualisations you will usually need additional software.

3.3 Spreadsheet or Database?

If you are only creating a small amount of quantitative data, or its structure is very simple (especially if it is primarily numeric) and will fit comfortably into a single table, then a spreadsheet will probably serve your needs, and it will not be as time-consuming to learn or set up as a database.

When to Use a Spreadsheet

Characteristics of your sources or research that might make a spreadsheet a more appropriate choice:

  • your sources already resemble spreadsheets - ie, they are in a regular tabular or list format
  • your sources consist of mainly numerical information
  • your sources are already aggregated information (even if not yet digital), suitable for statistical analysis without significant intermediate processing
  • you do not need to link together different sources
  • you are not creating large amounts of data

Examples

European State Finance Database - "an international collaborative research project for the collection, archiving and dissemination of data on European fiscal history across the medieval, early modern and modern periods." The database contains a range of aggregated tabular data deposited by researchers, which can be downloaded in CSV text format as well as viewed in graphical forms on the website.

1831 Census Data - downloadable datasets with accompanying documentation, made available by the Staffordshire University Victorian Censuses project. Again, these are aggregate data ideal for a spreadsheet.

When You Need a Database

On the other hand, if several of the following apply, you probably need a database:

  • you are compiling data from varied sources that you will have to aggregate for statistical analysis yourself
  • you are collecting data from related sources that you will want to be able to cross-reference and link together
  • your sources are mainly text rather than numbers
  • your sources are too complex to fit into a simple flat table
  • you will be creating a lot of data

In any case, if you think your research is outgrowing the spreadsheet format, spreadsheet software should normally have convenient facilities to export your data at any time; conversely, once you have data in a database, you will have options to compile it into subsets of aggregate data for analysis in a spreadsheet.

Examples

The Old Bailey Online - a database of reports of nearly 200,000 trials held at the Old Bailey in London between 1674 and 1913. Not only is this a very large dataset, but trials are complex sources for quantification. They may contain multiple defendants, charges, verdicts and punishments - "many to many" relationships. Subsets of the data, however, can be generated in tabular format for spreadsheet analysis using the site's Statistical Search.

Family Reconstitution Data, from Cheapside parish registers, c.1540-1710 - a relational database created by the People in Place projectFamily reconstitution is a technique used by demographic historians using parish register data between the 16th and 19th centuries, which involves "linking series of births, marriages and burials in the same family and comparing the results across thousands of families" to generate data on long-term demographic trends.

Already using a spreadsheet?

If you answer yes to several of these questions, you probably should consider switching to a database.

  • Are you duplicating a lot of data in spreadsheets?
  • Are you having to make changes across multiple spreadsheets when you change one of them?
  • Are your spreadsheets becoming unwieldy from trying to manage too much information?
  • Are you finding it difficult to locate specific data because of the size of your spreadsheets?

3.4 Statistical software packages

Examples: SPSS, SAS, Stata, R (open source)

Specialised statistical software packages are more commonly used by social scientists and scientists than historians. But they may be worth considering if your research requires advanced statistical techniques or visualisations.

Advantages:

  • Much more powerful for statistical analysis than either databases or spreadsheets.
  • Where spreadsheets are strong at data analysis while databases excel at data storage, statistical software usually provides a more complete integrated package.
  • Many online resources and tutorials for the popular packages.

Disadvantages:

  • Usually a very steep learning curve; you may need to invest in proper training to get the most out of them.
  • Often very expensive, while open source options tend to be even more intimidating to learn than the commercial packages.

3.5 Qualitative Data Analysis software

Examples: NVivo, ATLAS.ti

Historical research is very rarely purely quantitative, and many historians working with textual sources are likely to want to use quantification in some way sooner or later. Computer-Assisted Qualitative Data Analysis software (CAQDAS) has been most commonly used by ethnographers and qualitative sociologists, but it can represent a useful alternative to conventional databases for historical research: rather than entering text or numbers extracted from a source document into a table, sections of text in transcribed documents are coded or tagged for categories, meanings, networks, relationships and so on.

Advantages:

  • transcribed documents and structured data are not separated, and no data is lost.
  • it is easier to move between close reading and quantitative analysis
  • the encoded text is an adaptable resource: it can subsequently be used for various kinds of analysis, including export to more sophisticated quantitative analysis tools or databases; or it can be re-coded for later projects with different research questions.

Disadvantages:

  • marking up texts accurately and consistently may be more time-consuming than conventional data entry, particularly if you are working with already highly-structured documents.
  • there isn't much point if you are not creating/working with transcriptions or all your sources are highly tabular rather than narrative in form!

4. The 'Next Generation': Open Source and Web Tools

Historians have been using proprietary desktop spreadsheet and database tools for quantitative research for many years. Their advantages include that they are easy to use (at least relatively speaking!), and usually come with extensive documentation and technical support. Spreadsheet software in particular is popular because it is both simple and powerful. For a long time, these have been the best options for the majority of historians working with quantitative data.

However, their disadvantages include that they are often expensive (and have to be updated at regular intervals, at further cost). Data is stored in proprietary formats that limit sharing, are not designed for online publication, and create problems for preservation. Although they have become increasingly sophisticated and usable for complex historical sources, these programs were not created for historical research (or even for academic use).

In recent years, open source content management software, often built as browser-based applications and intended for easy online publication, has evolved and expanded rapidly. (Famous examples includeWordPress and the MediaWiki software underpinning Wikipedia , both using MySQL databases.) In addition, techniques and tools like XML markup have facilitated the creation of structured datasets and databases from large text corpuses without the manual process of entering data into a spreadsheet or database.

5. Conclusion

Increasingly, these developments mean that the tools historians need to collect, manage and analyse quantitative data are separable from the data itself. Their data is more portable, and therefore both easier to share and to preserve in the longer term. A dataset shared under an open license may be converted into different formats, re-used, enhanced and re-published for different purposes. It is also easier for historians to design and build data tools of their own. This requires more advanced computer skills - for example, learning MySQL and use of the command line - but it enables them to build custom tools that meet their source-based research needs - and ultimately to share not simply their data but also their tools with other historians.