Regular expressions and scripting

1. What are regular expressions?

Text mining for the purposes of historical research involves a number of different tasks:

  • Gathering and/or creating textual datasets
  • Preparing the format and structure of textual data for analysis
  • Making decisions about the ‘meaning of texts’ and defining semantic structures and schemas to allow you to find instances of themes and subjects that interest you
  • Marking up the texts to enable searches
  • Performing the analysis using suitable tools
  • Interpreting the results of the analysis, and putting them into historical context

In many cases the bulk of the work involved is likely to fall into the preparatory stages (although making sense of the results can also take some time and energy!), where you make sure that the texts are consistent in format and structure (and mark-up, if it is being used) before you start using text mining tools on them. This can often involve making changes to the text, often in terms of how punctuation is used (notably line and paragraph breaks) but also sometimes to the text itself, especially where errors need correcting.

There will be times when many changes are needed across a text for the sake of consistency, and quite often the kind of changes that need to be made are repetitious. For example, imagine a scenario where you have a collection of textual datasets that are a mixture of materials that you have gathered from online sources, historical transcripts made during previous historical research projects, and texts that you have created yourself during your own research. All the datasets share a common subject – say about the history of disease in early modern London – but due to the variety of sources that they have been transcribed from, and the variety of transcription (or other digitisation) methods used to create them, there is huge variety in the way that place names have been rendered.

Our texts leave us in a position where, to take the example of London parish names (a likely unit of analysis in our project), our computers would struggle to make sense of variant forms of the same place name. The parish of St George, for instance, may have been recorded in the sources in a number of different forms:

  • St George
  • St George Botolph Lane
  • St George Eastcheap
  • S. George [etc.]
  • St. George [etc.]
  • Saint George [etc.]

and this is before we take early modern spellings into account. Incidentally this particular problem, while still significant, is less serious for the parish of St George as there was only one in London in the period – it would have been exacerbated if we were using one of the many St Marys as our example!

The parish names in our list all refer to the same geographical area of London, but as they all comprise different strings, it is likely that their sheer variety will complicate searches, record linkage and other types of analysis. It is important, in other words, that information that means the same in our texts is presented in the same form. This can be achieved in a number of different ways (through mark-up, or the use of taxonomies and ‘synonym rings’, for example), the easiest of which is changing the actual text into a standardised format (this may not be appropriate for some research projects, of course).

To do this you could perform a search to find every instance of every variant form of the term you wanted to regularise, and then change it manually. A much better approach would be to employ regular expressions.

Regular expressions allow you to simultaneously perform a number of tasks:

  • Find instances of a particular term …
  • … even when there are variant (or fuzzy) forms
  • Convert the term (or a part of it) into something else of your choosing
  • Convert whole phrases (or even passages of text)
  • Make changes to the whole text all at once, no matter how large or how many instances of your term there are
  • Allow you to make changes consistently throughout 

The use of regular expressions allows you to make changes to text quickly and (relatively) easily, while minimising the potential for manual error to creep in, and are thus extremely useful tools to consider, especially when used in combination with other text mining tools.