Case Study: Data Mining with Criminal Intent
Project Title: Data Mining with Criminal Intent
Type: Text Mining
Address: http://criminalintent.org/
Introduction and definition
Project definition: text mining is the derivation of structured, meaningful data from a large body of unstructured data, using automated, analytical methods.
Some writers, such as Hearst (2003) http://people.ischool.berkeley.edu/~hearst/text-mining.html, distinguish text mining from data mining: the latter referring to structured data (such as databases) and the former to unstructured text files. The project team at With Criminal Intent does not seem to observe such a distinction. As there is no clear consensus on this matter of terminology we will not pursue it further.
Text mining involves a variety of techniques for analysing data. Once a corpus has been assembled (such assembly may in itself be considered part of text mining) the texts are normally prepared: this may involve tokenisation (to divide the text into required elements, such as words: for example determining if book-end is one token, or two – book and end) or stemming (treating as one entity different forms of a verb, such as fight, fought, fighting, or other part of speech).
Statistical techniques are used to extract grammatical information (for example, by automatic parsing to determine parts of speech) or semantic information (such as sentiment analysis, or spam filtering). Text mining can also discover associations between texts, or the entities within texts such as people and places.
The UK’s National Centre for Text Mining, NaCTem, based at the University of Manchester, is a good resource for information about text mining and also provides text mining tools freely to UK Higher Education institutions: http://www.nactem.ac.uk/.
Project description
Data Mining With Criminal Intent is a project to produce a research environment for the Old Bailey Online (http://www.oldbaileyonline.org/), an online edition of the texts of criminal trials held at the central London court between 1674 and 1913. Since the Old Bailey Online contains records of almost 200,000 trials (amounting to 127 million words) it is a good candidate for automated procedures to interrogate the corpus of texts it contains and to obtain more nuanced results than would be possible simply from using the available search facilities.
Data Mining With Criminal Intent is an international project where collaborators from different institutions have worked on different aspects of the project. The project was funded by JISC, NEH and the SSHRC Programme – higher education funding bodies for the US, UK and Canada, respectively – and involved team members from each of those countries.
The declared aims of the research environment are that it will allow a user to:
- Query the records of the Old Bailey Online.
- Save the results of the queries to a Zotero account.
- Send selected results and texts to Voyant Tools for analysis.
A further key theme of the project is that it should allow what is referred to as the ‘ordinary working historian’ to incorporate text mining into their work. The project considers that, if digital research tools are to transform the humanities, it is essential that the ordinary working historian is empowered to use them.
The project did not develop any new tools, but worked on existing ones, with the focus on making them work seamlessly together.
Use of tool
The heart of the text mining part of the project is the API that was developed to increase the flexibility of queries and allow for subsequent processing to be applied to the results. The OBAPI (http://www.oldbaileyonline.org/obapi/) is available as a demonstrator webpage, where users can construct advanced searches; the advantage over a conventional advanced search is that the demonstrator facilitates export of data to Zotero and Voyant.
The Web API can be used the conventional way as a URL argument, returning data in the JSON (Javascript Object Notation) format. For example (using a slightly modified example given at http://www.oldbaileyonline.org/static/DocAPI.jsp), this URL:
http://www.oldbaileyonline.org/obapi/ob?term0=trialtext_sheffield&term1=offcat_deception&breakdown=offsubcat&count=10&start=3
returns the results 4-13 of trial texts containing the word ‘Sheffield’ and the offence category ‘deception’, as well as any subcategories of offence. The URL arguments begin after the ? and are separated by &, so we can break this URL query down as follows:
part of URL |
meaning |
notes |
term0=trialtext_sheffield |
the first term, ‘Sheffield’, should appear anywhere in a trial text |
term indexing begins at zero, so the first term is term 0 |
term1=offcat |
the second term should appear in the offence category field |
clearly it is not possible to guess the naming conventions of these categories; they have to be derived from the API documentation |
breakdown=offsubcat |
include offence subcategories |
|
count=10 |
give 10 results |
the default is to return all results if this argument is omitted |
start=3 |
start from the fourth result |
defaults to 0 if this argument is omitted |
The results from this query look like this (although the use of JSON format may sound off-putting, the results are actually very easy to read):
{ "total" : 84,
"breakdown" :
[
{ "term" : "fraud", "total" : 44 },
{ "term" : "forgery", "total" : 31 },
{ "term" : "bankrupcy", "total" : 7 },
{ "term" : "perjury", "total" : 4 },
{ "term" : "receiving", "total" : 2 },
{ "term" : "simpleLarceny", "total" : 1 },
{ "term" : "stealingFromMaster", "total" : 1 }
]
,
"highlight" :
[
"sheffield"
]
,
"hits" :
[
"t18110529-18",
"t18171029-84",
"t18300527-110",
"t18310106-4",
"t18381022-2382",
"t18390617-1733",
"t18440101-380",
"t18470614-1435",
"t18510407-820",
"t18511027-1888"
]
}
Results can then be exported to Zotero (http://www.zotero.org/), a free reference manager that is used in conjunction with the Firefox browser. Zotero can be used for collecting citations from web pages (it detects when page content is Zotero compatible), as well as storing many other kinds of information. A Zotero account allows users to sync their Zotero data across multiple devices or machines.
The final stage of the pipeline is that users can analyse their results using Voyant Tools: http://voyant-tools.org/. Here users can upload text in a variety of formats, or point the tool at URLs, and perform a variety of types of simple linguistic analysis, such as word counts, frequency, and relative frequency across documents. Stop words can be specified and the results can be compared to another corpus or exported in a variety of formats.
Further possibilities
There seem to be two potential developments of Data Mining With Criminal Intent. The first would be to integrate more tools, allowing users to, for example, use more advanced natural language processing techniques on the data (which at present Voyant Tools cannot do). The limitation with this approach is that the project’s aim of serving the ordinary working historian would be compromised if techniques were introduced that require a steep learning curve.
A second possibility is the more general one that other projects could adopt this pipeline-like approach. Only a minority of historians will want to access the Old Bailey Online for research purposes, but in principle it does not seem too difficult to apply the approach to other large datasets, now that this project has provided a proof of concept. However a limitation to this may be that the Old Bailey Online is somewhat unusual as a historical dataset, being both extremely regularly structured and existing in a high-quality transcription.
Conclusion
The project’s aim of allowing the ordinary working historian to use digital research techniques with a minimum of effort is a laudable one. It also stands out in a field in which digital projects are often led by enthusiastic power users who tend to lose sight of the technical demands that the project outcomes may make on interested academics. A side benefit of the project is that it may convince some researchers that they can use digital tools effectively in their work, and perhaps even enthuse them sufficiently to learn more digital research skills.