1. Creating new data
Once you have identified a corpus of text that you would like to mark up semantically, the first question is whether it already exists in electronic form. If it already exists, and you have permission to reuse it, then in some respects your life is easier. In this section we’ll talk about how ways of getting text into electronic format, and in the next we’ll discuss reusing existing electronic texts.
You could just sit down with the print or manuscript text you’re interested in and type it up. For small amounts of text this is a good approach: you’ll get to know the text well and be well placed to decide how to mark it up; it’s probably not a good idea to do markup as you go, unless you’re very sure that you’re not going to change your mind halfway through.
For large amounts of texts there are two basic approaches for print material: OCR and rekeying.
Optical Character Recognition (OCR) is software that ‘reads’ an image of a text and tries to transcribe the image into characters. Results vary widely. A piece of text in a common font, printed off with a good printer on clean white paper, will usually OCR very accurately; an old, broken typeface (not to mention handwriting), will usually OCR very inaccurately. It is true that OCR technology continues to improve but if you opt for OCR you should be aware of the problems of transcription that may come with it – you may, for example, want to build in a human correction phase to your text capture.
For a nice illustration of the pitfalls of OCR, in this case in searching the Eighteenth Century Collections Online, here is a blog post on the IHR Digital blog. The post links to an article in Eighteenth Century Studies which investigates the implications of OCR capture for searching.
A popular piece of OCR software is ABBYY FineReader, which is not very expensive. In fairness we should add that OCR is improving all the time; nevertheless, rekeying currently gets more accurate results.
Rekeying (having a text retyped) is a more reliable but expensive way of capturing text. There are a number of companies around the world, particularly in India, that can produce highly accurate rekeying at a reasonable cost. The client will usually send images files of the text they want keyed, by FTP, along with keying instructions. An accuracy rate is agreed in the contract and if the client demonstrates on receipt that the rate has not been met the files are reworked at no additional cost to the client.
When provided with a DTD or a Schema, keying companies will produce valid XML. However, note that price is normally determined on a per-byte basis from the delivered files, so the more tagging you ask for the more you will have to pay (and the more checking you will have to do). In practice it is often a good idea to limit the tagging done by keying companies to visual elements of the text, such as paragraphs, tables, or font changes.
For handwriting there are keying companies who offer transcription and markup, but this is naturally a more demanding service. A project which used this method recently is Queen Victoria's Journals. Apart from having everything being done by the project team, a final possibility is crowdsourcing. This involves putting images of the manuscripts on the web and inviting members of the public to contribute transcriptions. This can work – see the Bentham Project for an example of a successful crowdsourcing of manuscripts. More generally, here you can download a PDF produced by the popular Zooniverse suite of crowdsourced projects, outlining their approach.
Three issues to bear in mind with crowdsourcing are:
- Motivation – how will keep your volunteers engaged with the project?
- Credit – how will you credit them, and who owns the transcriptions?
- Quality control – who is checking for accuracy and how long will this take?
Whichever method you choose, you will need decide upon an appropriate level of accuracy. Beware here, because accuracy figues can often be flattering. For example, the following piece of text is 95% accurate:
No, fure, mylord, my mother ried, but then there was a star damced, and under that was I bord.
There are five errors here in about 100 characters and the effect is clearly significant. This reasonable-sounding level of accuracy makes word searching severely compromised, and even ordinary reading is difficult. For rekeying 99.9% character accuracy (one error in 1000 characters) is normally considered the absolute minimum acceptable for academic purposes.
In case you’re interested, the text should have read:
No, sure, my lord, my mother cried, but then there was a star danced, and under that was I born.