An introduction to text mining
2. Case study: Ngram Viewer
2.1 What does the Google Ngram Viewer do?
Before you try the Ngram Viewer there are a few essentials you must know. First, in any given year the diagram will only show books with the term(s) if there are more than 40 occurrences, to avoid overwhelming the diagram. Second, to deal with the problem presented by the increase in published books over time, the results are normalised by the number published in each year. However, it should also be noted that results between 1500 and 1800 are less reliable as there just are not enough books published to give accurate results for many queries.
This brings us to another important point. As already stated, Google Books only represents 4% of publications, which are not necessarily representative of publications in any given year. While the Ngram Viewer includes the largest collection available, it is not a complete collection, nor is it much more than a random selection based upon contingencies such as copyright, availability, publisher agreements and language.
The most confusing element of the Ngram Viewer is, however, the term ‘ngram’ itself and the smoothing gradients that you can choose (on a range from 0 to 50). N-gram is a term commonly used in science and mathematics but in recent years it has also become popular in natural language processing (see section 4). An n-gram or 1-gram represents a string of characters with no spaces, usually, but not always, entire words. Thus ‘History’ is a 1-gram, while ‘Old Bailey’ would be a 2-gram. A term such as ‘the United Kingdom’ would appear as a 3-gram.
The smoothing ‘factor’ is simpler than it sounds. Basically, smoothing helps to make the graph more legible and thus easier to analyse. As the term suggests, ‘smoothing’ averages out values over a range of years so that, for instance, a smoothing factor of 3 averages out the values over a 3 year period rather than just 1, thus smoothing out the graph.
Comparison of Atlantis with El Dorado with no smoothing
Example of Atlantis and El Dorado with smoothing of 3
Example of Atlantis and El Dorado with smoothing of 50 (maximum)
Alongside options to restrict date ranges and choose in which language to examine the Google Ngram, the smoothing factor is well worth testing at different levels as an aid to interpreting results. It is, however, when you compare those results to the texts themselves (usefully made available by Google in shorter date ranges) that the true meaning of those results can be identified.
Other issues to consider include:
OCR (Optical Character recognition – i.e., the technology that allows scanning of an image and automatic recognition of words digitally) is not perfect. The greatest problem is the 'long s' (the archaic form of the letter s) which often appears as the letter f in the OCR conversion. Thus ‘Congress’ in earlier books often appears as ‘Congrefs’ in Google Books (a more unfortunate example was mistakenly unearthed by the Wall Street Journal because of the use of long s in the spelling of ‘suck’!). The researcher therefore needs to take errors of this sort into account by looking deeper into their results.
Spelling and terminology
Variations over time in spelling and terminology may skew your results. For example ‘The Great War’ has since become more commonly called ‘World War I’ or the ‘First World War’. The word ‘colour’ is ‘color’ in American English, and standardisation of words is not as strict at the start of the nineteenth century as it is in the twentieth. Remember, too, that many words in modern British English have legitimate variant spellings, such as 'fetus' and 'foetus', 'organisation' and 'organization', or 'judgement' and 'judgment'.
For variations in terminology you may find the Historical Thesaurus of English useful. For example, if you are looking for documents about jealousy in the nineteenth century the thesaurus will tell you for that date range you should also be searching for 'jealous-hood'. The Thesaurus has recently been embedded in the Oxford English Dictionary and can be used at entry-level, but the Glasgow site offers more search facilities.
For names and places the use of capitals or lower case has a direct result on findings.
To explain some of these issues in more detail, let’s look at an actual example: