You may have heard about Google Books Ngram Viewer or perhaps even dabbled with it at some point in the recent past, but I will dive a bit deeper into using the tool for the purpose of historical textual analysis.
An Overview of Ngrams
In the field of computational linguistics, an n-gram is an adjoining chain of n items in a sequence of speech or text. N-grams are extracted from a corpus of speech or text and are ordered as sets. An n-gram of size 1 is a unigram (“binders”), size 2 is a bigram (“many binders”), size 3 is a trigram (“binders of women”), and greater sizes are referred to as four-grams (“binders full of women”), five-grams (“many binders full of women”), and so on.
The corpora accessible via the Google’s Ngram Viewer includes American English, British English, Chinese, French, Hebrew, Spanish, Russian, and Italian processed between 2009-2012. The text within this corpora is derived from Google’s massive Google Books digitization endeavor, which is still ongoing. They note on their website that they have only included those books with sufficiently high optical character recognition (OCR) percentages and serials were also excluded from this corpora.1 If you are at all curious, you can download the dataset here.
The Google Books Ngram Viewer is optimized for quick inquiries into the usage of small sets of phrases (or n-grams as described above). The following embedded queries are to help us get more familiar with what is possible using this tool.
First of all, you will want to view/read this post on a desktop/laptop computer for the best experience. The n-gram embeds are not responsive to mobile phones. :'(
Next, load the viewer by visiting books.google.com/ngrams. Once the website has loaded you should see a sample query with results, a graph of Albert Einstein, Sherlock Holmes, and Frankenstein. If you are on a desktop or laptop computer, you can hover your mouse over the lines to see the values per year for a particular term. Clicking on a line and then double-clicking a line will isolate the line within the graph and then reset the view to select another. Note the start and end years for the query and know that you can adjust those. The lower and upper limits of the copora’s time period are 1500 (the data becomes a bit unreliable that far back) and 2012. The most recent books of 2016 and beyond will continue to be added to the corpora as far as I’ve read. Also note that dropdown box for selecting your language corpus. The smoothing dropdown is for averaging the years to reduce the jagged line graphs. Google provides this rationale for smoothing which I will quote since it’s quite specific:
Often trends become more apparent when data is viewed as a moving average. A smoothing of 1 means that the data shown for 1950 will be an average of the raw count for 1950 plus 1 value on either side: (“count for 1949” + “count for 1950” + “count for 1951”), divided by 3. So a smoothing of 10 means that 21 values will be averaged: 10 on either side, plus the target value in the center of them.
At the left and right edges of the graph, fewer values are averaged. With a smoothing of 3, the leftmost value (pretend it’s the year 1950) will be calculated as (“count for 1950” + “count for 1951” + “count for 1952” + “count for 1953”), divided by 4.
A smoothing of 0 means no smoothing at all: just raw data.2
I will cover the case-insensitive box in a subsequent example. So now that we’ve reviewed the interface: let’s dive in!
Mormon, Mormons, and Mormonism
If we try a quick comparison of three similar terms (all unigrams in this case: “Mormon”, “Mormons”, “Mormonism”), we get the following n-gram chart:
Like the default query Google supplies when visiting the website, this kind of comparison is fairly standard and showcases the value of this kind of data visualization. Next, we will look at case-sensitivity/case-insensitivity.
?United Order? vs. ?united order?
The following query (?United Order,united order?) demonstrates how case-sensitivity can affect the returned results:
Google treats each bigram as distinct, which can be useful in determining what form of a title or organization was preferred during various periods. ?United Order? was definitely more common, but ?united order? seems to have been more prevalent in the year 1829. Any ideas why that might be the case?
You can enable case-insensitivity via the checkbox next to the search box, which will lump the variously cased n-grams together if you are not interested in the tense of your search or to surface other tenses of the n-gram. One more example with case-insensitivity enabled demonstrates how the query ?mormon studies? produces an unexpected set of usages:
Although it’s not a surprise that both forms are present in the extant literature, but it is surprising to see the emergence of ?Mormon Studies? as the preferred form. We love our capitalization it seems!
Another useful search parameter is the * wildcard. When employed in a search, the * wildcard will replace the * with the top ten unigrams for the prescribed time period. For example, “Mormon *” produces the following:
And ?mormon *? with case-insensitivity produces:
These are great examples of how wildcards and case-insensitivity can produce some fascinating research leads. You should also note that the Ngram Viewer only supports one * wildcard per n-gram at this point in time.
Deseret, Kingdom of Deseret, State of Deseret
One more query for the sake of fun and interest:
Isn’t this great? This has been a short dip, but I’d like to delve deeper if you’re interested. Leave some research questions below and I’d love to respond to them as best as I’m able in a follow-up post.
- “Google Ngram Viewer.” Google Ngram Viewer. Accessed May 07, 2016. https://books.google.com/ngrams/info.