The Google Books project, underway since 2004, has scanned or converted approximately 30,000,000 books, making enormous volumes of information available digitally. Erez Aiden and Jean-Baptiste Michel were among the first to figure out that all that knowledge is searchable. But there were challenges, including the fact that knowledge in books is encoded in strings of words, not, as in a spreadsheet, individual cells. In “Uncharted,” a remarkable and entertaining new book, Aiden and Michel tell the detailed story of how they set about to mine this vast dataset. They developed a new discipline, the practice of quantifying historical change, they have called “culturomics. Along the way they worked out a new search application, the Google N-gram Viewer, which “charts the frequency of words and ideas over time.” (An n-gram is a word or phrase: a single word is a 1-gram, a two word phrase like “New York” is a 2-gram, and so on.)
General readers – even users of Google Books – may be unaware of this clever and fiendishly addictive application. The N-gram Viewer allows the user to search the text of all the books Google has included in its Google Books project (texts are sortable by language – not all of the books are in English). Users can see how often a particular word, or set of words, appears in the corpus (the y-axis is the frequency). Think Brooklyn is hot now? Unfortunately I can’t load the illustration, but if you experiment with the N-gram viewer you can see how much more often the word was used between 1900 and 1940.
Aiden and Michel started out by studying how words change over time – specifically, how irregular verbs become regularized, and why some do not. (One example – why do we still say drove and not drived?) Word frequencies, they learned, follow a power distribution, much like the Richter scale (there are big earthquakes, but not many of them). We use a lot of rare words, though we may not use any one of them very often. The authors say:
In Ulysses, only ten words are used more than 2,653 times. But there are a hundred words used more than 265 times, and a thousand words used more than 26 times, and so on . . .”
This insight turned into research that led to a fascinating article in Nature. How the young computer scientists persuaded Google to let them into the database of books and start exploring the data makes up a good portion of the book. The rest reports some of the authors’ explorations of the development of language, what the authors call the half-life of fame (it’s decreasing), the clear effects of censorship visible in the literature, and a culture’s collective memory. It can take a long time to learn about an invention, and often not quite as long to forget it. Take the almost obsolete fax machine. It pops up as a 2-gram in the 1980s and usage rises steeply for perhaps two decades. You’d think the fax machine was a new invention. In fact, it was invented in the 1840s (yes, you read that right). As the authors conclude: “Big news travels fast–but big ideas don’t.”
“Uncharted” is a fascinating and eminently readable peek at new ways of thinking. You can read the authors’ more rigorous but equally accessible original report in Nature here, but read the book for the color and expansion of the authors’ insights and wonderful retelling of how they got there. What’s the most fascinating N-gram you’ve come up with? Share your finds in the comments.
Have a book you want me to know about? Email me at email@example.com. I also blog about metrics at asbowie.blogspot.com.