The term “Big Data” has taken on a rather menacing aspect of late, thanks to the disclosures of snooping by the National Security Agency. But there’s more to Big Data than analyzing billions of our phone calls. One of the most remarkable data troves of all can be found in great heaps at any public library.
Since humans first put pen to paper, we’ve created about 130 million books. But according to the authors of this amusing, enlightening new book, we’re only just learning how to read them. Erez Aiden and Jean-Baptiste Michel show that our books are crammed with revelations about history, culture, economics, and politics that would even surprise their authors.
Aiden, a professor of genetics at Baylor College of Medicine, and Michel, an associate scientist at Harvard University, got to wondering what lessons they could learn by studying books apart from the ostensible meaning of each text.
They worked with Google Inc. and its digitized library of about 30 million books to build the Ngram Viewer, a geekily-named analytical tool that tracks specific words and phrases, measuring the frequency with which certain patterns occur within Google’s vast library. The resulting insights may shift our thinking about matters great and small.
Consider the familiar observation that Americans refer to the United States as a single political entity. But early in our history, people thought of the country as a collection of separate states and said “the United States are.” Historian James McPherson has written that the shift to the singular happened after the Union’s Civil War victory, a view that enjoys wide acceptance. But he may a bit off. After scanning thousands of digitized books, the researchers found that most writers didn’t begin using “the United States is” until 1880, well after the war’s end.
Or consider a larger, more troubling question: Can a repressive government stamp out unwelcome ideas? Yes and no. Aiden and Michel found that the Nazi crusade against “degenerate” artists like Marc Chagall and Paul Klee was quite successful in purging their names from books published under the Third Reich. But the men’s fame rebounded after Hitler’s fall.
The effect of Soviet censorship is more discouraging. Men who’d once been major political leaders were wiped from the country’s history books after Joseph Stalin purged them. Yet decades after Stalin’s death, these men remained invisible to Soviet history until the late 1980s when Communism was near collapse.
Ngram analysis can reveal at a glance our changing attitudes and concerns. From 1800 to the mid-20th century, frequent appearances of the word “fever” reflected our dread of infectious diseases. But by the 1970s, the frequency of the word fever was overtaken by a more pressing worry: cancer.
Aiden and Michel are well aware of the limitations of their technique. In a clever section about fame, they note that a number of celebrities were born in 1936, among them Hollywood star Robert Redford. But, according to an Ngram sweep, who is the best-known person born in 1936? Carol Gilligan. Who? Actually, Gilligan is a renowned psychologist, whose work is cited in many books. But outside of intellectual circles, hardly anyone has heard of her. Her exaggerated Ngram fame demonstrates the danger of relying too much on book learning.
The Ngram Viewer is also hampered by its limited scope. While it can presently analyze about 8 million books, that’s not even 10 percent of the planet’s library. And as Aiden and Michel remind us, books comprise a tiny fragment of all human knowledge. As our computers, sensors and data storage systems improve, Big Data researcher may someday gain access to all books, newspapers, phone calls, text messages, and e-mails.
At the farthest extreme, they imagine a future in which pretty much everything we do is recorded via video, audio, and biometric sensors. Take one entire lifetime of that, multiply by 7 billion, and analyze the lot. We’d learn some astonishing things. Aiden and Michel recoil at the prospect, but having tasted the first fruits of Big Data social research, they’re plainly hungry for more. Somehow, it always seems to work that way. Ask the NSA.Hiawatha Bray can be reached at firstname.lastname@example.org.