Big data: Mind the gaps
These days, we can measure a lot about the world – but only the fraction we know about.
We live in a world of Big Data. Our lives are now being mapped in countless ways; sensors on our phones, our cars, and our Web browsers record the people with whom we interact, the places we go, and even what we’re thinking about.
With Big Data comes the promise of equally big knowledge—or so we’re told by scientists and companies. Excited at the prospects, Massachusetts has even launched a “Big Data Initiative” to ensure that the state wasn’t left out of the coming boom. The dream is that these streams of data are bound to unlock all kinds of insights about the world.
In many ways, this promise seems likely to be borne out: The data that are being generated are rich, detailed, and eminently crunchable. But it’s not quite correct to think of these datasets as all-encompassing. The advent of Big Data also has a paradoxical risk: that by sending us down the narrow paths of the data we have available, it may cause us to mistake those paths for the whole world.
This might sound like a minor concern, but it’s actually a recurring problem with human knowledge, with how science works. Throughout history, in one field after another, science has made huge progress in precisely the areas where we can measure things—and lagged where we can’t.
The result, over time, has been that we know a lot about the things that are closer to our size, our altitude, our spot in the universe—and less about things that are hard to reach, hard to dig up, and hard to quantify. What we know has a bias, in other words, and is biased in favor of what we can measure.
Of course, there are ways to remedy this: Explore and venture into the unknown, or otherwise stretch our ability to measure everything, not just what’s immediately around us. But until we achieve this nearly impossible task, scientists recognize that the huge gaps in our knowledge will remain a problem, or at least something for which we must make allowances. When studying a population, there is even an established term for this problem: the convenience sample. It’s not necessarily a good or representative sample, but it sure is convenient. And this problem has shaped large swaths of what we know—even on very familiar topics—in ways many people don’t appreciate.
The bias toward measurable information affects some of the most popular arenas of science as well as obscure ones. In paleontology, the big-name dinosaurs are the tyrannosaurus rex, the stegosaurus, and the triceratops. Admittedly, as any 5-year-old can tell you, they’re pretty cool, as dinosaurs go. And it’s tempting to think they’re so well known because they were the most important dinosaurs, or the most successful.
But it turns out that the story is a bit more complex. When paleontologists in the United States began to hunt for dinosaurs, back in the late 19th century and early 20th century, they focused their work on two dig sites: Hell Creek in Montana and Como Bluff in Wyoming. These sites were chosen, among other reasons, for their sheer abundance of fossils and the ease with which they could be worked. And what were the dinosaurs that were easiest to find in these two places? The tyrannosaurus rex, the stegosaurus, and the triceratops.
These dinosaurs have since taken on an outsize influence in popular culture, but not because they are necessarily the most important creatures of the Mesozoic; they were simply the most available. Since then, we’ve found what appear to be larger, and frankly, more frightening dinosaurs, from spinosaurus to giganotosaurus. While we have many complete tyrannosaurus rex skeletons, which can provide important insights into dinosaurs, the tyrannosaurus only lived in a single region for only about 2 million years, a little more than 1 percent of the length of the Mesozoic Era.
The low-hanging fruit, in other words, are what we know the most about. In biology, the term is taxonomic bias: when certain species or groups of organisms are studied more than you would expect based on, say, their frequency. For example, insects are overwhelmingly more populous than vertebrates, but many more research papers have been published on vertebrates. Vertebrates, being much easier to spot—and perhaps because they’re more like us, or at least more familiar—have garnered more attention.
In some cases, where this bias shades into something more sinister, what you might call species-ism, scientists have come up with an even harsher term: taxonomic chauvinism. For example, reptiles and amphibians get short shrift when it comes to research attention compared to birds and mammals, because they’re slimy, creepy, and generally less popular.
Whatever the reasons for our bias, what we like, see, and measure dictates what we know the most about. When it comes to planets outside the solar system, for example, the bigger exoplanets are easier to discover. Therefore, they are the ones we know the most about.
This can sometimes contribute to misleading results. One well-known phenomenon in psychology is that experimental subjects are generally WEIRD. What does this acronym mean? When it comes to performing studies, psychologists overwhelmingly end up relying on Western, Educated, Industrial, Rich, and Democratic subjects (especially, given that many psychologists work at universities full of students looking to pick up a few dollars, WEIRD undergraduates). Research has found that such traits as visual perception and fairness are far from universal, and there are numerous differences between WEIRD and non-WEIRD populations—and so decades worth of experimental psychology results may not turn out to be nearly as universal as once thought.
Of course, many clever experiments and studies can be done even with a relative paucity of data. And when we want to learn more, we push to collect more data. We create moon shots, launch space probes, build massive particle accelerators, conduct global marine surveys, and much more. We conduct large efforts that are designed to stretch what we know, and to avoid our biases. However, we’re inevitably hampered by the unknown unknowns.
Which brings us to Big Data. The huge pools of data being generated today aren’t evenly distributed. Rather, Big Data is a series of deep wells, each one plumbing the depths of certain topics. We have a lot of mobile phone data, and Facebook is throwing off huge amounts of information. We can even mine credit card purchase data. But that doesn’t mean that we know everything.
Just because we know how people with iPhones interact with their phones doesn’t mean we know how everyone interacts, in all situations. Or, knowing how information spreads on Facebook doesn’t necessarily apply to idea adoption in general. These insights can be useful, and they’re certainly much better than simply trying an experiment on a small group of Ivy League undergraduates. But we need to be cautious about how we generalize about them and what kinds of conclusions we draw.
As personal data accumulate, there are big blind spots that researchers already know about. Datasets about how companies grow and develop is spotty and rare. Data about elusive topics like creativity—for example, how new ideas are formed—are far from robust, despite all the business books written about the subject. For all the published data on successful science, there is almost a complete lack of data from unsuccessful scientific experiments, which could be just as useful in the aggregate, if not more so. We lack large datasets that detail how infectious disease works its way from person to person, a problem that would be enormously beneficial to tackle.
Big Data might be deep, in other words, but it’s not wide. We’re certainly getting there. But until we have lots more, or distribute it more evenly, we always have to be aware that we might be dealing with some informational bias. As you hear the latest claim about big data and its promise, it’s OK to be excited—but keep in mind that we may be finding the triceratops, and thinking we understand everything there is to know about dinosaurs.
Samuel Arbesman is a senior scholar at the Kauffman Foundation and a fellow at the Institute for Quantitative Social Science at Harvard University. His first book, “The Half-Life of Facts” (Current/Penguin), from which this article is adapted, was published last week.