Big data is a vague term for a massive phenomenon that has rapidly become an obsession with entrepreneurs, scientists, governments and the media.
Five years ago, a team of researchers from Google announced a remarkable achievement in one of the worldâs top scientific journals, Nature. Without needing the results of a single medical check-up, they were nevertheless able to track the spread of influenza across the US. Whatâs more, they could do it more quickly than the Centers for Disease Control and Prevention (CDC). Googleâs tracking had only a dayâs delay, compared with the week or more it took for the CDC to assemble a picture based on reports from doctorsâ surgeries. Google was faster because it was tracking the outbreak by finding a correlation between what people searched for online and whether they had flu symptoms.
Not only was âGoogle Flu Trendsâ quick, accurate and cheap, it was theory-free. Googleâs engineers didnât bother to develop a hypothesis about what search terms â âflu symptomsâ or âpharmacies near meâ â might be correlated with the spread of the disease itself. The Google team just took their top 50 million search terms and let the algorithms do the work.
The success of Google Flu Trends became emblematic of the hot new trend in business, technology and science: âBig Dataâ. What, excited journalists asked, can science learn from Google?
As with so many buzzwords, âbig dataâ is a vague term, often thrown around by people with something to sell. Some emphasise the sheer scale of the data sets that now exist â the Large Hadron Colliderâs computers, for example, store 15 petabytes a year of data, equivalent to about 15,000 yearsâ worth of your favourite music.
But the âbig dataâ that interests many companies is what we might call âfound dataâ, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast. Google Flu Trends was built on found data and itâs this sort of data that Âinterests me here. Such data sets can be even bigger than the LHC data â Facebookâs is â but just as noteworthy is the fact that they are cheap to collect relative to their size, they are a messy collage of datapoints collected for disparate purposes and they can be updated in real time. As our communication, leisure and commerce have moved to the internet and the internet has moved into our phones, our cars and even our glasses, life can be recorded and quantified in a way that would have been hard to imagine just a decade ago.
Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passÃ© to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models arenât needed because, to quote âThe End of Theoryâ, a provocative essay published in Wired in 2008, âwith enough data, the numbers speak for themselvesâ.
Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be âcomplete bollocks. Absolute nonsense.â
Found data underpin the new internet economy as companies such as Google, Facebook and Amazon seek new ways to understand our lives through our data exhaust. Since Edward Snowdenâs leaks about the scale and scope of US electronic surveillance it has become apparent that security services are just as fascinated with what they might learn from our data exhaust, too.
Consultants urge the data-naive to wise up to the potential of big data. A recent report from the McKinsey Global Institute reckoned that the US healthcare system could save $300bn a year â $1,000 per American â through better integration and analysis of the data produced by everything from clinical trials to health insurance transactions to smart running shoes.
But while big data promise much to scientists, entrepreneurs and governments, they are doomed to disappoint us if we ignore some very familiar statistical lessons.
âThere are a lot of small data problems that occur in big data,â says Spiegelhalter. âThey donât disappear because youâve got lots of the stuff. They get worse.â
. . .
Four years after the original Nature paper was published, Nature News had sad tidings to convey: the latest flu outbreak had claimed an unexpected victim: Google Flu Trends. After reliably providing a swift and accurate account of flu outbreaks for several winters, the theory-free, data-rich model had lost its nose for where flu was going. Googleâs model pointed to a severe outbreak but when the slow-and-steady data from the CDC arrived, they showed that Googleâs estimates of the spread of flu-like illnesses were overstated by almost a factor of two.
The problem was that Google did not know â could not begin to know â what linked the search terms with the spread of flu. Googleâs engineers werenât trying to figure out what caused what. They were merely finding statistical patterns in the data. They cared about Âcorrelation rather than causation. This is common in big data analysis. Figuring out what causes what is hard (impossible, some say). Figuring out what is correlated with what is much cheaper and easier. That is why, according to Viktor Mayer-SchÃ¶nberger and Kenneth Cukierâs book, Big Data, âcausality wonât be discarded, but it is being knocked off its pedestal as the primary fountain of meaningâ.
But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. One explanation of the Flu Trends failure is that the news was full of scary stories about flu in December 2012 and that these stories provoked internet searches by people who were healthy. Another possible explanation is that Googleâs own search algorithm moved the goalposts when it began automatically suggesting diagnoses when people entered medical symptoms.
Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days â but we must not pretend that the traps have all been made safe. They have not.
In 1936, the Republican Alfred Landon stood for election against President Franklin Delano Roosevelt. The respected magazine, The Literary Digest, shouldered the responsibility of forecasting the result. It conducted a postal opinion poll of astonishing ambition, with the aim of reaching 10 million people, a quarter of the electorate. The deluge of mailed-in replies can hardly be imagined but the Digest seemed to be relishing the scale of the task. In late August it reported, âNext week, the first answers from these ten million will begin the incoming tide of marked ballots, to be triple-checked, verified, five-times cross-classified and totalled.â
After tabulating an astonishing 2.4 million returns as they flowed in over two months, The Literary Digest announced its conclusions: Landon would win by a convincing 55 per cent to 41 per cent, with a few voters favouring a third candidate.
The election delivered a very different result: Roosevelt crushed Landon by 61 per cent to 37 per cent. To add to The Literary Digestâs agony, a far smaller survey conducted by the opinion poll pioneer George Gallup came much closer to the final vote, forecasting a comfortable victory for Roosevelt. Mr Gallup understood something that The Literary Digest did not. When it comes to data, size isnât everything.
Opinion polls are based on samples of the voting population at large. This means that opinion pollsters need to deal with two issues: sample error and sample bias.
Sample error reflects the risk that, purely by chance, a randomly chosen sample of opinions does not reflect the true views of the population. The âmargin of errorâ reported in opinion polls reflects this risk and the larger the sample, the smaller the margin of error. A thousand interviews is a large enough sample for many purposes and Mr Gallup is reported to have conducted 3,000 interviews.
But if 3,000 interviews were good, why werenât 2.4 million far better? The answer is that sampling error has a far more dangerous friend: sampling bias. Sampling error is when a randomly chosen sample doesnât reflect the underlying population purely by chance; sampling bias is when the sample isnât randomly chosen at all. George Gallup took pains to find an unbiased sample because he knew that was far more important than finding a big one.
The Literary Digest, in its quest for a bigger data set, fumbled the question of a biased sample. It mailed out forms to people on a list it had compiled from automobile registrations and telephone directories â a sample that, at least in 1936, was disproportionately prosperous. To compound the problem, Landon supporters turned out to be more likely to mail back their answers. The combination of those two biases was enough to doom The Literary Digestâs poll. For each person George Gallupâs pollsters interviewed, The Literary Digest received 800 responses. All that gave them for their pains was a very precise estimate of the wrong answer.
The big data craze threatens to be The Literary Digest all over again. Because found data sets are so messy, it can be hard to figure out what biases lurk inside them â and because they are so large, some analysts seem to have decided the sampling problem isnât worth worrying about. It is.
Professor Viktor Mayer-SchÃ¶nberger of Oxfordâs Internet Institute, co-author of Big Data, told me that his favoured definition of a big data set is one where âN = Allâ â where we no longer have to sample, but we have the entire background population. Returning officers do not estimate an election result with a representative tally: they count the votes â all the votes. And when âN = Allâ there is indeed no issue of sampling bias because the sample includes everyone.
But is âN = Allâ really a good description of most of the found data sets we are considering? Probably not. âI would challenge the notion that one could ever have all the data,â says Patrick Wolfe, a computer scientist and professor of statistics at University College London.
An example is Twitter. It is in principle possible to record and analyse every message on Twitter and use it to draw conclusions about the public mood. (In practice, most researchers use a subset of that vast âfire hoseâ of data.) But while we can look at all the tweets, Twitter users are not representative of the population as a whole. (According to the Pew Research Internet Project, in 2013, US-based Twitter users were disproportionately young, urban or suburban, and black.)
There must always be a question about who and what is missing, especially with a messy pile of found data. Kaiser Fung, a data analyst and author of Numbersense, warns against simply assuming we have everything that matters. âN = All is often an assumption rather than a fact about the data,â he says.
Consider Bostonâs Street Bump smartphone app, which uses a phoneâs accelerometer to detect potholes without the need for city workers to patrol the streets. As citizens of Boston download the app and drive around, their phones automatically notify City Hall of the need to repair the road surface. Solving the technical challenges involved has produced, rather beautifully, an informative data exhaust that addresses a problem in a way that would have been inconceivable a few years ago. The City of Boston proudly proclaims that the âdata provides the City with real-time inÂformation it uses to fix problems and plan long term investments.â
Yet what Street Bump really produces, left to its own devices, is a map of potholes that systematically favours young, affluent areas where more people own smartphones. Street Bump offers us âN = Allâ in the sense that every bump from every enabled phone can be recorded. That is not the same thing as recording every pothole. As Microsoft researcher Kate Crawford points out, found data contain systematic biases and it takes careful thought to spot and correct for those biases. Big data sets can seem comprehensive but the âN = Allâ is often a seductive illusion.
Who cares about causation or sampling bias, though, when there is money to be made? Corporations around the world must be salivating as they contemplate the uncanny success of the US discount department store Target, as famously reported by Charles Duhigg in The New York Times in 2012. Duhigg explained that Target has collected so much data on its customers, and is so skilled at analysing that data, that its insight into consumers can seem like magic.
Duhiggâs killer anecdote was of the man who stormed into a Target near Minneapolis and complained to the manager that the company was sending coupons for baby clothes and maternity wear to his teenage daughter. The manager apologised profusely and later called to apologise again â only to be told that the teenager was indeed pregnant. Her father hadnât realised. Target, after analysing her purchases of unscented wipes and magnesium supplements, had.
Statistical sorcery? There is a more mundane explanation.
âThereâs a huge false positive issue,â says Kaiser Fung, who has spent years developing similar approaches for retailers and advertisers. What Fung means is that we didnât get to hear the countless stories about all the women who received coupons for babywear but who werenât pregnant.
In Charles Duhiggâs account, Target mixes in random offers, such as coupons for wine glasses, because pregnant customers would feel spooked if they realised how intimately the companyâs computers understood them.
Fung has another explanation: Target mixes up its offers not because it would be weird to send an all-baby coupon-book to a woman who was pregnant but because the company knows that many of those coupon books will be sent to women who arenât pregnant after all.
None of this suggests that such data analysis is worthless: it may be highly profitable. Even a modest increase in the accuracy of targeted special offers would be a prize worth winning. But profitability should not be conflated with omniscience.
In 2005, John Ioannidis, an epidemiologist, published a research paper with the self-explanatory title, âWhy Most Published Research Findings Are Falseâ. The paper became famous as a provocative diagnosis of a serious issue. One of the key ideas behind Ioannidisâs work is what statisticians call the âmultiple-comparisons problemâ.
It is routine, when examining a pattern in data, to ask whether such a pattern might have emerged by chance. If it is unlikely that the observed pattern could have emerged at random, we call that pattern âstatistically significantâ.
The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by âworkâ. The researchers could look at the childrenâs height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.
There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns â of signal to noise â quickly tends to zero.
Worse still, one of the antidotes to the Âmultiple-comparisons problem is transparency, allowing other researchers to figure out how many hypotheses were tested and how many contrary results are languishing in desk drawers because they just didnât seem interesting enough to publish. Yet found data sets are rarely transparent. Amazon and Google, Facebook and Twitter, Target and Tesco â these companies arenât about to share their data with you or anyone else.
New, large, cheap data sets and powerful Âanalytical tools will pay dividends â nobody doubts that. And there are a few cases in which analysis of very large data sets has worked miracles. David Spiegelhalter of Cambridge points to Google Translate, which operates by statistically analysing hundreds of millions of documents that have been translated by humans and looking for patterns it can copy. This is an example of what computer scientists call âmachine learningâ, and it can deliver astonishing results with no preprogrammed grammatical rules. Google Translate is as close to theory-free, data-driven algorithmic black box as we have â and it is, says Spiegelhalter, âan amazing achievementâ. That achievement is built on the clever processing of enormous data sets.
But big data do not solve the problem that has obsessed statisticians and scientists for centuries: the problem of insight, of inferring what is going on, and figuring out how we might intervene to change a system for the better.
âWe have a new resource here,â says Professor David Hand of Imperial College London. âBut nobody wants âdataâ. What they want are the answers.â
To use big data to produce such answers will require large strides in statistical methods.
âItâs the wild west right now,â says Patrick Wolfe of UCL. âPeople who are clever and driven will twist and turn and use every tool to get sense out of these data sets, and thatâs cool. But weâre flying a little bit blind at the moment.â
Statisticians are scrambling to develop new methods to seize the opportunity of big data. Such new methods are essential but they will work by building on the old statistical lessons, not by ignoring them.
Recall big dataâs four articles of faith. Uncanny accuracy is easy to overrate if we simply ignore false positives, as with Targetâs pregnancy predictor. The claim that causation has been âknocked off its pedestalâ is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it. The promise that âN = Allâ, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that âwith enough data, the numbers speak for themselvesâ â that seems hopelessly naive in data sets where spurious patterns vastly outnumber genuine discoveries.
âBig dataâ has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers â without making the same old statistical mistakes on a grander scale than ever.
Tim Harfordâs latest book is âThe Undercover Economist Strikes Backâ. To comment on this article please post below, or email firstname.lastname@example.org
Originally posted via “Big data: are we making a big mistake?”