First chink in big data’s armor: Google flu predictions way too high

For anyone who is skeptical about the power of big data to save us all – and solve all of healthcare’s cost/access/quality problems – this article will make you smile. According to a new analysis published in Science, Google’s Flu Trends is not so accurate after all. Apparently drawing conclusions from the 500 million Google […]

For anyone who is skeptical about the power of big data to save us all – and solve all of healthcare’s cost/access/quality problems – this article will make you smile. According to a new analysis published in Science, Google’s Flu Trends is not so accurate after all. Apparently drawing conclusions from the 500 million Google searches conducted every day is not as accurate as relying on data from the CDC. Time reported on The Parable of Google Flu: Traps in Big Data Analysis:

Google Flu Trends overestimated the prevalence of flu in the 2012-2013 and 2011-2012 seasons by more than 50%. From August 2011 to September 2013, GFT over-predicted the prevalence of the flu in 100 out of 108 weeks. During the peak flu season last winter, GFT would have had us believe that 11% of the U.S. had influenza, nearly double the CDC numbers of 6%. If you wanted to project current flu prevalence, you would have done much better basing your models off of 3-week-old data on cases from the CDC than you would have been using GFT’s sophisticated big data methods. “It’s a Dewey beats Truman moment for big data,” says David Lazer, a professor of computer science and politics at Northeastern University and one of the authors of the Science article.

Lazer told TIME that a number of associations in the model were really problematic and that the Google analysis was doomed to fail.

Nor did it help that GFT was dependent on Google’s top-secret and always changing search algorithm. Google modifies its search algorithm to provide more accurate results, but also to increase advertising revenue. Recommended searches, based on what other users have searched, can throw off the results for flu trends. While GFT assumes that the relative search volume for different flu terms is based in reality — the more of us are sick, the more of us will search for info about flu as we sniffle above our keyboards — in fact Google itself alters search behavior through that ever-shifting algorithm. If the data isn’t reflecting the world, how can it predict what will happen?

GFT and other big data methods can be useful, but only if they’re paired with what the Science researchers call “small data” — traditional forms of information collection. Put the two together, and you can get an excellent model of the world as it actually is. Of course, if big data is really just one tool of many, not an all-purpose path to omniscience, that would puncture the hype just a bit. You won’t get a SXSW panel with that kind of modesty.

Kara Swisher is right to be leery of Google having access to her genetic data:

A bigger concern, though, is that much of the data being gathered in “big data”— and the formulas used to analyze it — is controlled by private companies that can be positively opaque. Google has never made the search terms used in GFT public, and there’s no way for researchers to replicate how GFT works.