Jun 01, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Ethics  Source

[ AnalyticsWeek BYTES]

>> Big Data has Big Implications for Customer Experience Management by bobehayes

>> Optimizing your customer relationship survey by bobehayes

>> Big Data Insights in Healthcare, Part II. A Perspective on Challenges to Adoption by froliol

Wanna write? Click Here


 Red Hat launches OpenShift on Google Cloud – ZDNet Under  Cloud

 Justifying the Hybrid Cloud: It’s All About the Application – IT Business Edge (blog) Under  Hybrid Cloud

 MEF, PNDA Roll Out Initiative Focusing on LSO Analytics – SDxCentral Under  Analytics

More NEWS ? Click Here


Machine Learning


6.867 is an introductory course on machine learning which gives an overview of many concepts, techniques, and algorithms in machine learning, beginning with topics such as classification and linear regression and ending … more


On Intelligence


Jeff Hawkins, the man who created the PalmPilot, Treo smart phone, and other handheld devices, has reshaped our relationship to computers. Now he stands ready to revolutionize both neuroscience and computing in one strok… more


Data aids, not replace judgement
Data is a tool and means to help build a consensus to facilitate human decision-making but not replace it. Analysis converts data into information, information via context leads to insight. Insights lead to decision making which ultimately leads to outcomes that brings value. So, data is just the start, context and intuition plays a role.


Q:Examples of NoSQL architecture?
A: * Key-value: in a key-value NoSQL database, all of the data within consists of an indexed key and a value. Cassandra, DynamoDB
* Column-based: designed for storing data tables as sections of columns of data rather than as rows of data. HBase, SAP HANA
* Document Database: map a key to some document that contains structured information. The key is used to retrieve the document. MongoDB, CouchDB
* Graph Database: designed for data whose relations are well-represented as a graph and has elements which are interconnected, with an undetermined number of relations between them. Polyglot Neo4J



Reimagining the role of data in government

 Reimagining the role of data in government

Subscribe to  Youtube


Data that is loved tends to survive. – Kurt Bollacker, Data Scientist, Freebase/Infochimps


#BigData @AnalyticsWeek #FutureOfData #Podcast with  John Young, @Epsilonmktg

 #BigData @AnalyticsWeek #FutureOfData #Podcast with John Young, @Epsilonmktg


iTunes  GooglePlay


By 2020, we will have over 6.1 billion smartphone users globally (overtaking basic fixed phone subscriptions).

Sourced from: Analytics.CLUB #WEB Newsletter

Are U.S. Hospitals Delivering a Better Patient Experience?

The Centers for Medicare & Medicaid Services (CMS) use patient feedback about their care as part of their reimbursement plan for acute care hospitals. Under the Hospital Value-Based Purchasing Program, CMS makes value-based incentive payments to acute care hospitals, based either on how well the hospitals perform on certain quality measures or how much the hospitals’ performance improves on certain quality measures from their performance during a baseline period. This program began in FY 2013 for discharges occurring on or after October 1, 2012.

A standard patient satisfaction survey, known as HCAHPS (Hospital Consumer Assessment of Healthcare Providers and Systems), is the source of the patient feedback for the reimbursement program. I have previously used these publicly available HCAHPS data to understand the state of affairs for US hospitals in 2011 (see Big Data Provides Big Insights for U.S. Hospitals). Now that the Value-Based Purchasing program has been in effect since October 2012, I wanted to revisit the HCAHPS patient survey data to determine if US hospitals have improved. First, let’s review the HCAHPS survey.

The HCAHPS Survey

The survey asks a random sample of recently discharged patients about important aspects of their hospital experience. The data set includes patient survey results for US hospitals on ten measures of patients’ perspectives of care. The 10 measures are:

  1. Nurses communicate well
  2. Doctors communicate well
  3. Received help as soon as they wanted (Responsive)
  4. Pain well controlled
  5. Staff explain medicines before giving to patients
  6. Room and bathroom are clean
  7. Area around room is quiet at night
  8. Given information about what to do during recovery at home
  9. Overall hospital rating
  10. Recommend hospital to friends and family (Recommend)

For questions 1 through 7, respondents were asked to provide frequency ratings about the occurrence of each attribute (Never, Sometimes, Usually, Always). For question 8, respondents were provided a Y/N option. For question 9, respondents were asked to provide an overall rating of the hospital on a scale from 0 (Worst hospital possible) to 10 (Best hospital possible). For question 10, respondents were asked to provide their likelihood of recommending the hospital (Definitely no, Probably no, Probably yes, Definitely yes).

The Metrics

The HCAHPS data sets report metrics for each hospital as percentages of responses. Because the data sets have already been somewhat aggregated (e.g., percentages reported for group of response options), I was unable to calculate average scores for each hospital. Instead, I used top box scores as the metric of patient experience. I found that top box scores are highly correlated with average scores across groups of companies, suggesting that these two metrics tell us the same thing about the companies (in our case, hospitals).

Top box scores for the respective rating scales are defined as: 1) Percent of patients who reported “Always”; 2) Percent of patients who reported “Yes”; 3) Percent of patients who gave a rating of 9 or 10; 4) Percent of patients who said “Definitely yes.”

Top box scores provide an easy-to-understand way of communicating the survey results for different types of scales. Even though there are four different rating scales for the survey questions, using a top box reporting method puts all metrics on the same numeric scale. Across all 10 metrics, hospital scores can range from 0 (bad) to 100 (good).

I examined PX ratings of acute care hospitals across two time periods. The two time periods were 2011 (Q3 2010 through Q2 2011) and 2013 (Q4 2012 through Q3 2013). The data from the 2013 time-frame are the latest publicly available patient survey data as of this writing.

Results: Patient Satisfaction with US Hospitals Increasing

Patient Advocacy Trends for Acute Care Hospitals in US
Figure 1. Patient advocacy has increased for US hospitals

Figure 1 contains the comparisons for patient advocacy ratings for US hospitals across the two time periods. Paired T-tests comparing the three loyalty metrics across the two time periods were statistically significant, showing that patients are reporting higher levels of loyalty toward hospitals in 2013 compared to 2011. This increase in patient loyalty, while small, is still real.

Greater gains in patient loyalty have been seen for Overall Hospital Rating (increase of 2.26) compared to Recommend (increase of 1.09).

Figure 2. Patient Experience Trends
Figure 2. Patient satisfaction with their in-patient experience has increased for US hospitals

Figure 2 contains the comparisons for patient experience ratings for US hospitals across the two time periods. Again, paired T-tests comparing the seven PX metrics across the two time periods were statistically significant, showing that patients are reporting higher levels of satisfaction with their in-patient experience in 2013 compared to 2011.

The biggest increases in satisfaction were seen in “Given information about recovery,” “Staff explained meds” and “Responsive.” The smallest increases in satisfaction were seen for “Doctor communication” and “Pain well controlled.”


Hospital reimbursements are based, in part, on their patient satisfaction ratings. Consequently, hospital executives are focusing their efforts at improving the patient experience.

Comparing HCAHPS patient survey results from 2011 to 2013, it appears that hospitals have improved how they deliver patient care. Patient loyalty and PX metrics show significant improvements from 2011 to 2013.

Originally Posted at: Are U.S. Hospitals Delivering a Better Patient Experience? by bobehayes

May 25, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Big Data knows everything  Source


More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Your Agile Data Warehousing Architect: Excel by v1shal

>> The future of marketing automation depends on data analytics at scale by anum

>> Google loses data as lightning strikes by analyticsweekpick

Wanna write? Click Here


 Big Data Startup Tamr Wins Financial Investment From GE Ventures – CRN Under  Big Data

 Securonix Unveils Big Data Security Analytics Platform With Unprecedented Threat Prediction, Detection and … – Broadway World Under  Big Data Security

 Installing Ubuntu On Windows 10 — On vSphere – Virtualization Review Under  Virtualization

More NEWS ? Click Here


Hadoop Starter Kit


Hadoop learning made easy and fun. Learn HDFS, MapReduce and introduction to Pig and Hive with FREE cluster access…. more


Machine Learning With Random Forests And Decision Trees: A Visual Guide For Beginners


If you are looking for a book to help you understand how the machine learning algorithms “Random Forest” and “Decision Trees” work behind the scenes, then this is a good book for you. Those two algorithms are commonly u… more


Save yourself from zombie apocalypse from unscalable models
One living and breathing zombie in today’s analytical models is the pulsating absence of error bars. Not every model is scalable or holds ground with increasing data. Error bars that is tagged to almost every models should be duly calibrated. As business models rake in more data the error bars keep it sensible and in check. If error bars are not accounted for, we will make our models susceptible to failure leading us to halloween that we never wants to see.


Q:What is: collaborative filtering, n-grams, cosine distance?
A: Collaborative filtering:
– Technique used by some recommender systems
– Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources.
1. A user expresses his/her preferences by rating items (movies, CDs.)
2. The system matches this user’s ratings against other users’ and finds people with most similar tastes
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user

– Contiguous sequence of n items from a given sequence of text or speech
– ‘Andrew is a talented data scientist”
– Bi-gram: ‘Andrew is”, ‘is a”, ‘a talented”.
– Tri-grams: ‘Andrew is a”, ‘is a talented”, ‘a talented data”.
– An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game
– More concisely, n-gram model: P(Xi|Xi?(n?1)…Xi?1): Markov model
– N-gram model: each word depends only on the n?1 last words

– when facing infrequent n-grams
– solution: smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams
– Methods: Good-Turing, Backoff, Kneser-Kney smoothing

Cosine distance:
– How similar are two documents?
– Perfect similarity/agreement: 1
– No agreement : 0 (orthogonality)
– Measures the orientation, not magnitude

Given two vectors A and B representing word frequencies:



@AnalyticsWeek: Big Data Health Informatics for the 21st Century: Gil Alterovitz

 @AnalyticsWeek: Big Data Health Informatics for the 21st Century: Gil Alterovitz

Subscribe to  Youtube


Torture the data, and it will confess to anything. – Ronald Coase


#FutureOfData Podcast: Conversation With Sean Naismith, Enova Decisions

 #FutureOfData Podcast: Conversation With Sean Naismith, Enova Decisions


iTunes  GooglePlay


The data volumes are exploding, more data has been created in the past two years than in the entire previous history of the human race.

Sourced from: Analytics.CLUB #WEB Newsletter

RSPB Conservation Efforts Take Flight Thanks To Data Analytics


Big data may be helping to change the way we interact with the world around us, but how much can it do to help the wildlife that shares our planet?

With hundreds of species to track across the UK, ornithological charity the RSPB accrues huge amounts of data every year as it tries to ensure its efforts help as many birds as possible.

And in order to ensure they stay on top of this mountain of data, the charity has teamed up with analytics specialists SAS to develop and create more in-depth research and conservation efforts which should benefit birds around the country.

image: http://www.techweekeurope.co.uk/wp-content/uploads/2015/05/rspb-logo.jpg

rspb logoFlying high

“We need to make sense of a variety of large and complex data sets. For example, tracking the movements of kittiwakes and gannets as they forage at sea produces millions of data points,” said Dr. Will Peach, head of research delivery at RSPB.

“Conservation informed by statistical evidence is always more likely to succeed than that based solely on guesswork or anecdote. SAS allows us to explore the data to provide the evidence needed to confidently implement our initiatives.”

So far, the RSPB has implemented SAS’ advanced analytics solutions to combine datasets on yellowhammer and skylark nesting success with pesticide use and agriculture cropping patterns to judge the consequences for the birds.

RSPB also turned to SAS to explore how albatross forage across the Southern Ocean.

With large-scale commercial longline fishing killing tens of thousands of albatrosses a year, the goal was to cut down on the death rate and protect the 17 albatross species currently at risk.

The society took data from tags worn by the birds, merging it with external data sets like sea-surface temperatures and the location of fishing grounds.

“Scientific research is extremely fast-moving and there are now huge volumes of data to analyse,” said Andy Cutler, director of strategy at SAS UK & Ireland.

“SAS is able to provide a means of managing all the data and then apply cutting-edge analytical techniques that deliver valuable insights almost immediately. For example, through analysing previously non-informative data, RSPB is now able to intervene and correct the breeding problems faced by various bird species during treacherous migration journeys.”
Read more at http://www.techweekeurope.co.uk/data-storage/business-intelligence/rspb-conservation-sas-data-analytics-167988#Dzdo3ud6Ej3vt6ZC.99

RSPB Conservation Efforts Take Flight Thanks To Data Analytics
Read more at http://www.techweekeurope.co.uk/data-storage/business-intelligence/rspb-conservation-sas-data-analytics-167988#Dzdo3ud6Ej3vt6ZC.99

Source: RSPB Conservation Efforts Take Flight Thanks To Data Analytics by analyticsweekpick

Betting the Enterprise on Data with Cloud-Based Disaster Recovery and Backups

One of the more pressing consequences of truly transitioning to a data-driven company culture is a renewed esteem for the data—valued as an asset—that gives the enterprise its worth. Unlike other organizational assets, protecting data requires more than mere security measures. It necessitates reliable, test-worthy backup and disaster recovery plans that can automate these vital processes to account for virtually any scenario, especially some of the more immediate ones involving:
  • Ransomware: Ransomware attacks are increasing in incidence and severity. They occur when external entities deploy malware to encrypt organizational data using similar, if not more effective, encryption measures that those same organizations do and only release the data after being paid to do so. “Ransomware was not something that many people worried about a couple years ago,” Unitrends VP of Product Marketing Dave LeClair acknowledged. “Now it’s something that almost every company that I’ve talked to has been hit. The numbers are getting truly staggering how frequently ransomware attacks are hitting IT, encrypting their data, and demanding payments to unencrypt it from these criminal organizations.”
  • Downtime: External threats are not the only factors that engender IT downtime. Conventional maintenance and updating measures for various systems also result in situations in which organizations cannot access or leverage their data. In essential time-sensitive applications, cloud-based disaster recovery and backup solutions ensure business continuity.
  •  Contemporary IT Environments: Today’s IT environments are much more heterogeneous than they once were. It is not uncommon for organizations to utilize existing legacy systems alongside cloud-based applications and those involving virtualization. Cloud disaster recovery and data backup platforms preserve connected continuity in a singular manner to reduce costs and increase the efficiency of backup systems.
  • Acts of Nature: The increasing reliance on technology is still susceptible to unforseen acts based on weather conditions, natural disasters, and even man-made ones—in which case cloud options for recovery and backups are the most desirable because they store valued data offsite.

Additionally, when one considers that the primary benefits of the cloud are its low cost storage—at scale—and ubiquity of access regardless of location or time, cloud disaster recovery and backup solutions are a logical extension of enterprise infrastructure. “The new technologies, because of the ability of doing things in the cloud, kind of democratizes it so that anybody can afford to have a DR environment, particularly for their critical applications,” LeClair remarked.

Recovery and Backup Basics
There are a multitude of ways that organizations can leverage cloud recovery and data backup options to readily restore production capabilities in the event of system failure:

  • Replication: Replication is the means by which data is copied elsewhere—in this case, to the cloud for storage. Data can also be replicated to other forms of storage (i.e. disk or tape) and be transmitted to a cloud service provider that way.
  • Archives/Checkpoints: Archives or checkpoints are states of data at particular points in time for a data set which are preserved within a system. Therefore, organizations can always revert their system data to an archive to restore it to a time before some sort of failure occurred. According to LeClair, this capability is an integral way of mitigating the effects of ransomware: “You can simply rollback the clock, to the point before you got encrypted, and you can restore your system so you’re good to go”.
  • Instant Recovery Solutions: These solutions not only restore systems to a point in time prior to events of failure, but even facilitate workload management based on the backup appliance itself. This capability is critical in instances in which on-premise systems are still down. In such an event, the appliance’s compute power and storage replace those of the primary solution, which “allows you to spin off that workload in less than five minutes so you can get back up and running,” Le Clair said.
  • Incremental Forevers: This recovery and backup technique is particularly useful because it involves a full backup of a particular data set or application, and subsequently only backs up changes to that initial backup. Such utility is pivotal to massive quantities of big data.

Cloud Replication
There are many crucial considerations when leveraging the cloud as a means of recovery and data backup. Foremost of these is the replication process of copying data from on premises to the cloud. “It absolutely is an issue, particularly if you have terabytes of data,” LeClair mentioned. “If you’re a decent sized enterprise and you have 50 or 100 terabytes of data that you need to move from your production environment to the cloud, that can take weeks.” Smaller cloud providers such as Unitrends can issue storage to organizations via disk, which is then overnighted and uploaded to the cloud so that, on an ongoing basis, organizations only need to replicate the changes of their data.

Machine Transformation
Another consideration pertains to actually utilizing that data in the cloud due to networking concerns. “Networking in cloud generally works very differently than what happens on premise,” LeClair observed. Most large public cloud providers (such as Amazon Web Services) have networking constraints regarding interconnections that require significant IT involvement to configure. However, competitive disaster recovery and backup vendors have dedicated substantial resources to automating various facets of recovery, including all of the machine transformation (transmogrification) required to provision a production environment in the cloud.

Merely replicating data into the cloud is just the first step. The larger concern for actually utilizing it there in cases of emergency requires provisioning the network, which certain cloud platforms can do automatically so that, “You have a DR environment without having to actually dedicate any compute resources yet,” LeClair said. “You basically have your data that’s replicated into Amazon, and you have all the configuration data necessary to spin off that data if you need to. It’s a very cost-effective way to keep yourself protected.”
Recovery Insurance
The automation capabilities of cloud data recovery and back-up solutions also include testing, which is a vital prerequisite for actually ensuring that such systems function properly on demand. Traditionally, organizations tested their recovery environments sparingly, if at all. “There’s now technology that essentially automates your DR environment, so you don’t have to pull up human resources and time into it,” LeClair said. In many instances, those automation capabilities hinge upon the cloud, which has had a considerable impact on the capabilities for disaster recovery and backup. The overarching effect is that it renders data recovery and backup more consistent, cheaper, and easier to facilitate in an increasingly complicated and preeminent IT world.

Source: Betting the Enterprise on Data with Cloud-Based Disaster Recovery and Backups by jelaniharper

May 18, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Productivity  Source


More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Solving the Healthcare Crisis with Mobile Big Data Analytics by jelaniharper

>> November 7, 2016 Health and Biotech analytics news roundup by pstein

>> Big Data Advances in Customer Experience Management by bobehayes

Wanna write? Click Here


 Data science is creating a tidal wave of opportunity for women to get into executive leadership – Recode Under  Data Science

 Net Politics China’s Big Data Push Runs Into Orwell and Red Tape … – Council on Foreign Relations (blog) Under  Big Data

 7 Questions On How To Use AI Technology — Without A Data Scientist – MediaPost Communications Under  Data Scientist

More NEWS ? Click Here


Lean Analytics Workshop – Alistair Croll and Ben Yoskovitz


Use data to build a better startup faster in partnership with Geckoboard… more


The Future of the Professions: How Technology Will Transform the Work of Human Experts


This book predicts the decline of today’s professions and describes the people and systems that will replace them. In an Internet society, according to Richard Susskind and Daniel Susskind, we will neither need nor want … more


Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.


Q:Do we always need the intercept term in a regression model?
A: * It guarantees that the residuals have a zero mean
* It guarantees the least squares slopes estimates are unbiased
* the regression line floats up and down, by adjusting the constant, to a point where the mean of the residuals is zero



Reimagining the role of data in government

 Reimagining the role of data in government

Subscribe to  Youtube


If you can’t explain it simply, you don’t understand it well enough. – Albert Einstein


#BigData @AnalyticsWeek #FutureOfData #Podcast with Joe DeCosmo, @Enova

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Joe DeCosmo, @Enova


iTunes  GooglePlay


Data production will be 44 times greater in 2020 than it was in 2009.

Sourced from: Analytics.CLUB #WEB Newsletter

SAS enlarges its palette for big data analysis

SAS offers new tools for training, as well as for banking and network security.

SAS Institute did big data decades before big data was the buzz, and now the company is expanding on the ways large-scale computerized analysis can help organizations.

As part of its annual SAS Global Forum, being held in Dallas this week, the company has released new software customized for banking and cybersecurity, for training more people to understand SAS analytics, and for helping non-data scientists do predictive analysis with visual tools.

Founded in 1976, SAS was one of the first companies to offer analytics software for businesses. A private company that generated US$3 billion in revenue in 2014, SAS has devoted considerable research and development funds to enhance its core Statistical Analysis System (SAS) platform over the years. The new releases are the latest fruits of these labors.

With the aim of getting more people trained in the SAS ways, the company has posted its training software, SAS University Edition, on the Amazon Web Services Marketplace. Using AWS eliminates the work of setting up the software on a personal computer, and first-time users of AWS can use the 12-month free tier program, to train on the software at no cost.

SAS launched the University Edition a year ago, and it has since been downloaded over 245,000 times, according to the company.

With the release, SAS is taking aim at one of the chief problems organizations face today when it comes to data analysis, that of finding qualified talent. By 2018, the U.S. alone will face a shortage of anywhere from 140,000 to 190,000 people with analytical expertise, The McKinsey Global Institute consultancy has estimated.

Predictive analytics is becoming necessary even in fields where it hasn’t been heavily used heretofore. One example is information technology security. Security managers for large organizations are growing increasingly frustrated at learning of breaches only after they happen. SAS is betting that applying predictive and behavioral analytics to operational IT data, such as server logs, can help identify and deter break-ins and other malicious activity, as they unfold.

Last week, SAS announced that it’s building a new software package, called SAS Cybersecurity, which will process large of amounts of real-time data from network operations. The software, which will be generally available by the end of the year, will build a model of routine activity, which it then can use to identify and flag suspicious behavior.

SAS is also customizing its software for the banking industry. A new package, called SAS Model Risk Management, provides a detailed model of a how a bank operates so that the bank can better understand its financial risks, as well as convey these risks to regulators.

SAS also plans to broaden its user base by making its software more appealing beyond computer statisticians and data scientists. To this end, the company has paired its data exploration software, called SAS Visual Analytics, with its software for developing predictive models, called SAS Visual Statistics. The pairing can allow non-data scientists, such as line of business analysts and risk managers, to predict future trends based on current data.

The combined products can also be tied in with SAS In-Memory Analytics, software designed to allow large amounts of data to be held entirely in the server’s memory, speeding analysis. It can also work with data on Hadoop clusters, relational database systems or SAS servers.

QVC, the TV and online retailer, has already paired the two products. At its Italian operations, QVC streamlined its supply chain operations by allowing its sales staff to spot buying trends more easily, and spend less time building reports, according to SAS.

The combined package of SAS Visual Analytics and SAS Visual Statistics will be available in May.

Originally posted via “SAS enlarges its palette for big data analysis”

Source: SAS enlarges its palette for big data analysis

Big data: are we making a big mistake?

Big data is a vague term for a massive phenomenon that has rapidly become an obsession with entrepreneurs, scientists, governments and the media.

High quality global journalism requires investment. Please share this article with others using the link below, do not cut & paste the article. See our Ts&Cs and Copyright Policy for more detail.

Five years ago, a team of researchers from Google announced a remarkable achievement in one of the world’s top scientific journals, Nature. Without needing the results of a single medical check-up, they were nevertheless able to track the spread of influenza across the US. What’s more, they could do it more quickly than the Centers for Disease Control and Prevention (CDC). Google’s tracking had only a day’s delay, compared with the week or more it took for the CDC to assemble a picture based on reports from doctors’ surgeries. Google was faster because it was tracking the outbreak by finding a correlation between what people searched for online and whether they had flu symptoms.

Not only was “Google Flu Trends” quick, accurate and cheap, it was theory-free. Google’s engineers didn’t bother to develop a hypothesis about what search terms – “flu symptoms” or “pharmacies near me” – might be correlated with the spread of the disease itself. The Google team just took their top 50 million search terms and let the algorithms do the work.

FirstFT is our new essential daily email briefing of the best stories from across the web

The success of Google Flu Trends became emblematic of the hot new trend in business, technology and science: “Big Data”. What, excited journalists asked, can science learn from Google?

As with so many buzzwords, “big data” is a vague term, often thrown around by people with something to sell. Some emphasise the sheer scale of the data sets that now exist – the Large Hadron Collider’s computers, for example, store 15 petabytes a year of data, equivalent to about 15,000 years’ worth of your favourite music.

But the “big data” that interests many companies is what we might call “found data”, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast. Google Flu Trends was built on found data and it’s this sort of data that ­interests me here. Such data sets can be even bigger than the LHC data – Facebook’s is – but just as noteworthy is the fact that they are cheap to collect relative to their size, they are a messy collage of datapoints collected for disparate purposes and they can be updated in real time. As our communication, leisure and commerce have moved to the internet and the internet has moved into our phones, our cars and even our glasses, life can be recorded and quantified in a way that would have been hard to imagine just a decade ago.

Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.

Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks. Absolute nonsense.”

Found data underpin the new internet economy as companies such as Google, Facebook and Amazon seek new ways to understand our lives through our data exhaust. Since Edward Snowden’s leaks about the scale and scope of US electronic surveillance it has become apparent that security services are just as fascinated with what they might learn from our data exhaust, too.

Consultants urge the data-naive to wise up to the potential of big data. A recent report from the McKinsey Global Institute reckoned that the US healthcare system could save $300bn a year – $1,000 per American – through better integration and analysis of the data produced by everything from clinical trials to health insurance transactions to smart running shoes.

But while big data promise much to scientists, entrepreneurs and governments, they are doomed to disappoint us if we ignore some very familiar statistical lessons.

“There are a lot of small data problems that occur in big data,” says Spiegelhalter. “They don’t disappear because you’ve got lots of the stuff. They get worse.”

. . .

Four years after the original Nature paper was published, Nature News had sad tidings to convey: the latest flu outbreak had claimed an unexpected victim: Google Flu Trends. After reliably providing a swift and accurate account of flu outbreaks for several winters, the theory-free, data-rich model had lost its nose for where flu was going. Google’s model pointed to a severe outbreak but when the slow-and-steady data from the CDC arrived, they showed that Google’s estimates of the spread of flu-like illnesses were overstated by almost a factor of two.

The problem was that Google did not know – could not begin to know – what linked the search terms with the spread of flu. Google’s engineers weren’t trying to figure out what caused what. They were merely finding statistical patterns in the data. They cared about ­correlation rather than causation. This is common in big data analysis. Figuring out what causes what is hard (impossible, some say). Figuring out what is correlated with what is much cheaper and easier. That is why, according to Viktor Mayer-Schönberger and Kenneth Cukier’s book, Big Data, “causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning”.

But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. One explanation of the Flu Trends failure is that the news was full of scary stories about flu in December 2012 and that these stories provoked internet searches by people who were healthy. Another possible explanation is that Google’s own search algorithm moved the goalposts when it began automatically suggesting diagnoses when people entered medical symptoms.

Google Flu Trends will bounce back, recalibrated with fresh data – and rightly so. There are many reasons to be excited about the broader opportunities offered to us by the ease with which we can gather and analyse vast data sets. But unless we learn the lessons of this episode, we will find ourselves repeating it.

Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days – but we must not pretend that the traps have all been made safe. They have not.

In 1936, the Republican Alfred Landon stood for election against President Franklin Delano Roosevelt. The respected magazine, The Literary Digest, shouldered the responsibility of forecasting the result. It conducted a postal opinion poll of astonishing ambition, with the aim of reaching 10 million people, a quarter of the electorate. The deluge of mailed-in replies can hardly be imagined but the Digest seemed to be relishing the scale of the task. In late August it reported, “Next week, the first answers from these ten million will begin the incoming tide of marked ballots, to be triple-checked, verified, five-times cross-classified and totalled.”

After tabulating an astonishing 2.4 million returns as they flowed in over two months, The Literary Digest announced its conclusions: Landon would win by a convincing 55 per cent to 41 per cent, with a few voters favouring a third candidate.

The election delivered a very different result: Roosevelt crushed Landon by 61 per cent to 37 per cent. To add to The Literary Digest’s agony, a far smaller survey conducted by the opinion poll pioneer George Gallup came much closer to the final vote, forecasting a comfortable victory for Roosevelt. Mr Gallup understood something that The Literary Digest did not. When it comes to data, size isn’t everything.

Opinion polls are based on samples of the voting population at large. This means that opinion pollsters need to deal with two issues: sample error and sample bias.

Sample error reflects the risk that, purely by chance, a randomly chosen sample of opinions does not reflect the true views of the population. The “margin of error” reported in opinion polls reflects this risk and the larger the sample, the smaller the margin of error. A thousand interviews is a large enough sample for many purposes and Mr Gallup is reported to have conducted 3,000 interviews.

But if 3,000 interviews were good, why weren’t 2.4 million far better? The answer is that sampling error has a far more dangerous friend: sampling bias. Sampling error is when a randomly chosen sample doesn’t reflect the underlying population purely by chance; sampling bias is when the sample isn’t randomly chosen at all. George Gallup took pains to find an unbiased sample because he knew that was far more important than finding a big one.

The Literary Digest, in its quest for a bigger data set, fumbled the question of a biased sample. It mailed out forms to people on a list it had compiled from automobile registrations and telephone directories – a sample that, at least in 1936, was disproportionately prosperous. To compound the problem, Landon supporters turned out to be more likely to mail back their answers. The combination of those two biases was enough to doom The Literary Digest’s poll. For each person George Gallup’s pollsters interviewed, The Literary Digest received 800 responses. All that gave them for their pains was a very precise estimate of the wrong answer.

The big data craze threatens to be The Literary Digest all over again. Because found data sets are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is.

Professor Viktor Mayer-Schönberger of Oxford’s Internet Institute, co-author of Big Data, told me that his favoured definition of a big data set is one where “N = All” – where we no longer have to sample, but we have the entire background population. Returning officers do not estimate an election result with a representative tally: they count the votes – all the votes. And when “N = All” there is indeed no issue of sampling bias because the sample includes everyone.

But is “N = All” really a good description of most of the found data sets we are considering? Probably not. “I would challenge the notion that one could ever have all the data,” says Patrick Wolfe, a computer scientist and professor of statistics at University College London.

An example is Twitter. It is in principle possible to record and analyse every message on Twitter and use it to draw conclusions about the public mood. (In practice, most researchers use a subset of that vast “fire hose” of data.) But while we can look at all the tweets, Twitter users are not representative of the population as a whole. (According to the Pew Research Internet Project, in 2013, US-based Twitter users were disproportionately young, urban or suburban, and black.)

There must always be a question about who and what is missing, especially with a messy pile of found data. Kaiser Fung, a data analyst and author of Numbersense, warns against simply assuming we have everything that matters. “N = All is often an assumption rather than a fact about the data,” he says.

Consider Boston’s Street Bump smartphone app, which uses a phone’s accelerometer to detect potholes without the need for city workers to patrol the streets. As citizens of Boston download the app and drive around, their phones automatically notify City Hall of the need to repair the road surface. Solving the technical challenges involved has produced, rather beautifully, an informative data exhaust that addresses a problem in a way that would have been inconceivable a few years ago. The City of Boston proudly proclaims that the “data provides the City with real-time in­formation it uses to fix problems and plan long term investments.”

Yet what Street Bump really produces, left to its own devices, is a map of potholes that systematically favours young, affluent areas where more people own smartphones. Street Bump offers us “N = All” in the sense that every bump from every enabled phone can be recorded. That is not the same thing as recording every pothole. As Microsoft researcher Kate Crawford points out, found data contain systematic biases and it takes careful thought to spot and correct for those biases. Big data sets can seem comprehensive but the “N = All” is often a seductive illusion.

Who cares about causation or sampling bias, though, when there is money to be made? Corporations around the world must be salivating as they contemplate the uncanny success of the US discount department store Target, as famously reported by Charles Duhigg in The New York Times in 2012. Duhigg explained that Target has collected so much data on its customers, and is so skilled at analysing that data, that its insight into consumers can seem like magic.

Duhigg’s killer anecdote was of the man who stormed into a Target near Minneapolis and complained to the manager that the company was sending coupons for baby clothes and maternity wear to his teenage daughter. The manager apologised profusely and later called to apologise again – only to be told that the teenager was indeed pregnant. Her father hadn’t realised. Target, after analysing her purchases of unscented wipes and magnesium supplements, had.

Statistical sorcery? There is a more mundane explanation.

“There’s a huge false positive issue,” says Kaiser Fung, who has spent years developing similar approaches for retailers and advertisers. What Fung means is that we didn’t get to hear the countless stories about all the women who received coupons for babywear but who weren’t pregnant.

Hearing the anecdote, it’s easy to assume that Target’s algorithms are infallible – that everybody receiving coupons for onesies and wet wipes is pregnant. This is vanishingly unlikely. Indeed, it could be that pregnant women receive such offers merely because everybody on Target’s mailing list receives such offers. We should not buy the idea that Target employs mind-readers before considering how many misses attend each hit.

In Charles Duhigg’s account, Target mixes in random offers, such as coupons for wine glasses, because pregnant customers would feel spooked if they realised how intimately the company’s computers understood them.

Fung has another explanation: Target mixes up its offers not because it would be weird to send an all-baby coupon-book to a woman who was pregnant but because the company knows that many of those coupon books will be sent to women who aren’t pregnant after all.

None of this suggests that such data analysis is worthless: it may be highly profitable. Even a modest increase in the accuracy of targeted special offers would be a prize worth winning. But profitability should not be conflated with omniscience.

In 2005, John Ioannidis, an epidemiologist, published a research paper with the self-explanatory title, “Why Most Published Research Findings Are False”. The paper became famous as a provocative diagnosis of a serious issue. One of the key ideas behind Ioannidis’s work is what statisticians call the “multiple-comparisons problem”.

It is routine, when examining a pattern in data, to ask whether such a pattern might have emerged by chance. If it is unlikely that the observed pattern could have emerged at random, we call that pattern “statistically significant”.

The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.

There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns – of signal to noise – quickly tends to zero.

Worse still, one of the antidotes to the ­multiple-comparisons problem is transparency, allowing other researchers to figure out how many hypotheses were tested and how many contrary results are languishing in desk drawers because they just didn’t seem interesting enough to publish. Yet found data sets are rarely transparent. Amazon and Google, Facebook and Twitter, Target and Tesco – these companies aren’t about to share their data with you or anyone else.

New, large, cheap data sets and powerful ­analytical tools will pay dividends – nobody doubts that. And there are a few cases in which analysis of very large data sets has worked miracles. David Spiegelhalter of Cambridge points to Google Translate, which operates by statistically analysing hundreds of millions of documents that have been translated by humans and looking for patterns it can copy. This is an example of what computer scientists call “machine learning”, and it can deliver astonishing results with no preprogrammed grammatical rules. Google Translate is as close to theory-free, data-driven algorithmic black box as we have – and it is, says Spiegelhalter, “an amazing achievement”. That achievement is built on the clever processing of enormous data sets.

But big data do not solve the problem that has obsessed statisticians and scientists for centuries: the problem of insight, of inferring what is going on, and figuring out how we might intervene to change a system for the better.

“We have a new resource here,” says Professor David Hand of Imperial College London. “But nobody wants ‘data’. What they want are the answers.”

To use big data to produce such answers will require large strides in statistical methods.

“It’s the wild west right now,” says Patrick Wolfe of UCL. “People who are clever and driven will twist and turn and use every tool to get sense out of these data sets, and that’s cool. But we’re flying a little bit blind at the moment.”

Statisticians are scrambling to develop new methods to seize the opportunity of big data. Such new methods are essential but they will work by building on the old statistical lessons, not by ignoring them.

Recall big data’s four articles of faith. Uncanny accuracy is easy to overrate if we simply ignore false positives, as with Target’s pregnancy predictor. The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it. The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that “with enough data, the numbers speak for themselves” – that seems hopelessly naive in data sets where spurious patterns vastly outnumber genuine discoveries.

“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.


Tim Harford’s latest book is ‘The Undercover Economist Strikes Back’. To comment on this article please post below, or email magazineletters@ft.com

Originally posted via “Big data: are we making a big mistake?”


Originally Posted at: Big data: are we making a big mistake? by anum