Are U.S. Hospitals Delivering a Better Patient Experience?

The Centers for Medicare & Medicaid Services (CMS) use patient feedback about their care as part of their reimbursement plan for acute care hospitals. Under the Hospital Value-Based Purchasing Program, CMS makes value-based incentive payments to acute care hospitals, based either on how well the hospitals perform on certain quality measures or how much the hospitals’ performance improves on certain quality measures from their performance during a baseline period. This program began in FY 2013 for discharges occurring on or after October 1, 2012.

A standard patient satisfaction survey, known as HCAHPS (Hospital Consumer Assessment of Healthcare Providers and Systems), is the source of the patient feedback for the reimbursement program. I have previously used these publicly available HCAHPS data to understand the state of affairs for US hospitals in 2011 (see Big Data Provides Big Insights for U.S. Hospitals). Now that the Value-Based Purchasing program has been in effect since October 2012, I wanted to revisit the HCAHPS patient survey data to determine if US hospitals have improved. First, let’s review the HCAHPS survey.

The HCAHPS Survey

The survey asks a random sample of recently discharged patients about important aspects of their hospital experience. The data set includes patient survey results for US hospitals on ten measures of patients’ perspectives of care. The 10 measures are:

  1. Nurses communicate well
  2. Doctors communicate well
  3. Received help as soon as they wanted (Responsive)
  4. Pain well controlled
  5. Staff explain medicines before giving to patients
  6. Room and bathroom are clean
  7. Area around room is quiet at night
  8. Given information about what to do during recovery at home
  9. Overall hospital rating
  10. Recommend hospital to friends and family (Recommend)

For questions 1 through 7, respondents were asked to provide frequency ratings about the occurrence of each attribute (Never, Sometimes, Usually, Always). For question 8, respondents were provided a Y/N option. For question 9, respondents were asked to provide an overall rating of the hospital on a scale from 0 (Worst hospital possible) to 10 (Best hospital possible). For question 10, respondents were asked to provide their likelihood of recommending the hospital (Definitely no, Probably no, Probably yes, Definitely yes).

The Metrics

The HCAHPS data sets report metrics for each hospital as percentages of responses. Because the data sets have already been somewhat aggregated (e.g., percentages reported for group of response options), I was unable to calculate average scores for each hospital. Instead, I used top box scores as the metric of patient experience. I found that top box scores are highly correlated with average scores across groups of companies, suggesting that these two metrics tell us the same thing about the companies (in our case, hospitals).

Top box scores for the respective rating scales are defined as: 1) Percent of patients who reported “Always”; 2) Percent of patients who reported “Yes”; 3) Percent of patients who gave a rating of 9 or 10; 4) Percent of patients who said “Definitely yes.”

Top box scores provide an easy-to-understand way of communicating the survey results for different types of scales. Even though there are four different rating scales for the survey questions, using a top box reporting method puts all metrics on the same numeric scale. Across all 10 metrics, hospital scores can range from 0 (bad) to 100 (good).

I examined PX ratings of acute care hospitals across two time periods. The two time periods were 2011 (Q3 2010 through Q2 2011) and 2013 (Q4 2012 through Q3 2013). The data from the 2013 time-frame are the latest publicly available patient survey data as of this writing.

Results: Patient Satisfaction with US Hospitals Increasing

Patient Advocacy Trends for Acute Care Hospitals in US
Figure 1. Patient advocacy has increased for US hospitals

Figure 1 contains the comparisons for patient advocacy ratings for US hospitals across the two time periods. Paired T-tests comparing the three loyalty metrics across the two time periods were statistically significant, showing that patients are reporting higher levels of loyalty toward hospitals in 2013 compared to 2011. This increase in patient loyalty, while small, is still real.

Greater gains in patient loyalty have been seen for Overall Hospital Rating (increase of 2.26) compared to Recommend (increase of 1.09).

Figure 2. Patient Experience Trends
Figure 2. Patient satisfaction with their in-patient experience has increased for US hospitals

Figure 2 contains the comparisons for patient experience ratings for US hospitals across the two time periods. Again, paired T-tests comparing the seven PX metrics across the two time periods were statistically significant, showing that patients are reporting higher levels of satisfaction with their in-patient experience in 2013 compared to 2011.

The biggest increases in satisfaction were seen in “Given information about recovery,” “Staff explained meds” and “Responsive.” The smallest increases in satisfaction were seen for “Doctor communication” and “Pain well controlled.”


Hospital reimbursements are based, in part, on their patient satisfaction ratings. Consequently, hospital executives are focusing their efforts at improving the patient experience.

Comparing HCAHPS patient survey results from 2011 to 2013, it appears that hospitals have improved how they deliver patient care. Patient loyalty and PX metrics show significant improvements from 2011 to 2013.

Originally Posted at: Are U.S. Hospitals Delivering a Better Patient Experience? by bobehayes

RSPB Conservation Efforts Take Flight Thanks To Data Analytics


Big data may be helping to change the way we interact with the world around us, but how much can it do to help the wildlife that shares our planet?

With hundreds of species to track across the UK, ornithological charity the RSPB accrues huge amounts of data every year as it tries to ensure its efforts help as many birds as possible.

And in order to ensure they stay on top of this mountain of data, the charity has teamed up with analytics specialists SAS to develop and create more in-depth research and conservation efforts which should benefit birds around the country.


rspb logoFlying high

“We need to make sense of a variety of large and complex data sets. For example, tracking the movements of kittiwakes and gannets as they forage at sea produces millions of data points,” said Dr. Will Peach, head of research delivery at RSPB.

“Conservation informed by statistical evidence is always more likely to succeed than that based solely on guesswork or anecdote. SAS allows us to explore the data to provide the evidence needed to confidently implement our initiatives.”

So far, the RSPB has implemented SAS’ advanced analytics solutions to combine datasets on yellowhammer and skylark nesting success with pesticide use and agriculture cropping patterns to judge the consequences for the birds.

RSPB also turned to SAS to explore how albatross forage across the Southern Ocean.

With large-scale commercial longline fishing killing tens of thousands of albatrosses a year, the goal was to cut down on the death rate and protect the 17 albatross species currently at risk.

The society took data from tags worn by the birds, merging it with external data sets like sea-surface temperatures and the location of fishing grounds.

“Scientific research is extremely fast-moving and there are now huge volumes of data to analyse,” said Andy Cutler, director of strategy at SAS UK & Ireland.

“SAS is able to provide a means of managing all the data and then apply cutting-edge analytical techniques that deliver valuable insights almost immediately. For example, through analysing previously non-informative data, RSPB is now able to intervene and correct the breeding problems faced by various bird species during treacherous migration journeys.”

RSPB Conservation Efforts Take Flight Thanks To Data Analytics

Source: RSPB Conservation Efforts Take Flight Thanks To Data Analytics by analyticsweekpick

Betting the Enterprise on Data with Cloud-Based Disaster Recovery and Backups

One of the more pressing consequences of truly transitioning to a data-driven company culture is a renewed esteem for the data—valued as an asset—that gives the enterprise its worth. Unlike other organizational assets, protecting data requires more than mere security measures. It necessitates reliable, test-worthy backup and disaster recovery plans that can automate these vital processes to account for virtually any scenario, especially some of the more immediate ones involving:
  • Ransomware: Ransomware attacks are increasing in incidence and severity. They occur when external entities deploy malware to encrypt organizational data using similar, if not more effective, encryption measures that those same organizations do and only release the data after being paid to do so. “Ransomware was not something that many people worried about a couple years ago,” Unitrends VP of Product Marketing Dave LeClair acknowledged. “Now it’s something that almost every company that I’ve talked to has been hit. The numbers are getting truly staggering how frequently ransomware attacks are hitting IT, encrypting their data, and demanding payments to unencrypt it from these criminal organizations.”
  • Downtime: External threats are not the only factors that engender IT downtime. Conventional maintenance and updating measures for various systems also result in situations in which organizations cannot access or leverage their data. In essential time-sensitive applications, cloud-based disaster recovery and backup solutions ensure business continuity.
  •  Contemporary IT Environments: Today’s IT environments are much more heterogeneous than they once were. It is not uncommon for organizations to utilize existing legacy systems alongside cloud-based applications and those involving virtualization. Cloud disaster recovery and data backup platforms preserve connected continuity in a singular manner to reduce costs and increase the efficiency of backup systems.
  • Acts of Nature: The increasing reliance on technology is still susceptible to unforseen acts based on weather conditions, natural disasters, and even man-made ones—in which case cloud options for recovery and backups are the most desirable because they store valued data offsite.

Additionally, when one considers that the primary benefits of the cloud are its low cost storage—at scale—and ubiquity of access regardless of location or time, cloud disaster recovery and backup solutions are a logical extension of enterprise infrastructure. “The new technologies, because of the ability of doing things in the cloud, kind of democratizes it so that anybody can afford to have a DR environment, particularly for their critical applications,” LeClair remarked.

Recovery and Backup Basics
There are a multitude of ways that organizations can leverage cloud recovery and data backup options to readily restore production capabilities in the event of system failure:

  • Replication: Replication is the means by which data is copied elsewhere—in this case, to the cloud for storage. Data can also be replicated to other forms of storage (i.e. disk or tape) and be transmitted to a cloud service provider that way.
  • Archives/Checkpoints: Archives or checkpoints are states of data at particular points in time for a data set which are preserved within a system. Therefore, organizations can always revert their system data to an archive to restore it to a time before some sort of failure occurred. According to LeClair, this capability is an integral way of mitigating the effects of ransomware: “You can simply rollback the clock, to the point before you got encrypted, and you can restore your system so you’re good to go”.
  • Instant Recovery Solutions: These solutions not only restore systems to a point in time prior to events of failure, but even facilitate workload management based on the backup appliance itself. This capability is critical in instances in which on-premise systems are still down. In such an event, the appliance’s compute power and storage replace those of the primary solution, which “allows you to spin off that workload in less than five minutes so you can get back up and running,” Le Clair said.
  • Incremental Forevers: This recovery and backup technique is particularly useful because it involves a full backup of a particular data set or application, and subsequently only backs up changes to that initial backup. Such utility is pivotal to massive quantities of big data.

Cloud Replication
There are many crucial considerations when leveraging the cloud as a means of recovery and data backup. Foremost of these is the replication process of copying data from on premises to the cloud. “It absolutely is an issue, particularly if you have terabytes of data,” LeClair mentioned. “If you’re a decent sized enterprise and you have 50 or 100 terabytes of data that you need to move from your production environment to the cloud, that can take weeks.” Smaller cloud providers such as Unitrends can issue storage to organizations via disk, which is then overnighted and uploaded to the cloud so that, on an ongoing basis, organizations only need to replicate the changes of their data.

Machine Transformation
Another consideration pertains to actually utilizing that data in the cloud due to networking concerns. “Networking in cloud generally works very differently than what happens on premise,” LeClair observed. Most large public cloud providers (such as Amazon Web Services) have networking constraints regarding interconnections that require significant IT involvement to configure. However, competitive disaster recovery and backup vendors have dedicated substantial resources to automating various facets of recovery, including all of the machine transformation (transmogrification) required to provision a production environment in the cloud.

Merely replicating data into the cloud is just the first step. The larger concern for actually utilizing it there in cases of emergency requires provisioning the network, which certain cloud platforms can do automatically so that, “You have a DR environment without having to actually dedicate any compute resources yet,” LeClair said. “You basically have your data that’s replicated into Amazon, and you have all the configuration data necessary to spin off that data if you need to. It’s a very cost-effective way to keep yourself protected.”
Recovery Insurance
The automation capabilities of cloud data recovery and back-up solutions also include testing, which is a vital prerequisite for actually ensuring that such systems function properly on demand. Traditionally, organizations tested their recovery environments sparingly, if at all. “There’s now technology that essentially automates your DR environment, so you don’t have to pull up human resources and time into it,” LeClair said. In many instances, those automation capabilities hinge upon the cloud, which has had a considerable impact on the capabilities for disaster recovery and backup. The overarching effect is that it renders data recovery and backup more consistent, cheaper, and easier to facilitate in an increasingly complicated and preeminent IT world.

Source: Betting the Enterprise on Data with Cloud-Based Disaster Recovery and Backups by jelaniharper

SAS enlarges its palette for big data analysis

SAS offers new tools for training, as well as for banking and network security.

SAS Institute did big data decades before big data was the buzz, and now the company is expanding on the ways large-scale computerized analysis can help organizations.

As part of its annual SAS Global Forum, being held in Dallas this week, the company has released new software customized for banking and cybersecurity, for training more people to understand SAS analytics, and for helping non-data scientists do predictive analysis with visual tools.

Founded in 1976, SAS was one of the first companies to offer analytics software for businesses. A private company that generated US$3 billion in revenue in 2014, SAS has devoted considerable research and development funds to enhance its core Statistical Analysis System (SAS) platform over the years. The new releases are the latest fruits of these labors.

With the aim of getting more people trained in the SAS ways, the company has posted its training software, SAS University Edition, on the Amazon Web Services Marketplace. Using AWS eliminates the work of setting up the software on a personal computer, and first-time users of AWS can use the 12-month free tier program, to train on the software at no cost.

SAS launched the University Edition a year ago, and it has since been downloaded over 245,000 times, according to the company.

With the release, SAS is taking aim at one of the chief problems organizations face today when it comes to data analysis, that of finding qualified talent. By 2018, the U.S. alone will face a shortage of anywhere from 140,000 to 190,000 people with analytical expertise, The McKinsey Global Institute consultancy has estimated.

Predictive analytics is becoming necessary even in fields where it hasn’t been heavily used heretofore. One example is information technology security. Security managers for large organizations are growing increasingly frustrated at learning of breaches only after they happen. SAS is betting that applying predictive and behavioral analytics to operational IT data, such as server logs, can help identify and deter break-ins and other malicious activity, as they unfold.

Last week, SAS announced that it’s building a new software package, called SAS Cybersecurity, which will process large of amounts of real-time data from network operations. The software, which will be generally available by the end of the year, will build a model of routine activity, which it then can use to identify and flag suspicious behavior.

SAS is also customizing its software for the banking industry. A new package, called SAS Model Risk Management, provides a detailed model of a how a bank operates so that the bank can better understand its financial risks, as well as convey these risks to regulators.

SAS also plans to broaden its user base by making its software more appealing beyond computer statisticians and data scientists. To this end, the company has paired its data exploration software, called SAS Visual Analytics, with its software for developing predictive models, called SAS Visual Statistics. The pairing can allow non-data scientists, such as line of business analysts and risk managers, to predict future trends based on current data.

The combined products can also be tied in with SAS In-Memory Analytics, software designed to allow large amounts of data to be held entirely in the server’s memory, speeding analysis. It can also work with data on Hadoop clusters, relational database systems or SAS servers.

QVC, the TV and online retailer, has already paired the two products. At its Italian operations, QVC streamlined its supply chain operations by allowing its sales staff to spot buying trends more easily, and spend less time building reports, according to SAS.

The combined package of SAS Visual Analytics and SAS Visual Statistics will be available in May.

Originally posted via “SAS enlarges its palette for big data analysis”

Source: SAS enlarges its palette for big data analysis

Big data: are we making a big mistake?

Big data is a vague term for a massive phenomenon that has rapidly become an obsession with entrepreneurs, scientists, governments and the media.

High quality global journalism requires investment. Please share this article with others using the link below, do not cut & paste the article. See our Ts&Cs and Copyright Policy for more detail.

Five years ago, a team of researchers from Google announced a remarkable achievement in one of the world’s top scientific journals, Nature. Without needing the results of a single medical check-up, they were nevertheless able to track the spread of influenza across the US. What’s more, they could do it more quickly than the Centers for Disease Control and Prevention (CDC). Google’s tracking had only a day’s delay, compared with the week or more it took for the CDC to assemble a picture based on reports from doctors’ surgeries. Google was faster because it was tracking the outbreak by finding a correlation between what people searched for online and whether they had flu symptoms.

Not only was “Google Flu Trends” quick, accurate and cheap, it was theory-free. Google’s engineers didn’t bother to develop a hypothesis about what search terms – “flu symptoms” or “pharmacies near me” – might be correlated with the spread of the disease itself. The Google team just took their top 50 million search terms and let the algorithms do the work.

FirstFT is our new essential daily email briefing of the best stories from across the web

The success of Google Flu Trends became emblematic of the hot new trend in business, technology and science: “Big Data”. What, excited journalists asked, can science learn from Google?

As with so many buzzwords, “big data” is a vague term, often thrown around by people with something to sell. Some emphasise the sheer scale of the data sets that now exist – the Large Hadron Collider’s computers, for example, store 15 petabytes a year of data, equivalent to about 15,000 years’ worth of your favourite music.

But the “big data” that interests many companies is what we might call “found data”, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast. Google Flu Trends was built on found data and it’s this sort of data that ­interests me here. Such data sets can be even bigger than the LHC data – Facebook’s is – but just as noteworthy is the fact that they are cheap to collect relative to their size, they are a messy collage of datapoints collected for disparate purposes and they can be updated in real time. As our communication, leisure and commerce have moved to the internet and the internet has moved into our phones, our cars and even our glasses, life can be recorded and quantified in a way that would have been hard to imagine just a decade ago.

Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.

Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks. Absolute nonsense.”

Found data underpin the new internet economy as companies such as Google, Facebook and Amazon seek new ways to understand our lives through our data exhaust. Since Edward Snowden’s leaks about the scale and scope of US electronic surveillance it has become apparent that security services are just as fascinated with what they might learn from our data exhaust, too.

Consultants urge the data-naive to wise up to the potential of big data. A recent report from the McKinsey Global Institute reckoned that the US healthcare system could save $300bn a year – $1,000 per American – through better integration and analysis of the data produced by everything from clinical trials to health insurance transactions to smart running shoes.

But while big data promise much to scientists, entrepreneurs and governments, they are doomed to disappoint us if we ignore some very familiar statistical lessons.

“There are a lot of small data problems that occur in big data,” says Spiegelhalter. “They don’t disappear because you’ve got lots of the stuff. They get worse.”

. . .

Four years after the original Nature paper was published, Nature News had sad tidings to convey: the latest flu outbreak had claimed an unexpected victim: Google Flu Trends. After reliably providing a swift and accurate account of flu outbreaks for several winters, the theory-free, data-rich model had lost its nose for where flu was going. Google’s model pointed to a severe outbreak but when the slow-and-steady data from the CDC arrived, they showed that Google’s estimates of the spread of flu-like illnesses were overstated by almost a factor of two.

The problem was that Google did not know – could not begin to know – what linked the search terms with the spread of flu. Google’s engineers weren’t trying to figure out what caused what. They were merely finding statistical patterns in the data. They cared about ­correlation rather than causation. This is common in big data analysis. Figuring out what causes what is hard (impossible, some say). Figuring out what is correlated with what is much cheaper and easier. That is why, according to Viktor Mayer-Schönberger and Kenneth Cukier’s book, Big Data, “causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning”.

But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. One explanation of the Flu Trends failure is that the news was full of scary stories about flu in December 2012 and that these stories provoked internet searches by people who were healthy. Another possible explanation is that Google’s own search algorithm moved the goalposts when it began automatically suggesting diagnoses when people entered medical symptoms.

Google Flu Trends will bounce back, recalibrated with fresh data – and rightly so. There are many reasons to be excited about the broader opportunities offered to us by the ease with which we can gather and analyse vast data sets. But unless we learn the lessons of this episode, we will find ourselves repeating it.

Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days – but we must not pretend that the traps have all been made safe. They have not.

In 1936, the Republican Alfred Landon stood for election against President Franklin Delano Roosevelt. The respected magazine, The Literary Digest, shouldered the responsibility of forecasting the result. It conducted a postal opinion poll of astonishing ambition, with the aim of reaching 10 million people, a quarter of the electorate. The deluge of mailed-in replies can hardly be imagined but the Digest seemed to be relishing the scale of the task. In late August it reported, “Next week, the first answers from these ten million will begin the incoming tide of marked ballots, to be triple-checked, verified, five-times cross-classified and totalled.”

After tabulating an astonishing 2.4 million returns as they flowed in over two months, The Literary Digest announced its conclusions: Landon would win by a convincing 55 per cent to 41 per cent, with a few voters favouring a third candidate.

The election delivered a very different result: Roosevelt crushed Landon by 61 per cent to 37 per cent. To add to The Literary Digest’s agony, a far smaller survey conducted by the opinion poll pioneer George Gallup came much closer to the final vote, forecasting a comfortable victory for Roosevelt. Mr Gallup understood something that The Literary Digest did not. When it comes to data, size isn’t everything.

Opinion polls are based on samples of the voting population at large. This means that opinion pollsters need to deal with two issues: sample error and sample bias.

Sample error reflects the risk that, purely by chance, a randomly chosen sample of opinions does not reflect the true views of the population. The “margin of error” reported in opinion polls reflects this risk and the larger the sample, the smaller the margin of error. A thousand interviews is a large enough sample for many purposes and Mr Gallup is reported to have conducted 3,000 interviews.

But if 3,000 interviews were good, why weren’t 2.4 million far better? The answer is that sampling error has a far more dangerous friend: sampling bias. Sampling error is when a randomly chosen sample doesn’t reflect the underlying population purely by chance; sampling bias is when the sample isn’t randomly chosen at all. George Gallup took pains to find an unbiased sample because he knew that was far more important than finding a big one.

The Literary Digest, in its quest for a bigger data set, fumbled the question of a biased sample. It mailed out forms to people on a list it had compiled from automobile registrations and telephone directories – a sample that, at least in 1936, was disproportionately prosperous. To compound the problem, Landon supporters turned out to be more likely to mail back their answers. The combination of those two biases was enough to doom The Literary Digest’s poll. For each person George Gallup’s pollsters interviewed, The Literary Digest received 800 responses. All that gave them for their pains was a very precise estimate of the wrong answer.

The big data craze threatens to be The Literary Digest all over again. Because found data sets are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is.

Professor Viktor Mayer-Schönberger of Oxford’s Internet Institute, co-author of Big Data, told me that his favoured definition of a big data set is one where “N = All” – where we no longer have to sample, but we have the entire background population. Returning officers do not estimate an election result with a representative tally: they count the votes – all the votes. And when “N = All” there is indeed no issue of sampling bias because the sample includes everyone.

But is “N = All” really a good description of most of the found data sets we are considering? Probably not. “I would challenge the notion that one could ever have all the data,” says Patrick Wolfe, a computer scientist and professor of statistics at University College London.

An example is Twitter. It is in principle possible to record and analyse every message on Twitter and use it to draw conclusions about the public mood. (In practice, most researchers use a subset of that vast “fire hose” of data.) But while we can look at all the tweets, Twitter users are not representative of the population as a whole. (According to the Pew Research Internet Project, in 2013, US-based Twitter users were disproportionately young, urban or suburban, and black.)

There must always be a question about who and what is missing, especially with a messy pile of found data. Kaiser Fung, a data analyst and author of Numbersense, warns against simply assuming we have everything that matters. “N = All is often an assumption rather than a fact about the data,” he says.

Consider Boston’s Street Bump smartphone app, which uses a phone’s accelerometer to detect potholes without the need for city workers to patrol the streets. As citizens of Boston download the app and drive around, their phones automatically notify City Hall of the need to repair the road surface. Solving the technical challenges involved has produced, rather beautifully, an informative data exhaust that addresses a problem in a way that would have been inconceivable a few years ago. The City of Boston proudly proclaims that the “data provides the City with real-time in­formation it uses to fix problems and plan long term investments.”

Yet what Street Bump really produces, left to its own devices, is a map of potholes that systematically favours young, affluent areas where more people own smartphones. Street Bump offers us “N = All” in the sense that every bump from every enabled phone can be recorded. That is not the same thing as recording every pothole. As Microsoft researcher Kate Crawford points out, found data contain systematic biases and it takes careful thought to spot and correct for those biases. Big data sets can seem comprehensive but the “N = All” is often a seductive illusion.

Who cares about causation or sampling bias, though, when there is money to be made? Corporations around the world must be salivating as they contemplate the uncanny success of the US discount department store Target, as famously reported by Charles Duhigg in The New York Times in 2012. Duhigg explained that Target has collected so much data on its customers, and is so skilled at analysing that data, that its insight into consumers can seem like magic.

Duhigg’s killer anecdote was of the man who stormed into a Target near Minneapolis and complained to the manager that the company was sending coupons for baby clothes and maternity wear to his teenage daughter. The manager apologised profusely and later called to apologise again – only to be told that the teenager was indeed pregnant. Her father hadn’t realised. Target, after analysing her purchases of unscented wipes and magnesium supplements, had.

Statistical sorcery? There is a more mundane explanation.

“There’s a huge false positive issue,” says Kaiser Fung, who has spent years developing similar approaches for retailers and advertisers. What Fung means is that we didn’t get to hear the countless stories about all the women who received coupons for babywear but who weren’t pregnant.

Hearing the anecdote, it’s easy to assume that Target’s algorithms are infallible – that everybody receiving coupons for onesies and wet wipes is pregnant. This is vanishingly unlikely. Indeed, it could be that pregnant women receive such offers merely because everybody on Target’s mailing list receives such offers. We should not buy the idea that Target employs mind-readers before considering how many misses attend each hit.

In Charles Duhigg’s account, Target mixes in random offers, such as coupons for wine glasses, because pregnant customers would feel spooked if they realised how intimately the company’s computers understood them.

Fung has another explanation: Target mixes up its offers not because it would be weird to send an all-baby coupon-book to a woman who was pregnant but because the company knows that many of those coupon books will be sent to women who aren’t pregnant after all.

None of this suggests that such data analysis is worthless: it may be highly profitable. Even a modest increase in the accuracy of targeted special offers would be a prize worth winning. But profitability should not be conflated with omniscience.

In 2005, John Ioannidis, an epidemiologist, published a research paper with the self-explanatory title, “Why Most Published Research Findings Are False”. The paper became famous as a provocative diagnosis of a serious issue. One of the key ideas behind Ioannidis’s work is what statisticians call the “multiple-comparisons problem”.

It is routine, when examining a pattern in data, to ask whether such a pattern might have emerged by chance. If it is unlikely that the observed pattern could have emerged at random, we call that pattern “statistically significant”.

The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.

There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns – of signal to noise – quickly tends to zero.

Worse still, one of the antidotes to the ­multiple-comparisons problem is transparency, allowing other researchers to figure out how many hypotheses were tested and how many contrary results are languishing in desk drawers because they just didn’t seem interesting enough to publish. Yet found data sets are rarely transparent. Amazon and Google, Facebook and Twitter, Target and Tesco – these companies aren’t about to share their data with you or anyone else.

New, large, cheap data sets and powerful ­analytical tools will pay dividends – nobody doubts that. And there are a few cases in which analysis of very large data sets has worked miracles. David Spiegelhalter of Cambridge points to Google Translate, which operates by statistically analysing hundreds of millions of documents that have been translated by humans and looking for patterns it can copy. This is an example of what computer scientists call “machine learning”, and it can deliver astonishing results with no preprogrammed grammatical rules. Google Translate is as close to theory-free, data-driven algorithmic black box as we have – and it is, says Spiegelhalter, “an amazing achievement”. That achievement is built on the clever processing of enormous data sets.

But big data do not solve the problem that has obsessed statisticians and scientists for centuries: the problem of insight, of inferring what is going on, and figuring out how we might intervene to change a system for the better.

“We have a new resource here,” says Professor David Hand of Imperial College London. “But nobody wants ‘data’. What they want are the answers.”

To use big data to produce such answers will require large strides in statistical methods.

“It’s the wild west right now,” says Patrick Wolfe of UCL. “People who are clever and driven will twist and turn and use every tool to get sense out of these data sets, and that’s cool. But we’re flying a little bit blind at the moment.”

Statisticians are scrambling to develop new methods to seize the opportunity of big data. Such new methods are essential but they will work by building on the old statistical lessons, not by ignoring them.

Recall big data’s four articles of faith. Uncanny accuracy is easy to overrate if we simply ignore false positives, as with Target’s pregnancy predictor. The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it. The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that “with enough data, the numbers speak for themselves” – that seems hopelessly naive in data sets where spurious patterns vastly outnumber genuine discoveries.

“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.


Tim Harford’s latest book is ‘The Undercover Economist Strikes Back’. To comment on this article please post below, or email

Originally posted via “Big data: are we making a big mistake?”


Originally Posted at: Big data: are we making a big mistake? by anum

How Google does Rapid Prototyping? Tom Chi’s Perspective [video]

How Google does Rapid Prototyping? Tom Chi’s Perspective
In this TEDEducation video Tom Chi from Google Glass team explains how rapid prototyping is done. This video is a good and snappy tutorial to help entrepreneurs seek ways to do rapid prototyping. This not only help surface product problems early in product lifecycle and fixing them quickly but also help with one of the startup key problem, to have a prototype for validation.

Let us know if you have thoughts on how to do it effectively.


Free Research Report on the State of Patient Experience in US Hospitals

Download Free Report from TCELab: Improving the Patient Experience

The Centers for Medicare & Medicaid Services (CMS) will be using patient feedback about their care as part of their reimbursement plan for acute care hospitals (see Hospital Value-Based Purchasing (VBP) program). The purpose of the VBP program is to promote better clinical outcomes for patients and improve their experience of care during hospital stays. Not surprisingly, hospitals are focusing on improving the patient experience (PX) to ensure they receive the maximum of their incentive payments.

Free Download of Research Report on the Patient Experience

I spent the past few months conducting research on and writing about the importance of patient experience (PX) in US hospitals. My partners at TCELab have helped me summarize these studies into a single research report, Improving the Patient Experience . As far as I am aware, these series of studies are the first to integrate these disparate US hospital data sources (e.g., Patient Experience, Health Outcomes, Process of Care, and Medicare spending per patient) to apply predictive analytics for the purpose of identifying the reasons behind a loyal patient base.

While this research is really about the entirety of US hospitals, hospitals still need to dig deeper into their own specific patient experience data to understand what they need to do to improve the patient experience. This report is a good starting point for hospitals to learn what they need to do to improve the patient experience and increase patient loyalty. Read the entire press release about the research report, Improving the Patient Experience.

Get the free report from TCELab by clicking the image or link below:

Download Free Report from TCELab: Improving the Patient Experience



Source by bobehayes

How Big Data Analytics Can Help Track Money Laundering

Criminal and terrorist organizations are increasingly relying on international trade to hide the flow of illicit funds across borders. Big data analytics may be the key to tracking these financial flows.

or the past decade, governments around the world have established international anti-money laundering (AML) and counter-terrorist financing efforts in an effort to shut down the cross-border flow of funds to criminal and terrorist organizations. Their success has encouraged criminals to move their cash smuggling away from the financial system to the byzantine world of global trade. According to PwC US, big data analytics are becoming essential to tracking these activities.

It’s easy to understand why criminal and terrorist organizations would turn to the global merchandise export trade to hide the movement of their funds. It’s a classic needle in a haystack — an $18.3 trillion business formed of a “web of complexity that involves finance, shipping and insurance interests operating across multiple legal systems, multiple customs procedures, and multiple languages, using a set of traditional practices and procedures that in some instances have changed little for centuries,” PwC says.

Watching the Money Flow

There’s no real way to quantify how much money criminals are invisibly exchanging using this system. PwC notes that the Global Financial Integrity (GFI) research and advocacy organization estimates that 80 percent of illicit financial flows from developing countries are accomplished through trade-based money laundering (TBML), from more than $200 billion in 2002 to more than $600 billion in 2011. GFI believes more than $101 billion was illicitly smuggled into China in 2012 via over-invoicing, which is only one of the common TBML techniques.

“At its core, trade finance is an old-fashioned business,” the report says. “As other industries have adopted more technology- and data-driven infrastructures, trade finance has remained extremely document-intensive and paper-based, moored on a framework of instruments, systems, and practices that have proven their effectiveness and earned global trust over the generations.”

But they are also opaque, PwC says, making it extremely difficult for AML efforts to see what’s going on.

“For example, trade finance’s legacy procedures affect the relationship management aspect of AML, which includes know-your-customer (KYC) procedures and examination of customer documentation prior to transaction approval,” the report says. “In this paper-intensive environment, AML remains a largely manual procedure and thus prone to human error. It remains reliant upon established “red flag” checklists provided by regulators, in which transactions are manually reviewed by analysts, escalated should any concerns be raised, and then subjected to further manual review if wrongdoing is suspected.”

The Need to Share Data

This state of affairs is exacerbated by a number of factors, especially the lack of data sharing between customs, tax and legal authorities and a tendency to rely on AML procedures designed to target cash smuggling and financial system misuse. Instead, PwC says, authorities need to develop targeted TBML responses that focus on data sharing and text and data analytics.

So what exactly does TBML look like? Common TBML techniques include the following:

Under-invoicing. The exporter invoices trade goods at a price below the fair market price. This allows the exporter to effectively transfer value to the importer, as the payment for the trade goods will be lower than the value the importer receives when reselling the goods on the open market.
Over-invoicing. This technique is much the same as the first, except in reverse. The exporter invoices trade goods at a price above the fair market value, allowing the importer to transfer value to the exporter.
Multiple invoicing. With this technique, a money launderer or terrorist financier issues multiple invoices for the same international trade transaction, justifying multiple payments for the same shipment. “Payments can originate from different financial institutions, adding to the complexity of detection, and legitimate explanations can be offered if the scheme is uncovered (e.g., amendment of payment terms, payment of late fees, etc.),” the report explains.
Over- and under-shipment. In some cases, the parties simply overstate or understate the quantities of goods shipped relative to the payments sent or received. PwC calls out an extreme example of this, known as “phantom shipping,” in which no goods are exchanged at all, but shipping and customs documents are processed as normal.
False description of trade goods. With this technique, money launderers misrepresent the quality or type of trade goods. For instance, they might replace an expensive item listed on the invoice and customs documents with an inexpensive item.
Informal money transfer systems (IMTS). These networks have, in many cases, been co-opted by criminals and terrorists. PwC points to Colombia’s Black Market Peso Exchange (BMPE) as a prime example. Established by Colombian businesses trying to get around Colombia’s restrictive currency exchange policies, the BMPE allows users to sell dollars to a broker, who then trades them for Pesos to a legitimate Colombian business that needs hard U.S. currency to purchase goods for shipment to South America. It’s not just Colombian drug traffickers repatriating their profits either; PwC notes that similar systems exist around the world, including the hawalahundi system on the Indian sub-continent and others in Venezuela, Argentina, Brazil and Paraguay.

What Can Big Data Do?

So how can big data analytics help organizations find these illicit transactions in an $18.3 trillion haystack? Well, for one, the sea of documents generated by this activity — the commercial invoices, bills of lading, insurance certificates, inspection certificates, certificates of origin and more — that make it so difficult to see what’s truly happening may also be the point of vulnerability.

“A global, one-stop solution to TBML is highly unlikely,” PwC says. “The most effective solution would involve the imposition of bank-like compliance requirements on all organizations that trade internationally. But while this would create transparency across transactions, it would also create a massive layer of red tape that would adversely impact the preponderance of traders and related parties who are engaged in legitimate activity. The largely unquantifiable nature of the TBML problem makes it difficult to justify such an intrusive, expensive and vastly complicated solution. Short of global regulation, we have global analytics.”

In other words, automating anti-TBML monitoring — extracting and analyzing in-house and external data, both structured and unstructured — is of critical importance.

PwC believes such a program must properly align across key business areas and incorporate automated processes using a variety of advanced techniques, including:

Text analytics. The capability to extract data from text files in an automated fashion can unlock a massive amount of data that can be used for transaction monitoring.
Web analytics and Web-crawling. These tools can systematically scan the web to review shipment and custom details and compare them against corresponding documentation.
Unit price analysis. This statistic-driven approach uses publicly available data and algorithms to detect if unit prices exceed or fall far below global and regional established thresholds.
Unit weight analysis. This technique involves searching for instances where money launderers are attempting to transfer value by overstating or understating the quantity of goods shipped relative to payments.
Network (relationship) analysis of trade partners and ports. Enterprise analytics software tools can identify hidden relationships in data between trade partners and ports, and between other participants in the trade lifecycle. They can also identify potential shell companies or outlier activity.

International trade and country profiling analysis. An analysis of publicly available data may establish profiles of the types of goods that specific countries import and export, flagging outliers that might indicate TBML activity.

Thor Olavsrud

Orginally posted via “How Big Data Analytics Can Help Track Money Laundering”

Source: How Big Data Analytics Can Help Track Money Laundering by anum