Data Storytelling: What’s Easy and What’s Hard

Putting data on a screen is easy. Making it meaningful is so much harder. Gathering a collection of visualizations and calling it a data story is easy (and inaccurate). Making data-driven narrative that influences people…hard.

Here are 25 more lessons we’ve learned (the hard way) about what’s easy and what’s hard when it comes to telling data stories:

Easy: Picking a good visualization to answer a data question
Hard: Discovering the core message of your data story that will move your audience to action

Easy: Knowing who is your target audience
Hard: Knowing what motivates your target audience at a personal level by understanding their everyday frustrations and career goals

Easy: Collecting questions your audience wants to answer
Hard: Delivering answers your audience can act on

Easy: Providing flexibility to slice and dice data
Hard: Balancing flexibility with prescriptive guidance to help focus on the most important things

Easy: Labeling visualizations
Hard: Explaining the intent and meaning of visualizations

Easy: Choosing dimensions to show
Hard: Choosing the right metrics to show

Easy: Getting an export of the data you need
Hard: Restructuring data for high-performance analytical queries

Easy: Discovering inconsistencies in your data
Hard: Fixing those inconsistencies

Easy: Designing a data story with a fixed data set
Hard: Designing a data story where the data changes

Easy: Categorical dimensions
Hard: Dates

Easy: Showing data values within expected ranges
Hard: Dealing with null values

Easy: Determining formats for data fields
Hard: Writing a human-readable definition of data fields

Easy: Getting people interested in analytics and visualization
Hard: Getting people to use data regularly in their job

Easy: Picking theme colors
Hard: Using colors judiciously and with meaning

Easy: Setting the context for your story
Hard: Creating intrigue and suspense to move people past the introduction

Easy: Showing selections in a visualization
Hard: Carrying those selections through the duration of the story

Easy: Creating a long, shaggy data story
Hard: Creating a concise, meaningful data story
Easy: Adding more data
Hard: Cutting out unnecessary data

Easy: Serving one audience
Hard: Serving multiple audiences to enable new kinds of discussions

Easy: Helping people find insights
Hard: Explaining what to do about those insights

Easy: Explaining data to experts
Hard: Explaining data to novices

Easy: Building a predictive model
Hard: Convincing people they should trust your predictive model

Easy: Visual mock-ups with stubbed-in data
Hard: Visual mock-ups that support real-world data

Easy: Building a visualization tool
Hard: Building a data storytelling tool

Schedule a demo

Source by analyticsweek

What if the data tells you something you don’t like?” Three potential big data pitfalls

Big data is likely to quickly become big business. The ability to isolate the nuggets of insight inside the huge volumes of structured and unstructured data hoarded by most businesses could improve customer service, make processes more efficient and cut costs.

According to analysts Gartner, adoption of big data is still at a very early stage: just eight percent of companies have initiatives up and running, 20 percent are piloting and experimenting, 18 percent are ‘developing a strategy’, 19 percent are ‘knowledge gathering’, while the remainder have no plans or don’t know. But that could change rapidly: the analyst firm is predicting 4.4 million people will be working on such projects within two years, while a recent survey by Harvey Nash found that four out of ten CIOs planned to increase their investments in the next year.

However, because big data uses untested technologies and skills that are in short supply inside most organisations, there are number of hurdles for organisations seeking to exploit it:

1. Letting politics derail your big data project before it gets moving

Getting a big data initiative up and running might be one of the hardest parts of the project because the tech team and the rest of the business may have different ideas about what the goals should be, warn tech chiefs consulted by ZDNet: a big data project run solely by IT may fail because it’s unconnected to the needs of the business, for example, while a badly articulated request from the marketing department may leave IT confused about what to deliver.

As Rohit Killam, CTO at Masan Group points out: “The real bottleneck is conceptualising a value-driven big data programme with [the] right stakeholders,” while Duncan James, infrastructure manager at Clarion Solicitors notes: “Understanding what the business requires is the hardest part, especially if the business can’t articulate what it wants in the first place.”

In many organisations, whenever you want to do any project there has to be a business case before there can be any budget, says Frank Buytendijk, research vice president at Gartner.

“That is how organisations work and think, which is great for anything established — but for anything innovative that is really hard because the whole point of playing around with the technology is trying to figure out what it does for you. This is not unique to big data, but big data suffers from it as well.”

According to Buytendijk, big data projects don’t have to cost a lot, thanks to the availability of open-source tools. As a result, these projects can be used as a low-risk way to explore an organisation’s big data strategy. “The business case should not be the starting point; the business case should be the outcome, and it’s realising this that creates the right conversation within businesses,” he told ZDNet.

2. The big data skills crisis

According to the Harvey Nash CIO Survey carried out earlier this year, one in four CIOs reported difficulty in finding staff for big data projects. This is compounded by the complex array of skills needed for these projects, which are often outside of the standard skillset offered by the in-house tech team, according to tech chiefs canvassed by ZDNet.

“A shortage of big data skills doesn’t hold back big data projects, but it does have implications for the success factors and execution of the projects. There is certainly growth in demand for this area of skillset,” says Clarion’s Duncan James. Brian Wells, associate VP health technology and academic computing at Penn Medicine, adds that this is an issue in areas related to interpreting results and developing analytical hypotheses.

“Skills has been an issue from the beginning, and this will remain an issue for the foreseeable future,” says Gartner’s Buytendijk. “How do you find people who have a background in econometrics, statistics and mathematics, and who know how to programme in modern environments and have business sense, because big data analytics is all about interpreting context, why something is happening in a certain context. This skillset is really, really hard to find.”

One problem is that big data requires inductive rather than deductive thinking, whereas most IT organisations are good at deductive thinking: inductive thinking — using data to create likely connections — is a little outside their usual way of working.

Another problem is that big data technologies are very programming-intensive: while the typical ratio between software and implementation on a project is one to five, in big data that’s leapt to 1 to 25 as these tools are not very user friendly and they don’t integrate with other tools, and won’t for a number of years.

Not all tech chiefs agree on this, though: “I think the complexity of big data is way overrated,” maintains John Gracyalny, VP IT at SafeAmerica Credit Union. “We just kicked off a project to build a data warehouse/analytics tool internally. We only have a four-person IT department. I’m providing the ‘vision thing’ and database design, my newest guy is writing the code to handle external data extracts and imports, and my right hand will integrate an off-the-shelf reporting tool.”

3. The looming governance headache

When organisations start dredging through their digital detritus, they risk discovering information they might wish had remained buried. Consequently, they need to have some governance in place before they start delving into the huge piles of customer transactions and other data they’ve been storing.

For example, last year a New York Times story revealed how a retailer could use shopping patterns to spot when a customer was pregnant and offer them money-off vouchers — and how to do it without making them feel they were being watched. Thus organisations must beware of using their own data and other third-party data that together may lead them to discover information about customers that customers might not wish to have known.

As Gartner’s Buytendijk puts it: “If you start to work inductively, you let the data talk: what if the data tells you something you don’t like?”.

“Big data answers questions that weren’t even asked, and that can be quite embarrassing — so how do you create a governance situation with a sandbox with big walls where you contain things you don’t want the organisation to know?”.

According to Buytendijk, organisations need some kind of governance that shields them from over-using (and oversharing) the fruits of big data: “In lots of countries there have been reputational issues around big data being too clever for its own good. With great power comes great responsibility,” he warns.

Originally posted via “What if the data tells you something you don’t like?” Three potential big data pitfalls.

Source: What if the data tells you something you don’t like?” Three potential big data pitfalls by analyticsweekpick

Let’s Meet Up at the Nashville Analytics Summit


The Nashville Analytics Summit will be on us before we know it. This special gathering of data and analytics professionals is scheduled for August 20th and 21st, and should be bigger and better than ever. From my first experience with the Summit in 2014, it has consistently been a highlight of my year. My first Summit took place at the Lipscomb Spark Center meeting space with about a hundred attendees. Just a few years later, we’d grown to more than 450 attendees and moved into the Omni Hotel.

Mark it on your calendar. I’ll give you five reasons why it is a can’t-miss event if you work with data:

  1. We’ve invited world-renowned keynote speakers like Stephen Few and Thomas Davenport. You won’t believe who we are planning to bring in this year.
  2. There isn’t a better networking event for analytics professionals in our region. Whether you’re looking for talent or looking for the next step in your career, you’ll meet kindred spirits, data lovers, and innovative businesses. For two years in a row, we have hired Juice interns directly from conversations at the Summit. 
  3. It’s for everyone who works with data. Analyst, Chief Data Officer, or Data Scientist… we’ve got you covered. There are technical workshops and presentations for the hands-on practitioner and case studies and management strategies for the executive. We’re committed to bringing you quality and diverse content.
  4. It’s a “Goldilocks” conference. Some conferences go on for days. Some conferences are a sea of people, or too small to expand your horizons. The Analytics Summit is two days, 500-something people, and conveniently located in the cosy confines of the Omni Hotel. It is easy to meet new people and connect with people you know.
  5. See what’s happening. Nashville has a core of companies committed to building a special and innovative analytics community. We have innovators like Digital Reasoning, Stratasan, and Juice Analytics. We have larger companies making a deep commitment to analytics like Asurion, HCA, and Nissan. The Summit is the best chance to see the state of our thriving analytics community.

Now that you’re convinced you can’t miss out, you’re may wonder what to do next. First, block out your calendar (August 20 and 21). Next, find a colleague who you’d like to go with. Want to be even more involved? We invited dozens of local professionals to speak at the Summit. You can submit a proposal to present. 

Finally, if you don’t want your company to miss out on the opportunity to reach our entire analytics community, there are still slots for sponsors.

I hope to see you there.

learn more and register

Originally Posted at: Let’s Meet Up at the Nashville Analytics Summit by analyticsweek

The Future Of Big Data Looks Like Streaming

Big data is big news, but it’s still in its infancy. While most enterprises at least talk about launching Big Data projects, the reality is that very few do in any significant way. In fact, according to new survey data from Dimensional, while 91% of corporate data professionals have considered investment in Big Data, only 5% actually put any investment into a deployment, and only 11% even had a pilot in place.

Big data is big news, but it’s still in its infancy. While most enterprises at least talk about launching Big Data projects, the reality is that very few do in any significant way. In fact, according to new survey data from Dimensional, while 91% of corporate data professionals have considered investment in Big Data, only 5% actually put any investment into a deployment, and only 11% even had a pilot in place.

Real Time Gets Real

ReadWrite: Hadoop has been all about batch processing, but the new world of streaming analytics is all about real time and involves a different stack of technologies.

Langseth: Yes, however I would not entangle the concepts of real-time and streaming. Real-time data is obviously best handled as a stream. But it’s possible to stream historical data as well, just as your DVR can stream Gone with the Wind or last week’s American Idol to your TV.

 This distinction is important, as we at Zoomdata believe that analyzing data as a stream adds huge scalability and flexibility benefits, regardless of if the data is real-time or historical.

RW: So what are the components of this new stack? And how is this new big data stack impacting enterprise plans?

JL: The new stack is in some ways an extension of the old stack, and in some ways really new.

Data has always started its life as a stream. A stream of transactions in a point of sale system. A stream of stocks being bought and sold. A stream of agricultural goals being traded for valuable metals in Mesopotamia.

Traditional ETL processes would batch that data up and kill its stream nature. They did so because the data could not be transported as a stream, it needed to be loaded onto removable disks and tapes to be transported from place to place.

But now it is possible to take streams from their sources, through any enrichment or transformation processes, through analytical systems, and into the data’s “final resting place”—all as a stream. There is no real need to batch up data given today’s modern architectures such as Kafka and Kinesis, modern data stores such as MongoDB, Cassandra, Hbase, and DynamoDB (which can accept and store data as a stream), and modern business intelligence tools like the ones we make at Zoomdata that are able to process and visualize these streams as well as historical data, in a very seamless way.

Just like your home DVR can play live TV, rewind a few minutes or hours, or play moves from last century, the same is possible with data analysis tools like Zoomdata that treat time as a fluid.

Throw That Batch In The Stream

Also we believe that those who have proposed a “Lambda Architecture,” effectively separating paths for real-time and batched data, are espousing an unnecessary trade-off, optimized for legacy tooling that simply wasn’t engineered to handle streams of data be they historical or real-time.

At Zoomdata we believe that it is not necessary to separate-track real-time and historical, as there is now end-to-end tooling that can handle both from sourcing, to transport, to storage, to analysis and visualization.

RW: So this shift toward streaming data is real, and not hype?

JL: It’s real. It’s affecting modern deployments right now, as architects realize that it isn’t necessary to ever batch up data, at all, if it can be handled as a stream end-to-end. This massively simplifies Big Data architectures if you don’t need to worry about batch windows, recovering from batch process failures, etc.

So again, even if you don’t need to analyze data from five seconds or even five minutes ago to make business decisions, it still may be simplest and easiest to handle the data as a stream. This is a radical departure from the way things in big data have been done before, as Hadoop encouraged batch thinking.

But it is much easier to just handle data as a stream, even if you don’t care at all—or perhaps not yet—about real-time analysis.

RW: So is streaming analytics what Big Data really means?

JL: Yes. Data is just like water, or electricity. You can put water in bottles, or electricity in batteries, and ship them around the world by planes trains and automobiles. For some liquids, such as Dom Perignon, this makes sense. For other liquids, and for electricity, it makes sense to deliver them as a stream through wires or pipes. It’s simply more efficient if you don’t need to worry about batching it up and dealing with it in batches.

Data is very similar. It’s easier to stream big data end-to-end than it is to bottle it up.

Article originally appeared HERE.

Source by analyticsweekpick

Statistics: Is This Big Data’s Biggest Hurdle?

Big Data is less about the data itself and more about what you do with the data. The application of statistics and statistical principles on the data helps you extract the information it contains. According to Wikipedia, statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. The American Statistical Association defines statistics as “the science of learning from data, and of measuring, controlling, and communicating uncertainty.”

Statistics is considered to be one of the three primary pillars of the field of data science (the other two are content domain knowledge and computer science skills). While content domain expertise provides the context through which you identify the relevant questions to ask, computer science skills help you get access to the relevant data and prepare them for analysis, statistics helps you interrogate that data to provide answers to your questions.

The Rise of Statistics

We have a lot of data and are generating a lot more of it. IDC says that we created 2.8 zettabytes in 2012. They estimate that number will grow to 40 zettabytes by 2020. It’s not surprising that Hal Varian, chief economist at Google, in 2009, said that “the sexy job in the next 10 years will be statisticians.” Statistics, after all, helps make sense of and get insight from data. The importance of statistics and statistical thinking in our datafied world can also be found in this excellent slideshare by Diego Kuonen, a statistician.

Figure 1. The Hottest Skill on LinkedIn in 2014: Statistical Analysis and Data Mining
Figure 1. The Hottest Skill on LinkedIn in 2014: Statistical Analysis and Data Mining

Statistical skills are receiving increasing attention in the world of business and education. LinkedIn found that statistical analysis and data mining was the hottest skill in 2014 (see Figure 1).

Many companies are pursuing statistics-savvy people to help them make sense of their quickly-expanding, ever-growing, complex data. Job postings on Indeed show that the number of data science jobs continue to grow (see Figure 2).

big-data, data-science Job Trends graph
Figure 2. Growth rate for Data Science jobs continues to increase.

University students are flocking to the field of statistics. Of the STEM Professions, statistics has been the fastest growing undergraduate degree over the past four years (see Figure 3).

Figure 3. Of the STEM fields, statistics has the highest growth rate.

The Fall of Statistics

The value of statistics is evident by the increase in number of statistics degrees and the Big Data jobs requiring statistical skills. These are encouraging headlines, no doubt, as more businesses are adopting what scientists have been using to solve problems for decades. But here are a few troubling trends that need to be considered in our world of Big Data.

McKinsey estimates that the US faces a shortage of up to 190,000 people with analytics expertise to fill these data science jobs as well as a shortage of 1.5 million people to fill managerial and analyst jobs who can understand and make decisions based on the data. Where will we find these statistics-savvy people to fill the jobs of tomorrow? We may have to look outside the US.

Figure 4. Some ranking
Figure 4. USA Ranks 27th in the world on math literacy of 15-year-old students.

In a worldwide study on 15-year-old students’ reading, mathematics, and science literacy (the Program for International Student Assessment; PISA), researchers found that US teenagers, compared to children of other countries, ranked 27th (out of 34 countries) in math literacy (see Figure 4), many countries having significantly higher scores than the US. According to the NY Times, while 13% of industrialized nations reached the top two levels of proficiency in math,  just 9% of US students did. In comparison, 55% of students from Shanghai reached that level of proficiency. In Singapore, 40% did.

Even the general US public is showing a decreased interest in statistics. Using Google Trends, I looked at the popularity of the term, statistics, among the general US public, comparing it with “analytics” and “big data.” While the number of searches for “big data” and “analytics” has increased, the number of searches of “statistics” has decreased steadily since 2004.

Summary and Major Trends

Statistics is the science of learning from data. Statistics and statistical thinking helps people understand the importance of data collection, analysis, interpretation and reporting of results.

In our Big Data world, statistical skills are becoming increasingly important for businesses. Companies are creating analytics-intensive jobs for statistics-savvy people, and universities are churning out more graduates with statistics degrees. On the other hand, there is expected to be a huge talent gap in the analytics industry. Additionally, the math literacy of US students is very low compared to the rest of the world. Finally, the US general public’s interest in statistics has been decreasing steadily for about a decade.

Knowledge of Statistics Needed for Both Analysts and Consumers

Statistics and statistical knowledge are not just for people who analyze data. They are also for people who consume, interpret and make decisions based the analysis of those data. Think of the data from wearable devices, home monitoring systems and health records and how they are turned into reports for fitness buffs, homeowners and patients. Think of CRM systems, customer surveys, social media posts and review sites and how dashboards are created to help front-line employees make better decisions to improve the customer experience.

The better the grasp of statistics people have, the more insight/value/use they will get from the data. In a recent study, I found that customer experience professionals had difficulty estimating size of customer segments based on customer survey metrics. Even though these customer experience professionals commonly use customer survey metrics to understand their customers, they showed extreme bias when solving this relatively simple problem. I assert that they would likely benefit (make fewer errors) if they understood statistics.

To get value from the data, you need to make sense of it, do something with it. How you do that is through statistics and applying statistical thinking to your data. Statistics is a way of helping people get value from their data. As the number of things that get quantified (e.g., data) continues to grow, so will the value of statistics.

The Most Important Thing People Need to Know about Statistics

Statistics is the language of data. Like knowledge of your native language helps you maneuver in the world of words, statistics will help you maneuver in the world of data. As the world around us becomes more quantified, statistical skills will become more and more essential in our daily lives. If you want to make sense of our data-intensive world, you will need to understand statistics.

I’m not saying that everyone needs an in-depth knowledge of statistics, but I do believe that everybody would benefit from knowing basic statistical concepts and principles. What is the most important thing you think people need to know about statistics and why? I would love to hear your answers in the comments section. Here is my take on this question.


Source by bobehayes

Big data and predictive analytics likely to dominate recruitment, TeamLease report says

CHENNAI: New areas such as big data and predictive analytics are emerging as the most coveted in the Indian recruitment space and also accelerating the need for a highly sophisticated workforce, says TeamLease Services’ employment outlook report for the half year period from April to September 2015.

The report expects that demand for the analytic skills is likely to far outstrip supply. The skills required for a data analytics function typically relate to mathematics and statistics, besides programming skills and an ability to go through lines of data generated by businesses to unearth valuable patterns.

According to TeamLease, other key trends that are likely to dominate recruitment industry over the next six months include increasing demand for Information Technology (IT), engineering and other blue collar jobs, emergence of startups as key hirers and the increased adoption of Recruitment Process Outsourcing (RPO).

The report states that both business and employment outlook is likely to dip marginally in the April-September period, indicating a consolidation by India Inc. While business sentiment is expected to witness a one point drop, employment outlook is likely to be down by two points.

Photo courtesy of The Times of India
Photo courtesy of The Times of India

“However, the cautiousness seems to have not dampened job growth and it remains strong at 11.3%, although lower than last half year, it is significantly better than the previous year,” TeamLease said in a release.

Industry-wise analysis showed that retail (led largely by online) and manufacturing/engineering, which clocked a three and four point increase in business and employment outlook respectively, are pushing the overall sentiment upwards while telecom seems to be the laggard.

Geographically, Mumbai and Delhi continue to showcase robust business and hiring activity, while Chennai’s growth is restricted only to business.

Pune, with a three point increase in hiring, is on a hiring spree, the report says. It expects that hiring from tier 2 towns will fall by two points in the current half year.

Functionally, the report says sales & marketing seems to have lost its sheen in the employment market with the focus landing on IT and engineering roles.

“As the economic fundamentals that drive business and hiring sentiments last half year continue to exist, the current dip is more a course correction than a downturn. The pro-industry announcements and easing of norms coupled by the resurgence in the GDP growth will definitely pull the hiring back onto the growth trajectory,” Kunal Sen, senior vice-president of TeamLease Services, said in a statement.

The report stresses on the growing requirement for talent in the field of delivery/logistics, facility management, mobile applications as well as data science and lists content curator, dental hygienist and valuation and market risk analysts as a few sought after skills.

To read the original article from The Times of India, click here.

Originally Posted at: Big data and predictive analytics likely to dominate recruitment, TeamLease report says by analyticsweekpick

The Blueprint for Becoming Data Driven: Data Quality

Data quality is the foundation upon which data-driven culture rests. Before upper level management can embrace data-centric processes, before analytics and reliance on big data becomes pervasive, and before data stewardship can truly extend outside IT’s boundaries to become embraced by the business, organizations must trust their data.

Trustworthy data meets data quality measures for…

  • Parsing
  • Cleansing
  • Profiling
  • De-duplicating
  • Modeling

…and is the reliable, consistent basis for accurate analytics and the value data provide to optimize business processes.

By ensuring data quality, organizations are laying the foundation for becoming data driven both explicitly and implicitly. Explicit manifestations of this shift include an increased reliance on data, greater valuation of data as an asset, and an entrenchment of data as a means of optimizing business. Implicitly, the daily upkeep of data-driven processes become second-nature as aspects of data stewardship, provenance, integration, and even modeling simply become due diligence for everyone’s job.

According to Tamr head of product and strategy Nidhi Aggarwal, the quintessential manifestation of a data-centered culture may be reflected in another way that delivers even greater pecuniary benefits—the utilization of all enterprise data. “People talk about the democratization of analytics and about being truly data driven,” Aggarwal commented. “You cannot do that if you’re only using 20 percent of your data.”

Cognitive Data Science Automates Data Quality Measures
Data quality is largely conceived of as the output of assiduous data preparation. It is effectively ensured via the deployment of the machine learning technologies that are responsible for automating critical facets of data science: particularly the data cleansing and preparation that can otherwise become too laborious. Contemporary platforms for data quality establish machine learning models that map data of all types to specific measures for quality—and which can also include additional facets of data preparation, such as transformation. “Our machine learning models are able to do that really fast, really cheap, and improve it over time as we see more and more data,” Aggarwal noted.

With machine learning, quality measures begin by mapping relevant data sources to one another to determine how their attributes relate. The cognitive prowess of these models are demonstrated in their ability to sift through individual records for these data sources (which may be bountiful). In doing so, they identify points of redundancy, relationships between data, names and terms, how recent data are, and many other facets of data quality. “It’s very difficult for a human to do that,” Aggarwal said. “With machine learning, by doing statistical analysis, by looking at all of the attributes, by looking at these rules that some domain experts provide to the models, by looking at how the humans answered the questions that we presented as samples to them, it makes decisions about how these things should be de-duplicated.”

Reinforcing Trust with Provenance and Natural Language Processing
Competitive preparation platforms that facilitate data quality temper the quality measures of cognitive computing with human involvement. The result is extremely detailed data provenance which reinforces trust in data quality, and which is easily traced for the purposes of assurance. The decisions that domain experts make about how sources are unified and relate to each other for specific data types—which is critical to establishing data quality—are recorded and stored in platforms for traceability. Thus, there is little ambiguity about who made a decision, when, and what effect it had on the underlying machine learning model for how data was unified and defined to establish data quality. Natural Language Processing is involved in the data quality process (especially with unstructured text) by helping to reconcile definitions, different terms, and commonalities between terms and how they are phrased. The pivotal trust required for becoming data-driven is therefore facilitated with both machine learning and human expertise.

Metadata and Evolving Models
The granular nature of a machine learning, human tempered approach to data quality naturally lends itself to metadata and incorporating new data sources into quality measures. Metadata is identified and compared between sources to ensure unification for specific use cases and requisite data quality. The true value of this cognitive approach to data quality is evinced when additional data sources are included. According to Aggarwal: “People can do this manual mapping if they only wanted to do it manually once. But the trouble is when they have to add a new data source, it’s almost as much effort as doing it the first time.” However, the semantic technologies that form the crux of machine learning are able to incorporate new sources into models so that “the model can actually look at the new data set, profile it really quickly, and figure out where it maps to all the things that it previously knows about” Aggarwal said.

More significantly, the underlying machine learning model can evolve alongside data sets that are radically dissimilar from its initial ones. “Then the model updates itself to include this new data,” Aggarwal mentioned. “So when a new data set comes in further down the line, the chances are that it will be completely new and that the models don’t align with it go lower and lower every time.” The time saved from the expedited process of updating the models required for data quality underscore the agility required to further trust data when transitioning to becoming data driven.

Using All Data
When organizations are able to trust their data because of the aforementioned rigorous standards for data quality, they are able to incorporate more data into business processes. The mapping procedures previously outlined helps organizations to bring all of their data together and determine which of it relates to a specific uses case. The monetary boons of incorporating all enterprise data into business processes is exemplified with a use case from the procurement vertical. Were a company attempting to determine how many suppliers it had and whether it was getting the best payment terms from them, those that were not data savvy could only use a finite amount of their overall data—limited to particular business units—to determine this answer. Those that were truly data-driven and able to incorporate all of their data for this undertaking could incorporate the input of greater amounts of business units and, according to Aggarwal, who encountered this situation with a Tamr customer:

“There were wildly different payment terms for the same supplies. When we dug into what parts they were buying from the suppliers and at what prices across the different business units, there were sometimes 300X differences in the price of the same part.” Unifying one’s data for uniform quality measures is integral to identifying these variances, which translates into quantifiable financial advantages. “An individual decision might save them a few hundred dollars here and there,” Aggarwal remarked. “Collectively, optimizing their decisions every single day has saved them millions and millions of dollars over time. That’s the power of bringing all data together.”

Citizen Stewardship and Business Engagement
The pervasiveness of data reliance and the value it creates for decision-making and business processes is intrinsically engendered through the trust gained from a firm foundation in data quality. By utilizing timely, reliable, data that is consistent in terms of metadata, attributes, and records management, organizations can transition to a datacentric culture. The products of such a culture are the foregoing cost advantages businesses attributed to improved decision-making. The by-products are streamlined data preparation, improved provenance, upper level management support, aligned metadata, and an appreciation of data’s value and upkeep on the part of the business users who depend on it most.

Aggarwal commented that increased data quality processes facilitated by machine learning and human oversight result in: “A broader dialogue about data in terms of stewardship. Today stewardship is in the hands of IT people basically who don’t have business context. What [we do] is take that stewardship and engage the business people who actually know something about the data much sooner in the process of data quality. That’s how they get to higher data quality, faster.”

And that’s also how they become data driven, faster.

Originally Posted at: The Blueprint for Becoming Data Driven: Data Quality

3 S for Building Big Data Analytics Tool of the Future

3 S for Building Big Data Analytics Tool of the Future
3 S for Building Big Data Analytics Tool of the Future

There is a huge debate on what constitutes the Big Data Analytics tool of the future and many jump in the race to try their flavor of solutions or problem solving techniques that address many critical use cases at play in Big Data laden businesses. While new businesses are working at it, what constitutes a good fundamental theory on product design strategy that could help create something for the future – Solutions with an ability to stay competitive and relevant in the current times.

On the search for some thoughts, I stumbled upon the video of Christopher Lynch from Atlas Venture (@AnalyticsWeek Boston’s First Unconference Finance/Insurance track keynote). He made some interesting points on what constitutes an interesting focus area for new opportunities. You could see the video attached below(Click the video below to watch the specific bit, I would also recommend watching the entire bit as it has lots of great points on current big data ecosystem). He touched on 3S’s Simplicity, Scalability and Security as 3 fundamental areas for big data analytics companies. He certainly has an interesting perspective and surely provides a good coverage on the current disruptive opportunity areas. I’ve some coinciding thoughts briefly mentioned in the ebook Data Driven Innovation – A Primer(download free here). I briefly touched about 3S’s that we should use in our products to help cause much needed disruption in big data space. My laundry list was Small, Simple and Scale. So, it’s great to have 2 of 3 areas that Chris also slated.

3S’s that I think will shape the future of Big Data Analytics and Why:

1. Small: Yes, Big Data is Big but the solution should be small. Reducing the scope of the product to the one magical thing that could solve a potential use case. Wearing a system’s architect hat, one could easily vote for it as small solutions tends to scale well and are more often than not simple to understand. Small is where the most tough part goes in the planning. When you’ve heard that 80% is planning and 20% is execution, it is safe to say that 80% is / should be spend on making the solution smaller. A quick bite size for easy adaptability.

2. Simple: This is a no brainer in the world of software engineering. Simplicity always triumphs ginormous complicated product. Sure, complexity sells but as a service and not as a product. Who has not heard about the quote “if you can’t explain it simply you don’t understand it well enough?”. This is applicable to a good system design and hence a good big data product design. Simple solutions are often understood quickly and therefore meet easy adoption, hence better sales. In fact, it could be safe to say that simplicity is the most important aspect of the 3 listed here.

3. Scale: This is surely a freebie if you get the first two right, but there have been times when a simple and small solution failed to scale. Scalability is another good area of focus for disruption. A good unit size simple tool could be replicated over and over. This will induce the element of scalability in the tools. A good system should be able to grow with the company it is helping to grow. A tool that does not travel for a long ride with a company will often see a diminishing adoption right at the beginning. A great hopeful thought is that this point is easiest to achieve if above 2 points are taken into consideration. Scalability is important for adoption among big businesses who deal with big blobs of data.

I would certainly agree to Chris’s point of importance of security in current tools for easy adoption in enterprise world and I should probably add it as my 4th S as well. So, yes, all powers to you and congratulations on your disruptive platform if you’ve build your science around those 4S’s. World needs your product and Big Data Analytics world is craving for disruption from the tools that only serve to the 1% and rest 99% only wait and watch for the tools to get to their capability levels, else they have to up their game and buy into ocean of tools which are complicated, super sticky and failure is expensive.

Till that day arrives, all we’ve to do is write in front of us: Simple, Small & Scalable and keep them in mind as you build a solution.

Here’s the video (If you don’t have time? Skip to 4m 10sec down):

Originally Posted at: 3 S for Building Big Data Analytics Tool of the Future

The One Number You Need to Grow (A Replication)

The one number you need to grow.

That was the title of the 2003 HBR article by Fred Reichheld that introduced the Net Promoter Score as a way to measure customer loyalty.

It’s a strong claim that a single attitudinal item can portend company success. And strong claims need strong evidence (or at least corroborating evidence).

In an earlier article, I examined the original evidence put forth by Reichheld and looked for any other published evidence and discussed the findings at the event, How Harmful is the Net Promoter Score?

To establish the validity and make the claim that the NPS predicts growth, Fred Reichheld reported that the NPS was the best or second-best predictor of growth in 11 of 14 industries (p. 28).

The data he provided in the appendix of his 2006 book The Ultimate Question to support the relationship shows data from 35 companies in six industries (computers, life insurance, Korean auto insurance, U.S. airlines, Internet Service Providers, and UK supermarkets). His 2003 HBR article contained five more companies and one additional industry (rental cars) for a total of 40 companies and 7 industries.

Close examination of the data reveals that Reichheld used historical, not future growth. He showed the three-year average growth rates (1999–2002) correlated with the two-year average Net Promoter Scores (2001–2002). In other words, the NPS correlated with past growth rates (as opposed to future growth rates). This does establish validity (a sort of concurrent validity) but not predictive validity.

To assess the predictive ability of the NPS, I looked at the U.S. airline industry in 2013 and found a strong correlation between future growth and NPS (but only after accounting for a major merger in the industry).

The published literature on the topic in the last 15 years isn’t terribly helpful either. I found eight other studies that examined the NPS’s predictive ability (Figure 1). I was, however, a bit disappointed in the quality of many of the studies given the ubiquity of the Net Promoter Score.

As Figure 1 shows, three of the eight studies found medium to strong correlations but used historical or current revenue (not future). Of the five remaining studies that used future metrics, two were authored by a competitor of Satmetrix (a possible competitive bias) and one was from a book with connections to Satmetix and not peer reviewed (with an agenda to promote the NPS).

Figure 1: Summary of papers examining the NPS and growth (many used historical revenue or had methodological flaws—like not actually using the 11-point LTR item).

Surprisingly, two of the three studies that looked at future metrics didn’t use the 11-point Likelihood to Recommend question (Keiningham et al., 2007b; Morgan and Rego, 2006). One study that used a 10-point version that found no correlation with business growth also found no correlation with any metrics at the firm level for three Norwegian industries it examined (Keiningham et al., 2007a), which was an unusual finding given all other studies found some correlation with metrics.

Only the study by de Haan et al. (2015) actually used the 11-point Likelihood to Recommend item and found the Net Promoter Score did have a small correlation with future intent (collected in a longitudinal study). It wasn’t the best predictor, but it did correlate with future metrics (which was similar to the finding from the study by Keiningham et al., 2007b using a 5-point LTR).

I think there are at least two reasons for the dearth of published data examining the NPS and growth:

  1. Little upside: There’s little upside for Satmetrix and Reichheld to fund and publish more research to establish the predictive validity of the NPS. If it’s already in wide usage (most Fortune 500 companies use it), then there’s little to gain. That Reichheld didn’t include more data in his 2nd edition of The Ultimate Question likely supports this. (He even excluded the appendix that was in the 1st.)
  2. It’s difficult: Predicting revenue at the customer or company level requires data from two points in time. Longitudinal data takes time to collect (by definition years in this case). It’s also hard to associate attitudinal data to financial performance. Companies have little reason to expose their own data and third-party firms have trouble getting access.

Predicting Future Growth with the Original Data

A few papers I cited above pointed out the problem with Reichheld using historical revenue to show future growth but none I found actually looked to see whether the published NPS data predicted future growth for the same industries. Keiningham (2007a) did use some of Reichheld’s data to show that the American Consumer Satisfaction Index was an equal or better predictor of historical revenue, but didn’t look at future growth.

So, I revisited the very data used to establish the NPS validity—the 1999–2002 Net Promoter Score data Reichheld published in his 2006 book appendix and 2003 HBR article.

With the help of research assistants, I dug through old annual reports, press releases, articles, and the Internet Archive to match the financial metrics collected more than 15 years ago. It wasn’t easy, as many companies merged or went out of business, and whole industries morphed (AOL anyone?). We had to piece together numbers from many different sources and make some assumptions (noted below).

After several weeks of digging we had good results and were able to find data for the same six industries used in the 2006 book plus the one industry included in the HBR article for the years 2002–2006. Table 1 shows the industry, the metric we used, the year the NPS data was reported in Reichheld’s book, the current/historical years Reichheld used, and then the years we found data for to predict future growth.

Industry Metric NPS Data Reichheld Years Our Future Years
U.S. PC market PC Shipments 2001-2001 1999-2002 2002-2005
U.S. Life Insurance market Life premiums 2001-2002 1999-2003 2002-2005
U.S. airlines market Sales 2001-2002 1999-2002 2002-2005
U.S. Internet Service Providers Sales 2002 1999-2002 2002-2005
U.S. car rental market Revenue 2002 1999-2002 2002-2005
UK supermarkets Sales 2003 1999-2003 2003-2006
Korean auto insurance Sales 2003 2001-2003 2003-2006

Table 1: Industries used to establish the predictive ability of the Net Promoter Score from The Ultimate Question and the 2003 HBR article.


We used two future growth periods to assess the predictive validity of the NPS. The first are the two years immediately following the NPS data (and graphed below). For the U.S. industries this was 2002–2003; for the international industries this was 2003–2004 (which matches the years of NPS data Reichheld used). The second includes a longer period of three to four years of growth (2002–2005 for U.S. industries and 2003–2006 for international). We computed Pearson correlations for each industry, then averaged the correlations using the Fisher Z transformation to account for the non-normality in correlations. Finally, we converted the correlations to R2 values to match the fit statistic reported in The Ultimate Question.

Reichheld notes that they found the log of the change in NPS would boost the explanatory power (R2) of NPS but they reported only raw NPS numbers in the appendix. With only one year of NPS data, we didn’t have changes in the NPS so we replicated the approach in the appendix using only the data from the single Net Promoter Scores.

Table 2 shows the results for Reichheld’s originally reported R2 values using current or historical revenue and our R2 values for the subsequent two and four years.

A bit to my surprise (given the many vocal critics and lack of published data), we found evidence that the Net Promoter Score predicted growth in both the subsequent two- and four-year periods. On average we found the Net Promoter Scores reported by Reichheld explained 38% of the changes in growth for the seven industries examined for the immediate two years (low of 8% to a high of 76%). The explanatory power decreased some when the future period increased (which is not too surprising given what can change in four years). For the four-year period, the average explanatory power of the NPS is still 30% (low of 4% to a high of 79%).

To put these R2 values into perspective, the SAT can explain (predict) around 25% of first year college grades, which means these R2 values are impressively large.

  Reichheld Historical R^Sq 2-Year Future Growth R^Sq 4-Year Future Growth R^Sq
U.S. PC market 68% 27% 75%
U.S. Insurance market 86% 39% 4%
U.S. airlines market 68% 8% 22%
U.S. Internet Service Providers 93% 20% 2%
U.S. car rentals 28% 8% 8%
UK supermarkets 84% 76% 79%
Korean auto insurance 68% 48% 12%
Avg R2(Fisher Transformed) 76% 38% 30%

Table 2: R2 values of seven industries from Reichheld’s NPS data compared to historically reported revenue and two-year and four-year growth rates by industry. The Fisher R to Z transformation was used to average the correlations before converting to R2 averages. *Reichheld reported an R2 of 68% for Korean auto but our replication from the scatterplots generated a value of ~30%. See other notes below by industry.

Below we have re-created the bubble scatterplots from Reichheld and compared that with our two-year future data. We estimated the regression lines, R2 values and bubble size using a similar approach as described in Keiningham et al 2007a.

PC Shipments

Historical R2 = 74% Future (2 Years): R2 = 27%
nps gateway dell 26

Note: Compaq was purchased by Dell so is not included in future years. IBM sold its PC industry to Lenovo in 2005 so calculation only includes growth rates between 2002–2004 instead of 2002–2005. Gateway merged with eMachines in 2004; growth rates are also only 2002–2004 and only include Gateway numbers.

US Life Insurance

Historical R2 = 86% Future (2 Years): R2 = 39%
nps life premium 2001-2002 nps life premium 2002-2003 39

Note: For Prudential we used growth rates in British pounds, but bubble size on the chart is determined by converted number of life premiums in U.S. dollars.

US Airlines

Historical R2 = 66% Future (2 Years): R2 = 8%
nps airlines 82

Note: TWA stopped operations in 2001 and wasn’t included in calculation for future years. America West Airlines four-year growth period is between 2002–2004 as they merged with US Airways Group in 2005.

Internet Service Providers (ISPs)

Historical R2 = 89% Future (2 Years): R2 = 20%
nps internet service provider 22


UK Grocery Stores

Historical R2 = 81% Future (2 Years): R2 = 76%
nps groceries 81 nps groceries 76

Note: For ASDA we used growth rates in USD, but the bubble size on the chart is determined by converted number of sales in British pounds.

Korean Auto Insurance

Historical R2 = 68%/30%* Future (2 Years): R2 = 48%
nps korean auto insurance 30 nps korean auto insurance 47

Note: Reichheld reports an R2 of 68% but we calculated a much lower R2 of 30% from the same data.

U.S. Rental Cars

Historical R2 = 28% Future (2 Years): R2 = 17%
nps car rentals 28 nps car rentals 17

Note: In 2003 Vanguard Group purchased National and Alamo brands and didn’t separate the revenue so they are excluded in the future analysis.



A re-examination of the original NPS data using future (rather than historical revenue growth) found:

The NPS explains immediate firm growth in selected industries. On average we found NPS data can explain 38% of the variability in company growth metrics in seven industries at the company/firm level. This is less than half the explanatory power of historical growth reported by Reichheld (76%) but still represents a substantial amount relative to other behavioral science measures. While not as impressive, it still suggests the NPS is a leading indicator of future growth rates, at least in some selected industries for some time periods at the company level.

The NPS is still predictive of more distant growth. The explanatory power of the NPS still remained at a solid 30% for a four-year future growth period. This suggests that established company policies and growth patterns can remain in effect for years (but not always) and the NPS may still portend the more distant future (again in these selected industries and years).

Industry changes are hard to predict with few data points. Companies merge, industries morph, and unexpected changes can happen that affect a company’s growth and consequently the predictive ability of any measure, including the NPS. This was seen in the car rental industry (National merged) and the PC industry (IBM sold to Lenovo) and the airline industry (TWA was acquired after bankruptcy ). When an industry has few data points (e.g. ISPs with only three), only the strongest relationships are detectible and small changes in one year can completely remove any evidence for a relationship between NPS and growth.

Prediction is imprecise. The NPS may be a victim of its own success with its hype leading many to dismiss it unless it’s a perfect predictor of growth. (After all the headline indicated it’s the ONE number you need to grow!) Making predictions is difficult and imprecise but this analysis suggests the NPS does have reasonable predictive ability, at least as high as other high-stakes measures like college entrance exams. It’s unlikely always the superior measure in every industry, given our earlier analyses on satisfaction but this data again suggests it may be an adequate proxy measure of future growth for many industries.

There is a possible selection bias. We limited our analysis to the industries, companies, and metrics reported by Reichheld. It’s likely that these are the best illustrations of the NPS’s predictive (or post-dictive) ability and may not be representative of all industries. Reichheld himself reported that the NPS wasn’t always the best predictor of growth (only in 11/14 industries). A future analysis will look at a broader range of the seven industries shown here as well as examinations at the customer level.



Below are the sources where we found growth rates to match those reported in Reichheld so you can check our work and assumptions (let us know if you see a discrepancy).

US PC market (All Firms)

US Life insurance market

US Airlines

US Internet Service Providers

UK supermarkets

Korean auto insurance

US Car rental

(function() {
if (!window.mc4wp) {
window.mc4wp = {
listeners: [],
forms : {
on: function (event, callback) {
event : event,
callback: callback

Sign-up to receive weekly updates.

Source: The One Number You Need to Grow (A Replication) by analyticsweek

Nate-Silvering Small Data Leads to Internet Service Provider (ISP) industry insights

There is much talk of Big Data and how it is changing/impacting how businesses improve the customer experience. In this week’s post, I want to illustrate the value of Small Data.

Internet Service Providers (ISPs) receive the lowest customer satisfaction ratings among the industry sectors measured by the American Customer Satisfaction Index (ACSI). As an industry, then, the ISP industry has much room for improvement, some more than others. This week, I will use several data sets to help determine ISP intra-industry rankings and how to improve  their inter-industry ranking.

Table 1. Internet Service Provider Ratings
Table 1. Internet Service Provider Ratings

I took to the Web to find several publicly available and relevant data sets regarding ISPs. In all, I found 12 metrics from seven different sources for 27 ISPs. I combined the data sets by ISP. By merging the different data sources, we will be able to uncover greater insights about these different ISPs and what they need to do to increase customer loyalty. The final data set appears in Table 1. The description of each metric appears below:

  • Broadband type: The types of broadband were from PCMag article.
  • Actual ISP Speed: Average speed for Netflix streams from November 2012: Measured in megabits per second (Mbps).
  • American Customer Satisfaction Index (ACSI): an overall measure of customer satisfaction from 2013. Ratings can vary from 0 to 100.
  • Temkin Loyalty Ratings: Based on three likelihood questions (repurchase, switch and recommend) from 2012. Questions are combined and reported as a “net score,” similar to the NPS methodology. Net scores can range from -100 to 100.
  • JD Power: A 5-star rating system for overall satisfaction from 2012. 5 Star = Among the best; 4 Star = Better than most; 3 Star = About average; 2 Star = The rest.
  • PCMag Ratings (6 metrics: Recommend to Fees): Ratings based on customer survey that measured different CX areas in 2012. Ratings are based on a 10-point scale.
  • DSL Reports: The average customer rating across five areas. These five areas are: 1) Pre-Sales Information, 2) Install Coordination,  3) Connection reliability, 4) Tech Support and 5) Value for money. Data were pulled from the site on 6/30/2013. Ratings are based on a 5-point scale.

As you can see in Table 1, there is much missing data for some of the 27 ISPs. The missing data do not necessarily reflect the quality of the data that appear in the table. These sources simply did not collect data to provide reliable ratings for each ISP or simply did not attempt to collect data for each ISP. The descriptive statistics for and correlations among the study variables appear in Table 2.

Table 2. Descriptive Statistics of and Correlations among Study Variables
Table 2. Descriptive Statistics of and Correlations among Study Variables

It’s all about Speed

Customer experience management research tells us that one way of improving satisfaction is to improve the customer experience. We see that actual speed of the ISP is positively related to most customer ratings, suggesting that ISPs that have faster speed also have customers who are more satisfied with them compared to ISPs who have slower speeds. The only exception with this is for satisfaction with Fees; ISPs with faster actual speed tend to have customers who are less satisfied with Fees compared to ISPs with slower actual speed.

Nate-Silvering the Data

Table 3. Rescaled Values of Customer Loyalty Metrics for Internet Service Providers
Table 3. Rescaled Values of Customer Loyalty Metrics for Internet Service Providers

Recall that Nate Silver aggregated several polls to make accurate predictions about the results of the 2012 presidential elections. Even though different polls, due to sampling error, had different outcomes (sometimes Obama won, sometimes Romney won), the aggregation of different polls resulted in a clearer picture of who was really likely to win.

In the current study, we have five different survey vendors (ASCI, Temkin, JD Power, PCMag and assessing customer satisfaction with ISPs. Depending on what survey vendor you use, the ranking of ISPs differ. We can get a clearer picture of the ranking by combining the different data sources because a single study is less reliable than the combination of many different studies. While the outcome of aggregating customer surveys may not be as interesting as aggregating presidential polls, the general approach that Silver used to aggregate different results can be applied to the current data (I call it Nate-Silvering the data).

Given that the average correlations among the loyalty-related metrics in Table 2 are rather high (average r = .77; median r = .87), aggregating each metric to form an Overall Advocacy Loyalty metric makes mathematical sense. This overall score would be a much more reliable indicator of the quality of an ISP than any single rating by itself.

To facilitate the aggregation process, I first transformed the customer ratings to a common scale, a 100 -point scale using the following methods. I transformed the Temkin Ratings (a net score) into mean scores based on a mathematical model developed for this purpose (see: The Best Likelihood to Recommend Metric: Mean Score or Net Promoter Score?). This value was then multiplied by 10. The remaining metrics were transformed into a 100-point scale by using a multiplicative function of 20 (JD Power, DSLREPORTS) and 10 (PCMag Sat, PCMag Rec). These rescaled values are located in Table 3. While the transformation altered the average of each metric, these transformations did not appreciably alter the correlations among the metrics (average r = .75, median r = .82).

Table 4. Rankings of Internet Service Providers based on the average loyalty ratings.
Table 4. Rankings of Internet Service Providers based on the average loyalty ratings.

The transformed values were averaged for each of the ISPs. These results appear in Table 4. As seen in this table, the top 5 rated ISPs (overall advocacy ratings) are:

  1. WOW!
  2. Verizon FiOS
  3. Cablevision
  4. Earthlink
  5. Bright House

The bottom 5 rated ISPs (overall advocacy ratings) are:

  1. Windstream
  2. CenturyLink
  3. Frontier
  4. WildBlue
  5. HughesNet


Small Data, like its big brother, can provide good insight (with the help of right analytics, of course) about a given topic. By combining small data sets about ISPs, I was able to show that:

  1. Actual ISP speed is related to customer satisfaction with speed of ISP. ISPs that have objectively faster speed receive higher ratings on satisfaction with speed.
  2. Different survey vendors provide reliable and valid results about customer satisfaction with ISPs (there was a high correlation among different survey vendors).
  3. Improving customer loyalty with ISPs is a function of actual ISP speed.

The bottom line is that you shouldn’t forget the value of small data.

Source: Nate-Silvering Small Data Leads to Internet Service Provider (ISP) industry insights