May 30, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Weak data  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ FEATURED COURSE]

Statistical Thinking and Data Analysis

image

This course is an introduction to statistical data analysis. Topics are chosen from applied probability, sampling, estimation, hypothesis testing, linear regression, analysis of variance, categorical data analysis, and n… more

[ FEATURED READ]

Storytelling with Data: A Data Visualization Guide for Business Professionals

image

Storytelling with Data teaches you the fundamentals of data visualization and how to communicate effectively with data. You’ll discover the power of storytelling and the way to make data a pivotal point in your story. Th… more

[ TIPS & TRICKS OF THE WEEK]

Winter is coming, warm your Analytics Club
Yes and yes! As we are heading into winter what better way but to talk about our increasing dependence on data analytics to help with our decision making. Data and analytics driven decision making is rapidly sneaking its way into our core corporate DNA and we are not churning practice ground to test those models fast enough. Such snugly looking models have hidden nails which could induce unchartered pain if go unchecked. This is the right time to start thinking about putting Analytics Club[Data Analytics CoE] in your work place to help Lab out the best practices and provide test environment for those models.

[ DATA SCIENCE Q&A]

Q:What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset
A: Outliers:
– An observation point that is distant from other observations
– Can occur by chance in any distribution
– Often, they indicate measurement error or a heavy-tailed distribution
– Measurement error: discard them or use robust statistics
– Heavy-tailed distribution: high skewness, can’t use tools assuming a normal distribution
– Three-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean
– Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean

Three-sigma rules example: in a sample of 1000 observations, the presence of up to 5 observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected, being less than twice the expected number and hence within 1 standard deviation of the expected number (Poisson distribution).

If the nature of the distribution is known a priori, it is possible to see if the number of outliers deviate significantly from what can be expected. For a given cutoff (samples fall beyond the cutoff with probability p), the number of outliers can be approximated with a Poisson distribution with lambda=pn. Example: if one takes a normal distribution with a cutoff 3 standard deviations from the mean, p=0.3% and thus we can approximate the number of samples whose deviation exceed 3 sigmas by a Poisson with lambda=3

Identifying outliers:
– No rigid mathematical method
– Subjective exercise: be careful
– Boxplots
– QQ plots (sample quantiles Vs theoretical quantiles)

Handling outliers:
– Depends on the cause
– Retention: when the underlying model is confidently known
– Regression problems: only exclude points which exhibit a large degree of influence on the estimated coefficients (Cook’s distance)

Inlier:
– Observation lying within the general distribution of other observed values
– Doesn’t perturb the results but are non-conforming and unusual
– Simple example: observation recorded in the wrong unit (°F instead of °C)

Identifying inliers:
– Mahalanobi’s distance
– Used to calculate the distance between two random vectors
– Difference with Euclidean distance: accounts for correlations
– Discard them

Source

[ VIDEO OF THE WEEK]

@AnalyticsWeek Panel Discussion: Big Data Analytics

 @AnalyticsWeek Panel Discussion: Big Data Analytics

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom. – Clifford Stoll

[ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with Eloy Sasot, News Corp

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Eloy Sasot, News Corp

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

235 Terabytes of data has been collected by the U.S. Library of Congress in April 2011.

Sourced from: Analytics.CLUB #WEB Newsletter

Analytic Exploration: Where Data Discovery Meets Self-Service Big Data Analytics

Traditionally, the data discovery process was a critical prerequisite to, yet a distinct aspect of, formal analytics. This fact was particularly true for big data analytics, which involved extremely diverse sets of data types, structures, and sources.

However, a number of crucial developments have recently occurred within the data management landscape that resulted in increasingly blurred lines between the analytics and data discovery processes. The prominence of semantic graph technologies, combined with the burgeoning self-service movement and increased capabilities of visualization and dashboard tools, has resulted in a new conception of analytics in which users can dynamically explore their data while simultaneously gleaning analytic insight.

Such analytic exploration denotes several things: decreased time to insight and action, a democratization of big data and analytics fit for the users who need these technologies most, and an increased reliance on data for the pervasiveness of data-centric culture.

According to Ben Szekely, Vice President of Solutions and Pre-sales at Cambridge Semantics, it also means much more–a new understanding of the potential of analytics, which necessitates that users adopt:

“A willingness to explore their data and be a little bit daring. It is sort of a mind-bending thing to say, ‘let me just follow any relationship through my data as I’m just asking questions and doing analytics’. Most of our users, as they get in to it, they’re expanding their horizons a little bit in terms of realizing what this capability really is in front of them.”

Expanding Data Discovery to Include Analytics
In many ways, the data discovery process was widely viewed as part of the data preparation required to perform analytics. Data discovery was used to discern which data were relevant to a particular query and for solving a specific business problem. Discovery tools provided this information, which was then cleansed, transformed, and loaded into business intelligence or analytics options to deliver insight in a process that was typically facilitated by IT departments and exceedingly time consuming.

However, as the self-service movement has continued to gain credence throughout the data sphere these tools evolved to become more dynamic and celeritous. Today, any number of vendors is servicing tools that regularly publish the results of analytics in interactive dashboards and visualizations. These platforms enable users to manipulate those results, display them in ways that are the most meaningful for their objectives, and actually utilize those results to answer additional questions. As Szekely observed, oftentimes users are simply: “Approaching a web browser asking questions, or even using a BI or analytics tool they’re already familiar with.”

The Impact of Semantic Graphs for Exploration
The true potential for analytic exploration is realized when combining data discovery tools and visualizations with the relationship-based, semantic graph technologies that are highly effective on widespread sets of big data. By placing these data discovery platforms atop stacks predicated on an RDF graph, users are able to initiate analytics with the tools that they previously used to merely refine the results of analytics.

Szekely mentioned that: “It’s the responsibility of the toolset to make that exploration as easy as possible. It will allow them to navigate the ontology without them really knowing they’re using RDF or OWL at all…The system is just presenting it to them in a very natural and intuitive way. That’s the responsibility of the software; it’s not the responsibility of the user to try to come down to the level of RDF or OWL in any way.”

The underlying semantic components of RDF, OWL, and vocabularies and taxonomies that can link disparate sets of big data are able to contextualize that data to give them relevance for specific questions. Additionally, semantic graphs and semantic models are responsible for the upfront data integration that occurs prior to analyzing different data sets, structures and sources. By combining data discovery tools with semantic graph technologies, users are able to achieve a degree of profundity in their analytics that would have previously either taken too long to achieve or not have been possible.

The Nature of Analytic Exploration
On the one hand, that degree of analytic profundity is best described as the ability of the layman business end user to ask much more questions of his or her data in quicker time frames than he or she is used to doing so. On the other hand, the true utility of analytic exploration is realized in the types of questions that user can ask. These questions are frequently ad-hoc, include time-sensitive and real-time data, and are often based on the results of previous questions and conclusions that one can draw from them.

As Szekely previously stated, the sheer freedom and depth of analytic exploration lends itself to so many possibilities on different sorts of data that it may require a period of adjustment to conceptualize and fully exploit. The possibilities enabled by analytic exploration are largely based on the visual nature of semantic graphs, particularly when combined with competitive visualization mechanisms that capitalize on the relationships they illustrate for users. According to Craig Norvell, Franz Vice President of Global Sales and Marketing, such visualizations are an integral “part of the exploration process that facilitates the meaning of the research” for which an end user might be conducting analytics.

Emphasizing the End User
Overall, analytic exploration is reliant upon the relationship-savvy, encompassing nature of semantic technologies. Additionally, it depends upon contemporary visualizations to fuse data discovery and analytics. Its trump card, however, lies in its self-service nature which is tailored for end users to gain more comfort and familiarity with the analytics process. Ultimately, that familiarity can contribute to a significantly expanded usage of analytics, which in turn results in more meaningful data driven processes from which greater amounts of value are derived.

Originally Posted at: Analytic Exploration: Where Data Discovery Meets Self-Service Big Data Analytics by jelaniharper

Data Storytelling: What’s Easy and What’s Hard

Putting data on a screen is easy. Making it meaningful is so much harder. Gathering a collection of visualizations and calling it a data story is easy (and inaccurate). Making data-driven narrative that influences people…hard.

Here are 25 more lessons we’ve learned (the hard way) about what’s easy and what’s hard when it comes to telling data stories:

Easy: Picking a good visualization to answer a data question
Hard: Discovering the core message of your data story that will move your audience to action

Easy: Knowing who is your target audience
Hard: Knowing what motivates your target audience at a personal level by understanding their everyday frustrations and career goals

Easy: Collecting questions your audience wants to answer
Hard: Delivering answers your audience can act on

Easy: Providing flexibility to slice and dice data
Hard: Balancing flexibility with prescriptive guidance to help focus on the most important things

Easy: Labeling visualizations
Hard: Explaining the intent and meaning of visualizations

Easy: Choosing dimensions to show
Hard: Choosing the right metrics to show

Easy: Getting an export of the data you need
Hard: Restructuring data for high-performance analytical queries

Easy: Discovering inconsistencies in your data
Hard: Fixing those inconsistencies

Easy: Designing a data story with a fixed data set
Hard: Designing a data story where the data changes

Easy: Categorical dimensions
Hard: Dates

Easy: Showing data values within expected ranges
Hard: Dealing with null values

Easy: Determining formats for data fields
Hard: Writing a human-readable definition of data fields

Easy: Getting people interested in analytics and visualization
Hard: Getting people to use data regularly in their job

Easy: Picking theme colors
Hard: Using colors judiciously and with meaning

Easy: Setting the context for your story
Hard: Creating intrigue and suspense to move people past the introduction

Easy: Showing selections in a visualization
Hard: Carrying those selections through the duration of the story

Easy: Creating a long, shaggy data story
Hard: Creating a concise, meaningful data story
 
Easy: Adding more data
Hard: Cutting out unnecessary data

Easy: Serving one audience
Hard: Serving multiple audiences to enable new kinds of discussions

Easy: Helping people find insights
Hard: Explaining what to do about those insights

Easy: Explaining data to experts
Hard: Explaining data to novices

Easy: Building a predictive model
Hard: Convincing people they should trust your predictive model

Easy: Visual mock-ups with stubbed-in data
Hard: Visual mock-ups that support real-world data

Easy: Building a visualization tool
Hard: Building a data storytelling tool

Schedule a demo

Source by analyticsweek

What if the data tells you something you don’t like?” Three potential big data pitfalls

Big data is likely to quickly become big business. The ability to isolate the nuggets of insight inside the huge volumes of structured and unstructured data hoarded by most businesses could improve customer service, make processes more efficient and cut costs.

According to analysts Gartner, adoption of big data is still at a very early stage: just eight percent of companies have initiatives up and running, 20 percent are piloting and experimenting, 18 percent are ‘developing a strategy’, 19 percent are ‘knowledge gathering’, while the remainder have no plans or don’t know. But that could change rapidly: the analyst firm is predicting 4.4 million people will be working on such projects within two years, while a recent survey by Harvey Nash found that four out of ten CIOs planned to increase their investments in the next year.

However, because big data uses untested technologies and skills that are in short supply inside most organisations, there are number of hurdles for organisations seeking to exploit it:

1. Letting politics derail your big data project before it gets moving

Getting a big data initiative up and running might be one of the hardest parts of the project because the tech team and the rest of the business may have different ideas about what the goals should be, warn tech chiefs consulted by ZDNet: a big data project run solely by IT may fail because it’s unconnected to the needs of the business, for example, while a badly articulated request from the marketing department may leave IT confused about what to deliver.

As Rohit Killam, CTO at Masan Group points out: “The real bottleneck is conceptualising a value-driven big data programme with [the] right stakeholders,” while Duncan James, infrastructure manager at Clarion Solicitors notes: “Understanding what the business requires is the hardest part, especially if the business can’t articulate what it wants in the first place.”

In many organisations, whenever you want to do any project there has to be a business case before there can be any budget, says Frank Buytendijk, research vice president at Gartner.

“That is how organisations work and think, which is great for anything established — but for anything innovative that is really hard because the whole point of playing around with the technology is trying to figure out what it does for you. This is not unique to big data, but big data suffers from it as well.”

According to Buytendijk, big data projects don’t have to cost a lot, thanks to the availability of open-source tools. As a result, these projects can be used as a low-risk way to explore an organisation’s big data strategy. “The business case should not be the starting point; the business case should be the outcome, and it’s realising this that creates the right conversation within businesses,” he told ZDNet.

2. The big data skills crisis

According to the Harvey Nash CIO Survey carried out earlier this year, one in four CIOs reported difficulty in finding staff for big data projects. This is compounded by the complex array of skills needed for these projects, which are often outside of the standard skillset offered by the in-house tech team, according to tech chiefs canvassed by ZDNet.

“A shortage of big data skills doesn’t hold back big data projects, but it does have implications for the success factors and execution of the projects. There is certainly growth in demand for this area of skillset,” says Clarion’s Duncan James. Brian Wells, associate VP health technology and academic computing at Penn Medicine, adds that this is an issue in areas related to interpreting results and developing analytical hypotheses.

“Skills has been an issue from the beginning, and this will remain an issue for the foreseeable future,” says Gartner’s Buytendijk. “How do you find people who have a background in econometrics, statistics and mathematics, and who know how to programme in modern environments and have business sense, because big data analytics is all about interpreting context, why something is happening in a certain context. This skillset is really, really hard to find.”

One problem is that big data requires inductive rather than deductive thinking, whereas most IT organisations are good at deductive thinking: inductive thinking — using data to create likely connections — is a little outside their usual way of working.

Another problem is that big data technologies are very programming-intensive: while the typical ratio between software and implementation on a project is one to five, in big data that’s leapt to 1 to 25 as these tools are not very user friendly and they don’t integrate with other tools, and won’t for a number of years.

Not all tech chiefs agree on this, though: “I think the complexity of big data is way overrated,” maintains John Gracyalny, VP IT at SafeAmerica Credit Union. “We just kicked off a project to build a data warehouse/analytics tool internally. We only have a four-person IT department. I’m providing the ‘vision thing’ and database design, my newest guy is writing the code to handle external data extracts and imports, and my right hand will integrate an off-the-shelf reporting tool.”

3. The looming governance headache

When organisations start dredging through their digital detritus, they risk discovering information they might wish had remained buried. Consequently, they need to have some governance in place before they start delving into the huge piles of customer transactions and other data they’ve been storing.

For example, last year a New York Times story revealed how a retailer could use shopping patterns to spot when a customer was pregnant and offer them money-off vouchers — and how to do it without making them feel they were being watched. Thus organisations must beware of using their own data and other third-party data that together may lead them to discover information about customers that customers might not wish to have known.

As Gartner’s Buytendijk puts it: “If you start to work inductively, you let the data talk: what if the data tells you something you don’t like?”.

“Big data answers questions that weren’t even asked, and that can be quite embarrassing — so how do you create a governance situation with a sandbox with big walls where you contain things you don’t want the organisation to know?”.

According to Buytendijk, organisations need some kind of governance that shields them from over-using (and oversharing) the fruits of big data: “In lots of countries there have been reputational issues around big data being too clever for its own good. With great power comes great responsibility,” he warns.

Originally posted via “What if the data tells you something you don’t like?” Three potential big data pitfalls.

Source: What if the data tells you something you don’t like?” Three potential big data pitfalls by analyticsweekpick

Let’s Meet Up at the Nashville Analytics Summit

The_Nashville_Analytics_Summit.png

The Nashville Analytics Summit will be on us before we know it. This special gathering of data and analytics professionals is scheduled for August 20th and 21st, and should be bigger and better than ever. From my first experience with the Summit in 2014, it has consistently been a highlight of my year. My first Summit took place at the Lipscomb Spark Center meeting space with about a hundred attendees. Just a few years later, we’d grown to more than 450 attendees and moved into the Omni Hotel.

Mark it on your calendar. I’ll give you five reasons why it is a can’t-miss event if you work with data:

  1. We’ve invited world-renowned keynote speakers like Stephen Few and Thomas Davenport. You won’t believe who we are planning to bring in this year.
  2. There isn’t a better networking event for analytics professionals in our region. Whether you’re looking for talent or looking for the next step in your career, you’ll meet kindred spirits, data lovers, and innovative businesses. For two years in a row, we have hired Juice interns directly from conversations at the Summit. 
  3. It’s for everyone who works with data. Analyst, Chief Data Officer, or Data Scientist… we’ve got you covered. There are technical workshops and presentations for the hands-on practitioner and case studies and management strategies for the executive. We’re committed to bringing you quality and diverse content.
  4. It’s a “Goldilocks” conference. Some conferences go on for days. Some conferences are a sea of people, or too small to expand your horizons. The Analytics Summit is two days, 500-something people, and conveniently located in the cosy confines of the Omni Hotel. It is easy to meet new people and connect with people you know.
  5. See what’s happening. Nashville has a core of companies committed to building a special and innovative analytics community. We have innovators like Digital Reasoning, Stratasan, and Juice Analytics. We have larger companies making a deep commitment to analytics like Asurion, HCA, and Nissan. The Summit is the best chance to see the state of our thriving analytics community.

Now that you’re convinced you can’t miss out, you’re may wonder what to do next. First, block out your calendar (August 20 and 21). Next, find a colleague who you’d like to go with. Want to be even more involved? We invited dozens of local professionals to speak at the Summit. You can submit a proposal to present. 

Finally, if you don’t want your company to miss out on the opportunity to reach our entire analytics community, there are still slots for sponsors.

I hope to see you there.

learn more and register

Originally Posted at: Let’s Meet Up at the Nashville Analytics Summit by analyticsweek

The Future Of Big Data Looks Like Streaming

stream
Big data is big news, but it’s still in its infancy. While most enterprises at least talk about launching Big Data projects, the reality is that very few do in any significant way. In fact, according to new survey data from Dimensional, while 91% of corporate data professionals have considered investment in Big Data, only 5% actually put any investment into a deployment, and only 11% even had a pilot in place.

Big data is big news, but it’s still in its infancy. While most enterprises at least talk about launching Big Data projects, the reality is that very few do in any significant way. In fact, according to new survey data from Dimensional, while 91% of corporate data professionals have considered investment in Big Data, only 5% actually put any investment into a deployment, and only 11% even had a pilot in place.

Real Time Gets Real

ReadWrite: Hadoop has been all about batch processing, but the new world of streaming analytics is all about real time and involves a different stack of technologies.

Langseth: Yes, however I would not entangle the concepts of real-time and streaming. Real-time data is obviously best handled as a stream. But it’s possible to stream historical data as well, just as your DVR can stream Gone with the Wind or last week’s American Idol to your TV.

 This distinction is important, as we at Zoomdata believe that analyzing data as a stream adds huge scalability and flexibility benefits, regardless of if the data is real-time or historical.

RW: So what are the components of this new stack? And how is this new big data stack impacting enterprise plans?

JL: The new stack is in some ways an extension of the old stack, and in some ways really new.

Data has always started its life as a stream. A stream of transactions in a point of sale system. A stream of stocks being bought and sold. A stream of agricultural goals being traded for valuable metals in Mesopotamia.

Traditional ETL processes would batch that data up and kill its stream nature. They did so because the data could not be transported as a stream, it needed to be loaded onto removable disks and tapes to be transported from place to place.

But now it is possible to take streams from their sources, through any enrichment or transformation processes, through analytical systems, and into the data’s “final resting place”—all as a stream. There is no real need to batch up data given today’s modern architectures such as Kafka and Kinesis, modern data stores such as MongoDB, Cassandra, Hbase, and DynamoDB (which can accept and store data as a stream), and modern business intelligence tools like the ones we make at Zoomdata that are able to process and visualize these streams as well as historical data, in a very seamless way.

Just like your home DVR can play live TV, rewind a few minutes or hours, or play moves from last century, the same is possible with data analysis tools like Zoomdata that treat time as a fluid.

Throw That Batch In The Stream

Also we believe that those who have proposed a “Lambda Architecture,” effectively separating paths for real-time and batched data, are espousing an unnecessary trade-off, optimized for legacy tooling that simply wasn’t engineered to handle streams of data be they historical or real-time.

At Zoomdata we believe that it is not necessary to separate-track real-time and historical, as there is now end-to-end tooling that can handle both from sourcing, to transport, to storage, to analysis and visualization.

RW: So this shift toward streaming data is real, and not hype?

JL: It’s real. It’s affecting modern deployments right now, as architects realize that it isn’t necessary to ever batch up data, at all, if it can be handled as a stream end-to-end. This massively simplifies Big Data architectures if you don’t need to worry about batch windows, recovering from batch process failures, etc.

So again, even if you don’t need to analyze data from five seconds or even five minutes ago to make business decisions, it still may be simplest and easiest to handle the data as a stream. This is a radical departure from the way things in big data have been done before, as Hadoop encouraged batch thinking.

But it is much easier to just handle data as a stream, even if you don’t care at all—or perhaps not yet—about real-time analysis.

RW: So is streaming analytics what Big Data really means?

JL: Yes. Data is just like water, or electricity. You can put water in bottles, or electricity in batteries, and ship them around the world by planes trains and automobiles. For some liquids, such as Dom Perignon, this makes sense. For other liquids, and for electricity, it makes sense to deliver them as a stream through wires or pipes. It’s simply more efficient if you don’t need to worry about batching it up and dealing with it in batches.

Data is very similar. It’s easier to stream big data end-to-end than it is to bottle it up.

Article originally appeared HERE.

Source by analyticsweekpick

Statistics: Is This Big Data’s Biggest Hurdle?

Big Data is less about the data itself and more about what you do with the data. The application of statistics and statistical principles on the data helps you extract the information it contains. According to Wikipedia, statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. The American Statistical Association defines statistics as “the science of learning from data, and of measuring, controlling, and communicating uncertainty.”

Statistics is considered to be one of the three primary pillars of the field of data science (the other two are content domain knowledge and computer science skills). While content domain expertise provides the context through which you identify the relevant questions to ask, computer science skills help you get access to the relevant data and prepare them for analysis, statistics helps you interrogate that data to provide answers to your questions.

The Rise of Statistics

We have a lot of data and are generating a lot more of it. IDC says that we created 2.8 zettabytes in 2012. They estimate that number will grow to 40 zettabytes by 2020. It’s not surprising that Hal Varian, chief economist at Google, in 2009, said that “the sexy job in the next 10 years will be statisticians.” Statistics, after all, helps make sense of and get insight from data. The importance of statistics and statistical thinking in our datafied world can also be found in this excellent slideshare by Diego Kuonen, a statistician.

Figure 1. The Hottest Skill on LinkedIn in 2014: Statistical Analysis and Data Mining
Figure 1. The Hottest Skill on LinkedIn in 2014: Statistical Analysis and Data Mining

Statistical skills are receiving increasing attention in the world of business and education. LinkedIn found that statistical analysis and data mining was the hottest skill in 2014 (see Figure 1).

Many companies are pursuing statistics-savvy people to help them make sense of their quickly-expanding, ever-growing, complex data. Job postings on Indeed show that the number of data science jobs continue to grow (see Figure 2).

big-data, data-science Job Trends graph
Figure 2. Growth rate for Data Science jobs continues to increase.

University students are flocking to the field of statistics. Of the STEM Professions, statistics has been the fastest growing undergraduate degree over the past four years (see Figure 3).

Growth_Rate_of_Undergraduate_Degrees
Figure 3. Of the STEM fields, statistics has the highest growth rate.

The Fall of Statistics

The value of statistics is evident by the increase in number of statistics degrees and the Big Data jobs requiring statistical skills. These are encouraging headlines, no doubt, as more businesses are adopting what scientists have been using to solve problems for decades. But here are a few troubling trends that need to be considered in our world of Big Data.

McKinsey estimates that the US faces a shortage of up to 190,000 people with analytics expertise to fill these data science jobs as well as a shortage of 1.5 million people to fill managerial and analyst jobs who can understand and make decisions based on the data. Where will we find these statistics-savvy people to fill the jobs of tomorrow? We may have to look outside the US.

Figure 4. Some ranking
Figure 4. USA Ranks 27th in the world on math literacy of 15-year-old students.

In a worldwide study on 15-year-old students’ reading, mathematics, and science literacy (the Program for International Student Assessment; PISA), researchers found that US teenagers, compared to children of other countries, ranked 27th (out of 34 countries) in math literacy (see Figure 4), many countries having significantly higher scores than the US. According to the NY Times, while 13% of industrialized nations reached the top two levels of proficiency in math,  just 9% of US students did. In comparison, 55% of students from Shanghai reached that level of proficiency. In Singapore, 40% did.

Even the general US public is showing a decreased interest in statistics. Using Google Trends, I looked at the popularity of the term, statistics, among the general US public, comparing it with “analytics” and “big data.” While the number of searches for “big data” and “analytics” has increased, the number of searches of “statistics” has decreased steadily since 2004.

Summary and Major Trends

Statistics is the science of learning from data. Statistics and statistical thinking helps people understand the importance of data collection, analysis, interpretation and reporting of results.

In our Big Data world, statistical skills are becoming increasingly important for businesses. Companies are creating analytics-intensive jobs for statistics-savvy people, and universities are churning out more graduates with statistics degrees. On the other hand, there is expected to be a huge talent gap in the analytics industry. Additionally, the math literacy of US students is very low compared to the rest of the world. Finally, the US general public’s interest in statistics has been decreasing steadily for about a decade.

Knowledge of Statistics Needed for Both Analysts and Consumers

Statistics and statistical knowledge are not just for people who analyze data. They are also for people who consume, interpret and make decisions based the analysis of those data. Think of the data from wearable devices, home monitoring systems and health records and how they are turned into reports for fitness buffs, homeowners and patients. Think of CRM systems, customer surveys, social media posts and review sites and how dashboards are created to help front-line employees make better decisions to improve the customer experience.

The better the grasp of statistics people have, the more insight/value/use they will get from the data. In a recent study, I found that customer experience professionals had difficulty estimating size of customer segments based on customer survey metrics. Even though these customer experience professionals commonly use customer survey metrics to understand their customers, they showed extreme bias when solving this relatively simple problem. I assert that they would likely benefit (make fewer errors) if they understood statistics.

To get value from the data, you need to make sense of it, do something with it. How you do that is through statistics and applying statistical thinking to your data. Statistics is a way of helping people get value from their data. As the number of things that get quantified (e.g., data) continues to grow, so will the value of statistics.

The Most Important Thing People Need to Know about Statistics

Statistics is the language of data. Like knowledge of your native language helps you maneuver in the world of words, statistics will help you maneuver in the world of data. As the world around us becomes more quantified, statistical skills will become more and more essential in our daily lives. If you want to make sense of our data-intensive world, you will need to understand statistics.

I’m not saying that everyone needs an in-depth knowledge of statistics, but I do believe that everybody would benefit from knowing basic statistical concepts and principles. What is the most important thing you think people need to know about statistics and why? I would love to hear your answers in the comments section. Here is my take on this question.

 

Source by bobehayes