### [ COVER OF THE WEEK ]

**Trust the data** Source

### [ AnalyticsWeek BYTES]

**>>** April 3, 2017 Health and Biotech analytics news roundup by pstein

**>>** CEOs to Employees – Vote for Romney else Face Layoffs. A Good Strategy? by v1shal

**>>** 8 big trends in big data analytics by analyticsweekpick

### [ NEWS BYTES]

**>>**

The Hybrid Cloud Depends on Solid Networking – EnterpriseNetworkingPlanet (blog) Under Hybrid Cloud

**>>**

Hoteliers witness revenue surge by four pct with DJUBO adoption – Yahoo News Under Sales Analytics

**>>**

FX Volatility Focused on Weak USD As JPY Firms; EUR Pushes To 1.2000 – DailyFX Under Sentiment Analysis

### [ FEATURED COURSE]

### [ FEATURED READ]

**Introduction to Graph Theory (Dover Books on Mathematics)**

### [ TIPS & TRICKS OF THE WEEK]

**Data aids, not replace judgement**

Data is a tool and means to help build a consensus to facilitate human decision-making but not replace it. Analysis converts data into information, information via context leads to insight. Insights lead to decision making which ultimately leads to outcomes that brings value. So, data is just the start, context and intuition plays a role.

### [ DATA SCIENCE Q&A]

**Q:Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?**

A: * In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically

* Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution

* The least frequently occurring 80% of items are more important as a proportion of the total population

* Zipfs law, Pareto distribution, power laws

Examples:

1) Natural language

– Given some corpus of natural language – The frequency of any word is inversely proportional to its rank in the frequency table

– The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent

– The accounts for 7% of all word occurrences (70000 over 1 million)

– ‘of accounts for 3.5%, followed by ‘and

– Only 135 vocabulary items are needed to account for half the English corpus!

2. Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people

3. File size distribution of Internet Traffic

Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites

Importance in classification and regression problems:

– Skewed distribution

– Which metrics to use? Accuracy paradox (classification), F-score, AUC

– Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function
)

– Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (‘Synthetic Minority Over-sampling Technique, NV Chawla) or anomaly detection approach

**Source**

### [ VIDEO OF THE WEEK]

@AnalyticsWeek Panel Discussion: Marketing Analytics

Subscribe to Youtube

### [ QUOTE OF THE WEEK]

It’s easy to lie with statistics. It’s hard to tell the truth without statistics. Andrejs Dunkels

### [ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with John Young, @Epsilonmktg

Subscribe

### [ FACT OF THE WEEK]

14.9 percent of marketers polled in Crain’s BtoB Magazine are still wondering ‘What is Big Data?’