Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
[ COVER OF THE WEEK ]
Weak data Source
[ LOCAL EVENTS & SESSIONS]
- Jun 11, 2019 #WEB Webinar – HR Analytics made easy with Tableau
- Jun 06, 2019 #WEB Data Engineering & Analytics Engineering -Python, DB and App.Devs-Global Webinar
- Jun 11, 2019 #WEB Intro to Git & GitHub: Your Coding Safety Net
[ FEATURED COURSE]
Statistical Thinking and Data Analysis
![]() |
[ FEATURED READ]
Storytelling with Data: A Data Visualization Guide for Business Professionals
![]() |
[ TIPS & TRICKS OF THE WEEK]
Winter is coming, warm your Analytics Club
Yes and yes! As we are heading into winter what better way but to talk about our increasing dependence on data analytics to help with our decision making. Data and analytics driven decision making is rapidly sneaking its way into our core corporate DNA and we are not churning practice ground to test those models fast enough. Such snugly looking models have hidden nails which could induce unchartered pain if go unchecked. This is the right time to start thinking about putting Analytics Club[Data Analytics CoE] in your work place to help Lab out the best practices and provide test environment for those models.
[ DATA SCIENCE Q&A]
Q:What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset
A: Outliers:
– An observation point that is distant from other observations
– Can occur by chance in any distribution
– Often, they indicate measurement error or a heavy-tailed distribution
– Measurement error: discard them or use robust statistics
– Heavy-tailed distribution: high skewness, cant use tools assuming a normal distribution
– Three-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean
– Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean
Three-sigma rules example: in a sample of 1000 observations, the presence of up to 5 observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected, being less than twice the expected number and hence within 1 standard deviation of the expected number (Poisson distribution).
If the nature of the distribution is known a priori, it is possible to see if the number of outliers deviate significantly from what can be expected. For a given cutoff (samples fall beyond the cutoff with probability p), the number of outliers can be approximated with a Poisson distribution with lambda=pn. Example: if one takes a normal distribution with a cutoff 3 standard deviations from the mean, p=0.3% and thus we can approximate the number of samples whose deviation exceed 3 sigmas by a Poisson with lambda=3
Identifying outliers:
– No rigid mathematical method
– Subjective exercise: be careful
– Boxplots
– QQ plots (sample quantiles Vs theoretical quantiles)
Handling outliers:
– Depends on the cause
– Retention: when the underlying model is confidently known
– Regression problems: only exclude points which exhibit a large degree of influence on the estimated coefficients (Cooks distance)
Inlier:
– Observation lying within the general distribution of other observed values
– Doesnt perturb the results but are non-conforming and unusual
– Simple example: observation recorded in the wrong unit (°F instead of °C)
Identifying inliers:
– Mahalanobis distance
– Used to calculate the distance between two random vectors
– Difference with Euclidean distance: accounts for correlations
– Discard them
Source
[ VIDEO OF THE WEEK]
@AnalyticsWeek Panel Discussion: Big Data Analytics
Subscribe to Youtube
[ QUOTE OF THE WEEK]
Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom. Clifford Stoll
[ PODCAST OF THE WEEK]
#BigData @AnalyticsWeek #FutureOfData #Podcast with Eloy Sasot, News Corp
Subscribe
[ FACT OF THE WEEK]
235 Terabytes of data has been collected by the U.S. Library of Congress in April 2011.