[ COVER OF THE WEEK ]
SQL Database Source
[ AnalyticsWeek BYTES]
>> Getting a 360° View of the Customer – Interview with Mark Myers of IBM by bobehayes
>> The Blueprint for Becoming Data Driven: Data Quality by jelaniharper
>> May 04, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..) by admin
[ NEWS BYTES]
>>
Why Google’s Artificial Intelligence Confused a Turtle for a Rifle – Fortune Under Artificial Intelligence
>>
Microsoft Workplace Analytics helps managers understand worker … – TechCrunch Under Analytics
>>
Storytelling â Two Essentials for Customer Experience Professionals – Customer Think Under Customer Experience
[ FEATURED COURSE]
![]() |
[ FEATURED READ]
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
![]() |
[ TIPS & TRICKS OF THE WEEK]
Strong business case could save your project
Like anything in corporate culture, the project is oftentimes about the business, not the technology. With data analysis, the same type of thinking goes. It’s not always about the technicality but about the business implications. Data science project success criteria should include project management success criteria as well. This will ensure smooth adoption, easy buy-ins, room for wins and co-operating stakeholders. So, a good data scientist should also possess some qualities of a good project manager.
[ DATA SCIENCE Q&A]
Q:How to clean data?
A: 1. First: detect anomalies and contradictions
Common issues:
* Tidy data: (Hadley Wickam paper)
column names are values, not names, e.g. 26-45
multiple variables are stored in one column, e.g. m1534 (male of 15-34 years old age)
variables are stored in both rows and columns, e.g. tmax, tmin in the same column
multiple types of observational units are stored in the same table. e.g, song dataset and rank dataset in the same table
*a single observational unit is stored in multiple tables (can be combined)
* Data-Type constraints: values in a particular column must be of a particular type: integer, numeric, factor, boolean
* Range constraints: number or dates fall within a certain range. They have minimum/maximum permissible values
* Mandatory constraints: certain columns cant be empty
* Unique constraints: a field must be unique across a dataset: a same person must have a unique SS number
* Set-membership constraints: the values for a columns must come from a set of discrete values or codes: a gender must be female, male
* Regular expression patterns: for example, phone number may be required to have the pattern: (999)999-9999
* Misspellings
* Missing values
* Outliers
* Cross-field validation: certain conditions that utilize multiple fields must hold. For instance, in laboratory medicine: the sum of the different white blood cell must equal to zero (they are all percentages). In hospital database, a patients date or discharge cant be earlier than the admission date
2. Clean the data using:
* Regular expressions: misspellings, regular expression patterns
* KNN-impute and other missing values imputing methods
* Coercing: data-type constraints
* Melting: tidy data issues
* Date/time parsing
* Removing observations
Source
[ VIDEO OF THE WEEK]
@AngelaZutavern & @JoshDSullivan @BoozAllen discussed Mathematical Corporation #FutureOfData
Subscribe to Youtube
[ QUOTE OF THE WEEK]
What we have is a data glut. Vernon Vinge
[ PODCAST OF THE WEEK]
#BigData @AnalyticsWeek #FutureOfData #Podcast with Eloy Sasot, News Corp
Subscribe
[ FACT OF THE WEEK]
140,000 to 190,000. Too few people with deep analytical skills to fill the demand of Big Data jobs in the U.S. by 2018.