Mar 26, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Ethics  Source

[ AnalyticsWeek BYTES]

>> Artificial Intelligence – Weekly Round-up by administrator

>> CNN Model Architectures and Applications by administrator

>> Two Things Everyone Needs to Know About Your CEM Program by bobehayes

Wanna write? Click Here

[ FEATURED COURSE]

Lean Analytics Workshop – Alistair Croll and Ben Yoskovitz

image

Use data to build a better startup faster in partnership with Geckoboard… more

[ FEATURED READ]

The Future of the Professions: How Technology Will Transform the Work of Human Experts

image

This book predicts the decline of today’s professions and describes the people and systems that will replace them. In an Internet society, according to Richard Susskind and Daniel Susskind, we will neither need nor want … more

[ TIPS & TRICKS OF THE WEEK]

Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.

[ DATA SCIENCE Q&A]

Q:How do you test whether a new credit risk scoring model works?
A: * Test on a holdout set
* Kolmogorov-Smirnov test

Kolmogorov-Smirnov test:
– Non-parametric test
– Compare a sample with a reference probability distribution or compare two samples
– Quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution
– Or between the empirical distribution functions of two samples
– Null hypothesis (two-samples test): samples are drawn from the same distribution
– Can be modified as a goodness of fit test
– In our case: cumulative percentages of good, cumulative percentages of bad

Source

[ VIDEO OF THE WEEK]

Understanding #BigData #BigOpportunity in Big HR by @MarcRind #FutureOfData #Podcast

 Understanding #BigData #BigOpportunity in Big HR by @MarcRind #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

The data fabric is the next middleware. – Todd Papaioannou

[ PODCAST OF THE WEEK]

Jeff Palmucci @TripAdvisor discusses managing a #MachineLearning #AI Team

 Jeff Palmucci @TripAdvisor discusses managing a #MachineLearning #AI Team

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

A quarter of decision-makers surveyed predict that data volumes in their companies will rise by more than 60 per cent by the end of 2014, with the average of all respondents anticipating a growth of no less than 42 per cent.

Sourced from: Analytics.CLUB #WEB Newsletter

Inside CXM: New Global Thought Leader Hub for Customer Experience Professionals

Inside CXMInside CXM, a new online global thought leadership hub for customer experience management (CXM/CEM) professionals, officially launched yesterday. Inside CXM is focused on bringing you the latest insights from the field of customer experience management. According to Inside CXM, their goal with this program is to provide valuable content via experts who have their finger on the pulse of the global customer experience marketplace.

I am happy to announce my partnership (disclosure – I am a paid contributor) with Inside CXM by providing unique content to their site. I join a host of other industry experts who cover topics like aligning your organization to deliver a unified experience, creating contextual customer journeys and using customer insights to build a smarter experience. These experts include Flavio Martins, Andy Reid and Molly Boyer. Check out the other contributors.

My first article for Inside CXM, “Why Customer Experience Management? To Leave the World a Better Place,” focuses on how businesses need to consider that the impact they have on customers goes well beyond their company walls and financial ledger. Business leaders need to remember that customers’ interactions with their company not only impact how the customers feel about the company, but the quality of those interactions can impact, positively or negatively, their personal lives. Customer experiences are, after all, a subset of all of life’s experiences.

 

Originally Posted at: Inside CXM: New Global Thought Leader Hub for Customer Experience Professionals by bobehayes

Mar 19, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Statistically Significant  Source

[ AnalyticsWeek BYTES]

>> 10 groups of machine learning algorithms by administrator

>> Towards Better Visualizations: Part 1 – The Visual Frontier by analyticsweek

>> How can you reap the advantages of Big Data in your enterprise? Services you can expect from a Remote DBA Expert by thomassujain

Wanna write? Click Here

[ FEATURED COURSE]

A Course in Machine Learning

image

Machine learning is the study of algorithms that learn from data and experience. It is applied in a vast variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need… more

[ FEATURED READ]

Research Design: Qualitative, Quantitative, and Mixed Methods Approaches, 4th Edition

image

The eagerly anticipated Fourth Edition of the title that pioneered the comparison of qualitative, quantitative, and mixed methods research design is here! For all three approaches, Creswell includes a preliminary conside… more

[ TIPS & TRICKS OF THE WEEK]

Data aids, not replace judgement
Data is a tool and means to help build a consensus to facilitate human decision-making but not replace it. Analysis converts data into information, information via context leads to insight. Insights lead to decision making which ultimately leads to outcomes that brings value. So, data is just the start, context and intuition plays a role.

[ DATA SCIENCE Q&A]

Q:What are the drawbacks of linear model? Are you familiar with alternatives (Lasso, ridge regression)?
A: * Assumption of linearity of the errors
* Can’t be used for count outcomes, binary outcomes
* Can’t vary model flexibility: overfitting problems
* Alternatives: see question 4 about regularization

Source

[ VIDEO OF THE WEEK]

#GlobalBusiness at the speed of The #BigAnalytics

 #GlobalBusiness at the speed of The #BigAnalytics

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

You can use all the quantitative data you can get, but you still have to distrust it and use your own intelligence and judgment. – Alvin Tof

[ PODCAST OF THE WEEK]

Understanding Data Analytics in Information Security with @JayJarome, @BitSight

 Understanding Data Analytics in Information Security with @JayJarome, @BitSight

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

This year, over 1.4 billion smart phones will be shipped – all packed with sensors capable of collecting all kinds of data, not to mention the data the users create themselves.

Sourced from: Analytics.CLUB #WEB Newsletter

Measuring Customer Loyalty is Essential for a Successful CEM Program

Customers can exhibit many different types of loyalty behaviors toward a company (e.g., recommend, purchase same, purchase different products, stay/leave), each responsible for different types of business growth. Furthermore, when asked about their loyalty behaviors via relationship surveys, customers’ ratings of loyalty questions show that customer loyalty essentially boils down to three different types of customer loyalty:

  • Retention Loyalty: Degree to which customers will remain as customers or not leave to competitors. This type of loyalty impacts overall customer growth.
  • Advocacy Loyalty: Degree to which customers feel positively toward/will advocate your product/service/brand. This type of loyalty impacts new customer growth.
  • Purchasing Loyalty: Degree to which customers will increase their purchasing behavior. This type of loyalty impacts average revenue per customer.

These three distinct types of customer loyalty form the foundation of the RAPID loyalty approach. Using the RAPID loyalty’s multi-faceted approach helps companies understand how improving the customer experience can improve business growth in different ways. If interested, you can read my recent article on the development of the RAPID loyalty approach.

Product and Service Experience

Customer experience management (CEM) is the process of understanding and managing customers’ interactions with and perceptions about your company/brand. The ultimate goal of this process is to improve the customer experience and, consequently, increase customer loyalty. Two primary customer experience areas that are commonly assessed are the customers’ perception of their 1) product experience and 2) service experience. These two areas are shown to be among the top drivers of customer loyalty; customers who have a good experience in these two areas report higher levels of customer loyalty than customers who have a poor experience.

How does Product and Service Experience Impact Each Type of Customer Loyalty?

To understand the impact of the product and service experience on different facets of customer loyalty, I used existing survey data. Last year, Mob4Hire, a global crowd-sourced testing and market research community, and I conducted a worldwide survey, asking respondents’ about their experience with and loyalty towards their current wireless service provider. To measure the product and service experiences, respondents were asked to indicate their agreement about statements that describes their provider (1 to 5 – higher scores indicate agreement and better customer experience). As a measure of the product experience, we averaged respondent’s ratings across two questions: 1) good coverage in my area and 2) reliable service (few dropped calls). As a measure of the service experience, we averaged respondent’s ratings about their provider’s representatives across 5 areas: 1) responds to needs, 2) has knowledge to answer questions, 3) was courteous, 4) understands my needs and 5) always there when I need them. The survey also asked about the respondents’ loyalty toward their wireless service provider across the three types of loyalty: 1) retention, 2) advocacy and 3) purchasing.

To index the degree of impact that each customer experience dimension has on customer loyalty, I simply correlated the ratings of each customer experience dimension (Coverage/Reliability; Customer Service) with each of the three loyalty measures (Retention, Advocacy, Purchasing). I did this analysis for the entire dataset and then for each of the wireless service providers who had more than 100 respondents. Figure 1 contains the results for the impact of Coverage/Reliability on customer loyalty.

Figure 1. Impact of Product Experience on Retention, Advocacy and Purchasing Loyalty. Click image to enlarge.

As you can see in Figure 1, using the entire sample (far left bars), the product experience has the largest impact on advocacy loyalty (r = .49), followed by purchasing (r = .31) and retention loyalty (r = .34). Similarly, in Figure 2, using the entire sample (far left bars), the service experience has the largest impact on advocacy loyalty (r = .48), followed by purchasing (r = .34) and retention loyalty (r = .32). Generally speaking, while improving the product and service experience will have the greatest impact on advocacy loyalty, improvement in these areas will have an impact, albeit a smaller one, on purchasing and retention loyalty. I find this pattern of results in other industries as well.

Looking at individual wireless service providers in Figures 1 and 2, however, we see exceptions to this rule (Providers were ordered by their Advocacy Loyalty scores.). For example, we see that improving the product experience will have a comparable impact on different types of loyalty for specific companies (Figure 1 – T-Mobile, Safaricom). Additionally, we see that improving the service experience will have a comparable impact on different types of loyalty for specific companies (Figure 2 – Safaricom, MTN, Orange, Warid Telecom, Telenor, and Ufone). The value of improving the service experience is different across companies depending on the types of customer loyalty it impacts. For example, improving the service experience is much more valuable for Safaricom than it is for T-Mobile. Improving the service experience will greatly impact all three types of customer loyalty for Safaricom and only one for T-Mobile.  I suspect the reasons for variability across providers in what drives their customer loyalty could be due to company maturity, the experience delivery process, market pressures and customer type. Deeper analyses (e.g., stepwise regression, path analysis) of these data for specific providers could help shed light on the reasons.

Figure 2. Impact of Service Experience on Retention, Advocacy and Purchasing Loyalty. Click image to enlarge.

Benefits of Measuring Different Types of Customer Loyalty

Improving the customer experience impacts different types of customer loyalty and this pattern varies across specific companies. For some companies, improving the customer experience will primarily drive new customer growth (advocacy loyalty). For other companies, improving the customer experience will also significantly drive existing customer growth (retention and purchasing loyalty).

Companies who measure and understand different types of customer loyalty and how they are impacted by the customer experience have an advantage over companies who measure only one type of loyalty (typically advocacy):

  • Companies can target solutions to optimize different types of customer loyalty to improve business growth. For example, including retention loyalty questions (e.g., “likelihood to quit”) and a purchasing loyalty questions (e.g., “likelihood to buy different”) can help companies understand why customers are leaving and identify ways to increase customers’ purchasing behavior, respectively.
  • Key performance indicators (KPIs) can be identified for each type of customer loyalty. Identification of different KPIs (key drivers of customer loyalty) helps companies ensure they are monitoring all important customer experience areas. Identifying and monitoring all KPIs helps ensure the entire company is focused on matters that are important to the customer and his/her loyalty.
  • Companies are better equipped to quantify the value of their CEM program and obtain more accurate estimates of the Return on Investment (ROI) of the program. The ROI of a specific improvement opportunity will depend on how the company measures customer loyalty. If only advocacy loyalty is measured, the estimate of ROI is based on new customer growth. When companies measure advocacy, purchasing and retention loyalty, the estimate of ROI is based on new and existing customer growth.

Final Thoughts

The primary goal of CEM is to improve customer loyalty. Companies that narrowly define customer loyalty are missing out on opportunities to fully understand the impact that their CEM program has on the company’s bottom line. Companies need to ensure they are comprehensively measuring all facets of customer loyalty. A poor customer loyalty measurement approach can lead to sub-optimal business decisions, missed opportunities for business growth and an incomplete picture of the health of the customer relationship.

Originally Posted at: Measuring Customer Loyalty is Essential for a Successful CEM Program

Mar 12, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Data security  Source

[ AnalyticsWeek BYTES]

>> How to Solve 10 Healthcare Challenges with One Predictive Analytics Model by analyticsweek

>> 6 Big Data Analytics Use Cases for Healthcare IT by analyticsweekpick

>> The Usability of Dashboards (Part 1): Does Anyone Actually Use These Things? [Guest Post] by analyticsweek

Wanna write? Click Here

[ FEATURED COURSE]

R Basics – R Programming Language Introduction

image

Learn the essentials of R Programming – R Beginner Level!… more

[ FEATURED READ]

On Intelligence

image

Jeff Hawkins, the man who created the PalmPilot, Treo smart phone, and other handheld devices, has reshaped our relationship to computers. Now he stands ready to revolutionize both neuroscience and computing in one strok… more

[ TIPS & TRICKS OF THE WEEK]

Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.

[ DATA SCIENCE Q&A]

Q:Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
A: Hash tables:
– Average case O(1)O(1) lookup time
– Lookup time doesn’t depend on size

Even in terms of memory:
– O(n)O(n) memory
– Space scales linearly with number of elements
– Lots of dictionaries won’t take up significantly less space than a larger one

In-database analytics:
– Integration of data analytics in data warehousing functionality
– Much faster and corporate information is more secure, it doesn’t leave the enterprise data warehouse
Good for real-time analytics: fraud detection, credit scoring, transaction processing, pricing and margin analysis, behavioral ad targeting and recommendation engines

Source

[ VIDEO OF THE WEEK]

Solving #FutureOfWork with #Detonate mindset (by @steven_goldbach & @geofftuff) #JobsOfFuture #Podcast

 Solving #FutureOfWork with #Detonate mindset (by @steven_goldbach & @geofftuff) #JobsOfFuture #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

I keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding. – Hal Varian

[ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with Eloy Sasot, News Corp

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Eloy Sasot, News Corp

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

The Hadoop (open source software for distributed computing) market is forecast to grow at a compound annual growth rate 58% surpassing $1 billion by 2020.

Sourced from: Analytics.CLUB #WEB Newsletter

Reinforcing Data Governance with Data Discovery

Historically, data discovery has existed at the nexus point between data preparation and analytics. The discovery process was frequently viewed as the means of gathering the requisite data for analytics while illustrating relationships between data elements which might inform them.

Today, data discovery’s utility has considerably broadened. Aided by machine learning and data cataloging techniques, data discovery is playing an increasingly pivotal role in enabling—and solidifying—data governance for today’s highly regulated data environments.

“We now have the automated capability to see where data elements are showing up and what are the new instances of them that are being introduced [throughout the enterprise],” Io-Tahoe CEO Oksana Sokolovsky revealed. “Now, users can govern that as data owners and actually have this visibility into their changing data landscapes.”

The additional governance repercussions of data discovery (encompassing aspects of data quality, data stewardship, and data disambiguation), coupled with its traditional importance for enhancing analytics, makes this facet of data management more valuable than ever.

Data Cataloging
The expansion of data discovery into facets of data governance is rooted in the fundamental need to identify where data are for what specific purposes. Data cataloging immensely enriches this process by providing a means of detailing critical information about data assets that provide a blueprint for data governance. Moreover, discovery and cataloging systems which deploy machine learning are targeted towards business users, allowing them to “create business rules, maintain them, search for elements, define policies, and start providing the governance workflow for the data elements,” Sokolovsky said. The plethora of attributes imputed to data within catalogs is vast, including details about metadata, sensitivity, and access or security concerns. Another crucial advantage is that all of this information is stored in a centralized location. “The catalog enhances the metadata and enhances the business description of the data elements,” Sokolovsky explained. “It enables other business users to leverage that information. The catalog function now makes data discovery an actionable output for users.”

Exceeding Metadata Relationships
A number of data discovery tools are almost entirely based on metadata—providing circumscribed value in situations in which there is limited metadata. The most common of these involve data lakes, in which data elements “might not have any metadata associated with them, but we still need to tie them back to the same element which appears in your original sources,” Sokolovsky commented. Other metadata limitations involve scenarios in which there is not enough metadata, or metadata that applies to a specific use case. In these instances and others, discovery techniques informed by machine learning are superior because they can identify relationships among the actual data, as well as among any existent metadata.

According to Sokolovsky, this approach empowers organizations to “now pick up 30 to 40 percent more [information about data elements], which used to be input manually by subject matter experts.” The disambiguation capability of this approach supports basic aspects of data quality. For example, when determining if data referencing ‘Washington’ applies to names, locations, or businesses, machine learning “algorithms can narrow that down and say we found 700 Washington instances; out of that, X number is going to be last names, X number is going to be first names, X number is going to be streets, and X number is going to be cities,” Sokolovsky said.

Data Stewardship
The automation capabilities of machine learning for data discovery also support governance by democratizing the notion of data stewardship. It does so in two ways. Not only do those tools provide much needed visibility for employees in dedicated stewardship roles, but they also enable business users to add citizen stewardship responsibilities to their positions. The expansion of stewardship capabilities is useful for increasing data quality for data owners in particular, who “become more like stewards,” Sokolovsky maintained. “They can now say okay, out of 75 instances 74 seem to be accurate and one is bad. That’s going to continue to enhance the machine learning capability.”

The capacity for disambiguating data, reinforcing data quality and assisting data stewardship that this approach facilitates results in higher levels of accuracy for data in any variety of use cases. Although a lot of this work is engineered by machine learning, the human oversight of data stewardship is instrumental for its ultimate success. “The user should interact with the system to go and do the validation and say I accept or I reject [the machine learning results],” Sokolovsky said. “Because of that not only are they in control of the governance, but also the system becomes smarter and smarter in the client’s environment.”

Working for Business Users
The deployment of data discovery and data cataloging for data governance purposes indicates both the increasing importance of governance and machine learning. Machine learning is the intermediary that improves the data discovery process to make it suitable for the prominent data governance and regulatory compliance concerns contemporary enterprises face. It is further proof that these learning capabilities are not only ideal for analytics, but also for automating other processes that give those analytics value (such as data quality), which involves “working directly with the business user,” Sokolovsky said.

Source by jelaniharper

Data Matching with Different Regional Data Sets

When it comes to Data Matching, there is no ‘one size fits all menu’. Different matching routines, different algorithms and different tuning parameters will all apply to different datasets. You generally can’t take one matching setup used to match data from one distinct data set and apply it to another. This proves especially true when matching datasets from different regions or countries. Let me explain.

Data Matching for Attributes that are Unlikely to Change

Data Matching is all about identifying unique attributes that a person, or object, has; and then using those attributes to match individual members within that set. These attributes should be things that are ‘unlikely to change’ over time. For a person, these would be things like “Name” and “Date of Birth”. Attributes like “Address” are much more likely to change and therefore of less importance, although this does not mean you should not use them. It’s just that they are less unique and therefore of less value, or lend less weight, to the matching process. In the case of objects, they would be attributes that uniquely identify that object, so in the case of say, a cup (if you manufactured cups) those attributes would be things like “Size”, “Volume”, “Shape”, “Color”, etc. The attributes themselves are not too important, it’s the fact that they should be ‘things’ that are unlikely to change over time.

So, back to data relating to people. This is generally the main use case for data matching. So here comes the challenge. Can’t we use one set of data matching routines for a ‘person database’ and just use the same routines etc. for another dataset? Well, the answer is no, unfortunately. There are always going to be differences in the data that will manifest itself during the matching, and none more so than using datasets from different geographical regions such as different countries. Data matching routines are always tuned for a specific dataset, and whilst there are always going to be differences from dataset to dataset. The difference becomes much more distinct when you chose data from different geographical regions. Let us explore this some more.

Data Matching for Regional Data Sets

First, I must mention a caveat. I am going to assume that matching is done in western character sets, using Romanized names, not in languages or character sets such as Japanese or Chinese. This does not mean the data must contain only English or western names, far from it, it just means the matching routines are those which we can use for names that we can write using western, or Romanized characters. I will not consider matching using non-western characters here. 

Now, let us consider the matching of names. To do this for the name itself, we use matching routines that do things like phoneticize the names and then look for differences between the result. But first, the methodology involves blocking on names, sorting out the data in different piles that have similar attributes. It’s the age-old ‘matching the socks’ problem. You wouldn’t match socks in a great pile of fresh laundry by picking one sock at a time from the whole pile and then trying to find its duplicate. That would be very inefficient and take ages to complete. You instinctively know what to do, you sort them out first into similar piles, or ‘blocks’, of similar socks. Say, a pile of black socks, a pile of white socks, a pile of colored socks etc. and then you sort through those smaller piles looking for matches. It’s the same principle here. We sort the data into blocks of similar attributes, then match within those blocks. Ideally, these blocks should be of a manageable and similar size. Now, here comes the main point.

Different geographic regions will produce different distributions of block sizes and types that result in differences to the matching that will need to be done in those blocks, and this can manifest itself in terms of performance, efficiency, accuracy and overall quality of the matching. Regional variations in the distribution of names within different geographical regions, and therefore groups of data, can vary widely and therefore cause big differences in the results obtained.

Let’s look specifically at surnames for a moment. In the UK, according to the National Office of Statistics, there are around 270,000 surnames that cover around 95% of the population. Now obviously, some surnames are much more common than others. Surnames such as Jones, Brown, Patel example are amongst the most common, but the important thing is there is a distribution of these names that follow a specific graphical shape if we chose to plot them. There will be a big cluster of common names at one end, followed by a specific tailing-off of names to the other, whilst the shape of the curve would be specific to the UK and to the UK alone. Different countries or regions would have different shapes to their distributions. This is an important point. Some regions would have a much narrower distribution, names could be much more similar or common, whilst some regions would be broader, names would be much less common. The overall distribution of distinct names could be much more or much less and this would, therefore, affect the results of any matching we did within datasets emanating from within those regions. A smaller distribution of names would result in bigger block sizes and therefore more data to match on within those blocks. This could take longer, be less efficient and could even affect the accuracy of those matches. A larger distribution of names would result in many more blocks of a smaller size, each of which would need to be processed.

Data Matching Variances Across the Globe

Let’s take a look at how this varies across the globe. A good example of regional differences comes from Taiwan. Roughly forty percent of the population share just six different surnames (when using the Romanised form). Matching within datasets using names from Taiwanese data will, therefore, result in some very large blocks. Thailand, on the other hand, presents a completely different challenge. In Thailand, there are no common surnames. There is actually a law called the ‘Surname Act’ that states surnames cannot be duplicated and families should have unique surnames. In Thailand, it is incredibly rare for any two people to share the same name. In our blocking exercise, this would result in a huge number of very small blocks.

The two examples above may be extreme, but they perfectly illustrate the challenge. Datasets containing names vary from region to region and therefore the blocking and matching strategy can vary widely from place to place. You cannot simply use the same routines and algorithms for different datasets, each dataset is unique and must be treated so. Different matching strategies must be adopted for each set, each matching exercise must be ‘tuned’ for that specific dataset in order to find the most effective strategy and the results will vary. It doesn’t matter what toolset you choose to use; the same principle applies to all as it’s an issue that is in the data and cannot be changed or ignored. 

To summarize, the general point is that regional, geographic, cultural and language variations can make big differences to how you go about matching personal data within different datasets. Each dataset must be treated differently. You must have a good understanding of the data contained within those datasets and you must tune and optimize your matching routines and strategy for each dataset. Care must be taken to understand the data and select the best strategy for each separate dataset. Blocking and matching strategies will vary, you cannot just simply reuse the exact same approach and routines from dataset to dataset, this can vary widely from region to region. Until next time!

The post Data Matching with Different Regional Data Sets appeared first on Talend Real-Time Open Source Data Integration Software.

Source: Data Matching with Different Regional Data Sets