Investigating Data Scientists, their Skills and Team Makeup

A new survey of 490 data professionals from small to large companies, conducted by AnalyticsWeek in partnership with Business Over Broadway, provides a look into the field of data science. Download the free Executive Summary of the report, Optimizing your Data Science Teams.

Our world of Big Data requires that businesses, to outpace their competitors, optimize the use of their data. Understanding data is about extracting insights form the data to answer questions that will help executives drive their business forward. Do we invest in products or services to improve customer loyalty? Would we get greater ROI by hiring more staff or invest in new equipment?

Getting insights from data is no simple task, often requiring data science experts with a variety of different skills. Many pundits have offered their take on what it takes to be a successful data scientist. Required skills include expertise in business, technology and statistics. In an interesting study published by O’Reilly, researchers (Harlan D. Harris, Sean Patrick Murphy and Marck Vaisman) surveyed several hundred practitioners, asking them about their proficiency in 22 different data skills. Confirming the pundits’ list of skills, these researchers found that data skills fell into five broad areas: Business, ML / Big Data, Math / OR, Programming and Statistics.

Data Skills Survey and Sample

We invited data professionals from a variety of sources, including AnalyticsWeek community members and social media (e.g., Twitter and LinkedIn), to complete a short survey, asking them about their proficiency across different data skills, education, job roles, team members, satisfaction with their work outcomes and more.  We received 490 completed survey responses. Most of the respondents were from North America (68%), worked for B2B companies (79%) with less than 1000 employees (53%) in the IT, Financial Services, Education/Science, Consulting and Healthcare & Medicine (68%). Males accounted for 75% of the sample. A majority of the respondents held 4-year (30%), Master’s (49%) or PhD (18%) degrees.

Data Science Skills

Figure 1. Proficiency levels across 25 data skills. Click image to enlarge.

Data science is an umbrella term, under which different skills fall. We identified 25 data skills that make up the field of data science. They fall into five broad areas: 1) Business, 2) Technology, 3) Programming, 4) Math & Modeling and 5) Statistics. Respondents were asked to indicate their level of proficiency for each of 25 different skills, using a scale from 0 (Don’t know) to (Expert).

Proficiency levels varied widely across the different skills (see Figure 1). The respondents reported a high degree of competency in such areas as Communications, Structured data, Data mining, Science/Scientific Method and Math. The respondents indicated a low degree of competency in such areas as Systems Administration, Front- and Back-end programming, NLP, Big and distributed data and Cloud Management.

Job Roles

Figure 2. Job roles of data professionals.

Respondents were asked to indicate which of four options best described themselves and the work they do (e.g., job role). Over half indicated their primary job role was a Researcher, followed by Business Management, Creative and Developer (see Figure 2.).

Most of the respondents identified with only one primary job role (49%). About 32% indicated they had two job roles. About 14% indicated they had three job roles and 4% indicated they had all four job roles.

Figure 3. Satisfaction with outcome of analytics projects by job role
Figure 3. Satisfaction with outcome of analytics projects by job role

Looking at data professionals who selected only one job role, we examined their satisfaction with the outcomes of their analytics projects (see Figure 3.). The results showed that data professionals who identify as Researchers reported significantly higher levels of satisfaction with the work they do compared to data professionals who are Business Management or Developers.

Data Scientists are not Created Equal

Figure 4. Proficiency in Data Science Skills by Job Roles. Click image to enlarge.

What does it mean to be a data scientist? After all, there are many different skills that fall under the umbrella of data science. The professionals’ job role was logically related to their proficiency in different skills (see Figure 4.). I examined differences of data professionals who indicated they had only one primary job role. Data professionals in Business Management roles had the strongest business skills of all data professionals; Developers were the strongest in Technology and Programming skills; Researchers were the strongest in Statistics and Math & Modeling skills. The Creative types didn’t excel at any one skill but appeared to have a decent level of proficiency across all skill areas.

Data Science is a Team Sport: The Need for Complementary Skills

Figure 5. Effect of teammate’s expertise on satisfaction with analytics work outcomes. Click image to enlarge.

The results of the survey showed that data professionals tend to work together to solve problems. Seventy-six percent of the respondents said they work with at least one other person on projects that involve analytics.

To better understand how teams work together, we looked at how a data professional’s expertise impacts their teammate. We asked respondents how satisfied they were with the outcomes of their analytics projects. Additionally, we asked data professionals if their teammates were experts in any of the five data skill areas.

Results showed that Business Management professionals were more satisfied with the outcome of their work when they had quantitative-minded experts on their team (e.g., Math & Modeling and Statistics) compared to when they did not (see Figure 5.). Additionally, Researchers were more satisfied with their work outcome when they were paired with experts in Business and Math & Modeling. Developers were more satisfied with their work outcomes when paired with an expert in business. Creatives’ satisfaction with their work product is not impacted by the presence of other experts.

Summary and Implications

Solving problems with data requires expertise across different skill areas: 1) Business, 2) Technology, 3) Programming, 4) Math & Modeling and 5) Statistics.

Different types of data professionals (as defined by their role) are proficient in different areas. Not surprisingly, data professionals in Business Management” roles are the most proficient in business skills. Researchers are the most proficient in Math & Modeling and Statistics skills. Developers are the most proficient in Technology and Programming. The Creative types have some proficiency in all skill areas but are not the best in any single one skill area.

It appears that a team approach is an an effective way of approaching your data science projects. Solving problems using data (e.g., a data-driven approach) involves three major tasks: 1) identifying the right questions, 2) getting access to the right data and 3) analyzing the data to provide the answers. Each major task requires expertise in the different skills, often requiring a team approach. Different data professionals bring their unique and complementary skills to bear on each of the three phases of data intensive projects.

Finally, these preliminary findings are interesting and have important implications for business in helping chief data officers and hiring managers better understand their data science capabilities. Chief data/analytics officers need to focus on both data skills of their professionals as well as team composition. Additionally, recruiters need to effectively market to and recruit data professionals who have the right skills to fill specific roles. Getting feedback from data professionals can help organizations identify and close any talent gaps and improve how they manage their data science teams.

Download the free Executive Summary of the report, Optimizing your Data Science Teams.

To learn more about the DS3 Enterprise version, click here.


Social Media Analytics – What to Measure for Success?

If you are using social media for business or promotional purposes, then you should know how to measure it. However, when getting on to measure social media, it is not just for the sake of having some metrics, but to measure the effectiveness of the social media campaigns and further streamline it for better results. With proper analytics, you will be able to understand what was successful, what wasn’t, what your target audience expects, and how you can improve.

Two types of measurements

We may find that there are two major social media measurements to consider as:

  • Ongoing analytics – Tracking activities over time and monitoring performance.
  • Campaign-centric analytics – Getting analytics related to each event or campaign to assess its success.

Ongoing analytics will help to track the pulse of your social media communications regarding your business and brand in general. Once you set up the elements of brand tracking, you may just let is run by default and frequently fetch the data to see how it is going.

On the other hand, the campaign-specific metrics will help to understand the actual impact of your targeted content, which may be different from campaign to campaign. An ideal social media analytics program will take both of these measurements in a fine balance.

Monitoring social media analytics

Monitoring your social media analytics may surely distinguish between success and failure of your social media activities. In this excerpt, we are trying to outline what all things you need to monitor on social media and what tools to be used to do proper analysis and reporting or social media campaigns.

Social media analytics compass

It is almost impossible to monitor and measure everything related to social media on every channel at a time. So, we need to determine what is essential for your business and how to do monitoring of it well. For a better understanding of beginners, we will discuss the most critical generic areas of the so-called social media compass, i.e., the most important analytical measures.

  1. Size of the target audience

Many are confused about the matter whether the size of your audience really matters? Of course, it does matter if you are trying to promote a brand or service and building a relevant audience. It is essential to continuously build a relevant audience if you want to take your message to the right people on time.

Your audience will grow gradually through organic methods as well as on investing in paid ads or so. In fact, there is nothing wrong if you plan to invest in audience building tactics if you have a scope of converting the audience into business overtime. You should compare your rate of audience growth over a week or month with that of your competitors. Along with building the audience rate, also keep track of the unfollowers too on the go.

  1. Audience profile

As you slowly grow your audience, it is also essential to ensure that you are building the right type of audience, especially when you are paying for it, deciding whether you are making a worthy investment. Say, for example, if you try to build it through Twitter, this platform will allow you to access reports stating what types of profiles like marketers, entrepreneurs, or musicians, etc. are a part of your audience group.

You can do the same on Facebook also by setting up an ad which is targeted to a specific category of your Facebook target audience. This is also possible on Instagram to get some real Instagram likes. Say, for example; you can do this filtering for a specific interest of people, and see how many of your followers fall in that category to plan relevant campaigns for them. With a smart approach to it, you may perform this profile analysis across all social media platforms. You may use a traditional approach like surveys to most advanced premium tools offered by social media platforms to accomplish this.

  1. Reach and engagement

The campaigners also should monitor the social reach of your content and also see how much actually pay keen attention to it, even if not responding. Lack of responses doesn’t necessarily be non-interest. Engagement is another key aspect of monitoring as some of them with keen interest may engage with your content. If you find no engagement at all, then it may be either be the wrong content or the wrong audience you hit. You may typically split your audience under the following categories.

  • Lurkers – those who simply watch your content, but not interacting.
  • Influencers – They are connected to a large audience and can make an influence among them.
  • Engagers – People who are largely active in your target community and people will start recognizing their names.
  1. Traffic

The primary objective of your social media campaigns is to bring traffic back to your website or product pages. For some promoters, traffic is just enough. Say for example, for a site; they get paid for ads based on the volume of traffic. For the rest, the traffic needs to be converted into sales to meet their objective.

  1. Content analysis

As we have seen, creating content and sharing it through social media is an expensive and work-intensive affair. So, on a regular basis, you need to do content analysis as well to see if your efforts are getting recognized or not. You have to check whether:

  • Whether your images, videos, and text updates work the best?
  • Whether the content you share is in fine balance with the right mix or too much focused-on promotion?
  • Do you have enough engagement on the questions?
  • What changes are there on the social media platforms and what changes it demands from you?
  1. Sentiment analysis

The sentimental analysis covers the negative, positive, or neutral mentions on your brand. The latest social media tools are more focused on measuring the sentiments of the target audience over your brand through their social mentions about you. Even though these tools are not 100% accurate, it can surely be a good indicator of where you go wrong and what to correct.

These are some functional pointers included in social media analytics, which is a far wider specialty. However, it could be a good starting point if you master over these to streamline your social media campaigns to meet your online objectives.

Source: Social Media Analytics – What to Measure for Success? by thomassujain

Inside CXM: New Global Thought Leader Hub for Customer Experience Professionals

Inside CXMInside CXM, a new online global thought leadership hub for customer experience management (CXM/CEM) professionals, officially launched yesterday. Inside CXM is focused on bringing you the latest insights from the field of customer experience management. According to Inside CXM, their goal with this program is to provide valuable content via experts who have their finger on the pulse of the global customer experience marketplace.

I am happy to announce my partnership (disclosure – I am a paid contributor) with Inside CXM by providing unique content to their site. I join a host of other industry experts who cover topics like aligning your organization to deliver a unified experience, creating contextual customer journeys and using customer insights to build a smarter experience. These experts include Flavio Martins, Andy Reid and Molly Boyer. Check out the other contributors.

My first article for Inside CXM, “Why Customer Experience Management? To Leave the World a Better Place,” focuses on how businesses need to consider that the impact they have on customers goes well beyond their company walls and financial ledger. Business leaders need to remember that customers’ interactions with their company not only impact how the customers feel about the company, but the quality of those interactions can impact, positively or negatively, their personal lives. Customer experiences are, after all, a subset of all of life’s experiences.


Originally Posted at: Inside CXM: New Global Thought Leader Hub for Customer Experience Professionals by bobehayes

Measuring Customer Loyalty is Essential for a Successful CEM Program

Customers can exhibit many different types of loyalty behaviors toward a company (e.g., recommend, purchase same, purchase different products, stay/leave), each responsible for different types of business growth. Furthermore, when asked about their loyalty behaviors via relationship surveys, customers’ ratings of loyalty questions show that customer loyalty essentially boils down to three different types of customer loyalty:

  • Retention Loyalty: Degree to which customers will remain as customers or not leave to competitors. This type of loyalty impacts overall customer growth.
  • Advocacy Loyalty: Degree to which customers feel positively toward/will advocate your product/service/brand. This type of loyalty impacts new customer growth.
  • Purchasing Loyalty: Degree to which customers will increase their purchasing behavior. This type of loyalty impacts average revenue per customer.

These three distinct types of customer loyalty form the foundation of the RAPID loyalty approach. Using the RAPID loyalty’s multi-faceted approach helps companies understand how improving the customer experience can improve business growth in different ways. If interested, you can read my recent article on the development of the RAPID loyalty approach.

Product and Service Experience

Customer experience management (CEM) is the process of understanding and managing customers’ interactions with and perceptions about your company/brand. The ultimate goal of this process is to improve the customer experience and, consequently, increase customer loyalty. Two primary customer experience areas that are commonly assessed are the customers’ perception of their 1) product experience and 2) service experience. These two areas are shown to be among the top drivers of customer loyalty; customers who have a good experience in these two areas report higher levels of customer loyalty than customers who have a poor experience.

How does Product and Service Experience Impact Each Type of Customer Loyalty?

To understand the impact of the product and service experience on different facets of customer loyalty, I used existing survey data. Last year, Mob4Hire, a global crowd-sourced testing and market research community, and I conducted a worldwide survey, asking respondents’ about their experience with and loyalty towards their current wireless service provider. To measure the product and service experiences, respondents were asked to indicate their agreement about statements that describes their provider (1 to 5 – higher scores indicate agreement and better customer experience). As a measure of the product experience, we averaged respondent’s ratings across two questions: 1) good coverage in my area and 2) reliable service (few dropped calls). As a measure of the service experience, we averaged respondent’s ratings about their provider’s representatives across 5 areas: 1) responds to needs, 2) has knowledge to answer questions, 3) was courteous, 4) understands my needs and 5) always there when I need them. The survey also asked about the respondents’ loyalty toward their wireless service provider across the three types of loyalty: 1) retention, 2) advocacy and 3) purchasing.

To index the degree of impact that each customer experience dimension has on customer loyalty, I simply correlated the ratings of each customer experience dimension (Coverage/Reliability; Customer Service) with each of the three loyalty measures (Retention, Advocacy, Purchasing). I did this analysis for the entire dataset and then for each of the wireless service providers who had more than 100 respondents. Figure 1 contains the results for the impact of Coverage/Reliability on customer loyalty.

Figure 1. Impact of Product Experience on Retention, Advocacy and Purchasing Loyalty. Click image to enlarge.

As you can see in Figure 1, using the entire sample (far left bars), the product experience has the largest impact on advocacy loyalty (r = .49), followed by purchasing (r = .31) and retention loyalty (r = .34). Similarly, in Figure 2, using the entire sample (far left bars), the service experience has the largest impact on advocacy loyalty (r = .48), followed by purchasing (r = .34) and retention loyalty (r = .32). Generally speaking, while improving the product and service experience will have the greatest impact on advocacy loyalty, improvement in these areas will have an impact, albeit a smaller one, on purchasing and retention loyalty. I find this pattern of results in other industries as well.

Looking at individual wireless service providers in Figures 1 and 2, however, we see exceptions to this rule (Providers were ordered by their Advocacy Loyalty scores.). For example, we see that improving the product experience will have a comparable impact on different types of loyalty for specific companies (Figure 1 – T-Mobile, Safaricom). Additionally, we see that improving the service experience will have a comparable impact on different types of loyalty for specific companies (Figure 2 – Safaricom, MTN, Orange, Warid Telecom, Telenor, and Ufone). The value of improving the service experience is different across companies depending on the types of customer loyalty it impacts. For example, improving the service experience is much more valuable for Safaricom than it is for T-Mobile. Improving the service experience will greatly impact all three types of customer loyalty for Safaricom and only one for T-Mobile.  I suspect the reasons for variability across providers in what drives their customer loyalty could be due to company maturity, the experience delivery process, market pressures and customer type. Deeper analyses (e.g., stepwise regression, path analysis) of these data for specific providers could help shed light on the reasons.

Figure 2. Impact of Service Experience on Retention, Advocacy and Purchasing Loyalty. Click image to enlarge.

Benefits of Measuring Different Types of Customer Loyalty

Improving the customer experience impacts different types of customer loyalty and this pattern varies across specific companies. For some companies, improving the customer experience will primarily drive new customer growth (advocacy loyalty). For other companies, improving the customer experience will also significantly drive existing customer growth (retention and purchasing loyalty).

Companies who measure and understand different types of customer loyalty and how they are impacted by the customer experience have an advantage over companies who measure only one type of loyalty (typically advocacy):

  • Companies can target solutions to optimize different types of customer loyalty to improve business growth. For example, including retention loyalty questions (e.g., “likelihood to quit”) and a purchasing loyalty questions (e.g., “likelihood to buy different”) can help companies understand why customers are leaving and identify ways to increase customers’ purchasing behavior, respectively.
  • Key performance indicators (KPIs) can be identified for each type of customer loyalty. Identification of different KPIs (key drivers of customer loyalty) helps companies ensure they are monitoring all important customer experience areas. Identifying and monitoring all KPIs helps ensure the entire company is focused on matters that are important to the customer and his/her loyalty.
  • Companies are better equipped to quantify the value of their CEM program and obtain more accurate estimates of the Return on Investment (ROI) of the program. The ROI of a specific improvement opportunity will depend on how the company measures customer loyalty. If only advocacy loyalty is measured, the estimate of ROI is based on new customer growth. When companies measure advocacy, purchasing and retention loyalty, the estimate of ROI is based on new and existing customer growth.

Final Thoughts

The primary goal of CEM is to improve customer loyalty. Companies that narrowly define customer loyalty are missing out on opportunities to fully understand the impact that their CEM program has on the company’s bottom line. Companies need to ensure they are comprehensively measuring all facets of customer loyalty. A poor customer loyalty measurement approach can lead to sub-optimal business decisions, missed opportunities for business growth and an incomplete picture of the health of the customer relationship.

Originally Posted at: Measuring Customer Loyalty is Essential for a Successful CEM Program

Reinforcing Data Governance with Data Discovery

Historically, data discovery has existed at the nexus point between data preparation and analytics. The discovery process was frequently viewed as the means of gathering the requisite data for analytics while illustrating relationships between data elements which might inform them.

Today, data discovery’s utility has considerably broadened. Aided by machine learning and data cataloging techniques, data discovery is playing an increasingly pivotal role in enabling—and solidifying—data governance for today’s highly regulated data environments.

“We now have the automated capability to see where data elements are showing up and what are the new instances of them that are being introduced [throughout the enterprise],” Io-Tahoe CEO Oksana Sokolovsky revealed. “Now, users can govern that as data owners and actually have this visibility into their changing data landscapes.”

The additional governance repercussions of data discovery (encompassing aspects of data quality, data stewardship, and data disambiguation), coupled with its traditional importance for enhancing analytics, makes this facet of data management more valuable than ever.

Data Cataloging
The expansion of data discovery into facets of data governance is rooted in the fundamental need to identify where data are for what specific purposes. Data cataloging immensely enriches this process by providing a means of detailing critical information about data assets that provide a blueprint for data governance. Moreover, discovery and cataloging systems which deploy machine learning are targeted towards business users, allowing them to “create business rules, maintain them, search for elements, define policies, and start providing the governance workflow for the data elements,” Sokolovsky said. The plethora of attributes imputed to data within catalogs is vast, including details about metadata, sensitivity, and access or security concerns. Another crucial advantage is that all of this information is stored in a centralized location. “The catalog enhances the metadata and enhances the business description of the data elements,” Sokolovsky explained. “It enables other business users to leverage that information. The catalog function now makes data discovery an actionable output for users.”

Exceeding Metadata Relationships
A number of data discovery tools are almost entirely based on metadata—providing circumscribed value in situations in which there is limited metadata. The most common of these involve data lakes, in which data elements “might not have any metadata associated with them, but we still need to tie them back to the same element which appears in your original sources,” Sokolovsky commented. Other metadata limitations involve scenarios in which there is not enough metadata, or metadata that applies to a specific use case. In these instances and others, discovery techniques informed by machine learning are superior because they can identify relationships among the actual data, as well as among any existent metadata.

According to Sokolovsky, this approach empowers organizations to “now pick up 30 to 40 percent more [information about data elements], which used to be input manually by subject matter experts.” The disambiguation capability of this approach supports basic aspects of data quality. For example, when determining if data referencing ‘Washington’ applies to names, locations, or businesses, machine learning “algorithms can narrow that down and say we found 700 Washington instances; out of that, X number is going to be last names, X number is going to be first names, X number is going to be streets, and X number is going to be cities,” Sokolovsky said.

Data Stewardship
The automation capabilities of machine learning for data discovery also support governance by democratizing the notion of data stewardship. It does so in two ways. Not only do those tools provide much needed visibility for employees in dedicated stewardship roles, but they also enable business users to add citizen stewardship responsibilities to their positions. The expansion of stewardship capabilities is useful for increasing data quality for data owners in particular, who “become more like stewards,” Sokolovsky maintained. “They can now say okay, out of 75 instances 74 seem to be accurate and one is bad. That’s going to continue to enhance the machine learning capability.”

The capacity for disambiguating data, reinforcing data quality and assisting data stewardship that this approach facilitates results in higher levels of accuracy for data in any variety of use cases. Although a lot of this work is engineered by machine learning, the human oversight of data stewardship is instrumental for its ultimate success. “The user should interact with the system to go and do the validation and say I accept or I reject [the machine learning results],” Sokolovsky said. “Because of that not only are they in control of the governance, but also the system becomes smarter and smarter in the client’s environment.”

Working for Business Users
The deployment of data discovery and data cataloging for data governance purposes indicates both the increasing importance of governance and machine learning. Machine learning is the intermediary that improves the data discovery process to make it suitable for the prominent data governance and regulatory compliance concerns contemporary enterprises face. It is further proof that these learning capabilities are not only ideal for analytics, but also for automating other processes that give those analytics value (such as data quality), which involves “working directly with the business user,” Sokolovsky said.

Source by jelaniharper

Data Matching with Different Regional Data Sets

When it comes to Data Matching, there is no ‘one size fits all menu’. Different matching routines, different algorithms and different tuning parameters will all apply to different datasets. You generally can’t take one matching setup used to match data from one distinct data set and apply it to another. This proves especially true when matching datasets from different regions or countries. Let me explain.

Data Matching for Attributes that are Unlikely to Change

Data Matching is all about identifying unique attributes that a person, or object, has; and then using those attributes to match individual members within that set. These attributes should be things that are ‘unlikely to change’ over time. For a person, these would be things like “Name” and “Date of Birth”. Attributes like “Address” are much more likely to change and therefore of less importance, although this does not mean you should not use them. It’s just that they are less unique and therefore of less value, or lend less weight, to the matching process. In the case of objects, they would be attributes that uniquely identify that object, so in the case of say, a cup (if you manufactured cups) those attributes would be things like “Size”, “Volume”, “Shape”, “Color”, etc. The attributes themselves are not too important, it’s the fact that they should be ‘things’ that are unlikely to change over time.

So, back to data relating to people. This is generally the main use case for data matching. So here comes the challenge. Can’t we use one set of data matching routines for a ‘person database’ and just use the same routines etc. for another dataset? Well, the answer is no, unfortunately. There are always going to be differences in the data that will manifest itself during the matching, and none more so than using datasets from different geographical regions such as different countries. Data matching routines are always tuned for a specific dataset, and whilst there are always going to be differences from dataset to dataset. The difference becomes much more distinct when you chose data from different geographical regions. Let us explore this some more.

Data Matching for Regional Data Sets

First, I must mention a caveat. I am going to assume that matching is done in western character sets, using Romanized names, not in languages or character sets such as Japanese or Chinese. This does not mean the data must contain only English or western names, far from it, it just means the matching routines are those which we can use for names that we can write using western, or Romanized characters. I will not consider matching using non-western characters here. 

Now, let us consider the matching of names. To do this for the name itself, we use matching routines that do things like phoneticize the names and then look for differences between the result. But first, the methodology involves blocking on names, sorting out the data in different piles that have similar attributes. It’s the age-old ‘matching the socks’ problem. You wouldn’t match socks in a great pile of fresh laundry by picking one sock at a time from the whole pile and then trying to find its duplicate. That would be very inefficient and take ages to complete. You instinctively know what to do, you sort them out first into similar piles, or ‘blocks’, of similar socks. Say, a pile of black socks, a pile of white socks, a pile of colored socks etc. and then you sort through those smaller piles looking for matches. It’s the same principle here. We sort the data into blocks of similar attributes, then match within those blocks. Ideally, these blocks should be of a manageable and similar size. Now, here comes the main point.

Different geographic regions will produce different distributions of block sizes and types that result in differences to the matching that will need to be done in those blocks, and this can manifest itself in terms of performance, efficiency, accuracy and overall quality of the matching. Regional variations in the distribution of names within different geographical regions, and therefore groups of data, can vary widely and therefore cause big differences in the results obtained.

Let’s look specifically at surnames for a moment. In the UK, according to the National Office of Statistics, there are around 270,000 surnames that cover around 95% of the population. Now obviously, some surnames are much more common than others. Surnames such as Jones, Brown, Patel example are amongst the most common, but the important thing is there is a distribution of these names that follow a specific graphical shape if we chose to plot them. There will be a big cluster of common names at one end, followed by a specific tailing-off of names to the other, whilst the shape of the curve would be specific to the UK and to the UK alone. Different countries or regions would have different shapes to their distributions. This is an important point. Some regions would have a much narrower distribution, names could be much more similar or common, whilst some regions would be broader, names would be much less common. The overall distribution of distinct names could be much more or much less and this would, therefore, affect the results of any matching we did within datasets emanating from within those regions. A smaller distribution of names would result in bigger block sizes and therefore more data to match on within those blocks. This could take longer, be less efficient and could even affect the accuracy of those matches. A larger distribution of names would result in many more blocks of a smaller size, each of which would need to be processed.

Data Matching Variances Across the Globe

Let’s take a look at how this varies across the globe. A good example of regional differences comes from Taiwan. Roughly forty percent of the population share just six different surnames (when using the Romanised form). Matching within datasets using names from Taiwanese data will, therefore, result in some very large blocks. Thailand, on the other hand, presents a completely different challenge. In Thailand, there are no common surnames. There is actually a law called the ‘Surname Act’ that states surnames cannot be duplicated and families should have unique surnames. In Thailand, it is incredibly rare for any two people to share the same name. In our blocking exercise, this would result in a huge number of very small blocks.

The two examples above may be extreme, but they perfectly illustrate the challenge. Datasets containing names vary from region to region and therefore the blocking and matching strategy can vary widely from place to place. You cannot simply use the same routines and algorithms for different datasets, each dataset is unique and must be treated so. Different matching strategies must be adopted for each set, each matching exercise must be ‘tuned’ for that specific dataset in order to find the most effective strategy and the results will vary. It doesn’t matter what toolset you choose to use; the same principle applies to all as it’s an issue that is in the data and cannot be changed or ignored. 

To summarize, the general point is that regional, geographic, cultural and language variations can make big differences to how you go about matching personal data within different datasets. Each dataset must be treated differently. You must have a good understanding of the data contained within those datasets and you must tune and optimize your matching routines and strategy for each dataset. Care must be taken to understand the data and select the best strategy for each separate dataset. Blocking and matching strategies will vary, you cannot just simply reuse the exact same approach and routines from dataset to dataset, this can vary widely from region to region. Until next time!

The post Data Matching with Different Regional Data Sets appeared first on Talend Real-Time Open Source Data Integration Software.

Source: Data Matching with Different Regional Data Sets

How to solve the top three problems of vulnerability management

As the threat landscape around us continues to grow increasingly sophisticated, enterprises need to look for ways to ramp up on security and seek to break the barriers laid out by the conventional practices of cybersecurity- which have proven to be both ineffective and time-intensive.

A cybersecurity practice that has remained in the spotlight over the course of recent years is vulnerability management. For most enterprises and individuals, vulnerability management starts and ends at scanning tools- which is a dangerous approach that actually causes more harm than good. Usually, companies are firm in their belief that investing in an “industry-favorite” vulnerability scanner will effectively enable them to better assess, and manage the threats facing them, which could not be further away from the reality of the situation.

The primary reason as to why most organizations fail to hit the mark as far as effective vulnerability management is concerned is simple- companies usually have a pretty skewed idea of what vulnerability management is, in the first place. For most enterprises, the entire notion of vulnerability management revolves around scanning an organization’s network for any threats.

The greatest flaw with this definition of vulnerability management is that it overlooks crucial aspects of vulnerability management, which includes high-level processes such as the discovery, reporting, and prioritization of vulnerabilities, along with formulating effective responses to the discovered threats. In addition to these four key aspects, a strong vulnerability management framework tends to focus more on the larger cybersecurity picture and works in a cyclic manner, where one sub-process flows naturally into the next, which ultimately results in the reduction of business risk.

Having said that, however, there’s a significant portion of companies that, in the setting-up of their vulnerability management tools, create problems that can sabotage their vulnerability management framework, before it has even finished being set up. In order to ease the tedious process of setting up a vulnerability management program for our readers, we’ve listed some of the most common problems that enterprises face, below.

Problem #1- Poor prioritization of threats

As we’ve already discussed above, one of the key components of effective vulnerability management revolves around prioritization. Unfortunately, however, the poor prioritization of threats is one of the most common problems encountered by enterprises while initiating their vulnerability management tools and scanners.

When it comes to prioritizing certain threats and vulnerabilities, security teams need to follow a certain set of steps within their vulnerability management protocol. To prevent certain vulnerabilities, the security teams can consider using a VPN as it encrypts internet traffic and provides protection from snooping eyes. A standard framework, that produces highly satisfactory cybersecurity results, dictates that companies first ‘discover’ threats, which is followed by the prioritization of assets, which is then followed by an in-depth assessment of the discovered threats and vulnerabilities.

Unlike the first three steps of the vulnerability management procedure, the latter half of the steps revolve around the reporting, remediation, and verification of the threats faced by the enterprise. Most organizations, however, tend to skip straight from ‘discovery’ part of the framework to remediation, while foregoing crucial elements such as threat prioritization, and reporting.

The complete or partial failure to prioritize assets can often have devastating consequences for an enterprise since cybersecurity teams end up devoting the same amount of time and labor to otherwise menial, or routine tasks. Moreover, the poor prioritization of threats and vulnerabilities is bound to result in the failure of the vulnerability management program since the same threats will arise month after month since security teams will be preoccupied with the remediation of other issues.

Although the insufficient prioritization of assets might feel quite similar to being stuck in an endless cycle of failure, enterprises can actually solve the problems resulting as a result of poor prioritization, by conducting a thorough analysis of the results received from the scan. After companies have a clear idea of the threats facing them, they need to assess and see which of the assets require weekly and monthly scans.

After the prioritization of assets has been completed, enterprises can devote their cybersecurity resources and personnel that require it the most. Not only does this enable security teams to formulate fixes more quickly, but it also enables companies to invest effectively, which is on fixes that actually require the resources, instead of the mindless dedication of assets to threats that don’t require it.

Problem #2-  A lack of structure and organization within IT teams

A rather critical mistake that enterprises make with their vulnerability management programs, that often gets thrown under the radar, is a sheer lack of organization and structure within their IT teams.

As far as vulnerability management programs are concerned, each individual concerned with the tedious task of upholding an organization’s security needs to have a clear set of instructions as to the requirements and limitations of their role within the vulnerability management framework. Only after each individual in an IT team has a clear-cut idea of the specifics of their role, can an organization hope for effective vulnerability management.

Unfortunately, however, most teams tasked with vulnerability management fail to formulate and run effective programs since their vision is highly limited, and usually tends to focus solely on the scanning element of vulnerability management. To add the cherry on top, we’ve seen too many cybersecurity specialists who tend to forget how crucial teamwork is in the proper execution of projects.

When it comes to the effective execution of vulnerability management programs, the most important step that organizations can take is setting long and short-term goals, along with defining clear policies, particularly as far as ownership is concerned. While setting up the program, security teams need to dedicate time to answering important questions concerning each person’s role, and what the implementation of a vulnerability management program hopes to accomplish.

Problem #3- Inattention to maintenance

The last mistake that we’ve chosen to highlight is one that occurs quite frequently- the lack of defined maintenance windows, which can prove to be fatal for remediation teams. The unavailability of maintenance windows, combined with the use of ad hoc maintenance windows to patch servers, can result in devastating consequences for your vulnerability management program.

Unfortunately, many enterprises tend to overlook the importance of a dedicated maintenance period and fail to allocate the required resources to the patching and rebooting of the systems in use. Additionally, enterprises also need to rely on a robust patching platform to ensure that the vulnerabilities they’ve fixed don’t repeat themselves.

To conclude

At the end of the article, we can only hope that we’ve brought our reader’s attention to some of the most common problems encountered in the setting of a robust vulnerability management program within an organization’s cybersecurity infrastructure.

In the ever-evolving threat landscape of today, and the growing reliance of enterprises on cybersecurity tools- vulnerability management is no longer another “IT expense,” rather it is a crucial aspect of survival in the current digital landscape.







The post How to solve the top three problems of vulnerability management appeared first on Big Data Made Simple.

Originally Posted at: How to solve the top three problems of vulnerability management by administrator