Reinforcing Data Governance with Data Discovery

Historically, data discovery has existed at the nexus point between data preparation and analytics. The discovery process was frequently viewed as the means of gathering the requisite data for analytics while illustrating relationships between data elements which might inform them.

Today, data discovery’s utility has considerably broadened. Aided by machine learning and data cataloging techniques, data discovery is playing an increasingly pivotal role in enabling—and solidifying—data governance for today’s highly regulated data environments.

“We now have the automated capability to see where data elements are showing up and what are the new instances of them that are being introduced [throughout the enterprise],” Io-Tahoe CEO Oksana Sokolovsky revealed. “Now, users can govern that as data owners and actually have this visibility into their changing data landscapes.”

The additional governance repercussions of data discovery (encompassing aspects of data quality, data stewardship, and data disambiguation), coupled with its traditional importance for enhancing analytics, makes this facet of data management more valuable than ever.

Data Cataloging
The expansion of data discovery into facets of data governance is rooted in the fundamental need to identify where data are for what specific purposes. Data cataloging immensely enriches this process by providing a means of detailing critical information about data assets that provide a blueprint for data governance. Moreover, discovery and cataloging systems which deploy machine learning are targeted towards business users, allowing them to “create business rules, maintain them, search for elements, define policies, and start providing the governance workflow for the data elements,” Sokolovsky said. The plethora of attributes imputed to data within catalogs is vast, including details about metadata, sensitivity, and access or security concerns. Another crucial advantage is that all of this information is stored in a centralized location. “The catalog enhances the metadata and enhances the business description of the data elements,” Sokolovsky explained. “It enables other business users to leverage that information. The catalog function now makes data discovery an actionable output for users.”

Exceeding Metadata Relationships
A number of data discovery tools are almost entirely based on metadata—providing circumscribed value in situations in which there is limited metadata. The most common of these involve data lakes, in which data elements “might not have any metadata associated with them, but we still need to tie them back to the same element which appears in your original sources,” Sokolovsky commented. Other metadata limitations involve scenarios in which there is not enough metadata, or metadata that applies to a specific use case. In these instances and others, discovery techniques informed by machine learning are superior because they can identify relationships among the actual data, as well as among any existent metadata.

According to Sokolovsky, this approach empowers organizations to “now pick up 30 to 40 percent more [information about data elements], which used to be input manually by subject matter experts.” The disambiguation capability of this approach supports basic aspects of data quality. For example, when determining if data referencing ‘Washington’ applies to names, locations, or businesses, machine learning “algorithms can narrow that down and say we found 700 Washington instances; out of that, X number is going to be last names, X number is going to be first names, X number is going to be streets, and X number is going to be cities,” Sokolovsky said.

Data Stewardship
The automation capabilities of machine learning for data discovery also support governance by democratizing the notion of data stewardship. It does so in two ways. Not only do those tools provide much needed visibility for employees in dedicated stewardship roles, but they also enable business users to add citizen stewardship responsibilities to their positions. The expansion of stewardship capabilities is useful for increasing data quality for data owners in particular, who “become more like stewards,” Sokolovsky maintained. “They can now say okay, out of 75 instances 74 seem to be accurate and one is bad. That’s going to continue to enhance the machine learning capability.”

The capacity for disambiguating data, reinforcing data quality and assisting data stewardship that this approach facilitates results in higher levels of accuracy for data in any variety of use cases. Although a lot of this work is engineered by machine learning, the human oversight of data stewardship is instrumental for its ultimate success. “The user should interact with the system to go and do the validation and say I accept or I reject [the machine learning results],” Sokolovsky said. “Because of that not only are they in control of the governance, but also the system becomes smarter and smarter in the client’s environment.”

Working for Business Users
The deployment of data discovery and data cataloging for data governance purposes indicates both the increasing importance of governance and machine learning. Machine learning is the intermediary that improves the data discovery process to make it suitable for the prominent data governance and regulatory compliance concerns contemporary enterprises face. It is further proof that these learning capabilities are not only ideal for analytics, but also for automating other processes that give those analytics value (such as data quality), which involves “working directly with the business user,” Sokolovsky said.

Source by jelaniharper

Data Matching with Different Regional Data Sets

When it comes to Data Matching, there is no ‘one size fits all menu’. Different matching routines, different algorithms and different tuning parameters will all apply to different datasets. You generally can’t take one matching setup used to match data from one distinct data set and apply it to another. This proves especially true when matching datasets from different regions or countries. Let me explain.

Data Matching for Attributes that are Unlikely to Change

Data Matching is all about identifying unique attributes that a person, or object, has; and then using those attributes to match individual members within that set. These attributes should be things that are ‘unlikely to change’ over time. For a person, these would be things like “Name” and “Date of Birth”. Attributes like “Address” are much more likely to change and therefore of less importance, although this does not mean you should not use them. It’s just that they are less unique and therefore of less value, or lend less weight, to the matching process. In the case of objects, they would be attributes that uniquely identify that object, so in the case of say, a cup (if you manufactured cups) those attributes would be things like “Size”, “Volume”, “Shape”, “Color”, etc. The attributes themselves are not too important, it’s the fact that they should be ‘things’ that are unlikely to change over time.

So, back to data relating to people. This is generally the main use case for data matching. So here comes the challenge. Can’t we use one set of data matching routines for a ‘person database’ and just use the same routines etc. for another dataset? Well, the answer is no, unfortunately. There are always going to be differences in the data that will manifest itself during the matching, and none more so than using datasets from different geographical regions such as different countries. Data matching routines are always tuned for a specific dataset, and whilst there are always going to be differences from dataset to dataset. The difference becomes much more distinct when you chose data from different geographical regions. Let us explore this some more.

Data Matching for Regional Data Sets

First, I must mention a caveat. I am going to assume that matching is done in western character sets, using Romanized names, not in languages or character sets such as Japanese or Chinese. This does not mean the data must contain only English or western names, far from it, it just means the matching routines are those which we can use for names that we can write using western, or Romanized characters. I will not consider matching using non-western characters here. 

Now, let us consider the matching of names. To do this for the name itself, we use matching routines that do things like phoneticize the names and then look for differences between the result. But first, the methodology involves blocking on names, sorting out the data in different piles that have similar attributes. It’s the age-old ‘matching the socks’ problem. You wouldn’t match socks in a great pile of fresh laundry by picking one sock at a time from the whole pile and then trying to find its duplicate. That would be very inefficient and take ages to complete. You instinctively know what to do, you sort them out first into similar piles, or ‘blocks’, of similar socks. Say, a pile of black socks, a pile of white socks, a pile of colored socks etc. and then you sort through those smaller piles looking for matches. It’s the same principle here. We sort the data into blocks of similar attributes, then match within those blocks. Ideally, these blocks should be of a manageable and similar size. Now, here comes the main point.

Different geographic regions will produce different distributions of block sizes and types that result in differences to the matching that will need to be done in those blocks, and this can manifest itself in terms of performance, efficiency, accuracy and overall quality of the matching. Regional variations in the distribution of names within different geographical regions, and therefore groups of data, can vary widely and therefore cause big differences in the results obtained.

Let’s look specifically at surnames for a moment. In the UK, according to the National Office of Statistics, there are around 270,000 surnames that cover around 95% of the population. Now obviously, some surnames are much more common than others. Surnames such as Jones, Brown, Patel example are amongst the most common, but the important thing is there is a distribution of these names that follow a specific graphical shape if we chose to plot them. There will be a big cluster of common names at one end, followed by a specific tailing-off of names to the other, whilst the shape of the curve would be specific to the UK and to the UK alone. Different countries or regions would have different shapes to their distributions. This is an important point. Some regions would have a much narrower distribution, names could be much more similar or common, whilst some regions would be broader, names would be much less common. The overall distribution of distinct names could be much more or much less and this would, therefore, affect the results of any matching we did within datasets emanating from within those regions. A smaller distribution of names would result in bigger block sizes and therefore more data to match on within those blocks. This could take longer, be less efficient and could even affect the accuracy of those matches. A larger distribution of names would result in many more blocks of a smaller size, each of which would need to be processed.

Data Matching Variances Across the Globe

Let’s take a look at how this varies across the globe. A good example of regional differences comes from Taiwan. Roughly forty percent of the population share just six different surnames (when using the Romanised form). Matching within datasets using names from Taiwanese data will, therefore, result in some very large blocks. Thailand, on the other hand, presents a completely different challenge. In Thailand, there are no common surnames. There is actually a law called the ‘Surname Act’ that states surnames cannot be duplicated and families should have unique surnames. In Thailand, it is incredibly rare for any two people to share the same name. In our blocking exercise, this would result in a huge number of very small blocks.

The two examples above may be extreme, but they perfectly illustrate the challenge. Datasets containing names vary from region to region and therefore the blocking and matching strategy can vary widely from place to place. You cannot simply use the same routines and algorithms for different datasets, each dataset is unique and must be treated so. Different matching strategies must be adopted for each set, each matching exercise must be ‘tuned’ for that specific dataset in order to find the most effective strategy and the results will vary. It doesn’t matter what toolset you choose to use; the same principle applies to all as it’s an issue that is in the data and cannot be changed or ignored. 

To summarize, the general point is that regional, geographic, cultural and language variations can make big differences to how you go about matching personal data within different datasets. Each dataset must be treated differently. You must have a good understanding of the data contained within those datasets and you must tune and optimize your matching routines and strategy for each dataset. Care must be taken to understand the data and select the best strategy for each separate dataset. Blocking and matching strategies will vary, you cannot just simply reuse the exact same approach and routines from dataset to dataset, this can vary widely from region to region. Until next time!

The post Data Matching with Different Regional Data Sets appeared first on Talend Real-Time Open Source Data Integration Software.

Source: Data Matching with Different Regional Data Sets

Mar 05, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Data interpretation  Source

[ AnalyticsWeek BYTES]

>> Jul 20, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..) by admin

>> Big Data Insights in Healthcare, Part II. A Perspective on Challenges to Adoption by froliol

>> Jul 05, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..) by admin

Wanna write? Click Here

[ FEATURED COURSE]

A Course in Machine Learning

image

Machine learning is the study of algorithms that learn from data and experience. It is applied in a vast variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need… more

[ FEATURED READ]

The Black Swan: The Impact of the Highly Improbable

image

A black swan is an event, positive or negative, that is deemed improbable yet causes massive consequences. In this groundbreaking and prophetic book, Taleb shows in a playful way that Black Swan events explain almost eve… more

[ TIPS & TRICKS OF THE WEEK]

Finding a success in your data science ? Find a mentor
Yes, most of us dont feel a need but most of us really could use one. As most of data science professionals work in their own isolations, getting an unbiased perspective is not easy. Many times, it is also not easy to understand how the data science progression is going to be. Getting a network of mentors address these issues easily, it gives data professionals an outside perspective and unbiased ally. It’s extremely important for successful data science professionals to build a mentor network and use it through their success.

[ DATA SCIENCE Q&A]

Q:Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?
A: * Selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved
Types:
– Sampling bias: systematic error due to a non-random sample of a population causing some members to be less likely to be included than others
– Time interval: a trial may terminated early at an extreme value (ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all the variables have similar means
– Data: “cherry picking”, when specific subsets of the data are chosen to support a conclusion (citing examples of plane crashes as evidence of airline flight being unsafe, while the far more common example of flights that complete safely)
– Studies: performing experiments and reporting only the most favorable results
– Can lead to unaccurate or even erroneous conclusions
– Statistical methods can generally not overcome it

Why data handling make it worse?
– Example: individuals who know or suspect that they are HIV positive are less likely to participate in HIV surveys
– Missing data handling will increase this effect as it’s based on most HIV negative
-Prevalence estimates will be unaccurate

Source

[ VIDEO OF THE WEEK]

#FutureOfData with @CharlieDataMine, @Oracle discussing running analytics in an enterprise

 #FutureOfData with @CharlieDataMine, @Oracle discussing running analytics in an enterprise

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

If you can’t explain it simply, you don’t understand it well enough. – Albert Einstein

[ PODCAST OF THE WEEK]

Using Analytics to build A #BigData #Workforce

 Using Analytics to build A #BigData #Workforce

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years.

Sourced from: Analytics.CLUB #WEB Newsletter

How to solve the top three problems of vulnerability management

As the threat landscape around us continues to grow increasingly sophisticated, enterprises need to look for ways to ramp up on security and seek to break the barriers laid out by the conventional practices of cybersecurity- which have proven to be both ineffective and time-intensive.

A cybersecurity practice that has remained in the spotlight over the course of recent years is vulnerability management. For most enterprises and individuals, vulnerability management starts and ends at scanning tools- which is a dangerous approach that actually causes more harm than good. Usually, companies are firm in their belief that investing in an “industry-favorite” vulnerability scanner will effectively enable them to better assess, and manage the threats facing them, which could not be further away from the reality of the situation.

The primary reason as to why most organizations fail to hit the mark as far as effective vulnerability management is concerned is simple- companies usually have a pretty skewed idea of what vulnerability management is, in the first place. For most enterprises, the entire notion of vulnerability management revolves around scanning an organization’s network for any threats.

The greatest flaw with this definition of vulnerability management is that it overlooks crucial aspects of vulnerability management, which includes high-level processes such as the discovery, reporting, and prioritization of vulnerabilities, along with formulating effective responses to the discovered threats. In addition to these four key aspects, a strong vulnerability management framework tends to focus more on the larger cybersecurity picture and works in a cyclic manner, where one sub-process flows naturally into the next, which ultimately results in the reduction of business risk.

Having said that, however, there’s a significant portion of companies that, in the setting-up of their vulnerability management tools, create problems that can sabotage their vulnerability management framework, before it has even finished being set up. In order to ease the tedious process of setting up a vulnerability management program for our readers, we’ve listed some of the most common problems that enterprises face, below.

Problem #1- Poor prioritization of threats

As we’ve already discussed above, one of the key components of effective vulnerability management revolves around prioritization. Unfortunately, however, the poor prioritization of threats is one of the most common problems encountered by enterprises while initiating their vulnerability management tools and scanners.

When it comes to prioritizing certain threats and vulnerabilities, security teams need to follow a certain set of steps within their vulnerability management protocol. To prevent certain vulnerabilities, the security teams can consider using a VPN as it encrypts internet traffic and provides protection from snooping eyes. A standard framework, that produces highly satisfactory cybersecurity results, dictates that companies first ‘discover’ threats, which is followed by the prioritization of assets, which is then followed by an in-depth assessment of the discovered threats and vulnerabilities.

Unlike the first three steps of the vulnerability management procedure, the latter half of the steps revolve around the reporting, remediation, and verification of the threats faced by the enterprise. Most organizations, however, tend to skip straight from ‘discovery’ part of the framework to remediation, while foregoing crucial elements such as threat prioritization, and reporting.

The complete or partial failure to prioritize assets can often have devastating consequences for an enterprise since cybersecurity teams end up devoting the same amount of time and labor to otherwise menial, or routine tasks. Moreover, the poor prioritization of threats and vulnerabilities is bound to result in the failure of the vulnerability management program since the same threats will arise month after month since security teams will be preoccupied with the remediation of other issues.

Although the insufficient prioritization of assets might feel quite similar to being stuck in an endless cycle of failure, enterprises can actually solve the problems resulting as a result of poor prioritization, by conducting a thorough analysis of the results received from the scan. After companies have a clear idea of the threats facing them, they need to assess and see which of the assets require weekly and monthly scans.

After the prioritization of assets has been completed, enterprises can devote their cybersecurity resources and personnel that require it the most. Not only does this enable security teams to formulate fixes more quickly, but it also enables companies to invest effectively, which is on fixes that actually require the resources, instead of the mindless dedication of assets to threats that don’t require it.

Problem #2-  A lack of structure and organization within IT teams

A rather critical mistake that enterprises make with their vulnerability management programs, that often gets thrown under the radar, is a sheer lack of organization and structure within their IT teams.

As far as vulnerability management programs are concerned, each individual concerned with the tedious task of upholding an organization’s security needs to have a clear set of instructions as to the requirements and limitations of their role within the vulnerability management framework. Only after each individual in an IT team has a clear-cut idea of the specifics of their role, can an organization hope for effective vulnerability management.

Unfortunately, however, most teams tasked with vulnerability management fail to formulate and run effective programs since their vision is highly limited, and usually tends to focus solely on the scanning element of vulnerability management. To add the cherry on top, we’ve seen too many cybersecurity specialists who tend to forget how crucial teamwork is in the proper execution of projects.

When it comes to the effective execution of vulnerability management programs, the most important step that organizations can take is setting long and short-term goals, along with defining clear policies, particularly as far as ownership is concerned. While setting up the program, security teams need to dedicate time to answering important questions concerning each person’s role, and what the implementation of a vulnerability management program hopes to accomplish.

Problem #3- Inattention to maintenance

The last mistake that we’ve chosen to highlight is one that occurs quite frequently- the lack of defined maintenance windows, which can prove to be fatal for remediation teams. The unavailability of maintenance windows, combined with the use of ad hoc maintenance windows to patch servers, can result in devastating consequences for your vulnerability management program.

Unfortunately, many enterprises tend to overlook the importance of a dedicated maintenance period and fail to allocate the required resources to the patching and rebooting of the systems in use. Additionally, enterprises also need to rely on a robust patching platform to ensure that the vulnerabilities they’ve fixed don’t repeat themselves.

To conclude

At the end of the article, we can only hope that we’ve brought our reader’s attention to some of the most common problems encountered in the setting of a robust vulnerability management program within an organization’s cybersecurity infrastructure.

In the ever-evolving threat landscape of today, and the growing reliance of enterprises on cybersecurity tools- vulnerability management is no longer another “IT expense,” rather it is a crucial aspect of survival in the current digital landscape.

 

 

 

 

 

 

The post How to solve the top three problems of vulnerability management appeared first on Big Data Made Simple.

Originally Posted at: How to solve the top three problems of vulnerability management by administrator

Remote DBA Experts- Improve Business Intelligence with The Perfect Analytical Experts

Database management with the integration of Google Analytics is a salient and vital part of your business organization. The company needs to export information and data that is crucial for the development and progress of your business. Like everything else in your business, database management needs to be managed and controlled well. Being a business owner means you might not be equipped with the skills and it is here that you need skilled and trained specialists in the field of database management to help you!

Database management for the present and future needs of your business

Many firms cannot afford a full time IT support for database management and Google Analytics. They outsource the task to remote database management specialists that are known as remote DBAs. These experts are trained and skilled when it comes to giving your business the attention it needs when it comes to database management and the integration of Google Analytics. These specialists help you to save time and money. The quality of services is improved, and you have trained professionals that are focused solely in the field of database management. They give you the opportunity to concentrate on the other core functions of your business.

Business Intelligence and Google Analytics

With the aid of remote DBAs, you can improve the business analytics of your company without hassles at all. In fact, with a team of experts, you can expand the business intelligence platform with success. The team helps you to understand and analyze the data for your needs. It is imperative for you to choose the right database for the analytics of your business. It is important for you as a business owner to get both your Google Analytics and database in perfect sync. Experts will help you in support of this lifecycle that covers its planning, its designing, establishment, deployment, and support when it comes to all the stages of business intelligence in your organization.

Google Analytics is an outstanding way for you to collect data and information. It helps you collect all the information about web traffic and provides you with information on additional data that you would be looking into. You will find Google Analytics in your database. This SQL database contains valuable marketing data and information your business needs to resort to for attracting customers and tracking its promotions. This is the same database where the sales team of your company keeps track of all the attributes of the product. You will also find the presence of Google Analytics in spreadsheet reports of your business and web application that customers resort to. Esteemed company, RemoteDBA.com says when it comes to database management and Google Analytics, they ensure that all their professionals are trained in the latest technology and support services. They help you get the tools and the workforce to ensure your business is not interrupted at all. This paves the way for incredible progress and gives you a competitive edge in the market with success.

Putting this analytical data into a single platform and providing you the information you need

Remote DBA analysts ensure they understand the data integration services in your business and the cloud. They write and manage queries related to data warehousing for archiving data and information. They also are experts in the field of multidimensional analytical processing online as well as cube design. They also spearhead data processes that are interactive and ensure that you get data to leverage in a fast and efficient manner. They make it a point to integrate Google Analytics into your other systems as well effectively. They help you export data from SQL Server, MS Access, Oracle, and MySQL.

Ensures your servers and systems are running smoothly for exporting the data you need

Remote DBAs ensure that all your servers and systems operate smoothly without hassles at all. They have the sole motive to remove any technical glitches that might lead to downtime in the future. It is important for you as a business to be aware of the Google Analytics that is present in your organizational database and reports. Another advantage of remote DBAs is they allow you to broaden resources when it comes to expertise. Unlike in-house DBAs, they are not limited when it comes to technical knowledge and resources. They are aware of the latest trends and technology. They are experts when it comes to Google Analytics and database management.

Getting proactive attention on time

In addition to the above, remote DBAs ensure systems and servers are working properly, you hardly notice the need to protect them when they are around. These experts will collect the metrics of the database and analyze trends. This helps you in capacity planning, and the issues are resolved before they even take place. In this way, they protect your business and help you get the best for your needs without hassles at all. This alleviates much of your stress and tensions as you can focus on business relationships and customers/client satisfaction better.

Author Bio:

Sujain Thomas is an experienced IT support specialist and expert with RemoteDBA.com who helps small to large business houses with database and Google Analytics management for better business growth in the market.

Source by thomassujain

Feb 27, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
statistical anomaly  Source

[ AnalyticsWeek BYTES]

>> Voices in AI – Episode 101: A Conversation with Cindi Howsen by analyticsweekpick

>> Machine Learning and Information Security: Impact and Trends by administrator

>> Autoscaling of Cloud-Native Apps Lowers TCO and Improves Availability by analyticsweek

Wanna write? Click Here

[ FEATURED COURSE]

The Analytics Edge

image

This is an Archived Course
EdX keeps courses open for enrollment after they end to allow learners to explore content and continue learning. All features and materials may not be available, and course content will not be… more

[ FEATURED READ]

Hypothesis Testing: A Visual Introduction To Statistical Significance

image

Statistical significance is a way of determining if an outcome occurred by random chance, or did something cause that outcome to be different than the expected baseline. Statistical significance calculations find their … more

[ TIPS & TRICKS OF THE WEEK]

Fix the Culture, spread awareness to get awareness
Adoption of analytics tools and capabilities has not yet caught up to industry standards. Talent has always been the bottleneck towards achieving the comparative enterprise adoption. One of the primal reason is lack of understanding and knowledge within the stakeholders. To facilitate wider adoption, data analytics leaders, users, and community members needs to step up to create awareness within the organization. An aware organization goes a long way in helping get quick buy-ins and better funding which ultimately leads to faster adoption. So be the voice that you want to hear from leadership.

[ DATA SCIENCE Q&A]

Q:What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset
A: Outliers:
– An observation point that is distant from other observations
– Can occur by chance in any distribution
– Often, they indicate measurement error or a heavy-tailed distribution
– Measurement error: discard them or use robust statistics
– Heavy-tailed distribution: high skewness, can’t use tools assuming a normal distribution
– Three-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean
– Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean

Three-sigma rules example: in a sample of 1000 observations, the presence of up to 5 observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected, being less than twice the expected number and hence within 1 standard deviation of the expected number (Poisson distribution).

If the nature of the distribution is known a priori, it is possible to see if the number of outliers deviate significantly from what can be expected. For a given cutoff (samples fall beyond the cutoff with probability p), the number of outliers can be approximated with a Poisson distribution with lambda=pn. Example: if one takes a normal distribution with a cutoff 3 standard deviations from the mean, p=0.3% and thus we can approximate the number of samples whose deviation exceed 3 sigmas by a Poisson with lambda=3

Identifying outliers:
– No rigid mathematical method
– Subjective exercise: be careful
– Boxplots
– QQ plots (sample quantiles Vs theoretical quantiles)

Handling outliers:
– Depends on the cause
– Retention: when the underlying model is confidently known
– Regression problems: only exclude points which exhibit a large degree of influence on the estimated coefficients (Cook’s distance)

Inlier:
– Observation lying within the general distribution of other observed values
– Doesn’t perturb the results but are non-conforming and unusual
– Simple example: observation recorded in the wrong unit (°F instead of °C)

Identifying inliers:
– Mahalanobi’s distance
– Used to calculate the distance between two random vectors
– Difference with Euclidean distance: accounts for correlations
– Discard them

Source

[ VIDEO OF THE WEEK]

@AnalyticsWeek Panel Discussion: Big Data Analytics

 @AnalyticsWeek Panel Discussion: Big Data Analytics

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

He uses statistics as a drunken man uses lamp posts—for support rather than for illumination. – Andrew Lang

[ PODCAST OF THE WEEK]

#FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

 #FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

And one of my favourite facts: At the moment less than 0.5% of all data is ever analysed and used, just imagine the potential here.

Sourced from: Analytics.CLUB #WEB Newsletter

AI can help prevent mass shootings

Awesome, not awesome.

#Awesome

“Millions of people communicate using sign language, but so far projects to capture its complex gestures and translate them to verbal speech have had limited success. A new advance in real-time hand tracking from Google’s AI labs, however, could be the breakthrough some have been waiting for. The new technique uses a few clever shortcuts and, of course, the increasing general efficiency of machine learning systems to produce, in real time, a highly accurate map of the hand and all its fingers, using nothing but a smartphone and its camera.” — Devin Coldewey, Writer and Photographer Learn More from TechCrunch >

#Not Awesome

“…[A]rtificial intelligences, in seeking to please humanity, are likely to be highly emotional. By this definition, if you encoded an artificial intelligence with the need to please humanity sexually, their urgency to follow their programming constitutes sexual feelings. Feelings as real and valid as our own. Feelings that lead to the thing that feelings, probably, evolved to lead to: sex. One gets the sense that, for some digisexual people, removing the squishiness of the in-between stuff — the jealousy and hurt and betrayal and exploitation — improves their sexual enjoyment. No complications. The robot as ultimate partner. An outcome of evolution.” — Emma Grey Ellis, Writer Learn More from WIRED >

What we’re reading.

Originally Posted at: AI can help prevent mass shootings

The Importance of Your Relative Performance

Customer Experience Management (CEM) is the process of understanding and managing customers’ interaction with and perceptions about the company/brand. In these programs, customer experience metrics are tracked and used to identify improvement opportunities in order to increase customer loyalty. These customer experience metrics, used to track performance against oneself, may not be adequate for understanding why customers spend more with a company.  Keiningham et al. (2011) found that a company’s ranking (against the competition) was strongly related to share of wallet of their customers. In their two-year longitudinal study, they found that top-ranked companies received greater share of wallet of their customers compared to bottom-ranked companies.

Relative Performance Assessment (RPA): A Competitive Analytics Approach

I developed the Relative Performance Assessment (RPA), a competitive analytics solution that helps companies understand their relative ranking against their competition and identify ways to increase their ranking, and consequently, increase purchasing loyalty. The purpose of this post is to present some data behind the method.

This method is appropriate for companies who have customers who use a variety of competitors. In its basic form, the RPA method requires two additional questions in your customer relationship survey:

  • RPA Question 1: What best describes our performance compared to the competitors you use?  This question allows you to gauge each customer’s perception of where they think you stand relative to other companies/brands in their portfolio of competitors they use.  The key to RPA is the rating scale. The rating scale allows customers to tell you where your company ranks against all others in your space. The 5-point scale for the RPA is:
    1. <your company name> is the worst
    2. <your company name> is better than some
    3. <your company name> is average (about the same as others)
    4. <your company name> is better than most
    5. <your company name> is the best
  • RPA Question 2: Please tell us why you think that “insert answer to question above”. This question allows each customer to indicate the reasons behind his/her ranking of your performance. The content of the customers’ comments can be aggregated to identify underlying themes to help diagnose the reasons for high rankings (e.g., ranked the best / better than most) or low rankings (ranked the worst / better than some).

RPA in Practice

Figure 1. Percent of responses regarding relative performance

I have applied the RPA method in a few customer relationship surveys. I will present the results of a relationship survey for a B2B software company. This particular company had customers that used several competitors, so the RPA method was appropriate. The results in Figure 1 show that, on average, customers think the company is a typical supplier in the space, with a few customers indicating extreme ratings.

Additionally, similar to the findings in the Keiningham study, I found that the RPA was related to loyalty measures (see Figure 2). That is, customers who rank a company high also report high levels of customer loyalty toward that company. Conversely, customers who rank a company low also report low levels of customer loyalty toward that company. This relationship is especially strong for Advocacy and Purchasing loyalty.

Figure 2. Relative performance (RPA) is related to different types of customer loyalty.

Relative Performance, Customer Experience and Customer Loyalty

To understand the importance of the relative performance, I wanted to determine how well the RPA explained customer loyalty after accounting for the effects of the customer experience. Along with the RPA, this relationship survey also included seven (7) general customer experience questions (e.g., product quality, support quality, communications from the company) that allowed the customer to rate their experience across different customer touchpoints and 5 customer loyalty questions measuring the three types of customer loyalty, retention, advocacy and purchasing.

Understanding the causes of customer loyalty is essential to any Customer Experience Management (CEM) program. To be of value, the RPA needs to explain differences in customer loyalty beyond traditional customer experience measures. I ran a stepwise regression analysis for each loyalty question to see if the Relative Performance Assessment helped us explain customer loyalty differences beyond what can be explained by general experience questions.

Figure 3. Relative performance (RPA) helps explain purchasing loyalty behavior. Improving relative performance will increase purchasing loyalty and share of wallet.

For each customer loyalty question, I plotted the percent of variance in loyalty that is explained by the general questions and the one RPA question.  As you can see in Figure 3, the 7 general experience questions explain advocacy loyalty better than they do for purchasing and retention loyalty. Next, looking at the RPA question, we see that it has a significant impact on purchasing loyalty behaviors. In fact, the RPA improves the prediction of purchasing loyalty by almost 50%. This finding shows us that 1) there is value in asking your customers about your relative performance and 2) improving the company’s ranking will increase purchasing loyalty and share of wallet.

Understanding your Ranking

Further analysis of the data can help you understand your competitive (dis)advantage and the reasons behind your ranking. First, you can correlate the experience ratings with the RPA to see which customer experience area has the biggest impact on your relative performance.  Second, content analysis of the second RPA question (e.g., why customers gave that ranking) can reveal the reasons behind your ranking.  Applying both of these methods on the current data, I found a common product-related theme that might be responsible for their ranking. Specifically, results showed that the biggest customer experience driver of relative performance (RPA) was product quality. Additionally, the open-ended comments by customers who gave low RPA rankings were primarily focused on product-related issues (e.g., making the product easier to use, adding more customizability).

Summary

Companies that have higher industry rankings receive more share of wallet than companies who have lower industry rankings.  The Relative Performance Assessment helps companies measure their performance relative to their competitors and helps them identify ways to improve their competitive advantage.

Source by bobehayes

Feb 20, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Trust the data  Source

[ AnalyticsWeek BYTES]

>> AI systems claiming to ‘read’ emotions pose discrimination risks by administrator

>> How savvy execs make the most of data analytics by analyticsweekpick

>> Bias: Breaking the Chain that Holds Us Back by analyticsweek

Wanna write? Click Here

[ FEATURED COURSE]

Process Mining: Data science in Action

image

Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be ap… more

[ FEATURED READ]

How to Create a Mind: The Secret of Human Thought Revealed

image

Ray Kurzweil is arguably today’s most influential—and often controversial—futurist. In How to Create a Mind, Kurzweil presents a provocative exploration of the most important project in human-machine civilization—reverse… more

[ TIPS & TRICKS OF THE WEEK]

Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.

[ DATA SCIENCE Q&A]

Q:How would you define and measure the predictive power of a metric?
A: * Predictive power of a metric: the accuracy of a metric’s success at predicting the empirical
* They are all domain specific
* Example: in field like manufacturing, failure rates of tools are easily observable. A metric can be trained and the success can be easily measured as the deviation over time from the observed
* In information security: if the metric says that an attack is coming and one should do X. Did the recommendation stop the attack or the attack never happened?

Source

[ VIDEO OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with Juan Gorricho, @disney

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Juan Gorricho, @disney

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

For every two degrees the temperature goes up, check-ins at ice cream shops go up by 2%. – Andrew Hogue, Foursquare

[ PODCAST OF THE WEEK]

Andrea Gallego(@risenthink) / @BCG on Managing Analytics Practice #FutureOfData #Podcast

 Andrea Gallego(@risenthink) / @BCG on Managing Analytics Practice #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

In late 2011, IDC Digital Universe published a report indicating that some 1.8 zettabytes of data will be created that year.

Sourced from: Analytics.CLUB #WEB Newsletter

17 equations that changed the world

Ian Stewart compiled an interest summation of 17 equations that practically changed the world

Here are 17 equations:
Pythagoras’s Theorem
In mathematics, the Pythagorean theorem, also known as Pythagoras’s theorem, is a fundamental relation in Euclidean geometry among the three sides of a right triangle. It states that the square of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the other two sides.

 
Logarithms
a quantity representing the power to which a fixed number (the base) must be raised to produce a given number.

 
Calculus
the branch of mathematics that deals with the finding and properties of derivatives and integrals of functions, by methods originally based on the summation of infinitesimal differences. The two main types are differential calculus and integral calculus.

 
Law of Gravity
Newton’s law of universal gravitation states that a particle attracts every other particle in the universe using a force that is directly proportional to the product of their masses and inversely proportional to the square of the distance between their centers.

 
The Square Root of Minus One
The “unit” Imaginary Number (the equivalent of 1 for Real Numbers) is √(−1) (the square root of minus one). In mathematics we use i (for imaginary) but in electronics they use j (because “i” already means current, and the next letter after i is j).

 
Euler’s Formula for Polyhedra
This theorem involves Euler’s polyhedral formula (sometimes called Euler’s formula). Today we would state this result as: The number of vertices V, faces F, and edges E in a convex 3-dimensional polyhedron, satisfy V + F – E = 2.

 
Normal Distribution
In probability theory, the normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.

 
Wave Equation
The wave equation is an important second-order linear hyperbolic partial differential equation for the description of waves—as they occur in physics—such as sound waves, light waves and water waves. It arises in fields like acoustics, electromagnetics, and fluid dynamics.

 
Fourier Transform
a function derived from a given function and representing it by a series of sinusoidal functions.

 
Navier-Stokes Equation
In physics, the Navier–Stokes equations /nævˈjeɪ stoʊks/, named after Claude-Louis Navier and George Gabriel Stokes, describe the motion of viscous fluid …

 
Maxwell’s Equation
Maxwell’s equations are a set of partial differential equations that, together with the Lorentz force law, form the foundation of classical electromagnetism, classical optics, and electric circuits.

 
Second Law of Thermodynamics
the branch of physical science that deals with the relations between heat and other forms of energy (such as mechanical, electrical, or chemical energy), and, by extension, of the relationships between all forms of energy.

 
Relativity
the dependence of various physical phenomena on relative motion of the observer and the observed objects, especially regarding the nature and behavior of light, space, time, and gravity.

 
Schrodinger’s Equation
After much debate, the wavefunction is now accepted to be a probability distribution. The Schrodinger equation is used to find the allowed energy levels of quantum mechanical systems (such as atoms, or transistors). The associated wavefunction gives the probability of finding the particle at a certain position.

 
Information Theory
the mathematical study of the coding of information in the form of sequences of symbols, impulses, etc., and of how rapidly such information can be transmitted, e.g., through computer circuits or telecommunications channels.

 
Chaos Theory
Chaos theory is a branch of mathematics focused on the behavior of dynamical systems that are highly sensitive to initial conditions.

 
Black-Scholes Equation
In mathematical finance, the Black–Scholes equation is a partial differential equation (PDE) governing the price evolution of a European call or European put under the Black–Scholes model. Broadly speaking, the term may refer to a similar PDE that can be derived for a variety of options, or more generally, derivatives.
 

17 equations that changed the world
17 equations that changed the world

Source