Executives at a European Financial Services fi rm had a clear vision.
The company would create a data analytics application for all the markets
it served. It would then collect data about its customersâ behaviours and
preferences and, through analysis of the data, could identify opportunities
that would enable the fi rm to present the right o er to the right customer
at the right time. The company, thereby, would become more central to the
fi nancial lives of its customers. Rising revenues, of course, would follow.
Partway through the work of building the application, however, cost pressures at the fi rm whittled away at the
scope of the project. Instead of an application that would address all its markets, the fi rm decided to prioritise
one market and launch the application there. But the company had neglected to establish frameworks for
defi ning and categorising the data assets being collected, making it diffi cult (if not impossible) for the application
to recognise how data points related to each other
Note: This article originally appeared in FTIConsulting. Click for link here.
By Dr. Jans Aasman Ph.d, CEO of Franz Inc.
Effective data governance consists of protocols, practices, and the people necessary for implementation to ensure trustworthy, consistent data. Its yields include regulatory compliance, improved data quality, and dataâs increased valuation as a monetary asset that organizations can bank on.
Nonetheless, these aspects of governance would be impossible without what is arguably its most important component: the common terminologies and definitions that are sustainable throughout an entire organization, and which comprise the foundation for the aforementioned policy and governance outcomes.
When intrinsically related to the technologies used to implement governance protocols, terminology systems (containing vocabularies and taxonomies) can unify terms and definitions at a granular level. The result is a greatly increased ability to tackle the most pervasive challenges associated with big data governance including recurring issues with unstructured and semi-structured data, integration efforts (such as mergers and acquisitions), and regulatory compliance.
A Realistic Approach
Designating the common terms and definitions that are the rudiments of governance varies according to organization, business units, and specific objectives for data management. Creating policy from them and embedding them in technology that can achieve governance goals is perhaps most expediently and sustainably facilitated by semantic technologies, which are playing an increasingly pivotal role in the overall implementation of data governance in the wake of big dataâs emergence.
Once organizations adopt a glossary of terminology and definitions, they can then determine rules about terms based on their relationships to one another via taxonomies. Taxonomies are useful for disambiguation purposes and can clarify preferred labelsâamong any number of synonymsâfor different terms in accordance to governance conventions. These definitions and taxonomies form the basis for automated terminology systems that label data according to governance standards via inputs and outputs. Ingested data adheres to terminology conventions and is stored according to preferred labels. Data captured prior to the implementation of such a system can still be queried according to the systemâs standards.
Linking Terminology Systems: Endless Possibilities
The possibilities that such terminology systems produce (especially for unstructured and semi-structured big data) are virtually limitless, particularly with the linking capabilities of semantic technologies. In the medical field, a hand written note hastily scribbled by a doctor can be readily transcribed by the terminology system in accordance to governance policy with preferred terms, effectively giving structure to unstructured data. Moreover, it can be linked to billing coding systems per business functions. That structured data can then be stored in a knowledge repository and queried along with other data, adding to the comprehensive integration and accumulation of data that gives big data its value.
Focusing on common definitions and linking terminology systems enables organizations to leverage business intelligence and analytics on different databases across business units. This method is also critical for determining customer disambiguation, a frequently occurring problem across vertical industries. In finance, it is possible for institutions with numerous subsidiaries and acquisitions (such as Citigroup, Citibank, Citi Bike, etc.) to determine which subsidiary actually spent how much money with the parent company and additional internal, data-sensitive problems by using a common repository. Also, linking the different terminology repositories for these distinct yet related entities can achieve the same objective.
The primary way in which semantics addresses linking between terminology systems is by ensuring that those systems are utilizing the same words and definitions for the commonality of meaning required for successful linking. Vocabularies and taxonomies can provide such commonality of meaning, which can be implemented with ontologies to provide a standards-based approach to disparate systems and databases.
Subsequently, all systems that utilize those vocabularies and ontologies can be linked. In finance, the Financial Industry Business Ontology (FIBO) is being developed to grant âdata harmonization andâ¦the unambiguous sharing of meaning across different repositories.â The life sciences industry is similarly working on industry wide standards so that numerous databases can be made available to all within this industry, while still restricting access to internal drug discovery processes according to organization.
Regulatory Compliance and Ontologies
In terms of regulatory compliance, organizations are much more flexible and celeritous to account for new requirements when data throughout disparate systems and databases are linked and commonly sharedârequiring just a single update as opposed to numerous time consuming updates in multiple places. Issues of regulatory compliance are also assuaged in a semantic environment through the use of ontological models, which provide the schema that can create a model specifically in adherence to regulatory requirements.
Organizations can use ontologies to describe such requirements, then write rules for them that both restrict and permit access and usage according to regulations. Although ontological models can also be created for any other sort of requirements pertaining to governance (metadata, reference data, etc.) it is somewhat idealistic to attempt to account for all facets of governance implementation via such models. The more thorough approach is to do so with terminology systems and supplement them accordingly with ontological models.
The true value in utilizing a semantic approach to big data governance that focuses on terminology systems, their requisite taxonomies, and vocabularies pertains to the fact that this method is effective for governing unstructured data. Regardless of what particular schema (or lack thereof) is available, organizations can get their data to adhere to governance protocols by focusing on the terms, definitions, and relationships between them. Conversely, ontological models have a demonstrated efficacy with structured data. Given the fact that the majority of new data created is unstructured, the best means of wrapping effective governance policies and practices around them is through leveraging these terminology systems and semantic approaches that consistently achieve governance outcomes.
About the Author: Dr. Jans Aasman Ph.d is the CEO of Franz Inc., an early innovator in Artificial Intelligence and leading supplier of Semantic Graph Database technology. Dr. Aasmanâs previous experience and educational background include:
â¢ Experimental and cognitive psychology at the University of Groningen, specialization: Psychophysiology, Cognitive Psychology.
â¢ Tenured Professor in Industrial Design at the Technical University of Delft. Title of the chair: Informational Ergonomics of Telematics and Intelligent Products
â¢ KPN Research, the research lab of the major Dutch telecommunication company
â¢ Carnegie Mellon University. Visiting Scientist at the Computer Science Department of Prof. Dr. Allan Newell
Behaviour analytics technology is being developed or acquired by a growing number of information security suppliers. In July 2015 alone, European security technology firm Balabit released a real-time user behaviour analytics monitoring tool called Blindspotter and security intelligence firm Splunk acquired behaviour analytics and machine learning firm Caspida. But what is driving this trend?
Like most trends, there is no single driver, but several key factors that come together at the same time.
In this case, storage technology has improved and become cheaper, enabling companies to store more network activity data; distributed computing capacity is enabling real-time data gathering and analysis; and at the same time, traditional signature-based security technologies or technologies designed to detect specific types of attack are failing to block increasingly sophisticated attackers.
As companies have deployed security controls, attackers have shifted focus from malware to individuals in organisations, either stealing their usernames and passwords to access and navigate corporate networks without being detected or getting their co-operation through blackmail and other forms of coercion.
Stealing legitimate user credentials for both on-premise and cloud-based services is becoming increasingly popular with attackers as a way into an organisation that enables them to carry out reconnaissance, and it is easily done, according to Matthias Maier, European product marketing manager for Splunk.
âFor example, we are seeing highly plausible emails that appear to be from a companyâs IT support team telling a targeted employee their email inbox is full and their account has been locked. All they need to do is type in their username and password to access the account and delete messages, but in doing so, the attackers are able to capture legitimate credentials without using any malware and access corporate IT systems undetected,â he said.
An increase in such technique by attackers is driving a growing demand from organisations for technologies such as behaviour analytics that enable them to build an accurate profile of normal business activities for all employees. This means if credentials are stolen or people are being coerced into helping attackers, these systems are able to flag unusual patterns of behaviour.
Read complete article at:Â http://www.computerweekly.com/news/4500251006/Why-the-time-is-ripe-for-security-behaviour-analytics
Traditionally, data modeling has been one of the most time-consuming facets of leveraging data-driven processes. This reality has become significantly aggravated by the variety of big data options, their time-sensitive needs, and the ever growing complexity of the data ecosystem which readily meshes disparate data types and IT systems for an assortment of use cases.
Attempting to design schema for such broad varieties of data in accordance with the time constraints required to act on those data and extract value from them is difficult enough in relational environments. Incorporating such pre-conceived schema with semi-structured, machine-generated data (and integrating them with structured data) complicates the process, especially when requirements dynamically change over time.
Subsequently, one of the most significant trends to impact data modeling is the emerging capability to produce schema on-the-fly based on the data themselves, which considerably accelerates the modeling process while simplifying the means of using data-centric options.
According to Loom Systems VP of Product Dror Mann, âWeâve been able to build algorithms that break the data and structure it. We break it for instance to lift the key values. We understand that this is the constant, thatâs the host, thatâs the celerity, thatâs the message, and all the rest are just properties to explain whatâs going on there.â
Algorithmic Data Modeling
The expanding reliance on algorithms to facilitate data modeling is one of the critical deployments of Artificial Intelligence technologies such as machine learning and deep learning. These cognitive computing capabilities are underpinned by semantic technologies which prove influential in on-the-fly data modeling at scale. The foregoing algorithms are effectual in such time-sensitive use cases partly because of classification technologies which âmeasure every type of metric in a single oneâ Mann explained. The automation potential of the use of classifications with AI algorithms is an integral part of hastening the data modeling process in these circumstances. As Mann observed, âFor our usual customers, even if itâs a medium-sized enterprise, their data will probably create more than tens of thousands of metrics that will be measured by our software.â The classification enabled by semantic technologies allows for the underlying IT system to understand how to link the various data elements in a way which is communicable and sustainable according to the ensuing schema.
The result is that organizations are able to model data of various types in a way in which they are not constrained by schema, but rather mutate schema to include new data types and requirements. This ability to create schema as needed is vital to avoiding vendor lock-in and enabling various IT systems to communicate with one another. In such environments, the system âcreates the schema and allows the user to manipulate the change accordingly,â Mann reflected. âIt understands the schema from the data, and does some of the work of an engineer that would look at the data.â In fact, one of the primary use cases for such modeling is the real-time monitoring of IT systems which has become increasingly germane to both operations and analytics. Crucial to this process is the real-time capabilities involved, which are necessary for big data quantities and velocities. âThe system ingests the data in real time and does the computing in real time,â Mann revealed. âThrough the data we build a data set where we learn the pattern. From the first several minutes of viewing samples it will build a pattern of these samples and build the baseline of these metrics.â
From Predictive to Preventive
Another pivotal facet of automated data modeling fueled by AI is the predictive functionality which can prevent undesirable outcomes. These capabilities are of paramount importance in real-time monitoring of information systems for operations, and are applicable to various aspects of the Internet of Things and the Industrial Internet as well. Monitoring solutions employing AI-based data modeling are able to determine such events before they transpire due to the sheer amounts of data they are able to parse through almost instantaneously. When monitoring log data, for instance, these solutions can analyze such data and their connotations in a way which vastly exceeds that of conventional manual monitoring of IT systems. In these situations âthe logs are being scanned in real time, all the time,â Mann noted. âUsually logs tell you a much richer story. If you are able to scan your logs at the information level, not just at the error levelâ¦you would be able to predict issues before they happen because the logs tell you when something is about to be broken.â
Data modeling is arguably the foundation of nearly every subsequent data-focused activity from integration to real-time application monitoring. AI technologies are currently able to accelerate the modeling phase in a way that enables these activities to be determined even more by the actual data themselves, as opposed to relying upon predetermined schema. This flexibility has manifold utility for the enterprise, decreases time to value, and increases employee and IT system efficiency. Its predictive potential only compounds the aforementioned boons, and could very well prove a harbinger of the future for data modeling. According to Mann:
âWhen you look at statistics, sometimes you can detect deviations and abnormalities, but in many cases youâre also able to detect things before they happen because you can see the trend. So when youâre detecting a trend you see a sequence of events and itâs trending up or down. Youâre able to detect what we refer to as predictions which tells you that something is about to take place. Why not fix it now before it breaks?â
I was recently involved in a couple of panel discussions on what it means to be a data scientist and to practice data science. These discussions/debates took place at IBM Insight in Las Vegas in Late October. I attendedÂ the event as IBM’sÂ guest. The panels, moderated by Brian Fanzo, included me and these data experts:
- Andrew C. Oliver, President of Mammoth Data
- Lillian PiersonÂ of Data-Mania
- Matt Ridings, CEOÂ of SideraWorks
- Mike Tamir, Chief Science Officer ofÂ Galvanize
IÂ enjoyed our discussions andÂ their take on the topic of data science. Our discussion was opened by the question “What is the role of a data scientist in the insight economy?” You can read eachÂ of our answers to this question on IBM’s Big Data Hub. While we come from different backgrounds, there was a common theme across our answers. We all think that data science is aboutÂ finding insights in data to help make better decisions. I offered a more complete answer to that question in a prior post. Today, I want to share some more thoughts about other areas of the field of data scienceÂ that we talked about in our discussions. The content below reflects my opinion.
What is aÂ Data Scientist?
As more data professionals are now calling themselves data scientists, it’s important to clarify exactly what a data scientist is. One way to understand data scientists is to understand what kind of skills they bring to bear on analytics projects. It’s generally agreed that a successful data scientist is one who possesses skills across three areas: subject matter expertise in a particular field, programming/technology and statistics/math (see DJ Patil and Hilary Mason’s take,Â Drew Conway’s Data Science Venn DiagramÂ (see Figure 1)Â and a review ofÂ many experts’ opinion on this topic.
AnalyticsWeek and I recently took an empirical approachÂ to understanding the skills of data scientists by asking over 500 data professionals about their job roles and their proficiency across 25Â data skills in five areas (i.e., business, technology, programming, math/modeling and statistics). A factor analysis of their proficiency ratings revealedÂ three factors: business acumen, technology/programming skills and statistics/math knowledge.
A data scientist who possesses expertise in all data skills is rare. In our survey, none of the respondents were experts in all five skill areas. Instead, our results identified four different types of data scientists, each with varying levels of proficiency inÂ data skills; as expected, different data professionals possessedÂ role-specific skills (see Figure 2). Business Management professionals were the most proficient in business skills. Developers were the most proficient in technology and programming skills. Researchers were most proficient in math/modeling and statistics. Creatives did not possessÂ great proficiencyÂ in any one skill.
The Practice of Data Science:Â Getting InsightsÂ from Data
Gil Press offers a great summary of the field of data science. He traces the literary history of the term (term first appears in use in 1974) and settles on the idea that data science is way of extracting insightsÂ from those data using the powers of computer science and statistics applied to data from a specific field of study.
But how do you get insights from data? Bernard Marr offers his 5-stepÂ SMART approachÂ to extract information. SMART stands for:
- S = Start with Strategy
- M = Measure Metrics and Data
- A = Apply Analytics
- R = Report Results
- T = Transform your Business
Another approach is the 6-step CRISP-DM (Cross Industry Standard Process for Data Mining) method (see Figure 3). In a KDNuggets Poll in 2014, the CRISP-DM method was the most popular methodologyÂ (43%)Â used byÂ data professionalsÂ for analytics, data mining, and data science projects.
These twoÂ approaches have a lot in common with each other and bothÂ share a lot with aÂ method that has been around for about 1000Â years: the scientific method (see Alhazen, a forerunner of the scientific method).Â The scientific method follows these generalÂ steps (see figure 4):
- Formulate a question or problem statement
- Generate a hypothesis that is testable
- Gather/Generate data to understand the phenomenon in question. Data can be generated through experimentation; when we can’t conduct true experiments, data are obtained through observations and measurements.
- Analyze data to test the hypotheses / Draw conclusions
- Communicate results to interested parties or take action (e.g., change processes) based on the conclusions. Additionally, the outcome of the scientific method can help us refine our hypotheses for further testing.
The value of data is measured by what you do with it. Whether you’reÂ investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge, the scientific method is an effective way to systematically interrogateÂ your data. Scientists may differ with respect to the variables they use and the problems they study (e.g.,Â medicine, education and business), but they all useÂ the scientific method to advance bodies of knowledge.
Data is, has been and forever will be at the heart of science. The scientific method necessarily involves the collection of empirical evidence, subject to specific principles of reasoning. That is the practice of science,Â a way of extracting knowledge from data. Data science is science.
The Democratization of Data Science
Taking a scientific approach to analyzing data is not only valuable toÂ data workers; it is also valuable for people who consume, interpret and make decisions based the analysis of those data. In business, data users need to think critically about sales reports, social media metrics and quarterly reports. Application vendors are marketing their toolsÂ and platforms as a way of making everybody a data scientist, enablingÂ end users (i.e., data users) to get advanced statistical and visualization capabilities to find insights (see Prelert’s take on thisÂ here, Tableau’s ideasÂ here and Umbel’s call here).
I believe that the democratization of data science is not only a software problem but also an education problem. Companies need to provide their employees training on statistics and statistical concepts. This type of training gives the employees the ability to think critically about the data (e.g., dataÂ source, measurement properties and relevanceÂ of the metrics). The better the grasp of statistics employeesÂ have, the more insight/value/use they will get from the software they use to analyze/visualize that data.
Statistics is the language of data. Like knowledge of your native language helps you maneuver in the world of words, statistics will help you maneuver in the world ofÂ data. As the world around us becomes more quantified, statistical skills will become more and more essential in our daily lives. If you want to make sense of our data-intensive world, you will need to understand statistics.
Conclusions and Final Thoughts
Businesses are relying on data professionals with unique skills to makeÂ sense of their data. These data professionals apply their skills to improve decision-making in humans or algorithms.Â Getting from data to insights, data professionals can adopt a systematic approach to optimize the use of their skills. Following are some conclusions about data scientists and the practice of data science.
- The practice of data science requires three skills: subject matter expertise, computing skills and statistical knowledge.
- The general term, ‘data scientist,’ is ambiguous. Our research studied fourÂ different types of data scientists: Business management, Programmer, Creative and Researcher. Each role possessed different strengths.
- Science is a way of thinking, a way of testing ideas using data. An effective practice of data science includes the scientific method. I think that the term, ‘data science,’ is redundant. It’s just science. Science requires the use of data, data to help you understand your business and how the world really works.
- OfferÂ employeesÂ training on statistics. Giving people analytics software and expecting them to excel at
datascience is like giving them a stethescope and expecting them to excel at medicine. The better they understand the language of data, the more value they will get from the analytics software they use.
I’ll leave you with some thoughts on data science I shared with Nick Dimeo at IBM Insight.
I would love to hear your thoughts on data scientists and the practice of data science. What do those terms mean to you?