To AI or Not To AI

To AI or Not To AI
To AI or Not To AI

We all have heard about AI by now, it is stack of capabilities, put together to achieve a business objective with certain capacity for autonomy, ranging from expert system to deep learning algorithms. In my several conversations, I have found myriad of uninformed expectations from businesses on what they think of AI and what they want to achieve from it. The primal reason that IMHO is happening is limited understanding and technology landscape explosion. Such a radical shift has left businesses with imperfect understanding of the capabilities of AI. While it is tempting to point out the challenges that businesses are facing today, it is important to understand the core problem. One of the company executive (lets call him Steve) put it the best, “in today’s times AI is pushed right in the 2nd sentence of almost every product pitch and almost every vendor is trying to sell with rarely anyone trying to tell. But most are unclear what are they doing with their AI and how it would affect us.”. This hits to the nail of the problem.

Due to buzz in the market and push from top software companies to push their AI assistant to consumer, market is exploding. This widespread investment and media buzz is doing a great job at keeping the business anxiety high. While this is tiring for businesses and could potentially challenge their core strength (if not understood properly), businesses need to respond to this buzz word as an Innovation maverick. Hopefully we’ll talk about it below. Still not convinced to investigate AI adoption and need a reason? There are reportedly 35.6 million voice activated assistance devices that would make their way into American homes. That pretty much means that 1 in 4 household has an AI Assistant (total of 125.82 million households). This massive adoption is fueling the signal that everyone should consider AI in their corporate strategy as AI is slowly sliding into our lives and our work would be next. After all, you don’t want to lose your meal to an AI.

So, hopefully you are almost at the edge of being convinced, now what are some of the considerations that businesses should remember (almost always) and use them to build some ground rules before venturing into high dose of AI planning, execution & adoption.


AI is no silver bullet

While AI is good for lots of things, it’s not good for everything. AI solutions are great at clustering (likely events), predicting future (based on past) and finding anomalies (from large dataset) but they are certainly not great at bringing intuition to the table, quantify & qualify culture. They are still lagging to provide trusted results when they are equipped with underfitted or overfitted models. AI solutions are amazing at normalizing the data to predict the outcome, which many times leave the corners unseen. AI also has bias problem that humans have been mastering for ages. So, go with AI but keep critical decisions around best intuitive algos with who could do it the best, yes humans.


eAI (Enterprise AI) is in its infancy, so don’t yet give launch code to the kid

I am a South Asian, and sometimes when I am in my Indian-mode, my Indian accent jumps out and my interaction with Siri, Alexa and Google Home turn into an ego fest between what AI thinks I am speaking vs what I am speaking. The only difference is that AI holds more power in those interactions. Which is not yet scary, but I am sure it could be. If you have interacted with your AI assistance toys, you could relate to the experience when AI responds/reacts/executes due to misinterpretation. Now assume when consumer toys are programmed to react on misinformation, sometimes enterprise solutions could also suggest some fun and bizarre recommendations. Don’t believe me? Read my previous blog: Want to understand AI bounds? Learn from its failures to learn more. So, it’s important for businesses to understand and create the boundaries of AI and keep it air-tight from your critical decision-making.


Focus on the journey and not the destination

Yes, I know you have heard about this before in almost every challenging streak you are about to take. We have also heard about the same quote with “Journey” and “Destination” reversed. But I like the previous one. It puts emphasis on learning from this project and prepare decision makers to not rely on these technologies without a robust and fail-safe qualifying criterion. Every step, every learning, every use-case (intended & un-intended) must be captured for analysis. Most successful deployment stories I have heard are the ones where AI led the ROI hockey stick from unexpected corners. So, businesses should always provision for ears that must be listening to those unexpected corners. One of the most challenging conversation I find myself in contains a clearly defined uptight goals with no room for change. We need to achieve X by Y. While this is music to corporate ear, this is a headache for new untested waters. Imagine jumping in a viscous slurry to get to other corner in 10min. Sure, you may or may not get there, but then you’ll be too focused in getting to the other side and not focused enough in finding hacks to get you through the slurry faster the next time.


Building up on the foundation is critical

Yes, let’s not talk about changing laws of physics. Wait, let us change the laws of physics but give respect to the fundamental laws. It is important to see fundamentals REALLY fail before we try to ignore them. Avoid going against gravity, but it should be allowed to experiment with it. Businesses exists because of their secret sauce: part culture, part process and part product. While it is very tempting to break the legacy and build it fresh, it is extremely painful and time consuming to make different aspects of business work in harmony. Ignoring the old in front of the new is one of the most under estimated and freakishly devastating oversight that innovation could put businesses through. Imagine a newly launched rockstar product and how everyone jumped to work on it. While it is cool to work on new challenges, it is critical to establish their compliance to the fundamental foundation of the business. There is no silver bullet as to what constitutes the foundation, but it’s a deep conversation that business needs to have before venturing into self-learning, self-analyzing and self-reacting solutions.


Don’t rush to the road

I have a young child at home and I remember getting in a conversation with an executive about the young ones and AI (Wait, this could be a great title). Idea that you wouldn’t trust your young ones with the steering of your car on road without much practice and / or your confidence in their abilities. You would rather have them perfect their craft in the controlled backyard, instead of getting them on highways early. Once they are mature and sane, yes, now is the time to let them drive in controlled roads and once confident, go all in. Current AI capabilities are no different. They require lot of hand-holding and their every outcome hits you with amusement. So, understand the play area for AI deployment and build strong ground rules. You don’t want anyone hurt with overconfidence of a minor. So, is the case with these expert systems.


Be fallible

Time to get some meetings with your innovation team (if you have one yet), else time for you to create one. We are currently working on tech stack where most of the technologies are undergoing disruption. The time is excitingly scary for tech folks. As tech is now substitute spine of any business, and yet undergoing disruption. So, tech folks should be given enough ammo to fail and they should be encouraged to fail. The scariest thing that could happen today would be a team executing a scenario and giving it a pass due to the fear of failure. There needs to be responsive checks and balances to understand and appreciate failures. This will help businesses work with IT that is agile and yet robust enough to undertake any future disruptive change.


Understand the adoption challenges

If you are hurt that we spend little time to talk about adoption, my apologies, I hope to hear from you in the comment section below. Adoption holds the key for AI implementation. While you are undergoing digital transformation (you soon will, if not already), you are making your consumers, employees, executives crazy with this new paradigm, so adoption of yet another autonomy layer holds some challenges. Some of the adoption challenges could be attributed to understanding of capabilities. From poor understanding of corporate fundamentals to inability to deploy a fitted model that could be re-calibrated once ecosystem sees a shift, the adoption challenges are everywhere.

So, while it is great to jump on AI bandwagon, it is important now (more than ever before) to understand the business. While IT could be a superhero amidst this, understand that along with more power comes more responsibility. So, help prepare your tech stack to be responsible and with open ears.

If you have more to add, welcome your thoughts on the comments below. Appreciate your interest.

Source by v1shal

The Beginner’s Guide to Predictive Workforce Analytics

Greta Roberts, CEO
Talent Analytics, Corp.

Human Resources Feels Pressure to Begin Using Predictive Analytics
Today’s business executives are increasingly applying pressure to their Human Resources departments to “use predictive analytics”.  This pressure isn’t unique to Human Resources as these same business leaders are similarly pressuring Sales, Customer Service, IT, Finance and every other line of business (LOB) leader, to do something predictive or analytical.

Every line of business (LOB) is clear on their focus. They need to uncover predictive analytics projects that somehow affect their bottom line. (Increase sales, increase customer service, decrease mistakes, increase calls per day and the like).

Human Resources Departments have a Different, and Somewhat Unique, Challenge not Faced By Most Other Lines of Business
When Human Resources analysts begin a predictive analytics initiative, what we see mirrors what every other line of business does. Somehow for HR, instead of having a great outcome it can be potentially devastating.

Unless the unique challenge HR faces is understood, it can trip up an HR organization for a long time, cause them to lose analytics project resources and funding, and continue to perplex HR as they have no idea how they missed the goal of the predictive initiative so badly.

Human Resources’ Traditional Approach to Predictive Projects
Talent Analytics’ experience has been that (like all other lines of business) when Human Resources focuses on predictive analytics projects, they look around for interesting HR problems to solve; that is, problems inside of the Human Resources departments. They’d like to know if employee engagement predicts anything, or if they can use predictive work somehow with their diversity challenges, or predict a flight risk score that is tied to how much training or promotions someone has, or see if the kind of onboarding someone has relates to how long they last in a role. Though these projects have tentative ties to other lines of business, these projects are driven from an HR need or curiosity.

HR (and everyone else) Needs to Avoid the “Wikipedia Approach” to Predictive Analytics
Our firm is often asked if we can “explore the data in the HR systems” to see if we can find anything useful. We recommend avoiding this approach as it is exactly the same as beginning to read Wikipedia from the beginning (like a book) hoping to find something useful.

When exploring HR data (or any data) without a question, what you’ll find are factoids that will be “interesting but not actionable”. They will make people say “really, I never knew that”, but nothing will result.  You’ll pay an external consultant a lot of money to do this, or have a precious internal resource do this – only to gain little value without any strategic impact.  Avoid using the Wikipedia Approach – at least at first.  Start with a question to solve.  Don’t start with a dataset.

Human Resources Predictive Project Results are Often Met with Little Enthusiasm
Like all other Lines of Business, HR is excited to show results of their HR focused predictive projects.

The important disconnect. HR shows results that are meaningful to HR only.

Perhaps there is a prediction that ties # of training classes to attrition, or correlates performance review ratings with how long someone would last in their role. This is interesting information to HR but not to the business.

Here’s what’s going on.

Business Outcomes Matter to the Business.  HR Outcomes Don’t.
Human Resources departments can learn from the Marketing Department who came before them on the predictive analytics journey. Today’s Marketing Departments, that are using predictive analytics successfully, are arguably one of the strongest and most strategic departments of the entire company.

Today’s Marketing leaders predict customers who will generate the most revenue (have high customer lifetime value). Marketing Departments did not gain any traction with predictive analytics when they were predicting how many prospects would “click”. They needed to predict how many customers would buy.

Early predictive efforts in the Marketing Department used predictive analytics to predict how many webinars they’ll need to conduct to get 1,000 new prospects in their prospect database.  Or, how much they’d need to spend on marketing campaigns to get prospects to click on a coupon. (Adding new prospect names to a prospect database is a marketing goal not a business goal.  Clicking on a coupon is a marketing goal not a business goal). Or, they could predict that customer engagement would go up if they gave a discount on a Friday (again, this is a marketing goal not a business goal. The business doesn’t care about any of these “middle measures” unless they can be proved and tracked to the end business outcome.

Marketing Cracked the Code
Business wants to reliably predict how many people would buy (not click) using this coupon vs. that one.  When marketing predicted real business outcomes, resources, visibility and funding quickly became available.

When Marketing was able to show a predictive project that could identify what offer to make so that a customer bought and sales went up – business executives took notice. They took such close notice that they highlighted what Marketing was able to do, they gave Marketing more resources and funding and visibility. Important careers were made out of marketing folks who were / are part of strategic predictive analytics projects that delivered real revenue and / or real cost savings to the business’s bottom line.

Marketing stopped being “aligned” with the business, Marketing was the business.

Human Resources needs to do the same thing.

Best Approach for Successful and Noteworthy Predictive Workforce Projects
Many people get tangled up in definitions. Is it people analytics, workforce analytics, talent analytics or something else? It doesn’t matter what you call it – the point is that predictive workforce projects need to address and predict business outcomes not HR outcomes.

Like Marketing learned over time, when Human Resources begins predictive analytics projects, they need to approach the business units they support and ask them what kinds of challenges they are having that might be affected by the workforce.

There are 2 critical categories for strategic predictive workforce projects:

  • Measurably reducing employee turnover / attrition in a certain department or role

  • Measurably increasing specific employee performance (real performance not performance review scores) in one role or department or another (i.e. more sales, less mistakes, higher customer service scores, less accidents).

I say “measurably” because to be credible, the predictive workforce initiative needs to measure and show business results both before and after the predictive model.

For Greatest ROI: Businesses Must Predict Performance or Flight Risk Pre-Hire
Once an employee is hired, the business begins pouring significant cost into the employee typically made up of a) their salary and benefits b) training time while they ramp up to speed and deliver little to no value. Our analytics work measuring true replacement costs show us that even for very entry level roles a conservative replacement estimate for a single employee (Call Center Rep, Bank Teller and the like) will be over $6,000.

A great example, is to consider the credit industry. Imagine them extending credit to someone for a mortgage – and then applying analytics after the mortgage has been extended to predict which mortgage holders are a good credit risk. It’s preposterous.

They only thing the creditor can do after the relationship has begun is to try to coach, train, encourage, change the payment plan and the like. It’s too late after the relationship has begun.

Predicting credit risk (who will pay their bills) – is predicting human behavior.  Predicting who will make their sales quota, who will make happy customers, who will make mistakes, who will drive their truck efficiently – also is predicting human behavior.

HR needs to realize that predicting human behavior is a mature domain with decades of experience and time to hone approaches, algorithms and sensitivity to private data.

What is Human Resources’ Role in Predictive Analytics Projects?
The great news is that typically the Human Resources Department will already be aware of both of these business challenges. They just hadn’t considered that Human Resources could be a part of helping to solve these challenges using predictive analytics.

Many articles discuss how Human Resources needs to be an analytics culture, and that all Human Resources employees need to learn analytics. Though I appreciate the realization that analytics is here to stay, Human Resources of all people should know that there are some people with the natural mindset to “get” and love analytics and there are some that don’t and won’t.

As I speak around the world and talk to folks in HR, I can feel the fear felt today by people in HR who have little interest in this space. My recommendation would be to breathe, take a step back and realize that not everyone needs to know how to perform predictive analytics.  Realize there are many traditional HR functions that need to be accomplished. We recommend a best practice approach of identifying who does have the mindset and interest in the analytics space and let them partner with someone who is a true predictive analyst.

For those who know they are not cut out to be the person doing the predictive analytics there are still many roles where they can be incredibly useful in the predictive process. Perhaps they could identify problem areas that predictive analytics can solve, or perhaps they could be the person doing more of the traditional Human Resources work. I find this “analytics fear” paralyzes and demoralizes employees and people in general.

Loosely Identified, but Important Roles on a Predictive Workforce Analytics Project

  1. Someone to identify high turnover roles in the lines of business, or identify where there are a lot of employees not performing very well in their jobs

  2. A liaison: Someone to introduce the HR predictive analytics team to the lines of business with turnover or business performance challenges

  3. Someone to help find and access the data to support the predictive project

  4. Someone to actually “do” the predictive analytics work (the workforce analyst or data scientist)

  5. Someone who creates a final business report to show the results of the work (both positive and negative)

  6. Someone who presents the final business report

  7. A high level project manager to help keep the project moving along

  8. The business and HR experts that understand how things work and need to be consulted all along the way

These roles can sometimes all be the same person, and sometimes they can be many different people depending on the complexity of the project, the size of the predictive workforce organization, the number of lines of business that are involved in the project and / or the multiple areas where data needs to be accessed.

The important thing to realize is there are several non analytics roles inside of predictive projects. Not every role in a predictive project requires a predictive specialist or even an analytics savvy person.

High Value Predictive Projects Don’t Deliver HR Answers
We recommend, no. At least not to begin with. We started by describing how business leaders are pressuring Human Resources to do predictive analytics projects. There is often little or no guidance given to HR about what predictive projects to do.

Here is my prediction and you can take it to the bank. I’ve seen it happen over and over again.

When HR departments use predictive analytics to solve real, Line of Business challenges that are driven by the workforce, HR becomes an instant hero. These Human Resources Departments are given more resources, their projects are funded, they receive more headcount for their analytics projects – and like Marketing, they will turn into one of the most strategic departments of the entire company.

Feeling Pressure to Get Started with Predictive?
If you’re feeling pressure from your executives to start using predictive analytics strategically and have a high volume role like sales or customer service you’d like to optimize, get in touch.

Want to see more examples of “real” predictive workforce business outcomes? Attend Predictive Analytics World for Workforce in San Francisco, April 3-6, 2016.

Greta Roberts is the CEO & Co-founder of Talent Analytics, Corp. She is the Program Chair of Predictive Analytics World for Workforce and a Faculty member of the International Institute for Analytics. Follow her on twitter @gretaroberts.

Source: The Beginner’s Guide to Predictive Workforce Analytics

April 10, 2017 Health and Biotech analytics news roundup

A DNA-testing company is offering patients a chance to help find a cure for their conditions: Invitae is launching the Patient Insights Network, where people can input their own genome data and help link it to other health data.

Congratulations, you’re a parasite!  Erick Turner and Kun-Hsing Yu won the first ‘Research Parasite’ award, given to highlight reanalysis of data. The name is a tongue-in-cheek reference to an infamous article decrying the practice.

IMI chief: ‘We need to learn how to share data in a safe and ethical manner’: Pierre Meulien discusses the EU’s Innovative Medicines Initiative, where public and private institutions collaborate.

5 Tips for Making Use of Big Data in Healthcare Production: Two pharmaceutical executives offer their opinions on using data in pharmaceutical manufacturing.

Originally Posted at: April 10, 2017 Health and Biotech analytics news roundup

Skill-Based Approach to Improve the Practice of Data Science

Our Big Data world requires the application of data science principles by data professionals. I’ve recently taken a look at what it means to practice data science as a data scientist. Our survey results of over 500 data professionals revealed that different types of data scientists possess proficiency in different types of data skills. In today’s post, I take another look at that data to identify the data skills that are essential for successful analytics projects. Additionally, I will present the Data Science Driver Matrix, a skill-based approach to identify how to improve the practice of data science.

Substandard Proficiency in Data Skills

In this ongoing study with AnalyticsWeek, we asked data professionals a variety of questions about their skills, job role, education level and more.

Data professionals were asked to rate their proficiency across 25 data skills in five skill areas (i.e., business, technology, programming, math & modeling and statistics) using the following scale:

Data Skills Proficiency Wheel
Figure 1. Proficiency in Data Science Skills by Job Role. Click image to enlarge.
  • Don’t know (0)
  • Fundamental Knowledge (20)
  • Novice (40)
  • Intermediate (60)
  • Advanced (80)
  • Expert (100)

The different levels of proficiency are defined around the data scientists ability to give or need to receive help. In the instructions to the data professionals, the “Intermediate” level of proficiency was defined as the ability “to successfully complete tasks as requested.” We used that proficiency level (i.e., Intermediate) as the minimum acceptable level of proficiency for each data skill. The proficiency levels below the Intermediate level (i.e., Novice, Fundamental Awareness, Don’t Know) were defined by an increasing need for help on the part of the data professional. Proficiency levels above the Intermediate level (i.e., Advanced, Expert) were defined by the data professional’s increasing ability to give help or be known by others as “a person to ask.”

We looked at the level of proficiency for the 25 different data skills across four different job roles. As is seen in Figure 1, data professionals tend to be skilled in areas that are appropriate for their job role (see green-shaded areas in Figure 1 where average proficiency ratings are 60 or above). Specifically, Business Management data professionals show the most proficiency in Business Skills. Researchers, on the other hand, show lowest level of proficiency in Business Skills and the highest in Statistics Skills.

For many of the data skills, however, the typical data professional does not have the minimum level of proficiency to be successful at work, no matter their role (see yellow- and red-shaded areas in Figure 1 where average proficiency ratings are below 60). Specifically, there are 10 data skills in which the typical data professional does not have the minimum level of proficiency: Unstructured data, NLP, Machine Learning, Big and distributed data, Cloud management, Front-end programming, Optimization, Graphic models, Algorithms and Bayesian statistics. Furthermore, there are nine data skills in which only one type of data professional has the minimum level of proficiency to be successful at work: Product design, Business Development, Budgeting, Database Administration, Back-end Programming, Data Management, Math, Statistics/Statistical Modeling and Science/Scientific Method.

Not all Data Skills are Equally Important

Given that data professionals lack proficiency in many skill areas, where do they begin to improve their overall set of data skills? Are some data skills more critical to project success than others? Should data professionals focus on learning/developing certain skills instead of other, less important skills?

Table 1. Correlations of Proficiency of Different Data Skills with Satisfaction with Outcomes of Analytics Projects
Table 1. Correlations of Proficiency of Different Data Skills with Satisfaction with Outcomes of Analytics Projects

In our study, data professionals were asked to rate their satisfaction with the outcomes of analytics projects on which they work. They provided their rating on a scale from 0 (Extremely Dissatisfied) to 10 (Extremely Satisfied). I used this score as a measure of project success.

For each data skill, I correlated data professionals’ proficiency ratings with the data professional’s satisfaction with outcomes to understand the link between a specific skill and the outcome of analytics projects. This exercise was done for each of the four job roles (See Table 1). Skills that show a high correlation with satisfaction with outcomes indicate that those skills are closely linked to project success (as defined by the satisfaction ratings). Skills listed in the top half of Table 1 are more essential to project outcomes compared to skills listed in the bottom half of Table 1.

On average, we see that data skills are more closely linked to satisfaction with work outcomes for data professionals who are Business Managers (average r = .30) and Researchers (average r = .30) compared to data professionals who are Developers (average r = .18) and Creatives (average r = .18).

The ranking of data skills with respect to their impact on satisfaction also varies significantly by job role. The average correlations among the rankings of data skills across the four job roles is r = .01, suggesting that data skills that are essential to project outcomes for one type of data scientist are not essential for other types of data scientists.

The Data Science Driver Matrix: Graphing the Results

Figure 2. Skill-based approach to improve the practice of data science
Figure 2. Data Science Driver Matrix: Skill-based approach to improve the practice of data science. Click image to enlarge.

So, we now have the two pieces of information for each of the 25 data skills: 1) average proficiency rating (in Figure 1) and 2) correlation with work outcome (in Table 1). For each job role, I plotted both pieces of information of the 25 data skills in a 2×2 table (see Figure 2). I call this diagram the Data Science Driver Matrix (DSDM). In the DSDM, the x-axis represents the average level of proficiency across all data skills. The y-axis represents how essential the skill is to project outcome.

The midpoint on the x- and y-axes are 60 (minimum level of proficiency needed to be successful at work) and .30 (~average correlation of skills with satisfaction), respectively.

Interpreting the Results: Improving the Practice of Data Science

Each of the data skills will fall into one of the four quadrants of the DSDM. In Table 1, I list the quadrant number for each data skill for the separate job roles. The decisions you make about a specific data skill (e.g., whether to learn it or not) depends on the quadrant in which it falls:

  1. Quadrant 1 (upper left): Quadrant 1 houses skills that are essential to the outcome of the project and in which the proficiency is below the minimum requirement. These data skills reflect good areas for potential improvement efforts because we have ample room for improvement. Improvements in proficiency could come in the form of investments in hiring data professionals with these skills, investments in training your current data professionals to acquire these skills or creation of teams with members that have complementary skills.
  2. Quadrant 2 (upper right): Quadrant 2 houses skills that are essential to the outcome of the project and in which the proficiency is above the minimum requirement. These skills reflect data professionals’ strength that we know improves the success in analytics projects. You’ll likely want to stay the course on these data skills.
  3. Quadrant 3 (lower right): Quadrant 3 houses skills in which the proficiency is above the minimum requirement but are not very essential to the outcome of the project. Be careful not to over-invest in improving these skills as they are not necessarily essential for the success of analytics projects.
  4. Quadrant 4 (lower left): Quadrant 4 houses skills in which the proficiency is below the minimum requirement but are not very essential to the outcome of the project. Consider divesting resources from these skills and re-direct them to skills falling in Quadrant 1. These skills are of low priority because, despite the fact that proficiency is low for these skills, they do not have a substantial impact on the outcome of the analytics projects.

Data Science Driver Matrices for Different Data Roles

I created a DSDM for each of the four job roles: Business Manager, Developer, Creative and Researcher. For this exercise, I will focus primarily on data skills that fall into Quadrant 1 (i.e., low proficiency in highly essential data skills).

1. Business Managers

For data professionals who self-identify as Business Managers (see Figure 3), we see that none of the skills fall into Quadrant 2 (high proficiency in highly essential skills), while 12 skills fall into Quadrant 1 (low proficiency in highly essential skills). Skills in quadrant 1 include:

Figure 3. Data Science Driver Matrix for Business Managers. Click image to enlarge.
Figure 3. Data Science Driver Matrix for Business Managers. Click image to enlarge.
  • Statistics / Statistical Modeling
  • Data Mining
  • Science / Scientific Method
  • Big and distributed data
  • Machine Learning
  • Bayesian Statistics
  • Optimization
  • Unstructured data
  • Structured data
  • Algorithms
Data Science Driver Matrix for Developers
Figure 4. Data Science Driver Matrix for Developers. Click image to enlarge.

2. Developers

For data professionals who identify as Developers (see Figure 4), most of the skills fall into Quadrant 4 (low proficiency in non-essential skills). Only two skills fall into Quadrant 1:

  • Systems Administration
  • Data Mining
Data Science Driver Matrix for Creatives
Figure 5. Data Science Driver Matrix for Creatives. Click image to enlarge.

3. Creatives

For data professionals who identify as Creatives (see Figure 5), most of the skills fall in Quadrant 4 (low proficiency in non-essential skills). Five skills fall into Quadrant 1:

  • Math
  • Data Mining
  • Business Development
  • Graphical Models
  • Optimization

4. Researchers

For data professionals who identify as Researchers (see Figure 6), six skills fall into Quadrant 1 (low proficiency in essential skills):

Data Science Driver Matrix for Researchers
Figure 6. Data Science Driver Matrix for Researchers. Click image to enlarge
  • Algorithms
  • Big and distributed data
  • Data Management
  • Product Design
  • Machine Learning
  • Bayesian Statistics

Researchers appear to lack proficiency in areas that are critical to the success of analytics projects.


Applying the right data skills to analytics projects is key to successful project outcomes. I proposed a skill-based approach to improve the practice of data science to help identify the essential data skills for different types of data professionals. Businesses can use these results to ensure they bring the right data professionals with the right skills to bear on their Big Data analytics projects.

There are a few conclusions from we can make from the current analyses.

  1. Data Mining was the only data skill that was one of the top 4 data skills that was essential to the project outcome. No matter your role as a data professional, a key ingredient to project success is your ability to mine insights from data.
  2. Proficiency in data skills appears to be more important for data professionals who are in the roles of Business Management and Researcher compared to data professionals who are in the roles of Developer and Creative. Improving proficiency in data skills to increase satisfaction with work appears to be a more realistic approach for Business Management and Researcher type data professionals.
  3. Data professionals could likely be happier about the outcomes of their projects if they possessed specific data skills. Surprisingly, for Business Managers, business-related data skills are not critical to the outcome of their analytics work. Instead, what drives their work satisfaction is the extent to which they are proficient in statistical and technological skills. Unfortunately, these Business Management workers typically do not possess adequate proficiency in these types of skills.

Improving the practice of data science can be accomplished in a variety of ways.  While the current analysis suggests that you can improve analytics project outcomes by improving skills for specific data professionals, another approach is to build data science teams with data professionals who have complementary skills. As I’ve found before, Business Managers are more satisfied with the outcomes of analytics projects when they are paired with data professionals with strong statistics skills compared to Business Managers who work alone. Likewise, Researchers are more satisfied with the outcomes of analytics projects when they are paired with data professionals with strong business acumen. Using either approach, organizations can leverage the practice of data science to address their analytics projects.

Source: Skill-Based Approach to Improve the Practice of Data Science

Rethinking classical approaches to analysis and predictive modeling

Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modeling


The speaker will address the need to rethink classical approaches to analysis and predictive modeling. He will examine “iterative analytics” and extremely fine grained segmentation down to a single customer – ultimately building one model per customer or millions of predictive models delivering on the promise of “segment of one” . The speaker will also address the speed at which all this has to work to maintain a competitive advantage for innovative businesses.


Afshin Goodarzi Chief Analyst 1010data

A veteran of analytics, Goodarzi has led several teams in designing, building and delivering predictive analytics and business analytical products to a diverse set of industries. Prior to joining 1010data, Goodarzi was the Managing Director of Mortgage at Equifax, responsible for the creation of new data products and supporting analytics to the financial industry. Previously, he led the development of various classes of predictive models aimed at the mortgage industry during his tenure at Loan Performance (Core Logic). Earlier on he had worked at BlackRock, the research center for NYNEX (present day Verizon) and Norkom Technologies. Goodarzi’s publications span the fields of data mining, data visualization, optimization and artificial intelligence.

Presentation Video:

Presentation Slideshare:

1010Data [ ]
Microsoft NERD [ ]
Cognizeus [ ]

Source by v1shal

Energy companies have more data than they know what to do with

Energy enterprises (specifically, oil and natural gas companies) are witnessing a monumental shift in the global economy. North America is ramping up production, which is raising a number of health, safety and environmental concerns among United States and Canadian citizens alike.

It’s easy to view big data analytics as a cure-all for the challenges faced by the energy industry, but using the technology doesn’t automatically solve those problems. As I’ve repeatedly said, data visualization merely provides finished intelligence to its users – people are responsible for finding out how to apply this newfound knowledge to their operations.

“The ultimate goal of the modern energy company is to optimize production efficiency.”

What’s the end? Affordability
If energy companies can find efficient methods of extracting and refining larger amounts of fossil fuels without increasing the amount of resources they use, economics would suggest the price of the oil and natural gas would decrease. Ultimately, affordability is dictated by supply and demand, but I digress.

From the perspectives of McKinsey & Company’s Stefano Martinotti, Jim Nolten, and Jens Arne Steinsbø, the ultimate goal of the modern energy company is to optimize production efficiency without sacrificing residential health, worker safety and the environment. Based on McKinsey’s research, which specifically scrutinized oil drilling operations in the North Sea (the water body located between Great Britain, Scandinavia and the Netherlands), the authors discovered that oil companies with high production efficiencies did not incur high costs. Instead, these enterprises made systematic changes to existing operations by:

  • Eliminating equipment malfunctions
  • Choosing assets based on quality and historic performance data
  • Aligning personnel and properties with the market to plan and implement shutdowns

Analytics as an enabler of automation
The McKinsey authors maintained that automating operations was a key component to further improving existing oil drilling operations. This is where you get into the analytics applications and use cases associated with network-connected devices. Many of the North Sea’s offshore oil extraction facilities are equipped with comprehensive data infrastructures composed of network assets, sensors and software.

Data flow is a huge part of the automation process. Data flow is a huge part of the automation process.

The authors noted such platforms can possess as many as 40,000 data tags, not all of which are connected or used. The argument stands that if unused sensors and other technologies can be integrated into central operations to create a smart drilling facility, such a property could save between $220 million to $260 million annually. The possibilities and benefits go beyond the bottom line:

  • Automation could extend the lifecycle of equipment that is slowly becoming antiquated
  • New uses for under-allocated assets could be recognized
  • Equipment assessments could be conducted by applications receiving data from radio-frequency identification tags, enabling predictive maintenance

“A smart drilling facility could save between $220 million to $260 million annually.”

Resolving industry challenges
From a holistic standpoint, the oil and natural gas sector will use data analytics to effectively handle a number of industry challenges, some of which are opposed by internal or external forces.

One of the obvious challenges is the low tolerance people have for health, safety and environmental accidents. Think of how the BP oil spill of 2010 impacted consumer sentiments toward the energy industry. Technologies and processes associated with data analytics can resolve this issue by monitoring asset integrity, accurately anticipating when failures are about to occur and regularly scrutinizing how operations are affecting certain areas.

Generally, use cases expand as data scientists, operators and other professionals flex their creative muscles. There’s no telling how analytics will be applied in the near future.

Originally posted via “Energy companies have more data than they know what to do with”

Originally Posted at: Energy companies have more data than they know what to do with

How To Calculate Average Sales

No matter what industry you’re in, any sector that deals with customers will have to keep track of their sales. When you need a quick way to monitor your company’s success in meeting objectives, sales provide one of the easiest metrics as it is a direct display of efficiency related to profits. Even so, raw sales data can be overwhelming and may not always paint the clearest picture.

Using average sales across different periods can give you a better idea of how well your sales strategies and marketing campaigns are performing, what tactics are connecting with consumers, and how successful your sales team is at converting leads. More importantly, it gives you a straightforward way to establish a standard for measuring success and failure. Calculating average sales is an uncomplicated process and can help steer your business decisions for greater success.

Why Measure Average Sales?

More than just an eagle’s eye view of your sales operations, average sales can also give you a granular view at the results of every sale. Measuring average sales by customer can deliver useful insights such as how many dollars customers are spending at the point of sale, and how it compares to historical data.

On a broader level, you can compare the efficiency of different teams, stores, and branches by measuring their monthly and daily sales against historic averages and each other. This is important when choosing how to allocate budgets, deciding where to trim resources, and providing greater support. By understanding the historic patterns and combining it with more real-time data, you can make smarter decisions regarding your sales pipeline.

Looking for other ways to measure your sales numbers? Explore our interactive sales dashboards!

How to Calculate Average Sales

Calculating your average sales depends on two factors: a period or frequency you want to analyze and the total sales value for that period. Average sales can be measured on a much smaller scale, such as daily or weekly, or on a larger scale like monthly and even annually. To calculate the average sales over your chosen period, you can simply find the total value of all sales orders in the chosen timeframe and divide by the intervals. For example, you can calculate average sales per month by taking the value of sales over a year and dividing by 12 (the number of months in the year). If the total sales for the year were $1,000,000, monthly sales would be calculated as follows:

Average sales

Average sales per month, in this case, would be roughly $83,000. Daily average sales are also a common calculation, and they can vary based on the broader timeframe being measured. For example, you could measure daily average sales over a period of a single month to compare year-over-year data or calculate daily average sales over a full year to see how stores and sales teams performed throughout a 12-month period. In this case, the calculation would not change, except for replacing the top number for annual total sales, and dividing by the total number of work days.

A Variant Average Sales Calculation

Another useful way to track the average value of a sale is to measure how effective your sales team is on a per-customer basis. While overall visitors and the number of sales may be on the rise, if the value of sales per customer is declining, your overall revenues may actually fall. In this case, the division is similar to average sales, but instead of a time frame, you can divide the total sales value by the number of transactions completed during the period you are analyzing. For instance, if your total sales for the day were $15,000, and you completed 35 unique transactions, the average value of sales would be approximately $528 per customer. The formula to calculate average sales value is as follows:

average sales

Other KPIs You Can Include

Average sales are a great place to start tracking your sales effort, but to gain more actionable insights, your dashboard should also include other KPIs that can provide useful context. These are just a few of the useful sales dashboard examples of KPIs you can include when building your BI platform.

  • Average Revenue Per Unit (ARPU) – This metric is like average sale value but measures how much revenue a single customer or user will generate. This number is found by measuring revenue against the total number of units.
  • Sales per Rep – Average sales don’t give you a look into how individual salespeople may be performing. Adding sales per rep will provide a more granular look at your sales operations.
  • Opex to Sales – Raw sales data provides insight, but little context. Understanding how operating expenses relate to sales helps clarify the real value of a sale. If the Opex is too high, even large sales offer little real value.
  • Looking for other ways to measure your sales numbers? Explore our interactive sales dashboards!


Why “Big Data” Is a Big Deal

DATA NOW STREAM from daily life: from phones and credit cards and televisions and computers; from the infrastructure of cities; from sensor-equipped buildings, trains, buses, planes, bridges, and factories. The data flow so fast that the total accumulation of the past two years—a zettabyte—dwarfs the prior record of human civilization. “There is a big data revolution,” saysWeatherhead University Professor Gary King. But it is not the quantity of data that is revolutionary. “The big data revolution is that now we can do something with the data.”

The revolution lies in improved statistical and computational methods, not in the exponential growth of storage or even computational capacity, King explains. The doubling of computing power every 18 months (Moore’s Law) “is nothing compared to a big algorithm”—a set of rules that can be used to solve a problem a thousand times faster than conventional computational methods could. One colleague, faced with a mountain of data, figured out that he would need a $2-million computer to analyze it. Instead, King and his graduate students came up with an algorithm within two hours that would do the same thing in 20 minutes—on a laptop: a simple example, but illustrative.

New ways of linking datasets have played a large role in generating new insights. And creative approaches to visualizing data—humans are far better than computers at seeing patterns—frequently prove integral to the process of creating knowledge. Many of the tools now being developed can be used across disciplines as seemingly disparate as astronomy and medicine. Among students, there is a huge appetite for the new field. A Harvard course in data science last fall attracted 400 students, from the schools of law, business, government, design, and medicine, as well from the College, the School of Engineering and Applied Sciences (SEAS), and even MIT. Faculty members have taken note: the Harvard School of Public Health (HSPH) will introduce a new master’s program in computational biology and quantitative genetics next year, likely a precursor to a Ph.D. program. In SEAS, there is talk of organizing a master’s in data science.

“There is a movement of quantification rumbling across fields in academia and science, industry and government and nonprofits,” says King, who directs Harvard’sInstitute for Quantitative Social Science (IQSS), a hub of expertise for interdisciplinary projects aimed at solving problems in human society. Among faculty colleagues, he reports, “Half the members of the government department are doing some type of data analysis, along with much of the sociology department and a good fraction of economics, more than half of the School of Public Health, and a lot in the Medical School.” Even law has been seized by the movement to empirical research—“which is social science,” he says. “It is hard to find an area that hasn’t been affected.”

The story follows a similar pattern in every field, King asserts. The leaders are qualitative experts in their field. Then a statistical researcher who doesn’t know the details of the field comes in and, using modern data analysis, adds tremendous insight and value. As an example, he describes how Kevin Quinn, formerly an assistant professor of government at Harvard, ran a contest comparing his statistical model to the qualitative judgments of 87 law professors to see which could best predict the outcome of all the Supreme Court cases in a year. “The law professors knew the jurisprudence and what each of the justices had decided in previous cases, they knew the case law and all the arguments,” King recalls. “Quinn and his collaborator, Andrew Martin [then an associate professor of political science at Washington University], collected six crude variables on a whole lot of previous cases and did an analysis.” King pauses a moment. “I think you know how this is going to end. It was no contest.” Whenever sufficient information can be quantified, modern statistical methods will outperform an individual or small group of people every time.

In marketing, familiar uses of big data include “recommendation engines” like those used by companies such as Netflix and Amazon to make purchase suggestions based on the prior interests of one customer as compared to millions of others. Target famously (or infamously) used an algorithm to detect when women were pregnant by tracking purchases of items such as unscented lotions—and offered special discounts and coupons to those valuable patrons. Credit-card companies have found unusual associations in the course of mining data to evaluate the risk of default: people who buy anti-scuff pads for their furniture, for example, are highly likely to make their payments.

In the public realm, there are all kinds of applications: allocating police resources by predicting where and when crimes are most likely to occur; finding associations between air quality and health; or using genomic analysis to speed the breeding of crops like rice for drought resistance. In more specialized research, to take one example, creating tools to analyze huge datasets in the biological sciences enabled associate professor of organismic and evolutionary biology Pardis Sabeti, studying the human genome’s billions of base pairs, to identify genes that rose to prominence quickly in the course of human evolution, determining traits such as the ability to digest cow’s milk, or resistance to diseases like malaria.

King himself recently developed a tool for analyzing social media texts. “There are now a billion social-media posts every two days…which represent the largest increase in the capacity of the human race to express itself at any time in the history of the world,” he says. No single person can make sense of what a billion other people are saying. But statistical methods developed by King and his students, who tested his tool on Chinese-language posts, now make that possible. (To learn what he accidentally uncovered about Chinese government censorship practices, see“Reverse-engineering Chinese Censorship.”)

King also designed and implemented “what has been called the largest single experimental design to evaluate a social program in the world, ever,” reports Julio Frenk, dean of HSPH. “My entire career has been guided by the fundamental belief that scientifically derived evidence is the most powerful instrument we have to design enlightened policy and produce a positive social transformation,” says Frenk, who was at the time minister of health for Mexico. When he took office in 2000, more than half that nation’s health expenditures were being paid out of pocket—and each year, four million families were being ruined by catastrophic healthcare expenses. Frenk led a healthcare reform that created, implemented, and then evaluated a new public insurance scheme, Seguro Popular. A requirement to evaluate the program (which he says was projected to cost 1 percent of the GDP of the twelfth-largest economy in the world) was built into the law. So Frenk (with no inkling he would ever come to Harvard), hired “the top person in the world” to conduct the evaluation, Gary King.

Given the complications of running an experiment while the program was in progress, King had to invent new methods for analyzing it. Frenk calls it “great academic work. Seguro Popular has been studied and emulated in dozens of countries around the world thanks to a large extent to the fact that it had this very rigorous research with big data behind it.” King crafted “an incredibly original design,” Frenk explains. Because King compared communities that received public insurance in the first stage (the rollout lasted seven years) to demographically similar communities that hadn’t, the results were “very strong,” Frenk says: any observed effect would be attributable to the program. After just 10 months, King’s study showed that Seguro Popular successfully protected families from catastrophic expenditures due to serious illness, and his work provided guidance for needed improvements, such as public outreach to promote the use of preventive care.

King himself says big data’s potential benefits to society go far beyond what has been accomplished so far. Google has analyzed clusters of search terms by region in the United States to predict flu outbreaks faster than was possible using hospital admission records. “That was a nice demonstration project,” says King, “but it is a tiny fraction of what could be done” if it were possible for academic researchers to access the information held by companies. (Businesses now possess more social-science data than academics do, he notes—a shift from the recent past, when just the opposite was true.) If social scientists could use that material, he says, “We could solve all kinds of problems.” But even in academia, King reports, data are not being shared in many fields. “There are even studies at this university in which you can’t analyze the data unless you make the original collectors of the data co-authors.”

The potential for doing good is perhaps nowhere greater than in public health and medicine, fields in which, King says, “People are literally dying every day” simply because data are not being shared.

Bridges to Business

NATHAN EAGLE, an adjunct assistant professor at HSPH, was one of the first people to mine unstructured data from businesses with an eye to improving public health in the world’s poorest nations. A self-described engineer and “not much of an academic” (despite having held professorships at numerous institutions including MIT), much of his work has focused on innovative uses of cell-phone data. Drawn by the explosive growth of the mobile market in Africa, he moved in 2007 to a rural village on the Kenyan coast and began searching for ways to improve the lives of the people there. Within months, realizing that he would be more effective sharing his skills with others, he began teaching mobile-application development to students in the University of Nairobi’s computer-science department.

While there, he began working with the Kenyan ministry of health on a blood-bank monitoring system. The plan was to recruit nurses across the country to text the current blood-supply levels in their local hospitals to a central database. “We built this beautiful visualization to let the guys at the centralized blood banks in Kenya see in real time what the blood levels were in these rural hospitals,” he explains, “and more importantly, where the blood was needed.” In the first week, it was a giant success, as the nurses texted in the data and central monitors logged in every hour to see where they should replenish the blood supply. “But in the second week, half the nurses stopped texting in the data, and within about a month virtually no nurses were participating anymore.”

Eagle shares this tale of failure because the episode was a valuable learning experience. “The technical implementation was bulletproof,” he says. “It failed because of a fundamental lack of insight on my part…that had to do with the price of a text message. What I failed to appreciate was that an SMS represents a fairly substantial fraction of a rural nurse’s day wage. By asking them to send that text message we were asking them to essentially take a pay cut.”

Fortunately, Eagle was in a position to save the program. Because he was already working with most of the mobile operators in East Africa, he had access to their billing systems. The addition of a simple script let him credit the rural nurses with a small denomination of prepaid air time, about 10 cents’ worth—enough to cover the cost of the SMS “plus about a penny to say thank you in exchange for a properly formatted text message. Virtually every rural nurse reengaged,” he reports, and the program became a “relatively successful endeavor”—leading him to believe that cell phones could “really make an impact” on public health in developing nations, where there is a dearth of data and almost no capacity for disease surveillance.

Eagle’s next project, based in Rwanda, was more ambitious, and it also provided a lesson in one of the pitfalls of working with big data: that it is possible to findcorrelations in very large linked datasets without understanding causation. Working with mobile-phone records (which include the time and location of every call), he began creating models of people’s daily and weekly commuting patterns, termed their “radius of generation.” He began to notice patterns. Abruptly, people in a particular village would stop moving as much; he hypothesized that these patterns might indicate the onset of a communicable disease like the flu. Working with the Rwandan ministry of health, he compared the data on cholera outbreaks to his radius of generation data. Once linked, the two datasets proved startlingly powerful; the radius of generation in a village dropped two full weeks before a cholera outbreak. “We could even predict the magnitude of the outbreak based on the amount of decrease in the radius of generation,” he recalls. “I had built something that was performing in this unbelievable way.”

And in fact it was unbelievable. He tells this story as a “good example of why engineers like myself shouldn’t be doing epidemiology in isolation—and why I ended up joining the School of Public Health rather than staying within a physical-science department.” The model was not predicting cholera outbreaks, but pinpointing floods. “When a village floods and roads wash away, suddenly the radius of generation decreases,” he explains. “And it also makes the village more susceptible in the short term to a cholera outbreak. Ultimately, all this analysis with supercomputers was identifying where there was flooding—data that, frankly, you can get in a lot of other ways.”

Despite this setback, Eagle saw what was missing. If he could couple the data he had from the ministry of health and the mobile operators with on-the-ground reports of what was happening, then he would have a powerful tool for remote disease surveillance. “It opened my eyes to the fact that big data alone can’t solve this type of problem. We had petabytes* of data and yet we were building models that were fundamentally flawed because we didn’t have real insight about what was happening” in remote villages. Eagle has now built a platform that enables him to survey individuals in such countries by paying them small denominations of airtime (as with the Kenyan nurses) in exchange for answering questions: are they experiencing flu-like symptoms, sleeping under bednets, or taking anti-malarials? This ability to gather and link self-reported information to larger datasets has proven a powerful tool—and the survey technology has become a successful commercial entity named Jana, of which Eagle is co-founder and CEO.

New Paradigms—and Pitfalls

WILLY SHIH, Cizik professor of management practice at Harvard Business School, says that one of the most important changes wrought by big data is that their use involves a “fundamentally different way of doing experimental design.” Historically, social scientists would plan an experiment, decide what data to collect, and analyze the data. Now the low cost of storage (“The price of storing a bit of information has dropped 60 percent a year for six decades,” says Shih) has caused a rethinking, as people “collect everything and then search for significant patterns in the data.”

“This approach has risks,” Shih points out. One of the most prominent is data dredging, which involves searching for patterns in huge datasets. A traditional social-science study might assert that the results are significant with 95 percent confidence. That means, Shih points out, “that in one out of 20 instances” when dredging for results, “you will get results that are statistically significant purely by chance. So you have to remember that.” Although this is true for any statistical finding, the enormous number of potential correlations in very large datasets substantially magnifies the risk of finding spurious correlations.

Eagle agrees that “you don’t get good scientific output from throwing everything against the wall and seeing what sticks.” No matter how much data exists, researchers still need to ask the right questions to create a hypothesis, design a test, and use the data to determine whether that hypothesis is true. He sees two looming challenges in data science. First, there aren’t enough people comfortable dealing with petabytes of data. “These skill sets need to get out of the computer-science departments and into public health, social science, and public policy,” he says. “Big data is having a transformative impact across virtually all academic disciplines—it is time for data science to be integrated into the foundational courses for all undergraduates.”

Safeguarding data is his other major concern, because “the privacy implications are profound.” Typically, the owners of huge datasets are very nervous about sharing even anonymized, population-level information like the call records Eagle uses. For the companies that hold it, he says, “There is a lot of downside to making this data open to researchers. We need to figure out ways to mitigate that concern and craft data-usage policies in ways that make these large organizations more comfortable with sharing these data, which ultimately could improve the lives of the millions of people who are generating it—and the societies in which they are living.”

John Quackenbush, an HSPH professor of computational biology and bioinformatics, shares Eagle’s twin concerns. But in some realms of biomedical big data, he says, the privacy problem is not easily addressed. “As soon as you touch genomic data, that information is fundamentally identifiable,” he explains. “I can erase your address and Social Security number and every other identifier, but I can’t anonymize your genome without wiping out the information that I need to analyze.” Privacy in such cases is achieved not through anonymity but by consent paired with data security: granting access only to authorized researchers. Quackenbush is currently collaborating with a dozen investigators—from HSPH, the Dana-Farber Cancer Institute, and a group from MIT’s Lincoln Labs expert in security—to develop methods to address a wide range of biomedical research problems using big data, including privacy.

He is also leading the development of HSPH’s new master’s program in computational biology and quantitative genetics, which is designed to address the extraordinary complexity of biomedical data. As Quackenbush puts it, “You are not just you. You have all this associated health and exposure information that I need in order to interpret your genomic information.”

A primary goal, therefore, is to give students practical skills in the collection, management, analysis, and interpretation of genomic data in the context of all this other health information: electronic medical records, public-health records, Medicare information, and comprehensive-disease data. The program is a joint venture between biostatistics and the department of epidemiology.

Really Big Data

LIKE EAGLE, Quackenbush came to public health from another discipline—in his case, theoretical and high-energy experimental physics. He first began working outside his doctoral field in 1992, when biologists for the Human Genome Project realized they needed people accustomed to collecting, analyzing, managing, and interpreting huge datasets. Physicists have been good at that for a long time.

The first full human genome sequence took five to 15 years to complete, and cost $1 billion to $3 billion (“Depending on whom you ask,” notes Quackenbush). By 2009, eight years later, the cost had dropped to $100,000 and took a year. At that point, says Quackenbush, “if my wife had a rare, difficult cancer, I would have mortgaged our house to sequence her genome.” Now a genome sequence takes a little more than 24 hours and costs about $1,000—the point at which it can be paid for “on a credit card. That simple statement alone,” he says, “underscores why the biomedical sciences have become so data-driven.

“We each carry two copies of the human genome—one from our mother and one from our father—that together comprise 6 billion base pairs,” Quackenbush continues, “a number equivalent to all the seconds in 190 years.” But knowledge of what all the genes encoded in the genome do and how they interact to influence health and disease remains woefully incomplete. To discover that, scientists will have to take genomic data and “put it in the context of your health. And we’ll have to take you and put you in the context of the population in which you live, the environmental factors you are exposed to, and the people you come in contact with—so as we look at the vast amount of data we can generate on you, the only way we can effectively interpret it is to put it in the context of the vast amount of data we can generate on almost everything related to you, your environment, and your health. We are moving from a big data problem to a really big data problem.”

Curtis Huttenhower, an HSPH associate professor of computational biology and bioinformatics, is one of Quackenbush’s really big data collaborators. He studies the function of the human microbiome, the bacteria that live in and on humans, principally in the gut, helping people extract energy from food and maintaining health. “There are 100 times more genes in the bugs than in a human’s genome,” he reports, and “it’s not unusual for someone to share 50 percent or less of their microbes with other people. Because no one has precisely the same combination of gut bacteria, researchers are still learning how those bacteria distinguish us from each other; meanwhile both human and microbial genetic privacy must be maintained.” Not only do microbiome studies confront 100 times more information per human subject than genome studies, that 100 is different from person to person and changes slowly over time with age—and rapidly, as well, in response to factors like diet or antiobiotics. Deep sequencing of 100 people during the human microbiome project, Huttenhower reports, yielded a thousand human genomes’ worth of sequencing data—“and we could have gotten more. But there is still no comprehensive catalog of what affects the microbiome,” says Huttenhower. “We are still learning.”

Recently, he has been studying microbes in the built environment: from the hangstraps of Boston’s transit system to touchscreen machines and human skin. The Sloan Foundation, which funded the project, wants to know what microbes are there and how they got there. Huttenhower is interested in the dynamics of how entire communities of bugs are transferred from one person to another and at what speed. “Everyone tends to have a slightly different version of Helicobacter pylori, a bacterium that can cause gastric cancer and is transmitted vertically from parents to children,” he says. “But what other portions of the microbiome are mostly inherited, rather than acquired from our surroundings? We don’t know yet.” As researchers learn more about how the human genome and the microbiome interact, it might become possible to administer probiotics or more targeted antibiotics to treat or prevent disease. That would represent a tremendous advance in clinical practice because right now, when someone takes a broad spectrum antibiotic, it is “like setting off a nuke,” say Huttenhower. “They instantly change the shape of the microbiome for a few weeks to months.” Exactly how the microbiome recovers is not known.

A major question in microbiome studies involves the dynamics of coevolution: how the bugs evolved in humans over hundreds of thousands of years, and whether changes in the microbiome might be linked to ailments that have become more prevalent recently, such as irritable bowel disease, allergies, and metabolic syndrome (a precursor to diabetes). Because of the timescale of the change in the patterns of these ailments, the causes can’t be genomic, says Huttenhower. “They could be environmental, but the timescale is also right for the kinds of ecological changes that would be needed in microbial communities,” which can change on scales ranging from days to decades.

“Just think about the number of things that have changed in the past 50 years that affect microbes,” he continues. Commercial antibiotics didn’t exist until about 50 years ago; our locations have changed; and over a longer period, we have gone from 75 percent of the population working in agriculture to 2 percent; our exposure to animals has changed; our exposure to the environment; our use of agricultural antibiotics has changed; what we eat has changed; the availability of drugs has changed. There are so many things that are different over that timescale that would specifically affect microbes. That is why there is some weight given to the microbiome link to the hygiene hypothesis”—the theory that lack of early childhood exposure to a diverse microbiota has led to widespread problems in the establishment of healthy immune systems.

Understanding the links between all these effects will involve data analysis that will dwarf the human genome project and become the work of decades. Like Gary King, Huttenhower favors a good algorithm over a big computer when tackling such problems. “We prefer to build models or methods that are efficient enough to run on a[n entry-level] server. But even when you are efficient, when you scale up to populations of hundreds, thousands, or tens of thousands of people,” massive computational capability is needed.

Recently, having realized that large populations of people will need to be studied to advance microbiome science, Huttenhower has begun exploring how to deploy and run his models to Amazon’s cloud—thousands of linked computers running in parallel. Amazon has teamed with the National Institutes of Health to donate server time for such studies. Says Huttenhower, “It’s an important way for getting manageable big data democratized throughout the research community.”

Discerning Patterns in Complexity

MAKING SENSE of the relationships between distinct kinds of information is another challenge facing researchers. What insights can be gleaned from connecting gene sequences, health records, and environmental influences? And how can humans understand the results?

One of the most powerful tools for facilitating understanding of vast datasets is visualization. Hanspeter Pfister, Wang professor of computer science and director of the Institute for Applied Computational Science, works with scientists in genomics and systems biology to help them visualize what are called high-dimensional data sets (with multiple categories of data being compared). For example, members of his group have created a visualization for use by oncologists that connects gene sequence and activation data with cancer types and stages, treatments, and clinical outcomes. That allows the data to be viewed in a way that shows which particular gene expression pattern is associated with high mortality regardless of cancer type, for example, giving an important, actionable insight for how to devise new treatments.

Pfister teaches students how to turn data into visualizations in Computer Science 109, “Data Science,” which he co-teaches with Joseph K. Blitzstein, professor of the practice in statistics. “It is very important to make sure that what we will be presenting to the user is understandable, which means we cannot show it all,” says Pfister. “Aggregation, filtering, and clustering are hugely important for us to reduce the data so that it makes sense for a person to look at.” This is a different method of scientific inquiry that ultimately aims to create systems that let humans combine what they are good at—asking the right questions and interpreting the results—with what machines are good at: computation, analysis, and statistics using large datasets. Student projects have run the gamut from the evolution of the American presidency and the distribution of tweets for competitive product analysis, to predicting the stock market and analyzing the performance of NHL hockey teams.

Pfister’s advanced students and postdoctoral fellows work with scientists who lack the data science skills they now need to conduct their research. “Every collaboration pushes us into some new, unknown territory in computer science,” he says.

The flip side of Pfister’s work in creating visualizations is the automated analysis of images. For example, he works with Knowles professor of molecular and cellular biology Jeff Lichtman, who is also Ramon y Cajal professor of arts and sciences, to reconstruct and visualize neural connections in the brain. Lichtman and his team slice brain tissue very thinly, providing Pfister’s group with stacks of high-resolution images. “Our system then automatically identifies individual cells and labels them consistently,” such that each neuron can be traced through a three-dimensional stack of images, Pfister reports. Even working with only a few hundred neurons involves tens of thousands of connections. One cubic millimeter of mouse brain represents a thousand terabytes (a petabyte) of image data.

Pfister has also worked with radioastronomers. The head teaching fellow in his data science course, astronomer Chris Beaumont, has developed software (Glue) for linking and visualizing large telescope datasets. Beaumont’s former doctoral adviser (for whom he now works as senior software developer on Glue), professor of astronomy Alyssa Goodman, teaches her own course in visualization (Empirical and Mathematical Reasoning 19, “The Art of Numbers”). Goodman uses visualization as an exploratory technique in her efforts to understand interstellar gas—the stuff of which stars are born. “The data volume is not a concern,” she says; even though a big telescope can capture a petabyte of data in a night, astronomers have a long history of dealing with large quantities of data. The trick, she says, is making sense of it all. Data visualizations can lead to new insights, she says, because “humans are much better at pattern recognition” than computers. In a recent presentation, she showed how a three-dimensional visualization of a cloud of gas in interstellar space had led to the discovery of a previously unknown cloud structure. She will often work by moving from a visualization back to math, and then back to another visualization.

Many of the visualization tools that have been created for medical imaging and analysis can be adapted for use in astronomy, she says. A former undergraduate advisee of Goodman’s, Michelle Borkin ’06, now a doctoral candidate in SEAS (Goodman and Pfister are her co-advisers), has explored cross-disciplinary uses of data-visualization techniques, and conducted usability studies of these visualizations. In a particularly successful example, she showed how different ways of displaying blood-flow could dramatically change a cardiac physician’s ability to diagnose heart disease. Collaborating with doctors and simulators in a project to model blood flow called “Multiscale Hemodynamics,” Borkin first tested a color-coded visual representation of blood flow in branched arteries built from billions of blood cells and millions of fluid points. Physicians were able to locate and successfully diagnose arterial blockages only 39 percent of the time. Using Borkin’s novel visualization—akin to a linear side-view of the patient’s arteries—improved the rate of successful diagnosis to 62 percent. Then, simply by changing the colors based on an understanding of the way the human visual cortex works, Borkin found she could raise the rate of successful diagnosis to 91 percent.

Visualization tools even have application in the study of collections, says Pfister.Professor of romance language and literatures Jeffrey Schnapp, faculty director of Harvard’s metaLAB, is currently at work on a system for translating collections metadata into readily comprehensible, information-rich visualizations. Starting with a dataset of 17,000 photographs—trivial by big data standards—from the missing paintings of the Italian Renaissance collection assembled by Bernard Berenson (works that were photographed but have subsequently disappeared), Schnapp and colleagues have created a way to explore the collection by means of the existing descriptions of objects, classifications, provenance data, media, materials, and subject tags.

The traditional use of such inventory data was to locate and track individual objects, he continues. “We are instead creating a platform that you can use to make arguments, and to study collections as aggregates from multiple angles. I can’t look at everything in the Fogg Museum’s collections even if I am Tom Lentz [Cabot director of the Harvard Art Museums], because there are 250,000 objects. Even if I could assemble them all in a single room,” Schnapp says, “I couldn’t possibly see them all.” But with a well-structured dataset, “We can tell stories: about place, time, distribution of media, shifting themes through history and on and on.” In the case of the Berenson photo collection, one might ask, “What sorts of stories does the collection tell us about the market for Renaissance paintings during Berenson’s lifetime? Where are the originals now? Do they still exist? Who took the photographs and why? How did the photo formats evolve with progress in photographic techniques?”

This type of little “big data” project makes the incomprehensible navigable and potentially understandable. “Finding imaginative, innovative solutions for creatingqualitative experiences of collections is the key to making them count,” Schnapp says. Millions of photographs in the collections of institutions such as the Smithsonian, for example, will probably never be catalogued, even though they represent the richest, most complete record of life in America. It might take an archivist half a day just to research a single one, Schnapp points out. But the photographs are being digitized, and as they come on line, ordinary citizens with local information and experience can contribute to making them intelligible in ways that add value to the collection as an aggregate. The Berenson photographs are mostly of secondary works of art, and therefore not necessarily as interesting individually as they are as a collection. They perhaps tell stories about how works were produced in studios, or how they circulated. Visualizations of the collection grouped by subject are telling, if not surprising. Jesus represents the largest portion, then Mary, and so on down to tiny outliers, such as a portrait of a woman holding a book, that raise rich questions for the humanities, even though a computer scientist might regard them as problems to fix. “We’re on the culture side of the divide,” Schnapp says, “so we sometimes view big data from a slightly different angle, in that we are interested in the ability to zoom between the micro level of analysis (an individual object), the macro level (a collection), and the massively macro (multiple collections) to see what new knowledge allows you to expose, and the stories it lets you tell.”

• • •

DATA, IN THE FINAL ANALYSIS, are evidence. The forward edge of science, whether it drives a business or marketing decision, provides an insight into Renaissance painting, or leads to a medical breakthrough, is increasingly being driven by quantities of information that humans can understand only with the help of math and machines. Those who possess the skills to parse this ever-growing trove of information sense that they are making history in many realms of inquiry. “The data themselves, unless they are actionable, aren’t relevant or interesting,” is Nathan Eagle’s view. “What is interesting,” he says, “is what we can now do with them to make people’s lives better.” John Quackenbush says simply: “From Kepler using Tycho Brahe’s data to build a heliocentric model of the solar system, to the birth of statistical quantum mechanics, to Darwin’s theory of evolution, to the modern theory of the gene, every major scientific revolution has been driven by one thing, and that is data.”


Key Considerations for Converting Legacy ETL to Modern ETL

Recently, there has been a surge in our customers who want to move away from legacy data integration platforms to adopting Talend as their one-stop shop for all their integration needs. Some of the organizations have thousands of legacy ETL jobs to convert to Talend before they are fully operational. The big question that lurks in everyone’s mind is how to get past this hurdle.

Defining Your Conversion Strategy

To begin with, every organization undergoing such a change needs to focus on three key aspects:

  1. Will the source and/or target systems change? Is this just an ETL conversion from their legacy system to modern ETL like Talend?
  2. Is the goal to re-platform as well? Will the target system change?
  3. Will the new platform reside on the cloud or continue to be on-premise?

This is where Talend’s Strategic Services can help carve a successful conversion strategy and implementation roadmap for our customers. In the first part of this three-blog series, I will focus on the first aspect of conversion.

Before we dig into it, it’s worthwhile to note a very important point – the architecture of the product itself. Talend is a JAVA code generator and unlike its competitors (where the code is migrated from one environment to the other) Talend actually builds the code and migrates built binaries from one environment to the other. In many organizations, it takes a few sprints to fully acknowledge this fact as the architects and developers are used to the old ways of referring to code migration.

The upside of this architecture is that it helps in enabling a continuous integration environment that was not possible with legacy tools. A complete architecture of Talend’s platform not only includes the product itself, but also includes third-party products such as Jenkins, NEXUS – artifact repository and a source control repository like GIT. Compare this to a JAVA programming environment and you can clearly see the similarities. In short, it is extremely important to understand that Talend works differently and that’s what sets it apart from the rest in the crowd.

Where Should You Get Started?

Let’s focus on the first aspect, conversion. Assuming that nothing else changes except for the ETL jobs that integrate, cleanse, transform and load the data, it makes it a lucrative opportunity to leverage a conversion tool – something that ingests legacy code and generates Talend code. It is not a good idea to try and replicate the entire business logic of all ETL jobs manually as there will be a great risk of introducing errors leading to prolonged QA cycles. However, just like anyone coming from a sound technical background, it is also not a good idea to completely rely on the automated conversion process itself since the comparison may not always be apples to apples. The right approach is to use the automated conversion process as an accelerator with some manual interventions.

Bright minds bring in success. Keeping that mantra in mind, first build your team:

  • Core Team – Identify architects, senior developers and SMEs (data analysts, business analysts, people who live and breathe data in your organization)
  • Talend Experts – Bring in experts of the tool so that they can guide you and provide you with the best practices and solutions to all your conversion related effort. Will participate in performance tuning activities
  • Conversion Team – A team that leverages a conversion tool to automate the conversion process. A solid team with a solid tool and open to enhancing the tool along the way to automate new designs and specifications
  • QA Team – Seasoned QA professionals that help you breeze through your QA testing activities

Now comes the approach – Follow this approach for each sprint:


Analyze the ETL jobs and categorize them depending on the complexity of the jobs based on functionality and components used. Some good conversion tools provide analyzers that can help you determine the complexity of the jobs to be converted. Spread a healthy mix of varying complexity jobs across each sprint.


Leverage a conversion tool to automate the conversion of the jobs. There are certain functionalities such as an “unconnected lookup” that can be achieved through an innovative method in Talend. Seasoned conversion tools will help automate such functionalities


Focus on job design and performance tuning. This is your chance to revisit design, if required, either to leverage better component(s) or to go for a complete redesign. Also focus on performance optimization. For high-volume jobs, you could increase the throughput and performance by leveraging Talend’s big data components, it is not uncommon for us to see that we end up completely redesigning a converted Talend Data Integration job to a Talend Big Data job to drastically improve performance. Another feather in our hat where you can seamlessly execute standard data integration jobs alongside big data jobs.


Unit test and ensure all functionalities and performance acceptance criteria are satisfied before handing over the job to QA


An automated QA approach to compare result sets produced by the old set of ETL jobs and new ETL jobs. At the least, focus on:

  • Compare row counts from the old process to that of the new one
  • Compare each data element loaded by the load process to that of the new one
  • Verify “upsert” and “delete” logic work as expected
  • Introduce an element of regression testing to ensure fixes are not breaking other functionalities
  • Performance testing to ensure SLAs are met

Now, for several reasons, there can be instances where one would need to design a completely new ETL process for a certain functionality in order to continue processing data in the same way as before. For such situations, you should leverage the “Talend Experts” team that not only liaisons with the team that does the automated conversion but also works closely with the core team to ensure that, in such situations, the best solution is proposed which is then converted to a template and provided to the conversion team who can then automate the new design into the affected jobs.

As you can see, these activities can be part of the “Categorize” and “Convert” phases of the approach.

Finally, I would suggest chunking the conversion effort into logical waves. Do not go for a big bang approach since the conversion effort could be a lengthy one depending on the number of legacy ETL jobs in an organization.


This brings me to the end of the first part of the three-blog series. Below are the five key takeaways of this blog:

  1. Define scope and spread the conversion effort across multiple waves
  2. Identify core team, Talend experts, a solid conversion team leveraging a solid conversion tool and seasoned QA professionals
  3. Follow an iterative approach for the conversion effort
  4. Explore Talend’s big data capabilities to enhance performance
  5. Innovate new functionalities, create templates and automate the conversion of these functionalities

Stay tuned for the next two!!

The post Key Considerations for Converting Legacy ETL to Modern ETL appeared first on Talend Real-Time Open Source Data Integration Software.

Source: Key Considerations for Converting Legacy ETL to Modern ETL