What Are the 3 Critical Keys to Healthcare Big Data Analytics?


Healthcare big data analytics isn’t just a “use it or lose it” proposition for the provider community – it’s quickly becoming a “use it if you want to hold on to anything at all” situation for organizations that must invest in population health management, clinical analytics, and risk stratification if they are to succeed in a value-based reimbursement world.

Maintaining market share during this shift away from the simpler cash transactions of a fee-for-service environment requires organizations to take a proactive dive into their financial and clinical data, yet developing the technological and organizational competencies to take advantage of big data tools is just as complex as it sounds.

Healthcare big data analytics

Despite the vital importance of using big data to describe, predict, and prevent costly events in large patient populations, providers of all types and sizes are struggling to collect, categorize, store, retrieve, and analyze their data assets.

In a recent industry poll, Stoltenberg Consulting found that big data confuses half of providers, and six percent of participants were too intimidated by the process to even consider starting a healthcare big data analytics program.

Why is big data analytics such a difficult topic to tackle for healthcare organizations?  How do successful providers begin the process?  In this article, healthcare stakeholders weigh in on the three most important foundational steps for beginning a big data analytics and population health management program.

Define a direction and outline specific goals

Big data may be nearly infinite in scope, but having data for the sake of having data will not help achieve measurable organizational objectives.  Healthcare providers must start off their big data journey by defining clear, bite-sized problems that need solving.  Often, these problems are the “low hanging fruit” of healthcare operations: preventable readmissions, emergency department overuse, chronic disease management, patient engagement, and primary care screening rates.

“I would highly encourage people to start around specific high-value use cases,” suggested Marc Perlman, Global Vice President of Healthcare and Life Sciences at Oracle during a 2014 interview.  “I think the average hospital or health system has over 400 different interfaces and integration points, but it may take seven of them working together to give you some value out of your data.  Hospitals and health systems need to think about what they’re trying to fix, and then based upon that, what data sources they need.”

“We think it makes a lot of sense to think about what they are going to try to accomplish, what the methods are that they’re trying to drive, and how they are going to focus on sustainability,” he continued.  “I would say the most important thing is have a vision of your solution, what you’re trying to fix, and know where you’re going.  And then the technology will follow.”

Successful healthcare organizations often flag financial pain points as the gateway into big data analytics.  A March 2015 survey found that 59 percent of hospitals identified higher-than-necessary costs of care as one of their top motivating factors for implementing big data analytics.  Organizations are also seeking the clinical and financial insights necessary to engage in pay-for-performance reimbursement structures and combat the pressures of accountable care.

To do this, organizations implemented clinical analytics technologies that integrate EHR data and patient outcomes into a rich portrait of a patient’s journey through the care continuum.  More than 60 percent of organizations that took this approach have been able to improve their 30-day preventable readmissions rates and cut their mortality rates.

“Start out slow and have realistic goals,” says Dr. Robert M. Fishman, DO, FACP, who has helped to lead Valley Health Partners to the highest level of patient-centered medical home (PCMH) recognition thanks to investments in health IT and a fundamental attitude shift within the physician hospital organization (PHO).  “And when you achieve those goals, set out new goals.”

“We set up a very modest program where we would pick a couple of diagnoses in internal medicine and a couple of diagnoses in pediatrics and we would begin to set up policies and think in a more patient-centric way to reach certain goals,” he said.

“We concentrated on congestive heart failure and COPD, because those two conditions produce a lot of readmissions and emergency department visits.  There’s a lot of expense.  If we could really get a handle on those conditions, we could improve care, improve outcomes and decrease costs.  Three things that we’re all very interested in.”

By focusing on solving specific use cases for big data analytics, these organizations have proven that small big data investments are worth larger ones in the future.

Measure twice, cut once to plan a health IT infrastructure

After outlining a strategy, healthcare organizations can start making investments in advanced health IT products to support their efforts.  While the vast majority of providers already have a basic EHR infrastructure in place – and EHRs are increasingly coming packaged with population health management tools that meet many big data needs – sometimes more sophisticated products are required to get the ball rolling.

But crafting an interoperable health IT infrastructure with a high degree of flexibility and usability is a difficult ask.  Stakeholders are only focusing on the desperate need for interoperability between data systems because it is so rare to find an ecosystem of well-integrated vendor products that communicate freely with one another.

Many organizations continue to be challenged by their convoluted legacy systems, and most providers cannot afford to rip out and replace an entire infrastructure.  For those lucky few starting from scratch, 2015 is a good time to get into the big data analytics game.

As vendors focus on developing plug-and-play technologies and health information exchanges take on more mature roles in the local provider community, crafting an interoperable health IT system is easier than ever – if the right rules are on the table from the beginning.

“When it comes to doing big data analytics, you have to have two things right off the bat,” stated Richard C. Howe, PhD, FHIMSS, Executive Director of the North Texas Regional Extension Center, which also functions as the Dallas-Fort Worth region’s primary HIE.

“The first is strong governance. You have to get all the participants in the same room so they can determine what data they are going to contribute.  Get the absolute governance structure outlined at the beginning so you know what direction you’re trying to go.”

“The second thing is to start simple.  If you start off with thousands of different data elements, you’re just going to drown in the data before you see any results.  We started with claims data, and we found that there is a lot of really good information there that has been valuable to our hospital members even before we started adding more clinical information.  I would say good governance has to start simple.”

An organization’s data governance plan can make or break a big data analytics project: the old “garbage in, garbage out” rule will always apply.  Even the most robust tools require their users tounderstand their potential and their limitations, both of which rely on the quality of data moving through the system.

Understanding the scope of health IT tools, the data standards they are built upon, and the data integrity requirements of leveraging health IT for actionable insights will help organizations pick the best products for their needs in the long term, not just in the next six months to a year.

“When it comes to tools, providers should consider products that can give them the functionality that they need to answer the questions they’re asking today, but can also grow with them to continue answering those questions three years and five years down the road,” said Shane Pilcher, Vice President at Stoltenberg Consulting.

“They have to be thinking three, five and maybe even ten years down the road in terms of what they anticipate their questions are going to be, so that they know they’re on the right track with collecting the data that’s going to answer them.

“Even from the EHR perspective, this is where that long-term plan comes into play.  They know the type of questions that they’re looking for today,” he added.  “They need to anticipate the type of questions they’re going to asking in the future.  But in most cases, you don’t know what you don’t know, so you’ve got to be as creative, as imaginative as you can today when you’re setting up your roadmap.  That’s going to give you the information that you need to start defining what used to be collected today in the EHR and what you need to grow.”

A far-sighted approach to big data analytics may help organizations avoid or mitigate some of the interoperability problems that have plagued the industry for so long.  Investing in products that encourage health information exchange through standardized data elements will make analytics easier and ensure that organizations are set up for meeting ongoing mandates such as meaningful use.

Ensure support from executives and buy-in from clinical staff

No healthcare big data analytics plan can succeed without enthusiasm and support from all levels of the organization.  The board room must provide the funding and the direction; the clinical end-users must understand and embrace new technologies and new workflows.  Big data analytics isn’t just an IT project, but an organizational transformation from top to bottom.

Starting small, measuring results, and demonstrating improvement is often the key to securing executive buy-in, Pilcher says.  “Once you start picking up traction, you can start to identify that low-hanging fruit that can lead to cost savings and improve patient care.  These are cost savings that go directly to the bottom line of the organization and also show return on investment.  By being able to show ROI, the administration may be more inclined to invest more time, more labor, and more capital to further enhance the program and go after bigger and greater fruits.”

And executive leaders may not take too much convincing these days.  Despite the fact that more than a third of providers feel that a lack of leadership is a major barrier to big data analytics success, executives largely recognize the critical role that data competency will play in the immediate future.

Eighty-nine percent of hospital executives participating in a recent PwC poll are taking action to become more innovative and nimble through big data analytics adoption, and 95 percent are seeking to harness the potential of analytics technologies to extract actionable insights from their big data.

During the HIMSS15 Leadership Survey, three-quarters of organizations agreed with the notion that health IT is vital for achieving strategic goals and improvements in patient care.  Over half believe that health IT has helped them improve their population health management programs.   More than four in ten respondents think their executive leaders have a “fairly sophisticated understanding” of big data analytics technologies and the need to leverage them.

C-suite leaders are among the most likely to express an intention to purchase data analytics tools, with Chief Information Officers and Chief Medical Information Officers being the most eager to invest in new health IT products.  Even Chief Financial Officers are recognizing the fundamental need for analytics infrastructure to cut costs, raise revenues, and utilize resources more appropriately.

Getting clinicians to understand why their workflows are suddenly changing can be even more complicated than securing funds to purchase new tools, however.  It is important for providers todevelop a multi-disciplinary team for big data analytics: one that includes representatives from all areas of the organization.

Clinical champions can help to explain to their peers why certain tasks are changing, why new metrics may be pointing out flaws in the patient care process, and why it is important to adapt to an evolving health IT landscape.  Above all, both executive leaders and staff-level super users must be able to point to clear and immediate benefits when introducing a new tool, or risk rebellion among dissatisfied clinicians.

“When we are talking about big data, I think there needs to be a clear purpose,” says Tina Esposito, Vice President of the Center for Health Information Services at Advocate Health Care.  “There has to be a core need or a well-defined problem that you are trying to solve.”

“Big data is a means to an end for solving problems.  So you got to be very clear that you are not pulling this data together just to do put it together. There has got to be a focused effort from the right people to leverage that information so that ultimately you are supporting the business and your population health goals.”

“You need to be sure that what you are creating is usable in the most efficient and easiest way, and that it makes a positive impact on clinicians,” she said. “Is the clinician leveraging that intelligence that you are providing as part of their workflow in the EHR?  Are they seeing a benefit from it?  That’s going to be the most important piece of any big data project.”

Note: This article originally appeared in Health IT Analytics. Click for link here.

Source: What Are the 3 Critical Keys to Healthcare Big Data Analytics?

April 17, 2017 Health and Biotech analytics news roundup

Introducing Verily Study Watch: The device has multiple sensors, a long battery life, and a large capacity for storage. It is not available to the public but is currently being used in clinical studies.

Sansoro Health raises $5.2 million led by Bain Capital Ventures: The company’s product allows a link between customers and electronic health records.

Hospital cuts costly falls by 39% due to predictive analytics: The system, at Camino Hospital in California, flags high-risk patients at admission, then continually updates the risk throughout their stay.

Winning with analytics in the pharmaceutical industry: The industry can use analytics to improve efficiency and reduce costs across the business.

Why HIT tools can help organizations navigate the challenges of growth: Most health systems have implemented electronic health records. Along with improving care, these records can help distribute patients throughout a large system and better administer transfers.

Originally Posted at: April 17, 2017 Health and Biotech analytics news roundup

Movie Recommendations? How Does Netflix Do It? A 9 Step Coding & Intuitive Guide Into Collaborative Filtering

‘Movies recommended for you’ – Netflix
‘Videos recommended for you’ – YouTube
‘Restaurants recommended for you’ – Some smart restaurant finder app

Notice a trend? Your favorite apps ‘know’ you (or at least they think they do). They gradually learn your preferences over time (or in a matter of hours) and suggest new products which they think you’ll love.

How is this done? I can’t speak for how Netflix actually makes movie recommendations, but the fundamentals are largely intuitive, actually.

If you keep ‘five staring’ Stoner Comedy movies like the whole ‘Harold and Kumar’ series on Netflix, it makes sense for Netflix to assume that you may also enjoy ‘Ted’, or any other Stoner Comedy film on Netflix.

To make recommendations in a real world application, let’s take our intuition and apply it to a machine learning algorithm called Collaborative Filtering.

The following guide will be done in the ‘Octave’ programming language, so we can properly understand what is going on under the hood of collaborative filtering. Let’s get started.

Step 1 – Initialize The Movie Ratings

Simple but scalable scenario

  • 10 movies
  • 5 users
  • 3 features (we’ll discuss this in Step 3)

Here is an example diagram of movie ratings. Our rating system is from 1-10:


Let’s initialize a 10 X 5 matrix called ‘ratings’; this matrix holds all the ratings given by all users, for all movies. Note: Not all users may have rated all movies, and this is okay.

Note 2: I simply made up some data for ‘ratings’. The point of this step is to simply start off with a dataset that we can work with.

This matrix below contains the same ratings data you saw in the picture above. Here is how we declare it in Octave:

ratings = [
8 4 0 0 4;
0 0 8 10 4;
8 10 0 0 6;
10 10 8 10 10;
0 0 0 0 0;
2 0 4 0 6;
8 6 4 0 0;
0 0 6 4 0;
0 6 0 4 10;
0 4 6 8 8];

Learner’s check:

  • Each column represents all the movies rated by a single user
  • Each row represents all the ratings (from different users) received by a single movie


Recall that our rating system is from 1-10. Notice how there are 0’s to denote that no rating has been given.

Step 2 – Determine Whether a User Rated a Movie

To make our life easier, let’s also declare a binary matrix (0’s and 1’s) to denote whether a user rated a movie.

1 = the user rated the movie.
0 = the user did not rate the movie.

Let’s call this matrix ‘did_rate’. Note it has the same dimensions as ‘ratings’, 10 X 5:

did_rate = ratings ~= 0;

This above command should give you the following binary matrix:

did_rate = 
1 1 0 0 1
0 0 1 1 1
1 1 0 0 1
1 1 1 1 1
0 0 0 0 0
1 0 1 0 1
1 1 1 0 0
0 0 1 1 0
0 1 0 1 1
0 1 1 1 1

Learner’s check:

  • did_rate(2, 3) = 1: This means the 3rd user did rate the 2nd movie
  • did_rate(6, 4) = 0: This means the 4th user did not rate the 6th movie

Step 3 – User Preferences and Movie Features/Characteristics

This is where it gets interesting. In order for us to build a robust recommendation engine, we need to know user preferences and movie features (characteristics). After all, a good recommendation is based off of knowing this key user and movie information.

For example, a user preference could be how much the user likes comedy movies, on a scale of 1-5. A movie characteristic could be to what degree is the movie considered a comedy, on a scale of 0-1.

Example 1: User preferences -> Sample preferences for a single user Chelsea


Example 2: Movie features -> Sample features for a single movie Bad Boys


Note: The user preferences are the exact same as the movie features; in other words, we can map each user preference to a movie feature. This makes sense; if a user has a huge preference for a comedy, we’d like to recommend a movie with a high degree of comedy. If we have add a new preference for the user, for ‘romantic-comedy’, we should also add this as a new feature for a movie, so that our recommendation algorithm can fully use this feature/preference when making a prediction.

Note 2: We can use these numbers that I purposely came up with to ‘predict’ ratings for movies. For example, let’s predict what Chelsea would rate Bad Boys, below:

Chelsea's (C) rating (R) of Bad Boys (BB): RC,BB = comedy feature product * action feature product * romance feature product
RC,BB; = (4.5 * 0.8) + (4.9 * 0.5)  + (3.6 * 0.4)
RC,BB; = 7.49

5 big problems: This seems great, but:

  1. Who has time to sit down and come up with a list of features for users and movies?
  2. It would be very time consuming to come up with a value for each feature, for each and every user and movie.
  3. Why did I pick 1-10 as the range for user preferences and 0-1 as the range for movie features? It seems a bit forced.
  4. How does the product (multiplication) of user_prefs and movie_features magically give us a predicted rating?
  5. Why did I pick ‘comedy’, ‘romance’ and ‘action’ as the features? This seems manual and forced. There must be a better way to generate features

The solution: 

Before we dive deep into the collaborative filtering solution to answer our 4 big problems, let’s quickly introduce some key matrixes that we’ll be needing.

The user features (preferences) can be represented by a matrix ‘user_prefs’. In our example, we have 5 users and 3 features. So, ‘user_prefs’ is a 5 X 3 matrix.

Here is an example diagram to help visualize the data ‘user_prefs’ contains:


The movie features can also be represented by a matrix ‘movie_features’. In our example, we have 10 movies and 3 features. So, ‘movie_features’ is a 10 X 3 matrix.

Here is an example diagram to help visualize the data ‘movie_features’ contains:


Step 4: Let’s Rate Some Movies

I have a list of 10 movies here, in a text file:

1 Harold and Kumar Escape From Guantanamo Bay (2008)
2 Ted (2012)
3 Straight Outta Compton (2015)
4 A Very Harold and Kumar Christmas (2011)
5 Notorious (2009)
6 Get Rich Or Die Tryin' (2005)
7 Frozen (2013)
8 Tangled (2010)
9 Cinderella (2015)
10 Toy Story 3 (2010)

Now, let’s rate some movies. Our ratings can be represented by a 10 X 1 column vector my_ratings. Let’s initialize it to 0’s and make some ratings:

my_ratings = zeros(10, 1);
my_ratings(1) = 7;
my_ratings(5) = 8;
my_ratings(8)= 3;

Learner’s check:

  • I gave Harold and Kumar Escape From Guantanamo Bay a 7
  • I gave Notorious an 8
  • I gave Tangled a 3

Let’s update ratings and did_rate with the our ratings my_ratings:

ratings = [my_ratings ratings];
did_rate = [(my_ratings ~= 0) did_rate];

Learner’s check:

  • ‘ratings’ is now a 10 X 6 matrix
  • ‘did_rate’ is now a 10 X 6 matrix

Step 5: Mean Normalize All The Ratings

Once we get to Step 7: Minimize The Cost Function,  you may see why mean normalizing the ‘ratings‘ matrix is necessary.

What is mean normalization?

It is much easier to understand the ‘what’ if we understand the why. Why normalize the ‘ratings’ matrix?

Consider the following scenario:

A user (Christie) rated 0 movies. Our collaborative filtering algorithm that we are about to build will then go on to predict that Christie will rate all movies as 0. You may see why in the further steps when we cover the cost function and gradient descent. Don’t worry about it for now.

This is no good, because then we won’t be able to suggest Christie anything.  After all, a recommendation is simply based off of what movie(s) we predict the user to rate the highest.

So how do recommend a movie to a user who has never placed a rating?

We simply suggest the highest average rated movie. That’s the best we can do, since we know nothing about the user. This is made possible because of mean normalization.

What is mean normalization?

Mean normalization, in our case, is the process of making the average rating received by each movie equal to 0.

Take a look at our Step 1 example the ‘ratings’ matrix, again:


Each row represents all the ratings received by one movie. Here’s how to normalize a matrix:

  1. Find the average of the 1st row. In other words, find the average rating received by the first movie ‘Harold and Kumar Go To Guantanamo Bay’
  2. Subtract this average from each rating (entry) in the 1st row
  3. The first row has now been normalized. This row now has an average of 0.
  4. Repeat steps 1 & 2 for all rows.

Here is the implementation for mean normalization in Octave:

function [ratings_norm, ratings_mean] = normalizeRatings(ratings, did_rate)
[m, n] = size(ratings);
ratings_mean = zeros(m, 1);
ratings_norm = zeros(size(ratings));
for i = 1:m
% all the indexes where there is a 1
idx = find(did_rate(i, :) == 1);

%only finding the mean for which the user has rated
ratings_mean(i) = mean(ratings(i, idx));
ratings_norm(i, idx) = ratings(i, idx) - ratings_mean(i);


We can call this function and store the results into a 1 X 2 row vector.

[ratings, ratings_mean] = normalizeRatings(ratings, did_rate);

Learner’s check:

‘ratings’ contains the normalized ‘ratings’ matrix. Of course, it’s still a 10 X 6 matrix. Here it is below:

ratings =
1.25000 2.25000 -1.75000 0.00000 0.00000 -1.75000
0.00000 0.00000 0.00000 0.66667 2.66667 -3.33333
0.00000 0.00000 2.00000 0.00000 0.00000 -2.00000
0.00000 0.40000 0.40000 -1.60000 0.40000 0.40000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 -2.00000 0.00000 0.00000 0.00000 2.00000
0.00000 2.00000 0.00000 -2.00000 0.00000 0.00000
-1.33333 0.00000 0.00000 1.66667 -0.33333 0.00000
0.00000 0.00000 -0.66667 0.00000 -2.66667 3.33333
0.00000 0.00000 -2.50000 -0.50000 1.50000 1.50000

‘ratings_mean’ is a 10 X 1 column vector whose ith row contains the average rating of the ith movie. Here it is below:

ratings_mean =

Step 6: Collaborative Filtering via Linear Regression

If you are unfamiliar with how a linear regression works, these links should be helpful.

The simplest way to think about it is that we are simply fitting a line, (i.e) learning from to a scatter plot (in the case of a uni-linear regression):


In our case, we face a multi-linear regression problem. But don’t worry, we’ll briefly cover the intuition in a few seconds.

Helpful intuition : A user’s big preference for comedy movies (i.e 4.5/5) paired with a high movie’s ‘level of comedy’ (i.e 0.8/1) tends to be positively correlated with the user’s rating for that movie. For the most part, this correlation is continuos.

Conversely, a user’s hate for comedy (1/5), still paired with a high movie’s ‘level of comedy’ (i.e 0.8/1) tends to be negatively correlated with the user’s rating for that movie. This is another reason for mean normalization. If you notice in the ‘ratings’ matrix above, there are some negative ratings. These ratings are negative because they have been rated below average.

If you are familiar with a linear regression, you may know that the goal of a linear regression is to minimize the sum of squared errors (absolute difference between our predicted values and observed values), in order to come up with the best learning algorithm for predicting new outputs, or in the case of a uni-linear regression, the best ‘line of best fit’.

Note: In our case, we face a multi-linear regression problem, since we have many more than 1 feature.

A linear regression is associated with some cost function; our goal is to minimize this cost function (Step 7), and thus minimize the sum of squared errors.

A vectorized implementation of a linear regression is as follows:

Y = X * θT

Learner’s check:

  • θ is our parameter (user preferences, in our case) vector
  • X is our vector of features (movie features, in our case)

To fit our example, we can rename the variables as such:

ratings = movie_features * user_prefsT

We want to simultaneously find optimal values of movie_features and user_prefs such that the sum of squared errors (cost function) is minimized. How can we do this?

Step 7: Minimize The Cost Function

We will allow our collaborative filtering algorithm to simultaneously come up with the appropriate values of ‘movie_features’ and ‘user_prefs’, by minimizing the sum of squared errors, through a process called gradient descent. For our case, the gradient descent algorithm (function) we’ll be using in Octave is fmincg.

Note: If you are unfamiliar with gradient descent, worry not. All you need to understand is that gradient descent is an iterative algorithm that helps us minimize, in our specific case, the sum of squared errors. Consequently, we will have ‘learned’ the appropriate values of ‘user_prefs’ and ‘movie_features’ to make accurate predictions on movie ratings for every user.

We need to provide our fmincg function with 2 things: A cost function and it’s gradients (the slopes/ partial derivatives of cost function).

Here is my cost function, with regularization (to prevent overfitting, i.e high variance):

predictions = X * Theta';
difference = predictions - Y;
J = sum(difference(R==1) .^ 2) / 2;
thetaReg = sum(sum(Theta .^ 2)) * (lambda / 2);
xReg = sum(sum(X .^ 2)) * (lambda / 2);
J = J + thetaReg + xReg;

Learner’s check:

  • Remember, this code is inside of a function. So when this function is executed and it returns its matrix(s), the X matrix will hold the learned data (numbers) for ‘movie_features’ and the Theta matrix will hold the learned data (numbers) for ‘user_prefs’. You will see this shortly, if you are confused


Here is my gradient code:

for i = 1 : num_movies
withoutReg = ((X(i, :) * Theta' - Y(i, :)) .* R(i, :) * Theta);
reg = lambda * X(i, :);
X_grad(i, :) = withoutReg + reg;
for j = 1 : num_users
withoutReg = ((X * Theta(j, :)' - Y(:, j)) .* R(:, j))' * X;
reg = lambda * Theta(j, :);
Theta_grad(j, :) = withoutReg + reg;

And here is the full implementation of the entire function (calculating cost and its gradients):

function [J, grad] = costFunc(params, Y, R, num_users, num_movies, ...
num_features, lambda)
X = reshape(params(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(params(num_movies*num_features+1:end), ...
num_users, num_features);
J = 0; % cost (sum of squared differences)
X_grad = zeros(size(X)); % (partial derviates of J with respect to X (movie_features))
Theta_grad = zeros(size(Theta)); % (partial derviates of J with respect to Theta (user_prefs))

% Cost function with regularization
predictions = X * Theta';
difference = predictions - Y;
J = sum(difference(R==1) .^ 2) / 2;

thetaReg = sum(sum(Theta .^ 2)) * (lambda / 2);
xReg = sum(sum(X .^ 2)) * (lambda / 2);

J = J + thetaReg + xReg;

% gradients
for i = 1 : num_movies
withoutReg = ((X(i, :) * Theta' - Y(i, :)) .* R(i, :) * Theta);
reg = lambda * X(i, :);
X_grad(i, :) = withoutReg + reg;

for j = 1 : num_users
withoutReg = ((X * Theta(j, :)' - Y(:, j)) .* R(:, j))' * X;
reg = lambda * Theta(j, :);
Theta_grad(j, :) = withoutReg + reg;

grad = [X_grad(:); Theta_grad(:)];


Before we actually execute this function, we need to initialize our parameters user_prefs (Theta) and movie_features (X) to random small numbers. To do this in Octave, I have used the randn function. This function returns a matrix of random elements that are normally distributed, with a mean of 0 and a variance of 1:

num_users = size(ratings, 2);
num_movies = size(ratings, 1);
num_features = 5;
% Initialize Parameters Theta (user_prefs), X (movie_features)
movie_features = randn(num_movies, num_features);
user_prefs = randn(num_users, num_features);
initial_parameters = [movie_features(:); user_prefs(:)];

Now, let’s set some options for our cost minimizing fmincg function:

options = optimset('GradObj', 'on', 'MaxIter', 100);

Finally, let’s run fmincg, which will consequently run costFunc 100 times. Notice, fmincg takes our costFunc function as an argument. This is what fmincg needs to minimize our cost function and calculate the best learning algorithm for predicting movie ratings:

lambda = 10; % regularization weight/parameter
optimal_prefs_and_features = fmincg (@(t)(costFunc(t, ratings, did_rate, num_users, num_movies, ...
num_features, lambda)), ...
initial_parameters, options);

Learner’s check:

  • If you are unfamiliar with regularization, you don’t need to worry about what lambda means.
  • optimal_prefs_and_features is the column vector returned from fmincg. It contains optimal values for user preferences and movie features that minimize our cost function

We need to extract ‘user_prefs’ and ‘movie_features’ from optimal_prefs_and_features, so we can start making some predictions:

movie_features = reshape(optimal_prefs_and_features(1:num_movies*num_features), num_movies, num_features);
user_prefs = reshape(optimal_prefs_and_features(num_movies*num_features+1:end), ...
num_users, num_features);

Step 8: Make Movie Predictions!…Finally

Recall Step 4: Let’s Rate Some Movies. We rated some movies. Now, let’s use our learning algorithm we just built to predict ratings that we would give movies, based on our learning algorithm, and our ‘my_ratings’ row vector:

all_predictions = movie_features * user_prefs';
my_predictions = all_predictions(:,1) + ratings_mean;

‘my_predictions’ is a 10 X 1 column vector:

my_predictions =

Learner’s check:

  • Recall in Step 5 where we mean normalized all the ‘ratings’. Since we subtracted the mean of the movie’s ratings from each rating for that movie, we added back ‘ratings_mean’ to our predicted ratings.

Let’s display our predictions:

[r, ix] = sort(my_predictions, 'descend');
fprintf('nTop recommendations for you:n');
for i=1:10
j = ix(i);
fprintf('Predicting rating %.1f for movie %sn', my_predictions(j), ...
fprintf('nnOriginal ratings provided:n');
for i = 1:length(my_ratings)
if my_ratings(i) > 0
fprintf('Rated %d for %sn', my_ratings(i), ...

The result looks as follows:

Top recommendations for you:
Predicting rating 9.6 for movie Straight Outta Compton (2015)
Predicting rating 8.0 for movie A Very Harold and Kumar Christmas (2011)
Predicting rating 8.0 for movie Notorious (2009)
Predicting rating 7.3 for movie Ted (2012)
Predicting rating 6.7 for movie Cinderella (2015)
Predicting rating 6.5 for movie Toy Story 3 (2010)
Predicting rating 6.0 for movie Frozen (2013)
Predicting rating 5.8 for movie Harold and Kumar Escape From Guantanamo Bay (2008)
Predicting rating 4.3 for movie Tangled (2010)
Predicting rating 4.0 for movie Get Rich Or Die Tryin' (2005)
Original ratings provided:
Rated 7 for Harold and Kumar Escape From Guantanamo Bay (2008)
Rated 8 for Notorious (2009)
Rated 3 for Tangled (2010)

Step 9: Take It Further

You should try to build your own recommendation engine. Perhaps not just for movies, but for anything else you can think of. We can’t always find what are looking for by ourselves. Sometimes a good recommendation is all we need.

Perhaps you can implement a clustering algorithm such as k-means or DBSCAN to group users with similar features together, and thereby recommend the same movies to users belonging to the same cluster.

In our example, the more you rate movie movies, the more ‘personalized’ (and possibly accurate) your recommendations will be. This is because you are giving the recommendation engine (learning algorithm) more of your data to observe and learn from.

So, maybe if you actually ‘Netflix and chill’ed more often, Netflix will know you better and make better movie recommendations for you 😉

Nikhil Bhaskar

*Original post here*

Source: Movie Recommendations? How Does Netflix Do It? A 9 Step Coding & Intuitive Guide Into Collaborative Filtering by nbhaskar

August 28, 2017 Health and Biotech analytics news roundup

Genome sequencing method can detect clinically relevant mutations using 5 CTCs: Researchers showed that a technique that can sequence very long stretches of the genome can accurately quantify mutations using only 5 ‘circulating tumor cells’ (although they used 34 in this study).

Artificial intelligence predicts dementia before onset of symptoms: Using only one scan of the brain per patient, McGill scientists were able to accurately predict Alzheimer’s 2 years before its onset.

Using machine learning to improve patient care: Two papers from MIT made strides in the field, one that used ICU data to predict necessary treatments and another that trained models of mortality and length of stay based on electronic health record data.

How CROs Are Helping With Healthcare’s Data Problem: Clinical trial costs are a major cause of rising health care costs. To help streamline this, pharmaceutical companies are increasingly using ‘contract research organizations’ to conduct trials, as they can use their expertise and specialized business intelligence tools to cut costs.

I was worried about artificial intelligence—until it saved my life: Krista Jones had a rare form of cancer that was only able to be correctly treated with machine learning technology.

Genomic Medicine Has Entered the Building: Some types of genome sequences now cost as much as an MRI, which has allowed organizations to undertake large-scale studies in personalized medicine.

Source: August 28, 2017 Health and Biotech analytics news roundup

The 37 best tools for data visualization

Creating charts and info graphics can be time-consuming. But these tools make it easier.

It’s often said that data is the new world currency, and the web is the exchange bureau through which it’s traded. As consumers, we’re positively swimming in data; it’s everywhere from labels on food packaging design to World Health Organisation reports. As a result, for the designer it’s becoming increasingly difficult to present data in a way that stands out from the mass of competing data streams.

One of the best ways to get your message across is to use a visualization to quickly draw attention to the key messages, and by presenting data visually it’s also possible to uncover surprising patterns and observations that wouldn’t be apparent from looking at stats alone.


Not a web designer or developer? You may prefer free tools for creating infographics.

As author, data journalist and information designer David McCandless said in his TED talk: “By visualizing information, we turn it into a landscape that you can explore with your eyes, a sort of information map. And when you’re lost in information, an information map is kind of useful.”

There are many different ways of telling a story, but everything starts with an idea. So to help you get started we’ve rounded up some of the most awesome data visualization tools available on the web.

01. Dygraphs

Help visitors explore dense data sets with JavaScript library Dygraphs

Dygraphs is a fast, flexible open source JavaScript charting library that allows users to explore and interpret dense data sets. It’s highly customizable, works in all major browsers, and you can even pinch to zoom on mobile and tablet devices.

02. ZingChart

ZingChart lets you create HTML5 Canvas charts and more

ZingChart is a JavaScript charting library and feature-rich API set that lets you build interactive Flash or HTML5 charts. It offer over 100 chart types to fit your data.

03. InstantAtlas

InstantAtlas enables you to create highly engaging visualisations around map data

If you’re looking for a data viz tool with mapping, InstantAtlas is worth checking out. This tool enables you to create highly-interactive dynamic and profile reports that combine statistics and map data to create engaging data visualizations.

04. Timeline

Timeline creates beautiful interactive visualizations

Timeline is a fantastic widget which renders a beautiful interactive timeline that responds to the user’s mouse, making it easy to create advanced timelines that convey a lot of information in a compressed space.

Each element can be clicked to reveal more in-depth information, making this a great way to give a big-picture view while still providing full detail.

05. Exhibit

Exhibit makes data visualization a doddle

Developed by MIT, and fully open-source, Exhibit makes it easy to create interactive maps, and other data-based visualizations that are orientated towards teaching or static/historical based data sets, such as flags pinned to countries, or birth-places of famous people.

06. Modest Maps

 Modest Maps
Integrate and develop interactive maps within your site with this cool tool

Modest Maps is a lightweight, simple mapping tool for web designers that makes it easy to integrate and develop interactive maps within your site, using them as a data visualization tool.

The API is easy to get to grips with, and offers a useful number of hooks for adding your own interaction code, making it a good choice for designers looking to fully customise their user’s experience to match their website or web app. The basic library can also be extended with additional plugins, adding to its core functionality and offering some very useful data integration options.

07. Leaflet

Use OpenStreetMap data and integrate data visualisation in an HTML5/CSS3 wrapper

Another mapping tool, Leaflet makes it easy to use OpenStreetMap data and integrate fully interactive data visualisation in an HTML5/CSS3 wrapper.

The core library itself is very small, but there are a wide range of plugins available that extend the functionality with specialist functionality such as animated markers, masks and heatmaps. Perfect for any project where you need to show data overlaid on a geographical projection (including unusual projections!).

08. WolframAlpha

 Wolfram Alpha
Wolfram Alpha is excellent at creating charts

Billed as a “computational knowledge engine”, the Google rival WolframAlpha is really good at intelligently displaying charts in response to data queries without the need for any configuration. If you’re using publically available data, this offers a simple widget builder to make it really simple to get visualizations on your site.

09. Visual.ly

Visual.ly makes data visualization as simple as it can be

Visual.ly is a combined gallery and infographic generation tool. It offers a simple toolset for building stunning data representations, as well as a platform to share your creations. This goes beyond pure data visualisation, but if you want to create something that stands on its own, it’s a fantastic resource and an info-junkie’s dream come true!

10. Visualize Free

 Visualize Free
Make visualizations for free!

Visualize Free is a hosted tool that allows you to use publicly available datasets, or upload your own, and build interactive visualizations to illustrate the data. The visualizations go well beyond simple charts, and the service is completely free plus while development work requires Flash, output can be done through HTML5.

11. Better World Flux

 Better World Flux
Making the ugly beautiful – that’s Better World Flux

Orientated towards making positive change to the world, Better World Flux has some lovely visualizations of some pretty depressing data. It would be very useful, for example, if you were writing an article about world poverty, child undernourishment or access to clean water. This tool doesn’t allow you to upload your own data, but does offer a rich interactive output.

12. FusionCharts

FusionCharts Suite XT
A comprehensive JavaScript/HTML5 charting solution for your data visualization needs

FusionCharts Suite XT brings you 90+ charts and gauges, 965 data-driven maps, and ready-made business dashboards and demos. FusionCharts comes with extensive JavaScript API that makes it easy to integrate it with any AJAX application or JavaScript framework. These charts, maps and dashboards are highly interactive, customizable and work across all devices and platforms. They also have a comparison of the top JavaScript charting libraries which is worth checking out.

13. jqPlot

jqPlot is a nice solution for line and point charts

Another jQuery plugin, jqPlot is a nice solution for line and point charts. It comes with a few nice additional features such as the ability to generate trend lines automatically, and interactive points that can be adjusted by the website visitor, updating the dataset accordingly.

14. Dipity

Dipity has free and premium versions to suit your needs

Dipity allows you to create rich interactive timelines and embed them on your website. It offers a free version and a premium product, with the usual restrictions and limitations present. The timelines it outputs are beautiful and fully customisable, and are very easy to embed directly into your page.

15. Many Eyes

 Many Eyes
Many Eyes was developed by IBM

Developed by IBM, Many Eyes allows you to quickly build visualizations from publically available or uploaded data sets, and features a wide range of analysis types including the ability to scan text for keyword density and saturation. This is another great example of a big company supporting research and sharing the results openly.

16. D3.js

You can render some amazing diagrams with D3

D3.js is a JavaScript library that uses HTML, SVG, and CSS to render some amazing diagrams and charts from a variety of data sources. This library, more than most, is capable of some seriously advanced visualizations with complex data sets. It’s open source, and uses web standards so is very accessible. It also includes some fantastic user interaction support.

17. JavaScript InfoVis Toolkit

 JavaScript InfoVis Toolkit
JavaScript InfoVis Toolkit includes a handy modular structure

A fantastic library written by Nicolas Belmonte, the JavaScript InfoVis Toolkit includes a modular structure, allowing you to only force visitors to download what’s absolutely necessary to display your chosen data visualizations. This library has a number of unique styles and swish animation effects, and is free to use (although donations are encouraged).

18. jpGraph

jpGraph is a PHP-based data visualization tool

If you need to generate charts and graphs server-side, jpGraph offers a PHP-based solution with a wide range of chart types. It’s free for non-commercial use, and features extensive documentation. By rendering on the server, this is guaranteed to provide a consistent visual output, albeit at the expense of interactivity and accessibility.

19. Highcharts

Highcharts has a huge range of options available

Highcharts is a JavaScript charting library with a huge range of chart options available. The output is rendered using SVG in modern browsers and VML in Internet Explorer. The charts are beautifully animated into view automatically, and the framework also supports live data streams. It’s free to download and use non-commercially (and licensable for commercial use). You can also play with the extensive demos using JSFiddle.

20. Google Charts

 Google Charts
Google Charts has an excellent selection of tools available

The seminal charting solution for much of the web, Google Charts is highly flexible and has an excellent set of developer tools behind it. It’s an especially useful tool for specialist visualizations such as geocharts and gauges, and it also includes built-in animation and user interaction controls.

21. Excel

It isn’t graphically flexible, but Excel is a good way to explore data: for example, by creating ‘heat maps’ like this one

You can actually do some pretty complex things with Excel, from ‘heat maps’ of cells to scatter plots. As an entry-level tool, it can be a good way of quickly exploring data, or creating visualizations for internal use, but the limited default set of colours, lines and styles make it difficult to create graphics that would be usable in a professional publication or website. Nevertheless, as a means of rapidly communicating ideas, Excel should be part of your toolbox.

Excel comes as part of the commercial Microsoft Office suite, so if you don’t have access to it, Google’s spreadsheets – part ofGoogle Docs and Google Drive – can do many of the same things. Google ‘eats its own dog food’, so the spreadsheet can generate the same charts as the Google Chart API. This will get your familiar with what is possible before stepping off and using the API directly for your own projects.


CSV (Comma-Separated Values) and JSON (JavaScript Object Notation) aren’t actual visualization tools, but they are common formats for data. You’ll need to understand their structures and how to get data in or out of them.

23. Crossfilter

Crossfilter in action: by restricting the input range on any one chart, data is affected everywhere. This is a great tool for dashboards or other interactive tools with large volumes of data behind them

As we build more complex tools to enable clients to wade through their data, we are starting to create graphs and charts that double as interactive GUI widgets. JavaScript library Crossfilter can be both of these. It displays data, but at the same time, you can restrict the range of that data and see other linked charts react.

24. Tangle

Tangle creates complex interactive graphics. Pulling on any one of the knobs affects data throughout all of the linked charts. This creates a real-time feedback loop, enabling you to understand complex equations in a more intuitive way

The line between content and control blurs even further with Tangle. When you are trying to describe a complex interaction or equation, letting the reader tweak the input values and see the outcome for themselves provides both a sense of control and a powerful way to explore data. JavaScript library Tangle is a set of tools to do just this.

Dragging on variables enables you to increase or decrease their values and see an accompanying chart update automatically. The results are only just short of magical.

25. Polymaps

Aimed more at specialist data visualisers, the Polymaps library creates image and vector-tiled maps using SVG

Polymaps is a mapping library that is aimed squarely at a data visualization audience. Offering a unique approach to styling the the maps it creates, analagous to CSS selectors, it’s a great resource to know about.

26. OpenLayers

It isn’t easy to master, but OpenLayers is arguably the most complete, robust mapping solution discussed here

OpenLayers is probably the most robust of these mapping libraries. The documentation isn’t great and the learning curve is steep, but for certain tasks nothing else can compete. When you need a very specific tool no other library provides, OpenLayers is always there.

27. Kartograph

Kartograph’s projections breathe new life into our standard slippy maps

Kartograph’s tag line is ‘rethink mapping’ and that is exactly what its developers are doing. We’re all used to the Mercator projection, but Kartograph brings far more choices to the table. If you aren’t working with worldwide data, and can place your map in a defined box, Kartograph has the options you need to stand out from the crowd.

28. CartoDB

CartoDB provides an unparalleled way to combine maps and tabular data to create visualisations

CartoDB is a must-know site. The ease with which you can combine tabular data with maps is second to none. For example, you can feed in a CSV file of address strings and it will convert them to latitudes and longitudes and plot them on a map, but there are many other users. It’s free for up to five tables; after that, there are monthly pricing plans.

29. Processing

Processing provides a cross-platform environment for creating images, animations, and interactions

Processing has become the poster child for interactive visualizations. It enables you to write much simpler code which is in turn compiled into Java.

There is also a Processing.js project to make it easier for websites to use Processing without Java applets, plus a port to Objective-C so you can use it on iOS. It is a desktop application, but can be run on all platforms, and given that it is now several years old, there are plenty of examples and code from the community.

30. NodeBox

NodeBox is a quick, easy way for Python-savvy developers to create 2D visualisations

NodeBox is an OS X application for creating 2D graphics and visualizations. You need to know and understand Python code, but beyond that it’s a quick and easy way to tweak variables and see results instantly. It’s similar to Processing, but without all the interactivity.

31. R

A powerful free software environment for statistical computing and graphics, R is the most complex of the tools listed here

How many other pieces of software have an entire search enginededicated to them? A statistical package used to parse large data sets, R is a very complex tool, and one that takes a while to understand, but has a strong community and package library, with more and more being produced.

The learning curve is one of the steepest of any of these tools listed here, but you must be comfortable using it if you want to get to this level.

32. Weka

A collection of machine-learning algorithms for data-mining tasks, Weka is a powerful way to explore data

When you get deeper into being a data scientist, you will need to expand your capabilities from just creating visualizations to data mining. Weka is a good tool for classifying and clustering data based on various attributes – both powerful ways to explore data – but it also has the ability to generate simple plots.

33. Gephi

Gephi in action. Coloured regions represent clusters of data that the system is guessing are similar

When people talk about relatedness, social graphs and co-relations, they are really talking about how two nodes are related to one another relative to the other nodes in a network. The nodes in question could be people in a company, words in a document or passes in a football game, but the maths is the same.

Gephi, a graph-based visualiser and data explorer, can not only crunch large data sets and produce beautiful visualizations, but also allows you to clean and sort the data. It’s a very niche use case and a complex piece of software, but it puts you ahead of anyone else in the field who doesn’t know about this gem.

34. iCharts

iCharts can have interactive elements, and you can pull in data from Google Docs

The iCharts service provides a hosted solution for creating and presenting compelling charts for inclusion on your website. There are many different chart types available, and each is fully customisable to suit the subject matter and colour scheme of your site.

Charts can have interactive elements, and can pull data from Google Docs, Excel spreadsheets and other sources. The free account lets you create basic charts, while you can pay to upgrade for additional features and branding-free options.

35. Flot

Create animated visualisations with this jQuery plugin

Flot is a specialised plotting library for jQuery, but it has many handy features and crucially works across all common browsers including Internet Explorer 6. Data can be animated and, because it’s a jQuery plugin, you can fully control all the aspects of animation, presentation and user interaction. This does mean that you need to be familiar with (and comfortable with) jQuery, but if that’s the case, this makes a great option for including interactive charts on your website.

36. Raphaël

This handy JavaScript library offers a range of data visualisation options

This handy JavaScript library offers a wide range of data visualization options which are rendered using SVG. This makes for a flexible approach that can easily be integrated within your own web site/app code, and is limited only by your own imagination.

That said, it’s a bit more hands-on than some of the other tools featured here (a victim of being so flexible), so unless you’re a hardcore coder, you might want to check out some of the more point-and-click orientated options first!

37. jQuery Visualize

 JQuery Visualise
jQuery Visualize Plugin is an open source charting plugin

Written by the team behind jQuery’s ThemeRoller and jQuery UI websites, jQuery Visualize Plugin is an open source charting plugin for jQuery that uses HTML Canvas to draw a number of different chart types. One of the key features of this plugin is its focus on achieving ARIA support, making it friendly to screen-readers. It’s free to download from this page on GitHub.

Further reading

  • A great Tumblr blog for visualization examples and inspiration:vizualize.tumblr.com
  • Nicholas Felton’s annual reports are now infamous, but he also has a Tumblr blog of great things he finds.
  • From the guy who helped bring Processing into the world:benfry.com/writing
  • Stamen Design is always creating interesting projects:stamen.com
  • Eyeo Festival brings some of the greatest minds in data visualization together in one place, and you can watch the videos online.

Brian Suda is a master informatician and author of Designing with Data, a practical guide to data visualisation.

Originally posted via “The 37 best tools for data visualization”



Best Practices for Using Context Variables with Talend – Part 1

A question I was regularly asked when working on different customer sites and answering questions on forums was “What is the best practice when using context variables?”

My years of working with Talend have led me to work with context variables in a way that minimizes the effort I need to put into ongoing maintenance and moving them between environments. This blog series is intended to give you an insight into the best practices I use as well as highlight the potential pitfalls that can arise from using the Talend context variable functionality without fully understanding it.

Contexts, Context Variables and Context Groups

To start, I want to ensure that we are all on the same page with regard to terminology. There are 3 ways “Context” is used in Talend:

  • Context variable: A variable which can be set either at compile time or runtime. It can be changed and allows variables which would otherwise be hardcoded to be more dynamic.
  • Context: The environment or category of the value held by the context variable. Most of the time Contexts are DEV, TEST, PROD, UAT, etc. This allows you to set up one context variable and assign a different value per environment.
  • Context Group: A group of context variables which are packaged together for ease of use. Context Groups can be dragged and dropped into jobs so that you do not have to set up the same context variables in different jobs. They can also be updated (added to) in one location and then the changes can be distributed to the jobs that use those Context Groups.

I’ve found that many people will refer to “context variables” as “contexts”. This leads to confusion in discussions, so if these terms are used incorrectly online it really can confuse the issue. So, now that we have a common set of definitions, let’s move forward.

Potential Pitfalls with Contexts

While context variables are incredibly useful when working with Talend, they can also introduce some unforeseen problems if not fully understood. The biggest cause of problems in my experience are the contexts. Quite simply, I do not use anything but a Default Context.

At the beginning of your Talend journey, they come across as a genius idea which allows developers to build against one environment, using that environment’s context variable values, then when the code is ready to test, change the context at the flick of a switch. That is true (kind of), but mainly for smaller data integration jobs. However, more often than not they open up developers and testers to horrible and time-consuming unexpected behavior. Below is just one scenario demonstrating this.


Let’s say a developer has built a job which uses a Context Group configured to supply database connection parameters. She has set up 4 Contexts (DEV1, DEV2, TEST and PROD) and has configured the different Context Variable values for each Context. In her main job, she reads from the database and then passes some of the data to Child Jobs using tRunJob components. Some of these Child Jobs have their own Child Jobs and all Child Jobs make use of the database. Thus, all jobs make use of the Context Group holding the database credentials. While she is developing, she sets the Context within the tRunJobs to DEV1. This is great. She can debug her Job until she is happy that it is working. However, she needs to test on DEV2 because it has a slightly cleaner environment. When she runs the Parent Job she changes the default Context from DEV1 to DEV2 and runs the Job. It seems to work, but she cannot see the database updates in her DEV1 database. Why? She then realizes that her Child Jobs are all defaulted to use DEV1 and not DEV2.

Now there are ways around this, she could ensure that all of her tRunJobs are set with the correct Context. But what if she has dozens of them? How long will that take? She could ensure that “Transmit whole context” is set in each tRunJob. But what happens if a Child Job is using a Context variable or Context Group that is not used by any of the Parent Jobs? We are back to the same problem of having to change all of the tRunJob Contexts. But this doesn’t affect us outside of the Talend Studio, right? Wrong.

If the developer compiled that job to use on the command-line, even if she sets “Apply Context to children jobs” on the Build Job page, all this does is hardcode all of the Child Jobs’ Contexts to that selected in the Context scripts drop down. When you run it, if you change the Context that the Job needs to run for, the Child Jobs stick with the one that has been compiled. The same thing happens in the Talend Administration Center (TAC) as well.

Now, this does have some uses. Maybe your Contexts are not for environments and you want to be able to use different Contexts within the same environment? That is a legitimate (if not slightly unusual) scenario. There are other examples of these sorts of problems, but I think you get the idea.

In the early days of Talend, Contexts were brilliant. But these days (unless you have a particular use case where multiple Contexts are used within a single environment), there are better ways of handling Context variables for multiple environments. I’ll cover all of those ways and best practices in part two and three of our blog series coming out next week. Until next time!

The post Best Practices for Using Context Variables with Talend – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

Originally Posted at: Best Practices for Using Context Variables with Talend – Part 1 by analyticsweekpick

Data skills – Many hands make light work in a world awash with data

While the transformation to a data-driven culture needs to come from the top of the organization, data skills must permeate through all areas of the business.

Rather than being the responsibility of one person or department, assuring data availability and integrity must be a team sport in modern data-centric businesses. Everyone must be involved and made accountable throughout the process.  

The challenge for enterprises is to effectively enable greater data access among the workforce while maintaining oversight and quality.

The Evolution of the Data Team

Businesses are recognizing the value and opportunities that data creates. There is an understanding that data needs to be handled and processed efficiently. For some companies, this has led to the formation of a new department of data analysts and scientists.

The data team is led by a Chief Data Officer (CDO), a role that is set to become key to business success in the digital era, according to recent research from Gartner. While earlier iterations of roles within the data team centered on data governance, data quality and regulatory issues, the focus is shifting. Data analysts and scientists are now expected to contribute and deliver a data-driven culture across the company, while also driving business value. According to the Gartner survey, the skills required for roles within the data team have expanded to span data management, analytics, data science, ethics, and digital transformation.

Businesses are clearly recognizing the importance of the data team’s functions and are making significant investments in it. Office budgets for the data team increased by an impressive 23% between 2016 and 2017 according to Gartner. What’s more, some 15% of the CDOs that took part in the study revealed that their budgets were more than $20 million for their departments, compared with just 7% who said the same in 2016. The increasing popularity and evolution of these new data roles has largely been driven by GDPR in Europe and by new data protection regulations in the US. And the evidence suggests that the position will be essential for ensuring the successful transfer of data skills throughout businesses of all sizes.

The Data Skills Shortage

Data is an incredibly valuable resource, but businesses can only unlock its full potential if they have the talent to analyze that data and produce actionable insights that help them to better understand their customers’ needs. However, companies are already struggling to cope with the big data ecosystem due to a skills shortage and the problem shows little sign of improving. In fact, Europe could see a shortage of up to 500,000 IT professionals by 2020, according to the latest research from consultancy firm Empirica.

The rapidly evolving digital landscape is partly to blame as the skills required have changed radically in recent years. The required data science skills needed at today’s data-driven companies are more wide-ranging than ever before. The modern workforce is now required to have a firm grasp of computer science including everything from databases to the cloud, according to strategic advisor and best-selling author Bernard Marr. In addition, analytical skills are essential to make sense of the ever-increasing data gathered by enterprises, while mathematical skills are also vital as much of the data captured will be numerical as this is largely due to IoT and sensor data. These skills must also sit alongside more traditional business and communication skills, as well as the ability to be creative and adapt to developing technologies.

The need for these skills is set to increase, with IBM predicting that the number of jobs for data professionals will rise by a massive 28% by 2020. The good news is that businesses are already recognizing the importance of digital skills in the workforce, with the role of Data Scientist taking the number one spot in Glassdoor’s Best Jobs in America for the past three years, with a staggering 4,524 positions available in 2018. 

Data Training Employees

Data quality management is a task that extends across all functional areas of a company. It, therefore, makes sense to provide the employees in the specialist departments with tools to ensure data quality in self-service. Cloud-based tools that can be rolled out quickly and easily in the departments are essential. This way, companies can gradually improve their data quality whilst also increasing the value of their data.

While the number of data workers triples and to stay competitive with GDPR, businesses must think of good data management as a team sport. Investing in the Chief Data Officer role and data skills now will enable forward-thinking businesses to reap the rewards, both in the short-term and further into the future.

The post Data skills – Many hands make light work in a world awash with data appeared first on Talend Real-Time Open Source Data Integration Software.

Originally Posted at: Data skills – Many hands make light work in a world awash with data by analyticsweekpick

How big data can improve manufacturing

Manufacturers taking advantage of advanced analytics can reduce process flaws, saving time and money.

In the past 20 years or so, manufacturers have been able to reduce waste and variability in their production processes and dramatically improve product quality and yield (the amount of output per unit of input) by implementing lean and Six Sigma programs. However, in certain processing environments—pharmaceuticals, chemicals, and mining, for instance—extreme swings in variability are a fact of life, sometimes even after lean techniques have been applied. Given the sheer number and complexity of production activities that influence yield in these and other industries, manufacturers need a more granular approach to diagnosing and correcting process flaws. Advanced analytics provides just such an approach.

Advanced analytics refers to the application of statistics and other mathematical tools to business data in order to assess and improve practices (exhibit). In manufacturing, operations managers can use advanced analytics to take a deep dive into historical process data, identify patterns and relationships among discrete process steps and inputs, and then optimize the factors that prove to have the greatest effect on yield. Many global manufacturers in a range of industries and geographies now have an abundance of real-time shop-floor data and the capability to conduct such sophisticated statistical assessments. They are taking previously isolated data sets, aggregating them, and analyzing them to reveal important insights.


Consider the production of biopharmaceuticals, a category of healthcare products that includes vaccines, hormones, and blood components. They are manufactured using live, genetically engineered cells, and production teams must often monitor more than 200 variables within the production flow to ensure the purity of the ingredients as well as the substances being made. Two batches of a particular substance, produced using an identical process, can still exhibit a variation in yield of between 50 and 100 percent. This huge unexplained variability can create issues with capacity and product quality and can draw increased regulatory scrutiny.

One top-five biopharmaceuticals maker used advanced analytics to significantly increase its yield in vaccine production while incurring no additional capital expenditures. The company segmented its entire process into clusters of closely related production activities; for each cluster, it took far-flung data about process steps and the materials used and gathered them in a central database.

A project team then applied various forms of statistical analysis to the data to determine interdependencies among the different process parameters (upstream and downstream) and their impact on yield. Nine parameters proved to be most influential, especially time to inoculate cells and conductivity measures associated with one of the chromatography steps. The manufacturer made targeted process changes to account for these nine parameters and was able to increase its vaccine yield by more than 50 percent—worth between $5 million and $10 million in yearly savings for a single substance, one of hundreds it produces.

Developing unexpected insights

Even within manufacturing operations that are considered best in class, the use of advanced analytics may reveal further opportunities to increase yield. This was the case at one established European maker of functional and specialty chemicals for a number of industries, including paper, detergents, and metalworking. It boasted a strong history of process improvements since the 1960s, and its average yield was consistently higher than industry benchmarks. In fact, staffers were skeptical that there was much room for improvement. “This is the plant that everybody uses as a reference,” one engineer pointed out.

However, several unexpected insights emerged when the company used neural-network techniques (a form of advanced analytics based on the way the human brain processes information) to measure and compare the relative impact of different production inputs on yield. Among the factors it examined were coolant pressures, temperatures, quantity, and carbon dioxide flow. The analysis revealed a number of previously unseen sensitivities—for instance, levels of variability in carbon dioxide flow prompted significant reductions in yield. By resetting its parameters accordingly, the chemical company was able to reduce its waste of raw materials by 20 percent and its energy costs by around 15 percent, thereby improving overall yield. It is now implementing advanced process controls to complement its basic systems and steer production automatically.

Meanwhile, a precious-metals mine was able to increase its yield and profitability by rigorously assessing production data that were less than complete. The mine was going through a period in which the grade of its ore was declining; one of the only ways it could maintain production levels was to try to speed up or otherwise optimize its extraction and refining processes. The recovery of precious metals from ore is incredibly complex, typically involving between 10 and 15 variables and more than 15 pieces of machinery; extraction treatments may include cyanidation, oxidation, grinding, and leaching.

The production and process data that the operations team at the mine were working with were extremely fragmented, so the first step for the analytics team was to clean it up, using mathematical approaches to reconcile inconsistencies and account for information gaps. The team then examined the data on a number of process parameters—reagents, flow rates, density, and so on—before recognizing that variability in levels of dissolved oxygen (a key parameter in the leaching process) seemed to have the biggest impact on yield. Specifically, the team spotted fluctuations in oxygen concentration, which indicated that there were challenges in process control. The analysis also showed that the best demonstrated performance at the mine occurred on days in which oxygen levels were highest.

As a result of these findings, the mine made minor changes to its leach-recovery processes and increased its average yield by 3.7 percent within three months—a significant gain in a period during which ore grade had declined by some 20 percent. The increase in yield translated into a sustainable $10 million to $20 million annual profit impact for the mine, without it having to make additional capital investments or implement major change initiatives.

Capitalizing on big data

The critical first step for manufacturers that want to use advanced analytics to improve yield is to consider how much data the company has at its disposal. Most companies collect vast troves of process data but typically use them only for tracking purposes, not as a basis for improving operations. For these players, the challenge is to invest in the systems and skill sets that will allow them to optimize their use of existing process information—for instance, centralizing or indexing data from multiple sources so they can be analyzed more easily and hiring data analysts who are trained in spotting patterns and drawing actionable insights from information.

Some companies, particularly those with months- and sometimes years-long production cycles, have too little data to be statistically meaningful when put under an analyst’s lens. The challenge for senior leaders at these companies will be taking a long-term focus and investing in systems and practices to collect more data. They can invest incrementally—for instance, gathering information about one particularly important or particularly complex process step within the larger chain of activities, and then applying sophisticated analysis to that part of the process.

The big data era has only just emerged, but the practice of advanced analytics is grounded in years of mathematical research and scientific application. It can be a critical tool for realizing improvements in yield, particularly in any manufacturing environment in which process complexity, process variability, and capacity restraints are present. Indeed, companies that successfully build up their capabilities in conducting quantitative assessments can set themselves far apart from competitors.

About the authors

Eric Auschitzky is a consultant in McKinsey’s Lyon office, Markus Hammer is a senior expert in the Lisbon office, and Agesan Rajagopaul is an associate principal in the Johannesburg office.

The authors would like to thank Stewart Goodman, Jean-Baptiste Pelletier, Paul Rutten, Alberto Santagostino, Christoph Schmitz, and Ken Somers for their contributions to this article.

Originally posted via “How big data can improve manufacturing”


Source by analyticsweekpick

2016 Trends for the Internet of Things: Expanding Expedient Analytics Beyond the Industrial Internet

The current perception of the Internet of Things is greatly reliant upon its traditional applications of the Industrial Internet, in which monitoring of equipment assets via real-time and predictive analytics grossly reduces costs and time spent on maintenance and repairs.


Although the Industrial Internet will likely remain a vital component of the IoT, the potential of this vast connectivity and automated action greatly exceeds any one particular use case. With liberal estimates of as many as 50 billion connected devices by 2020, there are a number of trends pertaining to the IoT that will begin in earnest in 2016 to help it reach its full potential:


  • Structure: One of the most significant trends impacting the IoT will be the continued usage of JSON-based document stores which can provide schema and a loose means of structure on otherwise unstructured data on-the-fly, making IoT data more manageable and viable to the enterprise.
  • Decentralized Clouds: The trend towards fog computing—based on pushing cloud resources and computations to the edge of the cloud, closer to end users and their devices—will continue to gain credence and reduce bandwidth, decrease time to action, lower costs, and improve access to data.
  • Mobile: The IoT will impact the future of mobile technologies via the cyberforaging phenomena, in which mobile devices share resources with one another in a more efficient manner than most contemporary decentralized models do. Additionally, the increase of wearables in industries such as healthcare and others will help to broaden the IoT’s presence throughout the coming year.
  • Revenue Streams: 2016 will see an increase in revenue streams associated with the IoT (outside of the Industrial Internet), including the creation of services and applications structured around it. The most eminent example is provided by connected cars.
  • Security: Security concerns are one of the few remaining inhibitors to the IoT. IDC predicts that more than 50 percent of networks utilizing the IoT will have a security breach by 2018.



Of all the pragmatic necessities that must be addressed for the IoT to transform enterprise data management practices, contending with continuously streaming unstructured data from any variety of sources remains one of the most daunting. While the utilization of semantic technologies and their relevance to the unstructured and semi-structured data created by the IoT and other forms of big data continues to gain traction, one of the most profound developments related to structuring this data involves the JSON document format.


MapR Chief Marketing Officer Jack Norris noted that this format “typically is a complex document that has embedded hierarchies in it to exchange record to record depending on what was happening to produce it.” JSON documents are regularly used for the exchanges of data in web applications and contain much needed schema that is invaluable to unruly IoT data—particularly when combined with SQL solutions that can derive such schema. “A lot of the machine generated content, whether it’s log files or sensor data, is kind of a JSON format,” Norris remarked. “It’s definitely one of the fastest growing formats.”



Another major trend affecting the IoT pertains to the architecture required to account for the constant generation of data and requisite analytics between machines. Although there are certainly cases in which traditionally centralized approaches to data management have continuing relevance, decentralized methods of managing big data utilizing cloud infrastructure are emerging as a way to provision the real-time analytics required to exploit the IoT. Edge computing (also known as fog computing) facilitates the expedience required for such analytics while helping to preserve computing networks from the undue bandwidth, cost, and latency issues that uninterrupted transmission of data from the IoT creates with centralized models.


Instead, analytics and computations are performed at the edge of the cloud while only the most necessary results are transmitted to centralized data centers, which greatly decreases costs and network issues while improving the viability of leveraging machine-to-machine communication. A recent Forbes post indicates that: “By 2018, at least half of IT spending will be cloud-based, reaching 60 % of all IT infrastructure and 60-70% of all software, services, and technology spending by 2020.”



The renewed emphasis on cloud resources is indicative of the growing trend towards mobile computing that will continue to influence the IoT this year. Prevalent mobile computing concerns have transcended issues of governance and the incorporation of BYOD and CYOD policies, and have come to include the natural progression of fog computing via the cyberforaging phenomenon. Cyberforaging is the ability of mobile devices to facilitate their analytics and transformation needs (either in the form of ETL or ELT) by utilizing internet-based resources nearest to them. Similar to fog computing, cyberforaging again focuses on the edge of the cloud, and can involve the use of cloudlets. Cloudlets are various devices equipped to provision the cloud to mobile devices (which detect them via their sensors) without the need for a centralized location or data center.


More advanced capabilities include the increasing usage of smart agent technologies. According to Gartner, these technologies “in the form of virtual personal assistants (VPAs) and other agents, will monitor user content and behavior in conjunction with cloud-hosted neural networks to build and maintain data models from which the technology will draw inferences about people, content and contexts.” Although these technologies will gain prominence towards the end of the decade, the combination of them and cyberforaging will expand expectations for mobile computing and the remote computation power the IoT can facilitate.


Revenue Streams

There are a multitude of examples of the way that the IoT will increase revenue streams outside of the Industrial Internet in the coming months. One will be in providing support to machines instead of end users. Gartner noted that: “By 2018, six billion connected things will be requesting support…Strategies will also need to be developed for responding to them that are distinctly different from traditional human-customer communication and problem-solving. Responding to service requests from things will spawn entire service industries…”.


Connected cars will also produce a host of revenue streams, most eminently relating to marketing and advertising. Companies can tailor promotional efforts to offer the most salient marketing attempts for everything from gas prices to dining or entertainment. Similarly, organizations can pay to include their digital products and services as part of package offerings with auto manufacturers of smart vehicles. The health care industry will continue to further the IoT by designing and implementing any variety of remote monitoring gadgets and systems for patients, while some of the aforementioned marketing opportunities apply to the influx of wearables and smart clothing that is gaining popularity.



The most substantial trend to impact the security issues for the IoT pertain to the cloud, which has become the preferred infrastructure of choice for numerous applications of big data. With the majority of security breaches occurring due to instances of remissness on the part of the enterprise (as opposed to cloud providers), 2016 will see greater adoption rates of cloud security tools to counteract these issues. Cloud control tools include solutions specifically designed for access security and management of different forms of SOA. Indeed, security-as-a-service may well turn out to be one of the most important facets of SOA which provide a degree of stability and dependency on valued big data applications in the cloud.


Source: 2016 Trends for the Internet of Things: Expanding Expedient Analytics Beyond the Industrial Internet by jelaniharper