Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.
[ DATA SCIENCE Q&A]
Q:Given two fair dices, what is the probability of getting scores that sum to 4? to 8?
A: * Total: 36 combinations
* Of these, 3 involve a score of 4: (1,3), (3,1), (2,2)
* So: 3/36=1/12
* Considering a score of 8: (2,6), (3,5), (4,4), (6,2), (5,3)
* So: 5/36
Using machine learning to improve patient care: Two papers from MIT made strides in the field, one that used ICU data to predict necessary treatments and another that trained models of mortality and length of stay based on electronic health record data.
How CROs Are Helping With Healthcare’s Data Problem: Clinical trial costs are a major cause of rising health care costs. To help streamline this, pharmaceutical companies are increasingly using âcontract research organizationsâ to conduct trials, as they can use their expertise and specialized business intelligence tools to cut costs.
Genomic Medicine Has Entered the Building: Some types of genome sequences now cost as much as an MRI, which has allowed organizations to undertake large-scale studies in personalized medicine.
Creating charts and info graphics can be time-consuming. But these tools make it easier.
It’s often said that data is the new world currency, and the web is the exchange bureau through which it’s traded. As consumers, we’re positively swimming in data; it’s everywhere from labels on food packaging design to World Health Organisation reports. As a result, for the designer it’s becoming increasingly difficult to present data in a way that stands out from the mass of competing data streams.
One of the best ways to get your message across is to use a visualization to quickly draw attention to the key messages, and by presenting data visually it’s also possible to uncover surprising patterns and observations that wouldn’t be apparent from looking at stats alone.
As author, data journalist and information designer David McCandless said in his TED talk: “By visualizing information, we turn it into a landscape that you can explore with your eyes, a sort of information map. And when you’re lost in information, an information map is kind of useful.”
There are many different ways of telling a story, but everything starts with an idea. So to help you get started we’ve rounded up some of the most awesome data visualization tools available on the web.
Help visitors explore dense data sets with JavaScript library Dygraphs
Dygraphs is a fast, flexible open source JavaScript charting library that allows users to explore and interpret dense data sets. It’s highly customizable, works in all major browsers, and you can even pinch to zoom on mobile and tablet devices.
ZingChart lets you create HTML5 Canvas charts and more
ZingChart is a JavaScript charting library and feature-rich API set that lets you build interactive Flash or HTML5 charts. It offer over 100 chart types to fit your data.
InstantAtlas enables you to create highly engaging visualisations around map data
If you’re looking for a data viz tool with mapping, InstantAtlas is worth checking out. This tool enables you to create highly-interactive dynamic and profile reports that combine statistics and map data to create engaging data visualizations.
Timeline is a fantastic widget which renders a beautiful interactive timeline that responds to the user’s mouse, making it easy to create advanced timelines that convey a lot of information in a compressed space.
Each element can be clicked to reveal more in-depth information, making this a great way to give a big-picture view while still providing full detail.
Developed by MIT, and fully open-source, Exhibit makes it easy to create interactive maps, and other data-based visualizations that are orientated towards teaching or static/historical based data sets, such as flags pinned to countries, or birth-places of famous people.
Integrate and develop interactive maps within your site with this cool tool
Modest Maps is a lightweight, simple mapping tool for web designers that makes it easy to integrate and develop interactive maps within your site, using them as a data visualization tool.
The API is easy to get to grips with, and offers a useful number of hooks for adding your own interaction code, making it a good choice for designers looking to fully customise their user’s experience to match their website or web app. The basic library can also be extended with additional plugins, adding to its core functionality and offering some very useful data integration options.
Use OpenStreetMap data and integrate data visualisation in an HTML5/CSS3 wrapper
Another mapping tool, Leaflet makes it easy to use OpenStreetMap data and integrate fully interactive data visualisation in an HTML5/CSS3 wrapper.
The core library itself is very small, but there are a wide range of plugins available that extend the functionality with specialist functionality such as animated markers, masks and heatmaps. Perfect for any project where you need to show data overlaid on a geographical projection (including unusual projections!).
Billed as a “computational knowledge engine”, the Google rival WolframAlpha is really good at intelligently displaying charts in response to data queries without the need for any configuration. If you’re using publically available data, this offers a simple widget builder to make it really simple to get visualizations on your site.
Visual.ly makes data visualization as simple as it can be
Visual.ly is a combined gallery and infographic generation tool. It offers a simple toolset for building stunning data representations, as well as a platform to share your creations. This goes beyond pure data visualisation, but if you want to create something that stands on its own, it’s a fantastic resource and an info-junkie’s dream come true!
Visualize Free is a hosted tool that allows you to use publicly available datasets, or upload your own, and build interactive visualizations to illustrate the data. The visualizations go well beyond simple charts, and the service is completely free plus while development work requires Flash, output can be done through HTML5.
Making the ugly beautiful – that’s Better World Flux
Orientated towards making positive change to the world, Better World Flux has some lovely visualizations of some pretty depressing data. It would be very useful, for example, if you were writing an article about world poverty, child undernourishment or access to clean water. This tool doesn’t allow you to upload your own data, but does offer a rich interactive output.
A comprehensive JavaScript/HTML5 charting solution for your data visualization needs
FusionCharts Suite XT brings you 90+ charts and gauges, 965 data-driven maps, and ready-made business dashboards and demos. FusionCharts comes with extensive JavaScript API that makes it easy to integrate it with any AJAX application or JavaScript framework. These charts, maps and dashboards are highly interactive, customizable and work across all devices and platforms. They also have a comparison of the top JavaScript charting libraries which is worth checking out.
jqPlot is a nice solution for line and point charts
Another jQuery plugin, jqPlot is a nice solution for line and point charts. It comes with a few nice additional features such as the ability to generate trend lines automatically, and interactive points that can be adjusted by the website visitor, updating the dataset accordingly.
Dipity has free and premium versions to suit your needs
Dipity allows you to create rich interactive timelines and embed them on your website. It offers a free version and a premium product, with the usual restrictions and limitations present. The timelines it outputs are beautiful and fully customisable, and are very easy to embed directly into your page.
Developed by IBM, Many Eyes allows you to quickly build visualizations from publically available or uploaded data sets, and features a wide range of analysis types including the ability to scan text for keyword density and saturation. This is another great example of a big company supporting research and sharing the results openly.
D3.js is a JavaScript library that uses HTML, SVG, and CSS to render some amazing diagrams and charts from a variety of data sources. This library, more than most, is capable of some seriously advanced visualizations with complex data sets. It’s open source, and uses web standards so is very accessible. It also includes some fantastic user interaction support.
JavaScript InfoVis Toolkit includes a handy modular structure
A fantastic library written by Nicolas Belmonte, the JavaScript InfoVis Toolkit includes a modular structure, allowing you to only force visitors to download what’s absolutely necessary to display your chosen data visualizations. This library has a number of unique styles and swish animation effects, and is free to use (although donations are encouraged).
If you need to generate charts and graphs server-side, jpGraph offers a PHP-based solution with a wide range of chart types. It’s free for non-commercial use, and features extensive documentation. By rendering on the server, this is guaranteed to provide a consistent visual output, albeit at the expense of interactivity and accessibility.
Highcharts is a JavaScript charting library with a huge range of chart options available. The output is rendered using SVG in modern browsers and VML in Internet Explorer. The charts are beautifully animated into view automatically, and the framework also supports live data streams. It’s free to download and use non-commercially (and licensable for commercial use). You can also play with the extensive demos using JSFiddle.
Google Charts has an excellent selection of tools available
The seminal charting solution for much of the web, Google Charts is highly flexible and has an excellent set of developer tools behind it. It’s an especially useful tool for specialist visualizations such as geocharts and gauges, and it also includes built-in animation and user interaction controls.
It isn’t graphically flexible, but Excel is a good way to explore data: for example, by creating ‘heat maps’ like this one
You can actually do some pretty complex things with Excel, from ‘heat maps’ of cells to scatter plots. As an entry-level tool, it can be a good way of quickly exploring data, or creating visualizations for internal use, but the limited default set of colours, lines and styles make it difficult to create graphics that would be usable in a professional publication or website. Nevertheless, as a means of rapidly communicating ideas, Excel should be part of your toolbox.
Excel comes as part of the commercial Microsoft Office suite, so if you don’t have access to it, Google’s spreadsheets – part ofGoogle Docs and Google Drive – can do many of the same things. Google ‘eats its own dog food’, so the spreadsheet can generate the same charts as the Google Chart API. This will get your familiar with what is possible before stepping off and using the API directly for your own projects.
CSV (Comma-Separated Values) and JSON (JavaScript Object Notation) aren’t actual visualization tools, but they are common formats for data. You’ll need to understand their structures and how to get data in or out of them.
Crossfilter in action: by restricting the input range on any one chart, data is affected everywhere. This is a great tool for dashboards or other interactive tools with large volumes of data behind them
As we build more complex tools to enable clients to wade through their data, we are starting to create graphs and charts that double as interactive GUI widgets. JavaScript library Crossfilter can be both of these. It displays data, but at the same time, you can restrict the range of that data and see other linked charts react.
Tangle creates complex interactive graphics. Pulling on any one of the knobs affects data throughout all of the linked charts. This creates a real-time feedback loop, enabling you to understand complex equations in a more intuitive way
The line between content and control blurs even further with Tangle. When you are trying to describe a complex interaction or equation, letting the reader tweak the input values and see the outcome for themselves provides both a sense of control and a powerful way to explore data. JavaScript library Tangle is a set of tools to do just this.
Dragging on variables enables you to increase or decrease their values and see an accompanying chart update automatically. The results are only just short of magical.
Aimed more at specialist data visualisers, the Polymaps library creates image and vector-tiled maps using SVG
Polymaps is a mapping library that is aimed squarely at a data visualization audience. Offering a unique approach to styling the the maps it creates, analagous to CSS selectors, it’s a great resource to know about.
It isn’t easy to master, but OpenLayers is arguably the most complete, robust mapping solution discussed here
OpenLayers is probably the most robust of these mapping libraries. The documentation isn’t great and the learning curve is steep, but for certain tasks nothing else can compete. When you need a very specific tool no other library provides, OpenLayers is always there.
Kartograph’s projections breathe new life into our standard slippy maps
Kartograph’s tag line is ‘rethink mapping’ and that is exactly what its developers are doing. We’re all used to the Mercator projection, but Kartograph brings far more choices to the table. If you aren’t working with worldwide data, and can place your map in a defined box, Kartograph has the options you need to stand out from the crowd.
CartoDB provides an unparalleled way to combine maps and tabular data to create visualisations
CartoDB is a must-know site. The ease with which you can combine tabular data with maps is second to none. For example, you can feed in a CSV file of address strings and it will convert them to latitudes and longitudes and plot them on a map, but there are many other users. It’s free for up to five tables; after that, there are monthly pricing plans.
Processing provides a cross-platform environment for creating images, animations, and interactions
Processing has become the poster child for interactive visualizations. It enables you to write much simpler code which is in turn compiled into Java.
There is also a Processing.js project to make it easier for websites to use Processing without Java applets, plus a port to Objective-C so you can use it on iOS. It is a desktop application, but can be run on all platforms, and given that it is now several years old, there are plenty of examples and code from the community.
NodeBox is a quick, easy way for Python-savvy developers to create 2D visualisations
NodeBox is an OS X application for creating 2D graphics and visualizations. You need to know and understand Python code, but beyond that it’s a quick and easy way to tweak variables and see results instantly. It’s similar to Processing, but without all the interactivity.
A powerful free software environment for statistical computing and graphics, R is the most complex of the tools listed here
How many other pieces of software have an entire search enginededicated to them? A statistical package used to parse large data sets, R is a very complex tool, and one that takes a while to understand, but has a strong community and package library, with more and more being produced.
The learning curve is one of the steepest of any of these tools listed here, but you must be comfortable using it if you want to get to this level.
A collection of machine-learning algorithms for data-mining tasks, Weka is a powerful way to explore data
When you get deeper into being a data scientist, you will need to expand your capabilities from just creating visualizations to data mining. Weka is a good tool for classifying and clustering data based on various attributes – both powerful ways to explore data – but it also has the ability to generate simple plots.
Gephi in action. Coloured regions represent clusters of data that the system is guessing are similar
When people talk about relatedness, social graphs and co-relations, they are really talking about how two nodes are related to one another relative to the other nodes in a network. The nodes in question could be people in a company, words in a document or passes in a football game, but the maths is the same.
Gephi, a graph-based visualiser and data explorer, can not only crunch large data sets and produce beautiful visualizations, but also allows you to clean and sort the data. It’s a very niche use case and a complex piece of software, but it puts you ahead of anyone else in the field who doesn’t know about this gem.
iCharts can have interactive elements, and you can pull in data from Google Docs
The iCharts service provides a hosted solution for creating and presenting compelling charts for inclusion on your website. There are many different chart types available, and each is fully customisable to suit the subject matter and colour scheme of your site.
Charts can have interactive elements, and can pull data from Google Docs, Excel spreadsheets and other sources. The free account lets you create basic charts, while you can pay to upgrade for additional features and branding-free options.
Create animated visualisations with this jQuery plugin
Flot is a specialised plotting library for jQuery, but it has many handy features and crucially works across all common browsers including Internet Explorer 6. Data can be animated and, because it’s a jQuery plugin, you can fully control all the aspects of animation, presentation and user interaction. This does mean that you need to be familiar with (and comfortable with) jQuery, but if that’s the case, this makes a great option for including interactive charts on your website.
This handy JavaScript library offers a range of data visualisation options
This handy JavaScript library offers a wide range of data visualization options which are rendered using SVG. This makes for a flexible approach that can easily be integrated within your own web site/app code, and is limited only by your own imagination.
That said, it’s a bit more hands-on than some of the other tools featured here (a victim of being so flexible), so unless you’re a hardcore coder, you might want to check out some of the more point-and-click orientated options first!
jQuery Visualize Plugin is an open source charting plugin
Written by the team behind jQuery’s ThemeRoller and jQuery UI websites, jQuery Visualize Plugin is an open source charting plugin for jQuery that uses HTML Canvas to draw a number of different chart types. One of the key features of this plugin is its focus on achieving ARIA support, making it friendly to screen-readers. It’s free to download from this page on GitHub.
Further reading
A great Tumblr blog for visualization examples and inspiration:vizualize.tumblr.com
Nicholas Felton’s annual reports are now infamous, but he also has a Tumblr blog of great things he finds.
From the guy who helped bring Processing into the world:benfry.com/writing
Stamen Design is always creating interesting projects:stamen.com
Eyeo Festival brings some of the greatest minds in data visualization together in one place, and you can watch the videos online.
Brian Suda is a master informatician and author of Designing with Data, a practical guide to data visualisation.
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Save yourself from zombie apocalypse from unscalable models
One living and breathing zombie in today’s analytical models is the pulsating absence of error bars. Not every model is scalable or holds ground with increasing data. Error bars that is tagged to almost every models should be duly calibrated. As business models rake in more data the error bars keep it sensible and in check. If error bars are not accounted for, we will make our models susceptible to failure leading us to halloween that we never wants to see.
[ DATA SCIENCE Q&A]
Q:Explain what a false positive and a false negative are. Why is it important these from each other? Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important
A: * False positive
Improperly reporting the presence of a condition when its not in reality. Example: HIV positive test when the patient is actually HIV negative
* False negative
Improperly reporting the absence of a condition when in reality its the case. Example: not detecting a disease when the patient has this disease.
When false positives are more important than false negatives:
– In a non-contagious disease, where treatment delay doesnt have any long-term consequences but the treatment itself is grueling
– HIV test: psychological impact
When false negatives are more important than false positives:
– If early treatment is important for good outcomes
– In quality control: a defective item passes through the cracks!
– Software testing: a test to catch a virus has failed
A question I was regularly asked when working on different customer sites and answering questions on forums was âWhat is the best practice when using context variables?â
My years of working with Talend have led me to work with context variables in a way that minimizes the effort I need to put into ongoing maintenance and moving them between environments. This blog series is intended to give you an insight into the best practices I use as well as highlight the potential pitfalls that can arise from using the Talend context variable functionality without fully understanding it.
Contexts, Context Variables and Context Groups
To start, I want to ensure that we are all on the same page with regard to terminology. There are 3 ways âContextâ is used in Talend:
Context variable: A variable which can be set either at compile time or runtime. It can be changed and allows variables which would otherwise be hardcoded to be more dynamic.
Context: The environment or category of the value held by the context variable. Most of the time Contexts are DEV, TEST, PROD, UAT, etc. This allows you to set up one context variable and assign a different value per environment.
Context Group: A group of context variables which are packaged together for ease of use. Context Groups can be dragged and dropped into jobs so that you do not have to set up the same context variables in different jobs. They can also be updated (added to) in one location and then the changes can be distributed to the jobs that use those Context Groups.
Iâve found that many people will refer to âcontext variablesâ as âcontextsâ. This leads to confusion in discussions, so if these terms are used incorrectly online it really can confuse the issue. So, now that we have a common set of definitions, letâs move forward.
Potential Pitfalls with Contexts
While context variables are incredibly useful when working with Talend, they can also introduce some unforeseen problems if not fully understood. The biggest cause of problems in my experience are the contexts. Quite simply, I do not use anything but a Default Context.
At the beginning of your Talend journey, they come across as a genius idea which allows developers to build against one environment, using that environmentâs context variable values, then when the code is ready to test, change the context at the flick of a switch. That is true (kind of), but mainly for smaller data integration jobs. However, more often than not they open up developers and testers to horrible and time-consuming unexpected behavior. Below is just one scenario demonstrating this.
Letâs say a developer has built a job which uses a Context Group configured to supply database connection parameters. She has set up 4 Contexts (DEV1, DEV2, TEST and PROD) and has configured the different Context Variable values for each Context. In her main job, she reads from the database and then passes some of the data to Child Jobs using tRunJob components. Some of these Child Jobs have their own Child Jobs and all Child Jobs make use of the database. Thus, all jobs make use of the Context Group holding the database credentials. While she is developing, she sets the Context within the tRunJobs to DEV1. This is great. She can debug her Job until she is happy that it is working. However, she needs to test on DEV2 because it has a slightly cleaner environment. When she runs the Parent Job she changes the default Context from DEV1 to DEV2 and runs the Job. It seems to work, but she cannot see the database updates in her DEV1 database. Why? She then realizes that her Child Jobs are all defaulted to use DEV1 and not DEV2.
Now there are ways around this, she could ensure that all of her tRunJobs are set with the correct Context. But what if she has dozens of them? How long will that take? She could ensure that âTransmit whole contextâ is set in each tRunJob. But what happens if a Child Job is using a Context variable or Context Group that is not used by any of the Parent Jobs? We are back to the same problem of having to change all of the tRunJob Contexts. But this doesnât affect us outside of the Talend Studio, right? Wrong.
If the developer compiled that job to use on the command-line, even if she sets âApply Context to children jobsâ on the Build Job page, all this does is hardcode all of the Child Jobsâ Contexts to that selected in the Context scripts drop down. When you run it, if you change the Context that the Job needs to run for, the Child Jobs stick with the one that has been compiled. The same thing happens in the Talend Administration Center (TAC) as well.
Now, this does have some uses. Maybe your Contexts are not for environments and you want to be able to use different Contexts within the same environment? That is a legitimate (if not slightly unusual) scenario. There are other examples of these sorts of problems, but I think you get the idea.
In the early days of Talend, Contexts were brilliant. But these days (unless you have a particular use case where multiple Contexts are used within a single environment), there are better ways of handling Context variables for multiple environments. Iâll cover all of those ways and best practices in part two and three of our blog series coming out next week. Until next time!
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.
[ DATA SCIENCE Q&A]
Q:Do you think 50 small decision trees are better than a large one? Why?
A: * Yes!
* More robust model (ensemble of weak learners that come and make a strong learner)
* Better to improve a model by taking many small steps than fewer large steps
* If one tree is erroneous, it can be auto-corrected by the following
* Less prone to overfitting
Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data.
While the transformation to a data-driven culture needs to come from the top of the organization, data skills must permeate through all areas of the business.
Rather than being the responsibility of one person or department, assuring data availability and integrity must be a team sport in modern data-centric businesses. Everyone must be involved and made accountable throughout the process. Â
The challenge for enterprises is to effectively enable greater data access among the workforce while maintaining oversight and quality.
The Evolution of the Data Team
Businesses are recognizing the value and opportunities that data creates. There is an understanding that data needs to be handled and processed efficiently. For some companies, this has led to the formation of a new department of data analysts and scientists.
The data team is led by a Chief Data Officer (CDO), a role that is set to become key to business success in the digital era, according to recent research from Gartner. While earlier iterations of roles within the data team centered on data governance, data quality and regulatory issues, the focus is shifting. Data analysts and scientists are now expected to contribute and deliver a data-driven culture across the company, while also driving business value. According to the Gartner survey, the skills required for roles within the data team have expanded to span data management, analytics, data science, ethics, and digital transformation.
Businesses are clearly recognizing the importance of the data teamâs functions and are making significant investments in it. Office budgets for the data team increased by an impressive 23% between 2016 and 2017 according to Gartner. Whatâs more, some 15% of the CDOs that took part in the study revealed that their budgets were more than $20 million for their departments, compared with just 7% who said the same in 2016. The increasing popularity and evolution of these new data roles has largely been driven by GDPR in Europe and by new data protection regulations in the US. And the evidence suggests that the position will be essential for ensuring the successful transfer of data skills throughout businesses of all sizes.
The Data Skills Shortage
Data is an incredibly valuable resource, but businesses can only unlock its full potential if they have the talent to analyze that data and produce actionable insights that help them to better understand their customersâ needs. However, companies are already struggling to cope with the big data ecosystem due to a skills shortage and the problem shows little sign of improving. In fact, Europe could see a shortage of up to 500,000 IT professionals by 2020, according to the latest research from consultancy firm Empirica.
The rapidly evolving digital landscape is partly to blame as the skills required have changed radically in recent years. The required data science skills needed at todayâs data-driven companies are more wide-ranging than ever before. The modern workforce is now required to have a firm grasp of computer science including everything from databases to the cloud, according to strategic advisor and best-selling author Bernard Marr. In addition, analytical skills are essential to make sense of the ever-increasing data gathered by enterprises, while mathematical skills are also vital as much of the data captured will be numerical as this is largely due to IoT and sensor data. These skills must also sit alongside more traditional business and communication skills, as well as the ability to be creative and adapt to developing technologies.
The need for these skills is set to increase, with IBM predicting that the number of jobs for data professionals will rise by a massive 28% by 2020. The good news is that businesses are already recognizing the importance of digital skills in the workforce, with the role of Data Scientist taking the number one spot in Glassdoorâs Best Jobs in America for the past three years, with a staggering 4,524 positions available in 2018.Â
Data Training Employees
Data quality management is a task that extends across all functional areas of a company. It, therefore, makes sense to provide the employees in the specialist departments with tools to ensure data quality in self-service. Cloud-based tools that can be rolled out quickly and easily in the departments are essential. This way, companies can gradually improve their data quality whilst also increasing the value of their data.
While the number of data workers triples and to stay competitive with GDPR, businesses must think of good data management as a team sport. Investing in the Chief Data Officer role and data skills now will enable forward-thinking businesses to reap the rewards, both in the short-term and further into the future.
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15
Data Have Meaning
We live in a Big Data world in which everything is quantified. While the emphasis of Big Data has been focused on distinguishing the three characteristics of data (the infamous three Vs), we need to be cognizant of the fact that data have meaning. That is, the numbers in your data represent something of interest, an outcome that is important to your business. The meaning of those numbers is about the veracity of your data.
[ DATA SCIENCE Q&A]
Q:How frequently an algorithm must be updated?
A: You want to update an algorithm when:
– You want the model to evolve as data streams through infrastructure
– The underlying data source is changing
– Example: a retail store model that remains accurate as the business grows
– Dealing with non-stationarity
Some options:
– Incremental algorithms: the model is updated every time it sees a new training example
Note: simple, you always have an up-to-date model but you cant incorporate data to different degrees.
Sometimes mandatory: when data must be discarded once seen (privacy)
– Periodic re-training in batch mode: simply buffer the relevant data and update the model every-so-often
Note: more decisions and more complex implementations
How frequently?
– Is the sacrifice worth it?
– Data horizon: how quickly do you need the most recent training example to be part of your model?
– Data obsolescence: how long does it take before data is irrelevant to the model? Are some older instances
more relevant than the newer ones?
Economics: generally, newer instances are more relevant than older ones. However, data from the same month, quarter or year of the last year can be more relevant than the same periods of the current year. In a recession period: data from previous recessions can be more relevant than newer data from different economic cycles.
Manufacturers taking advantage of advanced analytics can reduce process flaws, saving time and money.
In the past 20 years or so, manufacturers have been able to reduce waste and variability in their production processes and dramatically improve product quality and yield (the amount of output per unit of input) by implementing lean and Six Sigma programs. However, in certain processing environmentsâpharmaceuticals, chemicals, and mining, for instanceâextreme swings in variability are a fact of life, sometimes even after lean techniques have been applied. Given the sheer number and complexity of production activities that influence yield in these and other industries, manufacturers need a more granular approach to diagnosing and correcting process flaws. Advanced analytics provides just such an approach.
Advanced analytics refers to the application of statistics and other mathematical tools to business data in order to assess and improve practices (exhibit). In manufacturing, operations managers can use advanced analytics to take a deep dive into historical process data, identify patterns and relationships among discrete process steps and inputs, and then optimize the factors that prove to have the greatest effect on yield. Many global manufacturers in a range of industries and geographies now have an abundance of real-time shop-floor data and the capability to conduct such sophisticated statistical assessments. They are taking previously isolated data sets, aggregating them, and analyzing them to reveal important insights.
Exhibit
Enlarge
Consider the production of biopharmaceuticals, a category of healthcare products that includes vaccines, hormones, and blood components. They are manufactured using live, genetically engineered cells, and production teams must often monitor more than 200 variables within the production flow to ensure the purity of the ingredients as well as the substances being made. Two batches of a particular substance, produced using an identical process, can still exhibit a variation in yield of between 50 and 100 percent. This huge unexplained variability can create issues with capacity and product quality and can draw increased regulatory scrutiny.
One top-five biopharmaceuticals maker used advanced analytics to significantly increase its yield in vaccine production while incurring no additional capital expenditures. The company segmented its entire process into clusters of closely related production activities; for each cluster, it took far-flung data about process steps and the materials used and gathered them in a central database.
A project team then applied various forms of statistical analysis to the data to determine interdependencies among the different process parameters (upstream and downstream) and their impact on yield. Nine parameters proved to be most influential, especially time to inoculate cells and conductivity measures associated with one of the chromatography steps. The manufacturer made targeted process changes to account for these nine parameters and was able to increase its vaccine yield by more than 50 percentâworth between $5 million and $10 million in yearly savings for a single substance, one of hundreds it produces.
Developing unexpected insights
Even within manufacturing operations that are considered best in class, the use of advanced analytics may reveal further opportunities to increase yield. This was the case at one established European maker of functional and specialty chemicals for a number of industries, including paper, detergents, and metalworking. It boasted a strong history of process improvements since the 1960s, and its average yield was consistently higher than industry benchmarks. In fact, staffers were skeptical that there was much room for improvement. âThis is the plant that everybody uses as a reference,â one engineer pointed out.
However, several unexpected insights emerged when the company used neural-network techniques (a form of advanced analytics based on the way the human brain processes information) to measure and compare the relative impact of different production inputs on yield. Among the factors it examined were coolant pressures, temperatures, quantity, and carbon dioxide flow. The analysis revealed a number of previously unseen sensitivitiesâfor instance, levels of variability in carbon dioxide flow prompted significant reductions in yield. By resetting its parameters accordingly, the chemical company was able to reduce its waste of raw materials by 20 percent and its energy costs by around 15 percent, thereby improving overall yield. It is now implementing advanced process controls to complement its basic systems and steer production automatically.
Meanwhile, a precious-metals mine was able to increase its yield and profitability by rigorously assessing production data that were less than complete. The mine was going through a period in which the grade of its ore was declining; one of the only ways it could maintain production levels was to try to speed up or otherwise optimize its extraction and refining processes. The recovery of precious metals from ore is incredibly complex, typically involving between 10 and 15 variables and more than 15 pieces of machinery; extraction treatments may include cyanidation, oxidation, grinding, and leaching.
The production and process data that the operations team at the mine were working with were extremely fragmented, so the first step for the analytics team was to clean it up, using mathematical approaches to reconcile inconsistencies and account for information gaps. The team then examined the data on a number of process parametersâreagents, flow rates, density, and so onâbefore recognizing that variability in levels of dissolved oxygen (a key parameter in the leaching process) seemed to have the biggest impact on yield. Specifically, the team spotted fluctuations in oxygen concentration, which indicated that there were challenges in process control. The analysis also showed that the best demonstrated performance at the mine occurred on days in which oxygen levels were highest.
As a result of these findings, the mine made minor changes to its leach-recovery processes and increased its average yield by 3.7 percent within three monthsâa significant gain in a period during which ore grade had declined by some 20 percent. The increase in yield translated into a sustainable $10 million to $20 million annual profit impact for the mine, without it having to make additional capital investments or implement major change initiatives.
Capitalizing on big data
The critical first step for manufacturers that want to use advanced analytics to improve yield is to consider how much data the company has at its disposal. Most companies collect vast troves of process data but typically use them only for tracking purposes, not as a basis for improving operations. For these players, the challenge is to invest in the systems and skill sets that will allow them to optimize their use of existing process informationâfor instance, centralizing or indexing data from multiple sources so they can be analyzed more easily and hiring data analysts who are trained in spotting patterns and drawing actionable insights from information.
Some companies, particularly those with months- and sometimes years-long production cycles, have too little data to be statistically meaningful when put under an analystâs lens. The challenge for senior leaders at these companies will be taking a long-term focus and investing in systems and practices to collect more data. They can invest incrementallyâfor instance, gathering information about one particularly important or particularly complex process step within the larger chain of activities, and then applying sophisticated analysis to that part of the process.
The big data era has only just emerged, but the practice of advanced analytics is grounded in years of mathematical research and scientific application. It can be a critical tool for realizing improvements in yield, particularly in any manufacturing environment in which process complexity, process variability, and capacity restraints are present. Indeed, companies that successfully build up their capabilities in conducting quantitative assessments can set themselves far apart from competitors.
About the authors
Eric Auschitzky is a consultant in McKinseyâs Lyon office, Markus Hammer is a senior expert in the Lisbon office, and Agesan Rajagopaul is an associate principal in the Johannesburg office.
The authors would like to thank Stewart Goodman, Jean-Baptiste Pelletier, Paul Rutten, Alberto Santagostino, Christoph Schmitz, and Ken Somers for their contributions to this article.