The 3 Step Guide CIO’s Need to Build a Data-Driven Culture

Today’s CIO has more data available than ever before. There is an opportunity for potential big improvements in decision-making outcomes, it carries huge complexity and responsibility in getting it right.

Many have already got it wrong and this is largely in part down to organisational culture. At the centre of creating a successful analytics strategy is building a data-driven culture.

According to a report by Gartner more than 35% of the top 5,000 global companies will fail to make use of the insight driven from their data. In another report by Eckerson, just 36% of the respondents gave their BI program a grade of ‘Excellent’ or ’Good’.

With the wealth of data already available in the world and the promise that it will continue to grow at an exponential rate, it seems inevitable that organisations attempt to leverage this resource to its fullest to improve their decision-making capabilities.

Before we move forward, it’s important to state that underpinning the success of these steps is to ensure all employees who have a direct involvement with the data or the insight generated are able to contribute. This point is highlighted in a case study of Warby Parker who illustrate the importance of utilising self-service technologies that help all users meet their own data needs, which, according to Carl Anderson, the director of Data Science, is essential in realising a data-driven culture.

Set Realistic Goals

I suppose this step is generic and best practice across all aspects of an organisation. However, I felt it needed to be mentioned because there are a number of examples available where decision-makers have become disillusioned with their analytics program due to it not delivering what they had expected.

Therefore, CIO’s should take the time to prepare in-depth research into their organisation; I recommend they look at current and future challenges facing their organisation and tailor their analytics strategy appropriately around solving these.

During this process, it is important to have a full understanding of the data sources currently used for analysis and reporting by the organisation as well as considering the external data sources available to the organisation that are not yet utilised.

By performing extensive research and gaining understanding on the data sources available to the organisation, it will be easier for CIO’s to set realistic and clear goals that address the challenges facing the business. Though there is still work to be done addressing how the analytics strategy will go about achieving these goals, it’s at this point where CIO’s need to get creative with the data available to them.

For example, big data has brought with it a wealth of unstructured data and many analysts believe that tapping into this unstructured data is paramount to obtaining a competitive advantage in the years to come. However it appears to be something that most will not realise any time soon as according to recent studies estimate that only around 0.5% percentage of unstructured data is analysed in the world.

Build the Right Infrastructure

Once the plan has been formulated, the next step for CIO’s is to ensure that their organisation’s IT infrastructure is aligned with the strategy so that the set goals can be achieved.

There is no universal “one way works for all” solution on building the right infrastructure; the most important factor to consider is whether the IT infrastructure can work according to the devised strategy.

A key requirement and expectation underpinning all good, modern infrastructures is the capability to integrate all of the data sources in the organisation into one central repository. The benefit being that by combining all of the data sources it provides users with a fully holistic view of the entire organisation.

For example, in a data environment where all of the organisation’s data is stored in silo, analysts may identify a trend or correlation in one data source but not have the full perspective afforded if the data were unified, i.e. what can our other data sources tell us about what has contributed to this correlation?

Legacy technologies that are now obsolete should be replaced in favour of more modern approaches to processing, storing and analysing data – one example are those technologies built on search-engine technology, as cited by Gartner.

Enable Front-Line Employees and Other Business Users

Imperative to succeeding now is ensuring that front-line employees (those whose job roles can directly benefit by having access to data) and other business users (managers, key business executives, etc.) are capable of self-serving their own data needs.

CIO’s should look to acquire a solution built specifically for self-service analysis over large-volumes of data and capable of seamless integration with their IT infrastructure.

A full analysis of employee skill-set and mind-set should be undertaken to determine whether certain employees need training in particular areas to bolster their knowledge or simply need to adapt their mind-set to a more analytical one.

Whilst it is essential that the front-line employees and other business users are given access to self-service analysis, inherently they will likely be “less-technical users”. Therefore ensuring they have the right access to training and other learning tools is vital to guarantee that they don’t become frustrated or disheartened.

By investing in employee development in these areas now, it will save time and money further down the line, removing an over reliance on both internal and external IT experts.

Source: The 3 Step Guide CIO’s Need to Build a Data-Driven Culture

Using sparklyr with Microsoft R Server

The sparklyr package (by RStudio) provides a high-level interface between R and Apache Spark. Among many other things, it allows you to filter and aggregate data in Spark using the dplyr syntax. In Microsoft R Server 9.1, you can now connect to a a Spark session using the sparklyr package as the interface, allowing you to combine the data-preparation capabilities of sparklyr and the data-analysis capabilities of Microsoft R Server in the same environment.

In a presentation by at the Spark Summit (embedded below, and you can find the slides here), Ali Zaidi shows how to connect to a Spark session from Microsoft R Server, and use the sparklyr package to extract a data set. He then shows how to build predictive models on this data (specifically, a deep Neural Network and a Boosted Trees classifier). He also shows how to build general ensemble models, cross-validate hyper-parameters in parallel, and even gives a preview of forthcoming streaming analysis capabilities.

[youtube https://www.youtube.com/watch?v=8-xvKlz26vg?rel=0&w=500&h=281]

Any easy way to try out these capabilities is with Azure HDInsight 3.6, which provides a managed Spark 2.1 instance with Microsoft R Server 9.1.

Spark Summit: Extending the R API for Spark with sparklyr and Microsoft R Server

Originally Posted at: Using sparklyr with Microsoft R Server

@DrJasonBrooks talked about the Fabric and Future of Leadership #JobsOfFuture #Podcast

[youtube https://www.youtube.com/watch?v=SB29nSaCppU]

In this podcast Jason talked about the fabric of a great transformative leadership. He shared some tactical steps that current leadership could follow to ensure their relevance and their association with transformative teams. Jason emphasized the role of team, leader and organization in create a healthy future proof culture. It is a good session for the leadership of tomorrow.

Jason’s Recommended Read:
Reset: Reformatting Your Purpose for Tomorrow’s World by Jason Brooks https://amzn.to/2rAuywh
Essentialism: The Disciplined Pursuit of Less by Greg McKeown https://amzn.to/2jOX8Xi

Podcast Link:
iTunes: http://math.im/itunes
GooglePlay: http://math.im/gplay

Jason’s BIO:
Dr. Jason Brooks is an executive, entrepreneur, consulting and leadership psychologist, bestselling author, and speaker with over 24 years of demonstrated results in the design, implementation and evaluation of leadership and organizational development, organizational effectiveness, and human capital management solutions, He work to grow leaders and enhance workforce performance and overall individual and company success. He is a results-oriented, high-impact executive leader with experience in start-up, high-growth, and operationally mature multi-million and multi-billion dollar companies in multiple industries.

About #Podcast:
#JobsOfFuture podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join?
If you or any you know wants to join in,
Register your interest @ info@analyticsweek.com

Want to sponsor?
Email us @ info@analyticsweek.com

Keywords:
#JobsOfFuture #Leadership #Podcast #Future of #Work #Worker & #Workplace

Source: @DrJasonBrooks talked about the Fabric and Future of Leadership #JobsOfFuture #Podcast by v1shal

Making sense of unstructured data by turning strings into things

Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things

We all know about the promise of Big Data Analytics to transform our understanding of the world. The analysis of structured data, such as inventory, transactions, close rates, and even clicks, likes and shares is clearly valuable, but the curious fact about the immense volume of data being produced is that a vast majority of it is unstructured text. Content such as news articles, blog post, product reviews, and yes even the dreaded 140 character novella contain tremendous value, if only they could be connected to things in the real world – people, places and things. In this talk, we’ll discuss the challenges and opportunities that result when you extract entities from Big Text.

Speaker:
Gregor Stewart – Director of Product Management for Text Analytics at Basis Technology

As Director of Product Management, Mr. Stewart helps to ensure that Basis Technology’s offerings stay ahead of the curve. Previously Mr. Stewart was the CTO of a storage services startup and a strategy consultant. He holds a Masters in Natural Language Processing from the University of Edinburgh, a BA in PPE from the University of Oxford, and a Masters from the London School of Economics.

Thanks to our amazing sponsors:

MicrosoftNERD for Venue

Basis Technology for Food and Kindle Raffle

Video:

Slideshare:

Originally Posted at: Making sense of unstructured data by turning strings into things by v1shal

Announcing dplyrXdf 1.0

I’m delighted to announce the release of version 1.0.0 of the dplyrXdf package. dplyrXdf began as a simple (relatively speaking) backend to dplyr for Microsoft Machine Learning Server/Microsoft R Server’s Xdf file format, but has now become a broader suite of tools to ease working with Xdf files.

This update to dplyrXdf brings the following new features:

  • Support for the new tidyeval framework that powers the current release of dplyr
  • Support for Spark and Hadoop clusters, including integration with the sparklyr package to process Hive tables in Spark
  • Integration with dplyr to process SQL Server tables in-database
  • Simplified handling of parallel processing for grouped data
  • Several utility functions for Xdf and file management
  • Workarounds for various glitches and unexpected behaviour in MRS and dplyr

Spark, Hadoop and HDFS

New in version 1.0.0 of dplyrXdf is support for Xdf files and datasets stored in HDFS in a Hadoop or Spark cluster. Most verbs and pipelines behave the same way, whether the computations are taking place in your R session itself, or in-cluster (except that they should be much more scalable in the latter case). Similarly, dplyrXdf can handle both the scenarios where your R session is taking place on the cluster edge node, or on a remote client.

For example, here is some sample code where we extract a table from Hive, then create a pipeline to process it in the cluster:

rxSparkConnect()
sampleHiv <- RxHiveData(table="hivesampletable")

# this will create the composite Xdf 'samplehivetable'
sampleXdf <- as_xdf(sampleHiv)

sampleXdf %>%
    filter(deviceplatform == "Android") %>%
    group_by(devicemake) %>%
    summarise(n=n()) %>%
    arrange(desc(n)) %>%
    head()
#>     devicemake     n
#> 1      Samsung 16244
#> 2           LG  7950
#> 3          HTC  2242
#> 4      Unknown  2133
#> 5     Motorola  1524

If you are logged into the edge node, dplyrXdf also has the ability to call sparklyr to process Hive tables in Spark. This can be more efficient than converting the data to Xdf format, since less I/O is involved. To run the above pipeline with sparklyr, we simply omit the step of creating an Xdf file:

sampleHiv %>%
    filter(deviceplatform == "Android") %>%
    group_by(devicemake) %>%
    summarise(n=n()) %>%
    arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   spark_connection
#> # Ordered by: desc(n)
#>     devicemake     n
#>           
#> 1      Samsung 16244
#> 2           LG  7950
#> 3          HTC  2242
#> 4      Unknown  2133
#> 5     Motorola  1524
#> # ... with more rows

For more information about Spark and Hadoop support, see the HDFS vignette and the Sparklyr website.

SQL database support

One of the key strengths of dplyr is its ability to interoperate with SQL databases. Given a database table as input, dplyr can translate the verbs in a pipeline into a SQL query which is then execute in the database. For large tables, this can often be much more efficient than importing the data and running them locally. dplyrXdf can take advantage of this with an MRS data source that is a table in a SQL database, including (but not limited to) Microsoft SQL Server: rather than importing the data to Xdf, the data source is converted to a dplyr tbl and passed to the database for processing.

# copy the flights dataset to SQL Server
flightsSql <- RxSqlServerData("flights", connectionString=connStr)
flightsHd <- copy_to(flightsSql, nycflights13::flights)

# this is run inside SQL Server by dplyr
flightsQry <- flightsSql %>%
    filter(month > 6) %>%
    group_by(carrier) %>%
    summarise(avg_delay=mean(arr_delay))

flightsQry
#> # Source:   lazy query [?? x 2]
#> # Database: Microsoft SQL Server
#> #   13.00.4202[dbo@DESKTOP-TBHQGUH/sqlDemoLocal]
#>   carrier avg_delay
#>          
#> 1 "9E"        5.37 
#> 2 AA        - 0.743
#> 3 AS        -16.9  
#> 4 B6          8.53 
#> 5 DL          1.55 
#> # ... with more rows

For more information about working with SQL databases including SQL Server, see the dplyrXdf SQL vignette and the dplyr database vignette.

Parallel processing and grouped data

Even without a Hadoop or Spark cluster, dplyrXdf makes it easy to parallelise the handling of groups. To do this, it takes advantage of Microsoft R Server’s distributed compute contexts: for example, if you set the compute context to “localpar”, grouped transformations will be done in parallel on a local cluster of R processes. The cluster will be shut down automatically when the transformation is complete.

More broadly, you can create a custom backend and tell dplyrXdf to use it by setting the compute context to “dopar”. This allows you a great deal of flexibility and scalability, for example by creating a cluster of multiple machines (as opposed to multiple cores on a single machine). Even if you do not have the physical machines, packages like AzureDSVM and doAzureParallel allow you to deploy clusters of VMs in the cloud, and then shut them down again. For more information, see the “Parallel processing of grouped data” section of the Using dplyrXdf vignette.

Data and file management

New in dplyrXdf 1.0.0 is a suite of functions to simplify managing Xdf files and data sources:

  • HDFS file management: upload and download files with hdfs_file_upload and hdfs_file_download; copy/move/delete files with hdfs_file_copy, hdfs_file_move, hdfs_file_remove; list files with hdfs_dir; and more
  • Xdf data management: upload and download datasets with copy_to, collect and compute; import/convert to Xdf with as_xdf; copy/move/delete Xdf data sources with copy_xdf, move_xdf and delete_xdf; and more
  • Other utilities: run a block of code in the local compute context with local_exec; convert an Xdf file to a data frame with as.data.frame; extract columns from an Xdf file with methods for [, [[ and pull

Obtaining dplyr and dplyrXdf

dplyrXdf 1.0.0 is available from GitHub. It requires Microsoft R Server 8.0 or higher, and dplyr 0.7 or higher. Note that dplyr 0.7 will not be in the MRAN snapshot that is your default repo, unless you are using the recently-released MRS 9.2; you can install it, and its dependencies, from CRAN. If you want to use the SQL Server and sparklyr integration facility, you should install the odbc, dbplyr and sparklyr packages as well.

install_packages(c("dplyr", "dbplyr", "odbc", "sparklyr"),
                 repos="https://cloud.r-project.org")
devtools::install_github("RevolutionAnalytics/dplyrXdf")

If you run into any bugs, or if you have any feedback, you can email me or log an issue at the Github repo.

Source by analyticsweekpick

Periodic Table Personified [image]

Have you ever tried memorizing periodic table? It is a daunting task as it has lot of elements and all coded with 2 alphabet characters. So, what is the solution? There are various methods used to do that. For one check out Wonderful Life with the Elements: The Periodic Table Personified by Bunpei Yorifuji. In his effortm Bunpei personified all the elements. It is a fun way to identify eact element and make it easily recognizable.

In his book, Yorifuji makes the many elements seem a little more individual by illustrating each one as as an anthropomorphic cartoon character, with distinctive hairstyles and clothes to help readers tell them apart. As for example, take Nitrogens, they have mohawks because they “hate normal,” while in another example, noble gases have afros because they are “too cool” to react to extreme heat or cold. Man-made elements are depicted in robot suits, while elements used in industrial application wear business attire.



Image by Wired

Source: Periodic Table Personified [image] by v1shal

Exploring the Structure of High-Dimensional Data with HyperTools in Kaggle Kernels

Exploring the structure of high-dimensional data with HyperTools in Kaggle Kernels

The datasets we encounter as scientists, analysts, and data nerds are increasingly complex. Much of machine learning is focused on extracting meaning from complex data. However, there is still a place for us lowly humans: the human visual system is phenomenal at detecting complex structure and discovering subtle patterns hidden in massive amounts of data. Every second that our eyes are open, countless data points (in the form of light patterns hitting our retinas) are pouring into visual areas of our brain. And yet, remarkably, we have no problem at all recognizing a neat looking shell on a beach, or our friend’s face in a large crowd. Our brains are “unsupervised pattern discovery aficionados.”

On the other hand, there is at least one major drawback to relying on our visual systems to extract meaning from the world around us: we are essentially capped at perceiving just 3 dimensions at a time, and many datasets we encounter today are higher dimensional.

So, the question of the hour is: how can we harness the incredible pattern-recognition superpowers of our brains to visualize complex and high-dimensional datasets?

Dimensionality Reduction

In comes dimensionality reduction, stage right. Dimensionality reduction is just what it sounds like– transforming a high-dimensional dataset into a lower-dimensional dataset. For example, take this UCI ML dataset on Kaggle comprising observations about mushrooms, organized as a big matrix. Each row is comprised of a bunch of features of the mushroom, like cap size, cap shape, cap color, odor etc. The simplest way to do dimensionality reduction might be to simply ignore some of the features (e.g. pick your favorite three—say size, shape, and color—and ignore everything else). However, this is problematic if the features you drop contained valuable diagnostic information (e.g. whether the mushrooms are poisonous).

A more sophisticated approach is to reduce the dimensionality of the dataset by only considering its principal components, or the combinations of features that explains the most variance in the dataset. Using a technique called principal components analysis (or PCA), we can reduced the dimensionality of a dataset, while preserving as much of its precious variance as possible. The key intuition is that we can create a new set of (a smaller number of) features, where each of the new features is some combination of the old features. For example, one of these new features might reflect a mix of shape and color, and another might reflect a mix of size and poisonousness. In general, each new feature will be constructed from a weighted mix of the original features.

Below is a figure to help with the intuition. Imagine that you had a 3 dimensional dataset (left), and you wanted to reduce it to a 2 dimensional dataset (right). PCA finds the principal axes in the original 3D space where the variance between points is the highest. Once we identify the two axes that explain the most variance (the black lines in the left panel), we can re-plot the data along just those axes, as shown on the right. Our 3D dataset is now 2D. Here we have chosen a low-dimensional example so we could visualize what is happening. However, this technique can be applied in the same way to higher-dimensional datasets.

We created the HyperTools package to facilitate these sorts of dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including matplotlib, scikit-learn and seaborn. HyperTools is designed with ease of use as a primary objective. We highlight two example use cases below.

Mushroom foraging with HyperTools: Visualizing static ‘point clouds’

First, let’s explore the mushrooms dataset we referenced above. We start by importing the relevant libraries:

import pandas as pd
import hypertools as hyp

and then we read in our data into a pandas DataFrame:

data = pd.read_csv('../input/mushrooms.csv')
data.head()
index class cap-shape cap-surface cap-color bruises odor gill-attachment
0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f
5 e x y y t a f

Each row of the DataFrame corresponds to a mushroom observation, and each column reflects a descriptive feature of the mushroom (only some of the rows and columns are shown above). Now let’s plot the high-dimensional data in a low dimensional space by passing it to HyperTools. To handle text columns, HyperTools will first convert each text column into a series of binary ‘dummy’ variables before performing the dimensionality reduction. For example, if the ‘cap size’ column contained ‘big’ and ‘small’ labels, this single column would be turned into two binary columns: one for ‘big’ and one for ‘small’, where 1s represents the presence of that feature and 0s represents the absence (for more on this, see the documentation for the get_dummies function in pandas).

hyp.plot(data, 'o')

In plotting the DataFrame, we are effectively creating a three-dimensional “mushroom space,” where mushrooms that exhibit similar features appear as nearby dots, and mushrooms that exhibit different features appear as more distant dots. By visualizing the DataFrame in this way, it becomes immediately clear that there are multiple clusters in the data. In other words, all combinations of mushroom features are not equally likely, but rather certain combinations of features tend to go together. To better understand this space, we can color each point according to some feature in the data that we are interested in knowing more about. For example, let’s color the points according to whether the mushrooms are (p)oisonous or (e)dible (the class_labels feature):

hyp.plot(data,'o', group=class_labels, legend=list(set(class_labels)))

Visualizing the data in this way highlights that mushrooms’ poisonousness appears stable within each cluster (e.g. mushrooms that have similar features), but varies across clusters. In addition, it looks like there are a number of distinct clusters that are poisonous/edible. We can explore this further by using the ‘cluster’ feature of HyperTools, which colors the observations using k-means clustering. In the description of the dataset, it was noted that there were 23 different types of mushrooms represented in this dataset, so we’ll set the n_clusters parameter to 23:

hyp.plot(data, 'o', n_clusters=23)

To gain access to the cluster labels, the clustering tool may be called directly using hyp.tools.cluster, and the resulting labels may then be passed to hyp.plot:

cluster_labels = hyp.tools.cluster(data, n_clusters=23)
hyp.plot(data, group=cluster_labels)

By default, HyperTools uses PCA to do dimensionality reduction, but with a few additional lines of code we can use other dimensionality reduction methods by directly calling the relevant functions from sklearn. For example, we can use t-SNE to reduce the dimensionality of the data using:

from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=3)
reduced_data_TSNE = TSNE_model.fit_transform(hyp.tools.df2mat(data))
hyp.plot(reduced_data_TSNE,'o', group=class_labels, legend=list(set(class_labels)))

Different dimensionality reduction methods highlight or preserve different aspects of the data. A repository containing additional examples (including different dimensionality reduction methods) may be found here.

The data expedition above provides one example of how the geometric structure of data may be revealed through dimensionality reduction and visualization. The observations in the mushrooms dataset formed distinct clusters, which we identified using HyperTools. Explorations and visualizations like this could help guide analysis decisions (e.g. whether to use a particular type of classifier to discriminate poisonous vs. edible mushrooms). If you’d like to play around with HyperTools and the mushrooms dataset, check out and fork this Kaggle Kernel!

Climate science with HyperTools: Visualizing dynamic data

Whereas the mushrooms dataset comprises static observations, here we will take a look at some global temperature data, which will showcase how HyperTools may be used to visualize timeseries data using dynamic trajectories.

This next dataset is made up of monthly temperature recordings from a sample of 20 global cities over the 138 year interval ranging from 1875–2013. To prepare this dataset for analysis with HyperTools, we created a time by cities matrix, where each row is a temperature recording for subsequent months, and each column is the temperature value for a different city. You can replicate this demo by using the Berkeley Earth Climate Change dataset on Kaggle or by cloning this GitHub repo. To visualize temperature changes over time, we will use HyperTools to reduce the dimensionality of the data, and then plot the temperature changes over time as a line:

hyp.plot(temps)

Well that just looks like a hot mess, now doesn’t it? However, we promise there is structure in there– so let’s find it! Because each city is in a different location, the mean and variance of its temperature timeseries may be higher or lower than the other cities. This will in turn affect how much that city is weighted when dimensionality reduction is performed. To normalize the contribution of each city to the plot, we can set the normalize flag (default value: False). Setting normalize='across' <will normalize (z-score) each column of the data. HyperTools incorporates a number of useful normalization options, which you can read more about here.

hyp.plot(temps, normalize='across')

Now we’re getting somewhere! Rotating the plot with the mouse reveals an interesting shape to this dataset. To help highlight the structure and understand how it changes over time, we can color the lines by year, where more red lines indicates early and more blue lines indicate later timepoints:

hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r')

Coloring the lines has now revealed two key structural aspects of the data. First, there is a systematic shift from blue to red, indicating a systematic change in the pattern of global temperatures over the years reflected in the dataset. Second, within each year (color), there is a cyclical pattern, reflecting seasonal changes in the temperature patterns. We can also visualize these two phenomena using a two dimensional plot:

hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r', ndims=2)

Now, for the grand finale. In addition to creating static plots, HyperTools can also create animated plots, which can sometimes reveal additional patterns in the data. To create an animated plot, simply pass animate=True to hyp.plot when visualizing timeseries data. If you also pass chemtrails=True, a low-opacity trace of the data will remain in the plot:

hyp.plot(temps, normalize='across', animate=True, chemtrails=True)

That pleasant feeling you get from looking at the animation is called “global warming.”

This concludes our exploration of climate and mushroom data with HyperTools. For more, please visit the project’s GitHub repository, readthedocs site, a paper we wrote, or our demo notebooks.

Bio

Andrew is a Cognitive Neuroscientist in the Contextual Dynamics Laboratory. His postdoctoral work integrates ideas from basic learning and memory research with computational techniques used in data science to optimize learning in natural educational settings, like the classroom or online. Additionally, he develops open-source software for data visualization, research and education.

The Contextual Dynamics Lab at Dartmouth College uses computational models and brain recordings to understand how we extract information from the world around us. You can learn more about us at http://www.context-lab.com.

Source: Exploring the Structure of High-Dimensional Data with HyperTools in Kaggle Kernels

The intersection of analytics, social media and cricket in the cognitive era of computing

Photo Credit: Getty Images.

Since 1975, every fourth year, world class cricketing nations come together for a month long extravaganza. Since February 14th, 2015, 14 teams are battling it out in a 6 week long Cricket World Cup tournament in 14 venues across Australia and New Zealand. During these six weeks, millions of cricket fans alter their schedule to relish the game of cricket world-wide in unison. It is a cricket carnival of sorts.

Cricket world cup is at its peak. All eyes are glued to the sport and nothing much is going unobserved – on the field or off it. Whether it is Shikhar Dhawan scoring a century or Virat Kohli’s anger, cricket enthusiasts are having a ball. There is however another segment that is thriving as cricket fever is reaching a high. It is the dedicated cricket follower who follows each ball, takes stock of each miss, for him each run is a statistic and each delivery an opportunity. I remember years ago, we all used to be hooked to a radio, while on the move keeping track of ball-by-ball updates, then came the television; and now with social media, the stakeholder engagement has become phenomenally addictive.

Such a big fan base is bound to open opportunities for sports tourism, brand endorsements and global partnerships. CWC is an event that many business enterprises tap in order to make their presence felt. With adequate assistance of technology and in-depth insights the possibilities of stakeholder engagement and scaling up ventures is huge like never before.

Sports industry is perhaps one of the biggest enterprises that have willingly adopted technology to change the game for players, viewers, organizers as well as broadcasters. Pathbreaking advent in technology has ensured that the experience of followers of the game has become finer and more nuanced. It is no longer just about what is happening on the field but about what happened on similar occasions in the past, and what could possibly happen given the past records of the team and the players. This ever-growing back and forth between information and analysis makes for a cricket lover’s paradise.

Cognitive analysis of such a large data is no longer just a dream. Machine learning algorithms are getting smarter day by day using cloud computing on clusters, that is about to change the whole landscape of human experience and involvement. To understand what CWC means to different people, from various backgrounds it is important to understand their psychology/perception of the game. A deeper look can bring us closer to understanding how technology, analytics and big data is in fact changing the dynamics of cricket.

A common man’s perspective

Cricket world cup to a common man is about sneaking a win from close encounters, high scoring run fests, electric crowd, blind faith in their teams and something to chew on, spicing their opinion after the game is over. With small boundaries, better willows, fit athletes and pressure situations to overcome to be victorious, every contest is an absolute delight to watch and closely follow. Cricket fans are so deeply attached to the game and the players that every bit of juicy information about the game enthralls them.

A geek’s perspective

In the last forty years the use of technology has changed the game of cricket on the field. Years ago, snickometer was considered revolutionary, then came the pathbreaking Hawk eye followed by Pitchvision, DRS (Decision Review System) and now we have flashing cricket bails. For cricketers this has meant a better reviewing process. Now they understand their game better, correct their mistakes, prepare against their weakness and also plan specific strategies against individual players of the opposite team. For cricket followers and business houses this has meant a better engagement with the audience, a deeper personalised experience and a detailed understanding of what works, what does not.

This increase in the viewer-engagement quotient has been boosted with Matchups covering past records on player performance, match stats etc. Wisden captures data from each match and provides the basis of comparatives around player potential, strike rate, runs in the middle over, important players in the death overs etc.

While Wisden India provides all the data points, IBM’s Analytics engine processes the information into meaningful assets of historical data making it possible predict future outcomes. For CWC 2015 IBM has partnered with Wisden to provide the viewers with live match analysis, player performance, which is very frequently used by commentators, coaches to keep the viewers glued to the match proceedings.

Just like it makes insightful observations from a vast trove of data in cricket, IBM’s Analytics Engine equips organizations to take advantage of all data to make better business decisions. Using analytics-instead of instinct-can help organizations provide personalized experiences to its customers, spot new opportunities and risks and create new business models.

Similarly, with social media outreach, the overall engagement of viewers in the game has become crucial in boosting confidence of a team or succumbing to the pressure of the masses.

Aggregating shared opinion on social sites is a key to highlighting the expectations, generating perceived predictions about the teams, potential wins, most popular players etc.

To give an idea of the numbers and technology involved, as part of Twitterati, IBM processed about 3 million tweets on an average, in a two match day, analysed at 10 min intervals.

IBM Cloudant was used to store tweets, crawled from twitter having match/tournament specific hashtags. According to their needs, IBM fetched the tweets from Cloudant and generated the events specific to every match. IBM Bluemix automates the process of getting tweets from Twitter and generating the events corresponding to every match given the schedule of the tournament of Cricket World Cup. The application is hosted in Bluemix. Apart from these technologies, IBM developed the core engine that identifies events from the Twitter feed.

The Social Sentiment Index analyzed around 3.5 million tweets, by tracking about 700 match-specific events daily in Twitter. IBM Data Curation and Integration capabilities were used on BigInsights and Social Data Accelerator (SDA) to extract social insights from streaming twitter feed in real time.

Moreover, IBM Text Analytics and Natural Language Processing performs fine grained temporal analytics around events that have short lifespan but are important — events like boundaries, sixes and wickets.

IBM Social Media Analytics also examines the quantum of discussion around teams, players, events. It examines sentiments across different entities and Identify topics that are trending and understands ways in which advertisers can use the discussion to appropriately position their products and services.

IBM Content Analytics examines the large social content more deeply and tries to mimic human cognition and learning behavior to answer complex questions like the impact of certain player or attributes determining the outcome of the game.

An enterprise perspective

What is most interesting to businesses however is that observing these campaigns help in understanding the consumer sentiment to drive sales initiatives. With right business insights in the nick of time, in line with social trends, several brands have come up with lucrative offers one can’t refuse. In earlier days, this kind of marketing required pumping in of a lot of money and waiting for several weeks before one could analyse and approve the commercial success of a business idea. With tools like IBM Analytics at hand, one can not only grab the data needed, assess it so it makes a business sense, but also anticipate the market response.

Imagine how, in the right hands, especially in the data sensitive industry, the facility of analyzing large scale structured and unstructured data combined with cloud computing and cognitive machine learning can lead to capable and interesting solutions with weighted recommendations at your disposal.

The potential of idea already sounds a game-changer to me. When I look around, every second person is tweeting and posting about everything around them. There are volumes of data waiting to be analyzed. With the power to process the virality of the events in real-time across devices, sensors, applications, I can vouch that with data mining and business intelligence capabilities, cloud computing can significantly improve and empower businesses to run focused campaigns.

With engines like Social Data Accelerator Cloudant, Social Data Curation at your service, social data analysis can be democratized to a fairly accurate possibility, opening new channels of business, which have not been identified so far. CWC 2015 insight is just the beginning. Howzzat?

Originally posted via “The intersection of analytics, social media and cricket in the cognitive era of computing”

Source: The intersection of analytics, social media and cricket in the cognitive era of computing by analyticsweekpick

Seven ways predictive analytics can improve healthcare

shutterstock_93383260

Everyone is a patient at some time or another, and we all want good medical care. We assume that doctors are all medical experts and that there is good research behind all their decisions.

Physicians are smart, well trained and do their best to stay up to date with the latest research. But they can’t possibly commit to memory all the knowledge they need for every situation, and they probably don’t have it all at their fingertips. Even if they did have access to the massive amounts of data needed to compare treatment outcomes for all the diseases they encounter, they would still need time and expertise to analyze that information and integrate it with the patient’s own medical profile. But this kind of in-depth research and statistical analysis is beyond the scope of a physician’s work.

That’s why more and more physicians – as well as insurance companies – are using predictive analytics.

Predictive analytics (PA) uses technology and statistical methods to search through massive amounts of information, analyzing it to predict outcomes for individual patients. That information can include data from past treatment outcomes as well as the latest medical research published in peer-reviewed journals and databases.

Not only can PA help with predictions, but it can also reveal surprising associations in data that our human brains would never suspect.

In medicine, predictions can range from responses to medications to hospital readmission rates. Examples are predicting infections from methods of suturing, determining the likelihood of disease, helping a physician with a diagnosis, and even predicting future wellness.

The statistical methods are called learning models because they can grow in precision with additional cases. There are two major ways in which PA differs from traditional statistics (and from evidence-based medicine):

  • First, predictions are made for individuals and not for groups
  • Second PA does not rely upon a normal (bell-shaped) curve.

Prediction modelling uses techniques such as artificial intelligence to create a prediction profile (algorithm) from past individuals. The model is then “deployed” so that a new individual can get a prediction instantly for whatever the need is, whether a bank loan or an accurate diagnosis.

In this post, I discuss the top seven benefits of PA to medicine – or at least how they will be beneficial once PA techniques are known and widely used. In the United States, many physicians are just beginning to hear about predictive analytics and are realizing that they have to make changes as the government regulations and demands have changed. For example, under the Affordable Care Act, one of the first mandates within Meaningful Use demands that patients not be readmitted before 30 days of being dismissed from the hospital. Hospitals will need predictive models to accurately assess when a patient can safely be released.

1. Predictive analytics increase the accuracy of diagnoses.

Physicians can use predictive algorithms to help them make more accurate diagnoses. For example, when patients come to the ER with chest pain, it is often difficult to know whether the patient should be hospitalized. If the doctors were able to answers questions about the patient and his condition into a system with a tested and accurate predictive algorithm that would assess the likelihood that the patient could be sent home safely, then their own clinical judgments would be aided. The prediction would not replace their judgments but rather would assist.

In a visit to one’s primary care physician, the following might occur: The doctor has been following the patient for many years. The patient’s genome includes a gene marker for early onset Alzheimer’s disease, determined by researchers using predictive analytics. This gene is rare and runs in the patient’s family on one side. Several years ago, when it was first discovered, the patient agreed to have his blood taken to see if he had the gene. He did. There was no gene treatment available, but evidence based research indicated to the PCP conditions that may be helpful for many early Alzheimer’s patients.

Ever since, the physician has had the patient engaging in exercise, good nutrition, and brain games apps that the patient downloaded on his smart phone and which automatically upload to the patient’s portal. Memory tests are given on a regular basis and are entered into the electronic medical record (EMR), which also links to the patient portal. The patient himself adds data weekly onto his patient portal to keep track of time and kinds of exercises, what he is eating, how he has slept, and any other variable that his doctor wishes to keep track of.

Because the PCP has a number of Alzheimer’s patients, the PCP has initiated an ongoing predictive study with the hope of developing a predictive model for individual likelihood of memory maintenance and uses, with permission, the data thus entered through the patients’ portals. At this visit, the physician shares the good news that a gene therapy been discovered for the patient’s specific gene and recommends that the patient receive such therapy.

2. Predictive analytics will help preventive medicine and public health.

With early intervention, many diseases can be prevented or ameliorated. Predictive analytics, particularly within the realm of genomics, will allow primary care physicians to identify at-risk patients within their practice. With that knowledge, patients can make lifestyle changes to avoid risks (An interview with Dr. Tim Armstrong on this WHO podcast explores the question: Do lifestyle changes improve health?)

As lifestyles change, population disease patterns may dramatically change with resulting savings in medical costs. As Dr. Daniel Kraft, Medicine and Neuroscience Chair at Stanford University, points out in his videoMedicine 2064:

During the history of medicine, we have not been involved in healthcare; no, we’ve been consumed by sick care. We wait until someone is sick and then try to treat that person. Instead, we need to learn how to avoid illness and learn what will make us healthy. Genomics will play a huge part in the shift toward well-living.

As Dr. Kraft mentions, our future medications might be designed just for us because predictive analytics methods will be able to sort out what works for people with “similar subtypes and molecular pathways.”

3. Predictive analytics provides physicians with answers they are seeking for individual patients.

Evidence-based medicine (EBM) is a step in the right direction and provides more help than simple hunches for physicians. However, what works best for the middle of a normal distribution of people may not work best for an individual patient seeking treatment. PA can help doctors decide the exact treatments for those individuals. It is wasteful and potentially dangerous to give treatments that are not needed or that won’t work specifically for an individual. (This topic is covered in a paper by the Personalized Medicine Coalition.) Better diagnoses and more targeted treatments will naturally lead to increases in good outcomes and fewer resources used, including the doctor’s time.

4. Predictive analytics can provide employers and hospitals with predictions concerning insurance product costs.

Employers providing healthcare benefits for employees can input characteristics of their workforce into a predictive analytic algorithm to obtain predictions of future medical costs. Predictions can be based upon the company’s own data or the company may work with insurance providers who also have their own databases in order to generate the prediction algorithms. Companies and hospitals, working with insurance providers, can synchronize databases and actuarial tables to build models and subsequent health plans. Employers might also use predictive analytics to determine which providers may give them the most effective products for their particular needs. Built into the models would be the specific business characteristics. For example, if it is discovered that the average employee visits a primary care physician six times a year, those metrics can be included in the model.

Hospitals will also work with insurance providers as they seek to increase optimum outcomes and quality assurance for accreditation. In tailoring treatments that produce better outcomes, accreditation standards are both documented and increasingly met. (Likewise, predictive analytics can support the Accountable Care Organization (ACO) model in that the primary goal of ACO is the reduction of costs by treating specific patient populations successfully. Supply chain management (SCM) for model hospitals and insurance providers will change as needs for resources change; in fact when using PA, those organizations may see otherwise hidden opportunities for savings and increasing efficiency. PA has a way of bringing our attention to that which may not have been seen before.

5. Predictive analytics allow researchers to develop prediction models that do not require thousands of cases and that can become more accurate over time.

In huge population studies, even very small differences can be “statistically significant.” Researchers understand that randomly assigned case control studies are superior to observational studies, but often it is simply not feasible to carry out such a design. From huge observational studies, the small but statistically significant differences are often not clinically significant. The media, ignorant of research nuances, may then focus on those small but statistically significant findings, convincing and sometimes frightening the public. Researchers also are to blame as sometimes they themselves do not understand the difference between statistical significance and clinical significance.

For example, in a TEDxColumbiaEngineering talk, Dr. David H. Newman spoke about the recent recommendation by the media that small to moderate alcohol consumption by women can result in higher levels of certain cancers. Many news programs and newspapers loudly and erroneously warned women not to drink even one alcoholic drink per day.

In contrast with predictive analytics, initial models in can be generated with smaller numbers of cases and then the accuracy of such may be improved over time with increased cases. The models are alive, learning, and adapting with added information and with changes that occur in the population over time.

In order to make use of data across practices, electronic data record systems will need to be compatible with one another; interoperability, or this very coordination, is important and has been mandated by the US government. Governance around the systems will require transparency and accountability. One program suite, STATISTICA, is familiar with governance as it has worked with banks, pharmaceutical industries and government agencies. Using such a program will be crucial in order to offer “transparent” models, meaning they work smoothly with other programs, such as Microsoft and Visual Basic. In addition, STATISTICA can provide predictive models using double-blind elements and random assignment, satisfying the continued need for controlled studies.

On the other hand, some programs are proprietary, and users often have to pay the statistical company to use their own data. In addition, they may find that the system is not compatible other systems if they need to make changes. When dealing with human life, the risks of making mistakes are increased, and the models used must lend themselves to making the systems valid, sharable and reliable.

6. Pharmaceutical companies can use predictive analytics to best meet the needs of the public for medications.

There will be incentives for the pharmaceutical industry to develop medications for ever smaller groups. Old medications, dropped because they were not used by the masses, may be brought back because drug companies will find it economically feasible to do so. In other words, previous big bulk medications are certain to be used less if they are found not to help many of those who were prescribed them. Less used medications will be economically lucrative to revive and develop as research is able to predict those who might benefit from them. For example, if 25,000 people need to be treated with a medication “shotgun-style” in order to save 10 people, then much waste has occurred. All medications have unwanted side effects. The shotgun-style delivery method can expose patients to those risks unnecessarily if the medication is not needed for them. Dr. Newman (above) discussed the probably overuse of statins as one example.

7. Patients have the potential benefit of better outcomes due to predictive analytics.

There will be many benefits in quality of life to patients as the use of predictive analytics increase. Potentially individuals will receive treatments that will work for them, be prescribed medications that work for them and not be given unnecessary medications just because that medication works for the majority of people. The patient role will change as patients become more informed consumers who work with their physicians collaboratively to achieve better outcomes. Patients will become aware of possible personal health risks sooner due to alerts from their genome analysis, from predictive models relayed by their physicians, from the increasing use of apps and medical devices (i.e., wearable devices and monitoring systems), and due to better accuracy of what information is needed for accurate predictions. They then will have decisions to make about life styles and their future well being.

 

Conclusion:  Changes are coming in medicine worldwide.

In developed nations, such as the United States, predictive analytics are the next big idea in medicine –the next evolution in statistics – and roles will change as a result.

  • Patients will have to become better informed and will have to assume more responsibility for their own care, if they are to make use of the information derived.
  • Physician roles will likely change to more of a consultant than decision maker, who will advise, warn and help individual patients. Physicians may find more joy in practice as positive outcomes increase and negative outcomes decrease. Perhaps time with individual patients will increase and physicians can once again have the time to form positive and lasting relationships with their patients. Time to think, to interact, to really help people; relationship formation is one of the reasons physicians say they went into medicine, and when these diminish, so does their satisfaction with their profession.
  • Hospitals, pharmaceutical companies and insurance providers will see changes as well. For example, there may be fewer unnecessary hospitalizations, resulting initially in less revenue. Over time, however, admissions will be more meaningful, the market will adjust, and accomplishment will rise. Initially, revenues may also be lost by pharmaceutical and device companies, but then more specialized and individualized offerings will increase profits. They may be forced to find newer and better solutions for individuals, ultimately providing them with fresh sources of revenue. There may be increased governmental funds offered for those who are innovative in approach.

All in all, changes are coming. The genie is out of the box and, in fact, is building boxes for the rest of us. Smart industries will anticipate and prepare.

These changes that can literally revolutionize the way medicine is practiced for better health and disease reduction.

I think about the Bayer TV commercialin which a woman gets a note that says, “Your heart attack will arrive in two days.” The voiceover proclaims, “Laura’s heart attack didn’t come with a warning.” Not so with predictive analytics. That very message could be sent to Laura from her doctor who uses predictive analytics. Better yet, in our bright future, Laura might get the note from her doctor that says, “Your heart attack will occur eight years from now, unless …” – giving Laura the chance to restructure her life and change the outcome.

Note: This article originally appeared in Elsevier. Click for link here.

Source: Seven ways predictive analytics can improve healthcare

Piwik PRO Introduces New Pricing Packages For Small and Medium Enterprises

We’re happy to announce that new pricing plans are available for Piwik PRO Marketing Suite. The changes will allow small and medium enterprises to take advantage of affordable privacy-compliant marketing tools (including Consent Manager) to meet the requirements of the GDPR.

In recent weeks, one of the most restrictive data privacy regulations the world has ever seen came into force – we’re obviously talking about GDPR.

Now every company that processes the personal data of EU citizens has to make sure that their internal processes, services and products are in line with the provisions of the new law (we wrote about it in numerous articles on our blog, be sure to check them out). Otherwise, they risk severe fines.

Among many other things, they have to collect active consents from visitors before they start processing their data.

The new rules apply not only to large corporations, but also to small and medium-sized enterprises.

When the market standard is not enough

The reason for the worry for many of them is the fact that the most popular freemium analytics software provider decided to limit their support in that matter to the bare minimum.

Although Google introduced some product updates that aim to help their clients comply with the new regulation (like data retention control and a user deletion tool), they decided that their clients (data controllers) are the ones who have to develop their own mechanism for collecting, managing, and storing consents (via opt-in) from visitors (for both Google Analytics and Google Tag Manager).

Following all these rules can be a hassle for website owners, especially small to medium enterprises with often limited resources of time and workforce.

Important note! Recent events indicate that Google could be an unreliable partner in the face of the new EU regulations. On the first day after the regulation came into force, Google was sued for violating provisions of GDPR by Max Schrems an Austrian lawyer and privacy activist. You can read about it more in this article by Verge.

How Piwik PRO can help you with the task

Luckily, there are many vendors who decided to create a tool to mediate between visitors and analytics software. Depending on the provider, it’s called Cookie Consent Manager, Cookie Widget, GDPR Consent Manager, etc.

These tools are a kind of gatekeeper that passes information about consents between individual visitors and your analytics system. That way, you make sure that the data you’re operating on has been collected in compliance with the new law.

One of the companies developing this type of product is Piwik PRO. You can read more about our solution here.

New pricing plan for small and medium enterprises

Due to the growing interest in our GDPR Consent Manager among small and medium enterprises, we decided to prepare a special offer tailored to their needs.

All companies wanting to collect data about the behavior of their website’s visitors in a privacy-compliant manner, will be able to take advantage of the new Business Plan” pricing package. The offer is intended for businesses with up 2 million monthly actions on their websites.

It includes the following products:

The combined forces of these products will help you collect all the relevant information about visitors without violating the provisions of the new law (and also other data privacy laws including Chinese Internet Law and Russian law 526-FZ).

Additionally, your data will be stored in highly secure environment:

  • ISO 27001 Certified private cloud data center
  • Fully-redundant infrastructure with 99% SLA
  • Microsoft Azure GDPR-compliant cloud infrastructure, hosted in the location of your choice: Germany, Netherlands, USA

What’s more, you’ll count on professional customer support including:

  • Email support
  • Live chat
  • User training
  • Professional Onboarding

Sound interesting? Then give it a (free) spin! All you have to do is register for 30-day free trial. Our sales representatives will contact you within 24 hours!

You also can read more about the offer on our pricing page.

REGISTER FOR A FREE TRIAL

The post Piwik PRO Introduces New Pricing Packages For Small and Medium Enterprises appeared first on Piwik PRO.

Source: Piwik PRO Introduces New Pricing Packages For Small and Medium Enterprises by analyticsweek