Oct 25, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Conditional Risk  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Avoiding a Data Science Hype Bubble by analyticsweek

>> The User Experience of University Websites by analyticsweek

>> Landscape of Big Data by v1shal

Wanna write? Click Here

[ NEWS BYTES]

>>
 Beckage PLLC focuses on data security – Buffalo Business First Under  Data Security

>>
 â€‹The data center is dead: Here’s what comes next | ZDNet – ZDNet Under  Data Center

>>
 Global Automotive HVAC Sensors Market Outlook, Size, Status, and Forecast to 2025 – City Councilor Under  Financial Analytics

More NEWS ? Click Here

[ FEATURED COURSE]

Lean Analytics Workshop – Alistair Croll and Ben Yoskovitz

image

Use data to build a better startup faster in partnership with Geckoboard… more

[ FEATURED READ]

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

image

Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the “data-analytic thinking” necessary for e… more

[ TIPS & TRICKS OF THE WEEK]

Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.

[ DATA SCIENCE Q&A]

Q:What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
A: * Effect would be similar to regularization: avoid overfitting
* Used to increase robustness

Source

[ VIDEO OF THE WEEK]

#FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

 #FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Without big data, you are blind and deaf and in the middle of a freeway. – Geoffrey Moore

[ PODCAST OF THE WEEK]

#BigData #BigOpportunity in Big #HR by @MarcRind #JobsOfFuture #Podcast

 #BigData #BigOpportunity in Big #HR by @MarcRind #JobsOfFuture #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

A quarter of decision-makers surveyed predict that data volumes in their companies will rise by more than 60 per cent by the end of 2014, with the average of all respondents anticipating a growth of no less than 42 per cent.

Sourced from: Analytics.CLUB #WEB Newsletter

Making sense of unstructured data by turning strings into things

Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things

We all know about the promise of Big Data Analytics to transform our understanding of the world. The analysis of structured data, such as inventory, transactions, close rates, and even clicks, likes and shares is clearly valuable, but the curious fact about the immense volume of data being produced is that a vast majority of it is unstructured text. Content such as news articles, blog post, product reviews, and yes even the dreaded 140 character novella contain tremendous value, if only they could be connected to things in the real world – people, places and things. In this talk, we’ll discuss the challenges and opportunities that result when you extract entities from Big Text.

Speaker:
Gregor Stewart – Director of Product Management for Text Analytics at Basis Technology

As Director of Product Management, Mr. Stewart helps to ensure that Basis Technology’s offerings stay ahead of the curve. Previously Mr. Stewart was the CTO of a storage services startup and a strategy consultant. He holds a Masters in Natural Language Processing from the University of Edinburgh, a BA in PPE from the University of Oxford, and a Masters from the London School of Economics.

Thanks to our amazing sponsors:

MicrosoftNERD for Venue

Basis Technology for Food and Kindle Raffle

Video:

Slideshare:

Originally Posted at: Making sense of unstructured data by turning strings into things by v1shal

Announcing dplyrXdf 1.0

I’m delighted to announce the release of version 1.0.0 of the dplyrXdf package. dplyrXdf began as a simple (relatively speaking) backend to dplyr for Microsoft Machine Learning Server/Microsoft R Server’s Xdf file format, but has now become a broader suite of tools to ease working with Xdf files.

This update to dplyrXdf brings the following new features:

  • Support for the new tidyeval framework that powers the current release of dplyr
  • Support for Spark and Hadoop clusters, including integration with the sparklyr package to process Hive tables in Spark
  • Integration with dplyr to process SQL Server tables in-database
  • Simplified handling of parallel processing for grouped data
  • Several utility functions for Xdf and file management
  • Workarounds for various glitches and unexpected behaviour in MRS and dplyr

Spark, Hadoop and HDFS

New in version 1.0.0 of dplyrXdf is support for Xdf files and datasets stored in HDFS in a Hadoop or Spark cluster. Most verbs and pipelines behave the same way, whether the computations are taking place in your R session itself, or in-cluster (except that they should be much more scalable in the latter case). Similarly, dplyrXdf can handle both the scenarios where your R session is taking place on the cluster edge node, or on a remote client.

For example, here is some sample code where we extract a table from Hive, then create a pipeline to process it in the cluster:

rxSparkConnect()
sampleHiv <- RxHiveData(table="hivesampletable")

# this will create the composite Xdf 'samplehivetable'
sampleXdf <- as_xdf(sampleHiv)

sampleXdf %>%
    filter(deviceplatform == "Android") %>%
    group_by(devicemake) %>%
    summarise(n=n()) %>%
    arrange(desc(n)) %>%
    head()
#>     devicemake     n
#> 1      Samsung 16244
#> 2           LG  7950
#> 3          HTC  2242
#> 4      Unknown  2133
#> 5     Motorola  1524

If you are logged into the edge node, dplyrXdf also has the ability to call sparklyr to process Hive tables in Spark. This can be more efficient than converting the data to Xdf format, since less I/O is involved. To run the above pipeline with sparklyr, we simply omit the step of creating an Xdf file:

sampleHiv %>%
    filter(deviceplatform == "Android") %>%
    group_by(devicemake) %>%
    summarise(n=n()) %>%
    arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   spark_connection
#> # Ordered by: desc(n)
#>     devicemake     n
#>           
#> 1      Samsung 16244
#> 2           LG  7950
#> 3          HTC  2242
#> 4      Unknown  2133
#> 5     Motorola  1524
#> # ... with more rows

For more information about Spark and Hadoop support, see the HDFS vignette and the Sparklyr website.

SQL database support

One of the key strengths of dplyr is its ability to interoperate with SQL databases. Given a database table as input, dplyr can translate the verbs in a pipeline into a SQL query which is then execute in the database. For large tables, this can often be much more efficient than importing the data and running them locally. dplyrXdf can take advantage of this with an MRS data source that is a table in a SQL database, including (but not limited to) Microsoft SQL Server: rather than importing the data to Xdf, the data source is converted to a dplyr tbl and passed to the database for processing.

# copy the flights dataset to SQL Server
flightsSql <- RxSqlServerData("flights", connectionString=connStr)
flightsHd <- copy_to(flightsSql, nycflights13::flights)

# this is run inside SQL Server by dplyr
flightsQry <- flightsSql %>%
    filter(month > 6) %>%
    group_by(carrier) %>%
    summarise(avg_delay=mean(arr_delay))

flightsQry
#> # Source:   lazy query [?? x 2]
#> # Database: Microsoft SQL Server
#> #   13.00.4202[dbo@DESKTOP-TBHQGUH/sqlDemoLocal]
#>   carrier avg_delay
#>          
#> 1 "9E"        5.37 
#> 2 AA        - 0.743
#> 3 AS        -16.9  
#> 4 B6          8.53 
#> 5 DL          1.55 
#> # ... with more rows

For more information about working with SQL databases including SQL Server, see the dplyrXdf SQL vignette and the dplyr database vignette.

Parallel processing and grouped data

Even without a Hadoop or Spark cluster, dplyrXdf makes it easy to parallelise the handling of groups. To do this, it takes advantage of Microsoft R Server’s distributed compute contexts: for example, if you set the compute context to “localpar”, grouped transformations will be done in parallel on a local cluster of R processes. The cluster will be shut down automatically when the transformation is complete.

More broadly, you can create a custom backend and tell dplyrXdf to use it by setting the compute context to “dopar”. This allows you a great deal of flexibility and scalability, for example by creating a cluster of multiple machines (as opposed to multiple cores on a single machine). Even if you do not have the physical machines, packages like AzureDSVM and doAzureParallel allow you to deploy clusters of VMs in the cloud, and then shut them down again. For more information, see the “Parallel processing of grouped data” section of the Using dplyrXdf vignette.

Data and file management

New in dplyrXdf 1.0.0 is a suite of functions to simplify managing Xdf files and data sources:

  • HDFS file management: upload and download files with hdfs_file_upload and hdfs_file_download; copy/move/delete files with hdfs_file_copy, hdfs_file_move, hdfs_file_remove; list files with hdfs_dir; and more
  • Xdf data management: upload and download datasets with copy_to, collect and compute; import/convert to Xdf with as_xdf; copy/move/delete Xdf data sources with copy_xdf, move_xdf and delete_xdf; and more
  • Other utilities: run a block of code in the local compute context with local_exec; convert an Xdf file to a data frame with as.data.frame; extract columns from an Xdf file with methods for [, [[ and pull

Obtaining dplyr and dplyrXdf

dplyrXdf 1.0.0 is available from GitHub. It requires Microsoft R Server 8.0 or higher, and dplyr 0.7 or higher. Note that dplyr 0.7 will not be in the MRAN snapshot that is your default repo, unless you are using the recently-released MRS 9.2; you can install it, and its dependencies, from CRAN. If you want to use the SQL Server and sparklyr integration facility, you should install the odbc, dbplyr and sparklyr packages as well.

install_packages(c("dplyr", "dbplyr", "odbc", "sparklyr"),
                 repos="https://cloud.r-project.org")
devtools::install_github("RevolutionAnalytics/dplyrXdf")

If you run into any bugs, or if you have any feedback, you can email me or log an issue at the Github repo.

Source by analyticsweekpick

Oct 18, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Ethics  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ NEWS BYTES]

>>
 Cray Inc (NASDAQ:CRAY) Institutional Investor Sentiment Analysis – Thorold News Under  Sentiment Analysis

>>
 Crimson Hexagon’s Plight In Five Words: Facebook Doesn’t Want … – AdExchanger Under  Social Analytics

>>
 Unisys Unveils TrustCheck™, the First Subscription-Based Service … – APN News Under  Risk Analytics

More NEWS ? Click Here

[ FEATURED COURSE]

Process Mining: Data science in Action

image

Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be ap… more

[ FEATURED READ]

The Industries of the Future

image

The New York Times bestseller, from leading innovation expert Alec Ross, a “fascinating vision” (Forbes) of what’s next for the world and how to navigate the changes the future will bring…. more

[ TIPS & TRICKS OF THE WEEK]

Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.

[ DATA SCIENCE Q&A]

Q:How would you define and measure the predictive power of a metric?
A: * Predictive power of a metric: the accuracy of a metric’s success at predicting the empirical
* They are all domain specific
* Example: in field like manufacturing, failure rates of tools are easily observable. A metric can be trained and the success can be easily measured as the deviation over time from the observed
* In information security: if the metric says that an attack is coming and one should do X. Did the recommendation stop the attack or the attack never happened?

Source

[ VIDEO OF THE WEEK]

Understanding #BigData #BigOpportunity in Big HR by @MarcRind #FutureOfData #Podcast

 Understanding #BigData #BigOpportunity in Big HR by @MarcRind #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

You can use all the quantitative data you can get, but you still have to distrust it and use your own intelligence and judgment. – Alvin Tof

[ PODCAST OF THE WEEK]

@Schmarzo @DellEMC on Ingredients of healthy #DataScience practice #FutureOfData #Podcast

 @Schmarzo @DellEMC on Ingredients of healthy #DataScience practice #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

IDC Estimates that by 2020,business transactions on the internet- business-to-business and business-to-consumer – will reach 450 billion per day.

Sourced from: Analytics.CLUB #WEB Newsletter

Periodic Table Personified [image]

Have you ever tried memorizing periodic table? It is a daunting task as it has lot of elements and all coded with 2 alphabet characters. So, what is the solution? There are various methods used to do that. For one check out Wonderful Life with the Elements: The Periodic Table Personified by Bunpei Yorifuji. In his effortm Bunpei personified all the elements. It is a fun way to identify eact element and make it easily recognizable.

In his book, Yorifuji makes the many elements seem a little more individual by illustrating each one as as an anthropomorphic cartoon character, with distinctive hairstyles and clothes to help readers tell them apart. As for example, take Nitrogens, they have mohawks because they “hate normal,” while in another example, noble gases have afros because they are “too cool” to react to extreme heat or cold. Man-made elements are depicted in robot suits, while elements used in industrial application wear business attire.



Image by Wired

Source: Periodic Table Personified [image] by v1shal

Exploring the Structure of High-Dimensional Data with HyperTools in Kaggle Kernels

Exploring the structure of high-dimensional data with HyperTools in Kaggle Kernels

The datasets we encounter as scientists, analysts, and data nerds are increasingly complex. Much of machine learning is focused on extracting meaning from complex data. However, there is still a place for us lowly humans: the human visual system is phenomenal at detecting complex structure and discovering subtle patterns hidden in massive amounts of data. Every second that our eyes are open, countless data points (in the form of light patterns hitting our retinas) are pouring into visual areas of our brain. And yet, remarkably, we have no problem at all recognizing a neat looking shell on a beach, or our friend’s face in a large crowd. Our brains are “unsupervised pattern discovery aficionados.”

On the other hand, there is at least one major drawback to relying on our visual systems to extract meaning from the world around us: we are essentially capped at perceiving just 3 dimensions at a time, and many datasets we encounter today are higher dimensional.

So, the question of the hour is: how can we harness the incredible pattern-recognition superpowers of our brains to visualize complex and high-dimensional datasets?

Dimensionality Reduction

In comes dimensionality reduction, stage right. Dimensionality reduction is just what it sounds like– transforming a high-dimensional dataset into a lower-dimensional dataset. For example, take this UCI ML dataset on Kaggle comprising observations about mushrooms, organized as a big matrix. Each row is comprised of a bunch of features of the mushroom, like cap size, cap shape, cap color, odor etc. The simplest way to do dimensionality reduction might be to simply ignore some of the features (e.g. pick your favorite three—say size, shape, and color—and ignore everything else). However, this is problematic if the features you drop contained valuable diagnostic information (e.g. whether the mushrooms are poisonous).

A more sophisticated approach is to reduce the dimensionality of the dataset by only considering its principal components, or the combinations of features that explains the most variance in the dataset. Using a technique called principal components analysis (or PCA), we can reduced the dimensionality of a dataset, while preserving as much of its precious variance as possible. The key intuition is that we can create a new set of (a smaller number of) features, where each of the new features is some combination of the old features. For example, one of these new features might reflect a mix of shape and color, and another might reflect a mix of size and poisonousness. In general, each new feature will be constructed from a weighted mix of the original features.

Below is a figure to help with the intuition. Imagine that you had a 3 dimensional dataset (left), and you wanted to reduce it to a 2 dimensional dataset (right). PCA finds the principal axes in the original 3D space where the variance between points is the highest. Once we identify the two axes that explain the most variance (the black lines in the left panel), we can re-plot the data along just those axes, as shown on the right. Our 3D dataset is now 2D. Here we have chosen a low-dimensional example so we could visualize what is happening. However, this technique can be applied in the same way to higher-dimensional datasets.

We created the HyperTools package to facilitate these sorts of dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including matplotlib, scikit-learn and seaborn. HyperTools is designed with ease of use as a primary objective. We highlight two example use cases below.

Mushroom foraging with HyperTools: Visualizing static ‘point clouds’

First, let’s explore the mushrooms dataset we referenced above. We start by importing the relevant libraries:

import pandas as pd
import hypertools as hyp

and then we read in our data into a pandas DataFrame:

data = pd.read_csv('../input/mushrooms.csv')
data.head()
index class cap-shape cap-surface cap-color bruises odor gill-attachment
0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f
5 e x y y t a f

Each row of the DataFrame corresponds to a mushroom observation, and each column reflects a descriptive feature of the mushroom (only some of the rows and columns are shown above). Now let’s plot the high-dimensional data in a low dimensional space by passing it to HyperTools. To handle text columns, HyperTools will first convert each text column into a series of binary ‘dummy’ variables before performing the dimensionality reduction. For example, if the ‘cap size’ column contained ‘big’ and ‘small’ labels, this single column would be turned into two binary columns: one for ‘big’ and one for ‘small’, where 1s represents the presence of that feature and 0s represents the absence (for more on this, see the documentation for the get_dummies function in pandas).

hyp.plot(data, 'o')

In plotting the DataFrame, we are effectively creating a three-dimensional “mushroom space,” where mushrooms that exhibit similar features appear as nearby dots, and mushrooms that exhibit different features appear as more distant dots. By visualizing the DataFrame in this way, it becomes immediately clear that there are multiple clusters in the data. In other words, all combinations of mushroom features are not equally likely, but rather certain combinations of features tend to go together. To better understand this space, we can color each point according to some feature in the data that we are interested in knowing more about. For example, let’s color the points according to whether the mushrooms are (p)oisonous or (e)dible (the class_labels feature):

hyp.plot(data,'o', group=class_labels, legend=list(set(class_labels)))

Visualizing the data in this way highlights that mushrooms’ poisonousness appears stable within each cluster (e.g. mushrooms that have similar features), but varies across clusters. In addition, it looks like there are a number of distinct clusters that are poisonous/edible. We can explore this further by using the ‘cluster’ feature of HyperTools, which colors the observations using k-means clustering. In the description of the dataset, it was noted that there were 23 different types of mushrooms represented in this dataset, so we’ll set the n_clusters parameter to 23:

hyp.plot(data, 'o', n_clusters=23)

To gain access to the cluster labels, the clustering tool may be called directly using hyp.tools.cluster, and the resulting labels may then be passed to hyp.plot:

cluster_labels = hyp.tools.cluster(data, n_clusters=23)
hyp.plot(data, group=cluster_labels)

By default, HyperTools uses PCA to do dimensionality reduction, but with a few additional lines of code we can use other dimensionality reduction methods by directly calling the relevant functions from sklearn. For example, we can use t-SNE to reduce the dimensionality of the data using:

from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=3)
reduced_data_TSNE = TSNE_model.fit_transform(hyp.tools.df2mat(data))
hyp.plot(reduced_data_TSNE,'o', group=class_labels, legend=list(set(class_labels)))

Different dimensionality reduction methods highlight or preserve different aspects of the data. A repository containing additional examples (including different dimensionality reduction methods) may be found here.

The data expedition above provides one example of how the geometric structure of data may be revealed through dimensionality reduction and visualization. The observations in the mushrooms dataset formed distinct clusters, which we identified using HyperTools. Explorations and visualizations like this could help guide analysis decisions (e.g. whether to use a particular type of classifier to discriminate poisonous vs. edible mushrooms). If you’d like to play around with HyperTools and the mushrooms dataset, check out and fork this Kaggle Kernel!

Climate science with HyperTools: Visualizing dynamic data

Whereas the mushrooms dataset comprises static observations, here we will take a look at some global temperature data, which will showcase how HyperTools may be used to visualize timeseries data using dynamic trajectories.

This next dataset is made up of monthly temperature recordings from a sample of 20 global cities over the 138 year interval ranging from 1875–2013. To prepare this dataset for analysis with HyperTools, we created a time by cities matrix, where each row is a temperature recording for subsequent months, and each column is the temperature value for a different city. You can replicate this demo by using the Berkeley Earth Climate Change dataset on Kaggle or by cloning this GitHub repo. To visualize temperature changes over time, we will use HyperTools to reduce the dimensionality of the data, and then plot the temperature changes over time as a line:

hyp.plot(temps)

Well that just looks like a hot mess, now doesn’t it? However, we promise there is structure in there– so let’s find it! Because each city is in a different location, the mean and variance of its temperature timeseries may be higher or lower than the other cities. This will in turn affect how much that city is weighted when dimensionality reduction is performed. To normalize the contribution of each city to the plot, we can set the normalize flag (default value: False). Setting normalize='across' <will normalize (z-score) each column of the data. HyperTools incorporates a number of useful normalization options, which you can read more about here.

hyp.plot(temps, normalize='across')

Now we’re getting somewhere! Rotating the plot with the mouse reveals an interesting shape to this dataset. To help highlight the structure and understand how it changes over time, we can color the lines by year, where more red lines indicates early and more blue lines indicate later timepoints:

hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r')

Coloring the lines has now revealed two key structural aspects of the data. First, there is a systematic shift from blue to red, indicating a systematic change in the pattern of global temperatures over the years reflected in the dataset. Second, within each year (color), there is a cyclical pattern, reflecting seasonal changes in the temperature patterns. We can also visualize these two phenomena using a two dimensional plot:

hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r', ndims=2)

Now, for the grand finale. In addition to creating static plots, HyperTools can also create animated plots, which can sometimes reveal additional patterns in the data. To create an animated plot, simply pass animate=True to hyp.plot when visualizing timeseries data. If you also pass chemtrails=True, a low-opacity trace of the data will remain in the plot:

hyp.plot(temps, normalize='across', animate=True, chemtrails=True)

That pleasant feeling you get from looking at the animation is called “global warming.”

This concludes our exploration of climate and mushroom data with HyperTools. For more, please visit the project’s GitHub repository, readthedocs site, a paper we wrote, or our demo notebooks.

Bio

Andrew is a Cognitive Neuroscientist in the Contextual Dynamics Laboratory. His postdoctoral work integrates ideas from basic learning and memory research with computational techniques used in data science to optimize learning in natural educational settings, like the classroom or online. Additionally, he develops open-source software for data visualization, research and education.

The Contextual Dynamics Lab at Dartmouth College uses computational models and brain recordings to understand how we extract information from the world around us. You can learn more about us at http://www.context-lab.com.

Source: Exploring the Structure of High-Dimensional Data with HyperTools in Kaggle Kernels

Oct 11, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
statistical anomaly  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> How to Successfully Incorporate Analytics Into Your Growth Marketing Process by analyticsweek

>> How the lack of the right data affects the promise of big data in India by analyticsweekpick

>> SDN and network function virtualization market worth $ 45.13 billion by 2020 by analyticsweekpick

Wanna write? Click Here

[ NEWS BYTES]

>>
 Marketing Analytics Software Market Effect and Growth Factors Research and Projection – Coherent News (press release) (blog) Under  Marketing Analytics

>>
 Streaming Analytics Market Research Study including Growth Factors, Types and Application by regions from 2017 to … – managementjournal24.com Under  Streaming Analytics

>>
 State Street: Latest investor sentiment towards Brexit – Asset Servicing Times Under  Risk Analytics

More NEWS ? Click Here

[ FEATURED COURSE]

CPSC 540 Machine Learning

image

Machine learning (ML) is one of the fastest growing areas of science. It is largely responsible for the rise of giant data companies such as Google, and it has been central to the development of lucrative products, such … more

[ FEATURED READ]

The Future of the Professions: How Technology Will Transform the Work of Human Experts

image

This book predicts the decline of today’s professions and describes the people and systems that will replace them. In an Internet society, according to Richard Susskind and Daniel Susskind, we will neither need nor want … more

[ TIPS & TRICKS OF THE WEEK]

Fix the Culture, spread awareness to get awareness
Adoption of analytics tools and capabilities has not yet caught up to industry standards. Talent has always been the bottleneck towards achieving the comparative enterprise adoption. One of the primal reason is lack of understanding and knowledge within the stakeholders. To facilitate wider adoption, data analytics leaders, users, and community members needs to step up to create awareness within the organization. An aware organization goes a long way in helping get quick buy-ins and better funding which ultimately leads to faster adoption. So be the voice that you want to hear from leadership.

[ DATA SCIENCE Q&A]

Q:What are the drawbacks of linear model? Are you familiar with alternatives (Lasso, ridge regression)?
A: * Assumption of linearity of the errors
* Can’t be used for count outcomes, binary outcomes
* Can’t vary model flexibility: overfitting problems
* Alternatives: see question 4 about regularization

Source

[ VIDEO OF THE WEEK]

Understanding #FutureOfData in #Health & #Medicine - @thedataguru / @InovaHealth #FutureOfData #Podcast

 Understanding #FutureOfData in #Health & #Medicine – @thedataguru / @InovaHealth #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Data matures like wine, applications like fish. – James Governor

[ PODCAST OF THE WEEK]

Discussing Forecasting with Brett McLaughlin (@akabret), @Akamai

 Discussing Forecasting with Brett McLaughlin (@akabret), @Akamai

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.

Sourced from: Analytics.CLUB #WEB Newsletter

The intersection of analytics, social media and cricket in the cognitive era of computing

Photo Credit: Getty Images.

Since 1975, every fourth year, world class cricketing nations come together for a month long extravaganza. Since February 14th, 2015, 14 teams are battling it out in a 6 week long Cricket World Cup tournament in 14 venues across Australia and New Zealand. During these six weeks, millions of cricket fans alter their schedule to relish the game of cricket world-wide in unison. It is a cricket carnival of sorts.

Cricket world cup is at its peak. All eyes are glued to the sport and nothing much is going unobserved – on the field or off it. Whether it is Shikhar Dhawan scoring a century or Virat Kohli’s anger, cricket enthusiasts are having a ball. There is however another segment that is thriving as cricket fever is reaching a high. It is the dedicated cricket follower who follows each ball, takes stock of each miss, for him each run is a statistic and each delivery an opportunity. I remember years ago, we all used to be hooked to a radio, while on the move keeping track of ball-by-ball updates, then came the television; and now with social media, the stakeholder engagement has become phenomenally addictive.

Such a big fan base is bound to open opportunities for sports tourism, brand endorsements and global partnerships. CWC is an event that many business enterprises tap in order to make their presence felt. With adequate assistance of technology and in-depth insights the possibilities of stakeholder engagement and scaling up ventures is huge like never before.

Sports industry is perhaps one of the biggest enterprises that have willingly adopted technology to change the game for players, viewers, organizers as well as broadcasters. Pathbreaking advent in technology has ensured that the experience of followers of the game has become finer and more nuanced. It is no longer just about what is happening on the field but about what happened on similar occasions in the past, and what could possibly happen given the past records of the team and the players. This ever-growing back and forth between information and analysis makes for a cricket lover’s paradise.

Cognitive analysis of such a large data is no longer just a dream. Machine learning algorithms are getting smarter day by day using cloud computing on clusters, that is about to change the whole landscape of human experience and involvement. To understand what CWC means to different people, from various backgrounds it is important to understand their psychology/perception of the game. A deeper look can bring us closer to understanding how technology, analytics and big data is in fact changing the dynamics of cricket.

A common man’s perspective

Cricket world cup to a common man is about sneaking a win from close encounters, high scoring run fests, electric crowd, blind faith in their teams and something to chew on, spicing their opinion after the game is over. With small boundaries, better willows, fit athletes and pressure situations to overcome to be victorious, every contest is an absolute delight to watch and closely follow. Cricket fans are so deeply attached to the game and the players that every bit of juicy information about the game enthralls them.

A geek’s perspective

In the last forty years the use of technology has changed the game of cricket on the field. Years ago, snickometer was considered revolutionary, then came the pathbreaking Hawk eye followed by Pitchvision, DRS (Decision Review System) and now we have flashing cricket bails. For cricketers this has meant a better reviewing process. Now they understand their game better, correct their mistakes, prepare against their weakness and also plan specific strategies against individual players of the opposite team. For cricket followers and business houses this has meant a better engagement with the audience, a deeper personalised experience and a detailed understanding of what works, what does not.

This increase in the viewer-engagement quotient has been boosted with Matchups covering past records on player performance, match stats etc. Wisden captures data from each match and provides the basis of comparatives around player potential, strike rate, runs in the middle over, important players in the death overs etc.

While Wisden India provides all the data points, IBM’s Analytics engine processes the information into meaningful assets of historical data making it possible predict future outcomes. For CWC 2015 IBM has partnered with Wisden to provide the viewers with live match analysis, player performance, which is very frequently used by commentators, coaches to keep the viewers glued to the match proceedings.

Just like it makes insightful observations from a vast trove of data in cricket, IBM’s Analytics Engine equips organizations to take advantage of all data to make better business decisions. Using analytics-instead of instinct-can help organizations provide personalized experiences to its customers, spot new opportunities and risks and create new business models.

Similarly, with social media outreach, the overall engagement of viewers in the game has become crucial in boosting confidence of a team or succumbing to the pressure of the masses.

Aggregating shared opinion on social sites is a key to highlighting the expectations, generating perceived predictions about the teams, potential wins, most popular players etc.

To give an idea of the numbers and technology involved, as part of Twitterati, IBM processed about 3 million tweets on an average, in a two match day, analysed at 10 min intervals.

IBM Cloudant was used to store tweets, crawled from twitter having match/tournament specific hashtags. According to their needs, IBM fetched the tweets from Cloudant and generated the events specific to every match. IBM Bluemix automates the process of getting tweets from Twitter and generating the events corresponding to every match given the schedule of the tournament of Cricket World Cup. The application is hosted in Bluemix. Apart from these technologies, IBM developed the core engine that identifies events from the Twitter feed.

The Social Sentiment Index analyzed around 3.5 million tweets, by tracking about 700 match-specific events daily in Twitter. IBM Data Curation and Integration capabilities were used on BigInsights and Social Data Accelerator (SDA) to extract social insights from streaming twitter feed in real time.

Moreover, IBM Text Analytics and Natural Language Processing performs fine grained temporal analytics around events that have short lifespan but are important — events like boundaries, sixes and wickets.

IBM Social Media Analytics also examines the quantum of discussion around teams, players, events. It examines sentiments across different entities and Identify topics that are trending and understands ways in which advertisers can use the discussion to appropriately position their products and services.

IBM Content Analytics examines the large social content more deeply and tries to mimic human cognition and learning behavior to answer complex questions like the impact of certain player or attributes determining the outcome of the game.

An enterprise perspective

What is most interesting to businesses however is that observing these campaigns help in understanding the consumer sentiment to drive sales initiatives. With right business insights in the nick of time, in line with social trends, several brands have come up with lucrative offers one can’t refuse. In earlier days, this kind of marketing required pumping in of a lot of money and waiting for several weeks before one could analyse and approve the commercial success of a business idea. With tools like IBM Analytics at hand, one can not only grab the data needed, assess it so it makes a business sense, but also anticipate the market response.

Imagine how, in the right hands, especially in the data sensitive industry, the facility of analyzing large scale structured and unstructured data combined with cloud computing and cognitive machine learning can lead to capable and interesting solutions with weighted recommendations at your disposal.

The potential of idea already sounds a game-changer to me. When I look around, every second person is tweeting and posting about everything around them. There are volumes of data waiting to be analyzed. With the power to process the virality of the events in real-time across devices, sensors, applications, I can vouch that with data mining and business intelligence capabilities, cloud computing can significantly improve and empower businesses to run focused campaigns.

With engines like Social Data Accelerator Cloudant, Social Data Curation at your service, social data analysis can be democratized to a fairly accurate possibility, opening new channels of business, which have not been identified so far. CWC 2015 insight is just the beginning. Howzzat?

Originally posted via “The intersection of analytics, social media and cricket in the cognitive era of computing”

Source: The intersection of analytics, social media and cricket in the cognitive era of computing by analyticsweekpick

Oct 04, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Connection Information

To perform the requested action, WordPress needs to access your web server. Please enter your FTP credentials to proceed. If you do not remember your credentials, you should contact your web host.

Connection Type

FTP

FTPS (SSL)

[  COVER OF THE WEEK ]

image
Data interpretation  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> The End of Transformation: Expediting Data Preparation and Analytics with Edge Computing by jelaniharper

>> Movie Recommendations? How Does Netflix Do It? A 9 Step Coding & Intuitive Guide Into Collaborative Filtering by nbhaskar

>> Your Firm’s Culture Need to Catch Up with its Business Analytics? by analyticsweekpick

Wanna write? Click Here

[ FEATURED COURSE]

Artificial Intelligence

image

This course includes interactive demonstrations which are intended to stimulate interest and to help students gain intuition about how artificial intelligence methods work under a variety of circumstances…. more

[ FEATURED READ]

Big Data: A Revolution That Will Transform How We Live, Work, and Think

image

“Illuminating and very timely . . . a fascinating — and sometimes alarming — survey of big data’s growing effect on just about everything: business, government, science and medicine, privacy, and even on the way we think… more

[ TIPS & TRICKS OF THE WEEK]

Save yourself from zombie apocalypse from unscalable models
One living and breathing zombie in today’s analytical models is the pulsating absence of error bars. Not every model is scalable or holds ground with increasing data. Error bars that is tagged to almost every models should be duly calibrated. As business models rake in more data the error bars keep it sensible and in check. If error bars are not accounted for, we will make our models susceptible to failure leading us to halloween that we never wants to see.

[ DATA SCIENCE Q&A]

Q:Give examples of bad and good visualizations?
A: Bad visualization:
– Pie charts: difficult to make comparisons between items when area is used, especially when there are lots of items
– Color choice for classes: abundant use of red, orange and blue. Readers can think that the colors could mean good (blue) versus bad (orange and red) whereas these are just associated with a specific segment
– 3D charts: can distort perception and therefore skew data
– Using a solid line in a line chart: dashed and dotted lines can be distracting

Good visualization:
– Heat map with a single color: some colors stand out more than others, giving more weight to that data. A single color with varying shades show the intensity better
– Adding a trend line (regression line) to a scatter plot help the reader highlighting trends

Source

[ VIDEO OF THE WEEK]

Solving #FutureOfOrgs with #Detonate mindset (by @steven_goldbach & @geofftuff) #FutureOfData #Podcast

 Solving #FutureOfOrgs with #Detonate mindset (by @steven_goldbach & @geofftuff) #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

The most valuable commodity I know of is information. – Gordon Gekko

[ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData with Jon Gibs(@jonathangibs) @L2_Digital

 #BigData @AnalyticsWeek #FutureOfData with Jon Gibs(@jonathangibs) @L2_Digital

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

The Hadoop (open source software for distributed computing) market is forecast to grow at a compound annual growth rate 58% surpassing $1 billion by 2020.

Sourced from: Analytics.CLUB #WEB Newsletter

Seven ways predictive analytics can improve healthcare

shutterstock_93383260

Everyone is a patient at some time or another, and we all want good medical care. We assume that doctors are all medical experts and that there is good research behind all their decisions.

Physicians are smart, well trained and do their best to stay up to date with the latest research. But they can’t possibly commit to memory all the knowledge they need for every situation, and they probably don’t have it all at their fingertips. Even if they did have access to the massive amounts of data needed to compare treatment outcomes for all the diseases they encounter, they would still need time and expertise to analyze that information and integrate it with the patient’s own medical profile. But this kind of in-depth research and statistical analysis is beyond the scope of a physician’s work.

That’s why more and more physicians – as well as insurance companies – are using predictive analytics.

Predictive analytics (PA) uses technology and statistical methods to search through massive amounts of information, analyzing it to predict outcomes for individual patients. That information can include data from past treatment outcomes as well as the latest medical research published in peer-reviewed journals and databases.

Not only can PA help with predictions, but it can also reveal surprising associations in data that our human brains would never suspect.

In medicine, predictions can range from responses to medications to hospital readmission rates. Examples are predicting infections from methods of suturing, determining the likelihood of disease, helping a physician with a diagnosis, and even predicting future wellness.

The statistical methods are called learning models because they can grow in precision with additional cases. There are two major ways in which PA differs from traditional statistics (and from evidence-based medicine):

  • First, predictions are made for individuals and not for groups
  • Second PA does not rely upon a normal (bell-shaped) curve.

Prediction modelling uses techniques such as artificial intelligence to create a prediction profile (algorithm) from past individuals. The model is then “deployed” so that a new individual can get a prediction instantly for whatever the need is, whether a bank loan or an accurate diagnosis.

In this post, I discuss the top seven benefits of PA to medicine – or at least how they will be beneficial once PA techniques are known and widely used. In the United States, many physicians are just beginning to hear about predictive analytics and are realizing that they have to make changes as the government regulations and demands have changed. For example, under the Affordable Care Act, one of the first mandates within Meaningful Use demands that patients not be readmitted before 30 days of being dismissed from the hospital. Hospitals will need predictive models to accurately assess when a patient can safely be released.

1. Predictive analytics increase the accuracy of diagnoses.

Physicians can use predictive algorithms to help them make more accurate diagnoses. For example, when patients come to the ER with chest pain, it is often difficult to know whether the patient should be hospitalized. If the doctors were able to answers questions about the patient and his condition into a system with a tested and accurate predictive algorithm that would assess the likelihood that the patient could be sent home safely, then their own clinical judgments would be aided. The prediction would not replace their judgments but rather would assist.

In a visit to one’s primary care physician, the following might occur: The doctor has been following the patient for many years. The patient’s genome includes a gene marker for early onset Alzheimer’s disease, determined by researchers using predictive analytics. This gene is rare and runs in the patient’s family on one side. Several years ago, when it was first discovered, the patient agreed to have his blood taken to see if he had the gene. He did. There was no gene treatment available, but evidence based research indicated to the PCP conditions that may be helpful for many early Alzheimer’s patients.

Ever since, the physician has had the patient engaging in exercise, good nutrition, and brain games apps that the patient downloaded on his smart phone and which automatically upload to the patient’s portal. Memory tests are given on a regular basis and are entered into the electronic medical record (EMR), which also links to the patient portal. The patient himself adds data weekly onto his patient portal to keep track of time and kinds of exercises, what he is eating, how he has slept, and any other variable that his doctor wishes to keep track of.

Because the PCP has a number of Alzheimer’s patients, the PCP has initiated an ongoing predictive study with the hope of developing a predictive model for individual likelihood of memory maintenance and uses, with permission, the data thus entered through the patients’ portals. At this visit, the physician shares the good news that a gene therapy been discovered for the patient’s specific gene and recommends that the patient receive such therapy.

2. Predictive analytics will help preventive medicine and public health.

With early intervention, many diseases can be prevented or ameliorated. Predictive analytics, particularly within the realm of genomics, will allow primary care physicians to identify at-risk patients within their practice. With that knowledge, patients can make lifestyle changes to avoid risks (An interview with Dr. Tim Armstrong on this WHO podcast explores the question: Do lifestyle changes improve health?)

As lifestyles change, population disease patterns may dramatically change with resulting savings in medical costs. As Dr. Daniel Kraft, Medicine and Neuroscience Chair at Stanford University, points out in his videoMedicine 2064:

During the history of medicine, we have not been involved in healthcare; no, we’ve been consumed by sick care. We wait until someone is sick and then try to treat that person. Instead, we need to learn how to avoid illness and learn what will make us healthy. Genomics will play a huge part in the shift toward well-living.

As Dr. Kraft mentions, our future medications might be designed just for us because predictive analytics methods will be able to sort out what works for people with “similar subtypes and molecular pathways.”

3. Predictive analytics provides physicians with answers they are seeking for individual patients.

Evidence-based medicine (EBM) is a step in the right direction and provides more help than simple hunches for physicians. However, what works best for the middle of a normal distribution of people may not work best for an individual patient seeking treatment. PA can help doctors decide the exact treatments for those individuals. It is wasteful and potentially dangerous to give treatments that are not needed or that won’t work specifically for an individual. (This topic is covered in a paper by the Personalized Medicine Coalition.) Better diagnoses and more targeted treatments will naturally lead to increases in good outcomes and fewer resources used, including the doctor’s time.

4. Predictive analytics can provide employers and hospitals with predictions concerning insurance product costs.

Employers providing healthcare benefits for employees can input characteristics of their workforce into a predictive analytic algorithm to obtain predictions of future medical costs. Predictions can be based upon the company’s own data or the company may work with insurance providers who also have their own databases in order to generate the prediction algorithms. Companies and hospitals, working with insurance providers, can synchronize databases and actuarial tables to build models and subsequent health plans. Employers might also use predictive analytics to determine which providers may give them the most effective products for their particular needs. Built into the models would be the specific business characteristics. For example, if it is discovered that the average employee visits a primary care physician six times a year, those metrics can be included in the model.

Hospitals will also work with insurance providers as they seek to increase optimum outcomes and quality assurance for accreditation. In tailoring treatments that produce better outcomes, accreditation standards are both documented and increasingly met. (Likewise, predictive analytics can support the Accountable Care Organization (ACO) model in that the primary goal of ACO is the reduction of costs by treating specific patient populations successfully. Supply chain management (SCM) for model hospitals and insurance providers will change as needs for resources change; in fact when using PA, those organizations may see otherwise hidden opportunities for savings and increasing efficiency. PA has a way of bringing our attention to that which may not have been seen before.

5. Predictive analytics allow researchers to develop prediction models that do not require thousands of cases and that can become more accurate over time.

In huge population studies, even very small differences can be “statistically significant.” Researchers understand that randomly assigned case control studies are superior to observational studies, but often it is simply not feasible to carry out such a design. From huge observational studies, the small but statistically significant differences are often not clinically significant. The media, ignorant of research nuances, may then focus on those small but statistically significant findings, convincing and sometimes frightening the public. Researchers also are to blame as sometimes they themselves do not understand the difference between statistical significance and clinical significance.

For example, in a TEDxColumbiaEngineering talk, Dr. David H. Newman spoke about the recent recommendation by the media that small to moderate alcohol consumption by women can result in higher levels of certain cancers. Many news programs and newspapers loudly and erroneously warned women not to drink even one alcoholic drink per day.

In contrast with predictive analytics, initial models in can be generated with smaller numbers of cases and then the accuracy of such may be improved over time with increased cases. The models are alive, learning, and adapting with added information and with changes that occur in the population over time.

In order to make use of data across practices, electronic data record systems will need to be compatible with one another; interoperability, or this very coordination, is important and has been mandated by the US government. Governance around the systems will require transparency and accountability. One program suite, STATISTICA, is familiar with governance as it has worked with banks, pharmaceutical industries and government agencies. Using such a program will be crucial in order to offer “transparent” models, meaning they work smoothly with other programs, such as Microsoft and Visual Basic. In addition, STATISTICA can provide predictive models using double-blind elements and random assignment, satisfying the continued need for controlled studies.

On the other hand, some programs are proprietary, and users often have to pay the statistical company to use their own data. In addition, they may find that the system is not compatible other systems if they need to make changes. When dealing with human life, the risks of making mistakes are increased, and the models used must lend themselves to making the systems valid, sharable and reliable.

6. Pharmaceutical companies can use predictive analytics to best meet the needs of the public for medications.

There will be incentives for the pharmaceutical industry to develop medications for ever smaller groups. Old medications, dropped because they were not used by the masses, may be brought back because drug companies will find it economically feasible to do so. In other words, previous big bulk medications are certain to be used less if they are found not to help many of those who were prescribed them. Less used medications will be economically lucrative to revive and develop as research is able to predict those who might benefit from them. For example, if 25,000 people need to be treated with a medication “shotgun-style” in order to save 10 people, then much waste has occurred. All medications have unwanted side effects. The shotgun-style delivery method can expose patients to those risks unnecessarily if the medication is not needed for them. Dr. Newman (above) discussed the probably overuse of statins as one example.

7. Patients have the potential benefit of better outcomes due to predictive analytics.

There will be many benefits in quality of life to patients as the use of predictive analytics increase. Potentially individuals will receive treatments that will work for them, be prescribed medications that work for them and not be given unnecessary medications just because that medication works for the majority of people. The patient role will change as patients become more informed consumers who work with their physicians collaboratively to achieve better outcomes. Patients will become aware of possible personal health risks sooner due to alerts from their genome analysis, from predictive models relayed by their physicians, from the increasing use of apps and medical devices (i.e., wearable devices and monitoring systems), and due to better accuracy of what information is needed for accurate predictions. They then will have decisions to make about life styles and their future well being.

 

Conclusion:  Changes are coming in medicine worldwide.

In developed nations, such as the United States, predictive analytics are the next big idea in medicine –the next evolution in statistics – and roles will change as a result.

  • Patients will have to become better informed and will have to assume more responsibility for their own care, if they are to make use of the information derived.
  • Physician roles will likely change to more of a consultant than decision maker, who will advise, warn and help individual patients. Physicians may find more joy in practice as positive outcomes increase and negative outcomes decrease. Perhaps time with individual patients will increase and physicians can once again have the time to form positive and lasting relationships with their patients. Time to think, to interact, to really help people; relationship formation is one of the reasons physicians say they went into medicine, and when these diminish, so does their satisfaction with their profession.
  • Hospitals, pharmaceutical companies and insurance providers will see changes as well. For example, there may be fewer unnecessary hospitalizations, resulting initially in less revenue. Over time, however, admissions will be more meaningful, the market will adjust, and accomplishment will rise. Initially, revenues may also be lost by pharmaceutical and device companies, but then more specialized and individualized offerings will increase profits. They may be forced to find newer and better solutions for individuals, ultimately providing them with fresh sources of revenue. There may be increased governmental funds offered for those who are innovative in approach.

All in all, changes are coming. The genie is out of the box and, in fact, is building boxes for the rest of us. Smart industries will anticipate and prepare.

These changes that can literally revolutionize the way medicine is practiced for better health and disease reduction.

I think about the Bayer TV commercialin which a woman gets a note that says, “Your heart attack will arrive in two days.” The voiceover proclaims, “Laura’s heart attack didn’t come with a warning.” Not so with predictive analytics. That very message could be sent to Laura from her doctor who uses predictive analytics. Better yet, in our bright future, Laura might get the note from her doctor that says, “Your heart attack will occur eight years from now, unless …” – giving Laura the chance to restructure her life and change the outcome.

Note: This article originally appeared in Elsevier. Click for link here.

Source: Seven ways predictive analytics can improve healthcare