Fix the Culture, spread awareness to get awareness
Adoption of analytics tools and capabilities has not yet caught up to industry standards. Talent has always been the bottleneck towards achieving the comparative enterprise adoption. One of the primal reason is lack of understanding and knowledge within the stakeholders. To facilitate wider adoption, data analytics leaders, users, and community members needs to step up to create awareness within the organization. An aware organization goes a long way in helping get quick buy-ins and better funding which ultimately leads to faster adoption. So be the voice that you want to hear from leadership.
[ DATA SCIENCE Q&A]
Q:What is a decision tree?
A: 1. Take the entire data set as input
2. Search for a split that maximizes the ‘separation of the classes. A split is any test that divides the data in two (e.g. if variable2>10)
3. Apply the split to the input data (divide step)
4. Re-apply steps 1 to 2 to the divided data
5. Stop when you meet some stopping criteria
6. (Optional) Clean up the tree when you went too far doing splits (called pruning)
Finding a split: methods vary, from greedy search (e.g. C4.5) to randomly selecting attributes and split points (random forests)
Purity measure: information gain, Gini coefficient, Chi Squared values
Stopping criteria: methods vary from minimum size, particular confidence in prediction, purity criteria threshold
Pruning: reduced error pruning, out of bag error pruning (ensemble methods)
I saw the results of a recent opinion poll about the US presidential election that amazed me. While many recent polls of US voters reveal a virtual tie in presidential race between Barack Obama and Mitt Romney, aÂ BBC poll surveying citizens from other countries about the US presidentÂ found overwhelming support for Barack Obama over Mitt Romney. In this late summer/early fall study by GlobeScan and PIPA of over 20,000 people across 21 countries,Â 50% favored Obama and 9% favored Mr Romney.
Global Businesses Needs Global Feedback
Companies conducting international business regularly poll their customers and prospects across the different countries they serve in hopes to get better insights about how to run their business. They use this feedback to help them understand where to enter new markets, guide product development, and improve service quality, just to name a few. The end goal is to create a loyal customer base (e.g., customers come back, recommend and expand relationship).
The US government’s policies impact international relations on many levels (e.g., economically, financially and socially). Could there be some value from this international poll for the candidates themselves and their constituencies?
Looking at the results of the poll, there are few implications that stand out to me:
The Romney brand has little international support. Mitt Romney has touted that his business experience has prepared him to be an effective president. How can he use these results to improve his image abroad?
Many international citizens do not care about the US presidency (in about half of the countries, fewer than 50% of respondents did not express an opinion for either Obama or Romney).
After four years of an Obama presidency, the international community continues to support the re-election of Obama. Obama received comparable results in 2008.
I like to use data whenever possible to help me guide my decisions. However, IÂ will be the first to admit that I am no expert on international relations. So, I am seeking help from my readers. Here are three questions:
Are these survey results useful to help guide US constituencies’ voting decision?
Is international citizenry survey results about the US presidential candidates analogous to international customer survey results about US companies?
If you owned a company and where selling the Obama and Romney brand, how would you use these survey results (barring simply ignoring them)Â to improve international customer satisfaction?
Data aids, not replace judgement
Data is a tool and means to help build a consensus to facilitate human decision-making but not replace it. Analysis converts data into information, information via context leads to insight. Insights lead to decision making which ultimately leads to outcomes that brings value. So, data is just the start, context and intuition plays a role.
[ DATA SCIENCE Q&A]
Q:Examples of NoSQL architecture?
A: * Key-value: in a key-value NoSQL database, all of the data within consists of an indexed key and a value. Cassandra, DynamoDB
* Column-based: designed for storing data tables as sections of columns of data rather than as rows of data. HBase, SAP HANA
* Document Database: map a key to some document that contains structured information. The key is used to retrieve the document. MongoDB, CouchDB
* Graph Database: designed for data whose relations are well-represented as a graph and has elements which are interconnected, with an undetermined number of relations between them. Polyglot Neo4J
A standard patient satisfaction survey, known asÂ HCAHPS (Hospital Consumer Assessment of Healthcare Providers and Systems), is the source of the patient feedback for the reimbursement program. I have previously used these publicly available HCAHPS data to understand the state of affairs for US hospitals in 2011Â (see Big Data Provides Big Insights for U.S. Hospitals). Now that the Value-Based Purchasing program has been in effect since October 2012, I wanted to revisit the HCAHPS patient survey data to determine if US hospitals have improved. First, let’s review the HCAHPS survey.
The HCAHPS Survey
The survey asks a random sample of recently discharged patients about important aspects of their hospital experience.Â The data set includes patient survey results for US hospitalsÂ onÂ ten measures of patients’ perspectives of care. The 10 measures are:
Nurses communicate well
Doctors communicate well
Received help as soon as they wanted (Responsive)
Pain well controlled
Staff explain medicines before giving to patients
Room and bathroom are clean
Area around room is quiet at night
Given information about what to do during recovery at home
Overall hospital rating
Recommend hospital to friends and family (Recommend)
For questions 1 through 7, respondents were asked to provide frequency ratings about the occurrence of each attribute (Never, Sometimes, Usually, Always). For question 8, respondents were provided a Y/N option. For question 9, respondents were asked to provide an overall rating of the hospital on a scale from 0 (Worst hospital possible) to 10 (Best hospital possible). For question 10, respondents were asked to provide their likelihood of recommending the hospital (Definitely no, Probably no, Probably yes, Definitely yes).
The HCAHPS data sets report metrics for each hospital as percentages of responses. Because theÂ data sets have already been somewhat aggregated (e.g., percentages reported for group of response options), I was unable to calculate average scores for each hospital. Instead, I used top box scoresÂ as the metric of patient experience. I found that top box scores are highly correlated with average scores across groups of companies, suggesting that these two metrics tell us the same thing about the companies (in our case, hospitals).
Top box scores for the respective rating scales are defined as: 1) Percent of patients who reported “Always”; 2) Percent of patients who reported “Yes”; 3) Percent of patients who gave a rating of 9 or 10; 4) Percent of patients who said “Definitely yes.”
Top box scores provide an easy-to-understand way of communicating the survey results for different types of scales. Even though there are four different rating scales for the survey questions, using a top box reporting method puts all metrics on the same numeric scale. Across all 10 metrics, hospital scores can range from 0 (bad) to 100 (good).
I examinedÂ PX ratings ofÂ acute care hospitals across two time periods.Â The two time periods were 2011 (Q3 2010 through Q2 2011) and 2013 (Q4 2012 through Q3 2013). The data from the 2013 time-frame are the latest publicly available patient survey dataÂ as of this writing.
Results:Â Patient Satisfaction with US Hospitals Increasing
Figure 1 contains the comparisonsÂ for patient advocacy ratings for US hospitals across the two time periods. Paired T-tests comparing the three loyalty metrics across the two time periods were statistically significant, showing that patients are reporting higher levels of loyalty toward hospitals in 2013 compared to 2011. This increase in patient loyalty, while small, is still real.
Greater gains in patient loyalty have been seen for Overall Hospital Rating (increase of 2.26) compared to Recommend (increase of 1.09).
Figure 2 contains the comparisonsÂ for patient experience ratings for US hospitals across the two time periods. Again, paired T-tests comparing the seven PX metrics across the two time periods were statistically significant, showing that patients are reporting higher levels of satisfaction with their in-patient experience in 2013 compared to 2011.
The biggest increases in satisfaction were seen in “Given information about recovery,” “Staff explained meds” and “Responsive.” The smallest increases in satisfactionÂ were seen for “Doctor communication” and “Pain well controlled.”
Hospital reimbursements are based, in part, on their patient satisfaction ratings. Consequently, hospital executives are focusing their efforts at improving the patient experience.
Comparing HCAHPS patient survey results from 2011 to 2013, it appearsÂ that hospitals have improved how they deliver patient care. Patient loyalty and PX metrics show significant improvements from 2011 to 2013.
Save yourself from zombie apocalypse from unscalable models
One living and breathing zombie in today’s analytical models is the pulsating absence of error bars. Not every model is scalable or holds ground with increasing data. Error bars that is tagged to almost every models should be duly calibrated. As business models rake in more data the error bars keep it sensible and in check. If error bars are not accounted for, we will make our models susceptible to failure leading us to halloween that we never wants to see.
[ DATA SCIENCE Q&A]
Q:What is: collaborative filtering, n-grams, cosine distance?
A: Collaborative filtering:
– Technique used by some recommender systems
– Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources.
1. A user expresses his/her preferences by rating items (movies, CDs.)
2. The system matches this users ratings against other users and finds people with most similar tastes
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user
– Contiguous sequence of n items from a given sequence of text or speech
– ‘Andrew is a talented data scientist
– Bi-gram: ‘Andrew is, ‘is a, ‘a talented.
– Tri-grams: ‘Andrew is a, ‘is a talented, ‘a talented data.
– An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game
– More concisely, n-gram model: P(Xi|Xi?(n?1)…Xi?1): Markov model
– N-gram model: each word depends only on the n?1 last words
– when facing infrequent n-grams
– solution: smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams
– Methods: Good-Turing, Backoff, Kneser-Kney smoothing
– How similar are two documents?
– Perfect similarity/agreement: 1
– No agreement : 0 (orthogonality)
– Measures the orientation, not magnitude
Given two vectors A and B representing word frequencies:
With hundreds of species to track across the UK, ornithological charity the RSPB accrues huge amounts of data every year as it tries to ensure its efforts help as many birds as possible.
And in order to ensure they stay on top of this mountain of data, the charity has teamed up with analytics specialists SAS to develop and create more in-depth research and conservation efforts which should benefit birds around the country.
âWe need to make sense of a variety of large and complex data sets. For example, tracking the movements of kittiwakes and gannets as they forage at sea produces millions of data points,â said Dr. Will Peach, head of research delivery at RSPB.
âConservation informed by statistical evidence is always more likely to succeed than that based solely on guesswork or anecdote. SAS allows us to explore the data to provide the evidence needed to confidently implement our initiatives.â
So far, the RSPB has implemented SASâ advanced analytics solutions to combine datasets on yellowhammer and skylark nesting success with pesticide use and agriculture cropping patterns to judge the consequences for the birds.
RSPB also turned to SAS to explore how albatross forage across the Southern Ocean.
With large-scale commercial longline fishing killing tens of thousands of albatrosses a year, the goal was to cut down on the death rate and protect the 17 albatross species currently at risk.
The society took data from tags worn by the birds, merging it with external data sets like sea-surface temperatures and the location of fishing grounds.
âScientific research is extremely fast-moving and there are now huge volumes of data to analyse,â saidÂ Andy Cutler, director of strategy at SAS UK & Ireland.
âSAS is able to provide a means of managing all the data and then apply cutting-edge analytical techniques that deliver valuable insights almost immediately. For example, through analysing previously non-informative data, RSPB is now able to intervene and correct the breeding problems faced by various bird species during treacherous migration journeys.â
Read more at http://www.techweekeurope.co.uk/data-storage/business-intelligence/rspb-conservation-sas-data-analytics-167988#Dzdo3ud6Ej3vt6ZC.99
RSPB Conservation Efforts Take Flight Thanks To Data Analytics
Read more at http://www.techweekeurope.co.uk/data-storage/business-intelligence/rspb-conservation-sas-data-analytics-167988#Dzdo3ud6Ej3vt6ZC.99
One of the more pressing consequences of truly transitioning to a data-driven company culture is a renewed esteem for the dataâvalued as an assetâthat gives the enterprise its worth. Unlike other organizational assets, protecting data requires more than mere security measures. It necessitates reliable, test-worthy backup and disaster recovery plans that can automate these vital processes to account for virtually any scenario, especially some of the more immediate ones involving:
Ransomware: Ransomware attacks are increasing in incidence and severity. They occur when external entities deploy malware to encrypt organizational data using similar, if not more effective, encryption measures that those same organizations do and only release the data after being paid to do so. âRansomware was not something that many people worried about a couple years ago,â Unitrends VP of Product Marketing Dave LeClair acknowledged. âNow itâs something that almost every company that Iâve talked to has been hit. The numbers are getting truly staggering how frequently ransomware attacks are hitting IT, encrypting their data, and demanding payments to unencrypt it from these criminal organizations.â
Downtime: External threats are not the only factors that engender IT downtime. Conventional maintenance and updating measures for various systems also result in situations in which organizations cannot access or leverage their data. In essential time-sensitive applications, cloud-based disaster recovery and backup solutions ensure business continuity.
Â Contemporary IT Environments: Todayâs IT environments are much more heterogeneous than they once were. It is not uncommon for organizations to utilize existing legacy systems alongside cloud-based applications and those involving virtualization. Cloud disaster recovery and data backup platforms preserve connected continuity in a singular manner to reduce costs and increase the efficiency of backup systems.
Acts of Nature: The increasing reliance on technology is still susceptible to unforseen acts based on weather conditions, natural disasters, and even man-made onesâin which case cloud options for recovery and backups are the most desirable because they store valued data offsite.
Additionally, when one considers that the primary benefits of the cloud are its low cost storageâat scaleâand ubiquity of access regardless of location or time, cloud disaster recovery and backup solutions are a logical extension of enterprise infrastructure. âThe new technologies, because of the ability of doing things in the cloud, kind of democratizes it so that anybody can afford to have a DR environment, particularly for their critical applications,â LeClair remarked.
Recovery and Backup Basics
There are a multitude of ways that organizations can leverage cloud recovery and data backup options to readily restore production capabilities in the event of system failure:
Replication: Replication is the means by which data is copied elsewhereâin this case, to the cloud for storage. Data can also be replicated to other forms of storage (i.e. disk or tape) and be transmitted to a cloud service provider that way.
Archives/Checkpoints: Archives or checkpoints are states of data at particular points in time for a data set which are preserved within a system. Therefore, organizations can always revert their system data to an archive to restore it to a time before some sort of failure occurred. According to LeClair, this capability is an integral way of mitigating the effects of ransomware: âYou can simply rollback the clock, to the point before you got encrypted, and you can restore your system so youâre good to goâ.
Instant Recovery Solutions: These solutions not only restore systems to a point in time prior to events of failure, but even facilitate workload management based on the backup appliance itself. This capability is critical in instances in which on-premise systems are still down. In such an event, the applianceâs compute power and storage replace those of the primary solution, which âallows you to spin off that workload in less than five minutes so you can get back up and running,â Le Clair said.
Incremental Forevers: This recovery and backup technique is particularly useful because it involves a full backup of a particular data set or application, and subsequently only backs up changes to that initial backup. Such utility is pivotal to massive quantities of big data.
There are many crucial considerations when leveraging the cloud as a means of recovery and data backup. Foremost of these is the replication process of copying data from on premises to the cloud. âIt absolutely is an issue, particularly if you have terabytes of data,â LeClair mentioned. âIf youâre a decent sized enterprise and you have 50 or 100 terabytes of data that you need to move from your production environment to the cloud, that can take weeks.â Smaller cloud providers such as Unitrends can issue storage to organizations via disk, which is then overnighted and uploaded to the cloud so that, on an ongoing basis, organizations only need to replicate the changes of their data.
Another consideration pertains to actually utilizing that data in the cloud due to networking concerns. âNetworking in cloud generally works very differently than what happens on premise,â LeClair observed. Most large public cloud providers (such as Amazon Web Services) have networking constraints regarding interconnections that require significant IT involvement to configure. However, competitive disaster recovery and backup vendors have dedicated substantial resources to automating various facets of recovery, including all of the machine transformation (transmogrification) required to provision a production environment in the cloud.
Merely replicating data into the cloud is just the first step. The larger concern for actually utilizing it there in cases of emergency requires provisioning the network, which certain cloud platforms can do automatically so that, âYou have a DR environment without having to actually dedicate any compute resources yet,â LeClair said. âYou basically have your data thatâs replicated into Amazon, and you have all the configuration data necessary to spin off that data if you need to. Itâs a very cost-effective way to keep yourself protected.â
The automation capabilities of cloud data recovery and back-up solutions also include testing, which is a vital prerequisite for actually ensuring that such systems function properly on demand. Traditionally, organizations tested their recovery environments sparingly, if at all. âThereâs now technology that essentially automates your DR environment, so you donât have to pull up human resources and time into it,â LeClair said. In many instances, those automation capabilities hinge upon the cloud, which has had a considerable impact on the capabilities for disaster recovery and backup. The overarching effect is that it renders data recovery and backup more consistent, cheaper, and easier to facilitate in an increasingly complicated and preeminent IT world.
Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.
[ DATA SCIENCE Q&A]
Q:Do we always need the intercept term in a regression model?
A: * It guarantees that the residuals have a zero mean
* It guarantees the least squares slopes estimates are unbiased
* the regression line floats up and down, by adjusting the constant, to a point where the mean of the residuals is zero