Aug 15, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Complex data  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Jul 26, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..) by admin

>> Tutorial: Azure Data Lake analytics with R by analyticsweek

>> Underpinning the Internet of Things with GPUs by jelaniharper

Wanna write? Click Here

[ FEATURED COURSE]

R, ggplot, and Simple Linear Regression

image

Begin to use R and ggplot while learning the basics of linear regression… more

[ FEATURED READ]

The Black Swan: The Impact of the Highly Improbable

image

A black swan is an event, positive or negative, that is deemed improbable yet causes massive consequences. In this groundbreaking and prophetic book, Taleb shows in a playful way that Black Swan events explain almost eve… more

[ TIPS & TRICKS OF THE WEEK]

Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.

[ DATA SCIENCE Q&A]

Q:Do you think 50 small decision trees are better than a large one? Why?
A: * Yes!
* More robust model (ensemble of weak learners that come and make a strong learner)
* Better to improve a model by taking many small steps than fewer large steps
* If one tree is erroneous, it can be auto-corrected by the following
* Less prone to overfitting

Source

[ VIDEO OF THE WEEK]

#DataScience Approach to Reducing #Employee #Attrition

 #DataScience Approach to Reducing #Employee #Attrition

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

If you can’t explain it simply, you don’t understand it well enough. – Albert Einstein

[ PODCAST OF THE WEEK]

Solving #FutureOfOrgs with #Detonate mindset (by @steven_goldbach & @geofftuff) #FutureOfData #Podcast

 Solving #FutureOfOrgs with #Detonate mindset (by @steven_goldbach & @geofftuff) #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data.

Sourced from: Analytics.CLUB #WEB Newsletter

Data skills – Many hands make light work in a world awash with data

While the transformation to a data-driven culture needs to come from the top of the organization, data skills must permeate through all areas of the business.

Rather than being the responsibility of one person or department, assuring data availability and integrity must be a team sport in modern data-centric businesses. Everyone must be involved and made accountable throughout the process.  

The challenge for enterprises is to effectively enable greater data access among the workforce while maintaining oversight and quality.

The Evolution of the Data Team

Businesses are recognizing the value and opportunities that data creates. There is an understanding that data needs to be handled and processed efficiently. For some companies, this has led to the formation of a new department of data analysts and scientists.

The data team is led by a Chief Data Officer (CDO), a role that is set to become key to business success in the digital era, according to recent research from Gartner. While earlier iterations of roles within the data team centered on data governance, data quality and regulatory issues, the focus is shifting. Data analysts and scientists are now expected to contribute and deliver a data-driven culture across the company, while also driving business value. According to the Gartner survey, the skills required for roles within the data team have expanded to span data management, analytics, data science, ethics, and digital transformation.

Businesses are clearly recognizing the importance of the data team’s functions and are making significant investments in it. Office budgets for the data team increased by an impressive 23% between 2016 and 2017 according to Gartner. What’s more, some 15% of the CDOs that took part in the study revealed that their budgets were more than $20 million for their departments, compared with just 7% who said the same in 2016. The increasing popularity and evolution of these new data roles has largely been driven by GDPR in Europe and by new data protection regulations in the US. And the evidence suggests that the position will be essential for ensuring the successful transfer of data skills throughout businesses of all sizes.

The Data Skills Shortage

Data is an incredibly valuable resource, but businesses can only unlock its full potential if they have the talent to analyze that data and produce actionable insights that help them to better understand their customers’ needs. However, companies are already struggling to cope with the big data ecosystem due to a skills shortage and the problem shows little sign of improving. In fact, Europe could see a shortage of up to 500,000 IT professionals by 2020, according to the latest research from consultancy firm Empirica.

The rapidly evolving digital landscape is partly to blame as the skills required have changed radically in recent years. The required data science skills needed at today’s data-driven companies are more wide-ranging than ever before. The modern workforce is now required to have a firm grasp of computer science including everything from databases to the cloud, according to strategic advisor and best-selling author Bernard Marr. In addition, analytical skills are essential to make sense of the ever-increasing data gathered by enterprises, while mathematical skills are also vital as much of the data captured will be numerical as this is largely due to IoT and sensor data. These skills must also sit alongside more traditional business and communication skills, as well as the ability to be creative and adapt to developing technologies.

The need for these skills is set to increase, with IBM predicting that the number of jobs for data professionals will rise by a massive 28% by 2020. The good news is that businesses are already recognizing the importance of digital skills in the workforce, with the role of Data Scientist taking the number one spot in Glassdoor’s Best Jobs in America for the past three years, with a staggering 4,524 positions available in 2018. 

Data Training Employees

Data quality management is a task that extends across all functional areas of a company. It, therefore, makes sense to provide the employees in the specialist departments with tools to ensure data quality in self-service. Cloud-based tools that can be rolled out quickly and easily in the departments are essential. This way, companies can gradually improve their data quality whilst also increasing the value of their data.

While the number of data workers triples and to stay competitive with GDPR, businesses must think of good data management as a team sport. Investing in the Chief Data Officer role and data skills now will enable forward-thinking businesses to reap the rewards, both in the short-term and further into the future.

The post Data skills – Many hands make light work in a world awash with data appeared first on Talend Real-Time Open Source Data Integration Software.

Originally Posted at: Data skills – Many hands make light work in a world awash with data by analyticsweekpick

Aug 08, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Trust the data  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> The 4 Common Challenges of Predictive Analytics by analyticsweek

>> Inside CXM: New Global Thought Leader Hub for Customer Experience Professionals by bobehayes

>> What Motivates People to Take Free Surveys? by analyticsweek

Wanna write? Click Here

[ FEATURED COURSE]

Deep Learning Prerequisites: The Numpy Stack in Python

image

The Numpy, Scipy, Pandas, and Matplotlib stack: prep for deep learning, machine learning, and artificial intelligence… more

[ FEATURED READ]

Research Design: Qualitative, Quantitative, and Mixed Methods Approaches, 4th Edition

image

The eagerly anticipated Fourth Edition of the title that pioneered the comparison of qualitative, quantitative, and mixed methods research design is here! For all three approaches, Creswell includes a preliminary conside… more

[ TIPS & TRICKS OF THE WEEK]

Data Have Meaning
We live in a Big Data world in which everything is quantified. While the emphasis of Big Data has been focused on distinguishing the three characteristics of data (the infamous three Vs), we need to be cognizant of the fact that data have meaning. That is, the numbers in your data represent something of interest, an outcome that is important to your business. The meaning of those numbers is about the veracity of your data.

[ DATA SCIENCE Q&A]

Q:How frequently an algorithm must be updated?
A: You want to update an algorithm when:
– You want the model to evolve as data streams through infrastructure
– The underlying data source is changing
– Example: a retail store model that remains accurate as the business grows
– Dealing with non-stationarity

Some options:
– Incremental algorithms: the model is updated every time it sees a new training example
Note: simple, you always have an up-to-date model but you can’t incorporate data to different degrees.
Sometimes mandatory: when data must be discarded once seen (privacy)
– Periodic re-training in “batch” mode: simply buffer the relevant data and update the model every-so-often
Note: more decisions and more complex implementations

How frequently?
– Is the sacrifice worth it?
– Data horizon: how quickly do you need the most recent training example to be part of your model?
– Data obsolescence: how long does it take before data is irrelevant to the model? Are some older instances
more relevant than the newer ones?
Economics: generally, newer instances are more relevant than older ones. However, data from the same month, quarter or year of the last year can be more relevant than the same periods of the current year. In a recession period: data from previous recessions can be more relevant than newer data from different economic cycles.

Source

[ VIDEO OF THE WEEK]

Want to fix #DataScience ? fix #governance by @StephenGatchell @Dell #FutureOfData #Podcast

 Want to fix #DataScience ? fix #governance by @StephenGatchell @Dell #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Torture the data, and it will confess to anything. – Ronald Coase

[ PODCAST OF THE WEEK]

#FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

 #FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Bad data or poor data quality costs US businesses $600 billion annually.

Sourced from: Analytics.CLUB #WEB Newsletter

How big data can improve manufacturing

Manufacturers taking advantage of advanced analytics can reduce process flaws, saving time and money.

In the past 20 years or so, manufacturers have been able to reduce waste and variability in their production processes and dramatically improve product quality and yield (the amount of output per unit of input) by implementing lean and Six Sigma programs. However, in certain processing environments—pharmaceuticals, chemicals, and mining, for instance—extreme swings in variability are a fact of life, sometimes even after lean techniques have been applied. Given the sheer number and complexity of production activities that influence yield in these and other industries, manufacturers need a more granular approach to diagnosing and correcting process flaws. Advanced analytics provides just such an approach.

Advanced analytics refers to the application of statistics and other mathematical tools to business data in order to assess and improve practices (exhibit). In manufacturing, operations managers can use advanced analytics to take a deep dive into historical process data, identify patterns and relationships among discrete process steps and inputs, and then optimize the factors that prove to have the greatest effect on yield. Many global manufacturers in a range of industries and geographies now have an abundance of real-time shop-floor data and the capability to conduct such sophisticated statistical assessments. They are taking previously isolated data sets, aggregating them, and analyzing them to reveal important insights.

Exhibit

Consider the production of biopharmaceuticals, a category of healthcare products that includes vaccines, hormones, and blood components. They are manufactured using live, genetically engineered cells, and production teams must often monitor more than 200 variables within the production flow to ensure the purity of the ingredients as well as the substances being made. Two batches of a particular substance, produced using an identical process, can still exhibit a variation in yield of between 50 and 100 percent. This huge unexplained variability can create issues with capacity and product quality and can draw increased regulatory scrutiny.

One top-five biopharmaceuticals maker used advanced analytics to significantly increase its yield in vaccine production while incurring no additional capital expenditures. The company segmented its entire process into clusters of closely related production activities; for each cluster, it took far-flung data about process steps and the materials used and gathered them in a central database.

A project team then applied various forms of statistical analysis to the data to determine interdependencies among the different process parameters (upstream and downstream) and their impact on yield. Nine parameters proved to be most influential, especially time to inoculate cells and conductivity measures associated with one of the chromatography steps. The manufacturer made targeted process changes to account for these nine parameters and was able to increase its vaccine yield by more than 50 percent—worth between $5 million and $10 million in yearly savings for a single substance, one of hundreds it produces.

Developing unexpected insights

Even within manufacturing operations that are considered best in class, the use of advanced analytics may reveal further opportunities to increase yield. This was the case at one established European maker of functional and specialty chemicals for a number of industries, including paper, detergents, and metalworking. It boasted a strong history of process improvements since the 1960s, and its average yield was consistently higher than industry benchmarks. In fact, staffers were skeptical that there was much room for improvement. “This is the plant that everybody uses as a reference,” one engineer pointed out.

However, several unexpected insights emerged when the company used neural-network techniques (a form of advanced analytics based on the way the human brain processes information) to measure and compare the relative impact of different production inputs on yield. Among the factors it examined were coolant pressures, temperatures, quantity, and carbon dioxide flow. The analysis revealed a number of previously unseen sensitivities—for instance, levels of variability in carbon dioxide flow prompted significant reductions in yield. By resetting its parameters accordingly, the chemical company was able to reduce its waste of raw materials by 20 percent and its energy costs by around 15 percent, thereby improving overall yield. It is now implementing advanced process controls to complement its basic systems and steer production automatically.

Meanwhile, a precious-metals mine was able to increase its yield and profitability by rigorously assessing production data that were less than complete. The mine was going through a period in which the grade of its ore was declining; one of the only ways it could maintain production levels was to try to speed up or otherwise optimize its extraction and refining processes. The recovery of precious metals from ore is incredibly complex, typically involving between 10 and 15 variables and more than 15 pieces of machinery; extraction treatments may include cyanidation, oxidation, grinding, and leaching.

The production and process data that the operations team at the mine were working with were extremely fragmented, so the first step for the analytics team was to clean it up, using mathematical approaches to reconcile inconsistencies and account for information gaps. The team then examined the data on a number of process parameters—reagents, flow rates, density, and so on—before recognizing that variability in levels of dissolved oxygen (a key parameter in the leaching process) seemed to have the biggest impact on yield. Specifically, the team spotted fluctuations in oxygen concentration, which indicated that there were challenges in process control. The analysis also showed that the best demonstrated performance at the mine occurred on days in which oxygen levels were highest.

As a result of these findings, the mine made minor changes to its leach-recovery processes and increased its average yield by 3.7 percent within three months—a significant gain in a period during which ore grade had declined by some 20 percent. The increase in yield translated into a sustainable $10 million to $20 million annual profit impact for the mine, without it having to make additional capital investments or implement major change initiatives.

Capitalizing on big data

The critical first step for manufacturers that want to use advanced analytics to improve yield is to consider how much data the company has at its disposal. Most companies collect vast troves of process data but typically use them only for tracking purposes, not as a basis for improving operations. For these players, the challenge is to invest in the systems and skill sets that will allow them to optimize their use of existing process information—for instance, centralizing or indexing data from multiple sources so they can be analyzed more easily and hiring data analysts who are trained in spotting patterns and drawing actionable insights from information.

Some companies, particularly those with months- and sometimes years-long production cycles, have too little data to be statistically meaningful when put under an analyst’s lens. The challenge for senior leaders at these companies will be taking a long-term focus and investing in systems and practices to collect more data. They can invest incrementally—for instance, gathering information about one particularly important or particularly complex process step within the larger chain of activities, and then applying sophisticated analysis to that part of the process.

The big data era has only just emerged, but the practice of advanced analytics is grounded in years of mathematical research and scientific application. It can be a critical tool for realizing improvements in yield, particularly in any manufacturing environment in which process complexity, process variability, and capacity restraints are present. Indeed, companies that successfully build up their capabilities in conducting quantitative assessments can set themselves far apart from competitors.

About the authors

Eric Auschitzky is a consultant in McKinsey’s Lyon office, Markus Hammer is a senior expert in the Lisbon office, and Agesan Rajagopaul is an associate principal in the Johannesburg office.

The authors would like to thank Stewart Goodman, Jean-Baptiste Pelletier, Paul Rutten, Alberto Santagostino, Christoph Schmitz, and Ken Somers for their contributions to this article.

Originally posted via “How big data can improve manufacturing”

PDF

Source by analyticsweekpick

2016 Trends for the Internet of Things: Expanding Expedient Analytics Beyond the Industrial Internet

The current perception of the Internet of Things is greatly reliant upon its traditional applications of the Industrial Internet, in which monitoring of equipment assets via real-time and predictive analytics grossly reduces costs and time spent on maintenance and repairs.

 

Although the Industrial Internet will likely remain a vital component of the IoT, the potential of this vast connectivity and automated action greatly exceeds any one particular use case. With liberal estimates of as many as 50 billion connected devices by 2020, there are a number of trends pertaining to the IoT that will begin in earnest in 2016 to help it reach its full potential:

 

  • Structure: One of the most significant trends impacting the IoT will be the continued usage of JSON-based document stores which can provide schema and a loose means of structure on otherwise unstructured data on-the-fly, making IoT data more manageable and viable to the enterprise.
  • Decentralized Clouds: The trend towards fog computing—based on pushing cloud resources and computations to the edge of the cloud, closer to end users and their devices—will continue to gain credence and reduce bandwidth, decrease time to action, lower costs, and improve access to data.
  • Mobile: The IoT will impact the future of mobile technologies via the cyberforaging phenomena, in which mobile devices share resources with one another in a more efficient manner than most contemporary decentralized models do. Additionally, the increase of wearables in industries such as healthcare and others will help to broaden the IoT’s presence throughout the coming year.
  • Revenue Streams: 2016 will see an increase in revenue streams associated with the IoT (outside of the Industrial Internet), including the creation of services and applications structured around it. The most eminent example is provided by connected cars.
  • Security: Security concerns are one of the few remaining inhibitors to the IoT. IDC predicts that more than 50 percent of networks utilizing the IoT will have a security breach by 2018.

 

Structure

Of all the pragmatic necessities that must be addressed for the IoT to transform enterprise data management practices, contending with continuously streaming unstructured data from any variety of sources remains one of the most daunting. While the utilization of semantic technologies and their relevance to the unstructured and semi-structured data created by the IoT and other forms of big data continues to gain traction, one of the most profound developments related to structuring this data involves the JSON document format.

 

MapR Chief Marketing Officer Jack Norris noted that this format “typically is a complex document that has embedded hierarchies in it to exchange record to record depending on what was happening to produce it.” JSON documents are regularly used for the exchanges of data in web applications and contain much needed schema that is invaluable to unruly IoT data—particularly when combined with SQL solutions that can derive such schema. “A lot of the machine generated content, whether it’s log files or sensor data, is kind of a JSON format,” Norris remarked. “It’s definitely one of the fastest growing formats.”

 

Decentralization

Another major trend affecting the IoT pertains to the architecture required to account for the constant generation of data and requisite analytics between machines. Although there are certainly cases in which traditionally centralized approaches to data management have continuing relevance, decentralized methods of managing big data utilizing cloud infrastructure are emerging as a way to provision the real-time analytics required to exploit the IoT. Edge computing (also known as fog computing) facilitates the expedience required for such analytics while helping to preserve computing networks from the undue bandwidth, cost, and latency issues that uninterrupted transmission of data from the IoT creates with centralized models.

 

Instead, analytics and computations are performed at the edge of the cloud while only the most necessary results are transmitted to centralized data centers, which greatly decreases costs and network issues while improving the viability of leveraging machine-to-machine communication. A recent Forbes post indicates that: “By 2018, at least half of IT spending will be cloud-based, reaching 60 % of all IT infrastructure and 60-70% of all software, services, and technology spending by 2020.”

 

Mobile

The renewed emphasis on cloud resources is indicative of the growing trend towards mobile computing that will continue to influence the IoT this year. Prevalent mobile computing concerns have transcended issues of governance and the incorporation of BYOD and CYOD policies, and have come to include the natural progression of fog computing via the cyberforaging phenomenon. Cyberforaging is the ability of mobile devices to facilitate their analytics and transformation needs (either in the form of ETL or ELT) by utilizing internet-based resources nearest to them. Similar to fog computing, cyberforaging again focuses on the edge of the cloud, and can involve the use of cloudlets. Cloudlets are various devices equipped to provision the cloud to mobile devices (which detect them via their sensors) without the need for a centralized location or data center.

 

More advanced capabilities include the increasing usage of smart agent technologies. According to Gartner, these technologies “in the form of virtual personal assistants (VPAs) and other agents, will monitor user content and behavior in conjunction with cloud-hosted neural networks to build and maintain data models from which the technology will draw inferences about people, content and contexts.” Although these technologies will gain prominence towards the end of the decade, the combination of them and cyberforaging will expand expectations for mobile computing and the remote computation power the IoT can facilitate.

 

Revenue Streams

There are a multitude of examples of the way that the IoT will increase revenue streams outside of the Industrial Internet in the coming months. One will be in providing support to machines instead of end users. Gartner noted that: “By 2018, six billion connected things will be requesting support…Strategies will also need to be developed for responding to them that are distinctly different from traditional human-customer communication and problem-solving. Responding to service requests from things will spawn entire service industries…”.

 

Connected cars will also produce a host of revenue streams, most eminently relating to marketing and advertising. Companies can tailor promotional efforts to offer the most salient marketing attempts for everything from gas prices to dining or entertainment. Similarly, organizations can pay to include their digital products and services as part of package offerings with auto manufacturers of smart vehicles. The health care industry will continue to further the IoT by designing and implementing any variety of remote monitoring gadgets and systems for patients, while some of the aforementioned marketing opportunities apply to the influx of wearables and smart clothing that is gaining popularity.

 

Security

The most substantial trend to impact the security issues for the IoT pertain to the cloud, which has become the preferred infrastructure of choice for numerous applications of big data. With the majority of security breaches occurring due to instances of remissness on the part of the enterprise (as opposed to cloud providers), 2016 will see greater adoption rates of cloud security tools to counteract these issues. Cloud control tools include solutions specifically designed for access security and management of different forms of SOA. Indeed, security-as-a-service may well turn out to be one of the most important facets of SOA which provide a degree of stability and dependency on valued big data applications in the cloud.

 

Source: 2016 Trends for the Internet of Things: Expanding Expedient Analytics Beyond the Industrial Internet by jelaniharper

Aug 01, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Data security  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Using Task Ease (SEQ) to Predict Completion Rates and Times by analyticsweek

>> October 24, 2016 Health and Biotech analytics news roundup by pstein

>> Up Your Game With Interactive Data Visualizations by analyticsweek

Wanna write? Click Here

[ FEATURED COURSE]

Introduction to Apache Spark

image

Learn the fundamentals and architecture of Apache Spark, the leading cluster-computing framework among professionals…. more

[ FEATURED READ]

Storytelling with Data: A Data Visualization Guide for Business Professionals

image

Storytelling with Data teaches you the fundamentals of data visualization and how to communicate effectively with data. You’ll discover the power of storytelling and the way to make data a pivotal point in your story. Th… more

[ TIPS & TRICKS OF THE WEEK]

Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.

[ DATA SCIENCE Q&A]

Q:How do you know if one algorithm is better than other?
A: * In terms of performance on a given data set?
* In terms of performance on several data sets?
* In terms of efficiency?
In terms of performance on several data sets:

– ‘Does learning algorithm A have a higher chance of producing a better predictor than learning algorithm B in the given context?”
– ‘Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets”, A. Lacoste and F. Laviolette
– ‘Statistical Comparisons of Classifiers over Multiple Data Sets”, Janez Demsar

In terms of performance on a given data set:
– One wants to choose between two learning algorithms
– Need to compare their performances and assess the statistical significance

One approach (Not preferred in the literature):
– Multiple k-fold cross validation: run CV multiple times and take the mean and sd
– You have: algorithm A (mean and sd) and algorithm B (mean and sd)
– Is the difference meaningful? (Paired t-test)

Sign-test (classification context):
Simply counts the number of times A has a better metrics than B and assumes this comes from a binomial distribution. Then we can obtain a p-value of the HoHo test: A and B are equal in terms of performance.

Wilcoxon signed rank test (classification context):
Like the sign-test, but the wins (A is better than B) are weighted and assumed coming from a symmetric distribution around a common median. Then, we obtain a p-value of the HoHo test.

Other (without hypothesis testing):
– AUC
– F-Score

Source

[ VIDEO OF THE WEEK]

Data-As-A-Service (#DAAS) to enable compliance reporting

 Data-As-A-Service (#DAAS) to enable compliance reporting

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Data really powers everything that we do. – Jeff Weiner

[ PODCAST OF THE WEEK]

@EdwardBoudrot / @Optum on #DesignThinking & #DataDriven Products #FutureOfData #Podcast

 @EdwardBoudrot / @Optum on #DesignThinking & #DataDriven Products #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Facebook users send on average 31.25 million messages and view 2.77 million videos every minute.

Sourced from: Analytics.CLUB #WEB Newsletter

With big data invading campus, universities risk unfairly profiling their students

Obama’s proposed Student Digital Privacy Act aims to limit what schools can do with data collected from apps used in K-12 classrooms. But college students are just as vulnerable to privacy violations.

Privacy advocates have long been pushing for laws governing how schools and companies treat data gathered from students using technology in the classroom. Most now applaud President Obama‘s newly announced Student Digital Privacy Act to ensure “data collected in the educational context is used only for educational purposes.”

But while young students are vulnerable to privacy harms, things are tricky for college students, too. This is especially true as many universities and colleges gather and analyze more data about students’ academic — and personal — lives than ever before.

Jeffrey Alan Johnson, assistant director of institutional effectiveness and planning at Utah Valley University, has written about some of the main issues for universities and college students in the era of big data. I spoke with him about the ethical and privacy implications of universities using more data analytics techniques.

Recommended: Do you have a clue about teenage behavior? Take our quiz!

Selinger: Privacy advocates worry about companies creating profiles of us. Is there an analog in the academic space? Are profiles being created that can have troubling experiential effects?

Johnson: Absolutely. We’ve got an early warning system [called Stoplight] in place on our campus that allows instructors to see what a student’s risk level is for completing a class. You don’t come in and start demonstrating what kind of a student you are. The instructor already knows that. The profile shows a red light, a green light, or a yellow light based on things like have you attempted to take the class before, what’s your overall level of performance, and do you fit any of the demographic categories related to risk. These profiles tend to follow students around, even after folks change how they approach school. The profile says they took three attempts to pass a basic math course and that suggests they’re going to be pretty shaky in advanced calculus.

Selinger: Is this transparent to students? Do they actually know what information the professor sees?

Johnson: No, not unless the professor tells them. I don’t think students are being told about Stoplight at all. I don’t think students are being told about many of the systems in place. To my knowledge, they aren’t told about the basis of the advising system that Austin Peay put in place where they’re recommending courses to students based, in part, on their likelihood of success. They’re as unaware of these things as the general public is about how Facebook determines what users should see.

Evan Selinger is an associate professor of philosophy at Rochester Institute of Technology. Follow him on Twitter @EvanSelinger.

Originally posted via “With big data invading campus, universities risk unfairly profiling their students”.

 

Originally Posted at: With big data invading campus, universities risk unfairly profiling their students by analyticsweekpick

Jul 25, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Fake data  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Are You Evolving Your Analytics? by analyticsweek

>> Meet Us in DC for the 2019 Logi Conference by analyticsweek

>> Big Data Is No Longer Confined to the Big Business Playbook by analyticsweekpick

Wanna write? Click Here

[ FEATURED COURSE]

Applied Data Science: An Introduction

image

As the world’s data grow exponentially, organizations across all sectors, including government and not-for-profit, need to understand, manage and use big, complex data sets—known as big data…. more

[ FEATURED READ]

Thinking, Fast and Slow

image

Drawing on decades of research in psychology that resulted in a Nobel Prize in Economic Sciences, Daniel Kahneman takes readers on an exploration of what influences thought example by example, sometimes with unlikely wor… more

[ TIPS & TRICKS OF THE WEEK]

Fix the Culture, spread awareness to get awareness
Adoption of analytics tools and capabilities has not yet caught up to industry standards. Talent has always been the bottleneck towards achieving the comparative enterprise adoption. One of the primal reason is lack of understanding and knowledge within the stakeholders. To facilitate wider adoption, data analytics leaders, users, and community members needs to step up to create awareness within the organization. An aware organization goes a long way in helping get quick buy-ins and better funding which ultimately leads to faster adoption. So be the voice that you want to hear from leadership.

[ DATA SCIENCE Q&A]

Q:What is the difference between supervised learning and unsupervised learning? Give concrete examples
?

A: * Supervised learning: inferring a function from labeled training data
* Supervised learning: predictor measurements associated with a response measurement; we wish to fit a model that relates both for better understanding the relation between them (inference) or with the aim to accurately predicting the response for future observations (prediction)
* Supervised learning: support vector machines, neural networks, linear regression, logistic regression, extreme gradient boosting
* Supervised learning examples: predict the price of a house based on the are, size.; churn prediction; predict the relevance of search engine results.
* Unsupervised learning: inferring a function to describe hidden structure of unlabeled data
* Unsupervised learning: we lack a response variable that can supervise our analysis
* Unsupervised learning: clustering, principal component analysis, singular value decomposition; identify group of customers
* Unsupervised learning examples: find customer segments; image segmentation; classify US senators by their voting.

Source

[ VIDEO OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with @Beena_Ammanath, @GE

 #BigData @AnalyticsWeek #FutureOfData #Podcast with @Beena_Ammanath, @GE

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

The world is one big data problem. – Andrew McAfee

[ PODCAST OF THE WEEK]

#FutureOfData Podcast: Conversation With Sean Naismith, Enova Decisions

 #FutureOfData Podcast: Conversation With Sean Naismith, Enova Decisions

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

For a typical Fortune 1000 company, just a 10% increase in data accessibility will result in more than $65 million additional net income.

Sourced from: Analytics.CLUB #WEB Newsletter

Logi Tutorial: Step-by-Step Guide for Setting Up a Logi Application on AWS

This post originally appeared on dbSeer, a business analytics consulting firm and Logi Analytics partner.

Last year, we wrote a blog about how to use AWS Auto Scaling with Logi Analytics Applications. In that blog, we promised to release this step-by-step guide outlining the technical details of how a Logi Application can be configured to harness the scalability and elasticity features of AWS.

Enabling a multi-web server Logi application on AWS Windows instances requires the right configuration for some of the shared Logi files (cache files, secure key, bookmarks, etc.). To support these shared files, we need a shared network drive that can be accessed by the different Logi webservers. Currently EFS (Elastic File Storage) is not supported on Windows on AWS. Below we have defined how EFS can be mounted on Windows servers and setup so that you can utilize the scalability feature of Logi.

Setting Up the File Server

Overview

In order for our distributed Logi application to function properly, it needs access to a shared file location. This can be easily implemented with Amazon’s Elastic File System (EFS). However, if you’re using a Windows server to run your Logi application, extra steps are necessary, as Windows does not currently support EFS drives. In order to get around this constraint, it is necessary to create Linux based EC2 instances to serve as an in-between file server. The EFS volumes will be mounted on these locations and then our Windows servers will access the files via the Samba (SMB) protocol.

Steps

  • Create EC2:
    • Follow the steps as outlined in this AWS Get Started guide and choose: Image: “Ubuntu Server 16.04 LTS (HVM), SSD Volume Type”
    • Create an Instance with desired type e.g.: “t2.micro”
  • Create AWS EFS volume:
    • Follow the steps listed here and use same VPC and availability zone as used above
  • Setup AWS EFS inside the EC2:
    • Connect to the EC2 instance we created in Step 1 using SSH
  • Mount the EFS to the EC2 using the following commands:
    • sudo apt-get install -y nfs-common
    • mkdir /mnt/efs
    • mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 EFS_IP_ADDRESS_HERE:/ /mnt/efs
  • Re-export NFS share to be used in Windows:
    • Give our Windows user access to its files. Let’s do this using samba. Again, drop the following to your shell for installing SMB(Samba) services in your Ubuntu EC2
  • Run the following commands:
    • apt-get install -y samba samba-common python-glade2 system-config-samba
    • cp -pf /etc/samba/smb.conf /etc/samba/smb.conf.bak
    • cat /dev/null -> /etc/samba/smb.conf
    • nano /etc/samba/smb.conf
  • And then, paste the text below inside the smb.conf file:

[global]
workgroup = WORKGROUP
server string = AWS-EFS-Windows
netbios name = ubuntu
dns proxy = no
socket options = TCP_NODELAY[efs]
path = /mnt/efs
read only = no
browseable = yes
guest ok = yes
writeable = yes

  • Create a Samba user/password. Use the same credentials as your EC2 user:
    • sudo smbpasswd –a ubuntu
  • Give Ubuntu user access to the mounted folder:
    • sudo chown ubuntu:ubuntu /mnt/efs/
  • And finally, restart the samba service:
    • sudo /etc/init.d/smbd restart

Setting Up the Application Server

Overview

Logi applications require setup in the form of settings, files, licenses, and more. In order to accommodate the elastic auto-scaling, we’ll set up one server – from creation to connecting to our shared drive to installing and configuring Logi – and then make an Amazon Machine Image (AMI) for use later.

Steps

  • Create EC2:
    • Follow the steps as outlined in this AWS Get Started guide and choose: Image: “Microsoft Windows Server 2016 Base”
    • Instance type: “t2.micro” or whatever type your application requires
  • Deploy code:
    • Clone your project repository and deploy the code in IIS
  • Set Up User Access:
    • Allow your application in IIS to access the shared folder (EFS) that we created inside the File server
    • From the control panel, choose users accounts → manage another account → add a user account
    • Use same username and password we created for the samba user in Ubuntu file server
    • In IIS, add the new Windows user we created above to the application connection pool, IIS → Application Pools → right click on your project application pool → identity → custom account → fill in the new username and password we created earlier.
  • Test EFS (shared folder) connection:
    • To test the connection between Windows application server and Ubuntu file server, go to:
    • This PC → computer tap → map network drive → in folder textbox type in “FILE_SERVER_IP_ADDRESSefs” → If credentials window appears for you, just use the new username and password we created earlier.

Configuring the Logi Application

Sticky and Non-Sticky Sessions

In a standard environment with one server, a session is established with the first HTTP request and all subsequent requests, for the life of the session, will be handled by that same server. However, in a load-balanced or clustered environment, there are two possibilities for handling requests: “sticky” sessions (sometimes called session affinity) and “non-sticky” sessions.

Use a sticky session to handle HTTP requests by centralizing the location of any shared resources and managing session state. You must create a centralized, shared location for cached data (rdDataCache folder), saved Bookmark files, _metaData folder, and saved Dashboard files because they must be accessible to all servers in the cluster.

Managing Session State

IIS is configured by default to manage session information using the “InProc” option. For both standalone and load-balanced, sticky environments, this option allows a single server to manage the session information for the life of the session.

Centralization of Application Resources

In a load-balanced environment, each web server must have Logi Server installed and properly licensed, and must have its own copy of the Logi application with its folder structure, system files, etc. This includes everything in the _SupportFiles folder such as images, style sheets, XML data files, etc., any custom themes, and any HTML or script files. We will achieve this by creating one instance with all the proper configurations, and then using an AMI.

Some application files should be centralized, which also allows for easier configuration management. These files include:

  • Definitions: Copies of report, process, widget, template, and any other necessary definitions (except _Settings) can be installed on each web server as part of the application, or centralized definitions may be used for easier maintenance (if desired).
    The location of definitions is configured in _Settings definition, using the Path element’s Alternative Definition Folder attribute, as shown above. This should be set to the UNC path to a shared network location accessible by all web servers, and the attribute value should include the _Definitions folder. Physically, within that folder, you should create the folders _Reports, _Processes, _Widgets, and _Templates as necessary. Do not include the _Settings definition in any alternate location; it must remain in the application folder on the web server as usual.
  • “Saved” Files: Many super-elements, such as the Dashboard and Analysis Grid, allow the user to save the current configuration to a file for later reuse. The locations of these files are specified in attributes of the elements.
    As shown in the example above, the Save File attribute value should be the UNC path to a shared network location (with file name, if applicable) accessible by all web servers.
  • Bookmarks: If used in an application, the location of these files should also be centralized:
    As shown above, in the _Settings definition, configure the General element’s Bookmark Folder Location attribute, with a UNC path to a shared network folder accessible by all web servers.

Using SecureKey Security

If you’re using Logi SecureKey security in a load-balanced environment, you need to configure security to share requests.

In the _Settings definition, set the Security element’s SecureKey Shared Folder attribute to a network path, as shown above. Files in the SecureKey folder are automatically deleted over time, so do not use this folder to store other files. It’s required to create the folder rdSecureKey under myProject shared folder, since it’s not auto created by Logi.

Note: “Authentication Client Addresses” must be replaced later with subnet IP addresses ranges of the load balancer VPC after completing the setup for load balancer below.

You can Specify ranges of IP addresses with wildcards. To use wildcards, specify an IP address, the space character, then the wildcard mask. For example to allow all addresses in the range of 172.16.*.*, specify:

172.16.0.0 0.0.255.255

Centralizing the Data Cache

The data cache repository is, by default, the rdDataCache folder in a Logi application’s root folder. In a standalone environment, where all the requests are processed by the same server, this default cache configuration is sufficient.

In a load-balanced environment, centralizing the data cache repository is required.

This is accomplished in Studio by editing a Logi application’s _Settings definition, as shown above. The General element’s Data Cache Location attribute value should be set to the UNC path of a shared network location accessible by all web servers. This change should be made in the _Settings definition for each instance of the Logi application (i.e. on each web server).

Note: “mySharedFileServer” IP/DNS address should be replaced later with file servers load balancer dns after completing the setup for load balancer below.

Creating and Configuring Your Load-Balancer

Overview

You’ll need to set up load balancers for both the Linux file server and the Windows application/web server. This process is relatively simple and is outlined below, and in the Getting Started guide here.

Steps

  • Windows application/web servers load balancer:
    • Use classic load balancers.
    • Use the same VPC that our Ubuntu file server’s uses.
    • Listener configuration: Keep defaults.
    • Health check configuration: Keep defaults and make sure that ping path is exists, i.e. “/myProject/rdlogon.aspx”
  • Add Instances: Add all Windows web/application servers to the load balancer, and check the status. All servers should give “InService” in 20-30 seconds.
    • To enable stickiness, select ELB > port configuration > edit stickiness > choose “enable load balancer generated cookie stickiness”, set expiration period for the same as well.
  • Linux file servers load balancer:
    • Use classic load balancers.
    • Use the same VPC that the EFS volume uses.
    • Listener configuration:
    • Health check configuration: Keep defaults and make sure that ping path is exists, i.e. “/index.html”
    • Note: A simple web application must be deployed to the Linux file servers, in order to set the health check. It should be running inside a web container like tomcat, then modify the ping path for the health checker to the deployed application path.
  • Add Instances: Add all Ubuntu file servers to the load balancer, and check the status, all servers should give “InService” in 20-30 seconds.

Using Auto-Scaling

Overview

In order to achieve auto-scaling, you need to set up a Launch Template and an Auto-Scaling Group. You can follow the steps in the link here, or the ones outlined below.

Steps

  • Create Launch Configuration:
    • Search and select the AMI that you created above.
    • Use same security group you used in your app server EC2 instance. (Windows)
  • Create an Auto Scaling Group
    • Make sure to select the launch configuration that we created above.
    • Make sure to set the group size, aka how many EE2 instances you want to have in the auto scaling group at all times.
    • Make sure to use same VPC we used for the Windows application server EC2s.
  • Set the Auto scaling policies:
    • Set min/max size of the group:
    • Min: minimum number of instances that will be launched at all times.
    • Max: maximum number of instances that will be launched once a metric condition is met.
  • Click on “Scale the Auto Scaling group using step or simple scaling policies”
  • Set the required values for:
    • Increase group size
      • Make sure that you create a new alarm that will notify your auto scaling group when the CPU utilization exceeds certain limits.
      • Make sure that you specify the action “add” and the number of instances that we want to add when the above alarm triggered.
    • Decrease group size
      • Make sure that you create a new alarm that will notify your auto scaling group when the CPU utilization is below certain limits.
      • Make sure that you specify the action and the number of instances that we want to add when the above alarm is triggered.
      • You can set the warm up time for the EC2, if necessary. This will depend on whether you have any initialization tasks that run after launching the EC2 instance, and if you want to wait for them to finish before starting to use the newly created instance.
      • You can also add a notification service to know when any instance is launched, terminated, failed to launch or failed to terminate by the auto scaling process.
  • Add tags to the auto scaling group. You can optionally choose to apply these tags to the instances in the group when they launch.
  • Review your settings and then click on Create Auto Scaling Group.

We hope this detailed how-to guide was helpful in helping you set up your Logi Application on AWS.

Please contact dbSeer if you have any questions or have any other how-to guide requests. We’re always happy to hear from you!

References:

Originally Posted at: Logi Tutorial: Step-by-Step Guide for Setting Up a Logi Application on AWS