Nov 15, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Tour of Accounting  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> 5 Advantages of Using a Redshift Data Warehouse by analyticsweek

>> January 23, 2017 Health and Biotech analytics news roundup by pstein

>> May 25, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..) by admin

Wanna write? Click Here

[ NEWS BYTES]

>>
 ND vital statistics hold steady in 2017 – Bismarck Tribune Under  Statistics

>>
 The Use of Ramped Rep Equivalents (RREs) in Sales Analytics and Modeling – Enterprise Irregulars (blog) Under  Sales Analytics

>>
 State Street: Latest investor sentiment towards Brexit – Asset Servicing Times Under  Risk Analytics

More NEWS ? Click Here

[ FEATURED COURSE]

Probability & Statistics

image

This course introduces students to the basic concepts and logic of statistical reasoning and gives the students introductory-level practical ability to choose, generate, and properly interpret appropriate descriptive and… more

[ FEATURED READ]

On Intelligence

image

Jeff Hawkins, the man who created the PalmPilot, Treo smart phone, and other handheld devices, has reshaped our relationship to computers. Now he stands ready to revolutionize both neuroscience and computing in one strok… more

[ TIPS & TRICKS OF THE WEEK]

Save yourself from zombie apocalypse from unscalable models
One living and breathing zombie in today’s analytical models is the pulsating absence of error bars. Not every model is scalable or holds ground with increasing data. Error bars that is tagged to almost every models should be duly calibrated. As business models rake in more data the error bars keep it sensible and in check. If error bars are not accounted for, we will make our models susceptible to failure leading us to halloween that we never wants to see.

[ DATA SCIENCE Q&A]

Q:How do you handle missing data? What imputation techniques do you recommend?
A: * If data missing at random: deletion has no bias effect, but decreases the power of the analysis by decreasing the effective sample size
* Recommended: Knn imputation, Gaussian mixture imputation

Source

[ VIDEO OF THE WEEK]

#FutureOfData with Rob(@telerob) / @ConnellyAgency on running innovation in agency

 #FutureOfData with Rob(@telerob) / @ConnellyAgency on running innovation in agency

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Torture the data, and it will confess to anything. – Ronald Coase

[ PODCAST OF THE WEEK]

Solving #FutureOfWork with #Detonate mindset (by @steven_goldbach & @geofftuff) #JobsOfFuture #Podcast

 Solving #FutureOfWork with #Detonate mindset (by @steven_goldbach & @geofftuff) #JobsOfFuture #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.

Sourced from: Analytics.CLUB #WEB Newsletter

Challenges for Data Driven Organization

Along with each new invention come its side effects or new challenges. This is true even in the case of data capturing, harnessing. Data is a holy grail for data scientists and organizations as it can help them reach the highest pinnacles of productivity, innovation, growth etc., but it comes with great responsibility. The organizations have to proactively prepare themselves in the domain of data policies, data security, legal issues, Technology, Organizational change and Talent, access to data etc. to successfully leverage the potential of data.

Data policies: As organizations start capturing and analyzing larger amounts of data, they need to setup policies that adhere and respect issues around cross national flow of data, intellectual property, and liability. Data can easily flow across the international borders in data pipes and the country or origination could be different from the country of analysis. This needs to moderated and there are policies that restrict such wide transfers of data for specific types of data like heath information. Also, there needs to be policies around who can analyze some sensitive data for individuals. So, policies restricting the use of data like credit score, SSN etc. are important for privacy considerations and preventing misuse of sensitive data. The increasing concerns around privacy of consumer data have been led by policies of some firms that have used consumer’s data for their own benefits. This needs to be mitigated by policies for protection of use of consumer data esp health and financial. Thus there is a tradeoff between utility and privacy that needs to be resolved.

Data Security: There are concerns around the security of data. Once there are policies to manage who has access and how much, we need to make sure that those policies are adhered to. In the recent past, there have been increasing instances of breach of consumer data by hackers and ill minded organizations. This has led to panic and concerns about security of data. As more and more consumer, organizational and national data gets digitalized; it would become important to protect that data with better technology and policies.

Legal Issues: Issues around use of data, ownership of data and liability arising from the use of data are new and would need to be understood and resolved. Data is different from other assets and can be easily transferred, copied and manipulated. So, this can lead to ownership issues that can become very important in a competitive situation, both within and across the organizations. There could be other issues related with the liability arising from the use and analysis of data, esp. incorrect analysis or implementation. This could have severe impact on the organization and would need clarification probably over time, to capture the full potential of data.

Technology and techniques: Need for data capture and analysis have brought organizations to a point where it is important to merge and use various data systems and mart to harness the complete value of that data. So, new techniques and technologies need to be employed to achieve this goal. Organizations need to develop the basic infrastructure and capability to support data capture, data integration, data analysis and reporting. This also implies that you need to invest in new technology, upgrade legacy systems and do change management to train personnel. There is also a need for new technologies that can help satisfy the need for data maneuvering and consumption in an easier fashion.

Organizational change and talent: This is a difficult issue and has many aspects to it. On one side, leadership may lack the understanding of big data and its potential benefits, so as to promote and approve initiatives to build capabilities. On the other side, there might be a lack to talent in the organization to effectively handle data and analyze it. This can be a big competitive advantage for companies that can use this data to effectively succeed in the market. Another issue is the lack of organizational structure, incentives to optimize the use of data to make better and informed decisions. So, the organizations have to take three fold actions – educate the leadership on the importance of big data and get their support; develop in-house capability or hire people that can handle big data; and create organizational structures to promote and optimize the use of data.

Access to data: The power of data multifold when it is integrated with other data sources to bring to light interesting insights. In most organizations, different departments use different systems with little scope for data integration. Also, as already stated, data ownership can provide the feeling of power and competitive advantage to some people in the organizations, leading to reluctance in sharing it and optimizing its use. So, we need to make sure that economic incentives are aligned within an organization to make the most effective use of data by sharing and integrating. To transform an organization, you may also need data from third party sources, and that might not be very easy to access and use. New business models are evolving and are being considered by different organizations to make such transactions easy.

Industry structure: Some industry structures have not evolved to imbibe the basic principles of efficiency and productivity. These industries are not impacted by competitive pressures and have a different rate of use of data. For example – government as well as health care are such industries where performance transparency is low and where data has not made much inroads. These industries need to improvise their productivity by using data more intensively to make more informed decisions. Organization leaders would have to determine how to evolve the structure of these organizations in an increasingly integrated and competitive world and how to use data to achieve and optimize them.

Thus, data as a business driver can be transformative for organizations if the above listed challenges can be tackled and the power of data is realized and utilized. All the stakeholders involved from leadership, to data scientists to policy makers need to understand the growing challenges as the data evolves and proactively counter them, so that we can create a culture that promotes and appreciates the use of data for everyone’s benefits.

Source by d3eksha

The 3 Step Guide CIO’s Need to Build a Data-Driven Culture

Today’s CIO has more data available than ever before. There is an opportunity for potential big improvements in decision-making outcomes, it carries huge complexity and responsibility in getting it right.

Many have already got it wrong and this is largely in part down to organisational culture. At the centre of creating a successful analytics strategy is building a data-driven culture.

According to a report by Gartner more than 35% of the top 5,000 global companies will fail to make use of the insight driven from their data. In another report by Eckerson, just 36% of the respondents gave their BI program a grade of ‘Excellent’ or ’Good’.

With the wealth of data already available in the world and the promise that it will continue to grow at an exponential rate, it seems inevitable that organisations attempt to leverage this resource to its fullest to improve their decision-making capabilities.

Before we move forward, it’s important to state that underpinning the success of these steps is to ensure all employees who have a direct involvement with the data or the insight generated are able to contribute. This point is highlighted in a case study of Warby Parker who illustrate the importance of utilising self-service technologies that help all users meet their own data needs, which, according to Carl Anderson, the director of Data Science, is essential in realising a data-driven culture.

Set Realistic Goals

I suppose this step is generic and best practice across all aspects of an organisation. However, I felt it needed to be mentioned because there are a number of examples available where decision-makers have become disillusioned with their analytics program due to it not delivering what they had expected.

Therefore, CIO’s should take the time to prepare in-depth research into their organisation; I recommend they look at current and future challenges facing their organisation and tailor their analytics strategy appropriately around solving these.

During this process, it is important to have a full understanding of the data sources currently used for analysis and reporting by the organisation as well as considering the external data sources available to the organisation that are not yet utilised.

By performing extensive research and gaining understanding on the data sources available to the organisation, it will be easier for CIO’s to set realistic and clear goals that address the challenges facing the business. Though there is still work to be done addressing how the analytics strategy will go about achieving these goals, it’s at this point where CIO’s need to get creative with the data available to them.

For example, big data has brought with it a wealth of unstructured data and many analysts believe that tapping into this unstructured data is paramount to obtaining a competitive advantage in the years to come. However it appears to be something that most will not realise any time soon as according to recent studies estimate that only around 0.5% percentage of unstructured data is analysed in the world.

Build the Right Infrastructure

Once the plan has been formulated, the next step for CIO’s is to ensure that their organisation’s IT infrastructure is aligned with the strategy so that the set goals can be achieved.

There is no universal “one way works for all” solution on building the right infrastructure; the most important factor to consider is whether the IT infrastructure can work according to the devised strategy.

A key requirement and expectation underpinning all good, modern infrastructures is the capability to integrate all of the data sources in the organisation into one central repository. The benefit being that by combining all of the data sources it provides users with a fully holistic view of the entire organisation.

For example, in a data environment where all of the organisation’s data is stored in silo, analysts may identify a trend or correlation in one data source but not have the full perspective afforded if the data were unified, i.e. what can our other data sources tell us about what has contributed to this correlation?

Legacy technologies that are now obsolete should be replaced in favour of more modern approaches to processing, storing and analysing data – one example are those technologies built on search-engine technology, as cited by Gartner.

Enable Front-Line Employees and Other Business Users

Imperative to succeeding now is ensuring that front-line employees (those whose job roles can directly benefit by having access to data) and other business users (managers, key business executives, etc.) are capable of self-serving their own data needs.

CIO’s should look to acquire a solution built specifically for self-service analysis over large-volumes of data and capable of seamless integration with their IT infrastructure.

A full analysis of employee skill-set and mind-set should be undertaken to determine whether certain employees need training in particular areas to bolster their knowledge or simply need to adapt their mind-set to a more analytical one.

Whilst it is essential that the front-line employees and other business users are given access to self-service analysis, inherently they will likely be “less-technical users”. Therefore ensuring they have the right access to training and other learning tools is vital to guarantee that they don’t become frustrated or disheartened.

By investing in employee development in these areas now, it will save time and money further down the line, removing an over reliance on both internal and external IT experts.

Source: The 3 Step Guide CIO’s Need to Build a Data-Driven Culture

Nov 08, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Data shortage  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ FEATURED COURSE]

Deep Learning Prerequisites: The Numpy Stack in Python

image

The Numpy, Scipy, Pandas, and Matplotlib stack: prep for deep learning, machine learning, and artificial intelligence… more

[ FEATURED READ]

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

image

Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the “data-analytic thinking” necessary for e… more

[ TIPS & TRICKS OF THE WEEK]

Fix the Culture, spread awareness to get awareness
Adoption of analytics tools and capabilities has not yet caught up to industry standards. Talent has always been the bottleneck towards achieving the comparative enterprise adoption. One of the primal reason is lack of understanding and knowledge within the stakeholders. To facilitate wider adoption, data analytics leaders, users, and community members needs to step up to create awareness within the organization. An aware organization goes a long way in helping get quick buy-ins and better funding which ultimately leads to faster adoption. So be the voice that you want to hear from leadership.

[ DATA SCIENCE Q&A]

Q:What is your definition of big data?
A: Big data is high volume, high velocity and/or high variety information assets that require new forms of processing
– Volume: big data doesn’t sample, just observes and tracks what happens
– Velocity: big data is often available in real-time
– Variety: big data comes from texts, images, audio, video…

Difference big data/business intelligence:
– Business intelligence uses descriptive statistics with data with high density information to measure things, detect trends etc.
– Big data uses inductive statistics (statistical inference) and concepts from non-linear system identification to infer laws (regression, classification, clustering) from large data sets with low density information to reveal relationships and dependencies or to perform prediction of outcomes or behaviors

Source

[ VIDEO OF THE WEEK]

@BrianHaugli @The_Hanover ?on Building a #Leadership #Security #Mindset #FutureOfData #Podcast

 @BrianHaugli @The_Hanover ?on Building a #Leadership #Security #Mindset #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

If you can’t explain it simply, you don’t understand it well enough. – Albert Einstein

[ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with @MPFlowersNYC, @enigma_data

 #BigData @AnalyticsWeek #FutureOfData #Podcast with @MPFlowersNYC, @enigma_data

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

140,000 to 190,000. Too few people with deep analytical skills to fill the demand of Big Data jobs in the U.S. by 2018.

Sourced from: Analytics.CLUB #WEB Newsletter

Using sparklyr with Microsoft R Server

The sparklyr package (by RStudio) provides a high-level interface between R and Apache Spark. Among many other things, it allows you to filter and aggregate data in Spark using the dplyr syntax. In Microsoft R Server 9.1, you can now connect to a a Spark session using the sparklyr package as the interface, allowing you to combine the data-preparation capabilities of sparklyr and the data-analysis capabilities of Microsoft R Server in the same environment.

In a presentation by at the Spark Summit (embedded below, and you can find the slides here), Ali Zaidi shows how to connect to a Spark session from Microsoft R Server, and use the sparklyr package to extract a data set. He then shows how to build predictive models on this data (specifically, a deep Neural Network and a Boosted Trees classifier). He also shows how to build general ensemble models, cross-validate hyper-parameters in parallel, and even gives a preview of forthcoming streaming analysis capabilities.

[youtube https://www.youtube.com/watch?v=8-xvKlz26vg?rel=0&w=500&h=281]

Any easy way to try out these capabilities is with Azure HDInsight 3.6, which provides a managed Spark 2.1 instance with Microsoft R Server 9.1.

Spark Summit: Extending the R API for Spark with sparklyr and Microsoft R Server

Originally Posted at: Using sparklyr with Microsoft R Server

Nov 01, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Accuracy check  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> The User Experience of State Government Websites by analyticsweek

>> Marginal gains: the rise of data analytics in sport by analyticsweekpick

>> The Pitfalls of Using Predictive Models by bobehayes

Wanna write? Click Here

[ NEWS BYTES]

>>
 How to Avoid the Trap of Fragmented Security Analytics – Security Intelligence (blog) Under  Analytics

>>
 Are You Spending Too Much (or Too Little) on Cybersecurity? – Data Center Knowledge Under  Data Center

>>
 Most UK businesses are not insured against security breaches and data loss, says study – Information Age Under  Data Security

More NEWS ? Click Here

[ FEATURED COURSE]

Python for Beginners with Examples

image

A practical Python course for beginners with examples and exercises…. more

[ FEATURED READ]

Superintelligence: Paths, Dangers, Strategies

image

The human brain has some capabilities that the brains of other animals lack. It is to these distinctive capabilities that our species owes its dominant position. Other animals have stronger muscles or sharper claws, but … more

[ TIPS & TRICKS OF THE WEEK]

Keeping Biases Checked during the last mile of decision making
Today a data driven leader, a data scientist or a data driven expert is always put to test by helping his team solve a problem using his skills and expertise. Believe it or not but a part of that decision tree is derived from the intuition that adds a bias in our judgement that makes the suggestions tainted. Most skilled professionals do understand and handle the biases well, but in few cases, we give into tiny traps and could find ourselves trapped in those biases which impairs the judgement. So, it is important that we keep the intuition bias in check when working on a data problem.

[ DATA SCIENCE Q&A]

Q:You have data on the durations of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?
A: 1. Exploratory data analysis
* Histogram of durations
* histogram of durations per service type, per day of week, per hours of day (durations can be systematically longer from 10am to 1pm for instance), per employee…
2. Distribution: lognormal?

3. Test graphically with QQ plot: sample quantiles of log(durations)log?(durations) Vs normal quantiles

Source

[ VIDEO OF THE WEEK]

@AnalyticsWeek #FutureOfData with Robin Thottungal(@rathottungal), Chief Data Scientist at @EPA

 @AnalyticsWeek #FutureOfData with Robin Thottungal(@rathottungal), Chief Data Scientist at @EPA

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Everybody gets so much information all day long that they lose their common sense. – Gertrude Stein

[ PODCAST OF THE WEEK]

Solving #FutureOfOrgs with #Detonate mindset (by @steven_goldbach & @geofftuff) #FutureOfData #Podcast

 Solving #FutureOfOrgs with #Detonate mindset (by @steven_goldbach & @geofftuff) #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Brands and organizations on Facebook receive 34,722 Likes every minute of the day.

Sourced from: Analytics.CLUB #WEB Newsletter

@DrJasonBrooks talked about the Fabric and Future of Leadership #JobsOfFuture #Podcast

[youtube https://www.youtube.com/watch?v=SB29nSaCppU]

In this podcast Jason talked about the fabric of a great transformative leadership. He shared some tactical steps that current leadership could follow to ensure their relevance and their association with transformative teams. Jason emphasized the role of team, leader and organization in create a healthy future proof culture. It is a good session for the leadership of tomorrow.

Jason’s Recommended Read:
Reset: Reformatting Your Purpose for Tomorrow’s World by Jason Brooks https://amzn.to/2rAuywh
Essentialism: The Disciplined Pursuit of Less by Greg McKeown https://amzn.to/2jOX8Xi

Podcast Link:
iTunes: http://math.im/itunes
GooglePlay: http://math.im/gplay

Jason’s BIO:
Dr. Jason Brooks is an executive, entrepreneur, consulting and leadership psychologist, bestselling author, and speaker with over 24 years of demonstrated results in the design, implementation and evaluation of leadership and organizational development, organizational effectiveness, and human capital management solutions, He work to grow leaders and enhance workforce performance and overall individual and company success. He is a results-oriented, high-impact executive leader with experience in start-up, high-growth, and operationally mature multi-million and multi-billion dollar companies in multiple industries.

About #Podcast:
#JobsOfFuture podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join?
If you or any you know wants to join in,
Register your interest @ info@analyticsweek.com

Want to sponsor?
Email us @ info@analyticsweek.com

Keywords:
#JobsOfFuture #Leadership #Podcast #Future of #Work #Worker & #Workplace

Source: @DrJasonBrooks talked about the Fabric and Future of Leadership #JobsOfFuture #Podcast by v1shal

Oct 25, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Conditional Risk  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Avoiding a Data Science Hype Bubble by analyticsweek

>> The User Experience of University Websites by analyticsweek

>> Landscape of Big Data by v1shal

Wanna write? Click Here

[ NEWS BYTES]

>>
 Beckage PLLC focuses on data security – Buffalo Business First Under  Data Security

>>
 â€‹The data center is dead: Here’s what comes next | ZDNet – ZDNet Under  Data Center

>>
 Global Automotive HVAC Sensors Market Outlook, Size, Status, and Forecast to 2025 – City Councilor Under  Financial Analytics

More NEWS ? Click Here

[ FEATURED COURSE]

Lean Analytics Workshop – Alistair Croll and Ben Yoskovitz

image

Use data to build a better startup faster in partnership with Geckoboard… more

[ FEATURED READ]

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

image

Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the “data-analytic thinking” necessary for e… more

[ TIPS & TRICKS OF THE WEEK]

Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.

[ DATA SCIENCE Q&A]

Q:What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
A: * Effect would be similar to regularization: avoid overfitting
* Used to increase robustness

Source

[ VIDEO OF THE WEEK]

#FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

 #FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Without big data, you are blind and deaf and in the middle of a freeway. – Geoffrey Moore

[ PODCAST OF THE WEEK]

#BigData #BigOpportunity in Big #HR by @MarcRind #JobsOfFuture #Podcast

 #BigData #BigOpportunity in Big #HR by @MarcRind #JobsOfFuture #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

A quarter of decision-makers surveyed predict that data volumes in their companies will rise by more than 60 per cent by the end of 2014, with the average of all respondents anticipating a growth of no less than 42 per cent.

Sourced from: Analytics.CLUB #WEB Newsletter

Making sense of unstructured data by turning strings into things

Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things

We all know about the promise of Big Data Analytics to transform our understanding of the world. The analysis of structured data, such as inventory, transactions, close rates, and even clicks, likes and shares is clearly valuable, but the curious fact about the immense volume of data being produced is that a vast majority of it is unstructured text. Content such as news articles, blog post, product reviews, and yes even the dreaded 140 character novella contain tremendous value, if only they could be connected to things in the real world – people, places and things. In this talk, we’ll discuss the challenges and opportunities that result when you extract entities from Big Text.

Speaker:
Gregor Stewart – Director of Product Management for Text Analytics at Basis Technology

As Director of Product Management, Mr. Stewart helps to ensure that Basis Technology’s offerings stay ahead of the curve. Previously Mr. Stewart was the CTO of a storage services startup and a strategy consultant. He holds a Masters in Natural Language Processing from the University of Edinburgh, a BA in PPE from the University of Oxford, and a Masters from the London School of Economics.

Thanks to our amazing sponsors:

MicrosoftNERD for Venue

Basis Technology for Food and Kindle Raffle

Video:

Slideshare:

Originally Posted at: Making sense of unstructured data by turning strings into things by v1shal

Announcing dplyrXdf 1.0

I’m delighted to announce the release of version 1.0.0 of the dplyrXdf package. dplyrXdf began as a simple (relatively speaking) backend to dplyr for Microsoft Machine Learning Server/Microsoft R Server’s Xdf file format, but has now become a broader suite of tools to ease working with Xdf files.

This update to dplyrXdf brings the following new features:

  • Support for the new tidyeval framework that powers the current release of dplyr
  • Support for Spark and Hadoop clusters, including integration with the sparklyr package to process Hive tables in Spark
  • Integration with dplyr to process SQL Server tables in-database
  • Simplified handling of parallel processing for grouped data
  • Several utility functions for Xdf and file management
  • Workarounds for various glitches and unexpected behaviour in MRS and dplyr

Spark, Hadoop and HDFS

New in version 1.0.0 of dplyrXdf is support for Xdf files and datasets stored in HDFS in a Hadoop or Spark cluster. Most verbs and pipelines behave the same way, whether the computations are taking place in your R session itself, or in-cluster (except that they should be much more scalable in the latter case). Similarly, dplyrXdf can handle both the scenarios where your R session is taking place on the cluster edge node, or on a remote client.

For example, here is some sample code where we extract a table from Hive, then create a pipeline to process it in the cluster:

rxSparkConnect()
sampleHiv <- RxHiveData(table="hivesampletable")

# this will create the composite Xdf 'samplehivetable'
sampleXdf <- as_xdf(sampleHiv)

sampleXdf %>%
    filter(deviceplatform == "Android") %>%
    group_by(devicemake) %>%
    summarise(n=n()) %>%
    arrange(desc(n)) %>%
    head()
#>     devicemake     n
#> 1      Samsung 16244
#> 2           LG  7950
#> 3          HTC  2242
#> 4      Unknown  2133
#> 5     Motorola  1524

If you are logged into the edge node, dplyrXdf also has the ability to call sparklyr to process Hive tables in Spark. This can be more efficient than converting the data to Xdf format, since less I/O is involved. To run the above pipeline with sparklyr, we simply omit the step of creating an Xdf file:

sampleHiv %>%
    filter(deviceplatform == "Android") %>%
    group_by(devicemake) %>%
    summarise(n=n()) %>%
    arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   spark_connection
#> # Ordered by: desc(n)
#>     devicemake     n
#>           
#> 1      Samsung 16244
#> 2           LG  7950
#> 3          HTC  2242
#> 4      Unknown  2133
#> 5     Motorola  1524
#> # ... with more rows

For more information about Spark and Hadoop support, see the HDFS vignette and the Sparklyr website.

SQL database support

One of the key strengths of dplyr is its ability to interoperate with SQL databases. Given a database table as input, dplyr can translate the verbs in a pipeline into a SQL query which is then execute in the database. For large tables, this can often be much more efficient than importing the data and running them locally. dplyrXdf can take advantage of this with an MRS data source that is a table in a SQL database, including (but not limited to) Microsoft SQL Server: rather than importing the data to Xdf, the data source is converted to a dplyr tbl and passed to the database for processing.

# copy the flights dataset to SQL Server
flightsSql <- RxSqlServerData("flights", connectionString=connStr)
flightsHd <- copy_to(flightsSql, nycflights13::flights)

# this is run inside SQL Server by dplyr
flightsQry <- flightsSql %>%
    filter(month > 6) %>%
    group_by(carrier) %>%
    summarise(avg_delay=mean(arr_delay))

flightsQry
#> # Source:   lazy query [?? x 2]
#> # Database: Microsoft SQL Server
#> #   13.00.4202[dbo@DESKTOP-TBHQGUH/sqlDemoLocal]
#>   carrier avg_delay
#>          
#> 1 "9E"        5.37 
#> 2 AA        - 0.743
#> 3 AS        -16.9  
#> 4 B6          8.53 
#> 5 DL          1.55 
#> # ... with more rows

For more information about working with SQL databases including SQL Server, see the dplyrXdf SQL vignette and the dplyr database vignette.

Parallel processing and grouped data

Even without a Hadoop or Spark cluster, dplyrXdf makes it easy to parallelise the handling of groups. To do this, it takes advantage of Microsoft R Server’s distributed compute contexts: for example, if you set the compute context to “localpar”, grouped transformations will be done in parallel on a local cluster of R processes. The cluster will be shut down automatically when the transformation is complete.

More broadly, you can create a custom backend and tell dplyrXdf to use it by setting the compute context to “dopar”. This allows you a great deal of flexibility and scalability, for example by creating a cluster of multiple machines (as opposed to multiple cores on a single machine). Even if you do not have the physical machines, packages like AzureDSVM and doAzureParallel allow you to deploy clusters of VMs in the cloud, and then shut them down again. For more information, see the “Parallel processing of grouped data” section of the Using dplyrXdf vignette.

Data and file management

New in dplyrXdf 1.0.0 is a suite of functions to simplify managing Xdf files and data sources:

  • HDFS file management: upload and download files with hdfs_file_upload and hdfs_file_download; copy/move/delete files with hdfs_file_copy, hdfs_file_move, hdfs_file_remove; list files with hdfs_dir; and more
  • Xdf data management: upload and download datasets with copy_to, collect and compute; import/convert to Xdf with as_xdf; copy/move/delete Xdf data sources with copy_xdf, move_xdf and delete_xdf; and more
  • Other utilities: run a block of code in the local compute context with local_exec; convert an Xdf file to a data frame with as.data.frame; extract columns from an Xdf file with methods for [, [[ and pull

Obtaining dplyr and dplyrXdf

dplyrXdf 1.0.0 is available from GitHub. It requires Microsoft R Server 8.0 or higher, and dplyr 0.7 or higher. Note that dplyr 0.7 will not be in the MRAN snapshot that is your default repo, unless you are using the recently-released MRS 9.2; you can install it, and its dependencies, from CRAN. If you want to use the SQL Server and sparklyr integration facility, you should install the odbc, dbplyr and sparklyr packages as well.

install_packages(c("dplyr", "dbplyr", "odbc", "sparklyr"),
                 repos="https://cloud.r-project.org")
devtools::install_github("RevolutionAnalytics/dplyrXdf")

If you run into any bugs, or if you have any feedback, you can email me or log an issue at the Github repo.

Source by analyticsweekpick