The document has moved here.
Statistically Significant Source
Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored f… more
Winter is coming, warm your Analytics Club
Yes and yes! As we are heading into winter what better way but to talk about our increasing dependence on data analytics to help with our decision making. Data and analytics driven decision making is rapidly sneaking its way into our core corporate DNA and we are not churning practice ground to test those models fast enough. Such snugly looking models have hidden nails which could induce unchartered pain if go unchecked. This is the right time to start thinking about putting Analytics Club[Data Analytics CoE] in your work place to help Lab out the best practices and provide test environment for those models.
Q:Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with?
A: data reduction techniques other than PCA?:
Partial least squares: like PCR (principal component regression) but chooses the principal components in a supervised way. Gives higher weights to variables that are most strongly related to the response
– the choice of predictive variables are carried out using a systematic procedure
– Usually, it takes the form of a sequence of F-tests, t-tests, adjusted R-squared, AIC, BIC
– at any given step, the model is fit using unconstrained least squares
– can get stuck in local optima
– Better: Lasso
– Forward-selection: begin with no variables, adding them when they improve a chosen model comparison criterion
– Backward-selection: begin with all the variables, removing them when it improves a chosen model comparison criterion
Better than reduced data:
Example 1: If all the components have a high variance: which components to discard with a guarantee that there will be no significant loss of the information?
Example 2 (classification):
– One has 2 classes; the within class variance is very high as compared to between class variance
– PCA might discard the very information that separates the two classes
Better than a sample:
– When number of variables is high relative to the number of observations
Subscribe to Youtube
If we have data, let’s look at data. If all we have are opinions, let’s go with mine. â Jim Barksdale
More than 200bn HD movies â which would take a person 47m years to watch.
Human resource Source
This is an Archived Course
Antifragile is a standalone book in Nassim Nicholas Talebâs landmark Incerto series, an investigation of opacity, luck, uncertainty, probability, human error, risk, and decision-making in a world we donât understand…. more
Fix the Culture, spread awareness to get awareness
Adoption of analytics tools and capabilities has not yet caught up to industry standards. Talent has always been the bottleneck towards achieving the comparative enterprise adoption. One of the primal reason is lack of understanding and knowledge within the stakeholders. To facilitate wider adoption, data analytics leaders, users, and community members needs to step up to create awareness within the organization. An aware organization goes a long way in helping get quick buy-ins and better funding which ultimately leads to faster adoption. So be the voice that you want to hear from leadership.
Q:Examples of NoSQL architecture?
A: * Key-value: in a key-value NoSQL database, all of the data within consists of an indexed key and a value. Cassandra, DynamoDB
* Column-based: designed for storing data tables as sections of columns of data rather than as rows of data. HBase, SAP HANA
* Document Database: map a key to some document that contains structured information. The key is used to retrieve the document. MongoDB, CouchDB
* Graph Database: designed for data whose relations are well-represented as a graph and has elements which are interconnected, with an undetermined number of relations between them. Polyglot Neo4J
Subscribe to Youtube
War is 90% information. – Napoleon Bonaparte
According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day, and has more than 465 million accounts.
The human brain has some capabilities that the brains of other animals lack. It is to these distinctive capabilities that our species owes its dominant position. Other animals have stronger muscles or sharper claws, but … more
Data Analytics Success Starts with Empowerment
Being Data Driven is not as much of a tech challenge as it is an adoption challenge. Adoption has it’s root in cultural DNA of any organization. Great data driven organizations rungs the data driven culture into the corporate DNA. A culture of connection, interactions, sharing and collaboration is what it takes to be data driven. Its about being empowered more than its about being educated.
Q:How to optimize algorithms? (parallel processing and/or faster algorithms). Provide examples for both?
A: Premature optimization is the root of all evil – Donald Knuth
Parallel processing: for instance in R with a single machine.
– doParallel and foreach package
– doParallel: parallel backend, will select n-cores of the machine
– for each: assign tasks for each core
– using Hadoop on a single node
– using Hadoop on multi-node
– In computer science: Pareto principle; 90% of the execution time is spent executing 10% of the code
– Data structure: affect performance
– Caching: avoid unnecessary work
– Improve source code level
For instance: on early C compilers, WHILE(something) was slower than FOR(;;), because WHILE evaluated âsomethingâ and then had a conditional jump which tested if it was true while FOR had unconditional jump.
Subscribe to Youtube
He uses statistics as a drunken man uses lamp postsâfor support rather than for illumination. â Andrew Lang
In 2015, a staggering 1 trillion photos will be taken and billions of them will be shared online. By 2017, nearly 80% of photos will be taken on smart phones.