Sunteți pe pagina 1din 29

Understand various data science concepts in few minutes

QUICKBOOK FOR DATA SCIENCE


ALSO HELPFUL FOR EMCDSA
(E20-007) CERTIFICATION
By HadoopExam Learning Resources in Partnership with www.QuickTechie.com

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 1


Project Sponsor: Responsible for the genesis of the project. Provides the impetus and requirements for the project
and defines the core business problem. Generally provides the funding and gauges the degree of value from the final

outputs of the working team. This person sets the priorities for the project and clarifies the desired outputs. Because
the presentation is often circulated within an organization, it is critical to articulate the results properly and position the

findings in a way that is appropriate for the audience. Presentation for project sponsors: This contains high-level
takeaways for executive level stakeholders, with a few key messages to aid their decision-making process. Focus on

clean, easy visuals for the presenter to explain and for the viewer to grasp

Classification
Build models to classify data into different categories. Algorithms: support vector machine (SVM), boosted and
bagged decision trees, k-nearest neighbor, Nave Bayes, discriminant analysis, neural networks,

Build models to classify data into different categories, credit scoring, tumor detection, image recognition Regression:
Build models to predict continuous data. Electricity load forecasting, algorithmic trading, drug discovery

Regression
Build models to predict continuous data. Algorithms: linear model, nonlinear model, regularization, stepwise
regression, boosted and bagged decision trees, neural networks, adaptive neuro-fuzzy learning

Clustering
Find natural groupings and patterns in data. Algorithms: k-means, hierarchical clustering, Gaussian mixture models,
hidden Markov models, self-organizing maps, fuzzy c-means clustering, subtractive clustering,

An Example of Big Data


An example of big data might be petabytes (1,024 terabytes) or Exabytes (1,024 petabytes) of data consisting of
billions to trillions of records of millions of peopleall from different sources (e.g. Web, sales, customer contact
center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and
inaccessible. Daily Log files from a web server that receives 100, 000 hits per minute
A nave Bayes classifier: assumes that the presence or absence of a particular feature of a class is unrelated to
the presence or absence of other features. For example, an object can be classified based on its attributes such as
shape, color, and weight. A reasonable classification for an object that is spherical, yellow, and less than 60 grams in
weight may be a tennis ball. Even if these features depend on each other or upon the existence of the other features,

a nave Bayes classifier considers all these properties to contribute independently to the probability that the object is

a tennis ball.

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances,
represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single

algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes

classifiers assume that the value of a particular feature is independent of the value of any other feature, given the
class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 2


naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is
an apple, regardless of any possible correlations between the color, roundness and diameter features.

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning

setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum
likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using

any Bayesian methods.

Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well

in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there
are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers.[5] Still, a
comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is
outperformed by other approaches, such as boosted trees or random forests.[6]

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters
necessary for classification

Data Analytics Lifecycle:


Phase 1Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether
the organization or business unit has attempted similar projects in the past from which they can learn. The team
assesses the resources available to support the project in terms of people, technology, time, and data. Important

activities in this phase include framing the business problem as an analytics challenge that can be addressed in
subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data.
Phase 2Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work with
data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform
(ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes

abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In

this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data
Phase 3Model planning: Phase 3 is model planning, where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the
relationships between variables and subsequently selects key variables and the most suitable models.
Phase 4Model building: In Phase 4, the team develops datasets for testing, training, and production purposes. In
addition, in this phase the team builds and executes models based on the work done in the model planning phase.
The team also considers whether its existing tools will suffice for running the models, or if it will need a more robust
environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable).

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 3


Phase 5Communicate results: In Phase 5, the team, in collaboration with major stakeholders, determines if the
results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify

key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders.
Phase 6Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents. In
addition, the team may run a pilot project to implement the models in a production environment.

Linear Regression: Also called Ordinary Least Squares Regression, models linear relationship between a
dependent variable and one or more independent variables. And also used for finding models Goodness of fit.

RY = b0 + b1x1+b2x2+ .... +bnxn

In the linear model, the bi's represent the unknown p parameters. The estimates for these unknown parameters are
chosen so that, on average, the model provides a reasonable estimate of a person's income based on age and
education. In other words, the fitted model should minimize the overall error between the linear model and the actual
observations. Ordinary Least Squares (OLS) is a common technique to estimate the parameters

The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value
(< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is
likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in
the response variable. Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not

associated with changes in the response. Significance of the estimated coefficients: Are the t-statistics greater than 2
in magnitude, corresponding to p-values less than 0.05 If they are not, you should probably try to refit the model with
the least significant variable excluded, which is the "backward stepwise" approach to model refinement.

Remember that the t-statistic is just the estimated coefficient divided by its own standard error. Thus, it measures
"how many standard deviations from zero" the estimated coefficient is, and it is used to test the hypothesis that the

true value of the coefficient is non-zero, in order to confirm that the independent variable really belongs in the model.

The p-value is the probability of observing a t-statistic that large or larger in magnitude given the null hypothesis that

the true coefficient value is zero. If the p-value is greater than 0.05-which occurs roughly when the t-statistic is less
than 2 in absolute value-this means that the coefficient may be only "accidentally" significant.

There's nothing magical about the 0.05 criterion, but in practice it usually turns out that a variable whose estimated

coefficient has a p-value of greater than 0.05 can be dropped from the model without affecting the error measures
very much-try it and see

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 4


Logistic regression extends the ideas of linear regression to the situation where the dependent variable, Y, is
categorical. We can think of a categorical variable as dividing the observations into classes. For example, if Y

denotes a recommendation on holding/selling/buying a stock, we have a categorical variable with three categories.
We can think of each of the stocks in the dataset (the observations) as belonging to one of three classes: the hold

class, the sell class, and the buy class. Logistic regression can be used for classifying a new observation, where the
class is unknown, into one of the classes, based on the values of its predictor variables (called classification). It can

also be used in data (where the class is known) to find similarities between observations within each class in terms of
the predictor variables (called profiling). For example, a logistic regression model can be built to determine if a person

will or will not purchase a new automobile in the next 12 months. The training set could include input variables for a

person's age, income, and gender as well as the age of an existing automobile. The training set would also include
the outcome variable on whether the person purchased a new automobile over a 12-month period. The logistic
regression model provides the likelihood or probability of a person making a purchase in the next 12 months. After
examining a few more use cases for logistic regression, the remaining portion of this chapter examines how to build
and evaluate a logistic regression model. Logistic regression attempts to predict outcomes based on a set of
independent variables, but if researchers include the wrong independent variables, the model will have little to no
predictive value. For example, if college admissions decisions depend more on letters of recommendation than test
scores, and researchers don't include a measure for letters of recommendation in their data set, then the logit model
will not provide useful or accurate predictions. This means that logistic regression is not a useful tool unless

researchers have already identified all the relevant independent variables.

Logistic regression can in many ways be seen to be similar to ordinary regression. It models the relationship between

a dependent and one or more independent variables, and allows us to look at the fit of the model as well as at the
significance of the relationships (between dependent and independent variables) that we are modelling. However, the
underlying principle of binomial logistic regression, and its statistical calculation, are quite different to ordinary linear

regression. While ordinary regression uses ordinary least squares to find a best fitting line, and comes up with
coefficients that predict the change in the dependent variable for one unit change in the independent variable, logistic

regression estimates the probability of an event occurring (e.g. the probability of a pupil continuing in education post

16). What we want to predict from a knowledge of relevant independent variables is not a precise numerical value of
a dependent variable, but rather the probability (p) that it is 1 (event occurring) rather than 0 (event not occurring).

This means that, while in linear regression, the relationship between the dependent and the independent variables is
linear, this assumption is not made in logistic regression. Instead, the logistic regression function is used.

Example 1: Suppose that we are interested in the factors that influence whether a political candidate wins an
election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 5


amount of money spent on the campaign, the amount of time spent campaigning negatively, and whether the
candidate is an incumbent.

Example 2: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade
point average) and prestige of the undergraduate institution, effect admission into graduate school. The outcome
variable, admit/don't admit, is binary.

Analytics project and stakeholders:


Business User typically tries to determine the benefits and implications of the findings to the business.
Project Sponsor typically asks questions related to the business impact of the project, the risks and return on
investment (ROI), and the way the project can be evangelized within the organization (and beyond).
Project Manager needs to determine if the project was completed on time and within budget and how well the
goals were met.
Business Intelligence Analyst needs to know if the reports and dashboards he manages will be impacted and
need to change.
Data Engineer and Database Administrator (DBA) typically need to share their code from the analytics project
and create a technical document on how to implement it.
Data Scientist needs to share the code and explain the model to her peers, managers, and other stakeholders.

Key outputs from a successful analytics project:


Presentation for project sponsors: This contains high-level takeaways for executive level stakeholders, with a
few key messages to aid their decision-making process. Focus on clean, easy visuals for the presenter to explain and
for the viewer to grasp.
Presentation for analysts, which describes business process changes and reporting changes. Fellow data
scientists will want the details and are comfortable with technical graphs (such as Receiver Operating Characteristic
[ROC] curves, density plots, and histograms shown
Code for technical people.
Technical specifications of implementing the code

Data exploration is an informative search used by data consumers to form true analysis from the information
gathered. Often, data is gathered in a non-rigid or controlled manner in large bulks. For true analysis, this

unorganized bulk of data needs to be narrowed down. This is where data exploration is used to analyze the data and
information from the data to form further analysis.

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 6


In the data preparation phase of the Data Analytics Lifecycle, the data range and distribution can be obtained. If the
data is skewed, viewing the logarithm of the data (if it's all positive) can help detect structures that might otherwise be

overlooked in a graph with a regular, non-logarithmic scale.


When preparing the data, one should look for signs of dirty data, as explained in the previous section. Examining if

the data is unimodal or multimodal will give an idea of how many distinct populations with different behavior patterns
might be mixed into the overall population. Many modeling techniques assume that the data follows a normal

distribution. Therefore, it is important to know if the available dataset can match that assumption before applying any
of those modeling techniques.

Final Deliverables: When presenting to a technical audience such as data scientists and analysts, focus on how
the work was done

One-way ANOVA: What is this test for?


The one-way analysis of variance (ANOVA) is used to determine whether there are any significant differences
between the means of three or more independent (unrelated) groups. This guide will provide a brief introduction to
the one-way ANOVA, including the assumptions of the test and when you should use this test. If you are familiar with
the one-way ANOVA, you can skip this guide and go straight to how to run this test in SPSS Statistics by
clicking here.
What does this test do?
The one-way ANOVA compares the means between the groups you are interested in and determines whether any of
those means are significantly different from each other. Specifically, it tests the null hypothesis:

Where = group mean and k = number of groups. If, however, the one-way ANOVA returns a significant result, we
accept the alternative hypothesis (HA), which is that there are at least 2 group means that are significantly different
from each other.

At this point, it is important to realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which

specific groups were significantly different from each other, only that at least two groups were. To determine which

specific groups differed from each other, you need to use a post hoc test. Post hoc tests are described later in this
guide.
Kmeans Clustering uses an iterative algorithm that minimizes the sum of distances from each object to its cluster
centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further.

The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the
minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster
centroids, and for the maximum number of iterations.

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 7


Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to
more focused analysis or decision processes. Some specific applications of k-means are image processing, medical,

and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified,
labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups

use k-means to better identify customers who have similar behaviors and spending patterns.

What are the weaknesses of K-Mean Clustering?

Similar to other algorithm, K-mean clustering has many weaknesses:

- When the numbers of data are not so many, initial grouping will determine the cluster significantly.

- The number of cluster, K, must be determined beforehand.


- We never know the real cluster, using the same data, if it is inputted in a different way may produce different cluster
if the number of data is a few.
- We never know which attribute contributes more to the grouping process since we assume that each attribute has
the same weight.
One way to overcome those weaknesses is to use K-mean clustering only if there are available many data.

A window function enables aggregation to occur but still provides the entire dataset with the summary results. For
example, the RANK() function can be used to order a set of rows based on some attribute.

A window function performs a calculation across a set of table rows that are somehow related to the current row. This
is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate
functions, use of a window function does not cause rows to become grouped into a single output row - the rows retain

their separate identities. Behind the scenes, the window function is able to access more than just the current row of
the query result.

A window function call always contains an OVER clause following the window function's name and argument(s). This
is what syntactically distinguishes it from a regular function or aggregate function. The OVER clause determines

exactly how the rows of the query are split up for processing by the window function. The PARTITION BY list within

OVER specifies dividing the rows into groups, or partitions that share the same values of the PARTITION BY
expression. For each row, the window function is computed across the rows that fall into the same partition as the

current row.

Although avg will produce the same result no matter what order it processes the partition's rows in, this is not true of
all window functions. When needed, you can control that order using ORDER BY within OVER.

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 8


Text Parsing: Parsing is the process that takes unstructured text and imposes a structure for further analysis. The
unstructured text could be a plain text file, a weblog, an Extensible Markup Language (XML) file, a HyperText Markup

Language (HTML) file, or a Word document. Parsing deconstructs the provided text and renders it in a more
structured way for the subsequent steps.

Confidence measures the chance that X and Y appear together in relation to the chance X appears. Confidence
can be used to identify the interestingness of the rules

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar
hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use
more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a
database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets
in order to reduce the distance over which it must be transmitted.

"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary
storage. A master node orchestrates that for redundant copies of input data, only one is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that

all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping

operation is independent of the others, all maps can be performed in parallel - though in practice this is limited by the
number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can
perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to

the same reducer at the same time, or that the reduction function is associative. While this process can often appear
inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger

datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in

only a few hours.[10] The parallelism also offers some possibility of recovering from partial failure of servers or
storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is

still available.

Apache Mahout is a suite of machine learning libraries designed to be scalable and robust. k-Means is a simple but
well-known algorithm for grouping objects, clustering. All objects need to be represented as a set of numerical
features. In addition, the user has to specify the number of groups (referred to as k) she wishes to identify.

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com 9


Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the
number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that

vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to the
center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task.

After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The
process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be

proven to converge after a finite number of iterations.

MADlib is an open-source library for scalable in-database analytics. It offers dataparallel implementations of
mathematical, statistical, and machine learning methods for structured and unstructured data. Because MADlib is
designed and built to accommodate massive parallel processing of
data, MADlib is ideal for Big Data in-database analytics. MADlib supports the opensource database PostgreSQL as
well as the Pivotal Greenplum Database and Pivotal HAWQ. HAWQ is a SQL query engine for data stored in the
Hadoop Distributed File System (HDFS).

An effective Data Scientist.


1. Diverse Technologies - a good Data Scientist is handy with a collection of open-source tools - Hadoop, Java,
Python, among others. Knowing when to use those tools, and how to code, are prerequisites. To be a Data Scientist,

you should have your hands on a number of tools and technologies, especially open source ones, such as Hadoop,
Java, Python, C++, ECL, etc. Besides, having good understanding of database technologies, such as NoSQL
database like HBase, CouchDB, etc. is an add-on.

2. Mathematics - The second skill, as you might expect, is a base in statistics, algorithms, machine learning, and
mathematics. Conventional computer science degrees no longer satisfy the quest of a data scientist. The job requires

someone who on the one hand understands large-scale machine learning algorithms and programming and on the
other is a statistician. So, the profile is better suited for experts in other scientific and mathematical disciplines, apart

from computer science.

3. Business Skills - As data scientists wear multiple hats, they need to have strong business skills. A data scientist
has to communicate with diverse people in an organization that includes communicating and understanding business
requirements, application requirements and interpret the patterns and relationships mined from data to people in

marketing group, product development teams, and corporate executives. And all this requires good business skills, to
get the things done right.

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


10
4. Visualization - The fourth set of skills focus on making products real and making data available to users. In
other words, this one's a combination of coding skills, an ability to see where data can add value, and collaborating

with teams to make these products a reality. You may be able to mine and model data, but are you able to visualize
it? Well if not, mind that you should be able to work with some, at least a few of the data visualization tools. Some of

these include Tableau, Flare, D3.js, Processing, Google Visualization API, and Raphael.js.

5. Innovation - You don't just have to look around and do with data. You got to think creative, and innovate. A
data scientist should be eager to learn more, be curious to find new things, and think out of the box. They should be

focused on making products real and making perfectly done data available to users. They should be able to see

where data can add value, and how it can brings better results.

6. Problem-Solving Skills This may seem obvious, of course, because data science is all about solving problems.
But a good data scientist must take the time to learn what problem needs to be solved, how the solution will deliver
value, and how it'll be used and by whom.

7. Communications Skills - Communication is the key to work with various cross-functional team members and
present analytics in a compelling and effective manner to the leadership and customers. In other words, you may be
brilliant in your rarefied field, but you're not going to be a really good data scientist if you can't communicate with the

common folk.

The SQL UNION clause/operator is used to combine the results of two or more SELECT statements without
returning any duplicate rows. To use UNION, each SELECT must have the same number of columns selected, the
same number of column expressions, the same data type, and have them in the same order, but they do not have to
be the same length.

Wilcoxson Rank Sum test: needs to be used when you cannot make an assumption about the distribution of the
populations.

Highly processed data: Prior to conducting data analysis, the required data must be collected and processed to
extract the useful information. The degree of initial processing and data preparation depends on the volume of data,
as well as how straightforward it is to understand the structure of the data. Highly processed data may lose some

important information.

Emphasis color is to standard color as Main message is to context

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


11
Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop
and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential

scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes).
This means that Hive would not be appropriate for applications that need very fast response times, as you would

expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction
processing that typically involves a high percentage of write operations.

Box-and-whisker plots show the distribution of a continuous variable for each value of a discrete variable. To
create a box-and-whisker plot, you start by ordering your data (putting the values in numerical order), if they aren't

ordered already. Then you find the median of your data. The median divides the data into two halves. To divide the
data into quarters, you then find the medians of these two halves. Note: If you have an even number of values, so the
first median was the average of the two middle values, then you include the middle values in your sub-median
computations. If you have an odd number of values, so the first median was an actual data point, then you do not
include that value in your sub-median computations. That is, to find the sub-medians, you're only looking at the
values that haven't yet been used

A line chart or line graph is a type of chart which displays information as a series of data points called 'markers'
connected by straight line segments. It is a basic type of chart common in many fields. It is similar to a scatter plot

except that the measurement points are ordered (typically by their x-axis value) and joined with straight line
segments. A line chart is often used to visualize a trend in data over intervals of time - a time series - thus the line is
often drawn chronologically. In these cases they are known as run charts

Scatter plots A scatter plot, scatterplot, or scattergraph is a type of mathematical diagram using Cartesian
coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each

having the value of one variable determining the position on the horizontal axis and the value of the other variable
determining the position on the vertical axis. This kind of plot is also called a scatter chart, scattergram, scatter

diagram, or scatter graph.


A heat map is a two-dimensional representation of data in which values are represented by colors. A simple heat
map provides an immediate visual summary of information. More elaborate heat maps allow the viewer to understand

complex data sets. Another type of heat map, which is often used in business, is sometimes referred to as a tree
map. This type of heat map uses rectangles to represent components of a data set. The largest rectangle represents

the dominant logical division of data and smaller rectangles illustrate other sub-divisions within the data set. The color
and size of the rectangles on this type of heat map can correspond to two different values, allowing the viewer to
perceive two variables at once. Tree maps are often used for budget proposals, stock market analysis, risk
management, project portfolio analysis, market share analysis, website design and network management. In
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
12
descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data
through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating

variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.
Outliers may be plotted as individual points. To visualize correlations between two variables, a scatter plot is typically

the best choice. By plotting the data on a scatter plot, you can easily see any trends in the correlation, such as a
linear relationship, a log normal relationship, or a polynomial relationship. A heat map uses three dimensions and so

would be a poor choice for this purpose. Box plots, bar charts, and tree maps do not provide the kind of uniform
special mapping of the data onto the graph that is required to see trends.

Box Plots: In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of
numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers)
indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-
whisker diagram. Outliers may be plotted as individual points.
Box plots display differences between populations without making any assumptions of the underlying statistical
distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of
dispersion (spread) and skewness in the data, and identify outliers. In addition to the points themselves, they allow
one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and
trimean. Boxplots can be drawn either horizontally or vertically.

A heat map is a two-dimensional representation of data in which values are represented by colors. A simple heat map
provides an immediate visual summary of information. More elaborate heat maps allow the viewer to understand
complex data sets.

In the United States, many people are familiar with heat maps from viewing television news programs. During a
presidential election, for instance, a geographic heat map with the colors red and blue will quickly inform the viewer
which states each candidate has won.

Another type of heat map, which is often used in business, is sometimes referred to as a tree map. This type of heat
map uses rectangles to represent components of a data set. The largest rectangle represents the dominant logical

division of data and smaller rectangles illustrate other sub-divisions within the data set. The color and size of the

rectangles on this type of heat map can correspond to two different values, allowing the viewer to perceive two
variables at once. Tree maps are often used for budget proposals, stock market analysis, risk management, project

portfolio analysis, market share analysis, website design and network management.

Autocorrelation is the linear dependence of a variable with itself at two points in time. For stationary processes,
autocorrelation between any two observations only depends on the time lag h between them. Define Cov(yt, yt-h) =
Yh. The autocorrelation function tells us the time interval over which a correlation in the noise exists. If the noise is
made entirely of waves, and the waves move through the plasma (or other medium) without decaying as they travel,
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
13
the autocorrelation will be large for all time. Why the autocorrelation does why does the autocorrelation decay in
time? Decay in time? Because the data does not go on forever. Autocorrelation, also known as serial correlation or

cross-autocorrelation, is the cross-correlation of a signal with itself at different points in time (that is what the cross
stands for). Informally, it is the similarity between observations as a function of the time lag between them. It is a

mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or
identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal

processing for analyzing functions or series of values, such as time domain signals.

A regular expression is a method used in programming for pattern matching. Regular expressions provide a
flexible and concise means to match strings of text. For example, a regular expression could be used to search
through large volumes of text and change all occurrences of "cat" to "dog".

Ordinal Data: The next level higher of data classification than nominal data. Numerical data where number is
assigned to represent a qualitative description similar to nominal data. However, these numbers can be arranged to
represent worst to best or vice-versa. Ordinal data is a form of discrete data and should apply nonparametric test to
analyze. Ratings provided on a FMEA for Severity, Occurrence, and Detection
DETECTION
1 = detectable every time

5 = detectable about 50% of the time


10 = not detectable at all
(All whole numbers from 1 - 10 represent levels of detection capability that are provided by team, customer,

standards, or law) classifying households as low income, middle-income, and high income. Nominal and ordinal data
are from imprecise measurements and are referred to as non-metric data, sometime referred to as qualitative data.
Ordinal data is also round when ranking sports teams, ranking the best cities to live, most popular beaches, and

survey questionnaires.

Interval Data:
The next higher level of data classification. Numerical data where the data can be arranged in a order and the
differences between the values are meaningful but not necessarily a zero point. Interval data can be both continuous

and discrete. Zero degrees Fahrenheit does not mean it is the lowest point on the scale, it is just another point on the
scale. The lowest appropriate level for the mean is interval data. Parametric AND nonparametric statistical techniques

can be used to analyze interval data. Examples in temperature readings, percentage change in performance of
machine, and dollar change in price of oil/gallon.

Ratio Data:
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
14
Similar to interval data EXCEPT has a defined absolute zero point and is the highest level of data measurement.
Ratio data can be both continuous and discrete. Ratio level data has the highest level of usage and can be analyzed

in more ways than the other three types of data. Interval data and ratio data are considered metric data, also called
quantitative data.

Apache Pig consists of a data flow language, Pig Latin, and an environment to execute the Pig code. The main
benefit of using Pig is to utilize the power of MapReduce in a distributed system, while simplifying the tasks of
developing and executing a MapReduce job. In most cases, it is transparent to the user that a MapReduce job is

running in the background when Pig commands are executed. This abstraction layer on top of Hadoop simplifies the

development of code against data in HDFS and makes MapReduce more accessible to a larger audience. With
Apache Hadoop and Pig already installed, the basics of using Pig include entering the Pig execution environment by
typing pig at the command prompt and then entering a sequence of Pig instruction lines at the grunt prompt.

Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to
import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an
RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be
imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault

tolerance.

Using visualization for data exploration is different from presenting results to stakeholders. Not every type of plot is
suitable for all audiences. Most of the plots presented earlier try to detail the data as clearly as possible for data
scientists to identify structures and relationships. These graphs are more technical in nature and are better suited to
technical audiences such as data scientists. Nontechnical stakeholders, however, generally prefer simple, clear

graphics that focus on the message rather than the data.

When presenting to a technical audience such as data scientists and analysts, focus on how the work was done.

Discuss how the team accomplished the goals and the choices it made in selecting models or analyzing the data.
Share analytical methods and decision-making processes so other analysts can learn from them for future projects.

Describe methods, techniques, and technologies used, as this technical audience will be interested in learning about
these details and considering whether the approach makes sense in this case and whether it can be extended to

other, similar projects. Plan to provide specifics related to model accuracy and speed, such as how well the model will
perform in a production environment.

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


15
A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications
done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix.

The following table shows the confusion matrix for a two class classifier. The entries in the confusion matrix have the
following meaning in the context of our study:

a is the number of correct predictions that an instance is negative,


b is the number of incorrect predictions that an instance is positive,

c is the number of incorrect of predictions that an instance negative, and


d is the number of correct predictions that an instance is positive.
The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the
equation: AC = (a+d)/(a+b+c+d)
The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated
using the equation: TP=d/(c+d)
The false positive rate (FP) is the proportion of negatives cases that were incorrectly classified as positive, as
calculated usingthe equation: FP=b/a+b
The true negative rate (TN) is defined as the proportion of negatives cases that were classified correctly, as
calculated using the equation: TB=a/a+b
The false negative rate (FN) is the proportion of positives cases that were incorrectly classified as negative, as
calculated using the equation: FN=c/c+d
Finally, precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the
equation: P=d/b+d

Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It
proceeds by identifying the frequent individual items in the database and extending them to larger and larger item
sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori

can be used to determine association rules which highlight general trends in the database: this has applications in
domains such as market basket analysis. The whole point of the algorithm (and data mining, in general) is to extract

useful information from large amounts of data. For example, the information that a customer who purchases a

keyboard also tends to buy a mouse at the same time is acquired from the association rule below:

Support: The percentage of task-relevant data transactions for which the pattern is true.
Support (Keyboard -> Mouse) = No. of Transactions containing both Keyboards and Mouse/No. of total transactions
Confidence: The measure of certainty or trustworthiness associated with each discovered pattern.
Confidence (Keyboard -> Mouse) = No. of Transactions containing both Keyboards and Mouse/No. of transactions
containing (Keyboard)

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


16
The algorithm aims to find the rules which satisfy both a minimum support threshold and a minimum confidence
threshold (Strong Rules).
Item: article in the basket.
Itemset: a group of items purchased together in a single transaction.

In a ROC curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for
different cut-off points of a parameter. Each point on the ROC curve represents a sensitivity/specificity pair
corresponding to a particular decision threshold. The area under the ROC curve (AUC) is a measure of how well a

parameter can distinguish between two diagnostic groups (diseased/normal). Logistic regression is often used as a

classifier to assign class labels to a person, item, or transaction based on the predicted probability provided by the
model. In the Churn example, a customer can be classified with the label called Churn if the logistic model predicts a
high probability that the customer will churn. Otherwise, a Remain label is assigned to the customer. Commonly, 0.5
is used as the default probability threshold to distinguish between any two class labels. However, any threshold value
can be used depending on the preference to avoid false positives (for example, to predict Churn when actually the
customer will Remain) or false negatives (for example, to predict Remain when the customer will actually Churn).

Autoregressive-moving-average (ARMA): In the statistical analysis of time series, autoregressive-moving-


average (ARMA) models provide a parsimonious description of a (weakly) stationary stochastic process in terms of

two polynomials, one for the auto-regression and the second for the moving average. Given a time series of data Xt,
the ARMA model is a tool for understanding and, perhaps, predicting future values in this series. The model consists
of two parts, an autoregressive (AR) part and a moving average (MA) part. The model is usually then referred to as

the ARMA(p,q) model where p is the order of the autoregressive part and q is the order of the moving average part .
There are a number of modelling options to account for a non-constant variance, for example ARCH (and GARCH,
and their many extensions) or stochastic volatility models.

An ARCH model extend ARMA models with an additional time series equation for the square error term. They tend to

be pretty easy to estimate (the fGRACH R package for example).

SV models extend ARMA models with an additional time series equation (usually a AR(1)) for the log of the time-

dependent variance. I have found these models are best estimated using Bayesian methods (OpenBUGS has
worked well for me in the past). You can fit ARIMA model, but first you need to stabilize the variance by applying

suitable transformation. You can also use Box-Cox transformation. This has been done in the book Time Series
Analysis: With Applications in R, page 99, and then they use Box-Cox transformation. Check this link Box-Jenkins
modelling Another reference is page 169, Introduction to Time Series and Forecasting, Brockwell and Davis, "Once
the data have been transformed (e.g., by some combination of Box-Cox and differencing transformations or by
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
17
removal of trend and seasonal components) to the point where the transformed series X_t can potentially be fitted by
a zero-mean ARMA model, we are faced with the problem of selecting appropriate values for the orders p and q."

Therefore, you need to stabilize the variance prior to fit the ARIMA model.

Hypothesis testing requires constructing a statistical model of what the world would look like given that chance or
random processes alone were responsible for the results. The hypothesis that chance alone is responsible for the

results is called the null hypothesis. The model of the result of the random process is called the distribution under the
null hypothesis. The obtained results are then compared with the distribution under the null hypothesis, and the

likelihood of finding the obtained results is thereby determined.

Hypothesis testing works by collecting data and measuring how likely the particular set of data is, assuming the null
hypothesis is true, when the study is on a randomly-selected representative sample. The null hypothesis assumes no
relationship between variables in the population from which the sample is selected.

If the data-set of a randomly-selected representative sample is very unlikely relative to the null hypothesis (defined as
being part of a class of sets of data that only rarely will be observed), the experimenter rejects the null hypothesis
concluding it (probably) is false. This class of data-sets is usually specified via a test statistic which is designed to
measure the extent of apparent departure from the null hypothesis. The procedure works by assessing whether the

observed departure measured by the test statistic is larger than a value defined so that the probability of occurrence
of a more extreme value is small under the null hypothesis (usually in less than either 5% or 1% of similar data-sets in
which the null hypothesis does hold).

If the data do not contradict the null hypothesis, then only a weak conclusion can be made: namely, that the observed
data set provides no strong evidence against the null hypothesis. In this case, because the null hypothesis could be

true or false, in some contexts this is interpreted as meaning that the data give insufficient evidence to make any
conclusion; in other contexts it is interpreted as meaning that there is no evidence to support changing from a

currently useful regime to a different one.

For instance, a certain drug may reduce the chance of having a heart attack. Possible null hypotheses are "this drug

does not reduce the chances of having a heart attack" or "this drug has no effect on the chances of having a heart
attack". The test of the hypothesis consists of administering the drug to half of the people in a study group as a

controlled experiment. If the data show a statistically significant change in the people receiving the drug, the null
hypothesis is rejected.

The FULL OUTER JOIN keyword returns all rows from the left table (table1) and from the right table (table2).
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
18
The ROLLUP, CUBE, and GROUPING SETS operators are extensions of the GROUP BY clause. The ROLLUP,
CUBE, or GROUPING SETS operators can generate the same result set as when you use UNION ALL to combine
single grouping queries; however, using one of the GROUP BY operators is usually more efficient.

The GROUPING SETS operator can generate the same result set as that generated by using a simple GROUP BY,
ROLLUP, or CUBE operator. When all the groupings that are generated by using a full ROLLUP or CUBE operator

are not required, you can use GROUPING SETS to specify only the groupings that you want. The GROUPING SETS
list can contain duplicate groupings; and, when GROUPING SETS is used with ROLLUP and CUBE, it might

generate duplicate groupings. Duplicate groupings are retained as they would be by using UNION ALL. Queries that

use the ROLLUP and CUBE operators generate some of the same result sets and perform some of the same
calculations as OLAP applications. The CUBE operator generates a result set that can be used for cross tabulation
reports. A ROLLUP operation can calculate the equivalent of an OLAP dimension or hierarchy. A query with a
GROUP BY ROLLUP clause returns the same aggregated data as an equivalent query with a GROUP BY clause. It
also returns multiple levels of subtotal rows. You can include up to three fields in a comma-separated list in a GROUP
BY ROLLUP clause.

Area under the Receiver Operating Characteristic Curve): There are no universal rules of thumb with the
AUC, ever. What the AUC is is the probability that a randomly sampled positive (or case) will have a higher marker

value than a negative (or control) because the AUC is mathematically equivalent to the U statistic. What the AUC is
not is a standardized measure of predictive accuracy. Highly deterministic events can have single predictor AUCs of
95% or higher (such as in controlled mechatronics, robotics, or optics), some complex multivariable logistic risk

prediction models have AUCs of 64% or lower such as breast cancer risk prediction, and those are respectably high
levels of predictive accuracy.

A sensible AUC value, as with a power analysis, is prespecified by gathering knowledge of the background and aims
of a study apriori. The doctor/engineer describes what they want, and you, the statistician, resolve on a target AUC

value for your predictive model. Then begins the investigation.

It is indeed possible to overfit a logistic regression model. Aside from linear dependence (if the model matrix is of

deficient rank), you can also have perfect concordance, or that is the plot of fitted values against Y perfectly
discriminates cases and controls. In that case, your parameters have not converged but simply reside somewhere on

the boundary space that gives a likelihood of 8. Sometimes, however, the AUC is 1 by random chance alone.

There's another type of bias that arises from adding too many predictors to the model, and that's small sample bias.
In general, the log odds ratios of a logistic regression model tend toward a biased factor of 2B because of non-
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
19
collapsibility of the odds ratio and zero cell counts. In inference, this is handled using conditional logistic regression to
control for confounding and precision variables in stratified analyses. However, in prediction, you're SooL. There is no

generalizable prediction when you have p>>np(1-p), (p=Prob(Y=1)) because you're guaranteed to have modeled the
"data" and not the "trend" at that point. High dimensional (large p) prediction of binary outcomes is better done with

machine learning methods. Understanding linear discriminant analysis, partial least squares, nearest neighbor
prediction, boosting, and random forests would be a very good place to start.

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g.
whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node

represents a class label (decision taken after computing all attributes). The paths from root to leaf represents
classification rules.

In decision analysis a decision tree and the closely related influence diagram are used as a visual and analytical
decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.
A decision tree consists of 3 types of nodes:
Decision nodes - commonly represented by squares
Chance nodes - represented by circles
End nodes - represented by triangles
Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy
most likely to reach a goal. If in practice decisions have to be taken online with no recall under incomplete knowledge,
a decision tree should be paralleled by a probability model as a best choice model or online selection model

algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.

Decision trees, influence diagrams, utility functions, and other decision analysis tools and methods are taught to

undergraduate students in schools of business, health economics, and public health, and are examples of operations
research or management science methods.

The loess () function can be used to fit a nonlinear line to the data. LOESS and LOWESS (locally weighted
scatterplot smoothing) are two strongly related non-parametric regression methods that combine multiple regression

models in a k-nearest-neighbor-based meta-model. "LOESS" is a later generalization of LOWESS; although it is not a


true initialism, it may be understood as standing for "LOcal regrESSion". LOESS and LOWESS thus build on

"classical" methods, such as linear and nonlinear least squares regression. They address situations in which the
classical procedures do not perform well or cannot be effectively applied without undue labor. LOESS combines
much of the simplicity of linear least squares regression with the flexibility of nonlinear regression. It does this by
fitting simple models to localized subsets of the data to build up a function that describes the deterministic part of the
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
20
variation in the data, point by point. In fact, one of the chief attractions of this method is that the data analyst is not
required to specify a global function of any form to fit a model to the data, only to fit segments of the data.

Online Analytical Processing (OLAP) databases facilitate business-intelligence queries. OLAP is a database
technology that has been optimized for querying and reporting, instead of processing transactions. The source data
for OLAP is Online Transactional Processing (OLTP) databases that are commonly stored in data warehouses. OLAP

data is derived from this historical data, and aggregated into structures that permit sophisticated analysis. OLAP data
is also organized hierarchically and stored in cubes instead of tables. It is a sophisticated technology that uses

multidimensional structures to provide rapid access to data for analysis. This organization makes it easy for a

PivotTable report or PivotChart report to display high-level summaries, such as sales totals across an entire country
or region, and also display the details for sites where sales are particularly strong or weak.

OLAP databases are designed to speed up the retrieval of data. Because the OLAP server, rather than Microsoft
Office Excel, computes the summarized values, less data needs to be sent to Excel when you create or change a
report. This approach enables you to work with much larger amounts of source data than you could if the data were
organized in a traditional database, where Excel retrieves all of the individual records and then calculates the
summarized values.

OLAP databases contain two basic types of data: measures, which are numeric data, the quantities and averages
that you use to make informed business decisions, and dimensions, which are the categories that you use to
organize these measures. OLAP databases help organize data by many levels of detail, using the same categories

that you are familiar with to analyze the data.

The following sections describe each of these components in more detail:

Cube A data structure that aggregates the measures by the levels and hierarchies of each of the dimensions that
you want to analyze. Cubes combine several dimensions, such as time, geography, and product lines, with

summarized data, such as sales or inventory figures. Cubes are not "cubes" in the strictly mathematical sense

because they do not necessarily have equal sides. However, they are an apt metaphor for a complex concept.
Measure A set of values in a cube that are based on a column in the cube's fact table and that are usually numeric

values. Measures are the central values in the cube that are preprocessed, aggregated, and analyzed. Common
examples include sales, profits, revenues, and costs.
Member An item in a hierarchy representing one or more occurrences of data. A member can be either unique or

nonunique. For example, 2007 and 2008 represent unique members in the year level of a time dimension, whereas
January represents nonunique members in the month level because there can be more than one January in the time
dimension if it contains data for more than one year.
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
21
Calculated member A member of a dimension whose value is calculated at run time by using an expression.
Calculated member values may be derived from other members' values. For example, a calculated member, Profit,
can be determined by subtracting the value of the member, Costs, from the value of the member, Sales.

Dimension A set of one or more organized hierarchies of levels in a cube that a user understands and uses as the
base for data analysis. For example, a geography dimension might include levels for Country/Region, State/Province,
and City. Or, a time dimension might include a hierarchy with levels for year, quarter, month, and day. In a PivotTable

report or PivotChart report, each hierarchy becomes a set of fields that you can expand and collapse to reveal lower

or higher levels.

Hierarchy A logical tree structure that organizes the members of a dimension such that each member has one
parent member and zero or more child members. A child is a member in the next lower level in a hierarchy that is
directly related to the current member. For example, in a Time hierarchy containing the levels Quarter, Month, and
Day, January is a child of Qtr1. A parent is a member in the next higher level in a hierarchy that is directly related to
the current member. The parent value is usually a consolidation of the values of all of its children. For example, in a
Time hierarchy that contains the levels Quarter, Month, and Day, Qtr1 is the parent of January.
Level Within a hierarchy, data can be organized into lower and higher levels of detail, such as Year, Quarter,

Month, and Day levels in a Time hierarchy.

A cube can be considered a multi-dimensional generalization of a two- or three-dimensional spreadsheet. For


example, a company might wish to summarize financial data by product, by time-period, and by city to compare
actual and budget expenses. Product, time, city and scenario (actual and budget) are the data's dimensions.

Cube is a shortcut for multidimensional dataset, given that data can have an arbitrary number of dimensions. The
term hypercube is sometimes used, especially for data with more than three dimensions. Slicer is a term for a

dimension which is held constant for all cells so that multidimensional information can be shown in a two dimensional

physical space of a spreadsheet or pivot table. Each cell of the cube holds a number that represents some measure
of the business, such as sales, profits, expenses, budget and forecast.

OLAP data is typically stored in a star schema or snowflake schema in a relational data warehouse or in a special-

purpose data management system. Measures are derived from the records in the fact table and dimensions are
derived from the dimension tables.

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


22
Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing
inflected (or sometimes derived) words to their word stem, base or root form-generally a written word form. The stem

needs not to be identical to the morphological root of the word; it is usually sufficient that related words map to the
same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer

science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query
expansion, a process called conflation.

Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemmer for English, for

example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and

"stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished",
and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to
the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments"
reduce to the stem "argument".

A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing or using
another software application. As the user clicks anywhere in the webpage or application, the action is logged on a
client or inside the web server, as well as possibly the web browser, router, proxy server or ad server. Clickstream
analysis is useful for web activity analysis,[1] software testing, market research, and for analyzing employee

productivity. Initial clickstream or click path data had to be gleaned from server log files. Because human and
machine traffic were not differentiated, the study of human clicks took a substantial effort. Subsequently JavaScript
technologies were developed which use a tracking cookie to generate a series of signals from browsers. In other

words, information was only collected from "real humans" clicking on sites through browsers.

A clickstream is a series of page requests, every page requested generates a signal. These signals can be

graphically represented for clickstream reporting. The main point of clickstream tracking is to give webmasters insight
into what visitors on their site are doing. This data itself is "neutral" in the sense that any dataset is neutral. The data

can be used in various scenarios, one of which is marketing. Additionally, any webmaster, researcher, blogger or

person with a website can learn about how to improve their site. Use of clickstream data can raise privacy concerns,
especially since some Internet service providers have resorted to selling users' clickstream data as a way to enhance

revenue. There are 10-12 companies that purchase this data, typically for about $0.40/month per user.[3] While this
practice may not directly identify individual users, it is often possible to indirectly identify specific users, an example

being the AOL search data scandal. Most consumers are unaware of this practice, and its potential for compromising
their privacy. In addition, few ISPs publicly admit to this practice.

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


23
Moving average: In statistics, a moving average (rolling average or running average) is a calculation to analyze
data points by creating a series of averages of different subsets of the full data set. It is also called a moving mean

(MM)[1] or rolling mean and is a type of finite impulse response filter. Variations include: simple, and cumulative, or
weighted forms (described below).

Given a series of numbers and a fixed subset size, the first element of the moving average is obtained by taking the

average of the initial fixed subset of the number series. Then the subset is modified by "shifting forward"; that is,
excluding the first number of the series and including the next number following the original subset in the series. This

creates a new subset of numbers, which is averaged. This process is repeated over the entire data series. The plot

line connecting all the (fixed) averages is the moving average. A moving average is a set of numbers, each of which
is the average of the corresponding subset of a larger set of datum points. A moving average may also use unequal
weights for each datum value in the subset to emphasize particular values in the subset.

A moving average is commonly used with time series data to smooth out short-term fluctuations and highlight longer-
term trends or cycles. The threshold between short-term and long-term depends on the application, and the
parameters of the moving average will be set accordingly. For example, it is often used in technical analysis of
financial data, like stock prices, returns or trading volumes. It is also used in economics to examine gross domestic
product, employment or other macroeconomic time series. Mathematically, a moving average is a type of convolution

and so it can be viewed as an example of a low-pass filter used in signal processing. When used with non-time series
data, a moving average filters higher frequency components without any specific connection to time, although
typically some kind of ordering is implied. Viewed simplistically it can be regarded as smoothing the data.

Euclidean distance: In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" (i.e straight line)
distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The

associated norm is called the Euclidean norm. Older literature refers to the metric as Pythagorean metric.
Very often, especially when measuring the distance in the plane, we use the formula for the Euclidean distance.

According to the Euclidean distance formula, the distance between two points in the plane with coordinates (x, y) and

(a, b) is given by

dist((x, y), (a, b)) = sqrt{(x - a)*(x-a)+ (y - b)*(y-b)}

Clustering is an example of unsupervised learning. The clustering algorithm finds groups within the data without
being told what to look for upfront. This contrasts with classification, an example of supervised machine learning,
which is the process of determining to which class an observation belongs. A common application of classification is
spam filtering. With spam filtering we use labeled data to train the classifier: e-mails marked as spam or ham.
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
24
Logistic regression
Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several
predictor variables that may be either numerical or categories.

Support Vector Machines


As with naive Bayes, Support Vector Machines (or SVMs) can be used to solve the task of assigning objects to
classes. But the way this task is solved is completely different to the setting in naive Bayes.

Neural Network: Neural Networks are a means for classifying multidimensional objects.

Hidden Markov Models: Hidden Markov Models are used in multiple areas of machine learning, such as speech
recognition, handwritten letter recognition, or natural language processing.

Bayes' theorem finds the actual probability of an event from the results of your tests. For example, you can:
Correct for measurement errors. If you know the real probabilities and the chance of a false positive and false
negative, you can correct for measurement errors. Relate the actual probability to the measured test probability.
Bayes' theorem lets you relate Pr(A|X), the chance that an event A happened given the indicator X, and Pr(X|A), the

chance the indicator X happened given that event A occurred. Given mammogram test results and known error rates,
you can predict the actual chance of having cancer.

Regression is a tool which Companies may use this for things such as sales forecasts or forecasting manufacturing
defects. Another creative example is predicting the probability of celebrity divorce.

Classification is the process of using several inputs to produce one or more outputs. For example the input might be
the income, education and current debt of a customer. The output might be a risk class, such as "good", "acceptable",

"average", or "unacceptable". Contrast this to regression where the output is a number, not a class.

Machine learning applications:


1 Collect data. You could collect the samples by scraping a website and extracting data, or you could get
information from an RSS feed or an API. You could have a device collect wind speed measurements and send them

to you, or blood glucose levels, or anything you can measure. The number of options is endless. To save some time
and effort, you could use publicly available data.
2 Prepare the input data. Once you have this data, you need to make sure it's in a useable format. The format
we'll be using in this book is the Python list. We'll talk about Python more in a little bit, and lists are reviewed in
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
25
appendix A. The benefit of having this standard format is that you can mix and match algorithms and data sources.
You may need to do some algorithm-specific formatting here. Some algorithms need features in a special format,

some algorithms can deal with target variables and features as strings, and some need them to be integers. We'll get
to this later, but the algorithm-specific formatting is usually trivial compared to collecting data.
3 Analyze the input data. This is looking at the data from the previous task. This could be as simple as looking at
the data you've parsed in a text editor to make sure steps 1 and 2 are actually working and you don't have a bunch of

empty values. You can also look at the data to see if you can recognize any patterns or if there's anything obvious,
such as a few data points that are vastly different from the rest of the set. Plotting data in one, two, or three

dimensions can also help. But most of the time you'll have more than three features, and you can't easily plot the data

across all features at one time. You could, however, use some advanced methods we'll talk about later to distill
multiple dimensions down to
two or three so you can visualize the data.
4 If you're working with a production system and you know what the data should look like, or you trust its source,
you can skip this step. This step takes human involvement, and for an automated system you don't want human
involvement. The value of this step is that it makes you understand you don't have garbage coming in.
5 Train the algorithm. This is where the machine learning takes place. This step and the next step are where the
"core" algorithms lie, depending on the algorithm.You feed the algorithm good clean data from the first two steps
andextract knowledge or information. This knowledge you often store in a formatthat's readily useable by a machine

for the next two steps.In the case of unsupervised learning, there's no training step because youdon't have a target
value. Everything is used in the next step.
6 Test the algorithm. This is where the information learned in the previous step isput to use. When you're
evaluating an algorithm, you'll test it to see how well itdoes. In the case of supervised learning, you have some known
values you can use to evaluate the algorithm. In unsupervised learning, you may have to use some other metrics to
evaluate the success. In either case, if you're not satisfied, you can go back to step 4, change some things, and try

testing again. Often thecollection or preparation of the data may have been the problem, and you'll have to go back to
step 1.
7 Use it. Here you make a real program to do some task, and once again you see if all the previous steps worked as
you expected. You might encounter some new data and have to revisit steps 1-5.

k-Nearest Neighbors :
Pros: High accuracy, insensitive to outliers, no assumptions about data

Cons: Computationally expensive, requires a lot of memory


Works with: Numeric values, nominal values

Naive Bayes
Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com
26
Pros: Works with a small amount of data, handles multiple classes
Cons: Sensitive to how the input data is prepared

Works with: Nominal values

Collaborative filtering: One approach to the design of recommender systems that has seen wide use is
collaborative filtering. Collaborative filtering methods are based on collecting and analyzing a large amount of

information on users' behaviors, activities or preferences and predicting what users will like based on their similarity to
other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable

content and therefore it is capable of accurately recommending complex items such as movies without requiring an

"understanding" of the item itself. Many algorithms have been used in measuring user similarity or item similarity in
recommender systems. For example, the k-nearest neighbor (k-NN) approach and the Pearson Correlation

Supervised learning is fairly common in classification problems because the goal is often to get the computer to
learn a classification system that we have created. Digit recognition, once again, is a common example of
classification learning. More generally, classification learning is appropriate for any problem where deducing a
classification is useful and the classification is easy to determine. In some cases, it might not even be necessary to
give pre determined classifications to every instance of a problem if the agent can work out the classifications for
itself. This would be an example of unsupervised learning in a classification context.

Please check other Material provided by www.HadoopExam.com

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


27
Data Science certification really needs a good and in depth knowledge
of statistics cum BigData Hadoop knowledge. It also require you to have
good knowledge in like the main phases of the Data Analytics Lifecycle,
analyzing and exploring data with R, statistics for model building and
evaluation, the theory and methods of advanced analytics and statistical
modeling, the technology and tools that can be used for advanced analytics,
operationalizing an analytics project, and data visualization techniques.
Successful candidates will achieve the EMC Proven Professional Data
Science Associate credential. Hence to clear the real exam it realy needs very
well preparation. So HadoopExam Learning Resources brings Data Science
Certification Simulator with 234 Practice Questions, which can help you to prepare for this exam in
lesser time. Practice - practice - practice! The EMC:DS E20-007 Exam Simulator offers you the

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


28
opportunity to take 4 sample Exams before heading out for the real thing. Be ready to succeed on
exam day!

Upcoming Releases

1. Apache Spark Training


2. Apache Spark Certification
3. MongoDB Certification Material

Email: mailto:admin@hadoopexam.com mailto:hadoopexam@gmail.com

Phone: 022-42669636 Mobile : +91-8879712614 HadoopExam Learning Resources

Learn Data Science Basic Concepts by www.HadoopExam.com in association with www.QuickTechie.com


29

S-ar putea să vă placă și