Sunteți pe pagina 1din 25

UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Time Series Analysis


Time series analysis is a statistical technique that deals with time series data, or trend
analysis. Time series data means that data is in a series of particular time periods or
intervals. The data is considered in three types:

 Time series data: A set of observations on the values that a variable takes at different
times.
 Cross-sectional data: Data of one or more variables, collected at the same point in time.
 Pooled data: A combination of time series data and cross-sectional data.
Terms and concepts:
 Dependence: Dependence refers to the association of two observations with the same
variable, at prior time points.
 Stationarity: Shows the mean value of the series that remains constant over a time
period; if past effects accumulate and the values increase toward infinity, then
stationarity is not met.
 Differencing: Used to make the series stationary, to De-trend, and to control the auto-
correlations; however, some time series analyses do not require differencing and over-
differenced series can produce inaccurate estimates.
 Specification: May involve the testing of the linear or non-linear relationships of
dependent variables by using models such as ARIMA, ARCH, GARCH, VAR, Co-
integration, etc.
 Exponential smoothing in time series analysis: This method predicts the one next
period value based on the past and current value. It involves averaging of data such
that the nonsystematic components of each individual case or observation cancel out
each other. The exponential smoothing method is used to predict the short term
predication. Alpha, Gamma, Phi, and Delta are the parameters that estimate the effect
of the time series data. Alpha is used when seasonality is not present in data. Gamma
is used when a series has a trend in data. Delta is used when seasonality cycles are
present in data. A model is applied according to the pattern of the data. Curve fitting in
time series analysis: Curve fitting regression is used when data is in a non-linear
relationship. The following equation shows the non-linear behavior:
 Dependent variable, where case is the sequential case number.
 Curve fitting can be performed by selecting “regression” from the analysis menu and
then selecting “curve estimation” from the regression option. Then select “wanted
curve linear,” “power,” “quadratic,” “cubic,” “inverse,” “logistic,” “exponential,” or
“other.”
ARIMA:
ARIMA stands for autoregressive integrated moving average. This method is also known as the
Box-Jenkins method.
Identification of ARIMA parameters:
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

 Autoregressive component: AR stands for autoregressive. Autoregressive paratmeter is


denoted by p. When p =0, it means that there is no auto-correlation in the
series. When p=1, it means that the series auto-correlation is till one lag.
 Integrated: In ARIMA time series analysis, integrated is denoted by d. Integration is the
inverse of differencing. When d=0, it means the series is stationary and we do not need
to take the difference of it. When d=1, it means that the series is not stationary and to
make it stationary, we need to take the first difference. When d=2, it means that the
series has been differenced twice. Usually, more than two time difference is not
reliable.
 Moving average component: MA stands for moving the average, which is denoted by
q. In ARIMA, moving average q=1 means that it is an error term and there is auto-
correlation with one lag.
 In order to test whether or not the series and their error term is auto correlated, we
usually use W-D test, ACF, and PACF.
 Decomposition: Refers to separating a time series into trend, seasonal effects, and
remaining variabilityAssumptions:
 Stationarity: The first assumption is that the series are stationary. Essentially, this
means that the series are normally distributed and the mean and variance are constant
over a long time period.
 Uncorrelated random error: We assume that the error term is randomly distributed and
the mean and variance are constant over a time period. The Durbin-Watson test is the
standard test for correlated errors.
 No outliers: We assume that there is no outlier in the series. Outliers may affect
conclusions strongly and can be misleading.
 Random shocks (a random error component): If shocks are present, they are assumed
to be randomly distributed with a mean of 0 and a constant variance.

ARIM A MODEL
ARIMA stands for Autoregressive Integrated Moving Average models. Univariate (single vector)
ARIMA is a forecasting technique that projects the future values of a series based entirely on its
own inertia. Its main application is in the area of short term forecasting requiring at least 40
historical data points. It works best when your data exhibits a stable or consistent pattern
over time with a minimum amount of outliers. Sometimes called Box-Jenkins (after the original
authors), ARIMA is usually superior to exponential smoothing techniques when the data is
reasonably long and the correlation between past observations is stable. If the data is short or
highly volatile, then some smoothing method may perform better. If you do not have at least
38 data points, you should consider some other method than ARIMA.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Autoregressive Integrated Moving Average Model

 An ARIMA model is a class of statistical models for analyzing and forecasting time series
data.
 It explicitly caters to a suite of standard structures in time series data, and as such
provides a simple yet powerful method for making skillful time series forecasts.

 ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a


generalization of the simpler AutoRegressive Moving Average and adds the notion of
integration.

This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:

 AR: Autoregression. A model that uses the dependent relationship between an


observation and some number of lagged observations.
 I: Integrated. The use of differencing of raw observations (e.g. subtracting an
observation from an observation at the previous time step) in order to make the time
series stationary.
 MA: Moving Average. A model that uses the dependency between an observation and a
residual error from a moving average model applied to lagged observations.
 Each of these components are explicitly specified in the model as a parameter. A
standard notation is used of ARIMA(p,d,q) where the parameters are substituted with
integer values to quickly indicate the specific ARIMA model being used.

 The parameters of the ARIMA model are defined as follows:

 p: The number of lag observations included in the model, also called the lag order.
 d: The number of times that the raw observations are differenced, also called the
degree of differencing.
 q: The size of the moving average window, also called the order of moving average.
 A linear regression model is constructed including the specified number and type of
terms, and the data is prepared by a degree of differencing in order to make it
stationary, i.e. to remove trend and seasonal structures that negatively affect the
regression model.

A value of 0 can be used for a parameter, which indicates to not use that element of the model.
This way, the ARIMA model can be configured to perform the function of an ARMA model, and
even a simple AR, I, or MA model.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Adopting an ARIMA model for a time series assumes that the underlying process that generated
the observations is an ARIMA process. This may seem obvious, but helps to motivate the need
to confirm the assumptions of the model in the raw observations and in the residual errors of
forecasts from the model.

Basic Concepts:

The first step in applying ARIMA methodology is to check for stationarity. "Stationarity" implies
that the series remains at a fairly constant level over time. If a trend exists, as in most economic
or business applications, then your data is NOT stationary. The data should also show a
constant variance in its fluctuations over time. This is easily seen with a series that is heavily
seasonal and growing at a faster rate. In such a case, the ups and downs in the seasonality will
become more dramatic over time. Without these stationarity conditions being met, many of
the calculations associated with the process cannot be computed.

Differencing:

If a graphical plot of the data indicates nonstationarity, then you should "difference" the series.
Differencing is an excellent way of transforming a nonstationary series to a stationary one. This
is done by subtracting the observation in the current period from the previous one. If this
transformation is done only once to a series, you say that the data has been "first differenced".
This process essentially eliminates the trend if your series is growing at a fairly constant rate. If
it is growing at an increasing rate, you can apply the same procedure and difference the data
again. Your data would then be "second differenced".

Autocorrelations:

"Autocorrelations" are numerical values that indicate how a data series is related to itself over
time. More precisely, it measures how strongly data values at a specified number of periods
apart are correlated to each other over time. The number of periods apart is usually called the
"lag". For example, an autocorrelation at lag 1 measures how values 1 period apart are
correlated to one another throughout the series. An autocorrelation at lag 2 measures how the
data two periods apart are correlated throughout the series. Autocorrelations may range from
+1 to -1. A value close to +1 indicates a high positive correlation while a value close to -1 implies
a high negative correlation. These measures are most often evaluated through graphical plots
called "correlagrams". A correlagram plots the auto- correlation values for a given series at
different lags. This is referred to as the "autocorrelation function" and is very important in the
ARIMA method.

Autoregressive Models:

ARIMA methodology attempts to describe the movements in a stationary time series as a


function of what are called "autoregressive and moving average" parameters. These are
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

referred to as AR parameters (autoregessive) and MA parameters (moving averages). An AR


model with only 1 parameter may be written as...

X(t) = A(1) * X(t-1) + E(t)

where X(t) = time series under investigation

A(1) = the autoregressive parameter of order 1

X(t-1) = the time series lagged 1 period

E(t) = the error term of the model

This simply means that any given value X(t) can be explained by some function of its previous
value, X(t-1), plus some unexplainable random error, E(t). If the estimated value of A(1) was .30,
then the current value of the series would be related to 30% of its value 1 period ago. Of
course, the series could be related to more than just one past value. For example,

X(t) = A(1) * X(t-1) + A(2) * X(t-2) + E(t)

This indicates that the current value of the series is a combination of the two immediately
preceding values, X(t-1) and X(t-2), plus some random error E(t). Our model is now an
autoregressive model of order 2.

Moving Average Models:

A second type of Box-Jenkins model is called a "moving average" model. Although these models
look very similar to the AR model, the concept behind them is quite different. Moving average
parameters relate what happens in period t only to the random errors that occurred in past
time periods, i.e. E(t-1), E(t-2), etc. rather than to X(t-1), X(t-2), (Xt-3) as in the autoregressive
approaches. A moving average model with one MA term may be written as follows...

X(t) = -B(1) * E(t-1) + E(t)

The term B(1) is called an MA of order 1. The negative sign in front of the parameter is used for
convention only and is usually printed out auto- matically by most computer programs. The
above model simply says that any given value of X(t) is directly related only to the random error
in the previous period, E(t-1), and to the current error term, E(t). As in the case of
autoregressive models, the moving average models can be extended to higher order structures
covering different combinations and moving average lengths.

Mixed Models:
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

ARIMA methodology also allows models to be built that incorporate both autoregressive and
moving average parameters together. These models are often referred to as "mixed models".
Although this makes for a more complicated forecasting tool, the structure may indeed
simulate the series better and produce a more accurate forecast. Pure models imply that the
structure consists only of AR or MA parameters - not both.

The models developed by this approach are usually called ARIMA models because they use a
combination of autoregressive (AR), integration (I) - referring to the reverse process of
differencing to produce the forecast, and moving average (MA) operations. An ARIMA model is
usually stated as ARIMA(p,d,q). This represents the order of the autoregressive components (p),
the number of differencing operators (d), and the highest order of the moving average term.
For example, ARIMA(2,1,1) means that you have a second order autoregressive model with a
first order moving average component whose series has been differenced once to induce
stationarity.

The first time series is taken from [8] and it represents the weekly exchange rate between
British pound and US dollar from 1980 to 1933. The second one is a seasonal time series,
considered in [3, 6, 11] and it shows the number of international airline passengers (in
thousands) between Jan. 1949 to Dec. 1960 on a monthly basis.

Decision Trees (DTs) are a non-parametric supervised learning method used


for classification and regression. The goal is to create a model that predicts the value of a target
variable by learning simple decision rules inferred from the data features.

For instance, in the example below, decision trees learn from data to approximate a sine curve
with a set of if-then-else decision rules. The deeper the tree, the more complex the decision
rules and the fitter the model.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Important Terminology related to Decision Trees

Let’s look at the basic terminology used with Decision trees:

1. Root Node: It represents entire population or sample and this further gets divided into
two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision
node.
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
You can say opposite process of splitting.
6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of
sub-nodes where as sub-nodes are the child of parent node.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

These are the terms commonly used for decision trees. As we know that every algorithm has
advantages and disadvantages, below are the important factors which one should know.

Advantages

1. Easy to Understand: Decision tree output is very easy to understand even for people
from non-analytical background. It does not require any statistical knowledge to read
and interpret them. Its graphical representation is very intuitive and users can easily
relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify most
significant variables and relation between two or more variables. With the help of
decision trees, we can create new variables / features that has better power to predict
target variable. You can refer article (Trick to enhance power of regression model) for
one such trick. It can also be used in data exploration stage. For example, we are
working on a problem where we have information available in hundreds of variables,
there decision tree will help to identify most significant variable.
3. Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric method.
This means that decision trees have no assumptions about the space distribution
and the classifier structure.

Disadvantages

1. Over fitting: Over fitting is one of the most practical difficulty for decision tree models.
This problem gets solved by setting constraints on model parameters and pruning
(discussed in detailed below).
2. Not fit for continuous variables: While working with continuous numerical variables,
decision tree looses information when it categorizes variables in different categories.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

3. Decision-tree learners can create over-complex trees that do not generalise the data
well. This is called overfitting. Mechanisms such as pruning (not currently supported),
setting the minimum number of samples required at a leaf node or setting the
maximum depth of the tree are necessary to avoid this problem.
4. Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This problem is mitigated by using decision
trees within an ensemble.

Representation of Decision Tree

There are several ways in which a decision tree can be represented. The Decision Tree Analysis
is commonly represented by lines, squares and circles. The squares represent decisions, the
lines represent consequences and the circles represent uncertain outcomes. By keeping the
lines as far apart as possible, there will be plenty of space to add new considerations and ideas.

The representation of the decision tree can be created in four steps:

1. Describe the decision that needs to be made in the square.


2. Draw various lines from the square and write possible solutions on each of the lines.
3. Put the outcome of the solution at the end of the line. Uncertain or unclear decisions
are put in a circle. When a solution leads to a new decision, the latter can be put in a
new square.
4. Each of the squares and circles are reviewed critically so that a final choice can be
made.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Practical example

Suppose a commercial company wishes to increase its sales and the associated profits in the
next year.

The different alternatives can then be mapped out by using a decision tree. There are two
choice for both increase of sales and profits: 1- expansion of advertising expenditure and 2-
expansion of sales activities. This creates two branches. Two new choices arise from choice 1,
namely 1-1 a new advertising agency and 1-2 using the services of the existing advertising
agency. Choice 2 presents two follow-up choices in turn; 2-1-working with agents or 2-2- using
its own sales force.

The branching continues.


UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

The following alternatives from 1-1 are:


1-1-1 The budget will increase by 10% -> end result: sales up 6%, profits up 2%
1-1-2 The budget will increase by 5% -> end result: sales up 4%, profits up 1.5 %

The alternatives that arise from 1.2:


1-2-1 The budget increases by 10% -> end result: sales up 5%, profits up 2.5%
1-2-2 The budget increases by 5 % -> end result: a sales up 4%, profits up 12%

From 2.1 possibly follows:


2-1-1 Set up with own dealers -> end result: sales up 20%, profits up 5%
2-1-2 Working with existing dealers -> end result: sales up 12.5%, profit up 8%

From 2.2 possibly follows:


2-2-1 Hiring of new sales staff -> end result: sales up 15%, profits up 5%
2-2-2 Motivating of existing sales staff -> end result: sales up 4%, profits up 2%.

The above example illustrates that, in all likelihood, the company will opt for 1-2-2, because the
forecast of this decision is that profits will increase by 12%.

The Decision Tree Analysis is particularly useful in situations in which it is considered desirable
to develop various alternatives of decisions in a structured manner as this will present a clear
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

substantiation. This method is increasingly used by medical practitioners and technicians as it


enables them to make a diagnosis or determine car problems.

Structural Equation Modeling/Path Analysis

Path Analysis is the statistical technique used to examine causal relationships between two or
more variables. It is based upon a linear equation system and was first developed by Sewall
Wright in the 1930s for use in phylogenetic studies. Path Analysis was adopted by the social
sciences in the 1960s and has been used with increasing frequency in the ecological literature
since the 1970s. In ecological studies, path analysis is used mainly in the attempt to understand
comparative strengths of direct and indirect relationships among a set of variables. In this way,
path analysis is unique from other linear equation models: In path analysis mediated pathways
(those acting through a mediating variable, i.e., “Y,” in the pathway X ® Y ® Z) can be examined.
Pathways in path models represent hypotheses of researchers, and can never be statistically
tested for directionality. Numerous articles deal with the use of path analysis in ecological
studies (See Shipley 1997, 1999 and Everitt and Dunn 1991 [Section 14.8] for discussion of
ecological applications and misuses of the technique).

Path analysis is a subset of Structural Equation Modeling (SEM), the multivariate procedure
that, as defined by Ullman (1996), “allows examination of a set of relationships between one or
more independent variables, either continuous or discrete, and one or more dependent
variables, either continuous or discrete.” SEM deals with measured and latent variables.

A measured variable is a variable that can be observed directly and is measurable. Measured
variables are also known as observed variables, indicators or manifest variables. A latent
variable is a variable that cannot be observed directly and must be inferred from measured
variables. Latent variables are implied by the covariances among two or more measured
variables. They are also known as factors (i.e., factor analysis), constructs or unobserved
variables. SEM is a combination of multiple regression and factor analysis. Path analysis deals
only with measured variables.

Components of a Structural Equation Model: Structural Equation Models are divided into two
parts: a measurement model and a structural model. The measurement model deals with the
relationships between measured variables and latent variables. The structural model deals with
the relationships between latent variables only. One of the advantages to SEM, is that latent
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

variables are free of random error. This is because error has been estimated and removed,
leaving only a common variance.

The parameters of a SEM are the variances, regression coefficients and covariances among
variables. A variance can be indicated by a two-headed arrow, both ends of which point at the
same variable, or, more simply by a number within the variable’s drawn box or circle.
Regression coefficients are represented along single-headed arrows that indicate a
hypothesized pathway between two variables (These are the weights applied to variables in
linear regression equations). Covariances are associated with double-headed, curved arrows
between two variables or error terms and indicate no directionality. The data for a SEM are the
sample variances and covariances taken from a population (held in S, the observed sample
variance and covariance matrix). A study by Bart and Earnst (1999) provides an example of the
ecological application of path analysis. In this example, the relative importance of male traits
and territory quality are examined in reference to number of females paired with each male.
The application uses territory quality, male quality, and male pairing success (# of
females/male) as the variables in question. Male pairing success is the dependent variable. An
indirect effect of male quality on territory quality (male quality affecting a male’s ability to
obtain territory of high quality) is seen to affect male
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

pairing success. A direct effect of male quality on pairing success is seen as well. The analysis is
not conducted in an SEM framework; however, it would be possible to envision male quality
and territory quality as latent variables affecting an observed variable, male pairing success. In
the study, comb size, tarsus length and wing chord are used as measures of male quality. These
variables could be viewed as the observed variables indicating the latent variable male quality.
Proportion of territory covered by dunes, willow density and territory size are used as measures
of territory quality. These could be viewed as the observed variables indicating the latent
variable territory quality. Structural Equation Model Construction: The goal in building a path
diagram or other structural equation model, is to find a model that fits the data (S) well enough
to serve as a useful representation of reality and a parsimonious explanation of the data. There
are five steps involved in SEM construction: 1. Model Specification 2. Model Identification
(Some authors include this step under specification or estimation) 3. Model Estimation 4.
Testing Model Fit 5. Model Manipulation Model Specification is the exercise of formally stating
a model. It is the step in which parameters are determined to be fixed or free. Fixed parameters
are not estimated from the data and are typically fixed at zero (indicating no relationship
between variables). The paths of fixed parameters are labeled numerically (unless assigned a
value of zero, in which case no path is drawn) in a SEM diagram. Free parameters are estimated
from the observed data and are believed by the investigator to be non-zero. Asterisks in the
SEM diagram label the paths of free parameters. Determining which parameters are fixed and
which are free in a SEM is extremely important because it determines which parameters will be
used to compare the hypothesized diagram with the sample population variance and
covariance matrix in testing the fit of the model (Step 4). The choice of which parameters are
free and which are fixed in a model is up to the researcher. This choice represents the
researcher’s a priori hypothesis about which pathways in a system are important in the
generation of the observed system’s relational structure (e.g., the observed sample variance
and covariance matrix). Model Identification concerns whether a unique value for each and
every free parameter can be obtained from the observed data. It depends on the model choice
and the specification of fixed, constrained and free parameters. A parameter is constrained
when it is set equal to another parameter. Models need to be overidentified in order to be
estimated (Step 3 in SEM construction) and in order to test hypotheses about relationships
among variables (See Ullman 1996 for a more detailed explanation of the levels of model
identification). A necessary condition for overidentification is that the number of data points
(number of variances and covariances) is less than the number of observed variables in the
model. A flow-chart schematic (extracted from Ullman 1996) summarizes the procedures of
model identification:
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Estimation: In this step, start values of the free parameters are chosen in order to generate an
estimated population covariance matrix, S(q), from the model. Start values can be chosen by
the researcher from prior information, by computer programs used to build SEMs (see
“References” section at end of this web page), or from multiple regression analysis (See Ullman
1996 and Hoyle 1995 for more start value choices and further discussion). The goal of
estimation is to produce a S(q) that converges upon the observed population covariance matrix,
S, with the residual matrix (the difference between S(q) and S) being minimized. Various
methods can be used to generate S(q). Choice of method is guided by characteristics of the data
including sample size and distribution. Most processes used are iterative. The general form of
the minimization function is:
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Model Modification: If the covariance/variance matrix estimated by the model does not
adequately reproduce the sample covariance/variance matrix, hypotheses can be adjusted and
the model retested. To adjust a model, new pathways are added or original ones are removed.
In other words, parameters are changed from fixed to free or from free to fixed. It is important
to remember, as in other statistical procedures, that adjusting a model after initial testing
increases the chance of making a Type I error. The common procedures used for model
modification are the Lagrange Multiplier Index (LM) and the Wald test. Both of these tests
report the change in X2 value when pathways are adjusted. The LM asks whether addition of
free parameters increases model fitness. This test uses the same logic as forward stepwise
regression. The Wald test asks whether deletion of free parameters increases model fitness.
The Wald test follows the logic of backward stepwise regression. To adjust for increased type
one error rates, Ullman (1996) recommends using a low probability value
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

“Big Data Management”

What is Big Data?


The term Big Data refers to all the data that is being generated across the globe at an
unprecedented rate. This data could be either structured or unstructured. Today’s business
enterprises owe a huge part of their success to an economy that is firmly knowledge-oriented.
Data drives the modern organizations of the world and hence making sense of this data and
unravelling the various patterns and revealing unseen connections within the vast sea of data
becomes critical and a hugely rewarding endeavour indeed. There is a need to convert Big Data
into Business Intelligence that enterprises can readily deploy. Better data leads to better
decision making and an improved way to strategize for organizations regardless of their size,
geography, market share, customer segmentation and such other categorizations. Hadoop is
the platform of choice for working with extremely large volumes of data. The most successful
enterprises of tomorrow will be the ones that can make sense of all that data at extremely high
volumes and speeds in order to capture newer markets and customer base.
Big Data has certain characteristics and hence is defined using 4Vs namely:

Volume: the amount of data that businesses can collect is really enormous and hence the
volume of the data becomes a critical factor in Big Data analytics.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Velocity: the rate at which new data is being generated all thanks to our dependence on the
internet, sensors, machine-to-machine data is also important to parse Big Data in a timely
manner.
Variety:the data that is generated is completely heterogeneous in the sense that it could be in
various formats like video, text, database, numeric, sensor dataand so on and hence
understanding the type of Big Data is a key factor to unlocking its value.
Veracity: knowing whether the data that is available is coming from a credible source is of
utmost importance before deciphering and implementing Big Data for business needs.
Here is a brief explanation of how exactly businesses are utilizing Big Data:
Once the Big Data is converted into nuggets of information then it becomes pretty
straightforward for most business enterprises in the sense that they now know what their
customers want, what are the products that are fast moving, what are the expectations of the
users from the customer service, how to speed up the time to market, ways to reduce costs,
and methods to build economies of scale in a highly efficient manner. Thus Big
Data distinctively leads to big time benefits for organizations and hence naturally there is such a
huge amount of interest in it from all around the world

Why Are Big Data Systems Different?

The basic requirements for working with big data are the same as the requirements for working
with datasets of any size. However, the massive scale, the speed of ingesting and processing,
and the characteristics of the data that must be dealt with at each stage of the process present
significant new challenges when designing solutions. The goal of most big data systems is to
surface insights and connections from large volumes of heterogeneous data that would not be
possible using conventional methods.

Volume

The sheer scale of the information processed helps define big data systems. These datasets can
be orders of magnitude larger than traditional datasets, which demands more thought at each
stage of the processing and storage life cycle.

Often, because the work requirements exceed the capabilities of a single computer, this
becomes a challenge of pooling, allocating, and coordinating resources from groups of
computers. Cluster management and algorithms capable of breaking tasks into smaller pieces
become increasingly important.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Velocity

Another way in which big data differs significantly from other data systems is the speed that
information moves through the system. Data is frequently flowing into the system from
multiple sources and is often expected to be processed in real time to gain insights and update
the current understanding of the system.

This focus on near instant feedback has driven many big data practitioners away from a batch-
oriented approach and closer to a real-time streaming system. Data is constantly being added,
massaged, processed, and analyzed in order to keep up with the influx of new information and
to surface valuable information early when it is most relevant. These ideas require robust
systems with highly available components to guard against failures along the data pipeline.

Variety

Big data problems are often unique because of the wide range of both the sources being
processed and their relative quality.

Data can be ingested from internal systems like application and server logs, from social media
feeds and other external APIs, from physical device sensors, and from other providers. Big data
seeks to handle potentially useful data regardless of where it's coming from by consolidating all
information into a single system.

The formats and types of media can vary significantly as well. Rich media like images, video
files, and audio recordings are ingested alongside text files, structured logs, etc. While more
traditional data processing systems might expect data to enter the pipeline already labeled,
formatted, and organized, big data systems usually accept and store data closer to its raw state.
Ideally, any transformations or changes to the raw data will happen in memory at the time of
processing.

Other Characteristics

Various individuals and organizations have suggested expanding the original three Vs, though
these proposals have tended to describe challenges rather than qualities of big data. Some
common additions are:

 Veracity: The variety of sources and the complexity of the processing can lead to
challenges in evaluating the quality of the data (and consequently, the quality of the
resulting analysis)
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

 Variability: Variation in the data leads to wide variation in quality. Additional resources
may be needed to identify, process, or filter low quality data to make it more useful.
 Value: The ultimate challenge of big data is delivering value. Sometimes, the systems
and processes in place are complex enough that using the data and extracting actual
value can become difficult.

What Does a Big Data Life Cycle Look Like?

So how is data actually processed when dealing with a big data system? While approaches to
implementation differ, there are some commonalities in the strategies and software that we
can talk about generally. While the steps presented below might not be true in all cases, they
are widely used.

The general categories of activities involved with big data processing are:

 Ingesting data into the system


 Persisting the data in storage
 Computing and Analyzing data
 Visualizing the results

Before we look at these four workflow categories in detail, we will take a moment to talk
about clustered computing, an important strategy employed by most big data solutions. Setting
up a computing cluster is often the foundation for technology used in each of the life cycle
stages.

Clustered Computing

Because of the qualities of big data, individual computers are often inadequate for handling the
data at most stages. To better address the high storage and computational needs of big data,
computer clusters are a better fit.

Big data clustering software combines the resources of many smaller machines, seeking to
provide a number of benefits:

 Resource Pooling: Combining the available storage space to hold data is a clear benefit,
but CPU and memory pooling is also extremely important. Processing large datasets
requires large amounts of all three of these resources.
 High Availability: Clusters can provide varying levels of fault tolerance and availability
guarantees to prevent hardware or software failures from affecting access to data and
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

processing. This becomes increasingly important as we continue to emphasize the


importance of real-time analytics.
 Easy Scalability: Clusters make it easy to scale horizontally by adding additional
machines to the group. This means the system can react to changes in resource
requirements without expanding the physical resources on a machine.

Using clusters requires a solution for managing cluster membership, coordinating resource
sharing, and scheduling actual work on individual nodes. Cluster membership and resource
allocation can be handled by software like Hadoop's YARN (which stands for Yet Another
Resource Negotiator) or Apache Mesos.

The assembled computing cluster often acts as a foundation which other software interfaces
with to process the data. The machines involved in the computing cluster are also typically
involved with the management of a distributed storage system, which we will talk about when
we discuss data persistence.

Ingesting Data into the System

Data ingestion is the process of taking raw data and adding it to the system. The complexity of
this operation depends heavily on the format and quality of the data sources and how far the
data is from the desired state prior to processing.

One way that data can be added to a big data system are dedicated ingestion tools.
Technologies like Apache Sqoop can take existing data from relational databases and add it to a
big data system. Similarly, Apache Flume and Apache Chukwa are projects designed to
aggregate and import application and server logs. Queuing systems like Apache Kafka can also
be used as an interface between various data generators and a big data system. Ingestion
frameworks like Gobblin can help to aggregate and normalize the output of these tools at the
end of the ingestion pipeline.

During the ingestion process, some level of analysis, sorting, and labelling usually takes place.
This process is sometimes called ETL, which stands for extract, transform, and load. While this
term conventionally refers to legacy data warehousing processes, some of the same concepts
apply to data entering the big data system. Typical operations might include modifying the
incoming data to format it, categorizing and labelling data, filtering out unneeded or bad data,
or potentially validating that it adheres to certain requirements.

With those capabilities in mind, ideally, the captured data should be kept as raw as possible for
greater flexibility further on down the pipeline.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Persisting the Data in Storage

The ingestion processes typically hand the data off to the components that manage storage, so
that it can be reliably persisted to disk. While this seems like it would be a simple operation, the
volume of incoming data, the requirements for availability, and the distributed computing layer
make more complex storage systems necessary.

This usually means leveraging a distributed file system for raw data storage. Solutions
like Apache Hadoop's HDFS filesystem allow large quantities of data to be written across
multiple nodes in the cluster. This ensures that the data can be accessed by compute resources,
can be loaded into the cluster's RAM for in-memory operations, and can gracefully handle
component failures. Other distributed filesystems can be used in place of HDFS
including Ceph and GlusterFS.

Data can also be imported into other distributed systems for more structured access.
Distributed databases, especially NoSQL databases, are well-suited for this role because they
are often designed with the same fault tolerant considerations and can handle heterogeneous
data. There are many different types of distributed databases to choose from depending on
how you want to organize and present the data. To learn more about some of the options and
what purpose they best serve, read our NoSQL comparison guide.

Computing and Analyzing Data

Once the data is available, the system can begin processing the data to surface actual
information. The computation layer is perhaps the most diverse part of the system as the
requirements and best approach can vary significantly depending on what type of insights
desired. Data is often processed repeatedly, either iteratively by a single tool or by using a
number of tools to surface different types of insights.

Batch processing is one method of computing over a large dataset. The process involves
breaking work up into smaller pieces, scheduling each piece on an individual machine,
reshuffling the data based on the intermediate results, and then calculating and assembling the
final result. These steps are often referred to individually as splitting, mapping, shuffling,
reducing, and assembling, or collectively as a distributed map reduce algorithm. This is the
strategy used by Apache Hadoop's MapReduce. Batch processing is most useful when dealing
with very large datasets that require quite a bit of computation.

While batch processing is a good fit for certain types of data and computation, other workloads
require more real-time processing. Real-time processing demands that information be
processed and made ready immediately and requires the system to react as new information
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

becomes available. One way of achieving this is stream processing, which operates on a
continuous stream of data composed of individual items. Another common characteristic of
real-time processors is in-memory computing, which works with representations of the data in
the cluster's memory to avoid having to write back to disk.

Apache Storm, Apache Flink, and Apache Spark provide different ways of achieving real-time
or near real-time processing. There are trade-offs with each of these technologies, which can
affect which approach is best for any individual problem. In general, real-time processing is best
suited for analyzing smaller chunks of data that are changing or being added to the system
rapidly.

The above examples represent computational frameworks. However, there are many other
ways of computing over or analyzing data within a big data system. These tools frequently plug
into the above frameworks and provide additional interfaces for interacting with the underlying
layers. For instance, Apache Hive provides a data warehouse interface for Hadoop, Apache
Pig provides a high level querying interface, while SQL-like interactions with data can be
achieved with projects like Apache Drill, Apache Impala, Apache Spark SQL, and Presto. For
machine learning, projects like Apache SystemML, Apache Mahout, and Apache Spark's
MLlib can be useful. For straight analytics programming that has wide support in the big data
ecosystem, both R and Python are popular choices.

Visualizing the Results

Due to the type of information being processed in big data systems, recognizing trends or
changes in data over time is often more important than the values themselves. Visualizing data
is one of the most useful ways to spot trends and make sense of a large number of data points.

Real-time processing is frequently used to visualize application and server metrics. The data
changes frequently and large deltas in the metrics typically indicate significant impacts on the
health of the systems or organization. In these cases, projects like Prometheus can be useful for
processing the data streams as a time-series database and visualizing that information.

One popular way of visualizing data is with the Elastic Stack, formerly known as the ELK stack.
Composed of Logstash for data collection, Elasticsearch for indexing data, and Kibana for
visualization, the Elastic stack can be used with big data systems to visually interface with the
results of calculations or raw metrics. A similar stack can be achieved using Apache Solr for
indexing and a Kibana fork called Banana for visualization. The stack created by these is
called Silk.
UNIT6-Letent Variable Model, Decision Tree & ARIMA MODEL

Another visualization technology typically used for interactive data science work is a data
"notebook". These projects allow for interactive exploration and visualization of the data in a
format conducive to sharing, presenting, or collaborating. Popular examples of this type of
visualization interface are Jupyter Notebook and Apache Zeppelin.

Challenges with Big Data


few challenges which come along with Big Data:

1. Data Quality – The problem here is the 4th V i.e. Veracity. The data here is very messy,
inconsistent and incomplete. Dirty data cost $600 billion to the companies every year in
the United States.

2. Discovery – Finding insights on Big Data is like finding a needle in a haystack.


Analyzing petabytes of data using extremely powerful algorithms to find patterns and
insights are very difficult.

3. Storage – The more data an organization has, the more complex the problems of
managing it can become. The question that arises here is “Where to store it?”. We need
a storage system which can easily scale up or down on-demand.

4. Analytics – In the case of Big Data, most of the time we are unaware of the kind of data
we are dealing with, so analyzing that data is even more difficult.

5. Security – Since the data is huge in size, keeping it secure is another challenge. It
includes user authentication, restricting access based on a user, recording data access
histories, proper use of data encryption etc.

6. Lack of Talent – There are a lot of Big Data projects in major organizations, but a
sophisticated team of developers, data scientists and analysts who also have sufficient
amount of domain knowledge is still a challenge.

Conclusion

Big data is a broad, rapidly evolving topic. While it is not well-suited for all types of computing,
many organizations are turning to big data for certain types of work loads and using it to
supplement their existing analysis and business tools. Big data systems are uniquely suited for
surfacing difficult-to-detect patterns and providing insight into behaviors that are impossible to
find through conventional means. By correctly implement systems that deal with big data,
organizations can gain incredible value from data that is already available.

S-ar putea să vă placă și