Documente Academic
Documente Profesional
Documente Cultură
REGRESSION
Technique used for the modeling and analysis of numerical data Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships
LINEAR REGRESSION
Simple linear regression is used for three main purposes: To describe the linear dependence of one variable on another To predict values of one variable from values of another, for which more data are available To correct for the linear dependence of one variable on another, in order to clarify other features of its variability. Linear regression determines the best-fit line through a scatterplot of data, such that the sum of squared residuals is minimized; equivalently, it minimizes the error variance. The fit is "best" in precisely that sense: the sum of squared errors is as small as possible. That is why it is also termed "Ordinary Least Squares" regression.
ASSUMPTIONS IN REGRESSION
The relationship between the response y and regressors is linear. The error term has zero mean. The error term has constant variance 2. The errors are uncontrolled. The errors are normally distributed.
CALCULATION OF PARAMETERS
Y = +x = intercept = slope =Sxy / Sxx 0 = y - x n- number of observations Sxy = xy - 1/n(x)(y) Sxx =x2 1/n(x)2
C B
B
A C
yi
SST
TOTAL SUM OF SQUARES
= SSR
REGRESSION SUM OF SQAURES
SSE
ERROR SUM OF SQAURES
SST =(Y-Y)2
Where
SSR=(Y-Y)2
SSE =(Y-Y)2
Y=Average value of the dependent variable Y =Predicted value of Y for given X Y=Observed value of the dependent variable
SSE
n-2
Total
SST
n-1
COEFFIECIENT OF DETERMINATION
Coefficient of determination ( R2 ) = SSR / SST The coefficient of determination is the variation in Y explained by the regression model when compared to the total variation . If this is higher , higher is the degree of relationship between Y and X. The range of R2 is from 0 to 1. The statistic R2 should be used with caution since it is always possible to make r2 large by adding enough terms to the model. The magnitude of R2 also depend on the range of variability in the regressor variable. Generally R2 will increase as the spread of the xs increase and decrease as the spread of xs decreases provided the assumed model form is correct. E( R2 )= 12 Sxx / 12 Sxx + MSSR Clearly the expected value of R2 will increase as Sxx increases. Thus a large value of R2 may result simply because x has been varied over an unrealistic large range. R2 does not measure the appropriates of the linear model , R2 Will often be large even though y and x are non linearly related. Coefficient of correlation =[SSR / SST ]1/2
COEFFICIENT OF CORRELATION
Coefficient of correlation =[SSR / SST ]1/2 The sign of R is same as that of the slope in the regression model. A zero correlation indicates that there is no relationship between variables. A correlation of -1 indicates a perfect negative relation A correlation of +1 indicates a perfect positive relation. It is very dangerous to conclude that there is no association between x and y just because r is close to zero as shown in figure below. The correlation coefficient is of value only when the relation between x and y is linear.
Calculated value of coefficient of correlation is zero but data follows a nice curve
The failure to reject H:=0 suggests that there is no linear relationship between y and x. Figure 11.8 is an illustration of the implication of this result .This may imply either that x is of little value in explaining the variation in y (a) or that the true relationship between x and y is not linear (b).
If H:=0 is rejected this implies that x is of value in explaining the variability in y . However rejecting H : =0 could mean either that the straight line model is adequate (fig .a ) or that even though there is a linear effect of x , better result could be obtained with the addition of higher order polynomial terms.
For individual : y - t/2,n-2 [MSSE (1+1/n+(x0-x) 2/SXX )]1/2 y0 y + t/2,n-2 [MSSE (1+1/n+(x0-x) 2/SXX )]1/2
Residual Analysis
Definition of Residuals Residual: i , i 1,, n ei yi y
The deviation between the data and the fit A measure of the variability in the response variable not explained by the regression model. The realized or observed values of the model errors.
18
Residual Analysis
Residual Plot
Graphical analysis is a very effective way to investigate the adequacy of the fit of a regression model and to check the underlying assumption. Normal Probability Plot: If the errors come from a distribution with thicker or heavier tails than the normal, LS fit may be sensitive to a small subset of the data. Heavy-tailed error distributions often generate outliers that pull LS fit too much in their direction. Normal probability plot: a simple way to
check the normal assumption. Ranked residuals: e[1] < < e[n] Plot e[i] against Pi = (i-1/2)/n Sometimes plot e[i] against -1[ (i-1/2)/n] Plot nearly a straight line for large sample n > 32 if e[i] normal Small sample (n<=16) may deviate from straight line even e[i] normal Usually 20 points are required to plot normal probability plots.
Residual Analysis
Fitting the parameters tends to destroy the evidence of nonnormality in the residuals, and we cannot always rely on the normal probability to detect departures from normality. Defect: Occurrence of one or two large residuals. Sometimes this is an indication that the corresponding observations are outlier
Residual Analysis
Plot of Residuals against the Fitted Values
a: Satisfactory b: Variance is an increase function of y c: Often occurs when y is a proportion between 0 and 1. d: Indicate nonlinearity.
Residual Analysis
Plot of Residuals in Time Sequence: The time sequence plot of residuals may indicate that the errors at one time period are correlated with those at other time periods. Autocorrelation: The correlation between model errors at different time periods. (a) positive autocorrelation (b) negative autocorrelation
The slope depends heavily on A and B. The remaining data would give a very different estimate of the slope if A and B are deleted. Situations like this often require correction action such as further analysis or estimation of model parameter with some other technique that is less influenced by these points.
Just because a regression analysis has indicated a strong relationship between two variables this does not imply that variables are related in causal sense. Causality implies necessary correlation . Regression analysis cannot address the issue of necessity.
In some applications the value of the regressor variable x required to predict y is unknown for example for forecast of load of electric power generation system we need to forecast the temperature.
In general, we can model the expected value of y as an nth order polynomial, yielding the general polynomial regression model
Hierarchy: The regression model y= + x+ 2x2 + 3x3 + is said to be hierarchal because it contains all terms of order 3 and low. Model Building Strategy: Various strategies for choosing the order of an approximating polynomial have been suggested. The two main strategies are forward selection and backward elimination.
may or may not be relevant for making predictions about the response variable, it is useful to be able to reduce the model to contain only the variables which provide important information about the response variable. First of all, we need to define the maximum model, that is, the model containing all explanatory variables which could possibly be present in the final model. Let k denote the maximum number of feasible explanatory variables then maximum model is given by Yi = + xi,1 + 2xi,2 +.+ k xk + where x , x2 . k are the explanatory variables, and are independent, normally distributed random error terms with zero mean and common variance.
SELECTION CRITERIA
When the maximum model has been defined, the next point to consider is how to determine whether one model is `better' than the rest: which criterion should we use to compare the possible models? A selection criterion is a criterion, which will order all possible models from `best' to `worst'. Many different criteria have been suggested through time; some are better than others, but there is no single criterion which is overall preferred.
Note that Ra2 does not necessarily increase when the number of explanatory variables increases. According to the Ra2 criteria, one should choose the model which has the largest Ra2 .
If is H0 not rejected, the reduced model provides as good a fit to the data as the maximum model, so we can use the reduced model instead of the maximum model.
SELECTION PROCEDURES
All possible models procedure: The most careful selection procedure is the all possible models procedure in which all possible models are fitted to the data, and the selection criterion is used on all the models in order to find the model which is preferable to all others.
THANK YOU