Sunteți pe pagina 1din 6

Linear Regression

Hypothesis Introduction
Testing • A hypothesis is a claim (assumption) about a population parameter based on the
statistics of the sample

Types of errors

Types of tests

Methods of rejecting the null hypothesis


Correlation Introduction
Analysis (2 + 3) • Correlation analysis expresses the relationship between 2 data series in a single
number (correlation coefficient)
• It measures the extent of a linear relationship between two variables.
o If all the points on a scatter plot fall on a line with a positive/ negative
slope, then the correlation coefficient would be +1/ -1
o A scatter plot with no linear relationship would have a correlation
coefficient of 0.

Limitation
• Correlation analysis only measures the linear relationship between two variables.
Two variables with a strong nonlinear relationship will have a low correlation
• Correlation can also be unreliable if outliers are present, which may dramatically
affect the calculation of the correlation coefficient
• Correlation does not imply causation. Even if two variables are highly correlated,
it does not mean that one causes the other
• Spurious correlations due to chance relationships, mixing two variables with a
third, and analyzing two variables that are both related to a third variable

Covariance
• The covariance of X with itself equals to the variance of X

• Covariance measures the nature o the relationship between two variables

Correlation Coefficient
• The numerical value of variance is not as meaningful as its unit-squared format –
Standard Deviation

• The numerical value of covariance is not as meaningful as its standardized format


– Correlation coefficient
Hypothesis Testing for the Correlation Coefficient
• The null hypothesis states the correlation in the population is 0. The alternative
hypothesis states the correlation in the population is different from 0

• The t-test can be used on the sample correlation, r, assuming both variables are
normally distributed.

• The decision rule is that the null hypothesis will be rejected

• When sample size increases, t-stat increases (numerator increases) and t-crit
decreases (higher degree of freedom), the required sample correlation, r, can
be relatively small and a false null hypothesis is more likely to be rejected
• When sample correlation increases, t-stat increases (numerator increases), the
required sample size, n, can be relatively small and a false null hypothesis is
more likely to be rejected
Single Introduction
Regression • Linear regression model describes the relationship between the dependent and
Analysis (3 + 3 + the independent variables
3)

• bo and b1 are regression coefficients/ parameters (b0 is y-intercept term and b1


is slope coefficient).
• Linear regression line computes the best fit line

• Symbols with hat are estimates of regression parameters, used for testing or
making predictions on dependent variables
• The regression equation implies the value of a dependent variable will increase
by b1 units when the value of the independent variable increases by one unit.
• Linear regression line minimizes the sum of squared errors (SSE), aka the sum of
squared regression residuals (differences between actual Y and predicted Y)

• The optimal value for the slope and intercept using a least squares approach can
be calculated with the following formulas:
Limitations

Assumptions

Standard Errors of Estimates (SEE) & Coefficient of Determination


• SSE measures how well a regression model captures the relationship between 2
variables (the quality of the fit of a given linear regression model)

• The denominator has n – 2 degrees of freedom because there are n observations


and two parameters (𝑏̂0 , 𝑏̂1 )
• SSE measures the standard deviation of the deviation of the residual terms. The
smaller SEE, the more accurate the predictions based on the model

• SSE tells how certain we can be about the prediction of Y, but it does not tell how
well the independent VAR explains the variation in the dependent VAR.
• The coefficient of determination measures the fraction of the total variation in
the dependent variable that is explained by the independent variable.
• For a linear regression with only one variable

• For a linear regression with one or more variable


Confidence Intervals for cegression coefficients
• A confidence interval is a range of values within which we believe the true
population parameter lies, with a certain degree of confidence (1 - ∝)

• In a confidence interval, we examine whether the hypothesized value of the


population parameter, slope coefficient (𝑏1 ) lies within a computed inverval
with a certain degree of confidence (1 - ∝)
• In a hypothesis testing, we examine whether the estimated value of the
parameter from the sample data (𝑏̂1 ) lies within a rejection region at a certain
level of significance (∝)

Hypothesis Testing for Individual regression coefficients (T – Test)


• Regarding linear regression, t-tests are often used for the population values of
the intercept and slope coefficients.
• Null and alternate hypotheses

• Test statistics

• The decision rule is that the null hypothesis will be rejected

• A lower level of significance (∝) increases the absolute value of t-stat, increase
the confidence interval, a lower likelihood of rejecting the null hypothesis
→ increase type I error and decrease type II error. Vice versa for a higher ∝
• A smaller standard error of the estimates decreases the confidence interval, a
higher likelihood of rejecting the null hypothesis
→ increase type I error and decrease type II error
• P-value is the lowest level of significance at which the null hypothesis can be
rejected. If p-value is smaller than ∝, the null hypothesis can be rejected
Analysis of variance (k+1 includes the y-intercept & the slope coefficients)

Hypothesis Testing for all regression coefficients in the model (F – Test)


• One-tailed F-test is used to determine if all slope coefficients are 0. F-test is not
commonly used if there is only one variable as it will duplicate t-test
• Null hypothesis and alternate hypothesis

• F statistics

• The decision rule is that the null hypothesis will be rejected

Prediction Intervals
• Confidence interval is used for independent variables (e.g. slope coefficent)
• Prediction interval is the confidence interval for dependent variable
• Predicted value of the dependent variable

• Variance of the prediction errors (Sx is the variance of independent variable)

• The (1 - ∝) percent prediction interval

S-ar putea să vă placă și