Documente Academic
Documente Profesional
Documente Cultură
Hypothesis Introduction
Testing • A hypothesis is a claim (assumption) about a population parameter based on the
statistics of the sample
Types of errors
Types of tests
Limitation
• Correlation analysis only measures the linear relationship between two variables.
Two variables with a strong nonlinear relationship will have a low correlation
• Correlation can also be unreliable if outliers are present, which may dramatically
affect the calculation of the correlation coefficient
• Correlation does not imply causation. Even if two variables are highly correlated,
it does not mean that one causes the other
• Spurious correlations due to chance relationships, mixing two variables with a
third, and analyzing two variables that are both related to a third variable
Covariance
• The covariance of X with itself equals to the variance of X
Correlation Coefficient
• The numerical value of variance is not as meaningful as its unit-squared format –
Standard Deviation
• The t-test can be used on the sample correlation, r, assuming both variables are
normally distributed.
• When sample size increases, t-stat increases (numerator increases) and t-crit
decreases (higher degree of freedom), the required sample correlation, r, can
be relatively small and a false null hypothesis is more likely to be rejected
• When sample correlation increases, t-stat increases (numerator increases), the
required sample size, n, can be relatively small and a false null hypothesis is
more likely to be rejected
Single Introduction
Regression • Linear regression model describes the relationship between the dependent and
Analysis (3 + 3 + the independent variables
3)
• Symbols with hat are estimates of regression parameters, used for testing or
making predictions on dependent variables
• The regression equation implies the value of a dependent variable will increase
by b1 units when the value of the independent variable increases by one unit.
• Linear regression line minimizes the sum of squared errors (SSE), aka the sum of
squared regression residuals (differences between actual Y and predicted Y)
• The optimal value for the slope and intercept using a least squares approach can
be calculated with the following formulas:
Limitations
Assumptions
• SSE tells how certain we can be about the prediction of Y, but it does not tell how
well the independent VAR explains the variation in the dependent VAR.
• The coefficient of determination measures the fraction of the total variation in
the dependent variable that is explained by the independent variable.
• For a linear regression with only one variable
• Test statistics
• A lower level of significance (∝) increases the absolute value of t-stat, increase
the confidence interval, a lower likelihood of rejecting the null hypothesis
→ increase type I error and decrease type II error. Vice versa for a higher ∝
• A smaller standard error of the estimates decreases the confidence interval, a
higher likelihood of rejecting the null hypothesis
→ increase type I error and decrease type II error
• P-value is the lowest level of significance at which the null hypothesis can be
rejected. If p-value is smaller than ∝, the null hypothesis can be rejected
Analysis of variance (k+1 includes the y-intercept & the slope coefficients)
• F statistics
Prediction Intervals
• Confidence interval is used for independent variables (e.g. slope coefficent)
• Prediction interval is the confidence interval for dependent variable
• Predicted value of the dependent variable