Documente Academic
Documente Profesional
Documente Cultură
Big Picture
Study Guides
Bivariate data is often represented with scatterplots and line plots. The main reason to display bivariate data in such a way is to find a relationship between the two variables. The relationship is described through the correlation coefficient. Often, transformations on the data must be done so that the correlation coefficient can be used.
Key Terms
Bivariate: Two variables. Scatterplot: A graph where each point represents a pair of measurements (two variables). Correlation: The relationship between bivariate data. Correlation Coefficient: A number that describes the correlation (relation) between bivariate data. Linear Regression: Using data to calculate a line that best fits that data. The line can be used to make predictions. Residual: The distance between the observed value and the expected value.
Correlation
Three important characteristics of bivariate data:
We are usually most interested in finding if there is any correlation in the data. The correlation describes the direction of the direction. One way to visualize the correlation is with a scatterplot. We can describe the correlation as: positive correlation: positive slope negative correlation: negative slope
perfect correlation: points on a scatterplot lie on a straight line - can be positive or negative
Disclaimer: this study guide was not created to replace your textbook and is for classroom or individual use only.
The more linear the data is, the stronger the linear correlation. Another way to view the strength of this correlation is to draw an ellipse (oval) around all of the data. The narrower or skinnier the ellipse is, the stronger the linear correlation.
This guide was created by Lizhi Fan and Jin Yu. To learn more about the student authors, visit http://www.ck12.org/about/about-us/team/interns.
Page 1 of 3
v1.1.9.2012
Bivariate Data
Correlation (cont.)
Correlation Coefficient
cont .
Transformations to Achieve Linearity
Curvilinear relationships are nonlinear relationships. Just because they are nonlinear does not mean they dont have a strong correlation. However, the r correlation coefficient by itself will not be able to tell us about the strength of a nonlinear relationship. There is a way to manipulate data points to make a nonlinear relationship linear. By doing this, we can use the correlation coefficient to describe the strength of the relationship. For example, if we were dealing with an exponential relationship: y = axb
Can have values between -1 and +1. Signs indicate negative (-) and positive (+) correlations The closer the absolute value of the coefficient (|r|) is to 1, the stronger the relationship
One statistic that measures the strength and direction of a linear correlation is the Pearson product-moment correlation coefficient. To calculate the correlation r of two variables X and Y, use the formula: , z is the z-score and n is the sample size If we have the raw scores and not the standardized scores, we can use this formula:
By taking the log of both sides, we can change the data to become a linear relationship. After doing this, we can describe the relationship with a correlation coefficient. log y = log (axb) log y = log a + log xb log y = log a + b log x
The new relationship is Y = log a + X. log a is a constant, so we have transformed the exponential relationship into a linear one.
Y = log y X = b log x
Correlation only describes linearity. It does not tell us if one variable caused the other.
The least-squares regression line (also known as a linear regression line) is created by finding the line that minimizes the calculated distance from the data points to the respective places on the line. This is also known as the residual.
Residual = Observed - Expected Generally, the smaller the residuals, the better fit the least-squares regression line is to the data. If all the residuals were added together, the sum would be zero.
A straight line that would represent the change in one variable associated with the change in the other Often used to predict values of future data points. This is done simply by substituting a value of a predictor variable (X) into the equation to find the outcome variable Y. (The predictor variable predicts the outcome).
Y is what we are trying to predict b is the slope of the line (regression coefficient) a is the value of Y when X = 0 (regression constant) X is the predictor variable
To calculate the line, we need to find b and a.
or
r is the correlation between X and Y sY is the standard deviation of Y sX is the standard deviation of X Plotting Residuals and Testing for Linearity
We can plot the residuals by plotting the x-value for each data pair on the x-axis and the residual on the y-axis.
A residual plot with no outliers and with a linear relationship would appear to have no correlation. If the residual plot has an obvious pattern, you may want to try other models of the data, such as power or exponential functions, to see if they are a better fit
Page 2 of 3
Bivariate Data
Inferences
Hypothesis Testing
cont .
The least-squares line y = a + bx is for samples. To predict the line for the entire population, we use = + x, where is the population correlation coefficient. CAREFUL: Here and are not the level of significance and the power of the test.
Make sure that the set of data is for a random sample. Make sure the y values have a normal distribution.
If these are true, we can use hypothesis testing. Null hypothesis is that the regression coefficient = some number
Ha hypothesis is that does NOT equal the given number ( or > or <)
Use the test statistic where
Notes
Page 3 of 3