0 Voturi pozitive0 Voturi negative

19 (de) vizualizări24 paginiDec 29, 2012

© Attribution Non-Commercial (BY-NC)

PPT, PDF, TXT sau citiți online pe Scribd

Attribution Non-Commercial (BY-NC)

19 (de) vizualizări

Attribution Non-Commercial (BY-NC)

- Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
- Hidden Figures Young Readers' Edition
- The Law of Explosive Growth: Lesson 20 from The 21 Irrefutable Laws of Leadership
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- The Wright Brothers
- The Power of Discipline: 7 Ways it Can Change Your Life
- The Other Einstein: A Novel
- The Kiss Quotient: A Novel
- State of Fear
- State of Fear
- The 10X Rule: The Only Difference Between Success and Failure
- Being Wrong: Adventures in the Margin of Error
- Algorithms to Live By: The Computer Science of Human Decisions
- The Black Swan
- Prince Caspian
- The Art of Thinking Clearly
- A Mind for Numbers: How to Excel at Math and Science Even If You Flunked Algebra
- The Last Battle
- The 6th Extinction
- HBR's 10 Must Reads on Strategy (including featured article "What Is Strategy?" by Michael E. Porter)

Sunteți pe pagina 1din 24

between correlation and linear regression. Correlation and regression provide us with different, but complementary, information on the relationship between two quantitative variables.

12/29/12

Slide 1

CreditCardData.sav has five variables for 8 cases. The data for the 8 cases is shown in the Data View to the left. The names and labels for each of the variables is shown below in the Variable View.

The goal of this analysis is to study the relationship between family size and number of credit cards. Finding the relationship will help us predict the number of credit cards a family typically has relative to the number of family members. If a family had fewer than expected, they would be a good candidate for us to extend another credit card offer.

Slide 2

Creating a histogram of the dependent variable, ncards, shows a distribution that is about as normal as we could expect for only 8 cases. I have superimposed the red normal curve and blue mean line on the histogram.

For any quantitative variable, our best estimate of the values for cases in the distribution is the mean, because it minimizes the errors or differences between the estimated value and the actual score represented by each of the bars in the histogram.

Slide 3

To demonstrate that the mean is the best value to estimate, I created a worksheet in Excel that compares the error associated with three different estimates of values for each case: the mean of 7, an estimate lower than the mean: 6, and an estimate higher than the mean: 8.

Error is calculated as the sum of the squared deviations from the value used as the estimate. Columns C, F and I contain the deviations from each of the estimates 7, 6, and 8). Columns D, G, and J contain the squared deviations, and the summed total at the base of the columns. Using the mean of 7 as the estimate, there are 22 units of error. Using either 6 or 8 results in 30 units of error. The measure of error is called the Total Sum of Squares. Slide 4

The graph for the relationship between two quantitative variables is the scatterplot, with the independent variable Family Size on the horizontal x-axis, and the dependent variable Number of Credit Cards on the vertical y-axis.

Each dot represents the combination of scores for one case. For example, this dot represents a family of 5 that had 8 credit cards.

I have superimposed the blue dotted mean line for Number of Credit Cards on the scatterplot. We see that the scores for two cases actually fall on the mean line, while the other six are at varying distances from the mean line.

Slide 5

The purple lines are the deviations the differences between individual scores and the mean of the dependent variable.

If we square the deviations and sum the squares, we have the Total Sum of Squares.

The differences are often phrased as distances, i.e. the vertical distance between the mean line and the score for this cases on the dependent variable is 3. Slide 6

I have added the green vertical dotted line at the mean number of credit cards, 4.25.

The regression line will pass through the intersection of the means of both variables, and will minimize the total sum of the differences between the individual scores and the regression line.

Slide 7

One way to think about linear regression is that we are rotating a line through the intersection of the means of the two variables.

Each time we rotate the line, we would compute the total sum of squares. We stop when we have found the line that has the smallest total sum of squares.

There is a direct method for finding the regression line that does not require this trial and error strategy. Slide 8

If there is no relationship, the blue regression line will be on top of (or very close) to the dotted blue mean line for the dependent variable.

No relationship means that we can not reduce the error or total sum of squares of the dependent variable by using the relationship to the independent variable.

Slide 9

The points along the regression line represent the estimated values for all possible values of the independent variable.

For example, if we wanted to estimate the number of cards for a family of 4, we would draw a vertical line from the 4 on the horizontal axis up to the regression line, and from the regression line left to the vertical axis. The location on the vertical axis is the estimated number of cards that a family of 4 would have, i.e. about 6.8 cards.

Slide 10

The differences between the estimated value and the actual value for the cases are deviations that are called residuals (the light blue lines). They represent errors in predicting the values of the dependent variable based on the value of the independent variable. We had two cases with a family size of 4. Our estimated value was overstated for one of the cases, and understated for the other case. Slide 11

The formula for the regression line can be extracted from the SPSS output. For this example, the regression equation is: ncards = 2.871 + .971 x famsize

Slide 12

We can plug the regression equation into Excel and estimate the number of cards for each case.

To compute the residuals, we subtract the actual value for ncards from the estimated value for the case.

If we square the residuals, and sum the squares, we have the amount of error associated with using the regression line to estimate each case, 5.485758.

Slide 13

If we plug the total sum of squares and the sum of squared residuals into an Excel spreadsheet, we can compute the reduction in the total sum of squared errors associated with using the information in the independent variable, as represented by the regression equation.

We can compute the percentage of total error reduced by the regression equation, we end up with the value of R, the percentage of variance explained by the regression relationship. Our calculation for R agrees with the value of R Square in the SPSS output.

Slide 14

R is often interpreted as the percentage of variance explained. We can convert our Sum of Squares column to Variance by dividing by the number of cases in the sample minus one (8 1).

If we compute the percentages using variances instead of sum of squares, we end with exactly the sample value for R, 0.750647.

R is also interpreted as the proportional reduction in error ( a PRE statistic), which we can also phrase as an increase in accuracy. We should remember the no matter whether we interpret R as explaining variance or reducing error, the statistic applies to the total error in distribution, not to the error in individual cases. Slide 15

We can also think of regression and correlation as based on the pattern of deviations for the two variable across the cases in the distribution. To present this, we will first compute the standard scores for each variable. As standard scores, the value for each case is the deviation from the mean of 0 which is the mean of the distribution of standard scores.

Slide 16

Plotting the z-scores for both variables produces the same pattern in the scatterplot that we found with the raw data.

As we would expect for standard scores, the green dotted line for the mean zscore for family size is at zero, as is the dotted blue line for the standard scores for number of credit cards.

Slide 17

We add lines for the deviation from the means for both variables.

The green deviation lines represent differences from the mean z-score for family size.

The blue deviation lines represent differences from the mean z-score for number of credit cards.

Slide 18

The strength of the relationship will depend on the agreement of the deviations for each case, i.e. the extent to which the green line deviation for a case agrees with the blue line deviation.

For some points, the length of the green deviation line is similar to the length of the blue deviation line.

For other points, the length of the green deviation line is shorter than the length of the blue deviation line.

Slide 19

Overall, the pattern of the deviations is similar. Green deviations above the mean are paired with blue deviations above the mean. Green deviations below the mean are paired with blue deviations below the mean. Though the length of the deviations for individual cases varies, the overall pattern suggests a strong relationship.

Slide 20

To compute the correlation coefficient, we multiply the zscores, and sum across all the cases.

To compute Pearsons r, we divide the sum of the zscore products by the number of cases minus one.

The value for Pearsons r that we computed agrees with the value supplied by SPSS.

Finally, if we square the value of Pearsons r, we have the same value as R Square in the SPSS regression output. Slide 21

If we return to the regression results for the raw data instead of the standard scores, we can show the link between Pearsons r and the slope in the regression equation.

Think of the standard deviation to be a measure of average difference from the mean for all of the cases for each of the variables. The standard deviation for number of cares is 1.773 and the standard deviation for family size is 1.581.

Recall that the slope of the regression line represents change in the dependent variable associated with a one unit change in the independent variable. Thus, when a family had one more member, we would predict that they had .971 more credit Slide 22 cards.

If the relationship between the two variables were perfect (one predicted the other without error), we could compute slope of the line using the average amount of differences in each of the distributions the standard deviations.

On average, the number of cards would go up 1.773 cards for a difference of 1.581 members in a family. We can simplify this by dividing the standard deviation for number of cards by the standard deviation for family size: 1.773 1.581 = 1.121 Thus, if the relationship were perfect, we would increase our estimate of the number of cards in a family by 1.121 for every additional member of a family.

Slide 23

If the relationship between the two variables were perfect, Pearsons r would be 1.0 (or -1.0 if the relationship were inverse). However, we know that Pearsons r is less than that, actually it is 0.866. If the slope of the regression line were 1.121 when the relationship were perfect, then we might expect the slope to be 0.866 x 1.121 when the relationship was less than perfect. And it fact, that turns out to be true, since: 0.866 x 1.121 = 0.971 The slope of the regression line is the ratio of the standard deviations multiplied by the correlation coefficient.

Slide 24

- Simple Regression 2-10-12Încărcat deDon Ho
- 2016 Free Mind Maps CFA Level 2Încărcat dePho6
- Burns05 Im 1s9Încărcat deravirajmistry
- CK-12 Flexbooks on Bivariate DataÎncărcat deCK-12 Foundation
- Malhotra Mr05 Ppt 17Încărcat deABHISHEK CHAKRABORTY
- STA6167_Project_1_Ramin_Shamshiri_SolutionÎncărcat deRaminShamshiri
- MBFM ProjectÎncărcat deSuhas Karanth
- SpssÎncărcat deabcd
- Social Media Marketing as a Competitive Strategy on Sales Performance in Small and Medium Enterprises in Nakuru Central Business District (CBD)- KenyaÎncărcat deEditor IJTSRD
- govandroiÎncărcat decurlicue
- Chapter 13 SolÎncărcat dePloy Su
- Lecture Slides Stats1.13.L05Încărcat dePepe Mejia
- Deliverable 2 Result and DiscussionÎncărcat deAsad
- Reg_2012_slideÎncărcat de송동근
- EJ1201186 (1)Încărcat deZi Zi
- REGRESSION AND CORRELATIONÎncărcat deDidi Sinaga
- 310-11Încărcat deDedy Suryadi
- Unit IIÎncărcat dePiyush Chaturvedi
- practical_problems_in_slr.docÎncărcat deakash
- CDF BORABU.pdfÎncărcat derobert maosa
- SAS RegressionÎncărcat deAdeyemi Odeneye
- The BUGS Project Examples Volume 1Încărcat deferalaes
- Output 2Încărcat deRahadjeng Sekar
- 1 Introduction to State Estimation (1)Încărcat deHarish Babu
- Engg Mathematics - 4 July 2011Încărcat dePrasad C M
- lmforÎncărcat destalin13
- A Simple Technique to Assess Vertical Dimension of OcclusionÎncărcat deanak
- GIPE-058050.pdfÎncărcat dejmcl
- RegressÎncărcat deAlex Caceros
- Lecture 18Încărcat derk_dharmani5305

- Nonlinear RegressionÎncărcat deannisa_alwa
- (experiment and model) Costa_ODE.pdfÎncărcat deDavid King
- MSK AnalysisÎncărcat deEj Xpress
- CFA lvl II Quantitative Methods Study notesÎncărcat deGerardo San Martin
- quiz02Încărcat deAlejandro Lopez
- Assumptions of Logistic RegressionÎncărcat deGundeep Kaur
- logitÎncărcat deKumar Prashant
- Lecture Notes on NonparametricsÎncărcat deAli Hamieh
- Intro to analysis of rare events.pdfÎncărcat deEd
- ForecastingÎncărcat deMohd Afiq
- rovigatti2018-1Încărcat deDavid Sanchez
- RegressionÎncărcat deAna-Maria Zaiceanu
- JMP Stat Graph GuideÎncărcat dekfaulk1
- wlrÎncărcat deEstira Woro Astrini Martodihardjo
- Lab Session PH Assumption and ResidualsÎncărcat deWendel Mirbel
- AIAA-2009-5692Încărcat deYogendraKumar
- Generalized method of moments - Wikipedia, the free encyclopedia.pdfÎncărcat dematquykyo811
- econometrics assig 1Încărcat deahmed
- Forecasting Crude Oil Prices using EviewsÎncărcat deNaba Kr Medhi
- Regression AnalysisÎncărcat deMuhammad Chaudhry
- Lecture Notes Chapter No. 4Încărcat deishtiaqlodhran
- Sas_webÎncărcat der4aden
- Using SAS for Econometrics, Fourth Edition - Griffiths, William E.Încărcat deAamir Baig
- GAUSS Programming TutorialÎncărcat deabdullah5502
- Machine Learning and Linear RegressionÎncărcat deKapil Chandel
- 700579308Încărcat deMaleeha Ahmad
- Stationary Time SeriesÎncărcat deHendry Lukito
- Statistical Method to Determine Petroleum ResourcesÎncărcat dededete
- Part7-PanelDataÎncărcat deRiya Gupta
- Estimating Standard Errors of Estimated Variance Components in GeÎncărcat deRudy Salim

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.