Sunteți pe pagina 1din 6

Hello everyone! Hope you had a good day and are getting ready for a good seminar.

Please try to finish all of the DQ assignments and the projects that you have not done yet
as soon as you can. You will be working on the final project and won't have much time to
work on your other assignments at the end of the term. Final project and Statistical
Review is due at the end of Unit 9. I am just trying help by reminding you what is ahead
of us so you can plan accordingly.

I am going over major concepts in chapters 13 and 15 in the book tonight. These
chapters are very involved and overwhelming so I try to give you the “big picture” of
what is being discussed so you can faster and better and understand the concepts when
you are reading the book. I wait for a few minutes in case you want to have your book
ready.

It is a lot more efficient to go over the book graphs and output than trying to post them on
the whiteboard and go back and forth between the chat session and whiteboard.

CHI-SQUARED Test of a contingency table

This test analyzes the effect of the levels of one nominal variable on the levels of another
nominal variable. Basically, we want to see if these two variables are independent. In
example on page 564, we want to see if the driver behavior at the stop sign is independent
of the type of vehicle he/she is driving.

To test this, 500 vehicles were observed at the stop sign and the result was recorded and
displayed in Part A of Table 13.3 on page 565. To study the independence, we need to
calculate X2 (calculated Chi-Squared). To calculate X2 we need to find the following:

Do not let the formula on page 564 scares you. Here is what we need to do.

Find the total for each row and column and write them next to each row and column (it is
already done for us in this problem).

Find the expected frequencies (frequencies are the numbers displayed in the Part A table).
Each expected frequency, E, is calculated by multiplying its total row and total column
and then dividing the answer by the sample size (total numbers in the table).

The book has done the calculations and shown the expected values for each frequency of
part A in the table shown in Part B. As you see, the calculation for the bottom left number
in Part A table (intersection of Pick up and Stopped - 14 cases) is shown at the end of the
table 13.3.

Just multiply 50 and 251 and divided the total frequencies in part A table which is 500.
So, the expected frequency of 14 in Part A table is 25.1 and it is shown at the same
location in Part B table on page 565.

We get the difference of each frequency and its expected value (or expected frequency)
and square this value. Then, we divide this value by its expected frequency again. Then
we add all of these values to get the calculated X2. The work is displayed on page 566.

Calculated X2= sum of (squared differences of frequencies and their expected value
divided by their expected values). Formula is on page 564 but don’t let it scare you. Just
follow the answer and you will see what you need to do.

The value of table X2 (or Critical X2 ) is displayed as X2 (alpha, df). For this problem it
is: X2 0.05, 4 = 9.488. Note that the degree of freedom is (number of rows-1) times
(number of columns-1) which is (3-1) (3-1) = 2 * 2 = 4 for this problem.

So, Calculated X2 = 12.431 and the table X2 = 9.488. Since calculate X2 > table X2 we
reject the H0 that driver behavior and type of vehicle are independent.

We conclude that there is evidence of a relationship between these two variables and
therefore the variables are not independent variables.

You can follow the instructions on Computer Solutions on page 568 to be able to use
Excel to give you the computer output for this problem.

Once you have the output you can either compare the P-value (in the output) with alpha
or you can compare the “Chi-Squared Stat” in the table (which is calculated Chi-Squared)
with Chi-Squared Critical to reject or not reject the Null that says the two variables are
independent (no relationship between them).

One of my students a few terms ago was a police officer. He enthusiastically run this test
with his data of tickets he has issued for a few months to test whether red cars get more
tickets. He came up with the conclusion that it was not the case and he rejected the null
hypothesis that red cars get more traffic tickets. He was so trilled about his test.

Chapter 15 (page 639) is about Simple Regression (or prediction) line. Most of the
companies (and Research organizations) use some sort of regression line to predict their
future sales, costs, production, revenue, etc. by entering the -historical data- in a
statistical software package and generating the regression line.
Using the regression line then they can predict any intended near future value.

The formula of y = b0 + b1 X (page 643) follows the equation of a linear equation in the
form that is usually known to us as y = a X + b in which -a- is the SLOPE of the line and
-b- is the constant or Y-intercept. Here in y = b0 + b1 X the b1 is slope and the b0 is Y-
intercept.

To understand this chapter and next, as I told you early in the term you need to
understand the formula, graph, and concept of a linear equation of the form y= a X + b.

If you still need a refresher on this topic you can go to www.purplemath and review
various lessons on linear equation . A linear equation is also called first-order linear
model because the highest exponent of variable X is ONE. You really need to know
slope, y-intercept of a straight line and be able to understand the graph of one in order to
follow the discussions in this chapter.

I also suggest that you go over the applets that are discussed in the book. The animation
helps us understand the regression line better. We see how the line changes as values are
entered. There are some websites that really are great for this. As you enter new values
you see the regression line changes accordingly.

Basically, you need to calculate (preferably, let Excel does it for you) the b0(Y-intercept)
and b1 (slope) of the prediction line y = b0 + b1 X to get the equation of the line. Again,
the formulas for calculating the b0 and b1 are on page 645.

If you have any difficulty working out the summation notations in this chapter go back to
the -summation notation calculation- message that I sent you a few weeks ago. If you
cannot find it let me know so I can sent it to you again.

The Regression line is also called Least Squares method (page 644) since the way we
find b0 and b1 will guaranty that the sum of squares of the differences between the actual
values of y and their predicted value (using the regression line) is MINIMUM. You don’t
need to worry about the miminum comment. Just know that there is a process that we
came up with these formulas for 0 and b1.

"Against All Odds" online Video series that I gave you its web address last week have a
very good animation of regression concept which allow you to understand the underlying
concept a lot better and easier.

If want to calculate b1 and b0 manually, I suggest that you use a table like Table 15.1 on
page 646 to find the sum of x values, sum of y values and so forth. Just make two
columns in an Excel document. One for x and one of y (like the last two columns from
the left) and enter their values.
This way is easier to find the values of the summation notations if you are not very
algebra savvy. Page 616 is showing the work of the b0 and b1 formulas and how we find
the regression line for the given values.

When you find a Regression line equation you can use the formula to find the predicted
values of any y value. Look at the top of page 647. After calculating b0 and b1 on page
646, the regression formula is Y hat = 19.2 + 3.0 X. So, b0= 19.2 and b1= 3.

This formula predicts the number of units that a person can assemble on the assembly
line per hour given his score on a manual dexterity test. Y hat is the dependent variable
which is the number of units of assembly per hour and X represents the independent
variable which is the score on the manual dexterity test.

Now, if we want to know the number of units a person assembles who scores 15 on the
test we just plug in 15 for X in the formula. So, X=15. If you plug in 15 in the regression
line for X you get PREDICTED value of Y(which is called Y hat). So, Y hat = 19.2 + 3
(15) = 64.2.

So the reqruiters can predict a person’s performance on the assembly line from his/her
score on the test. In this example, we expect that a person with a grade of 15 on the test
be able to assemble about 65 units per hour. Any questions?

Now, we know ACTUAL Y the value for X=16 is 70 (from the table on page 645) but its
predicted value is 67.2. Here is the work: Y hat = 19.2 + 3 (16) = 67.2. This predicted
value of Y is on the line of figure 15.3 (page 666) at X=16.

Look at Figure 15.3 on page 646. You can find the predicted value of Y when X=16 by
simply going up from X=16 on the X-axis until you hit the line and then go to the left
until you hit Y-axis. If you could read this value accurately it would be 67.2. As you
know from page 615, the actual value is 70.

Now, RESIDUAL or random error (shown like a fancy capital E and is called epsilon) is
the difference between each ACTUAL Y value and its PREDICTED value of Y for each
given X value. As you saw, actual Y value when X=16 was 70 but predicted was 67.2 so
Y – Y hat = 70 – 67.2 = 2.8

To find Standard Error of Estimate we need to work out a chart similar to the one on page
651. Standard Error of Estimate is the standard deviation of the residuals in the problem.
Example on page 653 gives us the 95% prediction interval for the number of units a
person with a test score of 15 can assemble.

We call this standard deviation of the residuals the *Standard Error of Estimate*, or Sy,x.
Its formula is on page 645. If you are using Excel you are already given its value as
“Standard Error” in the outputs. If you are not using Excel then you need to calculate it
using the formula on page 651.
Going back to example on page 653, in this case, we can expect a person with a test score
of 15 can assemble between about 60 to 68 unites with 95% confidence interval. The top
of page 656 gives you the Excel instructions to use Excel for the calculations.

In order to have a valid Regression model for the data, 4 requirements should be met (two
of them listed on page 670).

First, the distribution of all residual values should be normally distributed. It should look
similar to page 643.

Second, the mean of this distribution should be zero (which indicates that the error values
have a kind of balance. This is not discussed in the book but is good to know.

Third, the standard deviation of residuals should be constant for all X values. Remember
that you may have many ACTUAL y values per every X value in a real data set and the
residuals on for each X value should result in the same standard deviation.

You can look at page 413 to see the graphical presentation of constant standard deviation
of error variables in all X values. As you see, all of the normal distributions are the same
size so they all have the same standard deviations.

And last, Assumption 3 says the residuals should be independent of each other and there
shouldn’t be any sign of autocorrelation among them.

Imagine that your data is scattered like a horseshoe. If we know this then we know if we
want to use a Linear Equation for Regression line we won’t get a normal distribution out
of the error values and the assumption of equal standard deviations will also be violated.
Then, the graph of residuals vs x values will look like part C on page 672.

So, if try to fit a linear regression to a data that is not linear (and we don’t know it is not
linear) then it graph of residuals vs x values should look like part C and that is when we
know we are not dealing with a data that is linear (i.e. scatter plot looks like a straight
line). In that case, we want to fit other regression lines (they are called nonlinear).

Remember that Excel can apply different Regression lines (linear, nonlinear, etc.) to a
data set. It is up to us to instruct Excel to use which one. How do we know? well if we
do the scatter plot before we do any tests it tells us if data is kind of linear or nonlinear.
But, we just use linear regression in this class. Use page 671 and 673 for Excel
commands.

You can use a t test to check the hypothesis what the slope of the regression line is zero
(page 664). If you accept the H0 that says the slope is zero you are saying that there is no
LINEAR relationship between the two variables X and Y. That means, there might be
another kind of relationship like nonlinear between the two. The t formula is given on
page 664.
Basically, Explained variation is SSR (Sum of Squared for Regression) that is the amount
of variation in y values that is explained by the variation in X values in the regression
line. SSE is the variation in y values that we cannot explain. Figure 18.6 attempts to
display these variations.

R^2 is the proportion of the variation in y values that is explained by variation in x


values. R^2 value is between 0 and 1. The closer the R^2 the closer (overall) are the y
values to the prediction line which means the regression line is more representing the
data.

If the normality requirement of the errors are not satisfied or one of more of the variables
X and Y are ordinal you can use Spearman Rank Correlation Coefficient, rs. This is not
discussed very much in your book, though. Just an extra information.

Heteroscedasticity happens when the requirement that the variance of residuals should be
constant (the third requirement that was discussed earlier) is violated. Homoscedasticity
is when the requirement is met. So, part b of page 639 is a case of Heteroscedasticity
because the gap (variation) between residuals gets larger as X values change.

About the Statistical Review, remember to send me a copy of your report that you have
selected before you start working on them. And remember that an A grade report should
include some statistical discussion, output, and graphs in it.

You can even include what you think would be a good analysis (suggesting regression,
correlation, etc.) for the report that was not done in the report you selected.

Of course, you need to be a little specific as what variables should be involved in those
analyses that you suggest. For example, for correlation, you need to say what variable
with what variable should be considered for and why do you think that would improve or
give information about the report.

S-ar putea să vă placă și