Sunteți pe pagina 1din 19

THE NORMAL CURVE As discussed in the previous chapter, the normal curve is one of a number of possible models of probability

distributions. Because it is widely used and an important theoretical tool, it is given special status as a separate chapter. The normal curve is not a single curve, rather it is an infinite number of possible curves, all described by the same algebraic expression:

Upon viewing this expression for the first time the initial reaction of the student is usually to panic. Don't. In general it is not necessary to "know" this formula to appreciate and use the normal curve. It is, however, useful to examine this expression for an understanding of how the normal curve operates. First, some symbols in the expression are simply numbers. These symbols include "2", "P ", and "e". The latter two are irrational numbers that are very long, P equaling 3.1416... and e equaling 2.81.... As discussed in the chapter on the review of algebra, it is possible to raise a "funny number", in this case "e", to a "funny power". The second set of symbols which are of some interest includes the symbol "X", which is a variable corresponding to the score value. The height of the curve at any point is a function of X. Thirdly, the final two symbols in the equation, "m " and "d " are called PARAMETERS, or values which, when set to particular numbers, define which of the infinite number of possible normal curves with which one is dealing. The concept of parameters is very important and considerable attention will be given them in the rest of this chapter. A FAMILY OF DISTRIBUTIONS The normal curve is called a family of distributions. Each member of the family is determined by setting the parameters (m and d ) of the model to a particular value (number). Because the m parameter can take on any value, positive or negative, and the s parameter can take on any positive value, the family of normal curves is quite large, consisting of an infinite number of members. This makes the normal curve a general-purpose model, able to describe a large number of naturally occurring phenomena, from test scores to the size of the stars. Similarity of Members of the Family of Normal Curves All the members of the family of normal curves, although different, have a number of properties in common. These properties include: shape, symmetry, tails approaching but never touching the Xaxis, and area under the curve.

All members of the family of normal curves share the same bell shape, given the X-axis is scaled properly. Most of the area under the curve falls in the middle. The tails of the distribution (ends) approach the X-axis but never touch, with very little of the area under them. All members of the family of normal curves are bilaterally symmetrical. That is, if any normal curve was drawn on a two-dimensional surface (a piece of paper), cut out, and folded through the third dimension, the two sides would be exactly alike. Human beings are approximately bilaterally symmetrical, with a right and left side. All members of the family of normal curves have tails that approach, but never touch, the X-axis. The implication of this property is that no matter how far one travels along the number line, in either the positive or negative direction, there will still be some area under any normal curve. Thus, in order to draw the entire normal curve one must have an infinitely long line. Because most of the area under any normal curve falls within a limited range of the number line, only that part of the line segment is drawn for a particular normal curve. All members of the family of normal curves have a total area of one (1.00) under the curve, as do all probability models or models of frequency distributions. This property, in addition to the property of symmetry, implies that the area in each half of the distribution is .50 or one half. AREA UNDER A CURVE Because area under a curve may seem like a strange concept to many introductory statistics students, a short intermission is proposed at this point to introduce the concept. Area is a familiar concept. For example, the area of a square is s2, or side squared; the area of a rectangle is length times height; the area of a right triangle is one-half base times height; and the area of a circle is P * r2. It is valuable to know these formulas if one is purchasing such things as carpeting, shingles, etc. Areas may be added or subtracted from one another to find some resultant area. For example, suppose one had an L-shaped room and wished to purchase new carpet. One could find the area by taking the total area of the larger rectangle and subtracting the area of the rectangle that was not needed, or one could divide the area into two rectangles, find the area of each, and add the areas together. Both procedures are illustrated below:

Finding the area under a curve poses a slightly different problem. In some cases there are formulas which directly give the area between any two points; finding these formulas are what integral calculus is all about. In other cases the areas must be approximated. All of the above procedures share a common theoretical underpinning, however.

Suppose a curve was divided into equally spaced intervals on the X-axis and a rectangle drawn corresponding to the height of the curve at any of the intervals. The rectangles may be drawn either smaller that the curve, or larger, as in the two illustrations below:

In either case, if the areas of all the rectangles under the curve were added together, the sum of the areas would be an approximation of the total area under the curve. In the case of the smaller rectangles, the area would be too small; in the case of the latter, they would be too big. Taking the average would give a better approximation, but mathematical methods provide a better way. A better approximation may be achieved by making the intervals on the X-axis smaller. Such an approximations is illustrated below, more closely approximating the actual area under the curve.

The actual area of the curve may be calculated by making the intervals infinitely small (no distance between the intervals) and then computing the area. If this last statement seems a bit bewildering, you share the bewilderment with millions of introductory calculus students. At this point the introductory statistics student must say "I believe" and trust the mathematician or enroll in an introductory calculus course. DRAWING A MEMBER OF THE FAMILY OF NORMAL CURVES The standard procedure for drawing a normal curve is to draw a bell-shaped curve and an X-axis. A tick is placed on the X-axis in corresponding to the highest point (middle) of the curve. Three ticks are then placed to both the right and left of the middle point. These ticks are equally spaced and include all but a very small portion under the curve. The middle tick is labeled with the value of m ; sequential ticks to the right are labeled by adding the value of d . Ticks to the left are labeled by subtracting the value of d from m for the three values. For example, if m =52 and d =12, then the middle value would be labeled with 52, points to the right would have the values of 64 (52 + 12), 76, and 88, and points to the left would have the values 40, 28, and 16. An example is presented below:

DIFFERENCES IN MEMBERS OF THE FAMILY OF NORMAL CURVES Differences in members of the family of normal curves are a direct result of differences in values for parameters. The two parameters, m and d , each change the shape of the distribution in a different manner. The first, m , determines where the midpoint of the distribution falls. Changes in m , without changes in d , result in moving the distribution to the right or left, depending upon whether the new value of m was larger or smaller than the previous value, but does not change the shape of the distribution. An example of how changes in m affect the normal curve are presented below:

Changes in the value of d , on the other hand, change the shape of the distribution without affecting the midpoint, because d affects the spread or the dispersion of scores. The larger the value of d , the more dispersed the scores; the smaller the value, the less dispersed. Perhaps the easiest way to understand how d affects the distribution is graphically. The distribution below demonstrates the effect of increasing the value of d :

Since this distribution was drawn according to the procedure described earlier, it appears similar to the previous normal curve, except for the values on the X-axis. This procedure effectively changes the scale and hides the real effect of changes in d . Suppose the second distribution was drawn on a rubber sheet instead of a sheet of paper and stretched to twice its original length in order to make the two scales similar. Drawing the two distributions on the same scale results in the following graphic:

Note that the shape of the second distribution has changed dramatically, being much flatter than the original distribution. It must not be as high as the original distribution because the total area under the curve must be constant, that is, 1.00. The second curve is still a normal curve; it is simply drawn on a different scale on the X-axis. A different effect on the distribution may be observed if the size of d is decreased. Below the new distribution is drawn according to the standard procedure for drawing normal curves:

Now both distributions are drawn on the same scale, as outlined immediately above, except in this case the sheet is stretched before the distribution is drawn and then released in order that the two distributions are drawn on similar scales:

Note that the distribution is much higher in order to maintain the constant area of 1.00, and the scores are much more closely clustered around the value of m , or the midpoint, than before. An interactive exercise is provided to demonstrate how the normal curve changes as a function of changes in m and d . The exercise starts by presenting a curve with m = 70 and d = 10. The student may change the value of m from 50 to 90 by moving the scroll bar on the bottom of the graph. In a similar manner, the value of d can be adjusted from 5 to 15 by changing the scroll bar on the right side of the graph. FINDING AREA UNDER NORMAL CURVES Suppose that when ordering shoes to restock the shelves in the store one knew that female shoe sizes were normally distributed with m = 7.0 and d = 1.1. Don't worry about where these values came from at this point, there will be plenty about that later. If the area under this distribution between 7.75 and 8.25 could be found, then one would know the proportion of size eight shoes to order. The values of 7.75 and 8.25 are the real limits of the interval of size eight shoes.

Finding the areas on the curve above is easy; simply enter the value of mu, sigma, and the score or scores into the correct boxes and click on a button on the display and the area appears. The following is an example of the use of the Normal Curve Area program and the reader should verify how the program works by entering the values in a separate screen. To find the area below 7.75 on a normal curve with mu =7.0 and sigma=1.1 enter the following information and click on the button pointed to with the red arrow.

To find the area between scores, enter the low and high scores in the lower boxes and click on the box pointing to the "Area Between."

The area above a given score could be found on the above program by subtracting the area below the score from 1.00, the total area under the curve, or by entering the value as a "Low Score" on the bottom boxes and a corresponding very large value for a "High Score." The following illustrates the latter method. The value of "12" is more than three sigma units from the mu of 7.0, so the area will include all but the smallest fraction of the desired area.

FINDING SCORES FROM AREA In some applications of the normal curve, it will be necessary to find the scores that cut off some proportion or percentage of area of the normal distribution. For example, suppose one wished to know what two scores cut off the middle 75% of a normal distribution with m = 123 and d = 23. In order to answer questions of this nature, the Normal Curve Area program can be used as follows:

The results can be visualized as follows:

In a similar manner, the score value which cuts of the bottom proportion of a given normal curve can be found using the program. For example a score of 138.52 cuts off .75 of a normal curve with mu=123 and sigma=23. This area was found using Normal Curve Area program in the following manner.

The results can be visualized as follows:

THE STANDARD NORMAL CURVE The standard normal curve is a member of the family of normal curves with m = 0.0 and d = 1.0. The value of 0.0 was selected because the normal curve is symmetrical around m and the number system is symmetrical around 0.0. The value of 1.0 for d is simply a unit value. The X-axis on a standard normal curve is often relabeled and called Z scores.

There are three areas on a standard normal curve that all introductory statistics students should know. The first is that the total area below 0.0 is .50, as the standard normal curve is symmetrical like all normal curves. This result generalizes to all normal curves in that the total area below the value of mu is .50 on any member of the family of normal curves.

The second area that should be memorized is between Z-scores of -1.00 and +1.00. It is .68 or 68%.

The total area between plus and minus one sigma unit on any member of the family of normal curves is also .68. The third area is between Z-scores of -2.00 and +2.00 and is .95 or 95%.

This area (.95) also generalizes to plus and minus two sigma units on any normal curve. Knowing these areas allow computation of additional areas. For example, the area between a Zscore of 0.0 and 1.0 may be found by taking 1/2 the area between Z-scores of -1.0 and 1.0, because the distribution is symmetrical between those two points. The answer in this case is .34 or 34%. A similar logic and answer is found for the area between 0.0 and -1.0 because the standard normal distribution is symmetrical around the value of 0.0. The area below a Z-score of 1.0 may be computed by adding .34 and .50 to get .84. The area above a Z-score of 1.0 may now be computed by subtracting the area just obtained from the total area under the distribution (1.00), giving a result of 1.00 - .84 or .16 or 16%. The area between -2.0 and -1.0 requires additional computation. First, the area between 0.0 and 2.0 is 1/2 of .95 or .475. Because the .475 includes too much area, the area between 0.0 and -1.0 (.34) must be subtracted in order to obtain the desired result. The correct answer is .475 - .34 or .135.

Using a similar kind of logic to find the area between Z-scores of .5 and 1.0 will result in an incorrect answer because the curve is not symmetrical around .5. The correct answer must be something less than .17, because the desired area is on the smaller side of the total divided area. Because of this difficulty, the areas can be found using the program included in this text. Entering the following information will produce the correct answer

The result can be seen graphically in the following:

The following formula is used to transform a given normal distribution into the standard normal distribution. It was much more useful when area between and below a score was only contained in tables of the standard normal distribution. It is included here for both historical reasons and because it will appear in a different form later in this text.

Skewness The first thing you usually notice about a distributions shape is whether it has one mode (peak) or more than one. If its unimodal (has just one peak), like most data sets, the next thing you notice is whether its symmetric or skewed to one side. If the bulk of the data is at the left and the right tail is longer, we say that the distribution is skewed right or positively skewed; if the peak is toward the right and the left tail is longer, we say that the distribution is skewed left or negatively skewed. Look at the two graphs below. They both have = 0.6923 and = 0.1685, but their shapes are different.

Beta(=4.5, =2) skewness = 0.5370

1.3846 Beta(=4.5, =2) skewness = +0.5370

The first one is moderately skewed left: the left tail is longer and most of the distribution is at the right. By contrast, the second distribution is moderately skewed right: its right tail is longer and most of the distribution is at the left. You can get a general impression of skewness by drawing a histogram (MATH200A part 1), but there are also some common numerical measures of skewness. Some authors favor one, some favor another. This Web page presents one of them. In fact, these are the same formulas that Excel uses in its Descriptive Statistics tool in Analysis Toolpak. You may remember that the mean and standard deviation have the same units as the original data, and the variance has the square of those units. However, the skewness has no units: its a pure number, like a z-score. Computing The moment coefficient of skewness of a data set is skewness: g1 = m3 / m23/2 (1)where m3 = (xx)3 / n and m2 = (xx)2 / n x is the mean and n is the sample size, as usual. m3 is called the third moment of the data set. m2 is the variance, the square of the standard deviation. Youll remember that you have to choose one of two different measures of standard deviation, depending on whether you have data for the whole population or just a sample. The same is true of skewness. If you have the whole population, then g1 above is the measure of skewness. But if you have just a sample, you need the sample skewness:

(2)sample skewness: source: D. N. Joanes and C. A. Gill. Comparing Measures of Sample Skewness and Kurtosis.The Statistician 47(1):183189. Excel doesnt concern itself with whether you have a sample or a population: its measure of skewness is always G1. Example 1: College Mens Heights Height (inches) 59.562.5 62.565.5 65.568.5 68.571.5 71.574.5 Class Mark, x 61 64 67 70 73 Frequency, f 5 18 42 27 8

Here are grouped data for heights of 100 randomly selected male students, adapted from Spiegel & Stephens,Theory and Problems of Statistics 3/e (McGraw-Hill, 1999), page 68. A histogram shows that the data are skewed left, not symmetric.

But how highly skewed are they, compared to other data sets? To answer this question, you have to compute the skewness. Begin with the sample size and sample mean. (The sample size was given, but it never hurts to check.) n = 5+18+42+27+8 = 100 x = (615 + 6418 + 6742 + 7027 + 738) 100 x = 9305 + 1152 + 2814 + 1890 + 584) 100 x = 6745100 = 67.45 Now, with the mean in hand, you can compute the skewness. (Of course in real life youd probably use Excel or a statistics package, but its good to know where the numbers come from.)

Class Mark, x 61 64 67 70 73

Frequency, f 5 18 42 27 8

xf 305 1152 2814 1890 584 6745 67.45

(xx) -6.45 -3.45 -0.45 2.55 5.55 n/a n/a

(xx)f 208.01 214.25 8.51 175.57 246.42 852.75 8.5275

(xx)f -1341.68 -739.15 -3.83 447.70 1367.63 269.33 2.6933

x, m2, m3

Finally, the skewness is g1 = m3 / m23/2 = 2.6933 / 8.52753/2 = 0.1082 But wait, theres more! That would be the skewness if the you had data for the whole population. But obviously there are more than 100 male students in the world, or even in almost any school, so what you have here is a sample, not the population. You must compute the sample skewness:

= [(10099) / 98] [2.6933 / 8.52753/2] = 0.1098 Interpreting If skewness is positive, the data are positively skewed or skewed right, meaning that the right tail of the distribution is longer than the left. If skewness is negative, the data are negatively skewed or skewed left, meaning that the left tail is longer. If skewness = 0, the data are perfectly symmetrical. But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret the skewness number? Bulmer, M. G., Principles of Statistics (Dover, 1979) a classic suggests this rule of thumb: If skewness is less than 1 or greater than +1, the distribution is highly skewed. If skewness is between 1 and or between + and +1, the distribution ismoderately skewed. If skewness is between and +, the distribution is approximately symmetric. With a skewness of 0.1098, the sample data for student heights are approximately symmetric. Caution: This is an interpretation of the data you actually have. When you have data for the whole population, thats fine. But when you have a sample, the sample skewness doesnt necessarily

apply to the whole population. In that case the question is, from the sample skewness, can you conclude anything about the population skewness? To answer that question, see the next section. Inferring Your data set is just one sample drawn from a population. Maybe, from ordinary sample variability, your sample is skewed even though the population is symmetric. But if the sample is skewed too much for random chance to be the explanation, then you can conclude that there is skewness in the population. But what do I mean by too much for random chance to be the explanation? To answer that, you need to divide the sample skewness G1 by the standard error of skewness (SES)to get the test statistic, which measures how many standard errors separate the sample skewness from zero:

(3)test statistic: Zg1 = G1/SES where This formula is adapted from page 85 of Cramer, Duncan, Basic Statistics for Social Research(Routledge, 1997). (Some authors suggest (6/n), but for small samples thats a poor approximation. And anyway, weve all got calculators, so you may as well do it right.) The critical value of Zg1 is approximately 2. (This is a two-tailed test of skewness 0 at roughly the 0.05 significance level.) <liIf Zg1 < 2, the population is very likely skewed negatively (though you dont know by how much). <liIf Zg1 is between 2 and +2, you cant reach any conclusion about the skewness of the population: it might be symmetric, or it might be skewed in either direction. <liIf Zg1 > 2, the population is very likely skewed positively (though you dont know by how much).</li</li</li Dont mix up the meanings of this test statistic and the amount of skewness. The amount of skewness tells you how highly skewed your sample is: the bigger the number, the bigger the skew. The test statistic tells you whether the whole population is probably skewed, but not by how much: the bigger the number, the higher the probability. Estimating GraphPad suggests a confidence interval for skewness: (4)95% confidence interval of population skewness = G1 2 SES Im not so sure about that. Joanes and Gill point out that sample skewness is anunbiased estimator of population skewness for normal distributions, but not others. So I would say, compute that confidence interval, but take it with several grains of salt and the further the sample skewness is from zero, the more skeptical you should be. For the college mens heights, recall that the sample skewness was G1 = 0.1098. The sample size was n = 100 and therefore the standard error of skewness is SES = [ (60099) / (98101103) ] = 0.2414 The test statistic is Zg1 = G1/SES = 0.1098 / 0.2414 = 0.45

This is quite small, so its impossible to say whether the population is symmetric or skewed. Since the sample skewness is small, a confidence interval is probably reasonable: G1 2 SES = .1098 2.2414 = .1098.4828 = 0.5926 to +0.3730. You can give a 95% confidence interval of skewness as about 0.59 to +0.37, more or less. Kurtosis If a distribution is symmetric, the next question is about the central peak: is it high and sharp, or short and broad? You can get some idea of this from the histogram, but a numerical measure is more precise. The height and sharpness of the peak relative to the rest of the data are measured by a number called kurtosis. Higher values indicate a higher, sharper peak; lower values indicate a lower, less distinct peak. This occurs because, as Wikipedias article on kurtosisexplains, higher kurtosis means more of the variability is due to a few extreme differences from the mean, rather than a lot of modest differences from the mean. Balanda and MacGillivray say the same thing in another way: increasing kurtosis is associated with the movement of probability mass from the shoulders of a distribution into its center and tails. (Kevin P. Balanda and H.L. MacGillivray. Kurtosis: A Critical Review. The American Statistician 42:2 [May 1988], pp 111119, drawn to my attention by Karl Ove Hufthammer) You may remember that the mean and standard deviation have the same units as the original data, and the variance has the square of those units. However, the kurtosis has no units: its a pure number, like a z-score. The reference standard is a normal distribution, which has a kurtosis of 3. In token of this, often the excess kurtosis is presented: excess kurtosis is simply kurtosis3. For example, the kurtosis reported by Excel is actually the excess kurtosis. A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with kurtosis 3 (excess 0) is called mesokurtic. A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic. Compared to a normal distribution, its central peak is lower and broader, and its tails are shorter and thinner. A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal distribution, its central peak is higher and sharper, and its tails are longer and fatter. Visualizing Kurtosis is unfortunately harder to picture than skewness, but these illustrations, suggested by Wikipedia, should help. All three of these distributions have mean of 0, standard deviation of 1, and skewness of 0, and all are plotted on the same horizontal and vertical scale. Look at the progression from left to right, as kurtosis increases.

Uniform(min=3, max=3) kurtosis = 1.8, excess = 1.2

Normal(=0, =1) kurtosis = 3, excess = 0

Logistic(=0, =0.55153) kurtosis = 4.2, excess = 1.2

Moving from the illustrated uniform distribution to a normal distribution, you see that the shoulders have transferred some of their mass to the center and the tails. In other words, the intermediate values have become less likely and the central and extreme values have become more likely. The kurtosis increases while the standard deviation stays the same, because more of the variation is due to extreme values. Moving from the normal distribution to the illustrated logistic distribution, the trend continues. There is even less in the shoulders and even more in the tails, and the central peak is higher and narrower. How far can this go? What are the smallest and largest possible values of kurtosis? The smallest possible kurtosis is 1 (excess kurtosis 2), and the largest is , as shown here:

Discrete: equally likely values kurtosis = 1, excess = 2

Students t (df=4) kurtosis = , excess =

A discrete distribution with two equally likely outcomes, such as winning or losing on the flip of a coin, has the lowest possible kurtosis. It has no central peak and no real tails, and you could say that its all shoulder its as platykurtic as a distribution can be. At the other extreme, Students t distribution with four degrees of freedom has infinite kurtosis. A distribution cant be any more leptokurtic than this. Computing The moment coefficient of kurtosis of a data set is computed almost the same way as the coefficient of skewness: just change the exponent 3 to 4 in the formulas:

(5)where

kurtosis: a4 = m4 / m22 and excess kurtosis: g2 = a43

m4 = (xx)4 / n and m2 = (xx)2 / n Again, the excess kurtosis is generally used because the excess kurtosis of a normal distribution is 0. x is the mean and n is the sample size, as usual. m4 is called the fourth moment of the data set. m2 is the variance, the square of the standard deviation. Just as with variance, standard deviation, and kurtosis, the above is the final computation if you have data for the whole population. But if you have data for only a sample, you have to compute the sample excess kurtosis using this formula, which comes from Joanes and Gill: (6)sample excess kurtosis: Excel doesnt concern itself with whether you have a sample or a population: its measure of kurtosis is always G2. Example: Lets continue with the example of the college mens heights, and compute the kurtosis of the data set. n = 100, x = 67.45 inches, and the variance m2 = 8.5275 in were computed earlier. Class Mark, x 61 64 67 70 73 m4 Finally, the kurtosis is a4 = m4 / m2 = 199.3760/8.5275 = 2.7418 and the excess kurtosis is g2 = 2.74183 = 0.2582 But this is a sample, not the population, so you have to compute the sample excess kurtosis: G2 = [99/(9897)] [101(0.2582)+6)] = 0.2091 This sample is slightly platykurtic: its peak is just a bit shallower than the peak of a normal distribution. 8 Frequency, f 5 18 42 27 xx -6.45 -3.45 -0.45 2.55 5.55 n/a n/a (xx)4f 8653.84 2550.05 1.72 1141.63 7590.35 19937.60 199.3760

Inferring Your data set is just one sample drawn from a population. How far must the excess kurtosis be from 0, before you can say that the population also has nonzero excess kurtosis? The answer comes in a similar way to the similar question about skewness. You divide the sample excess kurtosis by the standard error of kurtosis (SEK) to get the test statistic, which tells you how many standard errors the sample excess kurtosis is from zero: (7)test statistic: Zg2 = G2 / SEK where The formula is adapted from page 89 of Duncan Cramers Basic Statistics for Social Research(Routledge, 1997). (Some authors suggest (24/n), but for small samples thats a poor approximation. And anyway, weve all got calculators, so you may as well do it right.) The critical value of Zg2 is approximately 2. (This is a two-tailed test of excess kurtosis 0 at approximately the 0.05 significance level.) If Zg2 < 2, the population very likely has negative excess kurtosis (kurtosis <3, platykurtic), though you dont know how much. If Zg2 is between 2 and +2, you cant reach any conclusion about the kurtosis: excess kurtosis might be positive, negative, or zero. If Zg2 > +2, the population very likely has positive excess kurtosis (kurtosis >3, leptokurtic), though you dont know how much. For the sample college mens heights (n=100), you found excess kurtosis of G2 = 0.2091. The sample is platykurtic, but is this enough to let you say that the whole population is platykurtic (has lower kurtosis than the bell curve)? First compute the standard error of kurtosis: SEK = 2 SES [ (n1) / ((n3)(n+5)) ] n = 100, and the SES was previously computed as 0.2414. SEK = 2 0.2414 [ (1001) / (97105) ] = 0.4784 The test statistic is Zg2 = G2/SEK = 0.2091 / 0.4784 = 0.44 You cant say whether the kurtosis of the population is the same as or different from the kurtosis of a normal distribution. Assessing Normality There are many ways to assess normality, and unfortunately none of them are without problems. Graphical methods are a good start, such as plotting a histogram and making a quantile plot. (You can find a TI-83 program to do those at MATH200A Program Statistics Utilities for TI-83/84.) The University of Surrey has a good survey or problems with normality tests, at How do I test the normality of a variables distribution? That page recommends using the test statistics individually. One test is the D'Agostino-Pearson omnibus test, so called because it uses the test statistics for both skewness and kurtosis to come up with a single p-value. The test statistic is (8)DP = Zg1 + Zg2 follows with df=2 You can look up the p-value in a table, or use cdf on a TI-83 or TI-84.

Caution: The DAgostino-Pearson test has a tendency to err on the side of rejecting normality, particularly with small sample sizes. David Moriarty, in his StatCat utility, recommends that you dont use DAgostino-Pearson for sample sizes below 20. For college students heights you had test statistics Zg1 = 0.45 for skewness and Zg2 = 0.44for kurtosis. The omnibus test statistic is DP = Zg1 + Zg2 = 0.45 + 0.44 = 0.3961 and the p-value for (2 df) > 0.3961, from a table or a statistics calculator, is 0.8203. You cannot reject the assumption of normality. (Remember, you never accept the null hypothesis, so you cant say from this test that the distribution is normal.) The histogram suggests normality, and this test gives you no reason to reject that impression. Example 2: Size of Rat Litters For a second illustration of inferences about skewness and kurtosis of a population, Ill use an example from Bulmers Principles of Statistics: Frequency distribution of litter size in rats, n=815 Litter size Frequency 1 7 2 33 3 58 4 116 5 125 6 126 7 121 8 107 9 56 10 37 11 25 12 4

Ill spare you the detailed calculations, but you should be able to verify them by followingequation (1) and equation (2): n = 815, x = 6.1252, m2 = 5.1721, m3 = 2.0316 skewness g1 = 0.1727 and sample skewness G1 = 0.1730

The sample is roughly symmetric but slightly skewed right, which looks about right from the histogram. The standard error of skewness is SES = [ (6815814) / (813816818) ] = 0.0856 Dividing the skewness by the SES, you get the test statistic Zg1 = 0.1730 / 0.0856 = 2.02 Since this is greater than 2, you can say that there is some positive skewness in the population. Again, some positive skewness just means a figure greater than zero; it doesnt tell us anything more about the magnitude of the skewness. If you go on to compute a 95% confidence interval of skewness from equation (4), you get 0.173020.0856 = 0.00 to 0.34. What about the kurtosis? You should be able to follow equation (5) and compute a fourth moment of m4 = 67.3948. You already have m2 = 5.1721, and therefore

kurtosis a4 = m4 / m2 = 67.3948 / 5.1721 = 2.5194 excess kurtosis g2 = 2.51943 = 0.4806 sample excess kurtosis G2 = [814/(813812)] [816(0.4806+6) = 0.4762 So the sample is moderately less peaked than a normal distribution. Again, this matches the histogram, where you can see the higher shoulders. What if anything can you say about the population? For this you need equation (7). Begin by computing the standard error of kurtosis, using n = 815 and the previously computed SES of 0.0.0856: SEK = 2 SES [ (n1) / ((n3)(n+5)) ] SEK = 2 0.0856 [ (8151) / (812820) ] = 0.1711 and divide: Zg2 = G2/SEK = 0.4762 / 0.1711 = 2.78 Since Zg2 is comfortably below 2, you can say that the distribution of all litter sizes is platykurtic, less sharply peaked than the normal distribution. But be careful: you know that it is platykurtic, but you dont know by how much. You already know the population is not normal, but lets apply the DAgostino-Pearson testanyway: DP = 2.02 + 2.78 = 11.8088 p-value = P( (2) > 11.8088 ) = 0.0027 The test agrees with the separate tests of skewness and kurtosis: sizes of rat litters, for the entire population of rats, is not normally distributed.

S-ar putea să vă placă și