Sunteți pe pagina 1din 10

GEOG-2210B

Descriptive Statistics

2014

EXERCISE TWO Descriptive Statistics


Descriptive statistics are fundamental tools in guiding inferences. They are used to provide a simple summary of a complex set of data, to allow more straightforward and objective appreciation of the data, and to allow objective comparisons of comparable data sets. This exercise introduces descriptive statistics and intermediate calculations using the data set from exercise one. It describes some of the basic steps that should be followed when first working with a data set in a research project. Intermediate statistics are not used for interpretation, but are numbers incorporated into subsequent calculations of descriptive and inferential statistics. Typical intermediate statistics are the sum of the values (x) and the sum of the values squared (x2). In examination questions, you will often be provided with intermediate statistics (whether you need them or not!) to limit the time spent processing large data sets, but you must also know how to calculate intermediate statistics. Descriptive statistics provide a succinct representation of the properties of a dataset. Simple, well-clustered data can be adequately described by a measure of central tendency and dispersion. A number of different statistics are available for each these include the mode, the mean and the median as indicators of central tendency, and the range, interquartile range, standard deviation and coefficient of variation as measures of dispersion. Means of integer (whole number) data need not be integers, although they may be ridiculous by implication (e.g., consider the average size of families as 3.87 people). Note that simple measures of central tendency become meaningless for multi-modal distributions. Often, such distributions indicate a mix of populations, each of which may be characterized with its own descriptive statistics, if they can only be teased apart. As a data set becomes more complex in its distribution, additional descriptive statistics such as skewness and kurtosis are required. However, if the mean is significantly larger than the median, it suggests a positively skewed distribution, and vice versa. Basic descriptive statistics can be combined to provide additional insight. For example, the coefficient of variation (=100 x standard deviation/mean) is a dimensionless number allowing intercomparison of the variability of two data sets with widely different means or different units. The mean divided by the standard deviation (=1/CV) is sometimes considered a signal to noise ratio; a measure of how strong the mean is compared to the error in a measurement. Similarly, a good way of indicating the spread of a data set, especially on a graph is to compute error bars. The standard error bar is determined by sx or by 2sx.

Graphical representation of Central Tendency and Dispersion


In Figure 2.1, the median (marked with a cross, +) shows less influence from the large outliers to the right of the graph, compared to the mean (). The large cross centred on the mean is a common method of graphically representing central tendency and dispersion of bivariate data. The standard deviation bars act to prevent identification of differences when the variability (dispersion) of the data do not justify it. In this case most of the male group lie well within the one standard deviation range. One constraint with using a logarithmic axis is that error bars ranging into negative values cannot be accurately plotted, as negative numbers cannot be plotted on log scales.
Raw Data 5000 4000 3000 2000 1000 0 -1000 60 70 80 90 100 110 Mean +/- 1 Std. Dev. Median

Figure 1.2. Scattergram of male distance vs. weight

GEOG-2210B

Descriptive Statistics

2014

The histograms below have the median (diamond) and the mean (circle) two standard deviations (arrows) marked on them. These ranges provide a convenient (and theoretically sound) basis for identifying outliers from a designated group.

Figure 2.2. Histograms of weight by sex with mean, median and mean two standard deviations Some of the calculations required below can be tedious without effective use of a calculator. Calculators with a sign probably store intermediate statistics. It will be worthwhile learning such a short cut now using a simple synthetic data set. For example, a data set consisting of {2,3,4} has n = 3, x = 9 and x2 = 29. Spreadsheets start to become a significant advantage now in taking over these tedious operations. However, tests and examinations may require some compilations of data, and spreadsheets are not allowed. Use the spreadsheet as a check on your manual calculations, to provide more professional exercise reports, and to build expertise in computer assisted analysis. Table 2.1. contains the height, weight and distance from parental home sata set as used in the previous exercise. The objective is to demonstrate the derivation of descriptive statistics and to explore the use of descriptive statistics in the efficient analysis of research questions. Many of the terms are described in the text, and a summary listing of descriptive statistics follows.

Notation and Formulae


In a sample of observations {xi}, I signifies any individual in the set {} of n observations of variable x n is the number of individual in a sample group x is the sum of the sample observations (x1 + x2 + x3Xn) x2 is the sum of the sample observations squared, known as the uncorrected sum of the squares x2 = (x12 + x22 + x32xn2) SSx = x2 (x)2/n is the corrected sum of squares (Note that given the sums (x2 and x), the order of calculation is power, multiple and divide, add and subtract.) 2

GEOG-2210B (x) or (x)


0.5

Descriptive Statistics

2014

indicates the square root of the enclosed terms

The mean of the sample is: = x/n The median is obtained by ranking the data and taking the middle value. In cases of tied ranks, then assign ranks either in order of collection, or take the tied ranks (e.g., 2nd ,3rd) and assign the average rank to all tied values (e.g., 2.5, 2.5). If the sample size (n) is even, then there is no central value. Take the average of the middlemost two values. The sample standard deviation is:

sx =

The last of the definitions is the easiest way of calculating standard deviation. The coefficient of variation (%), CV = 100(sx/ ) The signal to noise ratio is = /sx Notation The notation used here and in much of the course is simplified. The x should be subscripted with i (xi) to imply the individual members of the set. should normally indicate the range over which I varies in making the summation. Thus the mean should be defined as follows

The subscript (i=1) and superscript (n) after the imply that all the members of {x} should be summed. The equation can be read as sum all members of the set x from the first (i=1) to the last (i=n). In most cases, will imply summation over the entire set and the subscripts can be safely omitted.

GEOG-2210B

Descriptive Statistics

2014

Table 2.1. Height, weight and distance-to-home data for students in a Geography 2210 Class Sex M M M M M M M M M M M F F F F F F F F F F F F F Height (m) 1.80 1.84 1.89 1.79 1.85 1.62 1.85 1.83 1.70 1.93 1.91 1.61 1.60 1.60 1.85 1.65 1.45 1.70 1.75 1.84 1.68 1.63 1.80 1.57 Rank 8 6 3 9 4 11 5 7 10 1 2 9 10.5= 10.5= 1 7 13 5 4 2 6 8 3 12 Weight (kg) 104 68 75 73 77 65 68 60 73 85 99 54 84 49 78 69 48 73 59 73 60 49 55 52 Rank Distance (km) 1 573 192 198 211 544 588 4 573 189 4298 2 528 0.1 35 92 249 5 7 777 8 45 200 98 Rank 11 3 8 7 6 5 2 10 4 9 1

GEOG-2210B

Descriptive Statistics

2014

Geography 2210B EXERCISE 2 Descriptive Statistics Instructions and Exercise Sheets Name:_______________ 1. Intermediate Statistics
(a) For each sample group, calculate sample size (n), the sum of data (x) and the uncorrected (x2) and corrected sum of squares (SSx = x2 (x)2/n). The symbols and terms that you will need are defined at the end of the exercise. To start with, use the worked examples in the table to check your technique. Place your answers in Table 2.2. (6) Table 2.2. Intermediate statistics for Geography 2210 class data Height (m) Male n x x2 SSx Female 11 848 67,336 1,962.91 Weight (kg) Male 13 803 51,391 1,790.308 Female 11 7,371 1.9927E7* 1.4988E7* Distance (km) Male Female

Student #:____________

T.A.:_________________

* Note: E-notation is a computer shorthand form of exponential notation that is used to represent large or very small numbers. Exponential notation splits a number into a manitissa which gives the numerical value, and an exponent, which indicates the order of magnitude. Thus 19,927,000 is represented as 1.9927x107, or in the shorthand form, 1.9927E7. This so-called scientific notation or format is awkward to read, but allows precise application of the concept of significant figures, and, more pragmatically, permits very different number to be fit in the same space.

2. Descriptive Statistics
(a) For each sample group, calculate the mean ( ), median (Medx), standard deviation (sx), variance (sx2) and coefficient of variation (CV). Calculate the mean plus 1 and 2 standard deviations, and the mean minus 1 and 2 standard deviations. The symbols and terms that you will need are defined at the end of the exercise. Place your answers in Table 2.3. Use the worked examples to establish and check your technique. In particular, note that these results are all presented to three significant figures. If you try to use these rounded numbers subsequently errors can result. For example, the variance of Male Distance is not equal to 1220 squared, it is computer from the raw standard deviation (-1224.264). Note that the median requires rankings to be assigned in Table 2.1. (6)

Table 2.3. Descriptive statistics for Geography class data 5

GEOG-2210B Height (m) Male Mean ( ) median (Medx) standard deviation (sx) variance (sx2) coefficient of variation (CV), (%) + 2sx + sx - sx 2sx

Descriptive Statistics Weight (kg) Male 77.1 73.0 14.0 196 18.2 105 91.1 63.1 49.1 Female 61.8 59.0 12.2 149 19.8 86.2 74.0 49.6 37.3 Distance (km) Male 670 211 1,220 1,500,000 183 3,120 1,890 -554 -1780 Female

2014

Female

3. Graphical representation of central tendency and dispersion


For this exercise you will need the histograms of height and distance and scattergram of weight vs. height from last week. (a) Mark the median and mean two standard deviations on the histograms of height and distance drafted last week. (8 histogram + 4) (b) Mark the median and means one standard deviation on the two scattergrams of height vs weight from last week. (4 scattergram + 4) (c) Compare the mean and median. Which is the more robust measure of central tendency? Why? (Robust implies that the value remains representative of central tendency, regardless of peculiarities such as outliers in the distribution of the sample data.) (2) (d) Which sex has the greatest eight? Explain your reasoning (i.e. use descriptive statistics to support your answer). (2) (e) Which group has more variable weight, males or females? Explain your reasoning. (2) (f) In males, is weight or height more variable? (Provide your reasons and make sure you use the right statistic). (2)

(g) Approximately how many individuals lie outside the one and two standard deviation limits on your histogram or scattergram? Express you answer as a number and an approximate percentage (100 x number outside/total number in sample). (2)

Table 2.4. Outliers based on histograms of sample statistics Height (m) Weight (kg) 6 Distance (km)

GEOG-2210B Male sx 2sx Female

Descriptive Statistics Male 3 (27%) 0 (0%) Female 5 (39%) 0 (0%) Male 1 (9%) 1 (9%) Female 2 (15%) 2 (15%)

2014

4. Inference
In some cases, we know so little about a phenomenon that the primary need is for basic information, or exploratory analysis. Descriptive statistics are useful in drawing out information in such cases, and often raise new questions. Once armed with some knowledge, it is possible to develop more formal questions and to infer specific answers to these questions. Inference is the process of drawing conclusions from analysis of data. Drawing inferences from data requires formality so that your train of analysis should be objective and clear to a reader. Effective inference depends upon the quality and quantity of the data and its relevance to the problem on hand, and application of appropriate analytical methods. Relevance is achieved by narrowing the research problem to questions to formal hypotheses. Hypotheses are simple testable statements that lead directly to the necessary data. Hypotheses are often disappointingly narrow in their focus, but this is more than compensated for by their overt testability. For example, the class data set might have been gathered because of curiosity about the gender of students, their morphology (shape) and the availability of home cooking. Within this broad research problem, we have decided to focus on a narrower group of questions about height, weight, sex and distance from home. Formal hypotheses can be most readily expressed in simple terms of difference. There are three different ways of defining difference, and we can assign particular names to them. Table 2.5. Identification and representation of tailedness of a research question Description of hypothesis Language Symbol 1. One-tailed test on the positive side Significantly > greater than 2. One-tailed test on the negative side Significantly less < than 3. Two-tailed test Significantly different from

Words Expect one variable to exceed another Expect one variable to be less than the other Expect one variable to be greater or less than another

The histograms (Figure 1.1) indicate that males are often heavier than females. A related hypothesis, therefore, would be one-tailed positive (i.e. Wtm > Wtf). But some males are heavier and others are lighter than females, so the data are inconclusive. The sample mean provides an excellent way to characterize a data set. Indeed the mean weight of males (77.1 kg) is greater than the mean weight of females (61.9 kg). What if the male mean weight was only slightly larger than female mean weight? How different do the means have to be? An objective rule is needed to decide how big a difference is needed to be considered significant. A simple rule for deciding significance of a difference might be that the male average weight should be more than one standard deviation greater than female average weight. This can be presented graphically:

GEOG-2210B

Descriptive Statistics

2014

Figure 2.3. Sketch of hypothesis testing using means and standard deviation The male mean weight is clearly more than one standard deviation above the female mean weight; therefore we conclude male weight is not only greater than female weight, but that it is significantly greater. It is also possible to tackle the same hypothesis by asking if female weight is significantly less than male mean weight. In this case the following graphic applies.

Figure 2.4. Sketch of hypothesis testing using mean and standard deviation

This result confirms the previous test. Female weight is significantly less than male mean weight. Overall, the two tests both confirm that males are significantly greater in weight than females. As there are two ways of testing the hypothesis, these may or may not corroborate one another. How might we decide which route is better? First, it is possible that one sample is more reliable than the other. Assuming good sampling procedure, the sample size is the best indication of reliability. The female sample size is the larger in this case, so perhaps the first method of testing is preferable. Alternatively, we could pool the standard deviations, and get a composite confidence interval that way. This test procedure is only useful as a general indication. However, this style of inference testing is an extremely powerful approach to analysis. The research retains the primary responsibility for resolving the problem into practical hypotheses, and uses formal techniques of descriptive statistics and logic to test the hypotheses. The questions are answered directly using the data, and the grounds for drawing conclusions are explicit to readers. Table 2.6. below summarizes the example of the inferential process, and has a couple of examples to be completed. First, identify the variable of interest (e.g., Height in the partially worked example), then state a hypothesis (e.g., we expect males to be taller than females; Htm > Htf; a one-tailed test). Present a rule for identifying differences (e.g., the 8

GEOG-2210B Descriptive Statistics 2014 male mean height should be at least one standard deviation greater than the female mean height). This is the tolerance or threshold beyond which you will reject the hypothesis. Provide the appropriate statistics (e.g. sx) and draw a conclusion as to whether the hypothesis is supported or rejected. Conclusions should be qualified (i.e., add some cautionary words) if you have any doubts about the reliability of the test (e.g. if the results are inconsistent). Work through the examples, then complete your analysis for the hypothesis about height, and tackle a two-tailed test for gender differences in distance-to-home. (10) Observed Statistics Hypothesis Males are heavier than females Wtm > Wtf One tailed +ve Rule for acceptance m > ( f + sxf) Or f < ( m - sxm) Male = 77.1 sx = 14.0 (11) Female = 61.8 sx = 12.2 (13) Tolerance sx Male 63.1 Female 74.0 Conclusion m > ( f + sxf) 77.1 > (61.8 + 12.2) f > ( m - sxm) 61.8 < (77.1 14.0) hypothesis is supported

Male weight is different from female weight Wtm Wtf two tailed

m ( f sxf) Or f ( m sxm)

= 77.1 sx = 14.0 (11)

= 61.8 sx = 12.2 (13)

63.1 to 91.1

49.6 to 74.0

m > ( f + sxf) 77.1 > (61.8 + 12.2) f > ( m - sxm) 61.8 < (77.1 14.0) hypothesis is supported

GEOG-2210B Table 2.6. Hypothesis testing for Exercise 2 Variable Rule for Hypothesis acceptance Height Males are taller m > ( f + sxf) (kg) than females Or f < ( m - sxm) Htm > Htf One-tailed +ve

Descriptive Statistics Observed Statistics Male Female

2014 Tolerance sx Male Female Conclusion

Distance (km)

= 670 sx= 1220 Two-tailed

10