Sunteți pe pagina 1din 8

Week  

1:  Data Organization and Analysis -  Hel


 
Lecture    p
   

Print This
Page

This lecture uses MathML to make math formulas more readable.


Learn more.
Introduction to Business Statistics and Descriptive Statistics
Introduction | The Most Important Ideas in Statistics | Descriptive Statistics for the Housing Prices, Using
MegaStat | In Conclusion

Introduction

Welcome to Applied Managerial Statistics - GM533! In this week's lecture, we will provide an intuitive description
of the most important ideas in statistics, which we will be using throughout the course. We will then focus on
descriptive statistics, and we will use the Housing Prices Case #27 from Practical Data Analysis: Case Studies in
Business Statistics, to provide concrete examples, explanations, and applications of the key concepts.

To get the most out of this lecture, we will assume that you have: 1) viewed the tutorials, both from the home page
and Week 1, 2) loaded MegaStat (the statistical software we will be using in this course) onto your computer, with
Excel 2007 and the Excel 2007 version of MegaStat, 3) loaded the data set from Case #27 into MegaStat, and 4)
read Case #27 - Housing Prices. If you have not completed all of these steps, please do so now, before proceeding
with the lecture. To get you going, here's the first paragraph of the case:

Case #27 - Housing Prices


"You currently own a home in Eastville, Oregon, and want to put your house on the market.
You're not in any particular hurry to get rid of the house and would like to try selling it
yourself. One way to determine a reasonable asking price of a house is to call one or more real
estate agents and seek their advice. Another is to hire an appraiser - this approach would cost
several hundred dollars. You've been wondering if there might be an easier and cheaper way to
understand what determines selling prices in the area. -(Peter G. Bryant and Marlene A. Smith,
McGraw-Hill, 1999).

Before we dive into the data from the case, we will set the stage by looking at the four most important ideas in
statistics.

The Most Important Ideas in Statistics

Statistics is defined in the American Heritage dictionary as, "the mathematics of the collection, organization, and
interpretation of numerical data, especially the analysis of population characteristics by inference from sampling."
There are four basic ideas in statistics, and getting an intuitive grasp of these ideas early on will be very useful to
you throughout the course, as well as in your use of statistics at work and as a citizen. Each of the four basic ideas
consists of a complementary pair of ideas.

Central Tendency and Dispersion


A set of numbers, like the 108 housing prices, by itself, tells us nothing. We have to organize the raw data and
summarize it to begin to make any meaning out of it.

The "central tendency" answers the question, "what is the average value?" For example, for the 108 housing prices,
one number which expresses the central tendency of the housing prices is the mean (add up all the values and divide
by the number of values), which has a value of 97.99226 (in thousands of dollars this is $97,992.26). When we get
into the housing data, using MegaStat, we will see how easy it is to express the central tendency, with a variety of
numbers, and also in pictures and tables.

The "dispersion" answers the question, "how much do the values vary?" For example, for the 108 housing prices,
one number that expresses the variability is the range (subtract the lowest price from the highest price), which has a
value of 136 ($136,000). As with central tendency, when we get into the housing data using MegaStat, we will see
how easy it is to express the dispersion with a variety of numbers, as well as in pictures and tables.

Quantitative Variables and Qualitative Variables


Throughout our work, we will see basically two kinds of variables. "Quantitative Variables," such as the housing
prices, have values for which we can calculate a mean (i.e., adding them up and dividing by the number of values).
"Qualitative Variables," such as the presence or absence of a fireplace (represented in our housing data by 0 = no
fireplace and 1 = has a fireplace), lend themselves to finding a proportion, such as the proportion of the homes that
either have a fireplace or do not have a fireplace.

Descriptive Statistics versus Inferential Statistics


We use "descriptive statistics," including the central tendency and dispersion, among others, when our focus is on a
set of data immediately before us. For example, the 108 homes in the housing case could be in our community, and
our interest is solely focused on determining housing prices in our community. On the other hand, we use
"inferential statistics" when we are using the set of data immediately before us (we call it a "sample") to tell us
something about a larger set of data (which we call the "population"). If we are interested in understanding the
determinants of housing prices in the United States, and we have our 108 homes as a sample, then we have to deal
with the fact that there will be variability among samples. Our sample provides a certain mean and range, but
another sample will almost certainly have a somewhat different mean and range. One of the great achievements of
statistics (and of all of humankind) is to have worked out a solution to this problem of sampling variability.

One Variable versus Two or More Variables


In general, we will focus on one variable, as we focus in the housing data on housing prices, trying to understand
their central tendency and dispersion. But, in trying to understand the housing prices, we invariably introduce
additional variables - such as fireplace or not, square footage, and school district - to help us understand the
variability in prices.

In summary, we believe you will be well served to keep these intuitive but "Most Important Ideas in Statistics" close
at hand, as you proceed with your study and application of statistics, in your work and in your life as an involved
citizen. They will support you in keeping your focus on what is most essential and de-mystify some of the
intimidating formulas that all too often dominate statistics textbooks, even one as generally excellent as ours.
Descriptive Statistics for
the Housing Prices, Using
MegaStat

Descriptive Statistics
PRICE
Let's generate the basic descriptive statistics for the housing prices. count 108
Then, we will walk through each of the essential descriptive mean 97.99226
statistics, explaining them as we seek to understand the 108 sample variance 698.52002
housing prices, and, at least to a limited extent, their relation to thesample standard deviation 26.42953
other variables in the housing data set. While we will provide the minimum 59
MegaStat output in the context of the lecture, we suggest you run maximum 195
the descriptive statistics for yourself, as we go along. This will give range 136
you more confidence in your ability to use MegaStat, and it will sum 10,583.16400
enable you to experiment with the data as you go along. sum of squares 1,111,809.79218
deviation sum of squares
To begin, look at the column of home prices; run your eye down (SSX) 74,741.64171
the column. From the definitions of the variables, we know the 1st quartile 79.30000
numbers in the PRICE column represent thousands of dollars. But median 92.46950
we can't get much meaning out of the numbers just by looking at 3rd quartile 112.22500
the raw data. As a first step, sort the data matrix by PRICE. Now interquartile range 32.92500
look at the column of home prices again. With the values sorted, we mode 79.90000
can see that the lowest priced home is 59 ($59,000) and the highest low extremes 0
priced home is 195 ($195,000). Subtracting the lowest price from low outliers 0
the highest price gives us the range: 136 ($136,000), which is one high outliers 2
measure of the dispersion of the data. To get a beginning idea of the high extremes 0
central tendency of the values, look down halfway in the data, to Table 1.1
rows 55 and 56. The home price in row 55 of the sorted data is
92.439 ($92,439), and the home price in row 56 is 92.5 ($92,500).
The rest of the home prices are either less than or greater than these two values. The middle value in a ranked set of
values is called the median, and the median and the mean are the two most important measures of central tendency.
If we have two values in the middle, as we do here, we take the mean of the two, i.e., (92.439 + 92.5)/2 = 92.4695
($92,469.50).

This is a good start, but we can do better by putting MegaStat to work for us. We will go to Descriptive Statistics,
and click on the following options:

 Mean
 Sample variance and standard deviation
 Minimum, maximum, and range
 Median, quartiles, mode, outliers
 Sum, sum of squares, SSX

Table 1.1 shows the resulting MegaStat output.

Now, looking at the MegaStat output, the count 108 is the number of homes. Jumping down a bit, we see
the minimum 59, the maximum 195, and the range 136, which are the values we found when we sorted
the home prices. Furthermore, notice that the median 92.4695 is the value we got from our sorted home
prices.

Next, we will add to our basic measures of central tendency and dispersion. The mean 97.99226 is another average,
or measure of central tendency, a very important one. As we said earlier, we calculate the mean by adding up all the
values and dividing by the number of values. The critical difference between the median and the mean is that the
median is not affected by extreme values, whereas the mean takes into account every value. The mode 79.90000 is a
third measure of central tendency, less often used than the median and the mean, and is simply the most frequently
occurring value.

Another name for the median is the 2nd quartile, which says, as we did above, that half of the home prices are less
than the median and half of the home prices are above the median. The quartiles provide additional break points in
the ranked home prices. The 1st quartile 79.90000 is the home price which has 25% of the homes having lower
prices, and the 3rd quartile = 112.22500 is the home price which has 75% of the homes having lower prices. You
should note that there are a number of ways of calculating the quartiles (MegaStat uses the Excel procedure), which
give slightly different values. But the important thing is the basic idea of breaking the ranked values up into 4
quartiles, using the 1st quartile, the median, and the 3rd quartile as the break points. The interquartile range
32.92500 is the difference between the 3rd quartile and the 1st quartile, containing the middle 50% of the home
prices. The interquartile range (IQR) is a second measure of dispersion.

The third, and most important measure of dispersion is actually a family of three closely related measures of
dispersion: namely, the standard deviation, the variance, and SSX (which stands for the deviation sum of squares).
MegaStat has provided the SSX, or deviation sum of squares, in the above output.

All three of these measures of dispersion are based on subtracting the mean of all home prices from each individual
home price, then squaring the difference, and finally adding up the resulting squared values. This is called the SSX,
or deviation sum of squares. With symbols, we can represent SSX as:

SSX= ∑ (x−x − ) 2 =74,741.64171 (for the housing prices)

"x " represents an individual home price, and "x − " represents the mean of all of the housing

prices. The " ∑ " symbol says to add up all of the squared deviations from the mean. If we divide SSX by the
number of housing prices, "n" (actually we'll use n-1 = 108 - 1 = 107, for a technical reason, which we needn't worry
about here), we get what is called the "variance":

Variance = ∑ (x−x − ) 2 n−1 = 74,741.64171 (108−1)

=698.52002 (for the housing prices).

Standard Deviation = s= ∑ (x−x − ) 2 n−1 − −

− − − − − − − − −  ⎷  

= 698.52002 − − − − − − − − √ =26.42953

At the simplest, most intuitive level, the standard deviation is the average deviation of individual home prices from
the mean home price.

Finally, the outliers refer to home prices that are out at one or the other extreme of the distribution. In our set of
home prices, we see two outliers, namely, homes with prices of 192 and 195 (in thousands of dollars).

We know this is a lot of numbers to throw at you! While the numbers are useful in characterizing the central
tendency and dispersion of a set of values, like home prices, they can be very dry. To balance the numbers, and
enliven them, we turn to an alternative, and complementary way of capturing the central tendency and dispersion in
the home prices - namely, pictures and tables. Let's go back to MegaStat and ask for two pictures, the stem-and-leaf
and the histogram displays.
First, we get the stem-and-leaf display by going through MegaStat, shown in Table 1.2.

Stem and Leaf plot for PRICE


stem unit = 10
leaf unit = 1
Frequency Stem Leaf
2 5 99
13 6 1346666888899
16 7 0012255666679999
15 8 123455677889999
23 9 00011122234456778888999
7 10 2457779
12 11 000122333445
6 12 114558
7 13 0122223
1 14 4
4 15 0356
0 16
0 17
0 18
2 19 25
108

Table 1.2
Each home price is represented as a stem (which is a multiple of 10) and a leaf (which is a certain number of units).
Thus, the lowest housing price of 59 (thousand dollars) is shown as a 5 under the stem and a 9 under the leaves.
Notice that the various values in the stem-and-leaf display have dropped the numbers to the right of the decimal
point. The lowest value is 59 and the highest value is 195, which shows us the range. For comparison with other

displays, we recommend that you mentally rotate the stem-and-leaf 90° counterclockwise. What we then have
is a picture of the distribution of all housing prices, with a hump in the middle, tapering to the left and to the right,
but with a tail off to the right (we call this a distribution that is right skewed, and the two values in the tail are the
two outliers we identified earlier). The central tendency, at a gross level, is in the hump, corresponding to the stem
of 9, or the 90s; this visual central tendency corresponds to both the mean and the median, our two primary
numerical central tendencies.

Now let's look at our second picture, the histogram, which also captures the central tendency and dispersion. It has
an associated table, called a frequency distribution.

To have MegaStat create the histogram and the frequency distribution, use MegaStat: Frequency Distribution:
Quantitative, which generates the following:

Frequency Distribution - Quantitative


PRICE cumulative
upper midpoint width frequency percent frequency
< 60.00 50.00 20.00 2 1.9 2
< 80.00 70.00 20.00 29 26.9 31
< 100.00 90.00 20.00 38 35.2 69
< 120.00 110.00 20.00 19 17.6 88
< 140.00 130.00 20.00 13 12.0 101
< 160.00 150.00 20.00 5 4.6 106
< 180.00 170.00 20.00 0 0.0 106
< 200.00 190.00 20.00 2 1.9 108
108 100.0

Figure 1.1

Click here for Description

The basic idea in both the histogram and the


frequency distribution is the creation of class
intervals, the lowest one being greater than or
equal to 40 and less than 60. For each class Figure 1.2
interval, the percentage of the total number of
housing prices is shown, both in the histogram Click here for Description
- by the height of the bar above the class
interval - and by the frequency and the
percentage in the interval, for the frequency distribution. Both the histogram and the frequency distribution reveal
the central tendency and the dispersion similarly to the stem-and-leaf display; that is, by showing the low and high
home prices, and the "hump" where the greatest concentration of home prices is. It should be noted that MegaStat
lets us change the class intervals, if we wish, and also to insert titles.

Notice the similarity between the stem-and-leaf (turned 90° counterclockwise) and the histogram. They are
both very similar to the normal distribution, with a hump in the middle and tapering, symmetrical ends on both
sides. In our housing prices data we do not have perfect symmetry, because of the two outliers, which make the data
skewed to the right.

The normal distribution is shown in Figure 1.2.

The normal distribution will permeate all of our work in this course, and we
Descriptive statistics will have much more to say about it in next week's lecture and discussion.
PRICE For now, we will simply introduce the idea of the Empirical Rule for a
count 108 normally distributed population. The Empirical Rule says that if a
empirical rule distribution is normally distributed, then 68.26% of the measurements will be
mean - 1s 71.56273 within plus or minus one standard deviation of the mean; 95.44% of the
mean + 1s 124.42179 measurements will be within plus or minus two standard deviations; and
percent in interval 99.73% of the measurements will be within plus or minus three standard
(68.26%) 67.6% deviations. These probabilities can be seen on the normal distribution figure
mean - 2s 45.13320 above.
mean + 2s 150.85132
percent in interval MegaStat will also calculate the empirical probabilities and the actual values
(95.44%) 95.4% for our set of housing prices; we have included these values in the following
mean - 3s 18.70367
mean + 3s 177.28085
percent in interval
(99.73%) 98.1%

Table 1.4
MegaStat output. You can see from the output that our distribution of housing prices is close to what would be
expected in a normal distribution, though the fit is not perfect.

So far, we have been focused solely on the


housing prices data, the primary variable in
which we are interested. But as we said in our
Figure 1.3
introduction to "The Most Important Ideas in
Statistics," we almost always turn to other
variables to help us in our understanding. In Click here for Description
your discussion this week, one of the early
questions has to do with the difference in prices
for homes with and without a fireplace. With the above tools, you now have the capabilities you need to really
compare home prices for homes with and without a fireplace, or for any of the other variables.

We will focus briefly here on two variable descriptive statistics: a scatter plot of housing prices against the fireplace
variable, and the associated regression line, which we will explain below. The scatter plot is shown in Figure 1.3.

Notice first that the vertical, or y, axis is PRICE (our variable of interest), and it is called the dependent variable.
On the horizontal, or x, axis is FIRE (the presence or absence of a fireplace), which we think might have a
relationship with PRICE, and we call it the independent variable. In our original data set, we indicate the presence
of a fireplace with a "1" and the absence of a fireplace with a "0." Now look at the zero point on the horizontal axis;
the dots there represent the homes without fireplaces, and the dots above the 1 of the x (or FIRE) axis, are the homes
with a fireplace.

The upward sloping line joining the two stacks of dots is called the regression line, and it joins the means of the two
sub-sets of housing prices. The equation of the regression line is provided on the scatter plot, and if we substitute 0
for x in the equation (indicating homes without a fireplace) we get:

y = 28.554x + 72.611
y = 72.611,

which is the mean of the housing prices for the homes without a fireplace. Similarly, if we substitute 1 for x in the
equation (indicating homes with a fireplace) we get:

y = 28.554x + 72.611
y = 28.554 + 72.611
y = 101.165

This result is the mean of the housing prices for the homes with a fireplace. You can verify these numbers by
running the MegaStat descriptive statistics separately for the housing prices of homes with and without a fireplace.

In the scatter plot, you can also see visually that for each sub-group (homes with or without a fireplace) the means

vary around the regression line. Finally, on the scatter plot, you will see the equation: R 2 = 0.116.
This is called the simple coefficient of determination, or simply "R squared." This reflects the improvement we
can see in the scatter plot by using the FIRE variable over not using the FIRE variable. Another way of saying this is
that using the FIRE variable reduces the dispersion; that is, it results in a smaller standard deviation, around the
regression line, versus the dispersion around the overall mean of all housing prices. The "R squared" says that the
independent variable FIRE accounts for 0.116, or 11.6% of the variability in housing prices, which still leaves
88.4% of the variability unexplained (and presumably this will be helped by the other independent variables). While
there is some relationship between housing prices and the presence or absence of a fireplace, one caution is that we
are not able to say that the presence or absence of a fireplace (or any of the other housing variables we will be
looking at and discussing) "causes" the difference in prices.
In Conclusion

We have covered a lot of ground this week, including learning how to use MegaStat to help us in making sense of a
variable, like PRICE, by describing its central tendency and dispersion, using numbers, pictures, and tables.

Next week, and in the weeks to come, we will expand upon this week's work to include: inference, qualitative
variables, and, towards the end of the course, how to handle multiple independent variables at the same time to
better understand and make predictions about our dependent variable of interest, the housing prices, or any other
dependent variable of our choosing.

S-ar putea să vă placă și