Sunteți pe pagina 1din 43

Chapter 3

Be able to explain data using descriptive statistics, as well as display using graphs and tables. Describe the characteristics of distributions according to their plots and/or graphs.

Be able to explain properties of and calculate descriptive statistics (e.g., mean, median, mode, standard deviation, range, z scores, etc).
Identify and explain outliers. Understand bivariate descriptive statistics (we will cover more of this later in the semester, but it will be introduced and discussed in this chapter).

Summarize data simple as that

Number of times something occurred Can use a table and/or a graph

Center typical or average observation


Since not everyone is typical, there will be variability variation of scores/observations around the center or typical score

Relative frequency can be described using:

Proportions number of occurrences/observations for the category divided by total observations

20 of you go on to work on a Ph.D. after your M.S. There are

30 of you in the class, so the proportion is _______________

Percentages the proportion multiplied by 100


________________

The percentage of you that go on to work on a Ph.D. is

To move from proportion to percentage, do you have to physically multiply by 100 using a calculator or pencil and paper?

Origin of Visitors New York California Florida Texas Total

Table 1: Visitors to XYZ Number Proportion 100,000 0.40 80,000 0.18 0.10 250,000 1.00

Percentage 32 18 100

Bar graph uses rectangular bars to represent the relative frequency for each category.
Figure 1: Relative Frequency of Visitors to XYZ
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 New York California Florida Texas

Often used to visually present categorical data. Origin of visitors example put the origins in descending order, but

May have variables where the categories determine the order, such as household structure as presented in your text, where the bars are not descending.
Another example would be level of education (e.g., high

school, college, etc)

Another way to present categorical data is a pie chart, which is a circle with slices representing the categories.

What characteristics of categorical data allow this?

Frequency distribution

Summarize by creating sets of intervals (e.g., for age we could use: 18-29, 30-39, 40-49, etc) Keep in mind, we can collapse data to lower levels of measurement, but not vice versa Typically equal in width Cover all possible values Mutually exclusive only fit into one interval
Is 20-30, 30-40, 40-50 mutually exclusive?

Similar to bar graph, but for quantitative variable where number of observations are grouped into intervals. Choosing number of intervals is based on common sense. Might have only a few values and not need to create intervals (e.g., discrete variable)

Histogram of Relative Frequencies for Number of Times Visiting XYZ in Past 5 Years
0.6 0.4 0.2 0 0 1-4 5-9 10+

Stem is the first digit and leaf the second Similar to histogram but displayed on its side Presents information not displayed in histogramfor example.

How many total observations are in the stem-and-leaf plot below? How would we calculate the relative frequency for an interval of 20-29?

Stem

Leaf

Stem-and-leaf plot put one group to right of stem and other to the left (see example on page 37 of your book) Bar graphs and histograms use different color bars to represent different groups

Population distribution represents entire population Sample data distribution - glimpse (fuzzy) of the population

As sample increases it becomes more reflective or representative of population

(Page 37 of the book)

U-shaped two peaks at the low and high end of scores Bell-shaped most observations fall near the middle of the values

These distributions are symmetric left and right sides are mirror images How would you describe the shapes below in laymens terms?

(Page 38)

Skewed to the right a number of observations in the tail on the right side Skewed to the left a number of observations in the tail on the left side

Right or left skew is determined by which tail is longer How would you describe the shapes below in laymens terms?

(Page 38)

Typical observation we cannot describe every observation Mean = sum/number of observations

Also known as _________________? Frequently used and probably best known Your GPA is an average of what?

Formulas:
y = y1 + y2 + y3 + + yn / n
Where y1, y2, y3, etc are the observations and n is the

number of observations

yi = y1 + y2 + y3 + + yn
Where (sigma) is process of summing, y is the

value(s), and i typical value in the range of 1 to n

y = yi / n

Properties

Requires numerical values

Cannot calculate for nominal level variables Can calculate for ordinal level if we assign values (e.g., 1 =

Strongly Disagree, 2 = Disagree 3 = Neutral, 4 = Agree, 5 = Strongly Agree)

Outliers observation(s) well above or below majority of scores that influences the mean
observations of say agreement per above scale:
Set 1: 2, 1, 2, 2, 1 Set 2: 2, 1, 5, 2, 1

short example: calculate the mean for each of the small set of

So, if you have an outlier(s), which tail (longer or shorter) of a

skewed distribution are they found? Asked another way, which way do outliers pull the mean?

Properties (just a bit more):

It is pulled in the direction of the longer or shorter tail of a distribution that is skewed? Point of balance
Separate observations into two groups: those above and

those below the mean Calculate distance of each score from the mean Add up the distance for observations above the mean and those below the mean, they will be equal

Suppose you have two sets of data:


y = n1y1 + n2y2 / n1 + n2
Conceptually, it is the sum of all observations divided

by the total combined sample size

Lets calculate the individual and weighted averages for the following sets of observations and see what happens:
Set 1 (n1): 2, 2, 1, 2 Set 2 (n2): 3, 4, 5, 4, 5, 5, 4, 4

Simply put, it is the middle observation when observations are in ascending order

So, the median for the following set is?


3, 5, 4, 8, 13

If you have an even number of observations, it is the midpoint of the two middle observations
So, the median for the following set is? 3, 3, 4, 4, 6, 6, 7, 7

Conceptually it is (n + 1) / 2

Essentially indicates how many scores are above and below, which is an equal amount for a odd n and something .5 for even n

Median is useful to describe ordinal data, while the mean is not as meaningful

Simple example:
High school diploma (n = 2) College degree (n = 4) Masters degree (n = 3)

Properties:

Appropriate for quantitative and ordinal variable, but not nominal (e.g., gender cannot be put into a logical order) Symmetrical distributions, the mean and median will be identical (why?) In skewed distributions, the mean or median will be pulled closer to the longer tail? Not influenced by distance of scores from the mean because it uses ordinal properties only

Properties, a bit more:

Uses ordinal characteristics:


Not influenced by distance from the middle like the

mean Nor is it affected by outliers

Calculate the mean and median for each set:


Set 1: 3, 3, 4, 5, 5 Set 2: 3, 3, 4, 5, 9

Which of the above would look more like a bell shaped curve? Which might have a long tail?

Is the mean or median more affected by outliers? Which (mean or median) is more appropriate if your data is highly skewed?

On which side of the curve?

(Page 43)

Simply put, the most common observation

How would you identify the mode from a bar graph?

Used a lot for discrete (categorical) variables (e.g., religion) Properties:


Can be used for all levels of measurement or data If two most common observations occur the variable is bimodal

What would a bar graph of this look like? How about the shape of a curve if it was drawn (think about

a curve drawn over the bar graph)?

Properties, a bit more:

Similar for unimodal, symmetrical data (e.g., bell-shaped)


1, 2, 2, 3, 3, 3, 4, 4, 5

Keep in mind mean, median, and mode are complimentary, but general guidelines:

Nominal mode Ordinal median Interval and ratio mean if normally distributed and median if skewed
Keep in mind mode and median are possible for higher

levels than indicated above (e.g., mode for nominal and higher, median for ordinal and higher). Could also calculate a mean for nominal data if given values, but is it meaningful?

Variability how observations vary or spread around the typical observation or measure of the center

Range distance between smallest and largest observation


Standard deviation average distance of an observation from the mean

Simplest way to describe variability


Low score High score Distance between low and high score

What is the low, high, and range of the following set of observations?

8, 99, 55, 75, 22, 65, 24, 11

Two data sets can have same mean, but different variability characteristics (Lets look at an example)

Deviation - distance of each observation from the center (mean)


Above the mean it will be positive Below the mean it will be negative

Conceptually,

Sum of positive and negative deviations = 0 or (yi y) = 0


Draw an arrow to the mean on the curve to visually see why the sum of positive and negative deviations = 0.

Sum of squares (SS) we square EACH of the deviations

Why not the sum of the deviations?

Standard deviation formula:


s = (y i y )2 / n 1 or The square root of the sum of squared deviations divided by sample size 1

Think about itwe squared the deviations so the square root brings the measurement back to the same level

Practice: work through the calculation for standard deviation for the following sets of data we used earlier to look at means and medians:

Set 1: 3, 3, 4, 5, 5 Set 2: 3, 3, 4, 5, 9

Hint: Go step by step and be sure you can later identify how you did each step to review. Also, as we did with mean and medians, compare the two data sets to conceptually understand what is happening. Hint 2: On the next slide I tried to simplify this for you by outlining the steps to get to standard deviation

1.

2. 3.

4.

5.

Calculate deviation for each score (before this you need to calculate the mean) Square each deviation Sum the squared deviations Divide the sum of squared deviations by n 1 (now you have what is referred to as your variance or s2) So, what is the final step to go from variance to standard deviation???

Properties:

Greater than or equal to 0 (if 0, all observations are the same value) Greater the values vary around the mean, the greater your standard deviation (s) (n 1) is used rather than n to make an adjustment to correct for sample bias. Results in larger s.

Standard deviation is the distance of a typical score from the mean, so reported along with the mean. But, we also need to know about the distribution and think about what seems plausible vs. not plausible.

Percentage of data lie above or below a certain position

Percentiles p% of scores fall below or equal to the given percentile (100 p gives what is above)
The median would be what percentile?

Quartiles lower is 25th and upper is 75th


Given the median and the lower and upper quartiles

you now have four parts of a distribution If we say Bob is in the 75th percentile, how would you simply explain what that means to a non-statistician?

Difference between upper and lower quartiles

Or range of the middle half of scores

What happens to standard deviation as IQR increases?

Display five pieces of information: median, lower and upper quartiles, maximum and minimum

Figure 1: Boxplot for Time by Tourist [Chicken (1), Beef (2) and Fish (3)] (N = 102)

16.00

12.00

Time

8.00

4.00

1.00

1.50

2.00

2.50

3.00

chicke n=1 bee f=2 fish=3

+ or 1.5(IQR)

For upper quartile the formula is:


upper quartile + 1.5(IQR)

For lower quartile the formula is:


lower quartile - 1.5(IQR)

Box plots mark separately, as they fall outside the whiskers Arbitrary consider as potential outliers

Might actually not be far removed from values in the long tail

Standardizes scores or values


z = observed score mean / standard deviation

The z score tells you how many standard deviations it is above or below the mean If you are subtracting the mean from the observed score, what score is now the mean when standardizing observations?

Association relationship of values of one variable with values of another (so, you are now considering two variables).

Bivariate analysis two variables Explanatory variable defines groups Response variable outcome variable The analysis studies how the outcome on the response variable depends on or is explained by the value of the explanatory variable (p. 55).

Correlation measures the strength of the association or straight line trend

Correlation values are -1 to +1

Negative means as one goes up other goes down Positive means as one goes up so does other Visualize and draw the lines Further from 0 the stronger the association

Regression allows prediction of the response variable for different values of the explanatory variable
Note: notice I did not use independent and dependent variables. That is because correlation and regression does not necessarily mean the relationship is causal We will cover correlation and regression in more detail later in the semester

Following Problems in your book are due January 30, 2014 (next class):

3.1 3.2 3.7 3.13 3.23 3.25

S-ar putea să vă placă și