Documente Academic
Documente Profesional
Documente Cultură
Be able to explain data using descriptive statistics, as well as display using graphs and tables. Describe the characteristics of distributions according to their plots and/or graphs.
Be able to explain properties of and calculate descriptive statistics (e.g., mean, median, mode, standard deviation, range, z scores, etc).
Identify and explain outliers. Understand bivariate descriptive statistics (we will cover more of this later in the semester, but it will be introduced and discussed in this chapter).
To move from proportion to percentage, do you have to physically multiply by 100 using a calculator or pencil and paper?
Table 1: Visitors to XYZ Number Proportion 100,000 0.40 80,000 0.18 0.10 250,000 1.00
Percentage 32 18 100
Bar graph uses rectangular bars to represent the relative frequency for each category.
Figure 1: Relative Frequency of Visitors to XYZ
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 New York California Florida Texas
Often used to visually present categorical data. Origin of visitors example put the origins in descending order, but
May have variables where the categories determine the order, such as household structure as presented in your text, where the bars are not descending.
Another example would be level of education (e.g., high
Another way to present categorical data is a pie chart, which is a circle with slices representing the categories.
Frequency distribution
Summarize by creating sets of intervals (e.g., for age we could use: 18-29, 30-39, 40-49, etc) Keep in mind, we can collapse data to lower levels of measurement, but not vice versa Typically equal in width Cover all possible values Mutually exclusive only fit into one interval
Is 20-30, 30-40, 40-50 mutually exclusive?
Similar to bar graph, but for quantitative variable where number of observations are grouped into intervals. Choosing number of intervals is based on common sense. Might have only a few values and not need to create intervals (e.g., discrete variable)
Histogram of Relative Frequencies for Number of Times Visiting XYZ in Past 5 Years
0.6 0.4 0.2 0 0 1-4 5-9 10+
Stem is the first digit and leaf the second Similar to histogram but displayed on its side Presents information not displayed in histogramfor example.
How many total observations are in the stem-and-leaf plot below? How would we calculate the relative frequency for an interval of 20-29?
Stem
Leaf
Stem-and-leaf plot put one group to right of stem and other to the left (see example on page 37 of your book) Bar graphs and histograms use different color bars to represent different groups
Population distribution represents entire population Sample data distribution - glimpse (fuzzy) of the population
U-shaped two peaks at the low and high end of scores Bell-shaped most observations fall near the middle of the values
These distributions are symmetric left and right sides are mirror images How would you describe the shapes below in laymens terms?
(Page 38)
Skewed to the right a number of observations in the tail on the right side Skewed to the left a number of observations in the tail on the left side
Right or left skew is determined by which tail is longer How would you describe the shapes below in laymens terms?
(Page 38)
Also known as _________________? Frequently used and probably best known Your GPA is an average of what?
Formulas:
y = y1 + y2 + y3 + + yn / n
Where y1, y2, y3, etc are the observations and n is the
number of observations
yi = y1 + y2 + y3 + + yn
Where (sigma) is process of summing, y is the
y = yi / n
Properties
Cannot calculate for nominal level variables Can calculate for ordinal level if we assign values (e.g., 1 =
Outliers observation(s) well above or below majority of scores that influences the mean
observations of say agreement per above scale:
Set 1: 2, 1, 2, 2, 1 Set 2: 2, 1, 5, 2, 1
short example: calculate the mean for each of the small set of
skewed distribution are they found? Asked another way, which way do outliers pull the mean?
It is pulled in the direction of the longer or shorter tail of a distribution that is skewed? Point of balance
Separate observations into two groups: those above and
those below the mean Calculate distance of each score from the mean Add up the distance for observations above the mean and those below the mean, they will be equal
Lets calculate the individual and weighted averages for the following sets of observations and see what happens:
Set 1 (n1): 2, 2, 1, 2 Set 2 (n2): 3, 4, 5, 4, 5, 5, 4, 4
Simply put, it is the middle observation when observations are in ascending order
If you have an even number of observations, it is the midpoint of the two middle observations
So, the median for the following set is? 3, 3, 4, 4, 6, 6, 7, 7
Conceptually it is (n + 1) / 2
Essentially indicates how many scores are above and below, which is an equal amount for a odd n and something .5 for even n
Median is useful to describe ordinal data, while the mean is not as meaningful
Simple example:
High school diploma (n = 2) College degree (n = 4) Masters degree (n = 3)
Properties:
Appropriate for quantitative and ordinal variable, but not nominal (e.g., gender cannot be put into a logical order) Symmetrical distributions, the mean and median will be identical (why?) In skewed distributions, the mean or median will be pulled closer to the longer tail? Not influenced by distance of scores from the mean because it uses ordinal properties only
Set 1: 3, 3, 4, 5, 5 Set 2: 3, 3, 4, 5, 9
Which of the above would look more like a bell shaped curve? Which might have a long tail?
Is the mean or median more affected by outliers? Which (mean or median) is more appropriate if your data is highly skewed?
(Page 43)
What would a bar graph of this look like? How about the shape of a curve if it was drawn (think about
Keep in mind mean, median, and mode are complimentary, but general guidelines:
Nominal mode Ordinal median Interval and ratio mean if normally distributed and median if skewed
Keep in mind mode and median are possible for higher
levels than indicated above (e.g., mode for nominal and higher, median for ordinal and higher). Could also calculate a mean for nominal data if given values, but is it meaningful?
Variability how observations vary or spread around the typical observation or measure of the center
What is the low, high, and range of the following set of observations?
Two data sets can have same mean, but different variability characteristics (Lets look at an example)
Above the mean it will be positive Below the mean it will be negative
Conceptually,
Think about itwe squared the deviations so the square root brings the measurement back to the same level
Practice: work through the calculation for standard deviation for the following sets of data we used earlier to look at means and medians:
Set 1: 3, 3, 4, 5, 5 Set 2: 3, 3, 4, 5, 9
Hint: Go step by step and be sure you can later identify how you did each step to review. Also, as we did with mean and medians, compare the two data sets to conceptually understand what is happening. Hint 2: On the next slide I tried to simplify this for you by outlining the steps to get to standard deviation
1.
2. 3.
4.
5.
Calculate deviation for each score (before this you need to calculate the mean) Square each deviation Sum the squared deviations Divide the sum of squared deviations by n 1 (now you have what is referred to as your variance or s2) So, what is the final step to go from variance to standard deviation???
Properties:
Greater than or equal to 0 (if 0, all observations are the same value) Greater the values vary around the mean, the greater your standard deviation (s) (n 1) is used rather than n to make an adjustment to correct for sample bias. Results in larger s.
Standard deviation is the distance of a typical score from the mean, so reported along with the mean. But, we also need to know about the distribution and think about what seems plausible vs. not plausible.
Percentiles p% of scores fall below or equal to the given percentile (100 p gives what is above)
The median would be what percentile?
you now have four parts of a distribution If we say Bob is in the 75th percentile, how would you simply explain what that means to a non-statistician?
Display five pieces of information: median, lower and upper quartiles, maximum and minimum
Figure 1: Boxplot for Time by Tourist [Chicken (1), Beef (2) and Fish (3)] (N = 102)
16.00
12.00
Time
8.00
4.00
1.00
1.50
2.00
2.50
3.00
+ or 1.5(IQR)
Box plots mark separately, as they fall outside the whiskers Arbitrary consider as potential outliers
Might actually not be far removed from values in the long tail
The z score tells you how many standard deviations it is above or below the mean If you are subtracting the mean from the observed score, what score is now the mean when standardizing observations?
Association relationship of values of one variable with values of another (so, you are now considering two variables).
Bivariate analysis two variables Explanatory variable defines groups Response variable outcome variable The analysis studies how the outcome on the response variable depends on or is explained by the value of the explanatory variable (p. 55).
Negative means as one goes up other goes down Positive means as one goes up so does other Visualize and draw the lines Further from 0 the stronger the association
Regression allows prediction of the response variable for different values of the explanatory variable
Note: notice I did not use independent and dependent variables. That is because correlation and regression does not necessarily mean the relationship is causal We will cover correlation and regression in more detail later in the semester
Following Problems in your book are due January 30, 2014 (next class):