Sunteți pe pagina 1din 9

Chapter 1: Exploring Data I. Data Analysis: Making Sense of Data A. Individuals the objects described by a set of data i.

. They may be people, animals, or things. 1. Ex. For a high schools student data, the students would be the individuals. B. Variable any characteristic of an individual i. You can take different values for different individuals. 1. Ex. Age, gender, GPA, grade level, homeroom are all variables in a high schools student database. ii. Categorical Variable places an individual into one of several groups or categories iii. Quantitative Variable takes numerical values for which it makes sense to find an average 1. Not every variable that takes number values is quantitative. C. Most data follows this format: each row is an individual, each column is a variable D. Categorical variables sometimes have similar counts in each category & sometimes dont E. Distribution the distribution of a variable feels us what values the variable takes and often it takes these values F. Exploring data i. Begin by examining each variable by itself ii. Move on to study relationships among the variables iii. Start with a graph or graphs iv. Add numerical summaries G. Inference drawing conclusions that go beyond the data at hand H. Probability the study of chance behavior II. Analyzing Categorical Data A. The distribution of a categorical variable lists the categories and gives either the count or the percent of the individuals who fall in each category. i. Frequency Table displays the counts ii. Relative Frequency Table displays the percents (always add up to 100%) B. Round-off Error when adding up percents, they may not add up to a hundred because of rounding techniques (is not a mistake, just the effect of rounding off results) C. Pie Chart displays ratios between different categories D. Bar Graph- compares different quantities or percents of different categories i. Bar graphs are more flexible than pie charts; it can compare any set of quantities that are measured in the same units ii. When making a bar graph, make the bars equally wide to avoid confusion. Do not confuse bar graphs height with its area. The height is the only thing that matters. iii. Bar graphs comparing percents do not necessarily add up to a hundred. iv. A segmented bar graph shows conditional distribution for each category.

E. In order to understand information in a two-way table, look at the distribution of each variable separately. i. The marginal distribution of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table (does not tell you anything about the relationship between two variables). ii. A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable. 1. Depending on the variable, divide a specific cell by the total of the variable to find the conditional distribution. F. To organize a statistical problem: i. State: Whats the question that youre trying to answer? ii. Plan: How will you go about answering the question? What statistical techniques does this problem call for? iii. Do: Create graphs and carry out necessary calculations. iv. Conclude: Give a practical conclusion in the setting of the real-world problem. G. Association if specific values of one variable tend to occur in common with the specific values of the other i. Caution: Even a strong association between two categorical variables can be influenced by variables lurking in the background H. Simpsons Paradox an association between two variables that holds for each individual value or a third variable can be changed or even reversed when the date for all values of the third variable are combined III. Displaying Quantitative Date with Graphs A. Dot Plot a graph that shows data values as a dot above its location on a number line i. The purpose of the graph is to help us understand the data. B. Always look for the overall pattern in any graph or striking departures from the pattern. C. You can describe the overall pattern of any distribution by its shape, center, and spread. i. An outlier is an individual value that falls outside the overall pattern. ii. Shape indicate the mode and skew-ness (skew-ness is the direction of the long tail) 1. Symmetric if both sides of the distribution are approximately mirror images of each other 2. Skewed Right if the right side of the distribution is much longer than the left 3. Skewed Left if the left side of the distribution is much longer than the right 4. Unimodal distribution has a single peak 5. Bimodal distribution has two clear peaks 6. Multimodal distribution with three or more peaks IV. Describing Quantitative Data with Numbers A. Center Take the mean B. Spread use the range C. Many distributions have irregular shapes that are neither symmetric nor skewed D. To compare distributions, compare shape, center, spread, and outliers, and use words like greater than, less than, and about equal to E. Stem Plot a stem and leaf plot i. To construct a stem plot, split the stems

ii. A back-to-back stem plot compares two sets of data with a stem plot that is back to back. The smaller values always go closest to the stem. F. Histograms i. To make a histogram: 1. Divide the range into classes of equal width 2. Find the count or percent of individuals in each class 3. Label and scale your axes and draw the histogram ii. Do not confuse histograms with bar graphs iii. Do not use counts or percents as data iv. Use percents instead of counts on the vertical axis when comparing distribution with different number of observations v. The mean, , is the most common measure of the center 1. = (sum of observations)/n 2. = Summation vi. Use the median when a distribution is skewed and the mean when it is not. G. Box Plots i. First Quartile lies one quarter of the way up the list ii. Third Quartile lies three quarters of the way up the list iii. Interquartile Range (IQR) measure the range of the middle 50% of data, take Q3Q1 iv. To calculate the outlier if an observation is more than 1.5 (IQR) above the third quartile or below the first quartile v. 5 number summary consists of the smallest observation, first quartile, median, third quartile, and the largest observations (written in order from largest to smallest) vi. To make a box plot, draw a central box from the first to third quartile. Then, draw a vertical line in the box to mark the median. Draw extension lines from the box to the smallest and largest observations that are not outliers. vii. Standard Deviation the average distance of the observations from their mean 1. Calculated by finding an average of the squared distances and then taking the square root. 2. Variance is the average of the squared distances. 3. Spread is also measured by standard deviation.

Chapter 2: Modeling Distributions of Data I. Describing Location in a Distribution A. Percentile the pth percentile of a distribution is the value with p percent of the observations less than it B. Relative Cumulative Frequency divide the count in each class by the sample size, then multiply by 100 to convert to a percent C. Cumulative Frequency divide the entries in the cumulative frequency column by the sample size, and multiply by 100 to convert to a percent D. Cumulative Relative Frequency Graph Plot a point corresponding to the cumulative relative frequency in each class at the smallest value of the next class. i. Used to describe the position of an individual within a distribution or to locate a specified percentile of the distribution. E. Standardizing converting observations from original values to standard deviation units i. A standardized value is a z-score. ii. If x is an observation from a distribution that has known mean and standard deviation, the standardized value of x is 1. Z = (x-mean)/(standard deviation) a. The z-score tells us how many standard deviations away from the mean an observation falls and in what direction b. Positive z-scores mean that observations are larger than the mean c. Negative z-scores mean that observations are smaller than the mean iii. Using standardized values is useful for comparing things such as height at different ages. 1. Ex. If we want to compare a brother & sisters height according to age, we would standardize their height values and then compare them. F. Data Transformations i. Effect of adding or subtracting to a constant 1. Adding the same number to each observation adds that number to the mean, median, quartiles, and percentiles 2. It does not change the shape of the distribution or measure of spread such as the range, IQR, or standard deviation ii. Effect of multiplying or dividing by a constant 1. Multiplying or dividing each observation by the same number will multiply or divide the mean, median, quartiles, and percentiles by that number. 2. It will also multiply or divide the measures of spread such as range, IQR, and standard deviation by the absolute value of that number. 3. It will not change the shape of the distribution. G. Exploring quantitative data i. Always plot data through a graph, dot plot, stem plot, histogram, etc. ii. Look for overall pattern and striking departures from the pattern (shape, center, spread, and outliers). iii. Calculate a numerical summary to briefly describe center and spread. iv. Describe it by its curve H. Density Curves i. Always on or above the horizontal axis ii. Has an area of exactly one underneath it

iii. Describes the overall pattern of a distribution iv. Area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval v. They come in many shapes, and outliers are not described by the curve. vi. Often a good description of the overall pattern of a distribution. vii. The median of the density curve is the equal areas point; half of the area is to the left and half of the area is to the right of the point. viii. The mean of the density curve is the balancing point ix. The mean is always more toward the skewed direction than the median unless the curve is symmetric. x. - the usual notation for the mean of a density curve xi. - the usual notation for the standard deviation of a density curve II. Normal Distributions A. Normal Curves one of the most important density curves that describe a normal distribution i. All have the same overall shape ii. These curves are symmetric, single peaked, and bell shaped. iii. Any specific normal curve is completely described by its mean and standard deviation. iv. The mean is located at the center and is the same as the median. v. Changing the mean without the standard deviation moves the normal curve along the horizontal axis without changing its spread. vi. Standard deviation is the natural measure of spread and it is the distance from the center to the change-of-curvature points on either side. vii. Normal distributions include scores on tests like SAT and IQ tests, repeated measurements of the same quantity, and characteristics of biological populations. viii. They are good approximations of the results of many kinds of chance outcomes ix. Many sets of data still do not follow a normal distribution B. Empirical Rule i. It is also known as the 68-95-99.7 rule. ii. Approximately 68% of the observations fall within of the mean iii. Approximately 95% of the observations fall within 2 of the mean iv. Approximately 99.7% of the observations fall within 3 of the mean C. Chebyshevs Inequality in any distribution, the proportion of observations falling within k standard deviations of the mean is at least [1-(1/k^2)] D. Real data is never exactly normal E. Standard Normal Distribution the normal distribution with a mean of 0 and standard deviation of 1 i. In order to calculate the z-score for this, z=(x- )/ ii. Standard Normal Table table of area under the standard normal curve shows the area to the left of the z-score 1. Pay attention to which way the z-value will be F. Normal Probability Plot provides a good assessment of whether a data set follows a normal distribution i. Arrange data from smallest to largest ii. Find z-scores using percentile

iii. Plot x against z iv. Look for shapes that show clear departures from normality v. If it is close to a straight line, then it is close to normal

Chapter 3: Describing Relationships I. Scatterplots and Correlation A. Response Variable measures an outcome of a study B. Explanatory Variable helps explain or influence changes in a response variable C. The goal is to show that changes in one or more explanatory variables cause changes in a response variable D. Scatter Plot plots explanatory variable against response variable i. Must be quantitative data ii. The most useful graph for displaying two quantitative relationships iii. If you wanted to find the response for an explanatory variable, simply go to the explanatory variable value, move up until you find a dot, then move over horizontally to get the response value. iv. Each individual in the data appears as a point in the graph v. Always plot explanatory variables on the x axis and the response variable on the y axis vi. To make a scatter plot: 1. Decide which variable should go on each axis 2. Label and scale the axes 3. Plot individual data values vii. To interpret scatter plots: 1. Direction indicate which way the pattern moves a. Use words such as negative or positive association 2. Form indicate the shape of the scatterplot (linear or curved) and the clusters of data 3. Strength how closely the points of the scatterplot follow a clear form (use words like moderately strong) 4. Indicate any outliers viii. You can describe the overall pattern of a scatterplot by its direction, form, and strength of the relationship. E. Associations and Relationships i. Positive Association when above average values of one variable tend to accompany above values of the other and vice versa ii. Negative Association when above average values of one tend to accompany below average values of the other and vice versa iii. Not all relationships have a clear direction that we can describe as a positive or negative association iv. Association does not imply causation. 1. We cannot conclude that lurking variable did not influence the results. v. Our eyes are not good judges of how strong a linear relationship is vi. Correlation the correlation, r, measures the strength and direction of a linear relationship between two quantitative variables 1. It is always a number between -1 and 1. 2. If r > 0, then the relationship has a positive association. 3. If r < 0, then the relationship has a negative association. 4. If r is close to 0, then the relationship is weak. 5. If r = 1 or -1, then the relationship is exactly linear.

6. Strength of the relationship increases as r moves away from 0 and toward 1 or -1 7. The formula for r: a.
( ) ( )

8. A value of r close to 1 or -1 does not guarantee a linear relationship; it could be curved 9. Correlation makes no distinction between explanatory and response variables 10. r does not change if we change the units of x and/or y 11. r itself has no unit of measurement, it is just a number 12. Requires both variables to be quantitative 13. Correlation only measures the strength of linear relationships; it does not describe curved relationships no matter how strong the curved relationship is 14. Correlation is not resistant outliers and influential points can strongly affect it 15. Correlation is not a complete summary of two variable data a. It must include the mean and standard deviation of both x and y [separately] along with the correlation. b. A scatterplot cannot be replaced by a numerical summary. F. Least-Squares Regression i. Regression Line a line that describes how a response variable changes as an explanatory variable changes. It is used to predict y given a value of x. ii. Regression requires that you have an explanatory and response variable iii. A regression lines is a model for the data, similar to density curves. iv. Formula for regression line 1. 2. - the predicted value of the response variable y for a given value of the explanatory value of x 3. - the slope (amount y is predicted to change when x increases by one unit) 4. - the y intercept (the predicted value of y when x=0) v. A small slope does not mean that there is a weak relationship. 1. The size of a slope of a regression line does not show how important a relationship is vi. Extrapolation the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line 1. It is often inaccurate. Do not make predictions using values of x that are much larger or much smaller than those that actually appear in the data. vii. A good regression line makes the vertical distance from the line to the points as small as possible. viii. Residual the difference between an observed and expected value of the response variable ix. Least Squares Regression Line the line that makes the sum of the squared residuals as small as possible 1. Equation of the least squares regression line for explanatory variable x, response variable y, and n individuals: a. The slope is ( ) b. The y intercept is x. Residual Plot a scatterplot of the residuals against the explanatory variable

1. 2. 3. 4.

A plot with no pattern means that the data is linear Residuals should be relatively small in size Examining residuals tells us how well the regression line fits the data Standard deviation of residuals: a. If using a LSRL, then stddev of the residuals is:

b. Gives the approximate size of a typical prediction error (residual) xi. The coefficient of determination is 1. The formula is a. ( ) b. c. ( ) 2. If all points fall directly on LSRL, then SSE=0 and . xii. Correlation and regression lines describe only linear relationships (Anascombes data) xiii. Outlier observation that lies outside the overall pattern of the other observations xiv. Influential Point if removing it would markedly change the result of the calculation, then it is an influential point xv. An association between an explanatory variable and a response variable, even if it is very strong, is not by itself good evidence that changes in the explanatory variable actually cause changes in the response variable.

S-ar putea să vă placă și