Sunteți pe pagina 1din 6

Statistics Chapter 1 What is Statistics?

o The art and science of designing studies and analyzing the data that those studies produce o Its ultimate goal is translating data into knowledge and understanding of the world around us o Basically.. the art and science of learning from data Statistical methods o Design: planning how to obtain data to answer the question of interest o Description: summarizing the data obtained o Inference: making decisions and predictions based on that data Descriptive Statistics o Utilizes numerical and graphical methods to look for patterns, to summarize and to present the information in sample data Inferential statistics o Utilizes sample data to make estimations, decisions, or other predictions about a population and measure their reliability (need probability) Introduction to basic terms o Population: an entire collection of subjects(persons or objects) whose properties are to be analyzed o Sample: subset if the population o Variable: a characteristic about each subset of a population or sample o Parameter: a numerical value summarizing the population data, a constant/truth [remember by: p in population] o Statistic: a numerical value summarizing the sample data, an estimate/vary [remember by: s in sample] Randomness an variability o Random sampling: each subject in the population has the same chance of being in the sample This is desirable because this sample tends to be a good reflection of the population Randomization is also important to good experimental design o Variable: the measurements we make of a variable vary from subject to subject Likewise, results of descriptive and inferential statistics vary, depending on the sample chosen What role do computers play? o Computer applications perform the calculations for data analysis o Data is organized in a data file subject(row)

Chapter 2

variable (column)

(2.1) what are the types of data? Variable: characteristic of a subject about which information is sought Two types: o Categorical: each data value belongs to one of a set of data (gender, major, color)almost always words Ex: the average social security # or zip code would make no sense o Quantitative: each data value takes numerical values that represent different magnitudes of the variable (age, GPA, # of classes)- always numbers Discrete(count): possible values for a set of separate numbers [# of classes] countable; isolated Continuous(measurement): possible values form an interval [height] can assume any value along a line interval; uncountable o Examples How long until pain reliever works? Quantitative, continuous # of chocolate chips in a cookie? Quantitative, discrete Colors used in a book? Categorical Brand of refrigerator? Categorical Overall satisfaction in a car? Categorical # of files on a computer? Quantitative; discrete # of staples? Quantitative; discrete pH of a pool? Quantitative; continuous zip codes in Athens? Categorical Frequency o n=sample size (# of observations in the sample) o frequency distribution: lists the number of observations that belong in each category of data Note: f=n (sum of frequency=n) o Relative frequency distribution: lists the proportion of observations that belong in each category rf=f/n (%) Note: rf=1 o Frequency tables Mode: the category with the highest frequency Frequency: count Proportion: frequency/total (2-1) Describing data using graphical summaries

Types for categorical o Bar- (f/rf) o Pareto- similar to bar, but in descending (f/rf) o Pie- (rf x 360)=(measure of sector) Types for quantitative o Dot plot- shows a dot for each observation, place just above the value on the # line o Stem and leaf plot- also portrays the individual observation (n=# of leaves) Each observation is represented by a stem and a leaf. Usually the stem consists of all digits except for the final one, which is the leaf Place the stems in a column, starting with the smallest. Place a vertical line to their right. On the right side of the vertical line, indicate each leaf that has a particular stem. List the leaves in increasing order (can truncate the last digit) o Histogram (bar graph for quantitative data) is a graph that uses bars to portray the frequencies or relative frequencies of the possible outcomes Divide the range of the data into intervals of equal width. For a discrete variable with few values, use the actual possible values Count the number of observations (the frequency) in each interval forming a frequency table On the horizontal axis, label the values or the endpoints on the intervals. Draw a bar over each value or interval with height equal to its frequency (or percentage) value s of which are marked on the vertical axis o Time plot: Data collected over a period of time Trend: a long tendency of rise or fall Which graph should we use? o The dot and S&L plots are more useful for small data sets since they portray individual observations and data points o Histograms work better for large data sets because they are more compact, have more flexibility Shape Distribution o Mound/symmetrical o Bimodal (2 modes/ peaks) o Uniform o Skewed left (negatively skewed) o Skewed right (positively skewed) (2-3) How can we describe the center of quantitative data sets? Measures of center o Mean(average): the center of gravity o Median(middle): separate the bottom 50% from the top 50% of data Arrange from low to high If n is odd, its the middle #

If n is even, its the average of the 2 middle #s o Mode(most): most frequent observation.. value that occurs the most If it has 2 modes: bimodal More than 2 modes: no mode o Symmetrical data Mean=median o Skewed right Mean > median o Skewed left Mean < median Outlier: an observation that falls well above/below the overall bulk of data o The median is more resistant to an outlier then the mean is o If a distribution is highly skewed, the median is usually more accurate o If the data is symmetrical or slightly skewed, the mean is usually more accurate o The mode is rarely used in quantitative analysis, but always in quantitative (categorical)

January 24th 2008

Chapter 3
(3-1) How can we explore the association between two categorical variables Frequently, more than one categorical variable is being studied at the same time. o Contingency table(two-way table) a table that summarizes data from two categorical variables; the rows of the table identify the categories of one variable, and the columns identify the categories of the other variable

The first step in analyzing data in a contingency table is to look at the distribution of each variable separately. To determine the distribution of either the column variable or the row variable, we create a marginal distribution Marginal distribution- a frequency or relative frequency distribution of either the row or column in variable in a contingency table Frequency marginal distribution To create a frequency marginal distribution for a row variable, we calculate the row totals for each category of the row variable. To create a frequency marginal distribution for a column variable, we calculate the column totals for each category of the column variable Rf= f/n Father farmers rf= 42/500 .084 o (use table 1-3 on notes) Conditional distribution: lists the relative frequency of each category of a variable, given a specific category of the other variable in the contingency table If we find the CD of sons occupation by fathers occupation, Father= explanatory variable o What you reduce yourself to Sons= response variable o The result (%) from the reduction NOTE: in order to describe the associations between two categorical variables, relative frequencies must be used because there are different numbers of observations for the categories Constructing a CD Compute the RF for each category of the response variable, given the first category of the explanatory variable. Compute the relative frequency for each category of the response given the second category of the explanatory variable. Continue this approach until all categories of the explanatory variable have been exhausted.

Properties of the correlation o The correlation is between negative one and one -1<r<+1 o r>0 indicates a positive linear relationship o r<0 indicates a negative linear relationship o r=0 indicates no linear relationship o r=-1 indicates a perfect negative linear relationship Calculating the sample correlation o Formula in notebook o In stat crunch Stat>summary>stats>correlation (3-3) how can we predict the outcome of a variable? Least Squares Regression: a method of determining a line of best fit by minimizing the sum of the squared vertical distances between observed values of y and those predicted by the line y o You want the line that comes closest to most of the points o Y=a + bx Interpreting the slope and the y-intercept o Slope: the estimated amount of change in y on average for each on unit increase in x (b in formula) o Y-intercept= (a in formula) estimated average values of y when x=0 o Figure out just to see how far off we are

S-ar putea să vă placă și