Sunteți pe pagina 1din 12

1

Data Analysis Project Math 1040 Introduction to Statistics Spring 2014 Jeff Seamons

This project is divided into several sections or parts: Part 1:


Part 1 details categorical data for primary positions played in baseball. It includes a total of six different charts, two charts each for the following: Total population of the data, systematic sampling of a sample group in that data, and random sampling of a sample group in the data. There is also a reflection page following the Part 1 graphs.

Part 2:
Part 2 details quantitative data for homeruns hit in baseball. It also includes a total of six different charts, two charts each for the following: Total population of the data, systematic sampling of a sample group in that data, and random sampling of a sample group in the data. There also is a reflection page following the Part 2 graphs.

Part 3:
Part 3 details different statistical methods and objectives that I have learned while in this statistics class. Of those objectives, I will create confidence intervals from the samples that Ive obtained.

Part 4:
Part 4 is a final reflection for the entire project, and how it applies to what I will continue to learn in my college education, and what I will take from it into my career.

Jeff Seamons

Part 1

This project is about various primary positions played in baseball, and looking at the percentages of players that played
each position. Some of this project was done as a group, and some of it was done individually. We wanted to see if there were any similar proportions in the different sample groups we pulled, compared to the entire population of the data. For instance, there were statistics for 1,341 baseball players in our data set. Using two different sampling methods (Systematic and Random), we pulled 36 players from that total population into each sample group, created the graphs, and looked to see if the percentages in which positions they played in each sample resembled the percentages in entire population.

We chose to use Primary Playing position as our categorical variable because it has a limited number of values.
ENTIRE POPULATION: Positions played in baseball

Entire Data Population for all players


8 139 145 148 154 254 492 Outfield Catcher Shortstop 2nd Base 3rd Base 1st Base Designated Hitter

600 500 400 300 200 100 0

100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0

Frequency Cumulative Percentage

Jeff Seamons

Part 1 (continued) We used Microsoft Excel to determine our sample by using the Mod formula. We started by assigning each player a
unique number (in this case 1-1,340). We then created the Mod formula by taking the unique number and dividing by the quantity we wanted in our sample. In this case we chose 36 since we wanted 36 players in our samples. An example of this formula is: =Mod(AB2,37)

AB2 represents the Excel column of our unique number. This mod then assigned a number between 0-37 to each player
in the entire population. We randomly chose every player assigned with the number 14 for our systematic sample.

SAMPLING METHOD 1: SYSTEMATIC sampling of positions played in baseball

Systematic Sampling of 36 players


0 2 4 17 5 2 Designated Hitter 2nd Base 3rd Base Shortstop 1st Base 6

18 16 14 12 10 8 6 4 2 0

100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 Frequency Cumulative Percentage

Jeff Seamons

Part 1 (continued)

For the Random sample we also used Microsoft Excel, however for this sample, we simply and conveniently chose the
first 36 players in the data list as our sample.

SAMPLING METHOD 2: RANDOM sampling of positions played in baseball

Random Sampling of 36 players


0 3 3 12 4 Outfielder Shortstop 2nd Base Catcher 1st Base 3rd Base 6 8 Pitcher

14 12 10 8 6 4 2 0

100.00 90.00 80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00 Frequency Cumulative Percentage

Jeff Seamons

Part 1 (continued)
Part 1 Reflection

Systematic Sampling Method: We used Microsoft Excel to determine our sample by using the Mod formula. We started by assigning each player a
unique number (in this case 1-1,340). We then created the Mod formula by taking the unique number and dividing by the quantity we wanted in our sample. In this case we chose 36 since we wanted 36 players in our samples. An example of this formula is: =Mod(AB2,37)

AB2 represents the Excel column of our unique number. This mod then assigned a number between 0-37 to each player
in the entire population. We randomly chose every player assigned with the number 14 for our systematic sample.

Random Sampling Method: For the Random sample we also used Microsoft Excel, however for this sample, we simply and conveniently chose the
first 36 players in the data list as our sample.

Comparison: Looking at both types of graphs (Pie and Pareto) for each data set, we noticed that the sampling with both methods
actually was somewhat close to resembling the entire population. For instance, in both sample groups, not a single Designated Hitter player was selected from the original 8 players in the entire population. A main difference was in the percentage of Outfield positions in comparison to the Random Sampling and Systematic Sampling data sets.

Considerably more outfielders (more than double) were selected with the systematic method, as opposed to the
random method. The systematic proportion of outfielders more closely mirrored the percentages of the actual entire population set of data.

Jeff Seamons

Part 2
As the Quantitative Variable from the baseball data set, we selected home runs as a group. This time however, we input every players homerun statistics using the StatCrunch software that was available in our classroom homework website. Here are the graphs for the entire population of that data set, and two samples (systematic and random) from that data set. ENTIRE POPULATION: Population Mean: 85 Population Standard Deviation: 97.893 Five number summary:

1. 2. 3. 4. 5.

The sample minimum (smallest observation): 0 The lower quartile or first quartile: 22 The median (middle value): 51 The upper quartile or third quartile: 107 The sample maximum (largest observation): 755

Jeff Seamons

Part 2 (continued)
To select our simple random samples, we systematically selected every 36th player from the entire data population, and input their homerun statistics, still using the StatCrunch software that was available in our classroom homework website. SAMPLE 2: SYSTEMATIC Systematic sample mean: 72 Systematic Standard Deviation: 62.754 Five number summary:

1. 2. 3. 4. 5.

The sample minimum (smallest observation): 6 The lower quartile or first quartile: 27 The median (middle value): 50 The upper quartile or third quartile: 90 The sample maximum (largest observation): 271

Jeff Seamons

Part 2 (continued)
To select our simple random samples, we conveniently selected the very first 36 players from the entire data population, and input their homerun statistics, still using the StatCrunch software that was available in our classroom homework website. SAMPLE 1: SIMPLE RANDOM Simple random sample mean: 100.861 Simple random Standard Deviation: 146.121 Five number summary:

1. 2. 3. 4. 5.

The sample minimum (smallest observation): 8 The lower quartile or first quartile: 19 The median (middle value): 45 The upper quartile or third quartile: 90 The sample maximum (largest observation): 755

Jeff Seamons

Part 2 (continued)
Part 2 Reflection

Systematic Sampling Method: In relation to the shape and distribution of the systematic sampling method as opposed to the entire
population, I noticed some obvious differences in the frequency histogram charts, but not necessarily in the Box Plot charts. The histogram charts in both instances peaked on the left side of the graph, and somewhat tapered down to the lower right. An obvious difference from this sample method compared to the total population was that different categories were higher than others before it as they generally tapered downward; creating the uneven effect that is seen. The sample means and standard deviations arent even close to one another, as expected, nor are the numbers in the 5-number summaries.

Random Sampling Method: For the random sampling in comparison to the population, it too similarly had a more patchy effect that wasnt a
universally clean, downward sweeping line that is seen in the population graph. There are categories in the middle that are missed altogether, and the categories too rise and fall, as seen in the systematic sample.

The sample means and standard deviations arent even close to one another, as expected, nor are the numbers in the 5-number summaries.

Jeff Seamons

10

Part 3
Part 3 details different statistical methods and objectives that I have learned while in this statistics class. Of those objectives, I will create confidence intervals (95%) from the samples that Ive obtained, and will also go in depth explaining their levels of significance. I will complete hypothesis tests for both the population proportion, and also the population mean. There also is a reflection page following these learning objectives in Part 3.

Categorical Data: Outfield positions The level of confidence that I have chosen for my categorical sample (outfield positions) is 95%. The confidence intervals for both of my sample sets (random and systematic) are detailed below:

Systematic Sampling of categorical data: Outfield positions The confidence interval I created from this sample that shows the computation for the margin of error is below:

Confidence level: 95% Alpha: .05 Critical Value: Z a/2: 1.96 Margin of Error: .163

Random Sampling of categorical data: Outfield positions The confidence interval I created from this sample that shows the computation for the margin of error is below:

Confidence level: 95% Alpha: .05 Critical Value: Z a/2: 1.96 Margin of Error: .165

Jeff Seamons

11

Part 3 (continued)
Quantitative Data: Home Runs The confidence intervals for the mean of each of my home run samples (systematic and random), including the computation for the margin of error, are listed below:

Systematic Sampling of quantitative data: Home Runs The confidence interval I created from this quantitative systematic mean is detailed below. For the home runs data set, the sample size observed mean is 72. Sample size: 36 Observed mean: 72 Standard Deviation: 62.754 Confidence level: 95% Confidence Interval: +/- 20.5 Range for the true sample population mean: 51.5 < p < 92.5

Random Sampling of quantitative data: Home Runs The confidence interval I created from this quantitative random mean is detailed below. For the home runs data set, the sample size observed mean is 100.861. Sample size: 36 Observed mean: 100.861 Standard Deviation: 146.121 Confidence level: 95% Confidence Interval: +/- 47.73 Range for the true sample population mean: 53.13 < p < 148.59

Reflection for Part 3

The meaning of these confidence intervals can be explained like this:

For the random sampling of quantitative example, one can say that they are 95% confident that the true population sample mean lies within the range of 53.13 % and 148.59 %. This is taking into account of course the sample size, confidence level, observed mean and standard deviation. In all of these instances, the calculated intervals indeed did capture the population parameter.

Jeff Seamons

12

Part 4

Summary In completing the project, I learned out different sample methods can drastically (and sometimes not so
drastically) look in comparison to each other. Ive learned how to compute a hypothesis test to test a given claim or assumption, and also how to test those test statistics to find out if the null hypothesis is indeed true or not true (or basically reject the null hypothesis, or fail to reject it.) Ive learned to form graphs from provided data, and in turn calculate the population and sample means, along with their standard deviations.

These skills learned will undoubtedly be useful in future college courses, as my Associates Degree is in
Sociology, and Sociologists use statistics very frequently. Itll be beneficial to know how to look at a sample of data, and know just by looking at the numbers and their significance levels what the outcomes may or may not be, and even more so what they might prove.

The specific parts of this project that will have other applications in future classes include basically
every type of calculation preformed really. Im unsure if I could truthfully say that any part of this project wall irrelevant. Though it was very time consuming, I honestly dont think I can say that any part of it wouldnt be beneficial. From learning how to construct graphs, computing means, standard deviations, critical values, making a hypothesis, its all very applicable to my chosen field.

As far as problem solving goes, and how I look at real-world math applications, this really has shed an
enormous amount of light. Learning to work with a plethora of data (in this case of 1,300 players), and knowing that it can be broken down into small samples so that its easier to digest, was priceless. I can actually see myself using a lot of the skills that I learned, not only in this project- but in this class. It was beneficial to me to be able to work in a group, as what often occurs in the real world and be able to hypothesize with other students and test claims and assumptions.

Jeff Seamons

S-ar putea să vă placă și