Sunteți pe pagina 1din 7

Math 1040

Alyssa Winn
Skittles Statistical Analysis
Term Project
At the beginning of the semester each student in the class purchased a 2.17 ounce bag
of Skittles that we would use as a sample of the total population of Skittles produced.
Throughout the semester we worked individually and as a group to analyze the data
from our individual bag and from all bags of candies collected by the class. As we
learned statistical concepts we applied them at various stages of the project. As a group
we collaborated to analyze the data and apply confidence intervals and hypothesis
tests. As an individual I had to draw conclusions from the data and demonstrate my
understanding of the process.

The following is the combination of my individual and my group’s work during the five
stages of the project.

Part 1 - Data Collection


Each student was required to purchase a 2.17 oz bag of Skittles then count the number
of observations for each color (red, orange, yellow, green and purple). There were two
instances where a student purchased the wrong bag size and resulted in an skewed
data set. Those two outlier observations were excluded from our data set for the
remaining parts of the project.

Part 2 - Organizing and Displaying Categorical Data


Individually we wrote the proportion of each color we expected to see in the sample
(total data for class) and compared it to the actual proportion we observed in our own
bag. We also had to determine if our method of collecting data qualified as a simple
random sample. Then as a group we created graphs to represent the sample
proportions and individually analyzed the information. Working in a group helped the
editing process so our graphs were less misleading to the audience.

Group Work for Part 2:


Red Count Orange Count Yellow Count Green Count Purple Count

Expected 20.0% 20.0% 20.0% 20.0% 20.0%


Proportion

Observed 20.3% 19.9% 20.5% 20.2% 19.1%


Proportion

The expected proportions/ percentages for Red, Orange, Yellow, Green and Purple are 20% each. This is
based on the assumption the colors have even chances of appearing. In reality, even though the Skittles
are distributed by standardized processes and machinery, variability will, to some extent, still be
introduced. Therefore, it is highly unlikely each color will account for exactly 20% in each 2.17oz bag.
*Inserting the graphs to this document caused distortion so they may be more difficult to read. The program used to create the
graphs only allowed for colors to be randomly assigned - so the colors show on the pie chart don’t match the candy color.
Yes. the data represents a random sampling of 2.17oz bags of Skittles, at least within the Salt
Lake City, Utah area. The population represented by this sample is all 2.17oz bags of Skittles
available for purchase. The bags were presumably purchased from various (and somewhat
unique) stores by each member of the class, though likely these stores were conveniently
accessible for each student. The results could perhaps be distorted if the production process,
delivery process, or availability of 2.17oz bags of Skittles were different for this geographic region
and, in particular, for the stores that were most convenient to the students. A likely better
representation of the population would be to purchase 2.17oz bags of Skittles from different
geographic locations and different stores, varying data and times of purchase leading up to the
assignment. This would probably provide a better sampling of the population since this increases
the chances of purchasing bags of Skittles from different production groups.

Individual Work for Part 2:


Proportion Proportion Proportion Proportion Proportion Total
Red Orange Yellow Green Purple Count

My Bag 16 6 13 14 11 60

Class 893 874 900 889 838 4394


Counts

Before starting this project I had the expectation that each color would represented rather equally
in each bag. However, in my bag that wasn’t the case. So I then thought that if what began as
equal amounts of the colors was then divided amongst the bags, each bag may have different
proportions of each color, but when all the samples were brought back together it would reflect
the equal proportions of colors in the total population. Our group’s Pareto chart reflects that
concept because the frequency of each color is relatively similar. In the initial PDF of data from
the class there were two sets of outliers where the frequency of the colors was significantly higher
than the other occurrences. This is probably due to the students purchasing the incorrect bag
size. If those numbers weren’t omitted in our Modified Data PDF then it would create misleading
graphs because colors wouldn’t be shown from the same population size, it would also affect the
mean and median later in our project.

Part 3 - Organizing and Displaying Quantitative Data


At this point in the class we learned to find the mean and standard deviation and then
applied that knowledge to the sample data. Knowing the mean and standard deviation
helps us understand the shape of distribution of the data and if we can apply statistical
theories to it. As a group we created a histogram and boxplot of the data so we could
visually analyze the data. As individuals we had to describe the difference between
qualitative and quantitative data and how each category is best represented by certain
graphs.

Group Work for Part 3:


Measures for number of candies per bag: ​59.4
Std. deviation of number of candies per bag: ​2.8
5-number summary for number of candies per bag:
Min: ​52 Q1: ​58 Med: ​60 Q3:​ 61 Max:​ 65
Individual Work for Part 3:
I believe the shape of distribution of the candies per bag is relatively symmetrical. Based on part
2 of our project, most of the data was within 1.4% of each other so I expected the graphs to
reflect the cluster towards the center, or median of 60. I thought that was an accurate description
of the data because my own bag had 60 candies. One difference between my bag and the class
total is the standard deviation of my bag is 3.81 but the class total is 2.8. I had a much larger
variation in within the data of my bag. Within the class total of candies in each bag the spread is
between 52 and 65, for a range of 13. On our group box plot it is easy to see the three data points
of 52, 53 and 53 that are below the lower fence of 54. Overall I do think the data from the class is
relatively close to the data from my own bag, therefore the graphs are a good representation.

Categorical or qualitative data allows for classification of individuals or something based on an


attribute of characteristic. Categorical data does not use numbers when classifying the
differences of traits. Quantitative data does use numerical measures when classifying the
differences. This means quantitative data can be discrete, with countable number of possible
values or continuous, meaning infinite number of possibilities between any two values.

Graphs that are describing categorical data need to help compare the data without attaching
numerical values. It is best to use Pareto Diagram or bar graph when representing categorical
data because it is easy to compare the number of occurrences of the categorical data side by
side. Pie charts can also represent categorical data.

Graphs that are describing quantitative data can vary depending on how the values are being
interpreted. A histogram is best used with quantitative data because it can describe the spread,
center and identifies the different classes within the data. A stem and leaf plot can only be used
with quantitative data because it seeks to organize the numerical values. The same can be said
about scatter plots and time-series graphs.

Part 4: Confidence Interval Estimates


Since finding the actual mean of the population would be difficult and expensive we use
sample data to approximate it. Confidence intervals help find parameters that the
population mean could be between. Confidence levels help determine the likeliness that
the approx. population mean lies in the found parameters. As a group we had to verify
that our data met the requirements to use confidence intervals, then applied the process
to our sample data of Skittles. Individually we had to demonstrate our understanding of
he confidence.

Group Work for Part 4:


Construct a 99% confidence interval estimate for the population of yellow candies:
Sample proportion of yellow candies (​ p̂):
900
p̂ = x
n
where x=900 and n = 4394 p̂ = 4394 = 0.205 or 20.5%
Since we have the sample proportion, we will construct a confidence interval for a population
proportion (p).

The three requirements that must be met to construct a confidence interval for a population
proportion are:
1. The sample was obtained through a simple random sample since several students
obtained a 2.17oz bag of Skittles from various and (at least somewhat) unique locations.
2. np̂(1 − p̂) ≥ 10 where n=4394 p̂=0.205 and 1-p̂=0.795
4394 × 0.205(1 − 0.205) = 716.112
716.112 ≥ 10 ✓ Verified
3. Skittles were sampled from 74 bags out of millions sold. It is therefore reasonable to
assume that the ​sample size is less than 5% of the population size (2 ≤ 0.05N )
99% confidence interval is ​(0.189,0.221)

√(
p̂(1−p̂)
​Lower and upper bounds:​ p̂ ± z a2 × n
)
where​ α = 0.10 and​ z .01
2 = 2.5758


0.205(1−.205)
Lower:​ 0.205 − 2.5758 × 4394
= 0.189


0.205(1−.205)
Upper:​ 0.205 + 2.5758 × 4394
= 0.221
upper limit−lower limit
The margin of error is equal to
2
0.221−0.189
2 = 0016 or 1.6%
The confidence interval, 0.205 ± 0.016 , indicates that if a large number of different samples is
obtained, we expect 99% of intervals will encapsulate the population proportion of yellow candies
out of all candies.

Construct a 90% confidence interval estimate for the population mean number of candies per bag
Sample mean number of candies per bag (x) :
Σcandies in each bag
x = number of bags = 4394 74 = 59.4 candies per bag
Since we have the sample mean, we will construct a confidence interval for a population mean
(μ) :
The two requirements that must be met to construct a confidence interval for a population mean
are:
1. The sample was obtained through a simple random sample since several students
obtained a 2.17oz bag of Skittles from various and (at least somewhat) unique locations.
2. n = 74 ≥ 30 ✓Verified
90% confidence interval is ​(58.8, 59.9)
​Lower and upper bounds: x ± t a2 × s
√n
where α = 0.10 ,
t .10
2 = 1.6660 and s = 2.812412
2.812412
Lower:​ 59.4 − 1.6660 × = 58.8
√74
2.182412
Upper: 59.4 + 1.6660 × = 59.9
√74
upper limit−lower limit
The margin of error is equal to:​ 2
59.9−58.8
2 = 0.55 or 0.55 candies per bag
This confidence interval, 59.4 ± 0.55, indicates that if a large number of different samples is
obtained, we expect 90% of intervals will encapsulate the population mean number of candies per
bag.
Individual Work for Part 4:
In a paragraph, explain in general the purpose and meaning of a confidence interval.

Confidence intervals provide a range of values that are likely to to contain the population parameter. A
confidence level is used to show the percentage of samples that will contain the population parameter,
based on the range of values found in the confidence interval. If the confidence level was higher, it means
the range of values that could include the unknown parameter is wider, which would result in an increased
margin of error. If the sample size is increased it allows for more data and the confidence interval would
be narrower. Essentially the goal of constructing a confidence interval is to better estimate the true
population and to see if our sample actually reflects the population.

Part 5 - Reflection
The final part of our course project was to complete a reflection on what we learned and
how it can be applied to other classes or our future career.

Individual Work for Part 5:


Before this class I had a difficult time interpreting graphs and understanding the
significance of statistics. I didn't push myself to learn more because I was
intimidated by numbers and honestly didn’t understand what a statistics class
really consisted of. The way the class progressed through each concept helped
me build up my knowledge and confidence. I appreciated that the group project
was also set up the same way. We were able to apply the concept we just
learned as we continued the project. I had a great group that was supportive but
also provided feedback when I misunderstood something. It deepened my
understanding because I could bounce it off of group members who were able to
explain the process with a different perspective than the one I didn’t understand
from the MyLab tutorials.

I was explaining to my husband that it is much easier to use a concept you just
learned on a problem that has the similar structure but so much harder when you
have to determine the right application to interpret the data and then be able to
conceptualize the statistics. This project was like a baby step for a real world
application of statistics.

The biggest take away from the project was it’s not just about the data but
how I, as the author/ researcher, interpreted and explained how the statistics
support my argument. I am a political science and sociology major so it was
incredibly helpful to learn how to identify misleading graphs and how I should
present data to support my conclusion. In my other classes I am already applying
the concepts I have learned. With statistics I’m able to analyze data to make
inferences on cultural changes in society. It allows me to look at raw data to
make my own claim instead of taking another person’s interpretation as truth.

S-ar putea să vă placă și