Sunteți pe pagina 1din 7

Part I: Descriptive Statistics

Statistics is the study of the methods for describing and interpreting quantitative
information, including techniques for organizing and summarizing data and techniques for making
generalizations and inferences from data. The first of these two broad classes of methods is called
descriptive statistics, and the second is called inferential statistics.

Descriptive statistics refers to the procedures for organizing, summarizing, and describing
quantitative information which is called data. For example, a basketball fan is accustomed to
checking over his favorite player’s shooting average; the sales manager relies on charts showing
the sales distribution of an enterprise.

The second class of statistics, inferential statistics, include methods for making inferences
about a larger group of individuals on the basis of data collected on a much smaller group.

A knowledge of statistics is useful in understanding statistical information commonly


presented in the media and other aspects of everyday life, and it is essential in understanding and
conducting researches.

Some Basic Vocabulary

1. Entity. When we make observations about persons, places, and things, we call that which
is being observed an entity, regardless of the type of unit involved.
2. Variable. A characteristic that assumes different values for different entities is called a
variable. By contrast, a characteristic that retains the same value from entity to entity is
called a constant. The different values that one observes (or measures) are called
observations.
3. Quantitative Variable. A quantitative variable is one whose values are expressible as
numerical quantities, such as measurements and counts. A measurement taken on a
quantitative variable conveys information regarding amount.
4. Qualitative variable. A qualitative variable is one that is not measurable or countable.
Many characteristics can be classified only. A measurement taken on a qualitative variable
conveys information regarding attributes.

1
5. Discrete Variable. A discrete variable is one that can assume only certain values within
an interval. A discrete variable is characterized by interruptions between values that the
variable can assume.
6. Continuous variable. There is a continuum of values that a continuous variable can
assume- all whole numbers and all values in between.
7. Population. The largest collection of values of some variable in which there is interest
constitutes the population of these values.
8. Sample. A sample is a part of a population.

Summarizing Data

I. The Ordered Array

An ordered array is a list of the observations in order of magnitude. The order may be
from smallest value to the largest value or from the largest to the smallest.

II. Frequency Distribution

A frequency distribution is any device, such as graph or table, that displays the values that
a variable can assume along with the frequency of occurrence of these values, either individually
or as they are grouped into a set of mutually exclusive and exhaustive intervals.

Class intervals are contiguous, nonoverlapping intervals selected in such a way that they
are mutually exclusive and exhaustive. That is, each and every value in the set of data can be
placed in one and only one of the intervals.

Steps in Constructing a Frequency Distribution:

1. Calculate the range = biggest value – smallest value.


2. Determine the number of class intervals (k). Usually between 6 and 15 class intervals are
required. We use the formula: k = 1 + 3.322 (log10n) where n is the number of
observations. We should not regard the number of class intervals indicated in the formula
as final. The actual number of class intervals may be more or less than k obtained using
the formula.
3. Decide for the class size ( i) = R/k. The class size should be of the same size and we should
select the class size that is convenient to work with.

2
4. Organize the class intervals and proceed constructing your frequency distribution table.

# Also Discuss: True class limits (class boundaries); lower/upper class limits, class marks.

Cumulative Frequency Distribution

Sometimes one wants a cumulative frequency distribution. The entries in the cumulative
frequency < column is obtained by adding the number of observations from the first interval
(smallest) through the preceding interval, inclusive. A cf < indicates the number of observations
that fall below a specified upper boundary. Meanwhile, to obtain the entries in the cumulative
frequency > column, we add the number of observations from the largest interval (largest) to the
smallest interval. A cf > indicates the number of observations that fall above a specified lower
boundary.

A cumulative percentage frequency distribution allows us to determine how many percent


fall above or below a class boundary. Steps:

1. Cumulate the relative frequencies.


2. Divide the cumulative frequencies by n and multiply by 100%.

III. Histogram and Frequency Polygon

A histogram is a special type of bar graph that is representing a frequency distribution. In


a histogram, we plot the variable under consideration on the horizontal axis and the frequency on
the vertical axis. We locate the class intervals on the horizontal axis and above each we erect a
vertical bar. The height of a bar corresponds to the frequency of observations in the class interval
above which it is erected. We also make the adjacent cells of a histogram contiguous. We may
also use the true class limits to label the horizontal axis of a histogram. However, we may find it
more meaningful to use the lower limits, the upper class limits, or both.

A frequency polygon is an alternative kind of line graph for a frequency distribution. To


construct this graph, we place a dot above the center (class mark) of each class interval at a height
corresponding to the frequency for that interval. We then connect the dots with straight lines. We
can make the frequency polygon touch the horizontal axis at both ends by extending it to the center
of an imaginary class interval at each end.

3
IV. Descriptive Measures

In addition to tabular and graphical methods of summarizing, data, it is also useful to


summarize data by methods that lead to numerical results, called descriptive measures. We will
discuss two types of descriptive measures: measures of central tendency and measures of
dispersion.

A descriptive measure computed from or used to describe a sample of data is called a


statistic. A descriptive measure computed from or used to describe a population is called
a parameter.

A. Measures of Central Tendency


For a data set, it is impractical to keep in mind all the values that are in there. What we
need is some single value that we may consider typical of the set of data as a whole. The
need for such a single value is usually met by one of the three measures of central tendency:
the arithmetic mean (commonly known as the average), the median, and the mode.

The Arithmetic Mean is the most popular measure of central tendency. We find it by adding
all the values in a set of data and dividing the total by the number of values that were
summed.

ΣXi
Ungrouped data: = 𝑛

where: xi = individual values; n = total number of values added.


ΣfM
Grouped data: = 𝑛

where: f= frequency; M= class mark; n = total number of values/observations


Properties of the Mean:
1. For a given data set, there is one and only one mean.
2. Its meaning is easily understood.
3. Since every value goes into its computation, it is affected by the magnitude of each
value. Because of this property the mean may not be the best measure of central
tendency when one or two extreme values are present in a data set.

4
4. The mean cannot be obtained by inspection, it is a computed value and therefore can
be manipulated algebraically.

 Include discussion of weighted mean [give as assignment]

The median is that value above which half the values lie and below which the other half
lie. If the number of items is odd, the median is the value of the middle item of an ordered
array, when the items are arranged in ascending (or descending) order of magnitude. If the
number of items is even, none of the items has an equal number of values, above and below
it. In this case, the median is equal to the mean or average of the two middle values.

Grouped data: Median = L + (j/f) i

where:
L = the lower boundary of the class interval in which the median is located.
j = the number of values still needed to reach the median after the lower
limit of the interval containing the median has been reached.(n/2 – cf<).
i = class size.
f= the frequency in the class interval containing the median.

Properties of the Median


1. The median always exists in a set of numerical data. For a given data set, there is only
one median.
2. The median is not often affected by extreme values, whereas the mean is. Because of
this property, the median is frequently the central tendency measure of choice for a data
set that is skewed.
3. The median can be used to characterize qualitative data.
4. The median is easy to calculate unless a large number of values are involved.
5. The median for a data set can be calculated even when the data are incomplete, provided
that the number and the general location of all measurements are known and the exact

5
information regarding the magnitude of measurements near the center of the data set is
available.

The median for a frequency distribution is that value or point on the horizontal axis of the
histogram of the distribution at which a perpendicular line divides the area of the histogram
into two equal parts.

The mode for ungrouped discrete data is the value that occurs most frequently. If all the
values in a set of data are different, there is no mode. When we want to find the mode of
a frequency distribution, we usually specify the modal class, which is defined as the class
interval containing the largest number of values.

B. Measures of Dispersion

Once we have computed the mean of a data set, we want to know the extent to which the
values differ from this mean. We use the term dispersion to describe the degree to which
a set of values vary about their mean. When the values are closed to the mean, they exhibit
less dispersion than when some of the values are much larger and/or much smaller than the
mean.

The range is the difference between the largest and the smallest values in a set of data. For
grouped data, the range is simply the difference of the exact upper limit of the largest class
interval and the exact lower limit of the smallest class interval.

The variance uses all the deviations of values from their mean. It is the average of the
squared deviations of the individual values from the mean of the data set.

Ungrouped: s2 = Σ(xi – x)2/(n-1) or s2 = (Σxi2 – n( x)2)/(n-1)

𝑛 ∑ 𝑓𝑀2 −(∑ 𝑓𝑀)2


Grouped : s2 = 𝑛(𝑛−1)

The standard deviation is simply the positive square root of the variance.

Sometimes the need arises to compare the variability present in two sets of data. This
usually can be done by comparing the two variances or standard deviations if the data sets

6
satisfy two conditions: 1) the same unit of measurement is employed in both data sets; 2)
the means of the two data sets are approximately equal. If either of these conditions is not
met, we need a relative measure of dispersion for use in comparing the variability of the
two data sets. Such relative measure of dispersion is the coefficient of variation. The
sample coefficient of variation (CV) is equal to the ratio of the standard deviation to the
𝑠
mean. That is, CV = 𝑥 The CV is frequently multiplied by 100 and expressed as a percent.