Sunteți pe pagina 1din 10

©Dr. Valerie P.

Muehsam, 2006

Quantitative Techniques in Business

Introduction to Statistics

In the business world, and in fact, in practically every aspect of daily living, quantitative techniques are used to assist in

decision making. Why? Unlike the classroom, in the “real world” there is often not enough information available to be guaranteed of

making a correct decision. For instance, if advertisers would like to know how many households in the United States with televisions

are tuned to a particular television show, at a particular date and time, it would be impossible to determine without the complete

cooperation of every household and an astonishing amount of time and money. If a consumer protection agency wanted to determine

the true proportion of prescription drug users who also use herbal non-regulated over-the-counter supplements, this information would

most likely not be available. As a result of the inability to determine characteristics of interest, the application of statistics, and other

quantitative techniques has developed.

Statistics is defined as the process of collecting a sample, organizing, analyzing and interpreting data. The numeric values

which represent the characteristics analyzed in this process are also referred to as statistics. When information related to a particular

group is desired, and it is impossible or impractical to obtain this information, a sample or subset of the group is obtained and the

information of interest is determined for the subset. For instance someone is interested in the average annual income of all the

students with majors in the College of Business Administration at Sam Houston State University, the only way this information could

be obtained is if the annual income of every student in this population could be collected, recorded and analyzed without error. Since

this would take considerable time and money, and since the probability of collecting the data necessary to determine the true annual

salary of the students is small, a sample of this population will be taken. The sample mean annual salary of the sample of students will

be determined and used to estimate the true mean annual salary of all the students with majors in the College of Business

Administration at Sam Houston State University.

The study of statistics consists of two types: descriptive statistics and inferential statistics. Descriptive statistics are

characteristics, usually numeric, used to describe a particular data set. An example of a descriptive statistic would be the average final

exam grade of ten students in an elementary statistics class. This average test score is used to indicate a “typical value” for the exam

grades of the ten students. Inferential statistics, on the other hand, are similar to descriptive statistics in that each is calculated from a

sample, but the difference is the use of the statistic. In inferential statistics, the statistic is used to make inference, or make decisions,

about the entire population of interest. In other words, we take a sample and calculate a statistic and use that statistic to make

inference about the actual value of the characteristic in the entire population.

For instance, there are many descriptive characteristics of a firm’s customers that their management would like to know but

this information may be difficult or impossible to determine. Measurement of each and every customer of a large retail firm is nearly

impossible. Even if the information were gathered, it would be unlikely that it would be timely.

Unfortunately, managers do not always know what mean (average) weekly demand for a product will be or what

proportion of television viewers will watch a particular show. Since these parameters of interest are not known, and usually

impossible or impractical to determine, the parameters will be estimated using partial information gathered from a sample.
For instance, if the desired parameter is the mean annual salary of the income earning residents of a particular county, a

sample of 200 of these residents could be obtained and the annual salary of each resident (element) in the sample could be determined

and the mean annual salary of the sample residents. If the sample is drawn in a random fashion from a frame, or list, of the entire

population, and if we use correct statistical techniques, the sample mean annual salary (a statistic) may be a good estimate of the true

mean annual salary (a parameter) of all the residents of this county.

A population includes all the elements of interest. We use the term “element” to represent each individual unit of a group

in which we have interest. For instance, elements may refer to people (i.e., customers), records (i.e., all loan accounts at a particular

bank), products (i.e., we are interested in the proportion defective) etc. The notation used in statistics to represent the population size

is “N”. In our example above, the population of interest would be all the income earning residents of the county. Each of these

residents is an element in our population. If the population of the income earning residents in the county was 50,000 then N = 50,000.

The size of the population, N, is often not known.

A sample is a subset of the population. The notation for the sample size is “n”. In our previous example, the sample

would be the 200 residents we sampled out of all the income earning residents in the county. In this case n = 200.

A parameter is a characteristic, usually numeric, of the population. Populations have many parameters but researchers are

often interested in only one or two of these characteristics. For instance, in our example above, the parameter of interest is the

population mean annual salary of all the income earning residents of the county. The mean annual salary is but one of many other

characteristics of this population that may be of interest and could also be estimated. The proportion of these residents who support a

particular school bond issue and the mean age of the residents are two examples of other parameters that may be of interest.

A statistic is a characteristic, usually numeric, of the sample. Samples, like populations, also have many statistics that may

be calculated. For each parameter of a population, there is a corresponding statistic that may be calculated from a sample. An

important item to remember is that a statistic is a random variable which indicates that each sample may result in many different

values for the statistic. For instance, in the example above, the statistic is the sample mean annual income of the 200 residents of the

county. This value is called the “sample mean” because it is calculated from the sample.

Although the sample mean is our “best guess” for the value of the population mean it is one of many possible values that

could be calculated from different samples of size 200. In other words, there are many samples of 200 that could be collected from the

population of 50,000 residents. Unfortunately, even if we take a random sample of 200, we could end up with the most affluent 200

residents in the county. The sample mean calculated from this sample would not be representative of the population. The possibility

of collecting a sample like this cannot be ignored. We will, however, learn to use statistical techniques that allow us to estimate the

probability of getting a value for the sample statistic that is not a good estimate of the population parameter.

The use of statistics to estimate parameters of interest is not guaranteed to be successful. If the estimate is not “good” the

result could be a faulty decision that, in turn, could result in loss of time and/or revenue. We must not allow quantitative techniques to

make decisions for us, we must use these techniques only as a tool to assist us in decision making.

Scale of Data Measurement

2
Before any statistical technique is employed, a researcher must determine the type of data that is to be collected. In a

general sense, there are two types of data: qualitative data and quantitative data.

Qualitative data categorizes an element by a non-numeric attribute. For instance, if we are interested in which political

party a resident belongs to, we are categorizing the resident using qualitative data: Democratic, Republican, Independent, etc.

Qualitative data is often the data we are interested in gathering in the social sciences and particularly in business. For instance, much

of what we want to know in business is related to attitudes or behavior of consumers. The data is not numeric and therefore more

difficult to analyze. We often calculate the proportion of elements with a particular characteristic (i.e., the proportion of residents who

own their own home) but many techniques cannot be used on this type of data.

There are two types of qualitative data: nominal data and ordinal data. Nominal data is, in terms of structure, the

lowest form of data. Nominal data is qualitative data that has no natural order. Examples of nominal data include: gender; political

affiliation; type of car owned; product model; etc. Data comprised of “numbers” can also be qualitative data. Zip codes, area codes,

telephone numbers are examples of data that are qualitative. In math terms, these data are not “real” numbers because they do not

represent numeric measures. One way to determine whether “numbers” are numeric measures is to consider whether one might be

interested in an average of these “numbers”. If a number can be replaced with letters, words or symbols without losing any

information then this indicates that a “number” is NOT a numeric measure. Ordinal data is qualitative data that has a natural order.

Examples of ordinal data include: military rank; size of clothing using S, M, L, XL; place in which a race was finished; condition of a

used appliance using POOR, AVERAGE, GOOD, EXCELLENT; etc. While ordinal data has an order, the intervals between the

rankings are not equal intervals. Thus, while ordinal data has more structure than nominal data, math functions on the data, such as

differences, are not valid.

Quantitative data categorizes an element by a numeric measure. Quantitative data are true numbers and, as a result, more

quantitative techniques are available for use with this data. Quantitative data can be divided into two types of data: interval data and

ratio data. Interval data is quantitative data that has no natural starting point or zero level. Examples of interval data include

Fahrenheit temperature and scores on IQ tests. Each (of these type data) is a numeric measure but neither has a natural starting point

or zero level. Zero degrees Fahrenheit is not the absence of temperature just as there is no zero level for a test of intelligence. Interval

data can be used for any technique that requires quantitative data, however, we must realize that ratios have no meaning with this type

of data since there is no natural zero level. For example, 50 degrees Fahrenheit is not twice as warm as 25 degrees Fahrenheit. Ratio

data is quantitative data that has a natural starting point or zero level. Most quantitative data falls into this scale of data measurement.

Examples of ratio scaled data include height, weight, rate of return, net income, etc. Since there is a natural zero level, ratios have

meaning.

Measures of Central Tendency

Once we have decided the type of data that we are going to collect, we must determine the type of techniques that are

appropriate for analyzing the data. The first organizational technique we will most likely perform is to order the data from smallest

value to largest value. We order the data to get an idea about the range of the values observed. Consider a particular example, if we

have collected annual income figures from 1,000 households what might we be interested in knowing about this data? Perhaps we

3
would be interested in a typical annual income value for the data set. Typical values are often referred to as Measures of Central

Tendency. Measures of central tendency are attempts to identify typical values which are representative of the 1,000 observations

collected. The three most common measures of central tendency are the mean, the median and the mode. All three of these

measures are referred to as “average” or “typical” values although they are each different measures of typical.

The first, and most popular, measure of central tendency is the arithmetic mean, hereafter referred to as simply the mean.

The mean is calculated as the sum of the observations divided by the number of observations. The sample mean is denoted x and

the formula for calculating the sample mean is: x=


∑x . The population or true mean is denoted µ (the Greek script letter
n
“mu”) and is calculated the same way as the sample mean except that all elements in the population are measured.

The mean requires at least interval scaled data which means it is only valid for true numeric measures. The mean is often

referred to as the “gravitational center of the data set” which is similar to the balancing point of the data. If equal weights were

placed on a scale representing a number line for each observation in a data set, the mean would be the point at which the scale

balances. Since each observation has an equal weight, the magnitude of the values influence the mean. The mean, while certainly the

most commonly used measure of central tendency, is not always a good measure of “typical.” For instance, data sets that include

extreme values relative to the rest of the data “pull” the mean in that direction. Extremely small values cause the mean to be “small”

and extremely large values cause the mean to be “large.” The result is that the mean is not a “good” measure of typical and in fact,

may be larger or smaller than all values except the extreme one. When extreme values occur in a data set, we often use another

measure of typical referred to as the median. For instance, attempts to find a typical income often is best expressed as the median

income rather than the mean income since there is a lower limit (zero) but not an upper limit on income.

The median is the second most commonly used measure of central tendency and is referred to as the positional average.

The median is the center value in an ordered data set. If the data set has an odd number of observations then the median is the value

found in the center of the distribution of ordered values. If the sample set has an even number of values then the median is the mean

of the two values surrounding the center of the data set. The median is also P50, the fiftieth percentile. This means that 50% or half of

the values are smaller than the median and half of the values or 50% are greater than the median. The procedure for finding the

median is:

1. Order the data set from smallest to largest (or largest to smallest). NOTE: this requires that the data can be

ordered so the median cannot be found for nominal data.

2. Find i, which is the location or position of the median. This position can be calculated by using the

n +1
following formula: i= , where n is the size of the sample.
2

3. If i is an integer then the median is the value found at the ith position in the ordered data set. If i is not an

integer, then the median is the mean of the two values surrounding the ith position.

4
The median is often denoted as M or ~
x.
The last of the more common Measures of Central Tendency is called the mode. The mode is the most commonly

occurring value in a data set, in other words, the value that occurs with the greatest frequency. The mode, unlike either the mean or

the median, does not have to be unique. A data set can have more than one mode or no mode at all. A data set with: one mode is

referred to as unimodal; two modes is referred to as bimodal; and three or more modes is referred to as multimodal. There is no

universal notation for the mode and the mode is valid for any type of data.

Measures of Data Variation

Besides a measure of “typical,” what else might we want to know about a data set? Do the measures of central tendency

tell us all we need to “know” about the observations we have collected? Certainly not, in fact, two data sets could have the same mean

and be completely difference in terms of dispersion. Consider that we “know” the mean depth of a lake where we plan our next office

picnic. Suppose the mean depth of the lake is 4 feet, is this all we need to know about the depth of this lake? No. We need to know

how much the values (depth) varies around 4 feet. The depth of the lake could be 4 feet at every point and have a mean of 4 feet or

the depth of the lake could vary greatly around four feet and still have a mean of 4 feet. There could be places where the depth is a

few inches and other places where the depth is 10 feet. This information about how the data are dispersed is very important

(especially for those of us who cannot swim). The study of statistics could appropriately be referred to as the study of variability since

many of the techniques employ the comparison of the variability of typical values in different groups to determine whether or not

these values are the same or different between groups. Measures of Data Variation (variability, dispersion, or spread) are attempts

to describe how spread out, or how much the values vary, in a particular data set. All measures of data variation or dispersion

require quantitative data to calculate and are nonnegative. The measures of data variation are zero (if all the values are equal) or

positive. A “large” measure of spread indicates a more dispersed data set while a “small” measure indicates a more tightly grouped

data set.

The easiest measure of spread to calculate is the range. The range is the difference between the largest or maximum value

and the smallest or minimum value. The notation and formula for the range is: R = H − L , where H is the largest of maximum
value and L is the smallest or minimum value. The range, while simple to calculate, is only informative if it is “small.” “Small” and

“large” are relative terms and must be determined relative to the magnitude of the values measured. For instance, a range of $3 for

dinner could be characterized as “small” if we are eating at a five-star restaurant in a pricey hotel in New York City where the dinner

entrees range in price from $12.00 to $35.00 but may be characterized as “large” if we’re eating at a local fast-food restaurant. If the

range is “small” it means that the two extreme values are very close to each other, so the rest of the values must also be tightly

grouped. If the range is “large” we know that the extreme values are a long way from each other but we know nothing about the

distribution of the rest of the observations. Since the range only uses two values in its calculation, we are provided with limited

information.

Like our favorite measure of central tendency, the mean, we might like to come up with a measure of variability that

incorporates all the values in the data set as opposed to using only the two values needed to calculate the range. We might be

interested in finding out, on the average, how much the values vary around a “typical value.” In an effort to describe the variability of

5
a data set we could measure the distance each value is from the mean, our standard measure of “typical.” The distance a value is from

the mean is called the “deviation from the mean” and is found by subtracting the mean from a particular value. This deviation from

the mean can be negative, (if the value is smaller than the mean) positive, (if the value is bigger than the mean) or zero (if the value is

equal to the mean). To calculate the average deviation from the mean, we could sum the deviations from the mean for each value in

the data set and divide by the number of observations in our sample. Unfortunately, although a good idea intuitively, this value will

always be zero since the mean is the gravitational center of the data set and as a result, the sum of the deviations from the mean sum

to zero and so the average deviation would be zero (0):


∑( x −x ) = 0 . This occurs because the deviations from the mean
n
that are negative offset the deviations from the mean that are positive. We can avoid this problem by using the absolute value or

square of the deviations from the mean.

The Mean Absolute Deviation (MAD), is the sum of the absolute deviations from the mean divided by the sample size:

MAD =
∑| x − x | . The MAD is used in financial analysis to determine the variability in stock prices from the expected
n
price. Unfortunately, while the MAD is the “best” measure of spread for descriptive purposes, it is not useful for inferential statistics

since the distribution of an absolute value function is not smooth.

The sample variance, denoted s2, is the sum of the squared deviations from the mean divided by the sample size less one

(n-1). Continuing our effort to find an average deviation from the mean, we square the deviations from the mean to eliminate any

negative values so our numerator is not equal to zero, and then divide by the sample size less one. Our denominator is made smaller

(hence our variance is made larger) as an adjustment to our estimate for the true population variance, denoted σ 2
(sigma squared)

since we calculate the sample variance, s2, using the sample mean, x , instead of the true population mean, µ (mu). The true

measure of variability for the population should be calculated according to each value’s distance from µ , the population mean. The

adjustment in the denominator makes our estimate larger than without the adjustment to account for the estimate ( x ) used in the

numerator. Since we would prefer to have a “small” measure of variability because this indicates that the mean, x , is a good

measure of “typical” since most of the values are “close to” the mean, adjusting our estimate for the variance to be larger is considered

to be conservative. We are unsure of the true value of the mean so we use the value of the sample mean to estimate the variability in

the data. The deviations from the mean are estimated using deviations from the sample mean. It is said that we lose one degree of

freedom (df) in the denominator for every estimate in the numerator. All variances are of the form: sum of squares divided by

degrees of freedom.

The problem with the variance is that the value is in squared units. For instance, if we are measuring the dollar amount

spent on lunch, the variance will be in dollars squared. Since squared units make interpretation difficult, we normally take the square

6
root of the variance to return to the original units of measurement. The positive square root of the sample variance, s2, is the sample

standard deviation, s. The sample standard deviation, s, is our estimate for the true population standard deviation, denoted

σ ( sigma), which is the positive square root of the population variance, σ 2. The definitional formula for the sample variance, s2, is

given below followed by an algebraic manipulation which we call the computation formula. The computational formula is easier and

faster to calculate but intuitively the definitional formula makes more sense as our estimate of the “average” (squared) deviation from

the mean.

(∑ x) 2
∑(x − x)
2
∑ x2 − n = the sample variance
s2 = =
n −1 n −1

s= s 2 = the sample standard deviation

Although we rarely calculate parameters, the following formulae are given for the population variance and the population standard

deviation.

(∑ x ) 2
σ2 =
∑( x − µ ) 2 =∑
x2 −
N = the population variance
N
N
σ = σ2 = the population standard deviation.

Uses of the Standard Deviation

The standard deviation of a sample is an attempt to estimate the typical distance that values in the data set differ from the

mean. We use the standard deviation as the step-size to estimate the percentage of values that lie within 1 step, 2 steps, or three steps

of the mean. For example, Chebyshev’s Theorem, which applies to any distribution regardless of its shape, states that within k

1
standard deviations of the mean, at least 1 − % of the values will fall. Since Chebyshev’s Theorem applies to any distribution
k2
regardless of shape, the information learned is less specific then we might like. In other words, using the formula, we would discover

that at least 75% of the observations (in any distribution) lie within 2 standard deviations of the mean. This means that 75%-

100% of the values will fall within two standard deviations of the mean. While some information is better than none, we would like to

be more precise in our estimate of this percentage. For certain known distributions, we can more precisely estimate the percentage of

values that lie within one, two or three standard deviations of the mean.

The Empirical Rule, which only applies to a normal distribution, provides us with much more information about this

particular distribution than Chebyshev’s Theorem. The Empirical Rule states that for any normal distribution, approximately 68% of

the values will fall within one standard deviation of the mean, approximately 95% of the values will fall within two standard

7
deviations of the mean, and approximately 99.7% of the values will fall within three standard deviations of the mean. This much more

precise information is only true for data distributed normally. The normal distribution, sometimes referred to as the Gaussian

distribution after Karl Gauss who discovered that the normal distribution of certain errors, is bell-shaped and symmetrical, and models

the behavior of many random variables. We will discuss the normal distribution as well as its probability distribution later in the

course.

Measures of Position or Location

Measure of central tendency and measures of data variation are singular values to describe an entire data set. Measure of

position or location are measures of an individual value and indicate the relative position of that value to the other values in the data

set. A commonly used measure of position is a percentile. Aptitude tests often provide an individual’s percentile ranking to let them

know how they did relative to others who took the test. To determine what test score exceeds a certain percentage of test scores, we

first divide our data set into 100 equal parts and then count in to determine the location of the value that corresponds to the percentile

we are interested in.

The kth percentile, Pk, is that value which is equal to or greater than, k% of the observations and is less than or equal to the

remaining (100-k)% of the observations.

The procedure for calculating the kth percentile is:

1. Order the data from smallest to largest value.

nk
2. Find , where n is the sample size and k is the percentile you are calculating.
100

nk
3. (a) if is not an integer, then i, the position of the kth percentile, will be the next larger integer. For
100

nk
example if = 4.5 then i = 5.
100

nk nk
(b) if is an integer, then i, the position of the kth percentile, will be +.5. For example if
100 100

nk
= 6 then i = 6.5.
100

4. (a) if i is an integer (3a above) then the kth percentile if the value found at the ith position. For example, in 3a above, i =

5, so the kth percentile is the 5th value in the ordered data set.

(b) if i is not an integer (3b above) then the kth percentile if the mean of the two values surrounding the ith position.

For example, in 3b above, i = 6.5, so the kth percentile is the mean of the sixth and seventh values in the ordered

data set.

8
Sometimes, instead of being interested in what data point has a certain percentage above it or below it, researchers are

interested in determining the value that is “typical” for the “center” group of values. For example, suppose we are charged with the

responsibility of developing the curriculum for a kindergarten class. The students in a class of kindergarteners could differ

tremendously in terms of acquired knowledge. Suppose, in an effort to develop the curriculum, we give each student in the class an

aptitude class to measure his/her abilities in basic knowledge. The scores may vary greatly since some of the students may have

attended preschool since they were very young while others may not have attended at all. If we do not have the resources to have

multi-level curriculum, then we would develop a curriculum that was targeted at those “in the middle” in terms of their aptitude

scores. Since we are interested in targeting the center of the distribution of aptitude scores, we will determine what constitutes the

“middle 50%” and gear our curriculum at those students.

Quartiles, which are just specific percentiles, allow us to divide our data into four equal groups. The first or lower

quartile, Q1, is equal to the 25% percentile, P25. The second or mid-quartile, Q2, is equal to the 50% percentile, P50, which is also the

median, M. The third or upper quartile, Q3, is equal to the 75% percentile, P75. We use these quartiles to help us determine

characteristics of the middle 50% of our data. For example, the Interquartile Range (IQR), is the range of the middle 50% of the

data. Like the range, the IQR is a measure of data variation or dispersion but instead of indicating the range of all the data like the

range does, the IQR indicates the range of only the middle 50%. Like other Measures of Data Variation, the IQR requires quantitative

data to calculate. The formula for the IQR is: IQR = Q3 − Q1 . To calculate the IQR, the first and third quartiles are

determined by finding the corresponding percentile, i.e., Q3=P75 and Q1=P25.

The Mid-Quartile Range, (MQR), is a statistic we calculate to determine a “typical” value in the middle group of

observations. The MQR is a Measure of Central Tendency and is the mean of the extreme values of the middle 50% of the

observations. It is not the mean of all observations in the middle 50%, but instead we find the mean of the first and third quartiles.

Q1 + Q3
The formula for the MQR is: MQR = .
2
Another measure of position or location is called the Z-score or Z value. The Z-score for a particular value in a data set

indicates the number of standard deviations that value is from the mean. Z-scores can be negative (if the value is less than the

mean), positive (if the values is larger than the mean), or equal to zero (if the value is equal to the mean). The Z-score for the mean is

always zero. For example, a value with a Z-score of 1.35 is 1.35 standard deviations above the mean. A value with a Z-score of –

2.12 is 2.12 standard deviations below the mean.

Z-values can be calculated, and a Standard Normal Table used, to determine approximately what proportion of the values,

for a normal distribution, are above or below a particular value, or between two values in a distribution.

Frequency Distributions

Terminology:

9
Defn: The frequency, f, for a value or a class of values is the number of times that value or class of values
occurs in the data set.

We are simply counting how often a value or set of values occurs in the data set.
1. What is the minimum number of times a value or class of values occur(s) in a data set? The minimum number of times a
value or class of values can occur is zero (0). What is the maximum number of times a value or class of values can occur
in the data set? The maximum number of times a value or class of values can occur in the data set is n, or the total number
of values in the data set.

0≤ f≤ n

2. If we add the frequencies for each value or set of values it will sum to n.

Σ f=n

Defn: The relative frequency, f/n, (how often the value occurs divided by the total number of observations—
gives you a proportion of times a value or class of values occurs) for a value or a class of values is the proportion of time that a
value or class of values occurs in the data set.

1. What is the minimum proportion of time a value or class of values occur(s) in a data set? The minimum proportion of time
a value or class of values can occur is zero (0). What is the maximum proportion of time a value or class of values can
occur in the data set? The maximum proportion of time a value or class of values can occur in the data set is one (1).

0 ≤ f/n ≤ 1

2. If we add the relative frequencies for each value or set of values it will sum to one (1).

Σ f/n = 1

Defn: The cumulative frequency, F, for a value or a class of values is the number of times that value or any
smaller value occurs in the data set.

We are simply keeping a running total.


1. Cumulative frequencies are non-decreasing (this means the values cannot decrease—they can level off but they can’t go
down).
2. The cumulative frequency for the last value or class of values is n.
3. We must have at least ordinal scaled data to find cumulative frequencies.

Defn: The cumulative relative frequency, F/n, for a value or a class of values is the proportion of time that value
or any smaller value occurs in the data set.

We are simply keeping a running total of relative frequencies or proportions.


1. Cumulative relative frequencies are non-decreasing.
2. The cumulative relative frequency for the last value or class of values is one (1).
3. We must have at least ordinal scaled data to find cumulative relative frequencies.

10

S-ar putea să vă placă și