Sunteți pe pagina 1din 46

1

BUSM, A CASE STUDY OF TALUKA Mir Pur Bathoro, Thatta.

THESIS/ CASE STUDY REPORT SUBMITTED IN THE


PATRIAL FULFILMENT OF THE REQUIREMENT FOR
THE DEGREE OF BS (HONS) FINAL IN SOCIOLOGY,
DEPARTMENT OF SOCIOLOGY,
UNIVERSITY OF SINDH, JAMSHORO, SINDH,
PAKISTAN.

Submitted by:
MAJID ALI KHOWAJA
ROLL NO: 2K6/SOC/42
BS (Hons) Part-IV-2009

2
EXECUTIVE SUMMARY
As the allotted topic is entitled as “BUSINEES USING STATISTICAL
METHODS, A CASE STUDY OF TALUKA MIR PUR BATHORO,
THATTA.” It is, therefore, the thesis report consists of the basic concepts of
the statistics and business and later on there is a case study report regarding
the ‘male & female of Mir Pur Bathoro footware preference’, which in
response helps the proprietor of footwear house: i.e. Mr. Atta-ullah Khattri
of Aashu footwear house, taluka mir pur bathoro. This report highlights the
statistical method of collecting information regarding the concerned area of
interest in order to enhance the business policies and no doubt the customer
is also facilitated. Initially a form is distribute among 50 each males and
females of the city and then the collected data is summarized in tabular form
and later on statistical probability test (chi-square), a probability function
widely used in testing a statistical hypothesis, for example, the likelihood
that a given statistical distribution of results might be reached in an
experiment, is applied and the result is concluded, which in response will
help the footwear house owner to manage his business accordingly and
hence the purpose of the thesis (Business using statistical methods along
with a case study is accomplished). The thesis also consists of the
conclusion & future scope of the case study in the end.

3
CONTENTS
CHAPTER#1 STATISTICS
Page No.
Probability 06
Estimation 07
Hypothesis testing 09
Bayesian methods 09
Experimental design 09
Time series and forecasting 12
Nonparametric methods 13
Statistical quality control 13
Sample survey methods 14
Decision analysis 14

CHAPTER#2 INTRODUCTION
Types of Business 16
Manufacturing firms 16
Merchandisers 16
Service enterprise 18

CHAPTER#3 CASE STUDY (AASHU FOOTWEAR HOUSE)


USING CHI-SQUARE- OVERVIEW
Bivariate Tabular Analysis 20
Generalizing from samples to populations 26
Chi-square requirement 27
Collapsing values 30
Computing chi-square 31
Interpreting chi-square values 40
Measures of Association 42

CHAPTER#4 CONCLUSION & FUTURE SCOPE


Conclusion 44
Future Scope 44

BIBLIOGRAPHY 45

4
STATISTICS
Definition
Statistics is the science of collecting, analyzing, presenting, and interpreting
data. The branch of mathematics that deals with the relationships among
groups of measurements and with the relevance of similarities and
differences in those relationships is known as statistics.

Descriptive statistics
Descriptive statistics are tabular, graphical, and numerical summaries of
data. The purpose of descriptive statistics is to facilitate the presentation and
interpretation of data. Most of the statistical presentations appearing in
newspapers and magazines are descriptive in nature.

Descriptive statistics > Tabular methods


The most commonly used tabular summary of data for a single variable is a
frequency distribution. |A frequency distribution show the number of data
values in each of several non-overlapping classes. Another tabular summary,
called a relative frequency distribution, shows the fraction, or percentage, of
data values in each class.

Descriptive statistics > Graphical methods


A number of graphical methods are available for describing data. A bar
graph is a graphical device for depicting qualitative data that have been
summarized in a frequency distribution. Labels for the categories of the
qualitative variable are shown on the horizontal axis of the graph.

5
Descriptive statistics > Numerical measures
A variety of numerical measures are used to summarize data. The
proportion, or percentage, of data values in each category is the primary
numerical measure for qualitative data. The mean, median, mode,
percentiles, range, variance, and standard deviation are the most commonly
used numerical measures for quantitative data.

Descriptive statistic > Numerical measures > Outliers


Sometimes data for a variable will include one or more values that appear
unusually large or small and out of place when compared with the other data
values. These values are known as outliers and often have been erroneously
included in the data set.

Descriptive statistics > Numerical measures >


Exploratory data analysis
Exploratory data analysis a variety of tools for quickly summarizing and
gaining insight about a set of data. Two such methods are the five-number
summary and the box plot. A five-number summary simply consists of the
smallest data value, the firs quartile, the median, the third quartile, and the
largest data value.

Probability
Probability is a subject that deals with uncertainty. In everyday terminology,
probability can be thought of as a numerical measure of the likelihood that a
particular event will occur. Probability values are assigned on a scale from 0
to 1, whith values near 0 indicating that an event is unlikely to occur and
those near I indicating that an event is likely to take place.

6
Probability > Events and their probabilities
Oftentimes probabilities need to be computed for related events. For
instance, advertisements are developed for the purpose of increasing sales of
a product. If seeing the advertisement increases the probability of a person
buying the product, the events “seeing the advertisement” and “buying the
product” are said to be dependent.

Probability > Random variables and probability

distributions
A random variable is a numerical description of the outcome of a statistical
experiment. A random variable that my a assume only a finite number or an
infinite sequence of values is said to be discrete: one that mya assume any
value in some interval on the real number line is said to be continuous.

Probability > Special probability distribution > The

position distribution
The most widely used continuous probability distribution in statistics is the
normal probability distribution. All normal distribution graphs are, like a
bell – shaped curve.

Estimation
It is often of interest to learn about the characteristics of a large group of
elements such as individuals, households, buildings, products, parts,

7
customers, and so on. All the elements of interest in a particular study form
the population. Because of time, cost, and other considerations, data often
cannot be collected from every element of the population.

Estimation > Sampling and sampling distributions


Although sample survey methods will be discussed in more detail below in
the section sample survey methods, it should be noted here that the methods
of statistical inference, and estimation in particular, are based on the notion
that a probability sample has been taken.

Estimation > Estimation of a population mean


The most fundamental point and interval estimation process involves the
estimation of population mean. Suppose it is of interest to estimate the
population mean, , for a quantitative variable. Data collected from a
simple random sample can be used to compute the sample mean, x, where
the value of x provides a point estimate of .

Estimation > Estimation of other parameters


For qualitative variables, the population proportion is a parameter of interest.
A point estimate of the population proportion is given by the sample
proportion. With knowledge of the sampling distribution of the sample
proportion, an interval estimate of a population proportion is obtained in
much the sample fashion as for a population mean.

8
Estimation > Estimation procedures for two populations
The estimation procedures can be extended to two populations for
comparative studies. For example, suppo0se a study is being conducted to
determine differences between the salaries paid to a population of men and a
population of women.

Hypothesis testing
Hypothesis testing is a form of statistical inference that uses data from a
sample to draw conclusions about a population parameter or a population
probability distribution. This assumption is called the null hypothesis and is
denoted by H0.

Bayesian methods
The methods of statistical inference previously described are often referred
to as classical methods. Bayesian methods (so called after the English
mathematician Thomas Bays) provide alternatives that allow one to combine
prior information about a population parameter with information contained
in a sample to guide the statistical inference process.

Experimental design
Data for statistical studies are obtained by conducting either experiments or
surveys. Experimental design is the branch of statistics that deals with the
design and analysis of experiments. The methods of experimental design are
widely used in the fields of agriculture, medicine, biology, marketing
research, and industrial production.

9
Experimental design > Analysis of variance and

significance testing
A computational procedure frequently used to analyze the data from an
experimental study employs a statistical procedure known as the analysis of
variance. For a single factor experiment, this procedure uses a hypothesis
test concerning equality of treatment means to determine if the factor has a
statistically significant effect on the response variable.

Experimental design > Regression and correlation

analysis
Regression analysis involves identifying the relationship between a
dependent variable and one or more independent variables. A model of the
relationship is hypothesized, and estimates of the parameter values are used
to develop an estimated regression equation. Various tests are then
employed to determine if the model is satisfactory.

Experimental design > Regression and correlation

analysis > Regression model


In simple linear regression, the model used to describe the relationship
between a single dependent variable y and a single independent variable x is
y= 0+ 1× . 0 and 1 are referred to as the model parameters, and
is a probabilistic error term that accounts for the variability in y that
cannot be explained by the linear relationship with x.

10
Experimental design > Regression and correlation

analysis > Least squares method


Either a simple or multiple regression models is initially posed as a
hypothesis concerning the relationship among the dependent and
independent variables. The least squares method is the most widely used
procedure for developing estimates of the model parameters.

Experimental design > Regression and correlation

analysis > Analysis of variance and goodness of fit


A commonly used measure of the goodness of fit provided by the estimated
regression equation is the coefficient of determination. Computation of this
coefficient is based on the analysis of variance procedure that partitions the
total variation in the dependent variable.

Experimental Design > Regression and correlation

analysis > Significance testing


In a regression study, hypothesis tests are usually conducted to assess the
statistical significance of the overall relationship represented by the
regression model and to test for the statistical significance of the individual
parameters.

Experimental design > Regression and correlation

analysis > Residual analysis


The analysis of residuals plays an important role in validating the regression

11
model. If the error term in the regression model satisfies the four assume
options noted earlier, then the model is considered valid.

Experimental design > Regression and correlation

analysis > Model building


In regression analysis, model building is the process of developing a
probabilities model that best describes the relationship between the
dependent and independent variables. The major issues are finding the
proper from (Linear or curvilinear) of the relationship and selecting which
independent variables to include.

Experimental design > Regression and correlation

analysis > Correlation


Correlation and regression analysis are related in the sense that both deal
with relationships among variables. The correlation coefficient is a measure
of linear association between two variables. Values of the correlation
coefficient are always between -1 and +1.

Time series and forecasting


A time series is a set of data collected at successive points in time or over
successive e periods of time. A sequence of monthly data on new housing
starts and sequence of weekly data on product sales are examples of time
series. Usually the data in a time series are collected a equally spaced
periods of time, such as hour, day, week, month, or year.

12
Nonparametric methods
The statistical methods discussed above generally focus on the parameters of
populations or probability distributions and are referred to as parametric
methods. Nonparametric methods are statistical methods that require fewer
assumptions about a population or probability distribution and are applicable
in a wider range of situations.

Statistical quality control


Statistical quality control refers to the use of statistical methods in the
monitoring and maintaining of the quality of products and services. One
method, referred to as acceptance sampling, can be used when a decision
must be made to accept or reject a group of parts or items based on the
quality found in a sample.

Statistical quality control > Acceptance sampling


Assume that a consumer receives a shipment of parts called a lot from a
producer. A sample of parts will be taken and the number of defective items
counted. If the number of defective items is low, the entire lot will be
accepted. If the number of defective items is high, the entire lot will be
rejected.

Statistical quality control > Statistical process control


Statistical process control uses sampling and statistical methods to monitor
the quality of an ongoing process such as a production operation. A
graphical display referred to as a control chart provides a basis for deciding
whether the variation in the output of a process is due to common causes

13
(randomly occurring variations).

Sample survey methods


As noted above in the section Estimation, statistical inference is the process
of using data frm a sample to make estimates or test hypotheses about a
population. The field of sample survey methods is concerned with effective
ways of obtaining sample data. The three most common types of sample
surveys are
• Mail surveys
• Telephone surveys
• Personal interview

Decision analysis
Decision analysis, also called statistical decision theory, involves procedures
for choosing optimal decisions in the face of uncertainty. In the simplest
situation, a decision maker must choose the best decision from a finite set of
alternatives when there are two or more possible future events, called states
of nature, that might occur.

14
INTRODUCTION

Business, an organized approach to providing customers with the goods and


services they want. The word business also refers to an organization that
provides these goods and services. Most businesses seek to make a profit
that is; they aim to achieve revenues that exceed the costs of operating the
business. Prominent examples of for-profit businesses include Mitsubishi
Group, General Motors Corporation, and Royal Dutch/Shell Group.
However, some businesses only seek to earn enough to cover their operating
costs. Commonly called nonprofits, these organizations are primarily
nongovernmental service providers. Examples of nonprofit businesses
include such organization as social service agencies, foundations, advocacy
groups, and many hospitals.

Business plays a vital role in the life and culture of countries with industrial
and postindustrial (service and information based) free market economics
such as the United States. In free market systems, prices and wages are
primarily determined by competition, not by governments. In the United
States, for example, many people buy and sell goods and services as their
primary occupations. In 2001 American companies sold in excess of $10
trillion worth of goods and services. Businesses provide just about anything
consumers want or need, including basic necessitates such as food and
housing, luxuries such as whirlpool baths and wide screen televisions, and
even personal services such as caring for children and finding
companionship.

15
TYPES OF BUSINESS
There are many types of businesses in a free market economy. The three
most common are
• Manufacturing firms
• Merchandisers
• Services enterprises

Manufacturing firms
Manufacturing firms produce a wide range of products. Large manufacturers
include producers of airplanes, cars, computers, and furniture. Many
manufacturing firms construct only parts rather than complete, finished
products. These suppliers are usually smaller manufacturing firms, which
supply parts and components to larger firms. The larger firms then assemble
final products for market to consumers. For example, suppliers provide
many of the components in personal computers, automobiles, and home
appliances to large firms that create the finished or end products. These
larger end product manufacturers are often also responsible for marketing
and distributing the products. The advantage those large businesses have in
being able to efficiently and inexpensively control any parts of a production
process is known as economies of scale. But small manufacturing firms may
work best for producing certain types of finished products. Smaller end-
product firms are common in the food industry and among artisan trades
such as custom cabinetry.

Merchandiser
Merchandisers are businesses that help move goods through a channel of

16
distribution that is, the route goods take in reaching the consumer.
Merchandisers may be involved in wholesaling or retailing, or sometimes
both.
A wholesaler is a merchandiser who purchases goods and then sells them to
buyers, typically retailers, for the purpose of resale. A retailer is a
merchandiser who sells goods to consumers. A wholesaler often purchases
products in large quantities and then sells smaller quantities of each product
to retailers who are unable to either buy or stock large amounts of the
product. Wholesalers operate somewhat like large, end product
manufacturing firms, benefiting from economies of scale. For example, a
wholesaler might purchase 5,000 pairs of work gloves and then sell 100
pairs to 50 different retailers. Some large American discount chains, such as
Kmart Corporation and Wal-Mart Stores, Inc., Serve as their own
wholesalers, these companies go directly to factories and other
manufacturing outlets, buy in large amounts, and then warehouse and ship
the goods to their stores.
The division between retailing and wholesaling is now being blurred by new
technologies that allow retailing to become an economy of scale. Telephone
and computer communications allow retailers to serve far greater numbers of
customers in a given span of time than is possible in face to face interactions
between a consumer and a retail salesperson. Computer networks such as the
Internet, because they do not require any physical communication between
salespeople and customers, allow a nearly unlimited capacity for sales
interactions known as 24/7- that is, the Internet site can be open for
transaction 24 hours a day, seven days a week and for as many transactions
as the network can handle. For example, a typical transaction to purchase a
pair of shoes at a shoe store may take a half-hour from browsing, to fitting,

17
to the transaction with a cashier. But a customer can purchase a pair of shoes
through a computer interface with a retailer in a matter of seconds.
Computer technology also provides retailers with another economy of scale
through the ability to sell goods without opening any physical stores, often
referred to as electronic commerce or e-commerce. Retailers that provide
goods entirely through Internet transactions do not incur the expense of
building so called brick and mortar stores or the expense of maintaining
them.

Service enterprises
Service enterprises include many kinds of businesses. Examples include dry
cleaners, shoe repair stores, barbershops, restaurants, ski resorts, hospitals,
and hotels, In many cases service enterprises are moderately small because
they do not have mechanized services and limit service to only as many
individuals as they can accommodate at one time. For example, a waiter may
be able to provide good service to four tables at once, but with five or more
tables, customer service will suffer.
In recent years the number of service enterprises in wealthier free market
economies has grown rapidly, and spending on services now accounts for a
significant percentage of all spending. By the late 1990s, private services
accounted for more than 21 percent of U.S. spending. Wealthier nations
have developed postindustrial economies, where entertainment and
recreation businesses have become more important than most raw material
extraction such as the mining of mineral ores and some manufacturing
industries in terms of creating jobs and stimulating economic growth. Many
of these industries have moved to developing nations, especially with the
rise of large multinational corporations, As postindustrial economics have

18
accumulated wealth, they have come to support systems of leisure, in which
people are willing to pay others to do things for them. In the United States,
vast numbers of people work rigid schedules for long hours in indoor
offices, stores, and factories. Many employers pay high enough wages so
that employees can afford to balance their work schedules with purchased
recreation. People in the United States, for example, support thriving travel,
theme park, resort, and recreational sport businesses.

19
Overview
Chi-square is a non parametric test of statistical significance for bivariate
tabular analysis (also known as cross breaks). Any appropriately performed
test of statistical significance lets you know the degree of confidence you
can have in accepting or rejecting any hypothesis. Typically, the hypothesis
tested with chi square is whether or not two different samples (of people,
texts, etc) are different enough in some characteristic or aspect of their
behavior that we can generalize from our samples that the populations from
which our samples are drawn are also different in the behavior or
characteristic. A non parametric test, like chi squire, is a rough estimate of
confidence; it accepts weaker, less accurate data as input than parametric
tests (like t-tests and analysis of variance) and therefore has les status in the
pantheon of statistical test. Nonetheless, its limitations are also its strengths;
because chi squire is more forgiving in the data it will accept, it can be used
in a wide variety of research contexts.
Chi square is used most frequently to test the statistical significance of
results reported in bivariate tables, and interpreting bivariate tables is
integral to interpreting the results of a chi squire test, so we’ll take a look at
bivariate tabular (cross-break) analysis.

Bivariate Tabular Analysis


Bivariate tabular (cross-break) analysis is used when you are trying to
summarize the intersections of independent and dependent variables and
understand the relationship (if any) between those variables. For instance, if
we wanted to know if there is any relationship between the biological sex of
people of Mir Pur Bathoro and their footwear preferences, we might select

20
50 males and 50 females as randomly as possible, and ask them, “On
average, do you prefer to wear sandals, sneakers, leather shoes, boots, or
something else?” using the model form,

Name:
Sex:
Age:
Zodiac:
Occupation:
Footwear Choice:
Sandals
Sneakers
Leather shoes
Boots
Others

In this case study, our independent variable is biological sex. (In


experimental research the independent variable is actively manipulated by
the researcher: for example, whether or not a rate gets a food pellet when it
pulls on a striped bar. In most sociological research, the independent
variable is not actively manipulated in this way, but controlled by sampling
for, e.g., males vs. females.) Put another way, the independent variable is the
quality or characteristic that you hypothesize helps to predict or explain
some other quality or characteristic (the dependent variable). We control the
independent variable (and as much else as possible and natural) and elicit

21
and measure the dependent variable to test our hypothesis that there is some
relationship between them. Bivariate rabular analysis is good for asking
questions of the foll0owing kinds.

1. Is there a relationship between any two variables IN THE DATA?


2. How strong is the relationship IN THE DATA?

What is the direction and shape of the relationship IN THE DATA?


Is the relationship due to some intervening variable(s) IN THE DATA??
To see any patterns or systematic relationship between biological sex of
people of Mir Pur Bathoro, Thatta and reported footwear preferences, we
could summarize our results in a table like this:

Table 1. Male & Female of Mir Pur Bathoro’s Footwear preferences

Leather
Sandals Sneakers Boots Others
Shoes

Male

Female

Depending upon how our 50 male and 50 female subjects responded, we


could make a definitive claim about the (reported) footwear preferences of
those 100 people.
In constructing bivariate tables. Typically values on the independent bariable
are arrayed on vertical axis, while values on the dependent variable are
arrayed on the horizontal axis. This allows us to read across from

22
hypothetically causal values on the independent variable to their effects, or
values on the dependent variable. How we arrange the values on each axis
should be guided in conically by our research question/hypothesis. For
example, if values on an independent variable were arranged from lowest to
highest value on the variable and values on the dependent variable were
arranged left to right from lowest to highest, a positive relationship would
show up as a rising left to right line. (But remember, association does not
equal causation: an observed relationship between two variables is not
necessarily causal).
Each intersection/cell of a value on the independent variable and a value on
the independent variable reports the result of how many times that
combination of values was chosen/observed in the sample being analyzed.
(So we can see that cross tabs are structurally most suitable for analyzing
relationship between nominal and ordinal variables. Interval and ratio
variables will have to first be grouped before they can “fit” into a bivariate
table.) Each cell reports, essentially, how many subjects/observations
produced that combination of independent and dependent variable values?
So, for example, the top left cell of the table above answers the question:
“How many male in Mir Pur Bathoro prefer sandals?”

Table 2: Male & Female of Mir Pur Bathoro’s Footwear preferences

23
Leather
Sandals Sneakers Boots Others
Shoes

Male 6 17 13 9 5

Female 13 5 7 16 9

Reporting and interpreting cross tabs are most easily done by converting raw
frequencies (in each cell) into percentages of each cell writhing the values/
categories of the independent variable. For example, in the Footwear
preferences table above, total each row, then divide each cell by its row
total, and multiply that fraction by 100.

Table 3: Male & Felmale of mir Pur Bathoro’s Footwear preferences


(Percentages)

Leather N
Sandals Sneakers Boots Others
Shoes

Male 12 34 26 18 10 50

Female 26 10 14 32 18 50

24
Male Preference

30

25

20

15
Ratio
10

0
Sandals Sneakers Leather Boots Others
Shoe

Female Preference
30

25

20

15
Ratio
10

0
Sandals Sneakers Leathe r Shoe Boots Others

Percentages basically standardize cell frequencies an if there were 100


subjects/observations in each category of the independent variable. This is
useful for comparing across values on the independent variable, but that
usefulness comes at the price of a generalization from the actual number of
subjects/observations in that column in your data to a hypothetical 100
subjects / observations. If the raw row total was 93, then generalization (on
no statistical basis, i.e., with no knowledge of sample-population

25
representative ness) is drastic. So we should provide that total N at the end
of each row/independent variable category (for reliability and to enable the
reader to assess our interpretation of the table’s meaning).
With this limitation in mind, we can compare the patterns of distribution of
subjects/observations along the dependent variable between the values of the
independent variable: e.g., compare male and female of Mir Pur Bathoro,
footwear preference. (For some data, plotting the results on a line graph can
also help you interpret the results: i.e., whether there is a positive (/),
negative(\), or curvilinear (∨, ∧) relationship between the variables).
Table 3 shows that within our sample, roughly twice as many females
preferred sandals and boots as males: and within our sample, about three
times as many men preferred sneakers as women and twice as many men
preferred leather shoes. We might also infer from the “Other” category that
females within our sample had a broader range of footwear preferences than
did males.

Generalizing from samples to Populations


Coverting raw observed values or frequencies into percentages allows us to
see more easily patterns in the data, but that is all we can see: ‘what is in the
data’. Knowing with great certainty the footwear preferences of a particular
group of 100 males & females of taluka Mir Pur Bathoro is of limited use to
us; we usually want to measure a sample in order to know something about
the larger populations from which our samples were drawn. On the basis of
raw observed frequencies (or percentages) of sample’s behavior or
characteristics, we can make claims about the sample itself, but we cannot
generalize to make claims about the population from which we drew our

26
sample, unless we submit our results to a test of statistical significance. A
test of statistical significance tells us how confidently we can generalize to a
larger (unmeasured) population from (measured) sample of that population.

How dies chi-square do this?


Basically, the chi-square test of statistical significance is a series of
mathematical formulae, which compare the actual observed frequencies of
some phenomenon (in our sample) with the frequencies we would expect if
there were no relationship at all between the two variables in the larger
(sampled) population. That is, chi-square tests our actual results against the
null hypothesis and assesses whether the actual results are different enough
to overcome a certain probability that they are due to sampling error. In a
sense, chi-square is a lot like percentages; it extrapolates a population
characteristic (a parameter) from the sampling characteristic (a statistic)
similarly to the way percentage standardizes a frequency to a total column N
of 100. But chi-square works within the frequencies provided by the sample
and does not inflate (or minimize) the column and row totals.

Chi Square Requirements


As mentioned before, chi-square is a nonparametric test. It does not require
the sample data to be more or less normally distributed (as parametric tests
like t-tests do), although it relies on the assumption that the variable is
normally distribute in the population from which the sample is drawn. But
chi-square, while forgiving, does have some requirements:

1. The sample must be randomly drawn from the populations.

27
2. Data must be reportd in raw frequencies (not percentages);
3. Measured variables must be independent:
4. Values/categories on independent and dependent variables must be
mutually exclusive and exhaustive;
5. Observed frequencies cannot be too small.

1. As with any test of statistical significance, your data must be from


a random sample of the population to which we wish to generalize
our claims.
2. We should only use chi-square when our data are in the form of
raw frequency counts of things in tow or more mutually exclusive
and exhaustive categories. As discussed above, converting raw
frequencies into percentages standardizes cell frequencies as if
there were 100 subjects/observation in each category of the
independent variable for comparability. Part of the chi-square
mathematical procedure accomplishes this standardizing, so
computing the chi-square of percentages would amount to
standardizing an already standardized measurement.
3. Any observation must fall into only one category or value on each
variable. In our footwear example, our data are counts of male
versus female expressing a preference for five different categories
of footwear. Each observation/subject is counted only once, as
male or female (an exhaustive typology of biological sex) and
preferring sandals, sneakers, leather shoes boots, or other kinds of
footwear. For some variables, no ‘outer’ category may be needed,
but often ‘outer’ ensures that the variable has been exhaustively
categorized. (For some kinds of analysis, we may need to include

28
an “un-codable” category.) In any case, we must include the results
for the whole sample.
4. Furthermore, we should use chi-square only when observations are
independent: i.e.e no category or response is dependent upon or
influenced by another. (In linguistics, often this rule is fudged a bit.
For example, if we have one dependent variable/column for
linguistic feature X and another column for number of words
spoken or written (where the rows correspond to individual
speakers/ texts or groups of skeakers/texts which are being
compared), there is clearly some relation between the frequency of
feature X in a text and the number of wrds in a text, but it is a
distant, not immediate dependency.)
5. Chi-square is an approximate test of the probability of getting the
frequencies we’ve actually observed if the null hypothesis were
true. It’s basd on the expectation that within any category, sample
frequencies are normally distributed about the expected population
value. Since (logically) frequencies cannot be negative, the
distributon cannot be normal when expected population values are
close to zero, since the sample frequencies cannot be much below
the expected frequency while they can be much above it (an
asymmetric/non-normal distribution). So, when expected
frequencies cannot be much below the expected frequency while
they can be much above it (an asymmetric/non normal the
assumption of normal distributon, but the smaller the expected
frequencies, the less valid are the results of the chi-square test.
We’ll discuss expected frequencies are derived from observed
frequencies. Therefore, if we have cells in our bivariate table,

29
which show very low raw observed frequencies (5 or below), our
expected frequencies may also be too low for chi-square to be
appropriately used. In addition, because some of the mathematical
formulas used in chi-square use division, no cell in your table can
have an observed raw frequency of 0.

The following minimum frequency thresholds should be

obeyed:

• For a 1× 2 or 2× 2 or 2× 2 table, expected frequencies in each cell


should be at least 5;
• For a 2× 3 table, expected frequencies should be at least 2:
• For a 2× 4 or 3× or larger table, if all expected frequencies but one
are at least 5 and if the one small cell is at least 1, chi-square is still a
good approximation. In general, the greater the degrees of freedom
(i.e. the more values/ categories on the independent and dependent
variables), the more lenient the minimum expected frequencies
threshold.

Collapsing values
A brief word about collapsing values/categories on a variable is necessary.
Firs, although categories on a variable especially a dependent variable may
be collapsed, they cannot be excluded from a chi-square analysis. That is,
you cannot arbitrarily exclude some subset of our data from our analysis,
Second, a decision to collapse categories should be carefully motivated, with
consideration for preserving the integrity of the data as it was originally

30
collected. (For edample, how could we collapse the footwear preference
categories in our example and still preserve the integrity of the original
question/data? We can’t, since there’s no way to know if combining, e.g.,
boots and leather shoes varsus sandals and sneakers is true to our subjects’
typology of footwear.) As a rule, we should perform a chi-square on the data
in its un-collapsed form; if the chi-square value achieved is significant, then
we may collapse categories to test subsequent refinements of your original
hypothesis.

Computing chi Square


Let’s walk through the process by which a chi-square value is computed,
using Tbale2 above.
The first step is to determine our threshold of tolerance for error. That is,
what odds are we willing to accept that we are wrong in generalizing from
the results inour sample to the population it represents? Are we willing to
stake a claim on a 50% chance that we’re wrong? Or else. The answer
depends largely on our research question and the consequences of being
wrong. If people’s lives depend on our interpretation of our results, we
might want to take only 1 chance in 100,000 (or 1,000,000) that we’re
wrong. But if the stakes are smaller, for example, whether or not two texts
use the same frequencies of some linguistic feature (assuming this is not a
forensic issue in a capital murder case!) we might accept a greater
probability, 1 in 100 or even 1 in 20, that our data do not represent the
population we’re generalizing about. The important thing is to explicitly
motivate your threshold before you perform any test of statistical
significance. To minimize any temptation for post hoc compromise of

31
scientific standards. For our purposes, we’ll set a probability of error
threshold of 1 in 20, or p< .05, for our Footwear Study.)

Table 3: Male & Female of Mir Pur Bathoro’s Footwear preferences,


observed frequencies.

Sandals Sneaker Leather Boots Other Total


s shoes
Male 6 17 13 9 5 50
Female 13 5 7 16 9 50
Total 19 22 20 25 14 100

Male & Female


Choice

25
20
15
10
5
0
1 2 3 4 5 6

Remember that chi-square operates by comparing the acual, or observed,


frequencies in each cell in the table to the frequencies we would expect if
there were no relationship at all between the two variables in the populations
from which the sample is drawn. In other words, chi-square compares what
actually happened to what hypothetically would have happened if’all other
things were equal’ results are sufficiently different from the predicted null
hypothesis results; we can reject the null hypothesis and claim that a

32
statistically significant relationship exists between our variables.
Chiq-squre derives a representation of the null hypothesis, the all other
things being equal’ scenario, in the following way. The expected frequency
in each cell is the product of that cell’s row total multiplied by that cell’s
column total, divided by the sum total of all observations. So, to erive the
expected frequency of the “Males who prefer Sandals” cell, we multiply the
top row total (50) by the first column total (19) and divide that product by
the sum total (100): ((50 × 19) / 100) = 9.5. The logic of this is that we are
deriving the expected frequency of each cell from the union of the total
frequencies of the relevant values on each variable (in this case, Male and
Sandals), as a proportion of all observed frequencies (across al values of
each variable). This calculation is performed to derive the expected
frequency of each cell, as shown in Table5 below (the computation for each
cell is listed below Table5).

33
Sandals Sneakers Leather Boots Others Total
Shoes
Male Observed 6 17 13 9 5 50
Male Expected 9.5 11 10 12.5 7
Female 13 5 7 16 9 50
Observed
Female 9.5 11 10 12.5 7
expected
Total 19 22 20 25 14 100

Expected Value = (cell’s column total)* (cell’s row total)/(som total of all
observations)

Male/Sandals: ((19× 50)/100) = 9.5


Male/Sneakers: ((22× 50)/100) = 11
Male/Leather ((20× 50)100) = 10
Shoes
Male/Boots ((25× 50)/100) =
12.5
Male/Other: (14× 50)/100) = 7
Female / Sandals: (19× 50)/100) = 9.5
Female/Sneakers: ((22× 50)/100) = 11
Female/Leather ((20× 50)/100 = 10
Shoes
Female/Boots: ((25× 50)/100 = 12.5
Female/Other: ((14× 50)/100 = 7

34
Male (Expected)

12
10
8
6
4
2
0
1 2 3 4 5 6

Choice

Female (Expected)

12
10
8
6
4
2
0
1 2 3 4 5 6

Choice

Sandals 1
Sneakers 2
Leather shoe 3
Boots 4
Others 5

35
36
As we have originally obtained a balanced male/female sample, our male
and female expected scores are the same. This usually will not be the case.
We now have a comparison of the observed results versus the results we
would expect if the null hupothesis were true. We can informally analyze
this table, comparing observed and expected frequencies in each cell (Males
prefer sandals less than expected), across values on the independent variable
(Males prefer sneakers more than expected, Females less thanexpected(, or
across values on the dependent variable (Females preferesandals and boots
more than expected, but sneakers and soes less tan expecte). But so far, the
extra computation doesn’t really add much more information than
interpretationof the results in percentage form. We need some way to
neasure how different our observed results are from the null hypothesis. Or,
to pur it another way, we need some way to determine whether we can reject
the null hypothesis, and if we can, with what degree of confidence that we’re
not making a mistake in generalizing from our sample results to the larger
population.
Logically, we need to measure the size of the difference between the pair of
observed and expected frequencies in each cell. More specifically, we
calculate the difference between the observed and expected frequency in
each cell, square that difference, and then divide that product by the
difference itself. The formula can be expressed as ((O-E)^2/E)

Where O is for observes


E is for expected
^ is for power

Squaring the difference ensures positive number, so that we end up with an

37
absolute value of differences. If we didn’t work with absolute values, the
positive and negative differences across the entire table would always add up
to 0. (We really understand the logic of chi-square if you can figure out why
this is true.) Dividing the squrared difference by the expected frequency
essentially removes the expected frequency from the equation, so that the
remaining measures of observed/expected difference are comparable across
al cells.
So, for example, the difference between observed and expected frequencies
for the Male/Sandals preference is calculated as foollowsE

Observed (6) minus Expected (9.5) = (-3.5)


Difference (-.5.5) squared = 12.55
Differnce squared (12.25) divided by Expected (9.5) = 1.289
The sum of all products of this calculation on each cell is the total chi square
valu for the table.
The computation of chi-square for each cell is listed below table6.

38
Table 6. Male and Female of Mir Pur Bathoro, Footwear Preferences:
Observed and Expected Frequencies & Chi-Square.

Sandals Sneaker Leather Boots Other Tota


s Shoes s l
Male Observed 6 17 13 9 5 50
Male Expected 9.5 11 10 12.5 7
Female 13 5 7 16 9 50
Observed
Female 9.5 11 10 12.5 7
expected
Total 19 22 20 25 14 100

39
Chi
Male/Sandals: 2
((19× 50) /9.5) = 1.289
9.5
Male/Sneakers: ((22× 50)2/11) = 11 3.273
Male/Leather Shoes ((20× 50)2/10) = 10 0.900
Male/Boots ((25× 50)2/12.5) = 0.980
12.5
Male/Other: (14× 50)2/7) = 7 0.571
Female / Sandals: 2
(19× 50) /9.5) = 1.289
9.5
Female/Sneakers: ((22× 50)2/11) = 11 3.273
Female/Leather ((20× 50)2/10) = 10 0.900
Shoes
Female/Boots: ((25× 50)2/12.5) = 0.980
12.5
Female/Other: ((14× 50)2/7) = 7 0.571
Total Chi-Square Sum of chi-Values 14.026
Value =
The total chi-square value for Table 1 is 14.026.

Interpreting the Chi Square Value


We now need some criterion or yardstick against which tomeasure the
table’s chi square value, to know whether or not it is significant. What we

40
need to know is the probability of getting a chi-square value of a minimum
given size even if our variables are not related at all in the larger population
from which our sample was drawn. That is, we need to know how much
larger than 0 (the absolute chi-square value of the null hypothesis) our
table’s chi-square value must be before we can confidently reject the null
hypothesis. The probability we seek depends in part on the degree of
freedom of the table from which our chi-square value is derived.

Degrees of freedom
Mechanically, a table’s degrees offreedom (df) can be expressed by the
following formula:
Df = (r-1) (c-1)
That is, a table’s degrees of freedom equals the number of rows in the table
minus one multiplied by the number of columns in the minus cone. (For 1× 2
tables: df = k – 1, where k = number of values / categories on the variable.)
A degree of freedom is an issue because of the way in which expected
values in each cell are computed from the row and column total of each cell.
All but one of the expected values in a given row or column are free to vary
(within the total observed and therefore expected (frequency of that row or
column: once the free to vary expected cells are specified, the last one is
fixed by virtue of the fact that the expected frequencies must add up to the
observed row and column totals (from which they are derived).

Sandals Sneakers Leather Boots Others


Shoes
Male
Female

41
Df=(#row-1)*(#column-1) = (2-1)*(5-1)=1*4=4

The sampling distribution of chi-square (also know as critical values of chi


square) is typically listed in a appendix of the statistics book. We read down
the column repenting our previously chosen probability of error threshold
(e.g., p<.05) and across the row representing the degrees of freedom in our
table. If our chi-square value is larger than the critical value in that cell, our
data presents a statistically significant relationship between the variables in
our table.
Table 1’s chi-square value of 14.026, with 4 degrees of freedom, handily
clears the related critical value of 9.49, sowe can reject the null hypothesis
and affirm the claim that male and female of Mir Pur Batoro, differ in their
(self-reported) footwear preferences.
Statistical significance does not help us to interpret the nature or explanation
of that relationship: that must be done by tother means (including bivariate
tabular analysis and qualitative analysis of the data). But a statistically
significant chi-square value denotes the degree of confidence: we may hold
that relationship between variables described in ourresults is systematic in
the larger population and not attributable to random error. Statistical
significance also does not ensure substantive significance. A large enough
sample may demonstrate a statistically significant relationship between two
variables, but that relationship may be a trivially weak one. Statistical
significane means only that the pattern of distributon and relationship
between variables, which is found in the data from a sample, can be
confidently generalized to the larger population from which the sample was
randomly drawn. By itself, it does no ensure that the relationship is
theoretically or practically important or even very large.

42
Measures of Associaiton
While the issue of theoretical or practical importance of a statistical of a
statistically significant result cannot be quantified, the relative magnitude of
a satistically significant relationship can be measured. Chi-square allows us
tomake decisions about whether there is relationship between two or more
variables; if the null hypothesis is rejected, we conclude that there is a
statistically significant relationship between the variables. But we frequently
want a measure of the strength of that relationship, an index of degree of
correlation, a measure of the degree of association between the variables
represented in our table (and data). Luckily, several related measures of
association can be derived from a table’s chi-square value.

For tables larger than 2 × 2 (like ours Table 1), a measure called ‘Cramer’s
phi’ is derived by the following formula (where N = the total number of
observations, and k = the smaller of the number of rows or columns):
Cramer’s chi = the square root of (chi-square divided by (N times (k minus
1))
So, for our Tbale 1 (2× 5), we would compute Cramer’s phi as follows:
N(k-1) = 100 (2-1) = 100
Chi square/100 = 14.026/100 = 0.14
Square root of (2) = 0.37
The product is interpred as a Pearson r (that is, as a correlation coefficient)
For 2× 2 tables, a measure called ‘phi’ is derived by dividing the table’s chi
square value by N (the total number of observations) and then taking the
square root of the product. Phi is also interpreted as a Pearson r.

43
A complete account of how to interpret correlation coefficients is
unnecessary for present purposes. It will suffice to say that r2 is a measure
called shared variance. Shared variance is the portionof the total behavior (or
distributon) of the variables measured in the sample data which is accounted
for by the relationship we’ve already detected with our chi square. For Table
1, r2 = 0.137, so approximately 14% of the total footwear preference storfy is
explained/predicted by biological sex.
Computing a measure of association like phi or Cramer’s phi is rarely done
in quantitative linguistic analyses, but it is an omportant benchmark of jus
‘how much’ of the phenomenon under investigation has been explained. For
example, Table 1’s Cramer’s phi of 0.37 (r2 = 0.137) means that there are
one or more variables still undetected which, cumulatively, account for and
predict 86% of footwear preferences. This measure, of course, doesn’t begin
to address the nature of the relation(s) between these variables, which is a
crucial part of any adequate explanation or theory.

Conclusion
Business can be well managed and enhanced using the sociological
statistical methods. The case study of male and female footwear preference
helps up to 14% the Aashu footwear house owner Mr. Atta-ullah Khattri of
taluka Mir _Pur Bathoro.

44
By keeping rhe observed male and female footwear preference items he can
utilize only up to 14%, and the rest of 86% of the business management is
hidden in some other variables which can be found and hence can be further
extended as clearly described in future scope.

Future Scope
As the conclusion clearly tells us that 14% of the total footwear preference
story is explained/predicted by biological sex, and hence the business using
this statistical approach can only be managed up to 14%, and the rest of the
86% of keeping the footwear in the Aashu footwear house is still unknown,
i.e. the thesis can be extended further by exploring the undetected variables
(un-sued variables in the model can also help) which account for therest of
86% of footwear preference.

45
BIBLIOGRAPHY

1) Sociology of knowledge by M. Ryder


2) Encarta encyclopedia
3) http://www. Statistics.com
4) http://www. Wcsu.edu/socialsci/sorces.html
5) http://www. MSN Encarta-Dictionary – chi-square.html

46

S-ar putea să vă placă și