Sunteți pe pagina 1din 13

CHAPTER 4

Classification of data according to source


Lesson 4:DATA MANAGEMENT • Primary data are data documented by the
primary source. The data collectors themselves
Introduction documented this data.
• Secondary data are data documented by a
• Statistics is the branch of science that deals with secondary source. An individual/agency other than
the collection, presentation, organization, analysis the data collectors, documented this data.
and interpretation of data.
• The population is the collection of all elements Method of collecting data
under consideration in a statistical inquiry. The • The survey is a method of collecting data on the
sample is a subset of the population. variable of interest by asking people questions.
• The variable is a characteristics or attributes of • The experiment is a method of collecting data
the elements in a collection that can assume where there is a direct human intervention on the
different values for the different elements. conditions that may affect the values of the
• The parameter is a summary measure describing variables of interest.
a specific characteristic of the population. The • The observation method is a method of collecting
statistic is a summary measure describing a specific data on the phenomenon of interest by recording the
characteristic of the sample. observations made about the phenomenon as it
actually happens.
Areas in Applied Statistics
1. Descriptive statistics includes all the techniques Sampling and Sampling Techniques
used in organizing, summarizing and presenting the • The target population is the population we want
data on hand. to study. The sampled population is the population
2. Inferential statistics includes all the techniques from where we actually select the sample.
used in analysing the sample data that will lead to • Probability sampling is a method of selecting a
generalizations about a population from which the sample wherein each element in the population has
sample came from. a known, nonzero chance of being included in the
sample; otherwise it is nonprobability sampling.
Collection of Data
• Measurement is the process of determining the Methods of Probability Sampling
value or label of the variable based on what has 1. Simple random sampling
been observed. 2. Stratified sampling
3. Systematic sampling
Levels of Measurement 4. Cluster sampling
1. The ratio level of measurements has all of the
following properties: Methods of Nonprobability Sampling
a) The numbers in the system are used to 1. Haphazard or convenience
classify a person/object into distinct and 2. Judgement or purposive
nonoverlapping categories. 3. Quota sampling
b) The system arranges the categories
according to magnitude Presentation of Data
c) The system has a fixed unit of • Textual presentation of data incorporates
measurement representing a set size throughout the important figures in a paragraph of text.
scale • Tabular presentation of data arranges figures in
d) The system has an absolute zero a systematic manner in rows and columns.
2. The interval level of measurement satisfies only • Graphical presentation of data portrays
the first three properties of the ratio level. numerical figures or relationships among variables
3. The ordinal level of measurement satisfies only in pictorial form.
the first two properties of the ratio level.
4. The nominal level of measurement satisfies
only the first property of the ratio level.
Organization of Data • 𝑆𝑘 < 0, negatively skewed distribution
• Raw data are data in their original form. X−Mo
• 𝑆𝑘1 ¿
• The array is an ordered arrangement of data according s
to magnitude. 3 ( X−Md )
• The frequency distribution is a way of summarizing • 𝑆𝑘2 ¿
s
data by showing the number of observations that belong
in the different categories or classes. Measures of Skewness
1. Symmetrical or Normal Distribution In a
Measures of Central Tendency symmetrical distribution the mean, median, and
Measures of Central Tendency are descriptive mode all fall at the same point or equal.
measures that are used to describe the center of a set of 2. Positively Skewed Distribution In a positively
data, arranged numerically. skewed distribution, the extreme scores are larger,
1. The arithmetic mean is the most common type of thus the mean is larger than the median.
average. It is the sum of all the observed values divided 3. Negatively Skewed Distribution The order of
by the numbers of observations. the measures of central tendency would be the
2. The median is the value that divides the array into opposite of the positively skewed distribution, with
two equal parts. the mean being smaller than the median, which is
3. The mode is the observed value that occurs with the smaller than the mode.
greatest frequency in a data set.
Measure of kurtosis refers to the peakedness or
Measures of Location
flatness of the curve of the distribution.
Measures of Location, on position or fractiles are used
to specify the location of specific data in relation to the K=∑ ¿ ¿ ¿
rest of the sample. It divides the distribution into equal i. when K > 3, the distribution is
number of parts. Leptokurtic
1. The percentiles divide the ordered observation into ii. when K = 3, the distribution is
100 equal parts. Mesokurtic
2. The quartiles divide the ordered observations into 4 iii. when K < 3, the distribution is
equal parts. Platykurtic
3. The decile divides that observed observation into 10
equal parts.
Measure of Kurtosis
• Leptokurtic. The curve is more peaked and the
Consider the given set of data: hump is narrower or sharper than the normal curve.
Set A: 9, 12, 13, 15, 15, 17, 24 • Platykurtik. The curve is less peaked and the
Set B: 7, 11, 15, 15, 17, 19, 21 Set C: 11, 11, 15, 15, 15, hump is flatter than the normal curve.
18, 20 • Mesokurtic. The hump is the same as the normal
curved. It is neither too flat nor too peaked.
Measures of Dispersion
Measures of Dispersion or Variability describes the Normal Distribution
spread or the scatterings of the values around the mean •The normal distribution is pattern for the
1. The range is the distance between the maximum distribution of a set of data which follows a bell
value and the minimum value. shaped curve.
2. The variance is the average squared difference of •The graph of a normal distribution is called a
each observation from the mean.
normal curve.
3. The standard deviation is the positive square root of
the variance.
4. The coefficient of variation is the ratio of the Properties of Normal Distribution
standard deviation to the mean, expressed as a 1. Normal curve is bell shaped.
percentage. 2. The mean, median and mode are located at the
center of the distribution and it is unimodal.
Measures of Skewness and Kurtosis 3. It is symmetrical about mean.
Measure of skewness measures the degree of symmetry 4. It is continuous and is asymptotic with respect to
of a distribution. the x-axis.
• 𝑆𝑘 = 0, symmetric distribution 5. The total area under curve is 1.00 or 100%.
• 𝑆𝑘 > 0, positively skewed distribution
statistic for which we do not reject the null
Many Normal Distributions hypothesis.
•There are an infinite number of normal curves • These two regions are separated by the critical
•By varying mean and standard deviation, we obtain value of the set statistic.
different normal distributions
Critical Value
The Standard Normal Distribution • The critical value of the tabular value for the
• A normal distribution with a mean of 0 and a hypothesis test is a threshold to which the value of
standard deviation of 1 is called the standard normal the test statistic in a sample is compared to
distribution. • The z-score measures how many determine whether or not the null hypothesis is
standard deviations an observed value is above or rejected.
lower the mean. • Sample z score is given by the • We reject the null hypothesis if the computed
x−X value is greater than or equal to the critical value.
formula
s
• The standard score is useful when we want to Types of Error
compare two or more observed values from • The Type I error is the error committed when we
different data set. decide to reject the null hypothesis when in reality
the null hypothesis is true.
Area under the Standard Normal Curve • The Type II error is the error committed when
Given Steps we decide not to reject the null hypothesis when in
Between zero and any Look up the area in reality the null hypothesis is false.
number the table
Between two Look up both areas in The Level of Significance
positives, or Between the table and subtract • The level of significance, denoted by 𝛼, is the
two negatives the smaller from the maximum probability of committing a type I error
larger. that the researcher is willing to commit.
Between a negative Look up both areas in • Very frequently used are the .05 and .01 level of
and a positive the table and add them significance.
together • Note: 0.05 level of significance implies that we
Less than a negative, Look up the area in are willing to commit an error of 5% therefore a
or Greater than a the table and subtract confidence level of 95%.
positive from 0.5000
Greater than a Look up the area in p-value
negative, or Less than the table and add to The p-value is the probability of selecting a sample
a positive 0.5000 whose computed value for the test statistic is equal
Test of Hypothesis or more extreme than the realized value computed
• A one-tailed test of hypothesis is a test where the from the sample data, given that the null hypothesis
alternate hypothesis specifies a one-directional is true. As a rule, if the p-value is greater than the
difference for the parameter of interest. level of significance, then we do not reject the
• A two-tailed test of hypothesis is a test where the null hypothesis. On the other hand, if the p-value
alternate hypothesis does not specify a directional is less than or equal to the level of significance,
difference for the parameter of interest. then we reject the null hypothesis.
Test of Hypothesis
• A test statistic is a statistic whose value is Steps in Hypothesis Testing
calculated from sample data, which be the basis for 1. State the null and alternative hypotheses.
deciding whether to reject 𝐻0 or not in a test of 2. Choose the level of significance
hypothesis. 3. Determine the appropriate statistical technique
• The critical region is the set of values of the test and corresponding test statistic to use.
statistic for which we reject the null hypothesis. The 4. Perform the computation. Compare the computed
acceptance region is the set of values of the test value with the critical value (others use the p-value
instead)
5. Make the decision rule (Reject the null • It was developed by Karl Pearson that is why the
hypothesis or failed to reject it). correlation coefficient is sometimes called
Decision Rule "Pearson's r." The formula is defined by:
• Reject 𝐻0 if the value of the test statistic falls in r =n ¿ ¿
the region of rejection (that is, test statistics is Basic properties of r
greater than the critical value.) The range of the correlation coefficient is -1 and
• Reject 𝐻0 if the p-value is less than or equal to the +1. If the value of the coefficient is close to -1.00,
level of significance. it represents a perfect negative correlation while a
value of +1.00 represents a perfect positive
The parametric tests are tests applied to data that correlation.
are normally distributed, the levels of measurement If the value is equal to 0.00, it means
of which are expressed in interval and ratio. that there is no relation between the variables.

t-test for Dependent Samples (paired) Simple Linear Regression Analysis


• A parametric test applied to one group of samples. • The simple linear regression analysis predicts the
• It can be used in evaluation of a certain program value of y given the value of x.
or treatment. • It is used when there is a relationship between the
• It is applied when the mean before and the mean independent variable x and the dependent variable
after are being compared. y.
• The formula for the simple linear regression is
t-test for Independent Samples (unpaired) 𝑦 = 𝑎 + 𝑏𝑥, where y = dependent variable, x =
• Used when we compare the means of two independent variable, a = y-intercept,
independent groups. - This statistical procedure is concerned with
• Used when the sample is less than 30. prediction or forecasting
- It is a statistical method that allows us to
z-test summarize and study relationships between two
• It is used to compare two means: the sample continuous (quantitative) variables.
means and the perceived population mean. - In simple linear regression, we predict scores on
• It is also used to compare the two sample means one variable (dependent) from the scores on a
taken from the same population. second variable (independent).
• When samples are equal to or greater than 30. - The variable we are predicting is called the
• It can be applied in two ways: the One-sample criterion variable and is referred to as Y. The
mean test and the two sample mean test. variable we are basing our predictions on is called
the predictor variable and is referred to as X.
F-test
• It is another parametric test used to compare the Nonparametric tests are tests that do not require
means of two or more independent groups. a normal distribution. They utilize both nominal and
• It is also known as the analysis of variance ordinal data.
(ANOVA). Chi-Square Test
• Kinds of ANOVA: One-way, two-way, three-way • This is the test of difference between the observed
• We used ANOVA to find out if there is a and expected frequencies.
significant difference between and among the • The Test for Goodness of fit determines if the
means of two or more independent groups. sample under analysis was drawn from a population
that follows some specified distribution.
The Pearson Product Moment Coefficient • The Test for Homogeneity answers the
Correlation, r proposition that several populations are
• It is used to analyze if a relationship exists homogeneous with respect to some characteristic.
between two variables (measured in the interval or • The Test for independence (one of the most
ratio scale) say variable x and y. frequent uses of Chi Square) is for testing the null
hypothesis that two criteria of classification, when
applied to a population of subjects are independent.
If they are not independent then there is an 3. Do students from private schools have higher
association between them. drop-out rates at university than students from state
Lesson 4.1:Hypothesis Testing universities?
Crucial elements:
What is Hypothesis? •They identify the population (s) we want to make a
•An assumption about the population parameter. statement about;
•An educated guess about the population parameter. •They identify variables for which we will gather
data; and
• Hypothesis Testing:This is the process of •They identify the relevant descriptive statistic for
making an inference or generalization on population describing the data.
parameters based on the results of the study on
samples. 2. Set the level of significance
• Statistical Hypothesis: It is a guess or 3. Formulate the direction rule (DR)
prediction made by the researcher regarding the 4. Compute the test statistics
possible outcome of the study. 5. make a decision

•Null Hypothesis (H0 ):is always hoped to be What are the two kinds of research
rejected. It always contains “=“ sign questions?
•Alternative Hypothesis (Ha ): 1. The first is where a particular value is chosen for
o Challenges H0 practical or policy reasons.
o Never contains the “=“ sign 2. The other situation in which we will have a
specified test value where we want to compare the
o Uses the “< or >” or “ ≠ ”
population under investigation with another
o generally represents the idea which the
population whose parameter region is known.
researcher wants to prove.  The null
hypothesis (H0 ) The null hypothesis of no difference (Ho)
•The null hypothesis (Ho ) represents the current •The null hypothesis must be clearly capable of
line of thought concerning population parameters, being rejected, that is, it can be shown to be false.
prior to any application of inferential statistics. •The null hypothesis is also called the statistical
While;  hypothesis because it is stated for the purpose of
•The alternative hypothesis (Ha ) is accepted only either accepting or rejecting it after submitting the
after the validity of the null hypothesis is data to statistical analysis.
statistically inferred to be incorrect. Examples:
Example: “The dependant is assumed to be Title 1: An evaluation of the effectiveness of online
innocent until proven guilty beyond all reasonable learning
doubt. Problem: The researcher wants to know if online
learning has increased the average GPA of NEU
What are the steps in hypothesis testing? students from 80%.
1. State the null hypothesis and alternative H0 :  = 80; Online learning has not increased the
hypothesis average GPA of NEU students
• Begin with very clear, precisely stated research Ha :  > 80; Online learning has increased the
question that will guide the way we conduct and average GPA of NEU students
ensure that we do not just end up with a jumble of Explanation: This is because the researcher is
information interested in knowing if online learning has
that do not create any real knowledge. increased the average GPA of NEU students ( >,
Example: Research Questions because of the word increased)
1. Are men from America on the average taller
than men from the Philippines?
2. Is the proportion of cigarette smokers who
suffer from lung cancer higher than the proportion
of non-smokers who suffer from lung cancer?
2. If Ha uses the <, the test is one-tailed left
directional
3. If Ha uses the >, the test is one-tailed right
Lesson 4.2: Types of Hypothesis Test directional
Level of Significance
1. One-tailed left directional test The level of significance is the area of the rejection
• this is used if Ha uses the < symbol region designated by the Greek Letter  (alpha)
Left directional test: It is used when the while the area of the acceptance region is designed
alternatives hypothesis uses comparative such as by the Greek Letter  (beta). If  =0.05,  = 0.95,
less than, smaller than, inferior to, below, etc. the typical values of  are 0.01 and 0.05.
The figure above illustrates the acceptance and But you are not prevented from 0.02, 0.03, … etc.
rejection regions. Left tail tests normally used when In your research, however, you just have to use =
we want to test to see if some minimum 0.05.
requirement has been met. Decisions Made Regarding H0
(Reject H0 / Do not reject H0 )
Example: If you reject H0 , it means it is wrong!
1.It is known in our school canteen that the average If you do not reject H0 , it doesn’t mean it is
waiting time for a customer to receive and pay his correct! – you simply don’t have enough evidence
order is 20 minutes. Additional personnel have been to reject it!
assigned and now the management wants to know if
the average waiting time had been reduced. What is Type I error?
H0: The average waiting time had not been reduced • Type I error () is the error of rejecting the true
H0: The average waiting time is equal to 20 minutes null hypothesis (H0 ).
Ha: The average waiting time had been reduced or • It is called the level of significance of a test.
Ha: The average waiting is less than 20 minutes. What is Type II error?
• Type II error () is the error of accepting the false
2. One-tailed right directional test null hypothesis when the alternative hypothesis (H1
• this is used if Ha uses the > symbol ) is true.
Right directional test: It is used when the
alternatives hypothesis uses comparative such as •An  of 0.01 (compared with 0.05) means the
greater than, higher than, better than, superior researcher is being relatively careful. He/she is only
to, exceeds, etc. The figure above illustrates the willing to risk being wrong once in a 100 times in
acceptance and rejection regions. A right tail tests rejecting the null hypothesis which is true.
normally used when we want to test whether some
•If the null hypothesis is rejected at  = 0.05, the
maximum limit or standard has not been exceeded.
perceived difference is significant, but if it is
rejected at  = 0.01, the difference is highly
3. Two-tailed test: Non-directional
significant.
Two-tailed test: It is used when the alternative
hypothesis uses words such as not equal to,
Testing the Significance of Difference Between
significantly different, etc.
Means
z – test n  30 →  is known
Alternative hypothesis Tail of sampling
distribution t – test n < 30 →  is known
F – test (ANOVA)→ 3 or more μs
H1 :  ≠ X Both
H1 :  < X Left
•To reiterate, the z – test is used when “n is large”
H1 :  > X Right
or when “n  30” and  (population standard
deviation) is known. Three types of hypotheses can
Therefore: be tested by z – test
1. If Ha uses the ≠, the test is two-tailed non-
directional Testing the significance of difference between
•Population or hypothesized mean, that is •The second compares the pvalue (the area to the
Population mean vs Sample mean right of the computed value) and .
•Two sample means and two sample standard
deviations are known, that is Sample mean 1 vs
Sample mean 2
•Two sample means and population standard Lesson 4.3: The Z-Test
deviation is known, that is Sample mean 1 vs
Sample mean 2 •A table is constructed so that you don’t have to go
back to the areas under the normal curve table. You
Testing the Significance of Difference Between will always refer to this table whenever you use the
Means “n is large or when n  30 and  is ztest in hypothesis testing.
unknown”
Test  0.01 0.05
z – test n  30 →  is known One-tailed 2.33 1.65
•Hypothesized/population mean vs Sample mean Two-tailed 2.58 1.96
and population standard deviation is known Examples:
( x́−μ ) √ n 1.A supermarket owner believes that the mean
Z=
σ income of its customers is P50, 000 per month. One
x́- is the sample mean hundred customers are randomly selected and asked
 - is the population mean of their monthly income. The sample mean is P48,
n - is the sample size 500 per month and standard deviation is P3, 200. Is
 - is the population standard deviation there sufficient evidence to indicate that the mean
income of the customers of the supermarket is P50,
z – test n  30 →  is known 000 per month? Use a = 0.05.
•Sample mean 1 vs Sample mean 2 and 2 sample Answer:
standard deviation are known. Since n = 100, and only one sample mean is given,
( x 1−x 2 ) use the z-test, and the test statistic is:
Z=
s12 s 22
√ +
n1 n2
x 1- is the mean of sample 1
Z=
( x́−μ ) √ n
σ
Substituting μ=50000 , x́=48500 ,
x 2- is the mean of sample 2 n=100 , σ=3200 in the formula, you obtain
n1 & n2 - are the sample sizes ( 48500−50000 ) √ 100
Z=
s1 & s2 - are the sample standard deviations 3200
−1500 ( 10 ) −1500
z – test n  30 →  is known ¿ =
3200 3200
•Sample mean 1 vs Sample mean 2 and population ¿−4.69
standard deviation is known. 5 – Step solution

( x 1−x 2 ) 1. H σ : μ=50000; The mean income of the


Z= customers in the supermarket is 50000.
1 1
σ
√ +
n1 n 2
x 1- is the mean of sample 1
H a : μ≠ 50000; The mean income of the
customers in the supermarket is not
x 2 - is the mean of sample 2 50000.
n1 & n2 - are the sample sizes 2. a=0.05; two-tailed test; Ztab = -1.96
 - is the population standard deviations 3. Decision rule: Reject H σ if
|Z c ( – 4.69 )|≥|Z tab (−1.96)| that is if 4.69 
•The first is the critical value approach which 1.96
compares the computed value of the test statistic 4. Decision: Reject H σ , since Z c (2.50) >
and the critical value. Ztab (1.96)
5. Conclusion: The mean income of the • The steps in testing the significance of difference
customers in the supermarket is not between means using the t-test are just the same as
50000. in the z-test.
• The difference lies in the use of the t-distribution
with n – 1 degree of freedom instead of the normal
Lesson 4.4: The t-Test distribution.
• The t-distribution is also known as the student’s
• The t-test is used if n is small, it will be used if distribution since it was developed by Gosset in
n<30 and  is known 1908 under the name Student.
• If the sample size is small the value (n<30), the •Notice that as “n approaches infinity, the critical
values of the mean and standard deviation fluctuate value of t approaches the critical value of z”, that
from sample to sample. is t becomes z.
• The sampling distribution of sample mean and
standard deviation is no longer a standard normal How to use the t-table, “Student Distribution”
distribution, thus you call it t-distribution. If  = 0.05; df = 9 and the test is two-tailed; t tab =
• The t-distribution is similar to z-distribution. 2.262.
• They are both symmetrical about the mean. Both If  = 0.01; df = 10 and the test is one-tailed; t tab =
are bell-shaped. 2.764.
• The t-distribution is more variable since t-values
depend on the fluctuation of the mean and the Testing the Significance of Difference Between
standard deviation, whereas the z-values depend Means “n is small or when n < 30 and  is
only on the fluctuation of the mean from sample to
unknown.”
sample.
t – test n < 30 →  is known
• The two distributions differ in standard
• Hypothesized/population mean vs Sample mean
deviation: z has a standard deviation of 1 while t
and population standard deviation is unknown.
has a standard deviation which is always greater
than 1. ( x́−μ) √ n
t= df = n – 1
• The divisor n – 1 in the formula for the variance s
and standard deviation is called the degree of Given one sample mean and one standard deviation.
freedom (df). • If the mean and standard deviation Example:
are computed from samples of size n, the values of t A ten randomly selected oil wells in a large field
are said to belong to a t distribution with df = n – 1. produced 21, 19, 20, 22, 24, 21, 19, 22, 22, and 20
• Therefore, you have a different t curve for each barrels of crude oil per day. Is this enough evidence
possible sample size, such that a curve becomes to conclude that the oil wells are not producing an
more and more like the standard normal curve as n average of 22.5 barrels of crude oil per day? Test at
becomes larger or as n approaches infinity. 0.01 level of significance.
21+19+20+22+24+ 21+ 19+22+22+ 20
x́= x́=21
Degree of freedom: It is the number of variables 10
which are free to vary. The given data are:  = 22.5,  = 0.01. The sample
mean and standard deviation are not given,
t- distribution w/ different df therefore solve it first using calculator. Then x́ = 21,
df =  , df = 10, df = 4 sx = 1.56. Substituting the value, you have the
• The figure above shows the t-distribution with df following solutions:
= 4, 10 and . The curve for df=4, represents a ( x́−μ) √ n (21−22.5) √10
t= = =−3.04
sampling distribution of all t-values computed from s 1.56
repeated random samples of size n = 5 taken from a 5– step solution
normal population. Similarly the curve for df = 10 1.H0 :  = 22.5 (The oil wells are producing 22.5
represents sampling distribution of all t-values barrels of oil a day)
computed from samples of size n = 11. Ha :  ≠ 22.5 (The oil wells are not producing 22.5
barrels of oil a day.
2.  = 0.01; df = 9; t tab = -3.250 (from the table of
Student distribution)
3. Decision rule: Reject H0 if , 3. Decision rule: Reject H0 if ,
|t c (−3.04 )|≥|ttab (−3.250)|that is if 3.04  3.250 |t c ( 1.69 )|≥|t tab (1.703 .)|that is if 1.69  1.703.
4. Decision: Do not Reject H0 ,/Accept H0 because 4. Decision: Do not Reject H0 , since tc (1.69) <
3.04 < 3.250 ttab (1.703).
5. Conclusion: The oil wells are producing 22.5 5. Conclusion: The Team-based Instruction method
barrels of oil a day. of teaching Statistics is as effective as the
•Two independent sample means and two sample Individually-guided Instruction method.
standard deviations are unknown. • Dependent or Correlated Sample Means
x 1−x 2 d́ n
t= t= √ df = n-1
(n¿¿ 2−1)s 22 1 1 ; df = sd
√ (n ¿ ¿ 1−1) s 21+

n1+n2 -2

n1 +n2 −2
+ ¿¿
n1 n 2 • Dependent samples are drawn from the same
population or the same set of samples subjected to
different experimental conditions, like weight
S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 before and after attending aerobics, pretest and
TBI 30 28 29 20 18 19 16 27 22 24 26 28 30 29 18 posttest, etc. • Correlated samples are two
IGI 25 27 20 30 16 21 15 25 28 21 19 17 18 1 different sets of samples which are related, like
Method n x́ sx mother and daughter, brother and sister, old and
TBI 15 24.27 4.98 new machine, etc.
IGI 14 21.07 5.21 Example:
•Based on the result of the test, can you say that the The following are the weights in pounds of 15
TBI method of teaching is more effective than the students before and after six months of attending
IGI method? Use  = 0.05. aerobics.
Example: Weights Before 243 179 201 165 183 153 170
A teacher wants to find out if the Team-based Weights After 231 173 199 162 179 152 164
Instruction (TBI) method of teaching Statistics is Difference 12 6 2 3 4 1 6
more effective than the Individually-guided Weights Before 180 212 169 178 209 158 192 144
Instruction (IGI) method. Two classes of Weights After 177 207 170 171 196 159 190 140
approximately equal intelligence were selected. Difference 3 5 -1 7 13 -1 2 4
From one class, he/she considered 15 students with •Test at  = 0.05 if aerobics is effective in reducing
whom he/she used TBI method of teaching and weight.
from the other class, he/she considered 14 students Solution: Since the two sets of data are taken from
with whom he/she used IGI method. After several the same set of sample, use the t-test and the test
sessions, a 30-item test was given. The scores are statistic is :
shown in the following table. •First, get d́and sd . By using the formula or
Solution: Since n1 = 15 and n2 = 14, and there are calculator for finding the mean and standard
two independent samples, use the t-test and the test deviation, you find that the mean of the difference is
statistic is: 4.40 and standard deviation is 4.05.
Substituting: Substituting:
x 1=24.27 x 2=21.07 d́=4.40 s d =4.05 n=15
s1=4.98 s2=5.21
4.40 √ 15
n1 =15 n2 =14 t= =4.21 ; df =15−1=14
4.05
24.27−21.07 If you let :
t=
√ (15−1 ) ¿¿ ¿ ¿ μ B=The mean before attending aerobics
df =15+14−2=27 μ A =The mean after attending aerobics
5 – step solution
1.H0 : The TBI method of teaching Statistics is as 5 – step solution
effective as the IGI method. (H0 : TBI=IGI) 1.H0 : Aerobics is not effective in reducing weight.
Ha : The TBI method of teaching Statistics is more (H0 : B=A )
effective as the IGI method. (Ha : TBI>IGI) Ha : Aerobics is effective in reducing weight.
2.  = 0.05; one-tailed; df = 27; t tab = 1.703 (Ha : B ≠A )
2.  = 0.05; two-tailed; df = 14; t tab = 2.145 σ 2 = 1,120 / 5 = 224
3. Decision rule: Reject H0 if , s2 = 1,120 / 4 = 280
|t c ( 4.21 )|≥|t tab ( 2.145.)|that is if 4.21  2.145. STANDARD DEVIATION: The square root of the
4. Decision: Reject H0 , since tc (4.21) > t tab Variance
(2.145). BOYS
5. Conclusion: Based on the sample evidence, σ 2 = 224 s2 = 280
aerobics is effective in reducing weight. σ = 14.97 s = 16.73

Lesson 4.5: Measures of Variability of Lesson 4.6: Correlation Analysis


Ungrouped Data
• A correlation is a relationship or association
Boys Girls between two variables.
Frederick 70 Grace 82 • A correlation coefficient is a numerical measure
Russel 95 Irish 80 of the linear relationship between two variables.
Murphy 60 Abigail 83
Jerome 80 Sherry 81 A direct or positive relationship between two
variables implies that an increase in value of one of
Tom 100 Kristine 79
the variables corresponds to an increase in value of
Mean: 81 Mean: 81
the other variable. r = 1
A direct or positive relationship between two
Measures of Variability or Dispersion
variables implies that an increase in value of some
RANGE: The difference between the highest and
of the variables corresponds to an increase in value
the lowest observation R = H – L
of the other variable. 0 < r < 1
Boys: R = 100 – 60 R = 40
• In reality, you could seldom find variables with
Girls: R = 83 – 79 R = 4
perfect positive correlation. Oftentimes, you will
come across variables with only some degree of
Mean Deviation: The average of the summation of
positive relationship.
the absolute deviation of each observation from the
• In a perfect positive correlation, all the points
| Xi− X́|
mean. MD= ∑ can be contained in a straight line whose
n movement is upward right. Now what do you notice
Boys Xi Xi-X in the “some positive correlation”? Can they
Frederick 70 11 contained in one straight line? If not describe the
Russel 95 14 general direction of the points.
Murphy 60 21
Jerome 80 1 What do you think is the relationship between the
Tom 100 19 number of absences in class (variable 1) and the
Mean: 81 ∑405 ∑66 grades received (variable 2)?
M.D = 66 / 5 = 13.2
An inverse or negative relationship between two
VARIANCE: The average of the squared deviation variables implies that an increase in value of one of
from the mean. the variables corresponds to a decrease in value of
Population Variance σ 2=∑ ¿ ¿ ¿ the other variable. r = -1
Sample Variance s2=∑ ¿ ¿¿ • Again, this type of relationship is not true for all.
In real life, you can only get some degree of
Boys Xi Xi-X ¿
negative relationship.
Frederick 70 -11 121 An inverse or negative relationship between two
Russel 95 14 196 variables implies that an increase in value of some
Murphy 60 -21 441 of the variables corresponds to a decrease in value
Jerome 80 -1 1 of the other variable. -1 < r < 0
Tom 100 19 361
Mean: 81 ∑405 ∑66 ∑ 1,120
Yes! There are many variables which do not have r =0.89
correlation at all. Thus, there exists a zero • To get the correlation coefficient, press SHIFT or RCL
correlation. then “r”, 0.8851144396 will be display. In two decimal
places, rxy = 0.89 which is interpreted as high
A zero relationship exists between two variables if correlation.
an increase in value of one of the variables is not Another important and interesting statistics which
accompanied by either an increase or a decrease can be obtained from the correlation coefficient (r),
in value of the other variable. r = 0 is the coefficient of determination “r2 ”. This tells
us how much of Y (grades) is due to or can be
• To determine the degree of relationship between attributed to X (number of hours spent in studying).
two variables, the “Pearson product-moment Thus, if you square “r”, that is 0.8851143962 ,
correlation coefficient or simply Pearson’s “r” you will get 0.783427495.
formula will be used.” The formula and the extent
or the degree of relationship are given in the boxes • This value is interpreted as follows:
below. “Seventy-eight percent (78%) of the variation in
The Pearson product-moment correlation grades received (Y) is due to or can be attributed to
coefficient or simply Pearson r the variation in the number of hours spent in
r =n ¿ ¿ studying (X), and the remaining 22% (100% -
78%) is due to the other factors such as IQ, teacher,
etc…
A correlation coefficient is the magnitude or the Testing the significance of correlation
degree of relationship between two variables. • After learning how to get and interpret the value
of “r”, your next task is to determine whether the
between to  0.80 0.99 high correlation correlation, which exists between the variables, is
between to  0.60  0.79 moderately high correlation significant and not just due to chance. This time, it
between to  0.40  0.59 moderate correlation is testing the significance of correlation.
between to  0.20  0.39 low correlation • There are several ways to test if “r” is significant.
between to  0.01  0.19 negligible correlation One can use the t-test for correlation coefficient
with the formula:
• For manual computation, you may refer to the r n−2
formula. However, it will be easier if you have the t= √ 2
√1−r
required calculator with LR/stat1/stat2/statxy 0.8851143962 √ 10−2
mode. t=
• Do the computation using the example on the √ 1−0.8851143962 2
number of hours spent in studying and the grades t=5.379511443
received. t=5.3795
Hours Grades x2 y2 xy
x y
2 57 4 3249 114
2 63 4 3969 126
2 70 4 4900 140
3 72 9 5784 216
3 69 9 4761 207
4 75 16 5625 300
5 73 25 5329 365
5 84 25 7056 420 5– step solution (Let r be the pop. Correlation)
6 82 36 6724 492 1.H0 : r = 0; There is no correlation between the
6 89 36 7921 534 no. of hours spent in studying and the grades
38 734 168 54718 2914 received. (rho is the symbol for population r)
10 ( 2914 )−( 38 ) ( 734 ) Ha : r ≠ 0; There is a correlation between the
¿
√¿ ¿ ¿ number of hours spent in studying and the grades
r =0.8851144396 received.
2.  = 0.05; t comp = 5.3795 and cv = 2.306 5 – step solution:
3. Decision rule: Reject H0 if |5.3795|≥|2.306| 1.H0 : = 0; There is no correlation between the
4. Decision: Reject H0 , because 5.3795 > 2.306 rankings of the characteristics of professors
5. Conclusion: There is a significant correlation preferred most by BSE and BA students.
between the number of hours spent in studying and Ha :  ≠ 0; There is correlation between the
the grades received. Hence, as the number of hours rankings of the characteristics of professors
spent in studying increases, the grades received also preferred most by BSE and BA students.
increase. 2. : 0.05; -computed = 0.93; -tab = 0.683
Lesson 4.7: Spearman Rank-Order Correlation 3. Direction Rule: Reject H0 if  - comp (0.93)  
- tab (0.683).
Spearman Rank-Order Correlation Coefficient 4. Decision: Reject H0 because 0.94 > 0.683
•This known as Spearman’s rho. The basic logic 5. Conclusion: There is a significant correlation
underlying rho is that it tries to predict the ranking between the rankings of the characteristics of
of pairs of cases on the dependent variable given professors preferred most by BSE and BA students.
their ranking on the independent variable. However,
it makes use of the longer scale. Lesson 4.8: Regression Analysis
•This is a very simple and quick method when there If two variables are correlated, that is if the
the paired data expressed in ranks are less than 30. correlation coefficient (r) is significant, then it is
•Spearman’s rho of the ranks of the two variables possible to predict or estimate the value of one
is used to determine the degree of relationship. variable from the knowledge of the other variable.

Remember that Pearson’s r is appropriate to use Application of Regression Analysis


when data are interval or ratio scale. For ordinal • Suppose the advertising cost (x) and sales (y) are
data, Spearman Rho () is used. correlated, then you can predict the future sales (y)
But Spearman Rho () is considered as a very in terms of advertising cost (x).
special type of Pearson’s r . • Predicting the value of certain variable several
years hence or several years back when the values
Formula: of that variable for corresponding years are given.

6 ∑ d2 Definition:
ρ ( rho )=1
[ n(n2−1) ] • Regression Analysis deals with the estimation of
one variable based on the changes or movements of
Where: the other variable.
1 and 6 = constant • Regression Equation: Y = a + bx
d = the difference in ranks and
n = the number of pairs
BSE Rank BA Rank d D squared a=
∑ y−b ∑ x
N
1 1 0 0 b=N ¿ ¿
6 5 1 1
5 6 -1 1 Linear Regression of Y on X
4 3 1 1 •In a regression equation Y = a + bx, “a” which is
9 9 0 0 constant is the value of y-intercept while “b” is the
3 2 1 1 slope of the regression line.
2 4 -2 4 •The regression line is the line which traces the
8 8 0 0 general direction of the points plotted in the scatter
7 7 0 0 diagram. It is called the Trend Line or the Least
8 Square Regression Line (LSRL) because it is the
6 (8) line which gives the minimum sum of the
ρ ( rho )=1
[
9(92−1) ] differences from the actual values.
•Thus, Y = a + bx is the linear regression of Y on
48
¿ 1− =0.93 X, and is used to predict the value of Y from the
720
knowledge of X. the line which gives the minimum sum of squares of
the differences from the line parallel to the x axis.
•Two types of variables are involved in the •Thus, “X = a + by” is the linear regression of X on
regression equation: Y, and is used to predict the value of X from the
1. The predictor (independent) variable which is knowledge of Y. The Formulas are indicated in the
“x” in the regression equation (Y = a + bx). box.
2. The predictand (dependent) variable which is
“y” in the regression equation.
•Again, take as your example the hours spent in • Regression of X to Y: X =a+by
studying (x) and grades received (y) to predict the b=N ¿ ¿
grades received (y) using the knowledge of the
number of hours spent in studying (x). 10 ( 2914 )−(38)( 734)
b= = 0.1481
•In getting Pearson’s r , the same value can be 10 ( 54718 )−¿ ¿
obtained even if x and y are interchanged. Enter x
then y or y then x and the same r (0.89) will be ∑ x−b ∑ y
obtained. •However, you have to enter first the a=
N
independent variable, in this case, x followed by
the dependent variable y. They cannot be 38−0.1481(734)
interchanged! a= =−7.07
10
Now predict the grade (Y) of students whose • Thus, X = -7.07 + 0.15Y. This is the equation
number of hours spent for studying are: which will be used to predict the number of hours
• First have to set up the regression equation “Y = a spent in studying from the knowledge of the grades
+ bx” received are:
• Get “a” by pressing SHIFT then “A” or its a) 87: X’ = -7.07 + 0.15(87) = 5.98  6 hours
equivalent, and “53.30508475” will display
b) 60: X’ = -7.07 + 0.15(60) = 1.93  2 hours
• Get “b” by pressing SHIFT then “B” or its
equivalent, and “5.288135593” will display
• Now, can you use the calculator to get the “a and
• Putting them together in the regression equation
b” for the regression equation of X and Y the way
“Y = a + bx”, you have: “Y = 53.31 + 5.29x”, that is
you did for the regression of Y and X? YES! But
rounding off A and B to two-decimal places.
you have to interchange “x and y” thus, the grade
takes the role of X while the number of hours spent
• To predict the grade received (Y’) when the
in studying becomes Y.
number of hours spent in studying is:
• Notice that you get exactly the same values for “a
a) 7 hours: substitute 7 to x in your equation, Y’ =
and b” as what you got using the formula. It gives
53.31 + 5.29(7)  90.34 or 90.
you the same equation X = -7.07 + 0.15Y.
b) 1 hour: substitute 7 to x in your equation, Y’ =
Therefore, to make your job easier, and to avoid
53.31 + 5.29(1)  58.6 or 59. using the formula, you can just interchange X and Y
c) 45 min: substitute 7 to x in your equation, Y’ = as what you did above.
53.31 + 5.29(45/60)  57.28 or 57. • Now since you interchanged X and Y, your
equation becomes: Y = -7.07 + 0.15X.
Linear Regression of X on Y
•If you want to predict the number of hours spent in
studying given the grades, can you use the same
equation “Y = 53.31 + 5.29x”.? The answer is
NO!
•This time, the independent variable (x) is the
grade while the dependent variable (y) is the
number of hours spent in studying.
•This is now the linear regression of X on Y. The
Least Square Regression Line (LSRL) this time is

S-ar putea să vă placă și