Sunteți pe pagina 1din 31

Data Collection and Analysis:

An Introduction

Presented to:

Prof. Jiri Militky & Prof. Lubos Hes

By
Muhammad Mushtaq Ahmed Mangat
Textile Faculty
Technical University Liberec

Sep 09, 2010


Table of Contents, Tables and Figures

Part One: Statistics Definition and Functions......................................................................................6

Descriptive stat.....................................................................................................................................6

Inferential stat.......................................................................................................................................6

Populations and samples......................................................................................................................6

Types of samples..................................................................................................................................6

Number systems...................................................................................................................................7

Data......................................................................................................................................................7

Variable................................................................................................................................................7

Independent variable............................................................................................................................7

Dependent variable...............................................................................................................................7

Univariate data.....................................................................................................................................7

Bivariate Data.......................................................................................................................................7

Multivariate data..................................................................................................................................7

Discrete quantitative data.....................................................................................................................7

Continuous Quantitative data...............................................................................................................8

Ordinal..................................................................................................................................................8

Nominal................................................................................................................................................8

Time Series Data..................................................................................................................................8

Cross-sectional data..............................................................................................................................8

Primary.................................................................................................................................................8

Secondary.............................................................................................................................................8

Data Arranging and Presentation.........................................................................................................8

List of data............................................................................................................................................8

Data Frequency....................................................................................................................................8

Part Two...............................................................................................................................................8

Frequency Table...................................................................................................................................9
Pie Chart...............................................................................................................................................9

Bar Chart..............................................................................................................................................9

Area Charts.........................................................................................................................................10

Line Charts.........................................................................................................................................11

Dot Plot..............................................................................................................................................12

Histogram...........................................................................................................................................12

Histogram with normal curve.............................................................................................................13

Radar Charts.......................................................................................................................................13

Map chart............................................................................................................................................13

Stem and Leaf Plot.............................................................................................................................14

Box and Whisker Plot (or Boxplot)....................................................................................................14

Polygon charts....................................................................................................................................14

Range..................................................................................................................................................15

Arithmetic mean.................................................................................................................................15

Geometric mean.................................................................................................................................15

Trimmed Mean...................................................................................................................................16

Median................................................................................................................................................16

Mode...................................................................................................................................................16

Percentiles..........................................................................................................................................16

Extremes and quartiles.......................................................................................................................16

Variance.............................................................................................................................................16

Standard deviation (σ)........................................................................................................................16

Variance sum law...............................................................................................................................16

Percentile summary............................................................................................................................17

Standard Deviation.............................................................................................................................17

Normal Distribution...........................................................................................................................17

Skewed Distribution...........................................................................................................................17

Kurtosis..............................................................................................................................................18
Sampling distribution.........................................................................................................................18

Standard error (standard deviation of sampling)................................................................................18

Normal Distribution and central limit theorem..................................................................................19

Characteristics of Normal distributions..............................................................................................19

Binomial distribution..........................................................................................................................20

Bivariate and Multivariate analysis....................................................................................................20

Correlation..........................................................................................................................................20

Hypotheses Testing............................................................................................................................20

Null hypotheses..................................................................................................................................20

The research Hypotheses (alternative Hypotheses)...........................................................................20

Two Tail and One Tail Test...............................................................................................................21

Hypotheses Testing Methods.............................................................................................................21

Z Score Test.......................................................................................................................................21

Types of Z test....................................................................................................................................21

Z value calculation.............................................................................................................................22

t Test (William Sealy Gosset, 1908)..................................................................................................22

One sample t test:...............................................................................................................................22

P value................................................................................................................................................23

Correlation..........................................................................................................................................23

Regression Analysis...........................................................................................................................24

Example of regression analysis..........................................................................................................25

Explanation of model:........................................................................................................................26

Multinomial logistic regression..........................................................................................................26

Results of Multinomial Logistic Regression......................................................................................27

Chi-Square Test..................................................................................................................................27

Crosstabs............................................................................................................................................29
Part One: Statistics Definition and Functions
Statistics is an art and science of collecting and understanding data. Main functions:
1. Gathering
2. Arranging
3. Analyzing
4. Exploring the data
5. Estimate the unknown quantity
6. Presenting results
7. Interpreting results
8. Making available for decisions
9. Designing plan for data collection
10. Hypotheses testing

Descriptive stat
Descriptive statistics are used to describe the main features of a collection of data in quantitative
terms (en.wikipedia.org/wiki/Descriptive_statistics)

Inferential stat
A statistical inference is a conclusion made on the basis of data which is subject to random variation
of some kind, possibly observation errors or sampling variation
(en.wikipedia.org/wiki/Inferential_statistics)

Populations and samples


The population from which the sample is drawn and sample --- that is, a small subset of a larger set

Types of samples
Random sample, Stratified sample, Quota sample, Purposive sample, Convenience sample
Number systems

Natural : 0, 1, 2, 3, 4, 5, 6, 7, ..., n
Integers: −n, ..., −5, −4, −3, −2, −1, 0, 1, 2, 3, 4, 5, ..., n
Positive integers: 1, 2, 3, 4, 5, ..., n
Rational: a/b where a and b are integers and b is not zero (3/4)
Real: The limit of a convergent sequence of rational numbers (-1.23, 1.234)
Complex: a + bi where a and b are real numbers and i is the square root of −1
Prime numbers: a natural number that has exactly two distinct natural number divisors: 1 and itself
(1,3,5,7,11)
Irrational number: The irrational numbers are in fact precisely those infinite decimals which are not
repeating (7/22 Pai)

Data
Data refers to any kind of recorded information

Variable
A piece of information recorded for every item is called a variable

Independent variable
A variable, which can be exploited during experiment

Dependent variable
A variable affected by the exploitation of independent variable

Univariate data
It is a data set which one piece of information has recorded for each item.

Bivariate Data
Such data sets have exactly two pieces of information recorded for each item

Multivariate data
Such data sets have three or more pieces of information recorded for each item

Discrete quantitative data


A discrete variable can assume values only from a list of specific numbers e.g. number of people,
number of class rooms.
Continuous Quantitative data
It could be any number (value) e.g. weight of students, weather temperature

Ordinal
In this there is a meaningful order e.g. 1 to 5 where 1 is the dull and 5 is full bright

Nominal
Where there is no meaningful order e.g. name of different departments

Time Series Data


Data recorded in a meaningful sequence e.g. daily report of stock exchange, weekly temperature of
a patient

Cross-sectional data
Data collected at point of time e.g. grades of students in first term

Primary
Data collected for a specific purpose

Secondary
Previously collected data for another use

Data Arranging and Presentation

List of data
It is the simplest kind of data. It represents some kind of information.

Data Frequency
Frequency of data shows how often the various values occur in the data set. Normally presented in
shape of histogram
(source: http://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtab and results of
Google image research)

Part Two

Part Two
Central Tendency and Data Spread

Frequency Table

Score Frequency Frequency (%)


0 4 13%
1 3 10%
2 5 17%
3 5 17%
4 6 20%
5 7 23%

Pie Chart

Bar Chart
Area Charts
Line Charts
Dot Plot
Useful to identify any outliers, line of values also useful for this purpose.

Histogram
Histogram with normal curve

Radar Charts

Map chart
Stem and Leaf Plot

 Box and Whisker Plot (or Boxplot)

Polygon charts
Variability means the extent to which data values differ from each other.
Diversity, dispersion, spread and uncertainty have the same meanings

Population - parameter Sample - statistic


size: N n
Mean  “mu” x bar”
“x
median n/a M or ~
x “x tilde”
proportion  “pi” p
(p in text) p̂
( in text)
spread:
variance 2 “sigma squared” s2 “s squared” = (x - x
)2/(n - 1)
standard deviation  “sigma” s
z score = (x - mean)/sd Z z
correlation coefficient  “rho” r
Slope 1 “beta 1” b1
intercept 0 “beta naught” b0

In this report for simplicity we will use only signs of population

Range
Highest values-smallest value

Arithmetic mean

Geometric mean

Trimmed Mean
In this case some extreme values are removed for unbiased mean
Median
Halfway point of data set (n+1)/2 in case of odd number, in case of even number mean of two
middle values

Mode
The most common category

Percentiles
Percentiles are summary measures expressing ranks as percentage 0% to 100% rather than 1 to n.
These are used:
To indicate the data value at a given percentage
To indicate the percentage ranking of a given data value

Extremes and quartiles


Extremes the smallest and largest
Quartiles defines 25% and 75%

Variance
For population

For samples

Standard deviation (σ)


It is square root of variance and tells average distance from the mean value

Variance sum law

Percentile summary
Value attained by a given percentage after they have been ordered from smallest to largest.
Standard Deviation
It is an indication how different the numbers are from one another.

Normal Distribution
It is an idealized, smooth, bell-shaped histogram with all of the randomness removed.
It represents an ideal set that has lots of numbers concentrated in the middle.
It is common for statistical procedures to assume that the data set is reasonably approximated by a
normal distribution. Example with 5 and standard deviation:

Skewed Distribution
It is neither symmetric nor normal, because data values trail off more sharply on one side the on the
other. Pearson suggest following equation to measure skewness1:

Now more commonly used equation:

Negative Positive

1
Online Statistics: An Interactive Multimedia Course of Study
Kurtosis

Sampling distribution
It is a distribution of the statistic for all possible samples of a given size from a population. It is
highly dependent on the distribution of population.

Standard error (standard deviation of sampling)

Mean of sampling distribution is equal to the mean of population.


μM = μ
Variance of sampling distribution is as under:

Standard distribution of sampling is referred as standard error of the quantity.

Normal Distribution and central limit theorem


Repeated means from a population which may not be normally distributed will be normally
distributed. Large sample size will have higher normal distribution2.

2
Online Statistics: An Interactive Multimedia Course of Study
Following figures are different mean and SD.

Characteristics of Normal distributions


1. Symmetric around their mean.
2. Mean, median and mode at same point
3. Area under normal curve is 1.00
4. Dense in center and thin at tails
5. Mean and SD are used for it
6. 68.27% data is within one SD
7. 95.45% data is within 2 SD
8. 99.73 % data is within 3 SD
9. 1.96 Z has 95% area
10. 1.68 Z has 90% area
Binomial distribution
There is only one outcome of each trial and each trial is mutually exclusive for example of head and
tail of coin.

Bivariate and Multivariate analysis


Bivariate analysis deals with the association or relationship between two set of data of two different
variables, whereas, multivariate deals with data of more than two sets of variable to have joint
effect. It is used to test hypothesis and identify the strength of correlation between or simply
dependency one variable on the other.

Correlation
1. A causal, complementary, parallel, or reciprocal relationship, especially a structural,
functional, or qualitative correspondence between two comparable entities: a correlation
between drug abuse and crime.
2. Statistics. The simultaneous change in value of two numerically valued random
variables: the positive correlation between cigarette smoking and the incidence of lung
cancer; the negative correlation between age and normal vision.
3. An act of correlating or the condition of being correlated3.

Hypotheses Testing
Statistical hypothesis test, or more briefly, hypothesis test, is an algorithm to state the alternative
(for or against the hypothesis) which minimizes certain risks

Null hypotheses
It is denoted by Ho and represents the default possibility about the population that you will accept
unless you have convincing evidence to the contrary.

The research Hypotheses (alternative Hypotheses)


It is denoted by Ha and will be accepted if there is a convincing evidence that would rule out the
null hypotheses as a reasonable possibility

Example:
Ho: μa = μ0

3
http://www.answers.com/topic/correlation
Ha: μa μ0

Two Tail and One Tail Test

One tail test: population mean is greater/lesser that the sample mean,
Two Tail Test
In this case researcher claims that the sample mean may be different than the population mean
(greater or lesser).

Hypotheses Testing Methods


Z test
t Test
p value

Z Score Test
Considering central limit theorem lots of statistic analysis are possible since distribution is normal.
Z-tests are better if the sample size is not too small. It tells distance in standard deviation form from
the mean of a data set.
Z-test is a statistical test where normal distribution is applied and is basically used for dealing with
problems relating to large samples when n ≥ 30 (http://www.experiment-resources.com/z-test.html#ixzz0zCnm9iX5)

Types of Z test
1. Z test for single proportion to test hypothesis on a specific value of proportion, Ho: P=Po.
2. For two different groups of data, drinking habits of male and female
3. Test the specific value on a population. It is used when sample size >30 and standard
deviation is known.
4. Test of variance on a specific value of population variance.
5. Test of equality of two sets of variable when sample size >304.

Z value calculation
Formula of Z value:

Z value will be used to find the corresponding P value in table and will be compared with critical Z
value and if the P value is less than alpha, we reject the null hypothesis.

t Test (William Sealy Gosset, 1908)


For:
1. Single sample t test
2. Two independent samples t test
3. Compared groups t test (before treatment and after treatment)
4. For checking of regression line, is it equal to zero or not.
Useful for small samples, less than 30.
Assumption:
Data should by having normality which can be checked by using histogram and equality of variance
by using levene’s test.

One sample t test:

P value
P values indicates the probability if the test statistics are properly distributed under normal curve as
it was assumed in null hypothesis. The smaller p value supports to not accept the null hypothesis.
More common is 0.05(95%) significance; however 0.1 and .01 are also used.
4
Choudhury, Amit (2009). Z-Test. Retrieved [Date of Retrieval] from Experiment Resources: http://www.experiment-resources.com/z-test.html

Read more: http://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8C
Correlation
1. For parametric statistic (Pearson's product-moment correlation)
2. For nonparametric statistic (Spearman's rank correlation). 5

Following equation is used to measure coefficient of correlation6:

Another equation to calculate correlation coefficient:

Regression Analysis
Regression analysis is a process to find the best fit line to explain the relationship between the independent and
dependent variable. It is written as:

Simple regression:
Y=b0+ b1X+є

5
http://www.answers.com/topic/correlation-coefficient
6
Online Statistics: An Interactive Multimedia Course of Study
Multiple regression:
Y=b0+ b1X1++b2X2+b3X3+….bnXn+є

Where:
b0= interception on Y axis
Y= value of dependent variable
b1…b3=coefficient of independent value
X=independent variable
Є=noise or effect of unknown variable (it may be ignored)

Assumption for regression analysis:


1. The sample true representative f population
2. Linearity in the data
3. Existence of homoscedasticity

Equation for regression


Slope line:
For intercept

Example taken from http://faculty.uncfsu.edu/dwallace/lesson%2018.pdf


Example of regression analysis

Model Summary

Change Statistics

Mode Adjusted R Std. Error of R Square Sig. F

l R R Square Square the Estimate Change F Change df1 df2 Change

1 .733a .537 .512 12.92502 .537 21.659 3 56 .000

a. Predictors: (Constant), Thermal Conductivity at Dry State Wm^1K^-1, Sample Thickness at Dry State (mm),

Thermal Resistance at Dry StateK.m2W^-1)

ANOVAb

Sum of Mean

Model Squares df Square F Sig.

1 Regressio
10854.967 3 3618.322 21.659 .000a
n

Residual 9355.145 56 167.056

Total 20210.112 59

a. Predictors: (Constant), Thermal Conductivity at Dry State

Wm^1K^-1, Sample Thickness at Dry State (mm), Thermal

Resistance at Dry StateK.m2W^-1)

b. Dependent Variable: Thermal Absorbtivity at Dry

StateW.m^-2.s1/2. K-1)
Coefficientsa

Unstandardized Standardized Collinearity

Coefficients Coefficients Statistics

Model B Std. Error Beta t Sig. Tolerance VIF

1 (Constant) -80.481 173.186 -.465 .644

Thermal Resistance at
12696.547 10202.604 1.786 1.244 .219 .004 249.080
Dry StateK.m2W^-1)

Sample Thickness at
-334.270 199.663 -1.937 -1.674 .100 .006 161.903
Dry State (mm)

Thermal Conductivity

at Dry State 6192.067 3260.818 .774 1.899 .063 .050 20.091

Wm^1K^-1

a. Dependent Variable: Thermal Absorbtivity

Explanation of model:
Adjusted R square=.512(51.2%) means that in dependent variable 51.2% changes are due to these
independent variables. Significant F change shows that model is significant. Standardized
coefficient are coefficients of independent variables. Their significance values describe the
significance of these variables in the regression equation. Less than 0.05 tells that variable is
significant.
Multinomial logistic regression7
It is used for:
1. Analyze relationship between non-metric dependent and metric dichotomous independent
variable
2. It compares the multiple group through a combination of a binary logistic regression

It used to predict:

1. Coefficients for each of the two comparison


2. Three equations one for each group defined by the dependent variable
3. A comparison is possible between group membership and actual group to find measure of
classification accuracy
Requirements of Multinomial logistic regression analysis:

1. Dependent variable should be non-metric the independent variables should metric or


dichotomous
2. Dichotomous, nominal, and ordinal variables can satisfy the requirements

Results of Multinomial Logistic Regression


1. Overall relationship between independent variables and grouped defined by the
dependent variables
2. Difference follows a chi-square distribution and used for significance testing

Examples:
1. Influence of father professional and education on occupancy preference
2. Effect of food and exercise on a certain disease
3. Selection of brands based on gender and age Chi Square Test

7
Source: www.utexas.edu/.../MultinomialLogisticRegression_BasicRelationships.ppt  SW388R7
Data Analysis & Computers II
Chi-Square Test
Chi- square test is used to find association between two sets of variable written in the form of a
matrix, two way table8:

Where:
X2= Chi-square value
O= observed frequency
E= expected frequency
Example:

Short Tall Total


Male 24 20 44
Female 36 5 41
Total 60 25 85

Expected value are calculated by using probability rules:


Probability that a person is short: 60/85=0.706
Probability that a person is male: 44/85=0.518
A person is male and short: 0.706*0.518=.366
Expected frequency of such person who are male and short: 0.366*85= 31.1
(we can calculate all other values by using this method)

Observed Expected (O-E)2/E


values values
24 31.1 1.62
36 12.9 3.91
20 28.9 1.74
5 12.1 4.17
85 85 X2=11.4

Degree of freedom= (row-1)(column -1)= (2-1)(2-1)=1


values from Chi sq distribution:
8
Source: http://science.jrank.org/pages/1401/Chi-Square-Test.html
For p=0.05, value of X2= 3.84, whereas our value is 11.4, which is quite high.
It shows that we have to accept the null hypothesis that there is an association between
male and female and their height.

Crosstabs

It is a non parametric test and used to measure the association between two categories by
controlling other categories.

Example:

People having high salaries are more likely to go on vocation as compared to people having low
salaries.

Most commonly Pearson chi-square, likelihood-ratio chi-square are used for test of significance.

Results from SPSS:


Interpretation of the results: difference is by chance and there is no difference in services offerd by
different stores.

S-ar putea să vă placă și