Documente Academic
Documente Profesional
Documente Cultură
Dr. P. V. Sudeep
Dept. of Electronics and Communication Engineering
National Institute of Technology, Karnataka
Agenda
Review of Statistics
Basics
Statistics: The science of collecting, describing, and interpreting data.
Population: A collection, or set, of individuals or objects or events
whose properties are to be analyzed.
Two kinds of populations: finite or infinite.
Sample: A subset of the population.
Distribution : (of a variable) tells us what values the variable takes
and how often it takes these values.
Unimodal - having a single peak
Bimodal - having two distinct peaks
Symmetric - left and right half are mirror images.
An example
A college dean is interested in learning about the average age of faculty. Identify the basic terms in
this situation.
The population is the age of all faculty members at the college.
A sample is any subset of that population. For example, we might select 10 faculty members and
determine their age.
The variable is the age of each faculty member.
One data (singular) would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the ages forming the sample and determining
the actual age of each faculty member in the sample.
The parameter of interest is the average age of all faculty at the college.
The statistic is the average age for all faculty in the sample.
Terms
Variable: A characteristic about each individual element of a population or
sample.
Data (singular): The value of the variable associated with one element of a
population or sample. This value may be a number, a word, or a symbol.
Data (plural): The set of values collected for the variable from each of the
elements belonging to the sample.
Experiment: A planned activity whose results yield a set of data.
Parameter: A numerical value summarizing all the data of an entire
population.
Statistic: A numerical value summarizing the sample data.
A Taxonomy of Statistics
Two areas of statistics:
Descriptive Statistics: collection, presentation,
and description of sample data.
Inferential Statistics: making decisions and
drawing conclusions about populations.
Types of Statistics
Techniques that summarize and describe characteristics of a group
or make comparisons of characteristics between groups are knows
as descriptive statistics.
Univariate Analysis
Univariate analysis is the simplest form of analyzing data.
Uni means one, so in other words your data has only
one variable.
It doesnt deal with causes or relationships (unlike
regression) and its major purpose is to describe; it takes
data, summarizes that data and finds patterns in the data.
Data Presentation
Two types of statistical presentation of data - graphical
and numerical.
Graphical Presentation: We look for the overall pattern
and for striking deviations from that pattern. Over all
pattern usually described by shape, center, and spread of
the data. An individual value that falls outside the overall
pattern is called an outlier.
Bar diagram and Pie charts are used for categorical
variables.
Histogram, stem and leaf and Box-plot are used for
numerical variable.
90.41666667
Standard Error
3.902649518
Median
84
Mode
84
Standard Deviation
30.22979318
Sample Variance
913.8403955
Kurtosis
Skewness
-1.183899591
0.389872725
Range
95
Minimum
48
Maximum
143
Sum
Count
5425
60
Numerical Presentation
A fundamental concept in summary statistics is that of a central
value for a set of observations and the extent to which the central
value characterizes the whole set of data. Measures of central
value such as the mean or median must be coupled with measures
of data dispersion (e.g., average distance from the mean) to
indicate how well the central value characterizes the data as a
whole.
To understand how well a central value characterizes a set of observations, let us consider
the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from the mean
in data set A is larger than in the data set B. Thus, the mean of data set B is a better
representation of the data set than is the case for set A.
x x ... xn
x 1 2
x
i 1
Mean or Median
The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the mean
for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations out of
4 lie between 20-40. So, the mean 270 really fails to give
a realistic picture of the major part of the data. It is
influenced by extreme value 990.
Methods of Variability
Measurement
Variability (or dispersion) measures the amount of
scatter in a dataset.
Commonly used methods: range, variance, standard
deviation, interquartile range, coefficient of
variation etc.
Range: The difference between the largest and the
smallest observations. The range of 10, 5, 2, 100 is (1002)=98. Its a crude measure of variability.
Methods of Variability
Measurement
Variance: The variance of a set of observations is the average of
the squares of the deviations of the observations from their
2
mean. In symbols, the variance
the
observations
x 1, x2,xn is
( x1 x ) 2of
....
( xn
x
)
2
n
S
n 1
Normal distribution
(bell curve)
Quartiles: Data can be divided into four regions that cover the total range
of observed values. Cut points for these regions are known as quartiles.
100
x
Shape of Data
Shape of data is measured by
Skewness
Kurtosis
Skewness
Measures asymmetry of data
Positive or right skewed: Longer right tail
Negative or left skewed: Longer left tail
Let x1 , x2 ,...xn benobservatio ns.Then,
n
Skewness
n ( xi x )3
i 1
i 1
3/ 2
2
(
x
x
)
Kurtosis
Measures peakedness of the distribution of data. The
kurtosis of normal distribution is 0.
Kurtosis
n ( xi x ) 4
i 1
(x x)
i 1
Bivariate Analysis
It involves the analysis of two variables (often denoted
asX,Y), for the purpose of determining the empirical
relationship between them. i.e., it is the analysis of the
relationship between the two variables.
Bivariate analysis is a simple (two variable) special case of
multivariate analysis (where multiple relations between
multiple variables are examined simultaneously)
Level of Measurement
Nominal measurement is a classification system; we use numbers
instead of names to identify things.
For example, if we wanted to code religion, we might say 1 =
Catholic, 2 = Protestant, 3 = Jewish, etc. That does not mean that a
protestant is more religious than a Catholic and less religious than a
Jew. The numbers we use our arbitrary, and you can't perform
mathematical operations (i.e. add a Catholic and a protestant and get
a Jew). Categories should be mutually exclusive and exhaustive. That
is, you should only be able to classify something one way, and you
should have a category for every possible value.
Level of Measurement
With ordinal measurement, categories are ranked in order of their
values on some property. Class ranks are an example (highest
score, second highest score, etc.) However, the distances between
ranks do not have to be the same. For example, the highest scoring
person may have scored one more point than the 2nd highest, she
may have scored 5 more points than the third highest, etc.
Level of Measurement
With interval level measurement, the distance between each number is the
same. For example, the distance between 1 and 2 is the same as the distance
between 15 and 16. With interval measurement, we can determine not only
that a person ranks higher but how much higher they rank. You can do
addition and subtraction with interval level measures, but not multiplication
and division.
With ratio level measures, you can do addition, subtraction, and
multiplication and division. With ratio measures, you have an absolute, fixed,
and nonarbitrary zero point.
Examples
Fahrenheit and centigrade scales of temperatures are interval-level measures.
They are not ratio-level because the zero point is arbitrary. For example, in the F
scale 32 degrees happens to be the point where water freezes. There is no reason
you couldn't shift everything down by 32 degrees, and have 0 be the point where
water freezes. Or, add 68, and have 100 be the freezing point. The zero point is
arbitrary. It is not correct to say that, if it is 70 degrees outside, that it is
twice as warm as it would be if it were 35 degrees outside.
Such things as age and income, however, have nonarbitrary zero points. If you
have zero dollars, that literally means that you have no income. If you are 20 years
old, that literally means you have been around for 20 years. Further, $10,000 is
exactly twice as much as $5,000. If you are 20, you are half as old as someone who
is 40.
Bivariate data
Be clear about the difference between bivariate data and
two
sample data. In two sample data, the X and Y values are
not
paired, and there arent necessarily the same number of X
and
Y values.
Marginal Density
The distribution of X and the distribution of Y can be
considered individually using univariate methods. That is,
we can analyse
using CDFs, densities, quantile functions, etc. Any property
that described the behavior of the Xi values alone or the Yi
values alone is called marginal property.
Joint Property
The most interesting questions relating t o bivariate data
deal with X and Y simultaneously. These questions are
investigated using properties that describe X and Y
simultaneously. Such properties are called joint
properties.
A complete summary of the statistical properties of (X, Y)
is given by the joint distribution.
Example
If t he sample space is finite, the joint distribution is represented
in a table, where the X sample space corresponds to the rows,
and the Y sample space corresponds to the columns.
For example, if we flip two coins, the joint distribution is
The marginal distributions can always be obtained from the joint
distribution by summing the rows (to get the marginal X
distribution), or by summing the columns (to get the marginal Y
distribution). For this example, the marginal X and Y
distributions are both {H 1/2, T 1/2 }.
Scatterplot
The most important graphical
summary of bivariate data is the
scatterplot. This is simply a plot of
the points ( Xi, Yi) in the plane.
Scatterplot
A key feature in a scatterplot is the association, or trend between X and Y.
Higher January temperatures tend to be paired with higher June
temperatures, so these two values have a positive association (As X
increases, Y increases).
Higher latitudes tend to be paired with lower January temperature
decreases, so these values have a negative association (as X increases, Y
decreases).
If higher X values are paired with low or with high Y values equally often,
there is no association (No consistent tendency for values on Y to increase or
decrease as X increases).
Correlation coefficient
Suppose we would like to numerically quantify the trend in
a bivariate scatterplot.
The most common means of doing this is the correlation
coefficient (sometimes called Pearsons correlation
coefficient):
cov(X,Y) > 0
cov(X,Y) < 0
cov(X,Y) = 0
Linear Correlation
Linear relationships
Curvilinear relationships
Linear Correlation
Strong relationships
Weak relationships
Linear Correlation
No relationship
Example
DATA
Find Mean, SD and covariance?
Covcig .&CHD
( X X )(Y Y ) 222.44
11.12
N 1
21 1
cov XY
11.12
11.12
.713
s X sY
(2.33)(6.69) 15.59
Statistical Softwares
There are many softwares to perform statistical analysis
and visualization of data. Some of them are SAS (System
for Statistical Analysis), S-plus, R, Matlab, Minitab, BMDP,
Stata, SPSS, StatXact, Statistica, LISREL, JMP, GLIM, HIL,
MS Excel etc.
http://www.R-project.org