Sunteți pe pagina 1din 50

Preliminaries

Dr. P. V. Sudeep
Dept. of Electronics and Communication Engineering
National Institute of Technology, Karnataka

Agenda
Review of Statistics

Basics
Statistics: The science of collecting, describing, and interpreting data.
Population: A collection, or set, of individuals or objects or events
whose properties are to be analyzed.
Two kinds of populations: finite or infinite.
Sample: A subset of the population.
Distribution : (of a variable) tells us what values the variable takes
and how often it takes these values.
Unimodal - having a single peak
Bimodal - having two distinct peaks
Symmetric - left and right half are mirror images.

Comparison of Probability and


Statistics
Probability: Properties of the population are assumed
known. Answer questions about the sample based on
these properties.

Statistics: Use information in the sample to draw a


conclusion about the population.

An example
A college dean is interested in learning about the average age of faculty. Identify the basic terms in
this situation.
The population is the age of all faculty members at the college.
A sample is any subset of that population. For example, we might select 10 faculty members and
determine their age.
The variable is the age of each faculty member.
One data (singular) would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the ages forming the sample and determining
the actual age of each faculty member in the sample.
The parameter of interest is the average age of all faculty at the college.
The statistic is the average age for all faculty in the sample.

Terms
Variable: A characteristic about each individual element of a population or
sample.
Data (singular): The value of the variable associated with one element of a
population or sample. This value may be a number, a word, or a symbol.
Data (plural): The set of values collected for the variable from each of the
elements belonging to the sample.
Experiment: A planned activity whose results yield a set of data.
Parameter: A numerical value summarizing all the data of an entire
population.
Statistic: A numerical value summarizing the sample data.

A Taxonomy of Statistics
Two areas of statistics:
Descriptive Statistics: collection, presentation,
and description of sample data.
Inferential Statistics: making decisions and
drawing conclusions about populations.

Types of Statistics
Techniques that summarize and describe characteristics of a group
or make comparisons of characteristics between groups are knows
as descriptive statistics.

Inferential statistics are used to make generalizations or


inferences about a population based on findings from a sample.

The choice of a type of analysis is based on the evaluation


questions, the type of data collected, and the audience who will
receive the results.

Univariate Analysis
Univariate analysis is the simplest form of analyzing data.
Uni means one, so in other words your data has only
one variable.
It doesnt deal with causes or relationships (unlike
regression) and its major purpose is to describe; it takes
data, summarizes that data and finds patterns in the data.

Statistical Description of Data


Statistics describes a numeric set of data by its
Center
Variability
Shape
Statistics describes a categorical set of data by
Frequency, percentage or proportion of each category

Data Presentation
Two types of statistical presentation of data - graphical
and numerical.
Graphical Presentation: We look for the overall pattern
and for striking deviations from that pattern. Over all
pattern usually described by shape, center, and spread of
the data. An individual value that falls outside the overall
pattern is called an outlier.
Bar diagram and Pie charts are used for categorical
variables.
Histogram, stem and leaf and Box-plot are used for
numerical variable.

Bar Diagram: Lists the categories and presents the percent or


count of individuals who fall in each category.
Pie Chart: Lists the categories and presents the percent or
count of individuals who fall in each category.

Histogram: Overall pattern can be described by its shape,


center, and spread. The following age distribution is right
skewed. The center lies between 80 to 100. No outliers.
Mean

90.41666667

Standard Error

3.902649518

Median

84

Mode

84

Standard Deviation

30.22979318

Sample Variance

913.8403955

Kurtosis
Skewness

-1.183899591
0.389872725

Range

95

Minimum

48

Maximum

143

Sum
Count

5425
60

Numerical Presentation
A fundamental concept in summary statistics is that of a central
value for a set of observations and the extent to which the central
value characterizes the whole set of data. Measures of central
value such as the mean or median must be coupled with measures
of data dispersion (e.g., average distance from the mean) to
indicate how well the central value characterizes the data as a
whole.

To understand how well a central value characterizes a set of observations, let us consider
the following two sets of data:

A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from the mean
in data set A is larger than in the data set B. Thus, the mean of data set B is a better
representation of the data set than is the case for set A.

Methods of Center Measurement


Center measurement is a summary measure of the overall level of a
dataset
Commonly used methods are mean, median, mode, geometric mean etc.
Mean: Summing up all the observation and dividing by number of
observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.

Notation :Let x1 , x2, ...xn arenobservatio ns of avariable


x.Thenthe mean of this variable,
n

x x ... xn
x 1 2

x
i 1

Mode: The value that is observed most frequently. The mode is


undefined for sequences in which no observation is repeated.

Median: The middle value in an ordered sequence of


observations. That is, to find the median we need to order
the data set and then find the middle value. In case of an
even number of observations the average of the two
middle most values is the median. For example, to find the
median of {9, 3, 6, 7, 5}, we first sort the data giving {3,
5, 6, 7, 9}, then choose the middle value 6. If the number
of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the
median is the average of the two middle values from the
sorted sequence, in this case, (5 + 6) / 2 = 5.5.

Mean or Median
The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the mean
for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations out of
4 lie between 20-40. So, the mean 270 really fails to give
a realistic picture of the major part of the data. It is
influenced by extreme value 990.

Methods of Center Measurement

Methods of Variability
Measurement
Variability (or dispersion) measures the amount of
scatter in a dataset.
Commonly used methods: range, variance, standard
deviation, interquartile range, coefficient of
variation etc.
Range: The difference between the largest and the
smallest observations. The range of 10, 5, 2, 100 is (1002)=98. Its a crude measure of variability.

Methods of Variability
Measurement
Variance: The variance of a set of observations is the average of
the squares of the deviations of the observations from their
2
mean. In symbols, the variance
the
observations
x 1, x2,xn is
( x1 x ) 2of
....
( xn

x
)
2
n
S

n 1

Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is


(5 5) 2 (3 5) 2 (7 5) 2
4
3 1

Standard Deviation: Square root of the variance. The standard


deviation of the above example is 2.

Normal distribution
(bell curve)

Quartiles: Data can be divided into four regions that cover the total range
of observed values. Cut points for these regions are known as quartiles.

In notations, quartiles of a data is the ((n+1)/4)qth observation of the


data, where q is the desired quartile and n is the number of
observations of data.
The first quartile (Q1) is the first 25% of the data. The second
quartile (Q2) is between the 25th and 50th percentage points in the
data. The upper bound of Q2 is the median. The third quartile (Q3) is
the 25% of the data lying between the median and the 75% cut point
in the data.
Q1 is the median of the first half of the ordered observations and Q3
is the median of the second half of the ordered observations.

An example with 15 numbers


3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1
Q2
Q3

In the following example Q1= ((15+1)/4)1 =4th


observation of the data. The 4th observation is 11. So Q1 is
of this data is 11.
The first quartile is Q1=11. The second quartile is
Q2=40 (This is also the Median.) The third quartile is
Q3=61.
Inter-quartile Range: Difference between Q3 and Q1.
Inter-quartile range of the previous example is 61- 40=21.
The middle half of the ordered data lie between 40 and 61.

Percentiles: If data is ordered and divided into 100 parts,


then cut points are called Percentiles. 25th percentile is the
Q1, 50th is the Median (Q2) and the 75th percentile of the
data is Q3.
In notations, percentiles of a data is the ((n+1)/100)p th
observation of the data, where p is the desired percentile
and n is the no. of observations of data.
Deciles: If data is ordered and divided into 10 parts, then
cut points are called Deciles.
Coefficient of Variation: The standard deviation of data
divided by its mean. It is usually expressed in percent.
Coefficient of
Variation =

100
x

Shape of Data
Shape of data is measured by
Skewness
Kurtosis

Skewness
Measures asymmetry of data
Positive or right skewed: Longer right tail
Negative or left skewed: Longer left tail
Let x1 , x2 ,...xn benobservatio ns.Then,
n

Skewness

n ( xi x )3

i 1

i 1

3/ 2

2
(
x

x
)

Kurtosis
Measures peakedness of the distribution of data. The
kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn benobservatio ns.Then,


n

Kurtosis

n ( xi x ) 4

i 1

(x x)
i 1

Bivariate Analysis
It involves the analysis of two variables (often denoted
asX,Y), for the purpose of determining the empirical
relationship between them. i.e., it is the analysis of the
relationship between the two variables.
Bivariate analysis is a simple (two variable) special case of
multivariate analysis (where multiple relations between
multiple variables are examined simultaneously)

Bivariate analysis allows us to:


Look at associations/relationships among two variables.
Look at measures of the strength of the relationship between
two variables.

Test hypotheses about relationships between two nominal or


ordinal level variables.

Level of Measurement
Nominal measurement is a classification system; we use numbers
instead of names to identify things.
For example, if we wanted to code religion, we might say 1 =
Catholic, 2 = Protestant, 3 = Jewish, etc. That does not mean that a
protestant is more religious than a Catholic and less religious than a
Jew. The numbers we use our arbitrary, and you can't perform
mathematical operations (i.e. add a Catholic and a protestant and get
a Jew). Categories should be mutually exclusive and exhaustive. That
is, you should only be able to classify something one way, and you
should have a category for every possible value.

Level of Measurement
With ordinal measurement, categories are ranked in order of their
values on some property. Class ranks are an example (highest
score, second highest score, etc.) However, the distances between
ranks do not have to be the same. For example, the highest scoring
person may have scored one more point than the 2nd highest, she
may have scored 5 more points than the third highest, etc.

Level of Measurement
With interval level measurement, the distance between each number is the
same. For example, the distance between 1 and 2 is the same as the distance
between 15 and 16. With interval measurement, we can determine not only
that a person ranks higher but how much higher they rank. You can do
addition and subtraction with interval level measures, but not multiplication
and division.
With ratio level measures, you can do addition, subtraction, and
multiplication and division. With ratio measures, you have an absolute, fixed,
and nonarbitrary zero point.

Examples
Fahrenheit and centigrade scales of temperatures are interval-level measures.
They are not ratio-level because the zero point is arbitrary. For example, in the F
scale 32 degrees happens to be the point where water freezes. There is no reason
you couldn't shift everything down by 32 degrees, and have 0 be the point where
water freezes. Or, add 68, and have 100 be the freezing point. The zero point is
arbitrary. It is not correct to say that, if it is 70 degrees outside, that it is
twice as warm as it would be if it were 35 degrees outside.
Such things as age and income, however, have nonarbitrary zero points. If you
have zero dollars, that literally means that you have no income. If you are 20 years
old, that literally means you have been around for 20 years. Further, $10,000 is
exactly twice as much as $5,000. If you are 20, you are half as old as someone who
is 40.

Bivariate data
Be clear about the difference between bivariate data and
two
sample data. In two sample data, the X and Y values are
not
paired, and there arent necessarily the same number of X
and
Y values.

A bivariate simple random sample (SRS ) can be written


Each observation is a pair of values, for example (X3, Y3)
is the third observation.
In a bivariate SRS, the observations are independent of
each other, but the two measurements within an
observation may not be.

Marginal Density
The distribution of X and the distribution of Y can be
considered individually using univariate methods. That is,
we can analyse
using CDFs, densities, quantile functions, etc. Any property
that described the behavior of the Xi values alone or the Yi
values alone is called marginal property.

Joint Property
The most interesting questions relating t o bivariate data
deal with X and Y simultaneously. These questions are
investigated using properties that describe X and Y
simultaneously. Such properties are called joint
properties.
A complete summary of the statistical properties of (X, Y)
is given by the joint distribution.

Example
If t he sample space is finite, the joint distribution is represented
in a table, where the X sample space corresponds to the rows,
and the Y sample space corresponds to the columns.
For example, if we flip two coins, the joint distribution is
The marginal distributions can always be obtained from the joint
distribution by summing the rows (to get the marginal X
distribution), or by summing the columns (to get the marginal Y
distribution). For this example, the marginal X and Y
distributions are both {H 1/2, T 1/2 }.

Scatterplot
The most important graphical
summary of bivariate data is the
scatterplot. This is simply a plot of
the points ( Xi, Yi) in the plane.

The following figures show


scatterplots of June maximum
temperatures against January
maximum temperatures, and of
January maximum temperatures
against latitude.

Scatterplot
A key feature in a scatterplot is the association, or trend between X and Y.
Higher January temperatures tend to be paired with higher June
temperatures, so these two values have a positive association (As X
increases, Y increases).
Higher latitudes tend to be paired with lower January temperature
decreases, so these values have a negative association (as X increases, Y
decreases).
If higher X values are paired with low or with high Y values equally often,
there is no association (No consistent tendency for values on Y to increase or
decrease as X increases).

Correlation coefficient
Suppose we would like to numerically quantify the trend in
a bivariate scatterplot.
The most common means of doing this is the correlation
coefficient (sometimes called Pearsons correlation
coefficient):

The numerator is called the covariance.

cov(X,Y) > 0
cov(X,Y) < 0
cov(X,Y) = 0

X and Y are positively correlated


X and Y are inversely correlated
X and Y are independent

Linear Correlation
Linear relationships

Curvilinear relationships

Linear Correlation
Strong relationships

Weak relationships

Linear Correlation
No relationship

Example
DATA
Find Mean, SD and covariance?

Covcig .&CHD

( X X )(Y Y ) 222.44

11.12
N 1
21 1

cov XY
11.12
11.12

.713
s X sY
(2.33)(6.69) 15.59

Statistical Softwares
There are many softwares to perform statistical analysis
and visualization of data. Some of them are SAS (System
for Statistical Analysis), S-plus, R, Matlab, Minitab, BMDP,
Stata, SPSS, StatXact, Statistica, LISREL, JMP, GLIM, HIL,
MS Excel etc.
http://www.R-project.org

S-ar putea să vă placă și