Sunteți pe pagina 1din 70

Exploratory Data Analysis

C.Cordeiro

Chapter 2: Exploratory Data Analysis

1 / 38

Motivation

Data Example

Statisti s helps us to understand the impa t of limate


hange on the environment

C.Cordeiro

Chapter 2: Exploratory Data Analysis

2 / 38

Motivation

Data Example

Statisti s helps us to understand the impa t of limate


hange on the sea level temperature (SST)

C.Cordeiro

Chapter 2: Exploratory Data Analysis

3 / 38

Motivation

Data Example

Statisti s helps us to understand the impa t of limate


hange on the sea surfa e temperature (SST)

18
14

16

20

22

SST from Jan 1988 Dec 2014

1990

C.Cordeiro

1995

2000

2005

Chapter 2: Exploratory Data Analysis

2010

2015

4 / 38

Aspe ts to the pra ti e of statisti s

Des riptive & Inferen e Statisti s

Des riptive Statisti s: Des ribing data using graphs and numeri al
summaries, su h as, mean, median, interquartile range, et . In general,
the purpose is to dis over important patterns in the data and to
display these patterns.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

5 / 38

Aspe ts to the pra ti e of statisti s

Des riptive & Inferen e Statisti s

Des riptive Statisti s: Des ribing data using graphs and numeri al
summaries, su h as, mean, median, interquartile range, et . In general,
the purpose is to dis over important patterns in the data and to
display these patterns.

Statisti al inferen e: Using information about a sample, drawn at


random from a larger population, to establish on lusions about the
hara teristi s of the population.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

5 / 38

Aspe ts to the pra ti e of statisti s

Des riptive & Inferen e Statisti s

Aspe ts to the prati e of statisti s

Most biology experiments will involve some kind of measurement, su h


as time, temperature, et .
In some ases, in a well-designed experiment, there should be a
number of repeats (or repli ates) of ea h measurement.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

6 / 38

Aspe ts to the pra ti e of statisti s

Des riptive & Inferen e Statisti s

Aspe ts to the prati e of statisti s

Most biology experiments will involve some kind of measurement, su h


as time, temperature, et .
In some ases, in a well-designed experiment, there should be a
number of repeats (or repli ates) of ea h measurement.
On e some measurements have been olle ted the rst task is usually
to summarise them using des riptive statisti s.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

6 / 38

Aspe ts to the pra ti e of statisti s

Des riptive & Inferen e Statisti s

Aspe ts to the prati e of statisti s

Most biology experiments will involve some kind of measurement, su h


as time, temperature, et .
In some ases, in a well-designed experiment, there should be a
number of repeats (or repli ates) of ea h measurement.
On e some measurements have been olle ted the rst task is usually
to summarise them using des riptive statisti s.
Most interesting data analysis address relationships between and
among variables:
is

weigth

related to

is levels of

C.Cordeiro

CO2

heigth;

related to in rease of

SST ;

Chapter 2: Exploratory Data Analysis

6 / 38

Aspe ts to the pra ti e of statisti s

Des riptive & Inferen e Statisti s

Aspe ts to the prati e of statisti s

Most biology experiments will involve some kind of measurement, su h


as time, temperature, et .
In some ases, in a well-designed experiment, there should be a
number of repeats (or repli ates) of ea h measurement.
On e some measurements have been olle ted the rst task is usually
to summarise them using des riptive statisti s.
Most interesting data analysis address relationships between and
among variables:
is

weigth

related to

is levels of

CO2

heigth;

related to in rease of

SST ;

However, it is important, at the rst stage of data analysis to examine the


des riptives measures, graphs and distributions of variables

C.Cordeiro

Chapter 2: Exploratory Data Analysis

individually .
6 / 38

2.1-Basi on epts in Statisti s


Data, observations, variables

Data (dados) usually onsist of a olletion of observations or obje ts;

C.Cordeiro

Chapter 2: Exploratory Data Analysis

7 / 38

2.1-Basi on epts in Statisti s


Data, observations, variables

Data (dados) usually onsist of a olletion of observations or obje ts;


These observations are usually sampling units or experimental units
(unidades amostrais), e.g. as individual organisms;

C.Cordeiro

Chapter 2: Exploratory Data Analysis

7 / 38

2.1-Basi on epts in Statisti s


Data, observations, variables

Data (dados) usually onsist of a olletion of observations or obje ts;


These observations are usually sampling units or experimental units
(unidades amostrais), e.g. as individual organisms;
A set of these observations represent a sample (amostra) from a
learly dened population (populao);

C.Cordeiro

Chapter 2: Exploratory Data Analysis

7 / 38

2.1-Basi on epts in Statisti s


Data, observations, variables

Data (dados) usually onsist of a olletion of observations or obje ts;


These observations are usually sampling units or experimental units
(unidades amostrais), e.g. as individual organisms;
A set of these observations represent a sample (amostra) from a
learly dened population (populao);
Population: all possible observations in whi h we are interested;

C.Cordeiro

Chapter 2: Exploratory Data Analysis

7 / 38

2.1-Basi on epts in Statisti s


Data, observations, variables

Data (dados) usually onsist of a olletion of observations or obje ts;


These observations are usually sampling units or experimental units
(unidades amostrais), e.g. as individual organisms;
A set of these observations represent a sample (amostra) from a
learly dened population (populao);
Population: all possible observations in whi h we are interested;
Variable (varivel) is some measurement or hara teristi of interest,
e.g. temperature, number of individuals, pH, et . A variable is
denoted by

C.Cordeiro

Y,

with

being any value of

Y;

Chapter 2: Exploratory Data Analysis

7 / 38

2.1-Basi on epts in Statisti s


Data, observations, variables

Data (dados) usually onsist of a olletion of observations or obje ts;


These observations are usually sampling units or experimental units
(unidades amostrais), e.g. as individual organisms;
A set of these observations represent a sample (amostra) from a
learly dened population (populao);
Population: all possible observations in whi h we are interested;
Variable (varivel) is some measurement or hara teristi of interest,
e.g. temperature, number of individuals, pH, et . A variable is
denoted by

Y,

with

being any value of

Y;

A random variable (varivel aleatria) is simple a variable whose values


are not known for ertain before a sample is taken, i.e. the observed
values of a random variable are the results of a random experiment.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

7 / 38

2.1-Basi on epts in Statisti s


Population & Sample

C.Cordeiro

Chapter 2: Exploratory Data Analysis

8 / 38

2.1-Basi on epts in Statisti s

Types of variables

Quantitative data (dados quantitativos)

Dis rete (dis retas): an only take ertain, usually integer, values, e.g.
the numbers of plants in a forest plot, the number of ells in a tissue
se tion, the number of organisms in a sample of mud from a lo al
estuary, et .

C.Cordeiro

Chapter 2: Exploratory Data Analysis

9 / 38

2.1-Basi on epts in Statisti s

Types of variables

Quantitative data (dados quantitativos)

Dis rete (dis retas): an only take ertain, usually integer, values, e.g.
the numbers of plants in a forest plot, the number of ells in a tissue
se tion, the number of organisms in a sample of mud from a lo al
estuary, et .

Continuous ( ontnuas): an take any value, e.g. measurements like


length, weight, salinity, et

C.Cordeiro

Chapter 2: Exploratory Data Analysis

9 / 38

2.1-Basi on epts in Statisti s

Types of variables

Quantitative data (dados quantitativos)

Dis rete (dis retas): an only take ertain, usually integer, values, e.g.
the numbers of plants in a forest plot, the number of ells in a tissue
se tion, the number of organisms in a sample of mud from a lo al
estuary, et .

Continuous ( ontnuas): an take any value, e.g. measurements like


length, weight, salinity, et

C.Cordeiro

Chapter 2: Exploratory Data Analysis

9 / 38

2.1-Basi on epts in Statisti s

Types of variables

Qualitative data (dados qualitativos)

Qualitative data are data that re ords ategories, e.g. gender (male, female).

C.Cordeiro

Chapter 2: Exploratory Data Analysis

10 / 38

2.1-Basi on epts in Statisti s

Types of variables

Qualitative data (dados qualitativos)

Qualitative data are data that re ords ategories, e.g. gender (male, female).

There are

two levels

of measurement:

nominal, su h as gender, in whi h there is no order in the ategories

C.Cordeiro

Chapter 2: Exploratory Data Analysis

10 / 38

2.1-Basi on epts in Statisti s

Types of variables

Qualitative data (dados qualitativos)

Qualitative data are data that re ords ategories, e.g. gender (male, female).

There are

two levels

of measurement:

nominal, su h as gender, in whi h there is no order in the ategories


ordinal, su h as Edu ation level, in whi h the ategories have a natural
order

C.Cordeiro

Chapter 2: Exploratory Data Analysis

10 / 38

2.1-Basi on epts in Statisti s

Types of variables

Example

Consider the following table:


age weight gender

C.Cordeiro

1 17

50

male

2 16

59

male

3 15

49

female

4 16

51

male

Chapter 2: Exploratory Data Analysis

11 / 38

2.1-Basi on epts in Statisti s

Types of variables

Example

Consider the following table:


age weight gender
1 17

50

male

2 16

59

male

3 15

49

female

4 16

51

male

Two of the variables are quantitative- age and weight

C.Cordeiro

Chapter 2: Exploratory Data Analysis

11 / 38

2.1-Basi on epts in Statisti s

Types of variables

Example

Consider the following table:


age weight gender
1 17

50

male

2 16

59

male

3 15

49

female

4 16

51

male

Two of the variables are quantitative- age and weight


gender is a qualitative or ategorial variable

C.Cordeiro

Chapter 2: Exploratory Data Analysis

11 / 38

2.1-Basi on epts in Statisti s

Types of variables

Let's work!

Classique as seguintes variveis

a)
b)
)
d)
e)
f)
g)

h)
i)
j)
k)
l)

nmero de nas imentos de pandas em ativeiro, por ano


tipo de enzima (A,B,C)
tempo que uma baleia leva a vir superf ie respirar
temperatura do mar
nmero de avistamento de baleias ao largo da ilha do pi o num ms
faixa etria (20-39,40-59, mais de 60 anos)
espe ies em via de extino (panda gigante, elefante afri ano, urso
polar,..)
a dosagem de um medi amento em mL
tipo de habitat (oresta tropi al, tundra, deserto, ..)
nmero de visitas tutoria eletrni a por dia
on entrao de CO2 (em p.p.m.)
a dosagem de um medi amento (20,40,60mL)

Considere os dados esoph.

a) Quantas variveis?
b) Identique e lassique as variveis.
C.Cordeiro

Chapter 2: Exploratory Data Analysis

12 / 38

2.1-Basi on epts in Statisti s

Types of variables

Let's work!

Considere os dados:
lynx
hi kwts
PlantGrowth
Inse tSprays
faithful
ToothGrowth

Identique e lassique as variveis em estudo, para ada uma das


bases de dados.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

13 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

The variables (or type of data) an be lassied as qualitative (nominal or


ordinal) or quantitative (dis rete or ontinuous), and nding the proper way
to summarize them for easier understanding

is important .

There are three main des riptives measures:


Center (medidas de tendn ia entral)
Spread (medidas de disperso)
Shape (medidas de forma)

C.Cordeiro

Chapter 2: Exploratory Data Analysis

14 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Measures of Lo ation

Center measures

The most ommon summary of a data set is its


average, refered as mean (mdia)

x =

C.Cordeiro

Pn

i = 1 xi

Chapter 2: Exploratory Data Analysis

15 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Measures of Lo ation

Center measures

The most ommon summary of a data set is its


average, refered as mean (mdia)

x =

The median (mediana),

Pn

i = 1 xi

n
is the value for whi h as many

other values are bigger as smaller, basi ally the middle


value when they are sorted.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

15 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Measures of Lo ation

Center measures

The most ommon summary of a data set is its


average, refered as mean (mdia)

x =

The median (mediana),

Pn

i = 1 xi

n
is the value for whi h as many

other values are bigger as smaller, basi ally the middle


value when they are sorted.
Other measures

Other summaries are the most ommon value in the


data set, the so alled mode (moda).

C.Cordeiro

Chapter 2: Exploratory Data Analysis

15 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Measures of Lo ation

Center measures

The most ommon summary of a data set is its


average, refered as mean (mdia)

x =

The median (mediana),

Pn

i = 1 xi

n
is the value for whi h as many

other values are bigger as smaller, basi ally the middle


value when they are sorted.
Other measures

Other summaries are the most ommon value in the


data set, the so alled mode (moda).
The pth quartile (Quartil) is the value in the data for
whi h 100.p per ent of the data is less than the value
and 100.(1

C.Cordeiro

p)

is more. The median is the 0.5 quartile.

Chapter 2: Exploratory Data Analysis

15 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Let's work!

Obtenha a mdia para os dados seguintes

a) whale = c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79).
b) penguin =
c(17.1, 18.5, 19.7, 16.2, 21.3, 19.6, 16.2, 17.4, 17.3, 16.8, 19.5, 18.3).
4

Considere os dados faithful.

a)
b)
)
d)
e)
f)

Determine o tempo mdio de espera entre as erupes.


Determine a durao mdia das erupes.
O tempo mnimo e mximo de espera?
A durao mnima e mxima das erupes?
Determine a mediana para as variveis em estudo.
Determine o per entil 10, 25 e 75, para as variveis em estudo.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

16 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Let's work!

Considere os dados PlantGrowth.

a) Qual a dimenso da amostra? E por tratamento?


b) Em qual dos tratamentos se veri a o peso mdio mais elevado entre as
plantas?
) Como se designa a medida de tendn ia entral utilizada na alnea
anterior.
d) Em qual dos tratamentos se veri ou o peso menor? E o maior? Indique
os seus valores.
e) Qual o peso a ima do qual se en ontram 25% das plantas om peso mail
elevado? Qual o tratamento que ontm as mais "pesadas"?

C.Cordeiro

Chapter 2: Exploratory Data Analysis

17 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Let's work!

A varivel

time

(UsingR) dos dados nym.2002 refere-se aos tempos

dos atletas na maratona de Nova Iorque.

a) Qual o tempo de orrida do atleta ven edor? E qual o tempo da ltima


posio?
b) Qual o tempo de orrida para 10% dos atletas mais rpidos? E os 25%?
) Qual o tempo dos 10% atletas nais.
d) Qual o gnero que apresenta um maior nmero de parti ipantes e qual o
valor em per entagem?
e) Qual o gnero que apresenta o atleta mais novo? E mais velho? Indique
as respetivas idades.
7

Considerando os dados hi kwts.

a) Qual o suplemento alimentar mais utilizado na alimentao das galinhas?


Indique o valor absoluto e em per entagem.
b) Qual a per entagem de sunower ?
) Qual a dimenso da amostra?
d) Qual o suplemento que se revelou mais e az no res imento mdio das
galinhas?
C.Cordeiro

Chapter 2: Exploratory Data Analysis

18 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or variability of the data


The term dispersion is used to measure the variability in the data.
The most ommon are presented in the next slides...

C.Cordeiro

Chapter 2: Exploratory Data Analysis

19 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or variability of the data


The term dispersion is used to measure the variability in the data.
The most ommon are presented in the next slides...

Sample Varian e (varin ia amostral)


1

S =

n1

S2
n
X
i =1

(xi x)2

This is not quite the average of the squared distan es, as we have
divided by

n1

and not

n.

But, learly this have the same

interpretation.

Values far from the mean will have big deviations whi h when squared
will be even bigger.
So more spread-out data sets will have larger varian es.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

19 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or Variability of the data

Sample Standard Deviation (desvio padro amostral)

If our data has units, say kg, then the sample varian e will be in

squared units (kg ).


To avoid this, the sample standard deviation is dened as

S=

variance = S 2

The interpretation is the same as the varian e: larger values mean


more dispersion of the data, but now the

C.Cordeiro

scale is appropriate !

Chapter 2: Exploratory Data Analysis

20 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or Variability of the data

Coe ient of Variation (CV) ( oe iente de variao)


The standard deviation is sometimes normalized by the mean, given by:

CV =

C.Cordeiro

s
100%
x

Chapter 2: Exploratory Data Analysis

21 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or Variability of the data

Coe ient of Variation (CV) ( oe iente de variao)


The standard deviation is sometimes normalized by the mean, given by:

CV =

s
100%
x

Helps ompare variation a ross variables with

different units ,

that is,

is independent of the measurement units;

C.Cordeiro

Chapter 2: Exploratory Data Analysis

21 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or Variability of the data

Coe ient of Variation (CV) ( oe iente de variao)


The standard deviation is sometimes normalized by the mean, given by:

CV =

s
100%
x

Helps ompare variation a ross variables with

different units ,

that is,

is independent of the measurement units;


Is used to ompare standard deviations between populations with

different means ;

C.Cordeiro

Chapter 2: Exploratory Data Analysis

21 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or Variability of the data

Coe ient of Variation (CV) ( oe iente de variao)


The standard deviation is sometimes normalized by the mean, given by:

CV =

s
100%
x

Helps ompare variation a ross variables with

different units ,

that is,

is independent of the measurement units;


Is used to ompare standard deviations between populations with

different means ;
A variable with

higher

CV (in general,

> 50%)

is

more dispersed

than

one with lower CV.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

21 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or Variability of the data

Coe ient of Variation (CV) ( oe iente de variao)


The standard deviation is sometimes normalized by the mean, given by:

CV =

s
100%
x

Helps ompare variation a ross variables with

different units ,

that is,

is independent of the measurement units;


Is used to ompare standard deviations between populations with

different means ;
A variable with

higher

CV (in general,

> 50%)

is

more dispersed

than

one with lower CV.


CV only onsidered variables with

C.Cordeiro

positive values .

Chapter 2: Exploratory Data Analysis

21 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Dispersion or Variability of the data

Interquartile range (IQR) (amplitude inter-quartil)

IQR = Q0.75 Q0.25


The range is a measure of variability, but it suers from being ee ted
by just one large or small value. The standard deviation an be very
sensitive to a single large or small value

xi .

A measure that is not so ae ted is the IQR.


This is the range of the middle 50% of the data.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

22 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Let's work!

Considere os dados faithful.

a) Determine a mdia, desvio padro e CV.


b) Qual a medida de disperso mais apropriada para omparar as varveis?
Justique.
) Quais os dados que se en ontram menos dispersos? Justique.
9

Considere os dados esoph, onde a varivel

ncases

representa o nmero

de asos de an ro no esfago.

a) Em mdia, qual o grupo etrio om maior nmeros de asos de an ro?


E o menor?
b) Qual a medida de disperso a utilizar neste aso?

C.Cordeiro

Chapter 2: Exploratory Data Analysis

23 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Shape of the data

The shape (forma) is more important to understand in the ontext of

statistical inference .
When trying to understand a parent distribution from the data, we
dis uss some assumptions about the exa t shape of a distribution.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

24 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Shape of the data

The shape (forma) is more important to understand in the ontext of

statistical inference .
When trying to understand a parent distribution from the data, we
dis uss some assumptions about the exa t shape of a distribution.
In inferen e statisti s, a primary role is played by a parti ular shapethe normal shape.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

24 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Shape of the data

The shape (forma) is more important to understand in the ontext of

statistical inference .
When trying to understand a parent distribution from the data, we
dis uss some assumptions about the exa t shape of a distribution.
In inferen e statisti s, a primary role is played by a parti ular shapethe normal shape.

Sample skewness (assimetria) and sample kurtosis (a hatamento) are


the two measures used.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

24 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Skewness of the data


Measures the symmetry (assimetria) of the distribution (whether the mean
is at the enter of the distribution).
The skewness value of a normal distribution is 0. A negative value indi ates
a skew to the left (left tail is longer that the right tail) and a positive values
indi ates a skew to the right (right tail is longer than the left one).

From the book Estatsti a Des ritiva, Elizabeth Reis.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

25 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Kurtosis
Measures the atness (a hatamento) of a distribution.
A normal distribution has a value of 3. A kurtosis >3 indi ates a sharp peak
with heavy tails loser to the mean (leptokurti ). A kurtosis < 3 indi ates
the opposite a at top (platykurti ).

C.Cordeiro

Chapter 2: Exploratory Data Analysis

26 / 38

2.2 Des riptive measures, frequen y tables and graphs

Numeri summaries

Kurtosis
Measures the atness (a hatamento) of a distribution.
A normal distribution has a value of 3. A kurtosis >3 indi ates a sharp peak
with heavy tails loser to the mean (leptokurti ). A kurtosis < 3 indi ates
the opposite a at top (platykurti ).

We will ba k to these measures when explain the normal distribution...

C.Cordeiro

Chapter 2: Exploratory Data Analysis

26 / 38

2.2 Des riptive measures, frequen y tables and graphs

Methods for organizing the data

Summarizing the data


Qualitative/Categori al data : No mu h an be done with the distribution
of a qualitative variable beyond ounting the number of
individuals in ea h ategory of the variable and al ulating
the per entage distribution over the ategories.
Example: With variable tobgp within

esoph,

obtain the

tabulated data.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

27 / 38

2.2 Des riptive measures, frequen y tables and graphs

Methods for organizing the data

Summarizing the data


Qualitative/Categori al data : No mu h an be done with the distribution
of a qualitative variable beyond ounting the number of
individuals in ea h ategory of the variable and al ulating
the per entage distribution over the ategories.
Example: With variable tobgp within

esoph,

obtain the

tabulated data.
Quantitative

Discrete variable :

table of frequen ies (tabela de

frequn ias)

C.Cordeiro

Chapter 2: Exploratory Data Analysis

27 / 38

2.2 Des riptive measures, frequen y tables and graphs

Methods for organizing the data

Summarizing the data


Qualitative/Categori al data : No mu h an be done with the distribution
of a qualitative variable beyond ounting the number of
individuals in ea h ategory of the variable and al ulating
the per entage distribution over the ategories.
Example: With variable tobgp within

esoph,

obtain the

tabulated data.
Quantitative

Discrete variable :

table of frequen ies (tabela de

frequn ias)

Continuous variable :

divide the range of the variable

into lasses of intervals of equal width ( alled


How many

classes/bins

that works prettty well for

100 is given by 2

bins ).

should there be? A simple rule

not too mu h bigger than

n.

Example: With variable wt, within the

babyboom

dataset, obtain the tabulated data.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

27 / 38

2.2 Des riptive measures, frequen y tables and graphs

Methods for viewing the data

Plotting the data

Qualitative/Categori al data : bar graph (gr o de barras) or pie hart


(gr o ir ular)
Example: With variable gender within

babyboom,

obtain the

bar and the ir ular plots.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

28 / 38

2.2 Des riptive measures, frequen y tables and graphs

Methods for viewing the data

Plotting the data

Qualitative/Categori al data : bar graph (gr o de barras) or pie hart


(gr o ir ular)
Example: With variable gender within

babyboom,

obtain the

bar and the ir ular plots.

Quantitative

Discrete variable :

bar hart (gr o de barras)

Continuous variable :

histogram (histograma) is a bar

hart that shows the ount or per entage of individuals


falling in ea h of the lasses/bins.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

28 / 38

2.2 Des riptive measures, frequen y tables and graphs

Methods for viewing the data

Let's work!

10

babyboom do pa kage UsingR.


a) Usando a varivel qualitativa, onstrua a tabela de frequn ias, o gr o
ir ular e de barras.
b) Para a varivel running .time ,

Considere os dados

i)
ii)
iii)
iv)
v)
vi)

11

Determine o nmero de lasses (k).


Construa a nova varivel denida por lasses, runbin.
Obtenha a tabela de frequn ias.
Qual a lasse om maior frequn ia?
Construa o histograma.
O que pode dizer em relao sua simetria?

Para a varivel tobgp dos dados

esoph,

entre um gr o ir ular e um

gr o de barras, qual o que lhe pare e mais adequado?

C.Cordeiro

Chapter 2: Exploratory Data Analysis

29 / 38

2.2 Des riptive measures, frequen y tables and graphs

Graphi al tools

Viewing the shape of the data

There are several graphi al tools to view the shape of the data distribution:
Stem-and-leaf plot ( aule-e-folhas): The data set an be represented
in an organized ompa t manner. A good option when analysing the
data set by hand.
Example: 2, 3, 16, 23, 14, 12, 4, 13, 2, 0, 0, 0, 6, 28, 31, 14, 4, 8, 2, 5
Histogram (histograma): Represent the data points with a bar of a
given area.
Boxplots ( aixa de bigodes): Is a graphi al devi e based on the
quartiles (Q1 ,

Q2 , Q3 ).

It is easy to identify the median ( enter), the

IQR (spread) and the skew (shape) of the data.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

30 / 38

2.2 Des riptive measures, frequen y tables and graphs

Graphi al tools

Robust measures

The most used measures of enter and dispersion are the

standard deviation

mean

and the

due to their relationship with the normal distribution,

but....

C.Cordeiro

Chapter 2: Exploratory Data Analysis

31 / 38

2.2 Des riptive measures, frequen y tables and graphs

Graphi al tools

Robust measures

The most used measures of enter and dispersion are the

standard deviation

mean

and the

due to their relationship with the normal distribution,

but.... they suer when the data has long tails or many outliers.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

31 / 38

2.2 Des riptive measures, frequen y tables and graphs

Graphi al tools

Robust measures

The most used measures of enter and dispersion are the

standard deviation

mean

and the

due to their relationship with the normal distribution,

but.... they suer when the data has long tails or many outliers.

The

median

The

IQR

is su h resistant measure.

is one of the dispersion measures that is robust to unusual

observations.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

31 / 38

2.2 Des riptive measures, frequen y tables and graphs

Graphi al tools

Let's work!

12

Considere os dados nym.2002.

a) Verique se a varivel time tem uma distribuio simtri a?


b) Observe a forma dos dados atravs de um histograma e aixa de bigodes.
) Considerando os dados separados por gnero para a varivel time ,
represente-os atravs de aixas de bigodes em paralelo.
d) Utilizando uma medida de tendn ia entral, qual dos gneros fez os
melhores tempos de orrida?Justique.
13 Considere os dados rivers .
a) Classique a varivel em estudo.
b) Determine o oe iente de variao.
) Ser a mdia uma boa "medida estatsti a"para representar os dados?
Justique.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

32 / 38

2.2 Des riptive measures, frequen y tables and graphs

Graphi al tools

Let's work!

14

babyboom do pa kage UsingR.


a) Classique as variveis em estudo.
b) Qual a per entagem de nas imentos do gnero feminino?
) Determine o oe iente de assimetria para a varivel wt . Comente o
valor obtido.
d) Observe a forma dos dados atravs de um histograma e aixa de bigodes.
e) Qual o gnero que apresenta o peso (wt ) mais elevado? E mais baixo?
Indique os respe tivos pesos.
f) Agrupe a varivel wt em lasses e represente-a gra amente.

Considere os dados

C.Cordeiro

Chapter 2: Exploratory Data Analysis

33 / 38

2.2 Des riptive measures, frequen y tables and graphs

Graphi al tools

Bivariate data

Bivariate data involves

two variables :

for qualitative data we look at (two-way) ontingen y tables (tabela


de ontingn ia) or some related graphi s.
The table is ounting the o urren es of ea h possible pairs of levels.
Example: Considering data set
variables

C.Cordeiro

agegp

and

esoph,

obtain the ontingen y table for

tobgp .

Chapter 2: Exploratory Data Analysis

34 / 38

2.2 Des riptive measures, frequen y tables and graphs

Graphi al tools

Bivariate data

Bivariate data involves

two variables :

for qualitative data we look at (two-way) ontingen y tables (tabela


de ontingn ia) or some related graphi s.
The table is ounting the o urren es of ea h possible pairs of levels.
Example: Considering data set
variables

agegp

and

esoph,

obtain the ontingen y table for

tobgp .

s atter plot (gr o de disperso) is usefull with numeri data,


Example: Make the s atter plot of the data

C.Cordeiro

faithful .

Chapter 2: Exploratory Data Analysis

34 / 38

2.2 Des riptive measures, frequen y tables and graphs


Let's work!

15

diamond do pa kage UsingR.


Identique e lassique as variveis em estudo.
Para ada uma das variveis determine as medidas de lo alizao e de
disperso.
Usando as duas variveis, faa uma representao gr a dos dados.
Comente.
Transforme a varivel price numa varivel qualitativa ordinal (price.cat ).
Qual a ategoria que regista a per entagem mais elevada? Identique-a.
Represente a varivel price.cat atravs de um gr o adequado. Como
se designa esta representao gr a?

Considere os dados

a)
b)
)
d)
e)
f)

C.Cordeiro

Chapter 2: Exploratory Data Analysis

35 / 38

2.2 Des riptive measures, frequen y tables and graphs


Let's work!

16

Os dados

normtemp

(UsingR) ontm as temperaturas orporais de

temperature ).
a) Identique e lassique as variveis.
b) Faa um histograma om a varivel temperature .
) Obtenha a temperatura mdia orporal, e verique se esta medida
estatsti a a adequada para os dados.
d) A varivel gender 1 para mas ulino e 2 para feminino. Faa um gr o
de aixa de bigodes por gender . A ha que as temperaturas orporais so
semelhantes?
e) Usando a mdia, mediana, mximo e mnimo, ompare os gneros
relativamente ao batimento arda o em des anso hr .
f) Qual o gnero que registou o batimento arda o mais elevado? E o
menor? Indique os respe tivos valores.
g) Classique a assimetria das duas variveis quantitativas atravs de uma
gr o e indique tambm os seus valores.
h) Agrupe a varivel temperature em lasses e represente-a gra amente.
i) Apresente a tabela de ontingn ia para as duas variveis qualitativas.

130 indivduos (varivel

C.Cordeiro

Chapter 2: Exploratory Data Analysis

36 / 38

2.2 Des riptive measures, frequen y tables and graphs


Let's work!

17

O uso de telemvel durante a onduo aumenta o ris o de a idente.


Os dados

reaction.time

(UsingR) apresentam os tempos de reao a

um a onte imento. O grupo C no est a usar o telemvel e o grupo


T est a usar o telemvel.

a) Faa uma tabela de ontingn ia para as variveis gender e age . Qual o


gnero em maior nmero na amostra? E a faixa etria?
b) Considerando age e control , qual a faixa etria que faz maior uso do
telemvel durante a onduo?
) Considerando gender e control , qual dos gneros utiliza mais o telemvel
durante a onduo?
d) Considerando a varivel time por age , obtenha dimenso, mdia,
mediana, desvio padro, CV, IQR, minimo, mximo e quartis. Comente.
e) Faa uma aixa de bigodes em paralelo da varivel, para os 2 grupos.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

37 / 38

2.2 Des riptive measures, frequen y tables and graphs


Pro essing the data...

The statisti al methods that are


appropriate to analysing the data are
partly dependent upon the nature of
the variables.

Exploratory data analysis :

Des ribing

data using numeri al summaries,


table of frequen ies and graphs.
The general purpose is to dis over
important patterns in data and to
display these patterns.

C.Cordeiro

Chapter 2: Exploratory Data Analysis

38 / 38

S-ar putea să vă placă și