Sunteți pe pagina 1din 39

Università degli Studi di Torino

Corso di Laurea in Matematica


Computational Statistics Laboratory
a.a. 2019/2020

1-Introduction to the
course and to software R

Maria Teresa Giraudo


Lecturer

Maria Teresa Giraudo


Department of Mathematics University of Torino
Via Carlo Alberto 10 1st floor
Tel. 0116702937
e-mail mariateresa.giraudo@unito.it

Consulting:
upon sending an e-mail to the address above
Information on the course

Computational Statistics Laboratory in Computer Lab with software R

The aim of the course is to introduce the students to the applications of the basic
statistical principles and techniques they have acquired. This is done by
employing real problems and data sets coming from different fields such as
for instance Biology, Engineering, Finance, Epidemiology and by introducing
the statistical software R (www.r-project.com) and its programming facilities.

The course, starting from basic Statistics knowledge, allows to employ them in
real applications broadening at the same time the computational and
computer science skills.
Programme
•  Introduction to the applications of Statistics and to the use of statistical software R.

•  Unidimensional descriptive Statistics: main statistical indexes (sample mean, mode,


median, sample variance, coefficient of variation); graphical representations of
sample data

•  Two dimensional descriptive statistics: contingency tables and barplots.

•  Some methods for simulating random variables.

•  Hypothesis testing: parametrical and not parametrical tests for one and for two
samples; chi square test for independence.

•  Goodness of fit testing.

•  Linear correlation and regression.

•  Introduction to one and two way analysis of variance.


References

1) P. Dalgaard Introductory Statistics with R, Springer 2008

2) Owen Jones, Robert Maillardet, Andrew Robinson


Introduction to Scientific Programming and Simulation Using
R, Second Edition, Chapman and Hall/CRC 2014

3) Teaching material used during the lessons and downloadable from the Moodle
site of the course

4) Websites quoted during the lessons


Further material

Some exercises, tests and further material will be made available from the
Pearson MyLab digital teaching platform project of Torino University.

Instructions and hints on how to use them will be provided throughout the course.
Examination rules

Students intending to take the exam in the first two sessions (January and
February) will be asked to complete during the course two assigned individual
works, where they will have to analyze in detail a given dataset and/or to perform a
bit of statistical programming as learnt in the lessons.
The works will be given a score covering 60% of the final mark.

The examination will take place in the computer lab, where students will be asked
to perform an autonomous work following the guideline of some questions (1 hour).
It will provide the 40% of the final mark for the first two sessions and obviously the
full mark for the other sessions, when
the examination will be longer (around
1.5 hours).
The maximum mark is 32 to allow to get
the score 30 e lode.
Summary of the lesson

•  Introduction to the course

•  Introduction to the software R

•  Basic instructions

•  Simple graphical output

•  Vectors, Factors, Lists, Frames


Why is it important to know Statistics and its practical use?
Why is it important to know Statistics and its practical use?
Why Statistics and History of Statistics

http://www.amstat.org/ASA/Why-Statistics.aspx

http://www.statslife.org.uk/images/pdf/timeline-of-statistics.pdf
History of Statistics
FROM «A CARTOON GUIDE TO STATISTICS» - 1
FROM «A CARTOON GUIDE TO STATISTICS» - 2
TO MAKE STATISTICS: THE AID OF THE SOFTWARE

TO PRACTICE STATISTICS FOR ANALYZING THE AVAILABLE DATA


(REAL OR SYNTHETIC)

we will make use of the statistical software R

and in particular of Rstudio, an integrated development environment for R


THE SOFTWARE R

• R is a language and environment for statistical computing and graphics. It is a GNU
project which is similar to the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and
colleagues.

• R provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, …) and graphical
techniques, and is highly extensible.

• One of R’s strengths is the ease with which well-designed publication-quality plots can
be produced, including mathematical symbols and formulae where needed. Great care
has been taken over the defaults for the minor design choices in graphics, but the user
retains full control.

• The Comprehensive R Archive Network is available at the following URL:


http://www.r-project.org/ from where it is possible to easily download the software and
finf FAQs and tutorials.
THE SOFTWARE R

We will make use of the IDE Rstudio, for which an esential tutorial can be found in
the file
Rstudio101.pdf

It cab ne downloaded from the website


https://www.rstudio.com/products/rstudio/download/#download
and the easily installed.

The commands are obviously those of the software R, so we will refer to the
snapshots of the R program windows.
HOW R INTRODUCES ITSELF

Hereon is a picture of how the main page of R appears once you have downloaded it on
your PC.
A SIMPLE GRAPHICAL EXAMPLE

To use database already stored in some libraries (for example ISwR) first install the
corresponding package and then use
>library(ISwR)

Graphical example: simulation and graphical representation (dotplot) of 1000 standard


normal values
>plot(rnorm(1000))
R AS A CALCULATOR
The software is able to compute everything:
>  exp(-2) [1] 0.1353353
The [1] in front of the result is the notation to denote the vector position.
>  rnorm(15) [1] 0.43770085 -0.24674916 0.19234551 0.89002202
0.02038122 -0.27390178
[7] -0.92197645 -0.07029154 -2.48202759 0.82927383 -0.73189148
1.38242318
[13] -1.58462197 -0.47569911 1.34503247

One can introduce symbolic variables, i.e. assign numbers or structures to objects:
>x<-2
The assignment operator is <-.
>  x+x [1] 4
>  x^3 [1] 8
>  y<-5
>  x*y [1] 10

ATTENTION: R is case sensitive!


VECTORS

R allows to easily manipulate groups of data that generate vectors like a single object
(a number array):
>  weight<-c(60,72,57,90,95,72) c means “concatenate”
>  weight [1] 60 72 57 90 95 72
>  sum(weight) [1] 446
>  sum(weight)/length(weight) [1] 74.33333
>  height<-c(1.75,1.80,1.65,1.90,1.74,1.91)
>  height [1] 1.75 1.80 1.65 1.90 1.74 1.91
>  bmi<-weight/height^2
>  bmi
[1] 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630

Operations on vectors are executed elementwise.


Operations involving vectors with different lenghts are executed all the same, but the
vector with lower length is copied the number of time sufficient to reach the length of
the other.
If the length are not multiple of each other, a Warning appears.
VECTORS

Missing values are denoted as NA (not available).

In addition to the relational and logical operators, there are a series of functions
that return a logical value. A particularly important one is is.na(x), which is used
to find out which elements of x are recorded as missing (NA).

Notice that there is a real need for is.na because you cannot make comparisons of
the form x==NA. That simply gives NA as the result for any value of x. The
result of a comparison with an unknown value is unknown!
VECTOR ARYTHMETIC

Let us compute step by step the variance of weight:


>  sum(weight) [1] 446
>  sum(weight)/length(weight) [1] 74.33333
>  xbar<-sum(weight)/length(weight)
>  weight-xbar
[1] -14.333333 -2.333333 -17.333333 15.666667 20.666667 -2.333333
>  (weight-xbar)^2
[1] 205.444444 5.444444 300.444444 245.444444 427.111111 5.444444
>  varianza<-sum((weight-xbar)^2)/(length(weight)-1)
>  varianza [1] 237.8667
>  devst<-sqrt(varianza)
>  devst [1] 15.42293

Of course there is a function for mean and variance…:


>  media<-mean(weight)
>  media [1] 74.33333
>  devstd<-sd(weight)
>  devstd [1] 15.42293
BUILDING GRAPHICS

Graphical facilities in R span from the simplest ones like scatterplots to the most
advanced.
Let us consider a simple scatterplot:
>  plot(height,weight)
GRAPHICAL PARAMETERS (PAR)

One can choose among a huge variety of options, like for example the mark type (pch,
“plotting character”):
>plot(height,weight,pch=2) >plot(height,weight,pch=5)
OBJECTS AND EXPRESSIONS

Expressions are the main interaction modality for R. They involve references to
variables, operators (+, or, …) and function calls.
OBJECTS is an abstract term to define everything can be assigned to a variable. There
are VECTORS, MATRICES, DATA FRAMES, LISTS.
FUNCTIONS
Function calls are commands resembling the mathematical functions of one or more
arguments:
log(x), plot(x,y), sqrt(x), …
Arguments can be actual (concerning only the current call) or formal (connected to
actual arguments in the call). Positional matching is used: plot(height,weight) is
different from plot(weight,height).
Some functiona are void:
>  ls()
1  "a" "b" "d" "devst" "devstd" "h" …
VECTORS

Beside numerical vectors one can define


- Character or string vectors
>  c("Qui", "Quo", "Qua") [1] "Qui" "Quo" "Qua“
or
>  c('Qui', 'Quo', 'Qua') [1] "Qui" "Quo" "Qua“

- Logical vectors, whose elements areTRUE (or T) and FALSE (or F):
>  c(T,F,T,F) [1] TRUE FALSE TRUE FALSE
They can be obtained also by means of rational expressions:
>  weight>60
[1] FALSE TRUE FALSE TRUE TRUE TRUE
>  weight==90
[1] FALSE FALSE FALSE TRUE FALSE FALSE

Quotations with “ “ require first \”.


Function c means concatenate::
>  vector<-c(42,57,12,39,1,3,4)
> Vector [1] 42 57 12 39 1 3 4
>  length(vector) [1] 7

Different vectors or single elements can be concatenated:


>  x<-c(1,2,3)
>  y<-c(10,20)
>  c(x,y,5) [1] 1 2 3 10 20 5

Function seq generates sequences of elements equidistant from the first to the last by
a step 1 or by the value specified in the third argument:
>  seq(4,10) [1] 4 5 6 7 8 9 10
>  seq(4,12,2) [1] 4 6 8 10 12

For step 1 also:


>  4:10 [1] 4 5 6 7 8 9 10
Function rep generates repeated values (first, the values; second, the number of
replications)
>  rep(vect,4) [1] 7 9 13 7 9 13 7 9 13 7 9 13
>  rep(vect,1:3) [1] 7 9 9 13 13 13

In particular:
>  rep(1:2,c(10,12))
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2

In particular:

>  rep(1:2,each=10)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
MATRICES

Matrices and arrays are given as vectors + dimensions:


>  x<-1:12
>  dim(x)<-c(3,4) rows x columns, filling by columns
>  x
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

Alternatively with the function matrix:


>  matrix(1:12,nrow=3,byrow=T) byrow: filling by rows
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
>  matrix(1:12,nrow=3)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Some functions that work with matrices: >
>  x<-matrix(1:16,nrow=4,byrow=T)
>  rownames(x)<-LETTERS[1:4] gives names to rows
>  x
[,1] [,2] [,3] [,4]
A 1 2 3 4
B 5 6 7 8
C 9 10 11 12
D 13 14 15 16

>  t(x) transposes


AB C D
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
>  colnames(x)<-c(1:4) gives names to columns
>  x
123 4
A 1 2 34
B 5 6 7 8
C 9 10 11 12
D 13 14 15 16
Some functions that work with matrices:
cbind and rbind to «glow» vectors to make matrices:

>  cbind(A=1:4,B=5:8,C=9:12,D=13:16) glows by columns


AB C D
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16

>  rbind("1"=1:4,"2"=5:8,"3"=9:12,"4"=13:16) glows by rows


[,1] [,2] [,3] [,4]
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16

Bivariate frequency tables will be in the form of matrices.


FACTORS

The data structure of factors allows to treat categorical variables expressed as levels
(colors, intensity, judgements,…)
For example, a factor with 4 levels consists in two elements: a vector of integers
between 1 and 4 and a vector of characters of length 4.

>  pain<-c(0,3,2,2,1)
vectors with the levels of 5 persons
>  fpain<-factor(pain,levels=0:3)
>  levels(fpain)<-c("none","mild","medium","severe")
names of levels
>  pain
[1] 0 3 2 2 1
>  fpain
[1] none severe medium medium mild
Levels: none mild medium severe
>  as.numeric(fpain) extracts the numerical code for levels (that
starts always from 1)
[1] 1 4 3 3 2
>  levels(fpain) extracts the names of levels
[1] "none" "mild" "medium" "severe"
LISTS

Lists are used to combine different types of objects in a wider and more composite object
(e.g. matrices and vectors, matrices and single numbers, ecc.)
Example: energy comsumption before and after a stress

>  intake.pre<-c(5260,5470,5640,6180,6390,6515,6805,7515,7515,8230,8770)
>  intake.post<-c(3910,4220,3885,5160,5645,4680,5265,5975,6790,6900,7335)
>  mylist<-list(before=intake.pre,after=intake.post)
>  mylist
$before
[1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

$after
[1] 3910 4220 3885 5160 5645 4680 5265 5975 6790 6900 7335

To extract the elements in the list:

>  mylist$before
[1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770
DATA FRAMES

Data frames are data matrices or data sets (like worksheets in Excel). They are lists of
vectors of the same lenght arranged in rows (cases) and columns (variables) so that a
row corresponds to the same subject.
Existing variables can be arranged in dataframes.
>  d<-data.frame(intake.pre,intake.post)
>  d
intake.pre intake.post
1 5260 3910
2 5470 4220
3 5640 3885
4 6180 5160
5 6390 5645
6 6515 4680
7 6805 5265
8 7515 5975
9 7515 6790
10 8230 6900
11 8770 7335
The symbol $ is used to select a single variable in the dataframe:
>  d$intake.pre
[1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

To refer to single variables in the dataframe and to work safely with them it is
better to use the command
>  attach(name_dataframe)

In this case you have just to type


>  intake.pre

At the end of your worksession type


>  detach(name_dataframe)
EXERCISES

1.  How would you check whether two vectors are the same if they may contain
missing (NA) values? (Use of the identical function is considered cheating!)

2.  If x is a factor with n levels and y is a length n vector, what happens if you
compute y[x]?

3.  What happens if you change the levels of a factor (with levels) and give the
same value to two or more levels?

4.  Write the logical expression to use to extract girls between 7 and 14 years of
age in the juul data set of the library ISwR.
Remember that you can use either the conditional selection with []
dataframe_name[conditions on the variables, separated by &]
or the subsect function
subset(dataframe_name, conditions on the variables separate by &)

[Suggestion: use real vectors that you can build in advance and try with them]

S-ar putea să vă placă și