Sunteți pe pagina 1din 19

Lecture 1

Data Manipulation

Ionut Bebu

I. Bebu (DBBB) Data Manipulation 1 / 19


Objects
R is an implementation of the S language.
Everything in R is an object.
Every object in R has a class:
describes what the object contains and the functions that apply to it;
character, numeric, integer, logical, complex, list.
> year = 2008
> names(year)=c("Year")
> print(year) # or > year
Year 2008
> class(year)
[1] "numeric"

R is case sensitive.
> getwd()
[1] "C:/Program Files/R/R-2.5.0"
> ? setwd

I. Bebu (DBBB) Data Manipulation 2 / 19


Vectors

Define vectors:
> x = c(2,5,1,7)
> x = rep(10,5)
> x = seq(from=2,to=10,length.out=4)

Operations
> length(x)
> sort(x)
> log(x)
> y = 1:4
> x%*%y
> crossprod(x,y)
> outer(x,y)
> sum(x); prod(x); cumsum(x)

I. Bebu (DBBB) Data Manipulation 3 / 19


Factors
How to define a factor:
> status=factor(c("cured","no improvement","cured","some improvement",
"marked improvement","cured"))
> status
[1] cured no improvement cured some improvement marked
improvement cured
Levels: cured marked improvement no improvement
some improvement

The levels are ordered alphabetically.


The ordering can be specified by the user:
> status=factor(c("cured","no improvement","cured","some improvement",
"marked improvement","cured"), levels=c("cured","marked
improvement","some improvement","no improvement"))
> status
[1] cured no improvement cured some improvement [5] marked
improvement cured Levels: cured marked improvement some improvement
no improvement
> summary(status)
cured marked improvement some improvement no improvement
3 1 1 1

I. Bebu (DBBB) Data Manipulation 4 / 19


Data frames
Usually used to store a matrix.
Define a data frame by collecting together several vectors:
> Age = c(65,58,73,59,68,70)
> w_dataframe = data.frame(Age,status)
> w_dataframe
Age status
1 65 cured
2 58 no improvement
3 73 cured
4 59 some improvement
5 68 marked improvement
6 70 cured
> summary(w_dataframe) # try it

Try
rownames(w_dataframe), colnames(w_dataframe), edit(w_dataframe)

I. Bebu (DBBB) Data Manipulation 5 / 19


Matrices and arrays

Define a matrix:
> w_matrix = matrix(c(1,2,3,6,5,4),nrow=2,byrow=T)
> rownames(w_matrix) = 1:2
> colnames(w_matrix) = c("A","B","C")
> dim(w_matrix)
[1] 2 3
> A = matrix(0,nrow=2,ncol=4) # try it!

Define an array:
> w_array = array(1:12,c(2,2,3))
> dimnames(w_array) = list(letters[1:2],c("A","B"),
c("I","II","III"))
> w_array # try it!

I. Bebu (DBBB) Data Manipulation 6 / 19


Matrices

transpose, operations
> t(w_matrix); 3*w_matrix;
> u_matrix = matrix(1:4,ncol=2)
> u_matrix%*%w_matrix # multiplication
> det(u_matrix)
eigenvalues, inverse, trace
> A = matrix(1:9,ncol=3) + diag(rep(1,3))
> eigen(A)
> solve(A)
> sum(diag(A)) # trace

I. Bebu (DBBB) Data Manipulation 7 / 19


Indexing
The index can be:
a vector of positive integers:
> x = 1:10
> x[c(1,5)]
[1] 1 5

a vector of negative integers:


> x[-c(1,5)]
[1] 2 3 4 6 7 8 9 10

a logical vector:
> x[sqrt(x)==floor(sqrt(x))]
[1] 1 4 9

a vector of character strings:


> names(x) = letters[1:10]
> x[c("d","a","c","a")]
d a c a 4 1 3 1

Index a data frame

> w_dataframe[w_dataframe[,2]=="cured",]
> w_dataframe[w_dataframe[,c("Age")]>60,]

I. Bebu (DBBB) Data Manipulation 8 / 19


Combining vectors, matrices and data frames
Combine vectors:
> a = 1:3; b = c(12,14,21,23)
> c(a,b)
[1] 1 2 3 12 14 21 23

Combine matrices and data frames:


> A = matrix(0,2,2)
> B = matrix(1:4,2,2)
> rbind(A,B)
[,1] [,2]
[1,] 0 0
[2,] 0 0
[3,] 1 3
[4,] 2 4
> cbind(A,B)
[,1] [,2] [,3] [,4]
[1,] 0 0 1 3
[2,] 0 0 2 4

I. Bebu (DBBB) Data Manipulation 9 / 19


Write in latex and html

> print(xtable(w_dataframe),type="latex")

Age status
1 65.00 cured
2 58.00 no improvement
3 73.00 cured
4 59.00 some improvement
5 68.00 marked improvement
6 70.00 cured

> write(print(xtable(w_dataframe),type="html"),"C://TEACHING/R
file:///C|/TEACHING/R/sterge.html

Age status
1 65.00 cured
2 58.00 no improvement
3 73.00 cured
4 59.00 some improvement
marked
5 68.00
improvement
6 70.00 cured

I. Bebu (DBBB) Data Manipulation 10 / 19


Simple statistics
> mean(Age)
[1] 65.5
> sd(Age)
[1] 6.024948
> mean(Age[status=="cured"])
[1] 69.33333
> which(Age==max(Age))
[1] 3
> table(status)
> pie(table(status))
> barplot(table(status),beside=T,legend=T,
col=c("red","blue","green","yellow"))
> x1 = rnorm(52,mean=0,sd=1)
> x2 = rnorm(64,mean=1,sd=1)
> hist(x1);
> hist(x1,probability=T);lines(density(x1))
> plot(density(x1)); qqnorm(x1)
> smokes = c("Y","N","N","Y","N","Y","Y","Y","N","Y")
> amount = c(1,2,2,3,3,1,2,1,3,2)
amount
smoking 1 2 3
N 0.0000000 0.5000000 0.5000000
Y 0.5000000 0.3333333 0.1666667

I. Bebu (DBBB) Data Manipulation 11 / 19


Probability Distributions
”d” for the density, ”p” for the CDF, ”q” for the quantile function and ”r” for
simulation

> dbinom(2,5,0.5) # probability of two heads in five tosses


> punif(4,3,5)
> qnorm(0.975,0,1)
> rnorm(20,mean=2,sd=1)

Easy to write down the log-likelihood, which will be helpful for finding the
MLE. If x1 , . . . , x20 ∼ N(1, 1), then the likelihood is obtained as

> x = rnorm(20,1,1)
> log_likelihood = sum(dnorm(x,1,1,log=T))
[1] -30.52741
> set.seed(23)

I. Bebu (DBBB) Data Manipulation 12 / 19


Tests for proportions

> prop.test(12,39,1/3) # z-test


> binom.test(12,39,1/3) # exact test
> prop.test(c(12,10,22),c(39,28,63)) # test equality of several
# proportions
> freq = c(20,21,24,27,22,36)
> probs = c(1,1,1,1,1,1)/6 # test the distribution
> chisq.test(freq,p=probs)
> yesbelt = c(12813,647,359,42)
> names(yesbelt)=c("None","minimal","minor","major")
> nobelt = c(65963,4000,2642,303);
> names(nobelt)=c("None","minimal","minor","major")
> chisq.test(data.frame(yesbelt,nobelt)) # test of independence
# compare two distributions
> die.fair = sample(1:6,200,p=c(1,1,1,1,1,1)/6,replace=T)
> die.bias = sample(1:6,100,p=c(.5,.5,1,1,1,2)/6,replace=T)
> res.fair = table(die.fair);res.bias = table(die.bias)
> chisq.test(rbind(res.fair,res.bias))

I. Bebu (DBBB) Data Manipulation 13 / 19


Simple statistical tests

> t_s = (mean(x1)-0)/(sd(x1)/sqrt(length(x1)))


> 2*pt(t_s,df=length(x1)-1) # one-sample t-test
> t.test(x1,mu=0)
> t.test(x1,x2) # two-sample t-test
> t.test(x,y,paired=TRUE) # paired t-test
> wilcox.test(x1,x2) # Wilcoxon two-sample test for the
# median
> library(MASS)
> x=mvrnorm(100,c(1,0),Sigma=matrix(c(2,1,1,4),ncol=2))
> cor.test(x[,1],x[,2])
# one-way ANOVA
> x1 = rnorm(20,0,1); x2 = rnorm(25,0.3,1); x3 = rnorm(22,-0.2,1)
> w = c(x1,x2,x3); ind = factor(c(rep(1,20),rep(2,25),rep(3,22)))
> oneway.test(w~ind,var.equal=T)
> anova(lm(w~ind))
# nonparametric ANOVA
> kruskal.test(w~ind)

I. Bebu (DBBB) Data Manipulation 14 / 19


Linear models
> rubber_lm = lm(formula = loss ~ hard + tens, data = Rubber)
> summary(rubber_lm)
Call:
lm(formula = loss ~ hard + tens, data = Rubber)
Residuals:
Min 1Q Median 3Q Max
-79.385 -14.608 3.816 19.755 65.981
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 885.1611 61.7516 14.334 3.84e-14 ***
hard -6.5708 0.5832 -11.267 1.03e-11 ***
tens -1.3743 0.1943 -7.073 1.32e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 36.49 on 27 degrees of freedom
Multiple R-squared: 0.8402, Adjusted R-squared: 0.8284
F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11

I. Bebu (DBBB) Data Manipulation 15 / 19


Linear models
> par(mfrow=c(2,2))
> plot(rubber_lm)

Residuals vs Fitted Normal Q−Q

2
28 28

Standardized residuals
50

1
Residuals

0
−1
−50

−2
14 14
19
19

50 100 200 300 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage


19
1.5

0.5

2
14 28
28
Standardized residuals

Standardized residuals

1
1.0

0
−1
0.5

−2

14 0.5
Cook’s distance 19
0.0

50 100 200 300 0.00 0.05 0.10 0.15 0.20

Fitted values Leverage

I. Bebu (DBBB) Data Manipulation 16 / 19


Linear models: select a model + test linear hypothesis

> rubber_lm2 = lm(formula = loss ~ hard + tens, data = Rubber)


> rubber_lm1 = lm(formula = loss ~ hard , data = Rubber)
> anova(rubber_lm1,rubber_lm2)
Analysis of Variance Table
Model 1: loss ~ hard Model 2: loss ~ hard + tens
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 102556 2 27 35950 1 66607 50.025 1.325e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> library(car)
> linear.hypothesis(rubber_lm,c(0,1,-1))
Linear hypothesis test
Hypothesis: hard - tens = 0
Model 1: loss ~ hard + tens
Model 2: restricted model
Res.Df RSS Df Sum of Sq F Pr(>F)
1 27 35950
2 28 151916 -1 -115966 87.096 6.118e-10 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

I. Bebu (DBBB) Data Manipulation 17 / 19


Questions
1. Consider the ”UScereal” data set:

> library(’MASS’)
> data(’UScereal’)

1 write ”UScereal” to a .txt file and read it back;


2 write ”UScereal” in ”latex” and in ”html”;
3 present descriptive statistics for all the variables, both numerical and graphical;
4 is the ”protein” content the same for all levels of ”vitamins”? check the conditions.
2. In an effort to increase student retention, many colleges have tried block programs.
Suppose 100 students are broken into two groups of 50 at random. One half are in a block
program, the other half not. The number of years in attendance is then measured. We wish
to test if the block program makes a difference in retention. The data is:

Program 1yr 2yr . 3yr 4yr 5 + yrs.


Non − Block 18 15 5 8 4
Block 10 5 7 18 10

Is there a difference between the two types of programs in terms of


retention?

I. Bebu (DBBB) Data Manipulation 18 / 19


Questions
3. Consider the punting data punting.txt. Fit the three separate models

E (yi ) = β0 + β1 x1i
E (yi ) = β0 + β1 x1i + β2i
E (yi ) = β0 + β1 x2i

where x1i is the right-leg strength, x2i is the left-leg strength.


1 Comment on which model seems to be more appropriate.
2 Test H0 : β2 = 0 H1 : β2 6= 0. Interpret the results.
4. Consider the epa.txt data set. The amount of magnesium uptake is measured at several
levels of time. It is anticipated that the two treatments used may result in different regression
equations.
1 A model is postulated in which calcium uptake is regressed against time in a quadratic
regression,
E (y ) = β0 + β1 x + β2 x 2 + β3 z ,
where z is an indicator variable accounting for the treatment. Fit this regression model.
2 We need to determine if the simple indicator variable is actually appropriate. Suppose we
rewrite the model

E (yi ) = β0 + β1 xi + β2 xi2 (treatment 1)


E (yi ) = γ0 + γ1 xi + γ2 xi2 (treatment 2)

Test H0 : (β1 , β2 ) = (γ1 , γ2 ).


I. Bebu (DBBB) Data Manipulation 19 / 19

S-ar putea să vă placă și