Lecture 1: Data Manipulation

Lecture 1
Data Manipulation
Ionut Bebu
I. Bebu (DBBB) Data Manipulation 1 / 19

Objects
R is an implementation of the S language.
Everything in R is an object.
Every object in R has a class:
describes what the object contains and the functions that apply to it;
character, numeric, integer, logical, complex, list.
> year = 2008
> names(year)=c("Year")
> print(year) # or > year
Year 2008
> class(year)
[1] "numeric"
R is case sensitive.
> getwd()
[1] "C:/Program Files/R/R-2.5.0"
> ? setwd

Vectors
Define vectors:
> x = c(2,5,1,7)
> x = rep(10,5)
> x = seq(from=2,to=10,length.out=4)
Operations
> length(x)
> sort(x)
> log(x)
> y = 1:4
> x%*%y
> crossprod(x,y)
> outer(x,y)
> sum(x); prod(x); cumsum(x)

Factors
How to define a factor:
> status=factor(c("cured","no improvement","cured","some improvement",
"marked improvement","cured"))
> status
[1] cured no improvement cured some improvement marked
improvement cured
Levels: cured marked improvement no improvement
some improvement
The levels are ordered alphabetically.

The ordering can be specified by the user:
> status=factor(c("cured","no improvement","cured","some improvement",
"marked improvement","cured"), levels=c("cured","marked
improvement","some improvement","no improvement"))
> status
[1] cured no improvement cured some improvement [5] marked
improvement cured Levels: cured marked improvement some improvement
no improvement
> summary(status)
cured marked improvement some improvement no improvement
3 1 1 1

Data frames
Usually used to store a matrix.
Define a data frame by collecting together several vectors:
> Age = c(65,58,73,59,68,70)
> w_dataframe = data.frame(Age,status)
> w_dataframe
Age status
1 65 cured
2 58 no improvement
3 73 cured
4 59 some improvement
5 68 marked improvement
6 70 cured
> summary(w_dataframe) # try it
Try
rownames(w_dataframe), colnames(w_dataframe), edit(w_dataframe)

Matrices and arrays
Define a matrix:
> w_matrix = matrix(c(1,2,3,6,5,4),nrow=2,byrow=T)
> rownames(w_matrix) = 1:2
> colnames(w_matrix) = c("A","B","C")
> dim(w_matrix)
[1] 2 3
> A = matrix(0,nrow=2,ncol=4) # try it!
Define an array:
> w_array = array(1:12,c(2,2,3))
> dimnames(w_array) = list(letters[1:2],c("A","B"),
c("I","II","III"))
> w_array # try it!

Matrices
transpose, operations
> t(w_matrix); 3*w_matrix;
> u_matrix = matrix(1:4,ncol=2)
> u_matrix%*%w_matrix # multiplication
> det(u_matrix)
eigenvalues, inverse, trace
> A = matrix(1:9,ncol=3) + diag(rep(1,3))
> eigen(A)
> solve(A)
> sum(diag(A)) # trace

Indexing
The index can be:
a vector of positive integers:
> x = 1:10
> x[c(1,5)]
[1] 1 5
a vector of negative integers:

> x[-c(1,5)]
[1] 2 3 4 6 7 8 9 10
a logical vector:
> x[sqrt(x)==floor(sqrt(x))]
[1] 1 4 9
a vector of character strings:

> names(x) = letters[1:10]
> x[c("d","a","c","a")]
d a c a 4 1 3 1
Index a data frame
> w_dataframe[w_dataframe[,2]=="cured",]
> w_dataframe[w_dataframe[,c("Age")]>60,]

Combining vectors, matrices and data frames
Combine vectors:
> a = 1:3; b = c(12,14,21,23)
> c(a,b)
[1] 1 2 3 12 14 21 23
Combine matrices and data frames:

> A = matrix(0,2,2)
> B = matrix(1:4,2,2)
> rbind(A,B)
[,1] [,2]
[1,] 0 0
[2,] 0 0
[3,] 1 3
[4,] 2 4
> cbind(A,B)
[,1] [,2] [,3] [,4]
[1,] 0 0 1 3
[2,] 0 0 2 4

Write in latex and html
> print(xtable(w_dataframe),type="latex")
Age status
1 65.00 cured
2 58.00 no improvement
3 73.00 cured
4 59.00 some improvement
5 68.00 marked improvement
6 70.00 cured
> write(print(xtable(w_dataframe),type="html"),"C://TEACHING/R
file:///C|/TEACHING/R/sterge.html
Age status
1 65.00 cured
2 58.00 no improvement
3 73.00 cured
4 59.00 some improvement
marked
5 68.00
improvement
6 70.00 cured

Simple statistics
> mean(Age)
[1] 65.5
> sd(Age)
[1] 6.024948
> mean(Age[status=="cured"])
[1] 69.33333
> which(Age==max(Age))
[1] 3
> table(status)
> pie(table(status))
> barplot(table(status),beside=T,legend=T,
col=c("red","blue","green","yellow"))
> x1 = rnorm(52,mean=0,sd=1)
> x2 = rnorm(64,mean=1,sd=1)
> hist(x1);
> hist(x1,probability=T);lines(density(x1))
> plot(density(x1)); qqnorm(x1)
> smokes = c("Y","N","N","Y","N","Y","Y","Y","N","Y")
> amount = c(1,2,2,3,3,1,2,1,3,2)
amount
smoking 1 2 3
N 0.0000000 0.5000000 0.5000000
Y 0.5000000 0.3333333 0.1666667

Probability Distributions
”d” for the density, ”p” for the CDF, ”q” for the quantile function and ”r” for
simulation
> dbinom(2,5,0.5) # probability of two heads in five tosses

> punif(4,3,5)
> qnorm(0.975,0,1)
> rnorm(20,mean=2,sd=1)
Easy to write down the log-likelihood, which will be helpful for finding the
MLE. If x1 , . . . , x20 ∼ N(1, 1), then the likelihood is obtained as
> x = rnorm(20,1,1)
> log_likelihood = sum(dnorm(x,1,1,log=T))
[1] -30.52741
> set.seed(23)

Tests for proportions
> prop.test(12,39,1/3) # z-test

> binom.test(12,39,1/3) # exact test
> prop.test(c(12,10,22),c(39,28,63)) # test equality of several
# proportions
> freq = c(20,21,24,27,22,36)
> probs = c(1,1,1,1,1,1)/6 # test the distribution
> chisq.test(freq,p=probs)
> yesbelt = c(12813,647,359,42)
> names(yesbelt)=c("None","minimal","minor","major")
> nobelt = c(65963,4000,2642,303);
> names(nobelt)=c("None","minimal","minor","major")
> chisq.test(data.frame(yesbelt,nobelt)) # test of independence
# compare two distributions
> die.fair = sample(1:6,200,p=c(1,1,1,1,1,1)/6,replace=T)
> die.bias = sample(1:6,100,p=c(.5,.5,1,1,1,2)/6,replace=T)
> res.fair = table(die.fair);res.bias = table(die.bias)
> chisq.test(rbind(res.fair,res.bias))

Simple statistical tests
> t_s = (mean(x1)-0)/(sd(x1)/sqrt(length(x1)))

> 2*pt(t_s,df=length(x1)-1) # one-sample t-test
> t.test(x1,mu=0)
> t.test(x1,x2) # two-sample t-test
> t.test(x,y,paired=TRUE) # paired t-test
> wilcox.test(x1,x2) # Wilcoxon two-sample test for the
# median
> library(MASS)
> x=mvrnorm(100,c(1,0),Sigma=matrix(c(2,1,1,4),ncol=2))
> cor.test(x[,1],x[,2])
# one-way ANOVA
> x1 = rnorm(20,0,1); x2 = rnorm(25,0.3,1); x3 = rnorm(22,-0.2,1)
> w = c(x1,x2,x3); ind = factor(c(rep(1,20),rep(2,25),rep(3,22)))
> oneway.test(w~ind,var.equal=T)
> anova(lm(w~ind))
# nonparametric ANOVA
> kruskal.test(w~ind)

Linear models
> rubber_lm = lm(formula = loss ~ hard + tens, data = Rubber)
> summary(rubber_lm)
Call:
lm(formula = loss ~ hard + tens, data = Rubber)
Residuals:
Min 1Q Median 3Q Max
-79.385 -14.608 3.816 19.755 65.981
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 885.1611 61.7516 14.334 3.84e-14 ***
hard -6.5708 0.5832 -11.267 1.03e-11 ***
tens -1.3743 0.1943 -7.073 1.32e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 36.49 on 27 degrees of freedom
Multiple R-squared: 0.8402, Adjusted R-squared: 0.8284
F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11

Linear models
> par(mfrow=c(2,2))
> plot(rubber_lm)
Residuals vs Fitted Normal Q−Q
2
28 28
Standardized residuals
50
1
Residuals
0
−1
−50
−2
14 14
19
19
50 100 200 300 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Scale−Location Residuals vs Leverage

19
1.5
0.5
2
14 28
28
1
1.0
0
−1
0.5
−2
14 0.5
Cook’s distance 19
0.0
50 100 200 300 0.00 0.05 0.10 0.15 0.20
Fitted values Leverage

Linear models: select a model + test linear hypothesis
> rubber_lm2 = lm(formula = loss ~ hard + tens, data = Rubber)

> rubber_lm1 = lm(formula = loss ~ hard , data = Rubber)
> anova(rubber_lm1,rubber_lm2)
Analysis of Variance Table
Model 1: loss ~ hard Model 2: loss ~ hard + tens
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 102556 2 27 35950 1 66607 50.025 1.325e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> library(car)
> linear.hypothesis(rubber_lm,c(0,1,-1))
Linear hypothesis test
Hypothesis: hard - tens = 0
Model 1: loss ~ hard + tens
Model 2: restricted model
Res.Df RSS Df Sum of Sq F Pr(>F)
1 27 35950
2 28 151916 -1 -115966 87.096 6.118e-10 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Questions
1. Consider the ”UScereal” data set:
> library(’MASS’)
> data(’UScereal’)
1 write ”UScereal” to a .txt file and read it back;

2 write ”UScereal” in ”latex” and in ”html”;
3 present descriptive statistics for all the variables, both numerical and graphical;
4 is the ”protein” content the same for all levels of ”vitamins”? check the conditions.
2. In an effort to increase student retention, many colleges have tried block programs.
Suppose 100 students are broken into two groups of 50 at random. One half are in a block
program, the other half not. The number of years in attendance is then measured. We wish
to test if the block program makes a difference in retention. The data is:
Program 1yr 2yr . 3yr 4yr 5 + yrs.

Non − Block 18 15 5 8 4
Block 10 5 7 18 10
Is there a difference between the two types of programs in terms of

retention?

Questions
3. Consider the punting data punting.txt. Fit the three separate models
E (yi ) = β0 + β1 x1i
E (yi ) = β0 + β1 x1i + β2i
E (yi ) = β0 + β1 x2i
where x1i is the right-leg strength, x2i is the left-leg strength.

1 Comment on which model seems to be more appropriate.
2 Test H0 : β2 = 0 H1 : β2 6= 0. Interpret the results.
4. Consider the epa.txt data set. The amount of magnesium uptake is measured at several
levels of time. It is anticipated that the two treatments used may result in different regression
equations.
1 A model is postulated in which calcium uptake is regressed against time in a quadratic
regression,
E (y ) = β0 + β1 x + β2 x 2 + β3 z ,
where z is an indicator variable accounting for the treatment. Fit this regression model.
2 We need to determine if the simple indicator variable is actually appropriate. Suppose we
rewrite the model
E (yi ) = β0 + β1 xi + β2 xi2 (treatment 1)

E (yi ) = γ0 + γ1 xi + γ2 xi2 (treatment 2)
Test H0 : (β1 , β2 ) = (γ1 , γ2 ).


Lecture 1: Data Manipulation

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lecture 1: Data Manipulation

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture 1

I. Bebu (DBBB) Data Manipulation 1 / 19

I. Bebu (DBBB) Data Manipulation 2 / 19

I. Bebu (DBBB) Data Manipulation 3 / 19

The levels are ordered alphabetically.

I. Bebu (DBBB) Data Manipulation 4 / 19

I. Bebu (DBBB) Data Manipulation 5 / 19

I. Bebu (DBBB) Data Manipulation 6 / 19

I. Bebu (DBBB) Data Manipulation 7 / 19

a vector of negative integers:

a vector of character strings:

Index a data frame

I. Bebu (DBBB) Data Manipulation 8 / 19

Combine matrices and data frames:

I. Bebu (DBBB) Data Manipulation 9 / 19

I. Bebu (DBBB) Data Manipulation 10 / 19

I. Bebu (DBBB) Data Manipulation 11 / 19

> dbinom(2,5,0.5) # probability of two heads in five tosses

I. Bebu (DBBB) Data Manipulation 12 / 19

> prop.test(12,39,1/3) # z-test

I. Bebu (DBBB) Data Manipulation 13 / 19

> t_s = (mean(x1)-0)/(sd(x1)/sqrt(length(x1)))

I. Bebu (DBBB) Data Manipulation 14 / 19

I. Bebu (DBBB) Data Manipulation 15 / 19

Residuals vs Fitted Normal Q−Q

50 100 200 300 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

50 100 200 300 0.00 0.05 0.10 0.15 0.20

Fitted values Leverage

I. Bebu (DBBB) Data Manipulation 16 / 19

> rubber_lm2 = lm(formula = loss ~ hard + tens, data = Rubber)

I. Bebu (DBBB) Data Manipulation 17 / 19

1 write ”UScereal” to a .txt file and read it back;

Program 1yr 2yr . 3yr 4yr 5 + yrs.

Is there a difference between the two types of programs in terms of

I. Bebu (DBBB) Data Manipulation 18 / 19

where x1i is the right-leg strength, x2i is the left-leg strength.

E (yi ) = β0 + β1 xi + β2 xi2 (treatment 1)

Test H0 : (β1 , β2 ) = (γ1 , γ2 ).

S-ar putea să vă placă și