Sunteți pe pagina 1din 21

Basics of R – Exercises

Read the instructions closely!

Lines starting with ”>” contain R codes, and they should be written without the ”>” sign. Codes and
R outputs are typesetted with Courier font to separate them from normal text.

This exercise has been written so that you should test every command, and see what they do
yourself. If you need help, just ask!

The dataset consists of 17 bioinformatics students, who have given their height and shoe size
measurements for teaching purposes.

1. Reading the data into R

There is folder data on your desktop. In its subfolder students, there is a file students.txt. Open it in
Excel, and check what columns it contains. Note the column headers. You can leave the file open in
Excel in case you need to see it later; else just close it.

Open R by double-clicking on its icon on the desktop. Go to the menu File, and select option
Change Dir. Change the directory to the directory where students.txt file is located.

Read the data into an R object named as students (data is in a tab-delimited text file having a title
for every colum):

> students<-read.table(“students.txt”, header=T, sep=”\t”)

Check that R read the file correctly (objects can be printed just by typing their name):

> students
height shoesize gender population
1 181 44 male kuopio
2 160 38 female kuopio
3 174 42 female kuopio
4 170 43 male kuopio
5 172 43 male kuopio
6 165 39 female kuopio
7 161 38 female kuopio
8 167 38 female tampere
9 164 39 female tampere
10 166 38 female tampere
11 162 37 female tampere
12 158 36 female tampere
13 175 42 male tampere
14 181 44 male tampere
15 180 43 male tampere
16 177 43 male tampere
17 173 41 male tampere

You can also print the column headers only (sometimes the whole table does not fit on the screen,
and this might be more helpful):
> names(students)
[1] "height" "shoesize" "gender" "population"

2. Prepare the data for analysis

Individual columns can be called using the following syntax: first comes the name of the object,
followed by a dollor sign, after which comes the name of the column:

> students$height

3. Simple statistics

What is mean height and shoesize? Type:

> mean(students$height)
[1] 169.7647
> mean(students$shoesize)
[1] 40.47059

What about standard deviations? Type:

> sd(students$height)
[1] 7.578996
> sd(students$shoesize)
[1] 2.695312

What are the gender and sampling site distribution (how many observations are in each groups)?
Type:

> table(students$gender)
gender
female male
9 8
> table(students$population)
population
kuopio tampere
7 10

Command table can also be used for cross-tabulations:

> table(students$gender,studenbts$population)
population
gender kuopio tampere
female 4 5
male 3 5
4. Useful plots

Usually graphical inspectation gives an easier interpretation. How are heights distributed? To use a
histogram, type:

> hist(students$height)

That’s the distribution for the whole population. But, is there is a difference in heights between the
sampling sites? That can be studied using a box plot. In this case variable height is divided into two
groups using the variable gender, and a separate boxplot is produced for both of these plots:

> boxplot(students$height~ students$gender)

So, there is large difference between the genders in heights. Does the same apply for sampling
sites? Write the code for this yourself.
How are height and shoe size related? You can get a graphical view of this by making a scatter plot:

> plot(students$height, students$shoesize)

5. Recoding variables

What if we want to differentiate between males and females in the plot? Let’s use different plotting
symbols for males and females.

First, we need a vector of plotting symbols. Let’s plot females with F and males with M. The new
vector can be produced by the command ifelse:

> sym<-ifelse(students$gender==”male”, “M”, “F”)

Now plot the image again:

> plot(students$height, students$shoesize, pch=sym)

Check from the help file what are the arguments for ifelse command.

We can even represent different populations with colors. Let’s recode the population variable with
color names (Kuopio=blue Tampepre=red):

> cols<-ifelse(students$population==”kuopio”, “Blue”, “Red”)

Now plot the image again:

> plot(students$height, students$shoesize, pch=sym, col=cols)

There are only 16 symbols on the plot. Can you figure out where one has vanished?
6. Making a new dataset

Make a new dataset from the variables height, shoesize, sym and cols:

> students.new<-data.frame(students$height, students$shoesize,


sym, cols)

Check that the new dataset is OK:

> students.new
students.height students.shoesize sym cols
1 181 44 M Blue
2 160 38 F Blue
3 174 42 F Blue
4 170 43 M Blue
5 172 43 M Blue
6 165 39 F Blue
7 161 38 F Blue
8 167 38 F Red
9 164 39 F Red
10 166 38 F Red
11 162 37 F Red
12 158 36 F Red
13 175 42 M Red
14 181 44 M Red
15 180 43 M Red
16 177 43 M Red
17 173 41 M Red

> class(students.new)
[1] "data.frame"

7. Extracting a subset from a dataset

Make two subsets of the dataset students. Split it in two according to gender.

First, check which individuals are males:

> which(students$gender==”male”)
[1] 1 4 5 13 14 15 16 17

Based on that use subscripts to select the correct subset (take only rows for which gender is male):

> students.male<-students[which(students$gender==”male”),]
height shoesize gender population
1 181 44 male kuopio
4 170 43 male kuopio
5 172 43 male kuopio
13 175 42 male tampere
14 181 44 male tampere
15 180 43 male tampere
16 177 43 male tampere
17 173 41 male tampere
Similarly, make a new dataset from females.

Sometimes we want to split the dataset using some continuos variable, such as height. Typically the
median of the variable is used. Make two new datasets that containg individuals below and above
the median height:

> median(students$height)
[1] 170
> students.short<- students[which(students$height<=
median(students$height)),]
> students.short

height shoesize gender population


2 160 38 female kuopio
4 170 43 male kuopio
6 165 39 female kuopio
7 161 38 female kuopio
8 167 38 female tampere
9 164 39 female tampere
10 166 38 female tampere
11 162 37 female tampere
12 158 36 female tampere

Similarly, make a new dataset from long students.

8. Quit R

To quit R, type:

> q( )

R then asks you whether you would like to save the workspace or not. This is generally a good idea,
and answer the question “yes”. You can then get back to the same analysis just by double-clicking
on the .Rdata-icon in your students-folder.

If double-clicking does not works, you can start R, and use menu choise File->Load Workspace and
File->Load History to acquire the same result.
AFFYMETRIX PREPROCESSING EXERCISE

PRELIMINARY OPERATIONS
A. Start RGui;
B. Change the working directory shoosing the folder where thr CEL files and the PHENODATA are
located;
C. Load the needed libraries:
library(affy)
library(affyQCReport)
library(hs133ahsentrezgcdf)
library(hgu133aEG1000)

In this exercise we will perform the following tasks:


1. Importing the CEL files and PHENODATA;
2. Performing basic QC;
3. Preprocessing the data.

1. IMPORTING THE CEL FILES AND PHENODATA

First, we create a new AffyBatch object (dat) where we import the CEL files.
dat <- ReadAffy()

The PHENODATA contains the information about the samples (i.e. the microarrays) of our dataset. The
PHENODATA are usually stored into a table (TAB-delimited .txt file) where each row represents a sample
and each column represents a variable.
We create a data.frame object to import the PHENODATA text file
pd <- read.table("phenod.txt", header=T, row.names=1, sep="\t")

Then, we assign the pd data.frame as the PHENODATA of the dat AffyBatch


pData(dat) <- pd

2. PERFORMING BASIC QC

We use the affyQCReport library that we loaded before


QCReport(dat)

This creates a PDF file with some plots. Read the affyQCReport vignette for interpreting the different
plots. Note that the plots in page 2 can be also produced by the commands
boxplot(dat)
and
hist(dat)

Additionally, we can check the RNA quality by using the AffyRNAdeg function.
deg <- AffyRNAdeg(dat)
Individual probes in a probeset are ordered by location relative to the 5! end of the targeted RNA molecule.
Since RNA degradation typically starts from the 5! end of the molecule, we would expect probe intensities
to be systematically lowered at that end of a probeset when compared to the 3! end. On each chip, probe
intensities are averaged by location in probeset, with the average taken over probesets.
We can plot the results into a new PDF file:
a) we create a new PDF graphic device. this will direct everything we plot into the new PDF file
pdf("rnadeg.pdf")

b) we plot the RNAdeg results


plotAffyRNAdeg(deg, cols=1:17)

c) we add a legend to the plot


legend(1,60,legend=rownames(pd), text.col=1:17)

d) we close the graphic device


dev.off()

3. PROPROCESSING THE DATA

We want to use the re-annotation of the affymetrix probes of the hgu133a chipset according to the Entrez
Gene database.
For doing so, we instruct R to use the new CDF package with the re-annotated information
dat@cdfName <- "hs133ahsentrezgcdf"

We also instruct R to use the meta-annotation package


dat@annotation <- "hgu133aEG800"

We preprocess the data using the RMA algorithm. The new datrma object is of class ExpressionSet
datrma <- rma(dat)

Finally, we save the normalized expression values into a TAB-delimited text file (.txt)
write.exprs(datrma, "datexprs.txt", sep="\t")

FINAL OPERATIONS
A. Save the workspace
save.image("Affy_Preprocessing.Rdata")

B. Save the history


savehistory("Affy_Preprocessing.Rhistory")

Dario Greco
Institute of Biotechnology - University of Helsinki
Building Cultivator II, room 223b
P.O.Box 56 Viikinkaari 4
FIN-00014 Finland
Office: +358 9 191 58951
Fax: +358 9 191 58952
Mobile: +358 44 023 5780
Email: dario.greco@helsinki.fi

    "!#%$%$%& '( )
 *,+.-


/103254603798:2<;=;?>@BAC8(DFEHG IJ>8:07LK0=03KNMPOBKRQ=K
S.T UV.WX+ZY+ X+%[\+]+^U_ T U`,a+. bW[c,d+d
+ XXefg[T.Wihj klk?mgd? n%U_+%oLpq d X
+%r
s otfuT*ud .
#otvU% %X. wda X
x o  .*gaTTyz.

{t| DFKQ=KJ>fI}4V70=Q
~(TX.X.f.L ".W+ €fn_+6hj klk?m]d? fn U_+^T luX+e)"X+a+ .T U,TR‚ƒ *„*,+ l… L†
dT U
‡ hj‰ˆlŠm%Šl‹<Œfhjklk?m
Rp W+ ^ fW? U_+N.W+<[ .n)T UT+  .9[W++<.W+NfRXŽa
+ 6 w+ 9.W+5a+L‘’d%‘“”p:.U_+Xo •Yz
[W Wb  lfT XLT ‚ƒ .*g.  t.W+9f Xf+r
‡—– m%Š)˜
™ –šœ› Š™)m%_žm Š
˜
™ –?š Œ‰Ÿ l¡¢  =£zžm%Š)˜™ –š5¤¥–)¦l– Ÿ_
§¨+©ªXLa) n`[W?t.W+e-%.va+ – m%Š)˜
™ –š  lfT Xr
‡—– m%Š)˜
™ –š
« hj‰™ ¬)­%k
ˆ™ Š®¬mk™°¯j%hl™%¬m‰k?™—±_‹
² ±_‹³
´
µ 
¡ ¢
µ ´
µ ´µ ¤š –°¶ %
™ °
· =
¸ %
j %
h  – ¡™
)
‹
´ ¹ ¹ ´ ¹ ´%¹ ¤š ¡¢ –°¶ ™%·°¸=j%h% – ‹)¡™
²´ ² ´ ² ´ ² ¤š ¡¢ –°¶ ™%·°¸=j%h% – ‹)¡™
´‰º º ´‰º ´º ¤š ¡¢ –°¶ ™%·°¸=j%h% – ‹)¡™
³´ ³ ´ ³ ´ ³ ¤š ¡¢ –°¶ ™%·°¸=j%h% – ‹)¡™
´ » » ´ » ´%» ¤š ¡¢ –°¶ ™%·°¸=j%h% – ‹)¡™
´¼ ¼ ´¼ ´ ¼ ¤š ¡¢ –°¶ ™%·°¸=j%h% – ‹)¡™
´½ ½ ´½ ´ ½ ¤š ¡¢ –°¶ ™%·°¸=j%h% – ‹)¡™
¾µ ¿ ¾ µ ¾ µ ¤š ¡¢ –°¶ ™%·  )¡¢% ?£ÁÀ
Â
¾¹ µzà ¾ ¹ ¾ ¹ ¤š ¡¢ –°¶ ™%·  )¡¢% ?£ÁÀ
Â
¾² µlµ ¾ ² ¾ ² ¤š ¡¢ –°¶ ™%·  )¡¢% ?£ÁÀ
Â
¾º µ‰¹ ¾ º ¾ º ¤š ¡¢ –°¶ ™%·  )¡¢% ?£ÁÀ
Â
¾³ µ ² ¾ ³ ¾ ³ ¤š ¡¢ –°¶ ™%·  )¡¢% ?£ÁÀ
Â
¾» µº ¾ » ¾ » ¤š ¡¢ –°¶ ™%·  )¡¢% ?£ÁÀ
Â
¾¼ µ ³ ¾ ¼ ¾ ¼ ¤š ¡¢ –°¶ ™%·  )¡¢% ?£ÁÀ
Â
¾½ µ‰» ¾ ½ ¾ ½ ¤š ¡¢ –°¶ ™%·  )¡¢% ?£ÁÀ
Â
pRW++V+ s !€f‰a+ X] Ä.W+ +`ÆÅ3++ )6Xf*uda+ XÈǃL+Z‚.:[TaĆ
d3+%( ɑLd3%‘’“Ê^Ë9ÌZo< aT*u X
\ x  Í\ÎB+aTaq[W WÍXÏ*uda+ XW? -_+^+ + "av3+a+ "lÈ[W fW…_+ XtT €+ fWÍ.f _o
S9fba+ Xg‚  * .W+€T*gU_+Í ?aTX.X,d U%f* +€+ ° —X. + œi-%.va+€Š
˜ )Á  *u*g 
Š™)m% ¤ kmj k?m%˜
™ š o"“ .W+,‚ƒ aTaz[T U<[\+g+  .W+V+Z‚ aT]  aT*u X"Ç ‚ƒ +  ? fn)U%   -%aT+ Xw‚ƒ *
L+ € + + … W?  +aXZÌ "T Í T. ÍaXf,  aT*u © ? X.d3 ©Tr
‡ Š)˜ › Š™)m% ¤ kmj km_˜
™ š Œ – m Š
˜
™ –?š ¯j h)™ ¬m‰k™ š ¢ ­lŠ ´ ™ › Ÿ š ¡¢ – Ÿ¢ –
™ Š ¤ ´ ¢lh ­_k š› Ÿ ˆm_ š ¡¢ – Ÿ%
)T  +t NfeX5T V)d3 <‚  .*g%[\+tW? -_+tX.d3+ Ə?+  š ¢%­lŠ ´ ™ › Ÿ š ¡¢ – Ÿ%o5~  e+ + 'NÆYufuǕo U%dNa+ XÏÌ
[\+w[\ a€Xf+ š ¢ ­)Š ´ ™ › Ÿ˜
™ ™ ¡=j ¦ Ÿ%o
“ hjklklk?m"T6X9d3%XX.Ta+,€U%T-_+gÆŎ++ lB[\+TU%WlXB".W+VX.d3 XoV.WXw[+g  +%o UoVTU%  +g.W+
+ZŎ+ L‚(? …X.d XzoN“† ".W+9‚ƒ aTaz[T Ug[\+^U%T-_+9[\+TU%W)$]VaTa¨.W%X+9X.d3 X[W%X+9-%aT+9X s T …f
 aT*u © ? X.d3 ©To
‡ Š)˜  ¸ ™j‰˜ l–šœ› k m – Š=j ¦ Œ µ  lŠ¢ ¸<Œ.Š)˜ %¶   ´ l¢ h:Œ•Š)˜  ¶ l
‡ Š)˜  ¸ ™j‰˜ l– š Š
˜  ¢ – ™ Š  Ÿˆm% š ¡¢ – Ÿ ›)› µ  › Ã
)d3 Ra+ XL]  L  )fT …U_+ +9 ?*,+ X €.f Èav _ ’T ‚ƒ .*g. ¨o\pRW+ X+^+e+ … €Xf+l
‡ Š)˜ 
˜ ™ ™ Á š › Š ™lm%
_  (ŒzŸ  l¡¢  ?£ ¤ ˜ mlh?Ÿ_
‡ Š)˜  ¡l=Š j – ™ Š › ˜
™ – m ‹¢%­ – Œ•Š)˜  ˜ ™ ™ š 

+ +9.W+X.#%$]X.d  ?*,+ X ".W+Tta
. Xt €.W+^.fr
‡ Š)˜ 
˜ ™ ™ š µ”¹là 
e| Kt89A07 6@BO 8(7002<8qQ=>7L@
!X.?aTaT"a) a(? n
U%  bT l+ X.T Í‚\+ fWiX.d ’X’X.f + …‚ƒ *}.W+w‚  +U%  …T )+ X.T†_o9“
hjklk)kmg.WXX’ +BX.T UÈ  *u*g bˆm ´ ¾ ˜lŠ¢ ­
) 
±)¢ Š)Š™ ´ – o’‘L*, UÈX+-_+fa<aT+. ?.T-_+ X3a+©ªX’X+
W++e*,+.W
È[W++w‰‚ +? fn)U%  €X.f . ¨?aTa3- aT+ X3+a[ $o Î]+e+dav + ")g-%aT+w$o Îr
‡ Š)˜lˆ#"%$ ˆ m ´ ¾ ˜lŠ¢ ­

±)¢ Š)Š™ ´ – Œ.Š)˜&gk?™ – %¢  › Ÿ mlh_·Ÿ%

'w| DF>fKtAt@B7t;Q=>f8F4)(7RQ=;
§¨+©ªXLa) n`.W+9X...T. Í‚5? n
U%  …X.f + …X.TU% ?aŽ-%aT+ XT €+ € fW?  +a¨‚ƒ * .f s r
‡* j š‰– Œ.Š)˜lˆ  ¶+  ,µ  
-[ 
+ X.W+9X.W?d3+w‚5.W+9X...T. Í fW? U_+6‰‚ƒ+LddaT)T U,a Ul.T.W*u ^.f X•‚  .*/.
‡* j š‰– ŒÏhl¢_˜ ¹ Œ.Š)˜lˆ  ¶+  µ, )
)*,
 .W+t+d+ X+ lf. ‚5WX. U%f*„Xt fT + …l`X.T Ug  *u*g 
‡ ¡h)¢ – Œ
™ š j – ‹<Œfhl¢_˜ ¹ •Œ Š)˜lˆ % ¶0  ,µ  l
§¨+©ªX’  *ud?+9.W+wX .+tda ’‚(a U?Ç¥\ x Ì-XRa U?Ç¥\Î_Ì …͑da o
‡ hl¢_˜ › hl¢%˜ ¹ Œ.Š)˜_ˆ     µ 
‡ hl¢_˜ ¶ › hl¢%˜ ¹ Œ.Š)˜_ˆ  ¶+  µ 
‡ ¡h)¢ – ŒÏlh ¢_˜hl¢_˜ ¶ †kmj › Ÿž j š j š®š ´ m )– – ™%Š ¡hl¢ – Ÿ%
“ € +t,X+ +e [\uda X’X.T*^aTf +  X.aT_= d+ bB +[ [T [wr
‡—¦ )µ µ Œ 
‡ ¡h)¢ – Œ)Œfh)¢%˜
h)¢%˜ ¶  ¹ .h)¢%˜ ¶ $h)¢%˜
Vkmj › Ÿž j š j š   ¡hl¢ – Ÿ%
“†‚:_ €[R lL,da T €.W+w .TU%T ?a¨[T z[ Ç #_ÌUlT ¨[.T+%r
‡ 
™ ¤.š ™ – Œ ¹ 
<' +-
 X.aT_Ž[+uX+ X.T*uda+]a
a5Xd e? n
U%   X.f . i*,+.W)œÇƒ †tÄ  
=ÌZoBpRW+
a+-_+aq‚5? n
U%  €T )+ X.T†€  "3+eT l-_+ X..TUl+ 1Ç X+e z[ .U,T X.+ €‚5.U%=ÌZr
‡ j k?m%˜
™%¡h)¢ – ŒÏhl¢_˜ ¹ Œ.Š)˜  ¶ ˆ   µ,  `Š
˜  ¡lŠ?j – ™ Š&Íh)¢¸ › ZŸ ¸ j – ?™ Ÿ  ‰j ˜ )› Ÿ Š™_Ÿ
hj k › ´ Œ ¼  µ‰¹ )
‡ j k?m%˜
™%¡h)¢ – ό hl¢_˜ ¹ Œ.Š)˜  _ˆ   µ,  `Š
˜  ¡lŠ?j – ™ Š&Íh)¢¸ › ZŸ ¸ j – ?™ Ÿ  ‰j ˜ ) › Ÿ ˜_Š™l™ ŽŸ
hj k › ´ Œ ¼  µ ¹ )

| E 70ŽIJK ( >(KRQ=>7L@
)T  +È[\+€+ÈT )++ X.+ ÁT Áf.%X,‚w\Î —\ x W?  +aXa+©ªXu  *ud+"a Uif.%XB‚ƒ *P.W+
? n
U%  …X.f + …fr
‡ km ›  ¢%Š%km)hj
™j –
j ) lŠ)Šm ‹ š Œ.Š)˜_ˆ ¥ k?™ – ¢% › Ÿ¡ŽŸ%
.U%*,+ l9k™
– ¢_€  +]+%o Uo`ŸZk™_j m 3Ÿ_ Ÿzhl¢)™ š)š Ÿ%3 VŸ ¡lŠ?j –)– jz¡hl¢)™ šlš Ÿ%oi.T.T U€ aTÍ.W+6X.
a+.+^‚.W+]*,+.W)Xw+  U%W¨:+%o Uo€Ÿ¡3ŸuX.f Xe‚  VŸ¡lŠ=j –l– j‰¡h)¢l™ š)š Ÿ_o6§¨ U"f.…f`[T.W 
 li  .*gaTTyz. 1XB fT +  [T.WÄk™ – ¢_ › Ÿ ¢ ™=Ÿ%o"L+ XaT^‚t  .*gaTTyz. Á  3+,X+ + T 1
b‘ da r
‡ ¡h)¢ –  5Œ”km m Š)Šm ‹ › µ  ¦ h j k › ´ Œ ¼  µ‰» )
p¨."¡hl¢ – Š=j – ž?j‰¡
¢l™ š)š ‚   .W"   .*gaTTy+  "  .*gaTTy+ Ífr
‡ km à › ¢ Š%k?mlhj
™=j – j l )ŠlŠm%‹ š •Œ Š)˜lˆ&†k™ –
¢_ › Ÿ ¢ ™?Ÿ_
‡ ¡h)¢ – ?Š j – ž=jz¡ 
¢l™ šlš Œ km à m ŠlŠm ‹ › µ †kmj › Ÿ™%·
¢%Š™œ¬¢%Š%km)hj
m – j ¢ 3Ÿ%
‡ ¡h)¢ –  ?Š j – ž=jz¡ 
¢l™ šlš Œ km.m%ŠlŠm%‹ › µ ¥k?m j › Ÿ  )· – ™ Š ¬¢%Š%k?mlhj
m – j ¢ 3Ÿ_
 .v  + XL  …+e  .*gaTTy+ …X.T U

‡ km ¹ › ¢ Š%k?mlhj
™™ – ¸™)™ ) lŠlŠm ‹ š Œ k?m † k™ –
¢_ › Ÿ š ´ mlhl™=Ÿ%

+ +9.W+9+ZŎ+ r
‡ ˆ¢ ¦ ¡ h)¢ – Œ km  ´ )¢ hqŒ”km    mk™ š› ´ ¢lh m‰k?™ š Œ km  l
‡ ˆ¢ ¦ ¡ h)¢ – Œ km ¹  ´ l¢ h:Œ km ¹    mk™ š › ´ ¢lh m‰k?™ š Œ”km ¹  l

    "!#%$%$%& '( )
 *,+.-


/102435025673408:9<;=9>2@?0ACBB.DE9(FHG5;=9>IJIJ9>3K6C9>249>I
L .M NO.PQ+SR+ QT+%UV+W+HNX M NO5Y+. "PUZ4[ "]\^++ _.`YMYM"+SRa+ QQ+ bNX+ + QUM.Pdcef_fJg

aJ h%NX+%iHjkP+lT,Qnmo *qp!r*O X.T QHUP++5.P+4nsntumo *v&:tna%twyxyh

hX *O +5 &
 .*rY=*O +PJ -X+z+ + {  *OaJ+ {UM.Pb+Smo++  +H@st5i_jkP+NXXY|Q}[ { VP~U.P+nh
) ThX 
mntna%twHNX+ +"€\=+ Q5.P+" .M-
M‚m .P+5NX+ + Q~i1LH+TMYQOm@.P+b+SRa+.M*,+ ),  ƒz=+rmo  ‚M
.P+„.. Y+%…dkYMY~UWƒi>†iMVLn MCiMk‡H NVˆ@i>‰ŠiMk)a=+ + ^j‹i>'ŒiMk CzM ŽVˆCiŠƒi#%$%$%$X‘Si
d X.T ’+SRa+ QQ. “a[YMM Nd+ _.][J+ Q5NX+ + QWUM.P”YM++ ‚+SRa+ QQ. “M “•nLH‰+S[J + _5*O +%i
–@—S˜|™%š,—H›C—Sœ —. ž.ŸS ’¡¢ #%$%#%#€x£#%$%#%¤¥ykYMY~U@#%$%$%$i a
mVM „TOM+  .‘Si

¦¨§ ;=9>B0T©v025A;=09>I
)T.@u {ThX+4cef_fgOaJ h%NX+WM )OQ+
ª ce€«_¬g%¬_­>®Tcef_fJg¯
VPJ NX+‹.P+‹U .h)M NbM+  .„ "+ {.P+‹T.NX+Q[Y+%…
ª±° g%¬)²
³ °´µ ¬³)g%¶X·g ¬
²
³ °J´ ®€¸¹_º» ¹|¼~·g%¬)²³ °´<½y°)¾_° ¸X¯
jkP+ „+ ".P+}T…
ª ¬)² µ ¬³)g%¶ ½ fge fgX²
³ ´ ® ° g ¬
²
³ °J´%¿XÀ e c)³ Ág€f³< ´ » Ã_¬|Ä ³ µ ¸ ´ º » ° ¸ » )° Å ³ ¬ ½ Ä%»_c ÃXf
Æ ´µ ¸ «gX¶ ´ º » ° ¸%¯
s+SR
Ç+YMM*OM J+n.P+@Qa QÇUP PbPJ -X+Cz=+ + r*r.hX+ rzJ do  YM*O ÉÈSzJ Q.a= ~ÈPJ QV- YM+Wp~‘>z_ON%M-
M N
.P+*ÊU+MN%P_$…
ª )¬ ² ¿ Ë ³ e€² _Å °´µ f g ° ¬|e ¾ ~® Ì=ÂÍÆ_¬» Ë .® ¬)² %¿ Î Ž¯ £ÆJÄ%»_cŒ®Ï¬)² ¿ Î ¯_¯
ª ¬)² ¿Ë ³ e€² _Å °´>Ð ¬
² ¿ » _° Å ³ ¬ ¿ ¸ «g%¶ ´ º» ° ¸ µ)µ ~Ì Ñ µÓÒ
Ô(M JYMYM{+ "NX+ +H J*,+ Qn ¥.T{Y` X @M m .*r. Ž…
ª )¬ ² ¿
² ³%Ƴ Õ´ µ ¬ ³_g%¶)ÖX¹
×(®~¸ ¹_º» ¹J¼ ½ ² g_cJ¸X¯
ª ¬)² ¿ º_|¬ e~Æ ° ³ ¬ µ ²
³ ° ×g ­»%à ° ®Ï¬)² ¿ ² ³ Ƴ ´ ¯
Ž‰ +ØÙQÚ  .*rYMMÛ+>.P+ŠTkUM.PWa.M )Ïxy.Ma}Y
+ QQ=  .*rYMMÛ~. "o+SmYMŽzJ Th)N%  4  .+ . }*,+.P

QQ+ J‘S…
ª fg µ Æ»%¬%fg)ceÜ
³Ýe °)Å e~Æ)¹_¬)¬g ­ ´ ®.¬)²>f³ °_Å »X¶ µ ¸º=¸X¯
 9>I|06C2K©vA?J;0.F
’+‹U Y"YMMhX+‹O Y Y`+W.P+HNX+ +}+SR
a+ QQ „m(h
) ThX C*O +‹+Y`.M-X+‹,UMY{
a+‹*O +%iÇÔ 
.PQa.a=%Q+%Y+ØÙQC+S[ +W.P+Hmo YMY~UM N —Sœ 
˜dš, ož  …
ª
¶ ³ ´ e€²_Æ µ Ā«Je€Æ)¶(®€¸SÝe c%¶ Î ³ ¸ µ Ì^ ¸Ý|e cX¶¸ µ Ä=®.¬³%º<® Ò ÂJ¯^£¬³ º>®Ì=¯)¯_¯
LH+ Q.MN% ¥*r..]RdYMY~UCQQCr+ Q.M*r+}.P+}- YM+ Qm  ž. šr— —Sž œ i@w£ „.PQnUV .h"U+5+}M _++ Q.+ dM
.P+‹NX+ +W+SRa+ QQ. d]\=++  + Q@z+£UV+ + ¥h

hX  „UMY"
a+W*O +r aJT*,++!@x"MYJ‘SiVw£
 +}+ Q..M*r+nMÇU+n+S[ +HYQ4  .P+aJT*,++W# MY
xn+SmS‘Si%$Ç :  {Q+ +UPJ.P+@-%.`zY+

³ ´ e²XÆ:  )TM Q@ NX+.P+@UM.P„- .`zY+ ° g ¬
²
³ °´ z){
aM N
ª Ä~«|e~Æ)¶<®.¶³ ´ e²XÆÇ ° g%¬)²³ °´ ¯
Ýe c%¶ Î ³ &'( Ý|e%c%¶*)_ce€¶³ Á_ÃXf
«³ ¬Ág€fJ³ À e c)³ Ág€f³,+X­'- +X­.
Ä
Ì Ì Ò Ì Ä
Ì Ä
Ì ½´ º» ° Î ³/ Ë e cX¶ ° ­_º³
Ä10 Ì Ò 0 Ä10 Ä10 ½´ º» ° Î ³/ Ë e cX¶ ° ­_º³
Ä2- Ì Ò - Ä2- Ä2- ½´ º» ° Î ³/ Ë e cX¶ ° ­_º³
Ä43 Ì Ò 3 Ä43 Ä43 ½´ º» ° Î ³/ Ë e cX¶ ° ­_º³
Ä1. Ì Ò . Ä1. Ä1. ½´ º» ° Î ³/ Ë e cX¶ ° ­_º³
Ä15 Ì Ò 5 Ä15 Ä15 ½´ º» ° Î ³/ Ë e cX¶ ° ­_º³
Ä26 Ì Ò 6 Ä26 Ä26 ½´ º» ° Î ³/ Ë e cX¶ ° ­_º³
Ä2 Ì Ò  Ä2 Ä2 ½´ º» ° Î ³/ Ë e cX¶ ° ­_º³
7Ì Ì Ì 8 7Ì 7 Ì ½´ º» ° Î ³/ ¹_º» ¹J¼9
70 Ì Ì ÌÒ 70 7 0 ½´ º» ° Î ³/ ¹_º» ¹J¼9
7- Ì Ì Ì_Ì 7 - 7 - ½´ º» ° Î ³/ ¹_º» ¹J¼9
73 Ì Ì Ì40 7 3 7 3 ½´ º» ° Î ³/ ¹_º» ¹J¼9
7. Ì Ì Ì:- 7 . 7 . ½´ º» ° Î ³/ ¹_º» ¹J¼9
57 Ì Ì Ì;3 7 5 7 5 ½´ º» ° Î ³/ ¹_º» ¹J¼9
76 Ì Ì Ì4. 7 6 7 6 ½´ º» ° Î ³/ ¹_º» ¹J¼9
7 Ì Ì Ì45 7  7  ½´ º» ° Î ³/ ¹_º» ¹J¼9
ˆÇ TP„UZ  .+ Q.a= QO +W.TXiŠjkP+H- YM+ Q‹p~ $5M "£UV,  YM*O Q4È MY
xn+Sm.ÈH ’È!@x"MYŽÈ
+YMYŽUPJCM m .*r. „U+W  ¥ zTM "mo * + P1.T XiÇjkP+‹[Q.C  Q+HQ…
pX#MY
x@+SmS‘=< $>!@x"MYJ‘@?AMY{xÇ@+Sm
UP P:*,+ QŠ.PJÇm  * .P+[Q.Ç+MN%P_k.T QŠU+CNX+Š.P+Y N%T.}z+£UV+ + bUMYO
a=+@ O+Sm++  +%i
Ô *Ê.P+WY` Q.C+MN%P_H.T QUV+}NX+…
pX#MY
x@+SmS‘=< pX>!@x"MYJ‘@?AMY{xÇ@+SmB<C! xDMYE?C! xÇ@+Sm
jkP+‹aJT*,++Q+‹.P+ ¥+ Q..M*r+ {mo *Ê.P+‹TlQM N,YMM +C*,
+Y^[..M N…
ª  e ° µ c f À e ° ®Ífg(Â϶
³ ´ e€²_Æ|¯
)M  +„+ Q.MN% Õ*r..]R  )TM + U’  YM*O Q.P++{UMYMYCz=+{UÉaJT*,++Qr+ Q..M*r+ Õmo ,+ TP
NX+ +%iŠjkP+‹aJT*,++n- YM+ Qmo .P+H[QNX+ +}+%…
ª  e °¿ Ä%»_³/ Ð Ì^ÂÍÑ
Ý|e%c%¶F Î ³/ 'Ý|e%c%¶
 Òǽ 5G.H8H.H-H|Ì ŠÒ ½ 5/-G8H-JÌ28/6
Ql.P+‹Y N%T.rm(NX+ +,pHz=+U+ + ¥h
) ThX @*O +} {UMY„
a+‹*O +WQ 0.639
i
 08:9<;=9>2@?0ACBB.D 9(F‹G5;9>I|IJ9>3 6@9>249>I
’+O+5NX M Nbr[ ]\=++ ).`YMYM1+SRa+ QQ+ 1NX+ + Qz_aaYM)M N„:Ïxy+ Q.nmo ‹+ TP1NX+ +%i}w  +
,*rhX+}.P+}+ Q..M*r. mŠQ.T 1+. @ + + + dM ¥.P+WÏxy+ QCm .*}Y`r*, +W+YM`zY+%|Y+ØÙQaaYM
 X+ Q.` ¥- .`  +WQP.M h NX+%…
ª e ° µ ³g ­³ ´ ®e ° ¯
‰Ž+ØÙQÉQ+ +‚.P+ƒ a p$±]\^++ _.`YMYM +SR
a+ QTQ+ uNX+ + Q1QT .+    M NÓaJT*,++ !@x"MY
ÍÄ »)³  µ 0_‘S…
ª±° »%º_·g%«c)³Ú®Je ° Â Ä »)³  µ 0(ÂÍÆ_ÃXf
«³%¬ µ Ì Ò Âg%¶%Ã ´°)µ ¸_¶_¬=¸X¯
c)»Ä 7 Î » Ë ) + »)c ÃXf
Æ ¼ Ágf³,+)c)» Ƴ¼  ¹ °
ǽ g_c%ó 
0Ì:3'8 5  6 Ä_Á)¹ ¹_º»%¹J¼ÚÂ.ce€ºJe¶FX¼f² Ì Ò 6H6'.H0 Ò 2- ½ 0uÌ20 10 3 -)³
Ì_Ì Ì4. ½ -
. 3 Ò 0 6 Ì4.ZÄ_Á)¹
') ·ÇÂ|e€² Å c ­ ´ e f=e c)g ¬ ° » ¹ 3'-G8H-G./-C2- ½£Ò Ì20 
Ì:- ._³ Ò 6 Ì_Ì ½£Ò
./-'.H5 Ì:3 8 ̱Ä_Á)¹*+%¹_· G+%× X·
×X· Î ¹_Á Ì:-'. Ò 0/-G0A
Ì ½  Ì4-A
Ì40 6)³ Ò 6 Ì Òǽ 5
3|Ì4-G8 Ì)Ì  0ZÄ_Á)¹
') ·ÇÝ³_g 7 c ­ ´ e f=e c)g ¬ ° »/+ -G6/3'-H6 Ò 
Ì ½£Ò Ì4-A
Ì40 ̀³ Ò 5uÌ Òǽ Ì
Ì:6G-G8 . 6 Ì:6 Ä_Á)¹ ¹_º»/+¼)¼_¼ÚÂ.ce€ºJe¶FX¼f² 3'H-'5Ì;3  Ҋ½ 8uÌ:3C
Ì Ò 0_³ Ò .  ½ -
Ì;38H5 3 Ì4. .ZÄ_Á)¹ ³ ´€° 3'/3Ì:H-C
Ì ½£Ò Ì20 28 ._³ Ò . 6 ½ 3
0H.H-H6 6 6 Ì:6 Ä_Á)¹
') · ´ ÂJe€² Å c%­ ´ efe%c_g ¬ ° » 3'H-'5Ì;3 
Ì ½£Ò Ì:3 28 ̀³ Ò 3 . ½ 8
3'8/3|Ì Ì4-  5ZÄ_Á)¹ ´ e f=e c)g ¬ ° » ­³_g ´°´° ³ ¬»_c 6G-H6|Ì:H-C
Ì ½£Ò Ì4- 46 5_³ Ò 3 . ½ 3
0/|Ì40  Ì Ì48ZÄ_Á)¹ . ´ e f=e c_g%¬ ° » ¼ Î )H.H. Ò Ì -|Ì:6'5H5/6C Ҋ½ 5uÌ20 25 0_³ Ò 0 0 ½ -
8 3'6 -  0ZÄ_Á)¹
') ·ÇÝ³_g 7 c ­ ´ e f=e c)g ¬ ° » À -'./-'0H8H0A Ҋ½ 5uÌ)Ì 2. 0_³ Ò Ì Òǽ 5
•++4axy-%YM+ QUV++O  .+ + É+5b*}YM.MaY+l+ Q..M N{z)dk+ .*OM yØÙQ •) TP_z=+.NØÙQ‹Ô<LʏoÔ|YQ+
LQ -X+."+~‘V*,+.P)^i
jkP+‹zMN%NX+ Q.C]\^++  +‹z=+U+ + ¥NX+ +‹h
) ThX @ "UMY"£)a=+H*O +HQM ".P+W+SRa+ QQ. „m<t@a=%tnw
 [Q.CM ".P+WYMQ. ‘Si>jkPQC*rhX+ QCQT+ Q+‹Q.M  +‹MUk QC.P+‹h
) ThX CNX+ +%i

   ;=0Ï?|0256 6C9>259 B0I?I 02n?"!$#lB9


$Ç   dQT-X+}.P+5 .am ° » º)·g%«c_³b Qn+SR
@[Y+%i@w£ ¥  Q+4X ¥Uk _HM mo .*r. ’z= YMY>!%%&'&
NX+ + Q@QT .+ {z_"  .+ + „axy-%YM+ Q
a=+%…
ª±° %» º  µ ° » º_·g «c_³Ú® e °  Ä%»_³/ µ 0<£Æ)Ã%f
«³ ¬ µ /5 -G/3ŠÂ.g%¶%à €´ °)µ ¸)¶X¬=¸= ´ » ¬ ° ½ «)­ µ ¸ ¸X¯
ªË ¬|e ° ³ ½y° g «c)³Ú® ° » ºÇ ¸;e c)³ Æg€f³ ½y°)¾_° ¸=£¬» ˽ Ægf³ ´ µ_À  ´ ³%º µ ¸( ° ¸%¯

)FW9<;+*Œ0I|9

  *OaJ+WX @+ Q.YMQCUM.P„.P+W+ Q.YMQCM „.P+4.. Y+:ykYMYU@#%$%$%#i a=


mS‘Siw „.PJ@aJa+C.P+"Q.P~U
YQ,<jx'Vu+ Q.YMQ~iǕU .P+„+}+Y`+ „O.P+‹Y N%T.%QnX ¥PJ-X+} zTM + „m  * *O X.T
T ,
WORKING WITH LIMMA AND AFFYMETRIX DATA

PRELIMINARY OPERATIONS
A. Start RGui;
B. Change the working directory choosing the folder where the Affymetrix data is located;
C. Load the workspace you have saved from the AFFY exercise:
Load(“Affy_Preprocessing.Rdata”)
C. Load the needed libraries:
library(affy)
library(hgu133aEG1000)
library(limma)

In this exercise we will perform the following tasks:


1. building the design matrix and defining the contrasts of interest;
2. fitting the lnear model;
3. exploring and writing the results;

1. BUILDING THE DESIGN MATRIX AND DEFINING THE CONTRASTS OF INTEREST

design <- model.matrix(~0 + pd[,1])

colnames(design) <- c("NORMAL", "RCC")

contr <- makeContrasts(NORMAL-RCC, levels=design)

2. FITTING THE LINEAR MODEL

fit <- lmFit(datrma, design)

fit2 <- contrasts.fit(fit, contr)

fit2 <- eBayes(fit2)

3. EXPLORING AND WRITING THE RESULTS

First, we visualize the number of significant genes by Venn diagrams. Here we use p-value < 0.01 and
Benjamini-Hochberg p-value correction.
vennDiagram(decideTests(fit2, p.value = 0.01, adjust.method =
"BH"))

we extract the gene symbol and the entrez gene information from the annotation package
gs <- as.data.frame(unlist(as.list(hgu133aEG1000SYMBOL)))
eg <- as.data.frame(unlist(as.list(hgu133aEG1000ENTREZID)))
annot <- cbind(rownames(gs), gs, eg)
colnames(annot) <- c(“ID”, “Gene Symbol”, “Entrez Gene ID”)
we store the significant genes with their annotation in a data-frame object
results <- topTable(fit2, coef=1, n = 1838, genelist=annot)

Finally, we save the table of significant genes into a TAB-delimited text file (.txt)
write.table(results, "results.txt", sep="\t")

FINAL OPERATIONS
A. Save the workspace
save.image("Affy_Limma.Rdata")

B. Save the history


savehistory("Affy_Limma.Rhistory")

Dario Greco
Institute of Biotechnology - University of Helsinki
Building Cultivator II, room 223b
P.O.Box 56 Viikinkaari 4
FIN-00014 Finland
Office: +358 9 191 58951
Fax: +358 9 191 58952
Mobile: +358 44 023 5780
Email: dario.greco@helsinki.fi
FINDING OVER-REPRESENTED GO FAMILIES

PRELIMINARY OPERATIONS
A. Start RGui;
B. Change the working directory choosing the folder where the Affymetrix data is located;
C. Load the workspace you have saved from the AFFY-LIMMA exercise:
Load(“Affy_Limma.Rdata”)
C. Load the needed libraries:
library(GOstats)
library(limma)
library(affy)
library(hgu133aEG1000)

In this exercise we will perform the following tasks:


1. creating the parameters for the hypergeometric test;
2. performing the hypergeometric test;
3. exporting the results.

1. CREATING THE PARAMETERS FOR THE HYPERGEOMETRIC TEST


First, we need to find the column of the limma results data frame containing the Entrez Gene Ids (It
should be the column 3):
colnames(results)

Now, we can create the parameters for running the Fisher's Exact Test:
params <- new("GOHyperGParams", geneIds =
as.vector(results[,3]), annotation = "hgu133aEG1000", ontology
= "BP", pvalueCutoff = 0.05, conditional = FALSE, testDirection
= "over")

In this command, we specify the Entrez Gene Ids, the annotation package, the ontology that we want to
assay (BP, MF, or CC), the p-value cut off (here we chose 0.05), whether we want to run a conditional
test, and the test direction, for finding the over- or the under-represented families (here we want to find
the over-represented families).

2. PERFORMING THE HYPERGEOMETRIC TEST

BPover <- hyperGTest(params)

3. EXPORTING THE RESULTS

In order to save the results into a data.frame object:


BPresults <- summary(BPover)

Now we can export the results into a TAB delimited text file:
write.table(BPresults, "BP_over.txt", sep="\t")

FINAL OPERATIONS
A. Save the workspace
save.image("Affy_GOstats.Rdata")
B. Save the history
savehistory("Affy_GOstats.Rhistory")

Dario Greco
Institute of Biotechnology - University of Helsinki
Building Cultivator II, room 223b
P.O.Box 56 Viikinkaari 4
FIN-00014 Finland
Office: +358 9 191 58951
Fax: +358 9 191 58952
Mobile: +358 44 023 5780
Email: dario.greco@helsinki.fi
Clustering – Exercises
This exercise introduces some clustering methods available in R and Bioconductor.This exercise
uses the prenormalized yest dataset.

1. Reading the prenormalized data

Read in the prenormalized Spellman’s yeast dataset:

> d<-read.table("combined.txt", sep="\t", header=T, row.names=1)

We want only the cdc15 data, so take only those columns from the data:

> names(d)
> da<-data.frame(d[26:49])

Remove missing values from the data:

> dat<-na.omit(da)

2. Filter the genelist by standard deviation

Select only the genes that are among the 0.3% of the highest standard deviations.

> library(genefilter)
> # Row-wise SDs
> sds<-rowSds(dat)
> # Which is the value at 99.7% of data
> sdt<-quantile(sds, 0.997)
> sel<-(sds>sdt)
> set<-dat[sel, ]

How many genes are left after filtering?

3. Creating a heatmap using Euclidean distance and complete linkage

> heatmap(as.matrix(set))

To get other colors in the heatmap, you first need to generate a sequence of colors, and then plot the
heatmap using these colors:

> library(RColorBrewer)
> heatcol<-colorRampPalette(c("Red", "Green"))(32)
> heatmap(as.matrix(set), col=heatcol)
4. Saving the heatmap into a file

For further modifications, the heatmap might need to be saved in a file. This is accomplished with:

> cwd=getwd()
> bmp(file.path(cwd, "heatmap.bmp"), width=1800, height=1800)
> heatmap(as.matrix(set), col=heatcol)
> dev.off()

This results into about 6*6 inch print quality bitmap image in your data folder. Some papers might
want to get a postscript image, and this is accomplished as:

> cwd=getwd()
> postscript(file.path(cwd, "heatmap.ps"), width=1800,
height=1800)
> heatmap(as.matrix(set), col=heatcol)
> dev.off()

5. K-means clustering of genes

In K-means clustering you need to pick an artificial number, the number of clusters (K).

To produce a K-means clustering with 5 clusters, type:

> k<-c(5)
> km<-kmeans(set, k, iter.max=1000)

Calculate an average withinness of the results. This is a measure of how close together genes lie
inside the clusters.

> mean(km$withinss)
[1] 21.1838

Run the same K-means analysis several times (save the result into a new object every time). Select
the K-means clustering giving the smallest withinness score as the best result.

You can do this by hand, or run the following code chunk:

> ss<-c(1000000)
> for(i in 1:10) {
> km<-kmeans(set, 5)
> if(mean(km$withinss)<=ss) {
> ss<-mean(km$withinss)
> km.best<-km
> }
> }
6. Visualizing the K-means clustering

Let’s produce a new K-means clustering result using four clusters:

> km<-kmeans(set, 4, iter.max=1000)

Next, initiate a 2*2 image area, and draw the expression profiles. We need to apply a for-loop here:

> par(mfrow=c(2,2))
> for(i in 1:4) {
> matplot(t(set[km$cluster==i,]), type="l",
main=paste(“cluster:”, i), ylab=”log expression”, xlab=”time”)
> }

S-ar putea să vă placă și