Documente Academic
Documente Profesional
Documente Cultură
What is R?
The R Environment
The R console
R Studio
Lets begin!
Getting help
Packages in R
Expressions
Objects
Sorting
Working directory
Coercion
Operations on Matrices
Logical Operators
Practice
Basics of R
Syllabus
Introduction
Installing R and RStudio: R Versions, Download and install R, Installing Packages, Loading
Packages, Updating R and its Packages.
User Interface, Packages and Help: The R Console, RStudio, Getting Help.
R Packages: Listing Packages in Local Libraries, Loading Packages, Package
Repositories, Finding and Installing Packages Inside R and other Repositories,
The R Language
Overview: Expressions, Objects, Symbols, Functions, Special Values.
R Objects: Vectors, Integers, Character, Logicals, Complex and Raw, Attributes of Objects,
Matrices, Arrays, Class, Dates and Time, Factors, Coercion, Lists, Data Frames, Changing
values, Logical Subsetting, Boolean Operators, Missing Information and removing NAs.
R Environment: Symbols, Working with Environments, The Global Environment.
Working with Data in R
Loading and Saving data in R: Entering Data within R, Data Editor (RStudio), Datasets in R,
Working Directory, The read Family, HTML data links, R Files, Saving R Files, Excel
Spreadsheets and R, Loading files from other programs.
Preparing Data: Combining, Transforming, Binning, Subsetting, Cleaning, Sorting and
Summarizing Data.
Graphics
Base Graphics: Scatter Plots, Time Series Plots, Bar Plots, Histogram, Box-Plots,
Customizing Charts.
What is R?
R is an open-source software environment for statistical computing and graphics.
R compiles and runs on Windows, Mac OS X, and numerous UNIX platforms (such as Linux).
R is open source software. Other software like STATA may cost on an average of $1000 annual
license fee.
The R Environment
Features of R
An effective data handling and storage facility.
A suite of operators for calculations on arrays, in particular matrices.
A collection of tools for data analysis.
Graphical facilities for data analysis.
A well developed, simple, and effective programing language.
The R system is a software environment for statistical computing and graphics.
The term “environment” is intended to characterize R as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the case
with other data analysis software.
R is an implementation of S language.
S is a language that was developed by John Chambers and others at Bell Labs.
S was initiated in 1976 as a statistical analysis environment.
R was created in New Zealand in the year 1991 by Ross Ihaka and Robert Gentleman.
The first version R 1.0.0 was released in the year 2000 after Ross and Robert agreed to make it a
free software under GNU General Public License in 1995.
The name “GNU” is a recursive acronym for “GNU’s Not Unix.” See https://gnu.org
(https://gnu.org)
Usually, there is an official release of R twice a year.
In S, statistical analysis is usually done as a series of steps with intermediate results being
stored in objects, Thus whereas SPSS and SAS will give copious output from a regression
analysis, R will give minimal output and store the results in a fit object for subsequent
interrogation by further R functions.
Design of the R system
R system is divided into 2 conceptual parts –
The “base” R system that you download from CRAN https://cran.r-project.org/
(https://cran.r-project.org/)
Everything else.
The “base” package contains the most fundamental functions used to run R.
The other packages can be downloaded and installed as per user requirements; for eg. AER
package is required for econometrics.
The R console
The R console is a tool that allows you to type commands into R and see how the R system responds.
The commands that you type into the console are called expressions. By default, R will display a
greater-than sign (“>”) in the console (at the beginning of a line, when nothing else is shown) when R is
waiting for you to enter a command into the console. R is prompting you to type something, so this is
called a prompt. For example, suppose that you typed 17 + 3 on the console, you would see something
similar to this:
Fig 2:Simple operations in R Console
R Studio
R Studio Integrated Development Environment (IDE) is a powerful and productive user interface
for R.
Like R, it is free and multi-platform. It can be downloaded from https://www.rstudio.com/
(https://www.rstudio.com/)
RStudio is a separate open-source project that brings many powerful coding tools together into
an intuitive, easy-to-learn interface.
The RStudio program can be run on the desktop or through a web browser.
The desktop version is available for Windows, Mac OS X, and Linux platforms
Fig 3:RStudio
Lets begin!
You can now unistall the R and RStudio already installed in your computers so that you can actually
have a fresh start.
help(mean)
# or
?mean()
# or
??mean
For a feature specified by special characters, the argument must be enclosed in double or single
quotes making it a character string. This is also necessary for a few words with syntatic meaning
including if , for and function .
help("[["); help("if")
?"[["
Packages in R
A package is a related set of functions, help files, and data files that have been bundled together. For
example, the stats package contains functions for doing statistical analysis. Some packages are
included in R, other packages are available from public package repositories. You can also make your
own packages!
Inorder to get the list of packages loaded by default use the command;
getOption("defaultPackages")
# or
library()
To install and load a specific package use the following command. Remember that you have to load the
package everytime when you are in a new R session.
install.packages("AER")
library(AER)
Some packages can contatin data. Inorder to access data from a specific package use the following
command.
## data(package = "AER")
data(SwissLabor, package = "AER") ; head(SwissLabor)
remove.packages("AER")
Expressions
Examples of expressions in R include assignment statements, conditional statements and arithmetic
expressions.
x <- 1; x
## [1] 1
## [1] "no"
Expressions are composed of objects and fuctions they may be seperated with new lines or
semicolons.
## [1] 10
Objects
An object is a thing that is represented by the computer.
The entries that R creates and manipulates are known as ‘objects’ (more on this later).
During an R session objects are created and stored by name.
The R command objects() can be used to display the names of the objects that are currently
stored within R.
The collection of objects currently stored is called the ‘workspace’.
To remove objects the function rm() is available
objects()
rm(x); objects()
## [1] "SwissLabor"
character: "Hello"
numeric(real numbers/decimal numbers): c(1,2.36) .
integer: 10L
complex : 2+4i
logical: TRUE/FALSE .
The most basic object is a vector. Vectors can contain objects of same class only. However, list() is
an exception and can contain a mixture of objects.
1/Inf
## [1] 0
Inf/Inf
## [1] NaN
NaN (Not A Number) is a value that represents an undefined value NaN may represent missing values
as well.
Objects in R have attributes like names() , dimnames() (for a matrix, array or data frame), class() ,
length() .
class(10L)
## [1] "integer"
class("Hello!")
## [1] "character"
length(12)
## [1] 1
names(12)
## NULL
## [1] "Amit" NA NA
The things that we are typing/ we type in R are called expressions. <- is called the assignment
operator that assigns a value to a symbol.
assign("num", 1200 )
num
## [1] 1200
## [1] "numeric"
nos
Sorting
sort(x) returns a vector of same size as x with the elements arranged in increasing order.
d <- c(10,50,12,63,74,1,89,5,3,6,4)
sort(d)
## [1] 1 3 4 5 6 10 12 50 63 74 89
## [1] 89 74 63 50 12 10 6 5 4 3 1
x <- 30:1; x
## [1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8
## [24] 7 6 5 4 3 2 1
x <- c(1,2,3,4,5,6,7,8,9); x
## [1] 1 2 3 4 5 6 7 8 9
x <- seq(1,30); x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30
## [1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4
## [15] -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4
## [29] 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
## [43] 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0
## [1] 10 8 6 4 2
## [1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4
## [15] -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4
## [29] 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
## [43] 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0
A related function is rep() which can be used for replicating objects in various ways.
x <- c("hi","hello"); x
## [1] 3 3 3 3 3 3 3 3 3 3
The paste() function takes an arbitrary number of arguments and concatenates one by one into
character strings. Any numbers given among the arguments are coerced into character strings.
## [1] "a1" "a2" "a3" "a4" "a5" "a6" "a7" "a8" "a9" "a10"
To collapse the output into a single string pass the collapse argument.
## [1] "x1" "y2" "x3" "y4" "x5" "y6" "x7" "y8" "x9" "y10"
## [1] "x1-y2-x3-y4-x5-y6-x7-y8-x9-y10"
Working directory
To check the location of your workig derectory, use getwd() and to set a specific folder as your
working directory use setwd("file path") .
x <- 1:10
x*2
## [1] 2 4 6 8 10 12 14 16 18 20
x
## [1] 1 2 3 4 5 6 7 8 9 10
# or consider
y <- x
x*3
## [1] 3 6 9 12 15 18 21 24 27 30
## [1] 1 2 3 4 5 6 7 8 9 10
If you want to use the pass-by-reference paradigm, have a look at the R.oo ,
mutatr and proto packages..
Coercion
1. Implicit Coercion
2. Explicit Coercion
Implicit Coercion occurs when we try to combine objects of two or more different classes. The ordering
is roughly LINCL i.e logical < integer < numeric < complex < character<list .
## [1] "character"
## [1] "numeric"
## [1] "character"
x <- c(0,1,2,3,4,5,6)
class(x)
## [1] "numeric"
as.character(x)
as.complex(x)
as.logical(x)
## [1] NA NA NA NA
as.complex(x)
## [1] NA NA NA NA
as.logical(x)
## [1] NA NA NA NA
Operations on Matrices
Matrices are vectors with dimension() attribute.
## [1] 2 5
length(m)
## [1] 10
class(m)
## [1] "matrix"
x <- 1:5
y <- 6:10
cbind(x,y)
## x y
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
rbind(x,y)
## [,1] [,2] [,3] [,4] [,5]
## x 1 2 3 4 5
## y 6 7 8 9 10
# or
cbind(1:5, 6:10)
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
Subsetting
Subsetting a matrix with [ will subset objects of the same class
x[1,2]
## [1] 3
x[2,5]
## [1] 10
x[,2]
## [1] 3 4
x[2,]
## [1] 2 4 6 8 10
## [,1] [,2]
## [1,] 3 7
## [2,] 4 8
## [1] 3 7
## [1] 1 4
When we subset a matrix it is seen that the output is a vector and not in a matrix form. This can be
corrected by setting drop = FALSE argument.
x[1,3]
## [1] 5
x[1,]
## Kolkata Delhi Mumbai Chennai Bangalore
## 1 3 5 7 9
How does the output look different with the drop argument?
## Mumbai
## temp 5
x <- matrix(c(0.3,4.5,55.3,91,0.1,105.5,-4.2,8.2,27.9),nrow=3,ncol=3); x
x[,-2]
## [,1] [,2]
## [1,] 0.3 -4.2
## [2,] 4.5 8.2
## [3,] 55.3 27.9
## [,1] [,2]
## [1,] 8.2 0.1
## [2,] 27.9 105.5
## [,1] [,2]
## [1,] 4.5 8.2
## [2,] 55.3 27.9
x[2,] <- 1:3
x[c(1,3),2] <- 900 # overwrites the second and the third row of the second colu
mn
x[c(1,3), c(1,3)] <- c(-7,7)
Replacing diagonals
Matrix transpose
t(x)
i <- diag(3); i
a <- 2; a; x
## [1] 2
b <- a*x; b
Matrix multiplication
dim(x); dim(b)
## [1] 3 3
## [1] 3 3
x%*%b
Inverse of a matrix
A <- matrix(data=c(3,4,1,2),nrow=2,ncol=2);A
## [,1] [,2]
## [1,] 3 1
## [2,] 4 2
solve(A)
## [,1] [,2]
## [1,] 1 -0.5
## [2,] -2 1.5
## [1] 1 2 3 4 5 6 7 8 9 10
The functions rowSums() and colSums() calculate the total for each row and column of a matrix.
## x y
## 1 460.998 314.4
## 2 290.475 247.9
## 3 309.306 165.8
## 1 2 3
## 775.398 538.375 475.106
## x y
## 1060.779 728.100
date()
class(date())
## [1] "character"
The Sys.Date() rerturns the current day in the current time zone.
Sys.Date()
## [1] "2019-08-08"
class(Sys.Date())
## [1] "Date"
Sys.time()
Formatting dates
The follwoing codes are required while formating dates.
%a abbreviated weekday
A Unabbreviated weekday
m months (00-12)
b abbreviated month
B unabbreviated month
d <- Sys.Date()
d
## [1] "2019-08-08"
## [1] "08-August-2019"
## [1] "character"
Creating dates
x <- c("1jan2016", "1feb2017", "1mar2018"); x
## [1] "Date"
d2 <- Sys.time()
d2
1. POSIXct: (Portable Operating System Interface) ct stands for calander time. It is the number of
seconds since 1-Jan-1970. Negative numbers represent the number of seconds before this time.
2. POSIXlt: lt stands for local time and is a named list of vectors representing seconds, minutes,
hour, day, month, year and time zones.
## [1] 1565271796
# as POSIXlt
unclass(as.POSIXlt(Sys.time()))
## $sec
## [1] 15.53448
##
## $min
## [1] 13
##
## $hour
## [1] 19
##
## $mday
## [1] 8
##
## $mon
## [1] 7
##
## $year
## [1] 119
##
## $wday
## [1] 4
##
## $yday
## [1] 219
##
## $isdst
## [1] 0
##
## $zone
## [1] "IST"
##
## $gmtoff
## [1] 19800
##
## attr(,"tzone")
## [1] "" "IST" "+0630"
## install.packages("lubridate")
library(lubridate)
##
## Attaching package: 'lubridate'
## [1] "2018-08-07"
## [1] "Date"
dmy(180807)
## [1] "2007-08-18"
ydm(180807)
## [1] "2018-07-08"
You can also add the hour, minute and second information.
dmy_hms("08-09-2019 08:30:00")
Time zones.
For example, you have a scheduled admission interview over Skype in a University in London on 25
Sept 2019 at 09:45 am, what time shall it be for you in Kolkata?
## [1] "Asia/Calcutta"
So your interview timings are; (a) 2019-09-25 09:45:00 if you are in London,
and (b) 2019-09-25 14:15:00 if you are in Kolkata.
Logical Operators
Operator Interpretation Results
Operator Interpretation Results
Some examples
a <- c(T,F,T,F); a
b <- c(F,T,T,T); b
a&b; a&&b
## [1] FALSE
a|b; a||b
## [1] TRUE
myvec<0
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
To extract every second elment from the vector starting with the first
## [1] 5 4 4 8 40221
## [1] 5 4 4 4 6 8 10
which(myvec < 0)
## [1] 2 10
Let’s apply logical subsetting and extraction in matrices as an example. In the following example we
shall
A <- matrix(c(0.3,4.5,55.3,91,0.1,105.5,-4.2,8.2,27.9),nrow=3,ncol=3); A
A[c(T,F,F),c(F,T,T)] #extracts the second and the third elements of first row.
c(T, F,F) selects the first row, c(F,T,T) selects the elements from the select
ed first row.
## [1] 91.0 -4.2
A[c(F,T,F),c(F,T,T)] #extracts the second and the third elements of second row.
A[A<1] <- 100 # Changes the elements in the matrix with values less than 1.
A
Practice
1. Create a vector x and assign a numerical value to it.
2. Create another vector y and assign the numbers 1 to 5.
3. Creater a longer vector that contains numbers 1 to 5 ten times.
4. Assign some meaningful names to the following vectors:
c(2, 4, 6, 8, 10, 12, 14, 16, 20)
0
3.141593
c(1, 10, 100, 1000, 10000, 100000)
5. Create vectors that correspond to the following variables names:
BMI
Age
daysPerMonth
firstFivePrimeNumbers
6. Create three vectors that each contain just 1 element with variable names p , q , and r , and
values 1, 2, and 3. Then, create a new vector that contains multiple elements, using the scalars
we just created i.e., create a vector u of length 3, with the subsequent elements of p , q and
r.
7. Create a new vector u with length 96 that contains the elements of u as follows: 1, 2, 3, 1, 2, 3, ….,
1, 2, 3.
8. Suppose the surface area of a circle equals 25, what is the radius?
9. What is the probability density at x=0 of a normally distributed random variable x with mean (mu)
equal to zero, and standard devation (sigma) equal to one?
10. Sort the numbers 10,50,12,63,74,1,89,5,3,6,4 in ascending order.
11. Generate a sequence of 100 numbers between -60 to 45 with a width of 5, and find the mean
value.
12. Use the functions mean() and range() to find the mean and range of:
the numbers 1, 2, . . . , 21
the sample of 50 random normal values, that can be generated from a normaL distribution
with mean 0 and variance 1 using the assignment y <- rnorm(50) .
the columns height and weight in the data frame women. [The datasets package that has
this data frame is by default attached when R is started.]
13. What are the respective effects of the arguments sep and collapse in the paste() function?
14. Create a matrix as depicted in the following table. The row names are the roll numbers of the
students in economics class, and the column names are the codes of the courses. The values
represent the marks obtained out of 50, and are random. You can assume and assign marks.
ECON01 35 25 36 40
ECON02 36 25 35 39
… … … … …
ECON40 40 32 45 25
15. Calculate the mean marks obtained by each student in Q14 and find out which students got the
top and last 3 positions (hint: you can sort the data).
16. What date and time shall your friend in Massachusetts Institute of Technology (MIT) be speaking
with you online if you call him/her now?
17. Create a 4X3 matrix with numbers between -30 and 30 (excluding 0). Extract the first and second
elements from the fourth row. Extract the second and third elements from the second row.
Replace all the negative elements in the matrix with 0.