Sunteți pe pagina 1din 34

R - Introduction

MBA (Core) – II Year (Mktg. A)

Dec-18
Introduction

 R: Open Source, Object Oriented, programming language


– Commonly used in statistical computing, data analytics, visualization
 Growth in popularity: powerful, easy-to-use syntax, environment
– Open source, free
– Runs on multiple platforms
– Get things done with sparse code
– Wide usage: Business, individuals, researchers
 Alternatives: SPSS, SAS, other software
Installation

 Official Site:
– https://www.r-project.org/
– Download from
http://cran.r-project.org/mirrors.html
– Base Distribution
https://cran.r-project.org/bin/windows/base/
– Current version: 3.5.1
– 32-bit Vs. 64-bit: as per individual choice
• Corporate environment: As per production environment
 RStudio Desktop (V 1.1.463): popular IDE to run R programs
– Interface to the R console
– Has a free license
– Also has a non-install version
– https://www.rstudio.com/products/rstudio/download/

3
Confidential |
Usage

 Startup: R runs the file named "Rprofile.site"


– Located in the directory "C:\Program Files\R\R-n.n.n\etc"
 Concept of a "Working Directory": used for file handling
getwd(); setwd()
– Need to change "\" to "/" or make it "\\"
 R looks for a file named ".Rprofile" in the working directory
 Packages: collections of R functions, data, and compiled code
– Library: directory where packages are stored
 Installed R: standard set of packages
– Others can be downloaded and installed
– Have to be loaded into the session to be used
.libPaths() # get library location
library() # see all packages installed
library(package) # load pkg into the current session
search() # see packages currently loaded
4
Confidential |
First Program

n <- floor(rnorm(10000, 500, 100))


t <- table(n)
barplot(t)

– Line 1: Generates a list of 10000 random numbers in a normal


distribution such that the mean of these numbers is 500 and
standard deviation 100
• floor: takes each number in this list and removes the decimal
point
– table: takes those 10000 numbers, counts the frequency of each
– barplot: takes the table of frequencies; creates the bar chart

barplot(table(floor(rnorm(10000, 500, 100))))

5
Confidential |
Getting Started

 R console: has the default ">" prompt


– Type commands at the prompt: hit return to execute it

> n <- c(2, 3, 5, 10, 14)


> mean(n)

 Incomplete command
– Prompt changes to "+": continues to take input
– Until the command is syntactically complete
 Execute R commands stored in an external file:
– Escape "\" if full path is provided
source("example.R")

6
Confidential |
Getting Started

 help() function
> help.start() # general help
> help(Syntax)
> ?Syntax

> help.search("histograms")
> ??"histograms"

> apropos("foo") # list all functions containing string foo


> example(foo) # show an example of function foo

> help("bs", try.all.packages = TRUE)


> help("bs", package = "splines")

7
Confidential |
Reserved Words

 Reserved Words: Set of words which have special meaning


– Cannot be used as an identifier (names of variables, functions etc.)
• if • else • repeat • while • function

• for • in • next • break • TRUE

• FALSE • NULL • Inf • NaN • NA

• NA_integer_ • NA_real_ • NA_complex_ • NA_character_ • …


– Can be viewed:
> help(reserved)
> ?reserved
– Basic building blocks of programming in R
– R: case sensitive

> TRUE <- 1


> True <- 1
> TRUE
> True
8
Confidential |
Variables and Constants

 Variables: used to store data


– Value can be changed as required
 Identifier: Unique name given to variable (function, objects)
 Rules for writing Identifiers
– Identifiers can be a combination of letters, digits, period (.) and
underscore (_)
– It must start with a letter or a period. If it starts with a period, it cannot
be followed by a digit
– Reserved words cannot be used as identifiers
 Valid identifiers
total, Sum, .fine.with.dot, this_is_acceptable, Number5
 Invalid identifiers
tot@l, 5um, _fine, TRUE, .0ne
 Good Practice: CamelCase

9
Confidential |
> typeof(5)
Constants > typeof(5L)
> typeof(5i)

> 0xff
> 0XF + 1
 Entities whose value cannot be altered
> 'example'
– Basic types: numeric, character > typeof("5")
 Numeric Constants: All numbers
– Can be of type integer, double or complex
– Can be checked with typeof()
– Numeric constants followed by L are regarded as integers
• Those followed by i are regarded as complex
• Those preceded by 0x or 0X are interpreted as hexadecimal numbers
 Character Constants
– Can be represented using either single (') or double quotes (") as
delimiters

10
Confidential |
Built-in Constants

 Constants built into R


– Implemented as variables taking appropriate values
• LETTERS The 26 upper-case letters of the roman alphabet
• letters The 26 lower-case letters of the roman alphabet
• month.abb The three-letter abbreviations for the English month names
• month.name The English names for the months of the year
• pi The ratio of the circumference of a circle to its diameter

> LETTERS
> letters
> pi
> month.name
> month.abb
> pi <- 56
> pi

11
Confidential |
Operators

 Used to perform tasks: mathematical, logical operations


– Arithmetic, Relational, Logical, Assignment
 Arithmetic Operators
+, -, *, /, ^, %%, %/%
 Relational Operators
<, >, <=, >=, ==, !=
 Assignment Operators
<-, <<-, = (Leftwards)
->, ->> (Rightwards)
<<- operator: used for assigning to variables in the parent environments
– Good Practice:
• <- for assignment; = for named parameters
• Rightward assignments are rarely used

12
Confidential |
Logical Operators

 & and |: perform element-wise operation


– Result has length of the longer operand
 && and ||: examines only the first element
– Result: single length logical vector
Op Description
 Zero is considered FALSE
! Logical NOT
– Non-zero numbers are taken as TRUE
& Element-wise logical AND
> x <- c(TRUE,FALSE,0,6) && Logical AND
> y <- c(FALSE,TRUE,FALSE,TRUE)
> !x | Element-wise logical OR
> x&y || Logical OR
> x&&y
> x|y
> x||y

13
Confidential |
Data Types

 Data: most basic ingredients used in "data analysis"


 Basic data types:
– character, numeric (real or decimal), integer, logical, complex
• character: "a", "swc"
• numeric: 2, 15.5
• integer: 2L (the L tells R to store this as an integer)
• logical: TRUE, FALSE
• complex: 1+4i (complex numbers with real and imaginary parts)
– Functions to examine features
> x <- "dataset"
• class() - what kind of object is it? > typeof(x)
• typeof() - what is the object’s data type? > attributes(x)
• length() - how long is it? > y <- 1:10
> y
• attributes() - does it have any metadata? > typeof(y)
> length(y)
> z <- as.numeric(y)
Confidential | > z 14
Data Structures

 Vector: collection of elements: of the same type


– Most commonly of mode character, logical, integer or numeric
> vector()
> logical(0)
> vector("character", length = 5)
> character(5) > typeof(z)
> numeric(5) > length(z)
> logical(5) > class(z)
> x <- c(1, 2, 3) > str(z)
> x1 <- c(1L, 2L, 3L)
> y <- c(TRUE, TRUE, FALSE, FALSE)
> z <- c("Sarah", "Tracy", "Jon")
– Adding Elements: c() – for combine
> z <- c(z, "Annette")
> z <- c("Greg", z)
– Create vectors from a sequence of numbers
> series <- 1:10
> seq(10)
> seq(from = 1, to = 10, by = 0.1)

15
Confidential |
Operations on Vectors

 Adding two vectors of equal length > x <- c(3, 6, 8)


> y <- c(2, 9, 0)
– No difficulty > x+y
 Adding a single number to a vector: > x+1
– Expansion to the length of the vector > x+c(1,4)
> c(2,4,6,8)-c(1,4)
 Vectors of unequal length
– Expansion of the smaller sized vector to the larger sized one
– Warning issued if lower size is not an integral multiple of the
higher size
 Sum, mean and product of vector elements
x<-c(2,NA, 3,1,4); x
sum(x)
sum(x, na.rm=TRUE)
mean(x, na.rm=TRUE)
prod(x, na.rm=TRUE)

16
Confidential |
Vectors

 Accessing Elements: using vector indexing


– Index can be integer, logical or character
– Integer: starts from 1
• Can use a vector of integers as index to access specific elements
• Can also use negative integers to return all elements except that
those specified
• Cannot mix positive and negative integers while indexing
– Real numbers, if used, are truncated to integers

> x<-c(0, 2, 4, 6, 8, 10)


> x[3]
> x[c(2, 4)]
> x[-1]
> x[c(2, -4)]
> x[c(2.4, 3.54)]

17
Confidential |
Vectors

 Using logical vector as index


– The position where the logical vector is TRUE is returned
> x<-c(0, 2, 4, 6, 8, 10)
> x[c(TRUE, FALSE, FALSE, TRUE)]
> x[c(TRUE)]
> x[c(TRUE,FALSE)]
> x[c(TRUE,FALSE,FALSE)]
> x[c(TRUE,FALSE,FALSE,TRUE)]
 Using character vector as index
> x <- c("first"=3, "second"=0, "third"=9); names(x)
> x["second"]; x[c("first", "third")]

 Modify Vectors
> x<-c(-3, -2, -1, 0, 1, 2); x
> x[2] <- 0; x
> x[x<0] <- 5; x
> x <- x[1:4]; x
> x<-c(-3, -2, NA, 0, 1, 2); x
> x[is.na(x)] <- 5; x

18
Confidential |
Data Structures

 Missing Data: represented as NA (Not Available)


> x <- c(0.5, NA, 0.7)
> x <- c(TRUE, FALSE, NA)
> x <- c("a", NA, "c", "d", "e")
– is.na(): Indicates the elements that represent missing data
• anyNA(): returns TRUE if the vector contains any missing values
> x <- c("a", NA, "c", "d", NA)
> y <- c("a", "b", "c", "d", "e")
> is.na(x)
> is.na(y)
> anyNA(x)
> anyNA(y)
– Special Values: Inf (Infinity: -ve/+ve); NaN: not a number (0/0)
– Mixing types inside a vector
> xx <- c(1.7, "a")
> xy <- c(TRUE, 2)
> xz <- c("a", TRUE)
– Delete vector: rm(x)
19
Confidential |
Data Structures

 Matrix: two dimensional data structure in R


– Extension of the numeric or character vectors: with dimensions (rows,
columns)
> m <- matrix(nrow = 2, ncol = 2); m; dim(m)
> m <- matrix(c(1:3)); class(m); typeof(m)
– Filled column-wise
> m <- matrix(1:6); m
> m <- matrix(1:6, nrow = 2, ncol = 3); m
– Other ways to construct a matrix
> m <- 1:10; typeof(m); class(m)
> dim(m) <- c(2, 5); typeof(m); class(m)
> mdat <- matrix(c(1:3, 11:13), nrow = 2, ncol = 3, byrow = TRUE)
> mdat
> x <- 1:3
– Referencing Elements > y <- 10:12
mdat[2, 3] > class (x); class(y)
> mdat <- cbind(x,y); mdat
> mdat <- rbind(x,y); mdat
20
Confidential |
Matrix

 Access elements: var[row, column]


– Can use –ve integers to specify rows or columns to be excluded

x<-matrix(1:9, nrow = 3, ncol = 3); x


x[,]; x[-1,]; x[,1]; x[,1:2]; x[,c(1,3)]

• Name the rows and columns


x <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"),
c("A","B","C"))); x
x[,1]
x[,1:2]
x[,c(1,3)]

• Names can be accessed or changed


colnames(x); rownames(x)
colnames(x) <- c("C1","C2","C3")
rownames(x) <- c("R1","R2","R3")
x

21
Confidential |
Matrix

 If the matrix returned after indexing is a row matrix or column


matrix x[,1]; class(x[,1])
– Result is given as a vector
– Can be avoided by using the argument drop = FALSE
x[,1,drop=FALSE]; class(x[,1,drop=FALSE])
 Possible to index a matrix with a single vector
x<-c(4,6,1,8,0,2,3,7,9); class(x[1:4]); x[c(3,5,7)]
 Logical vectors can be used to index a matrix
– Rows and columns where the value is TRUE is returned
– Can be mixed with integer vectors
x[c(TRUE,FALSE,TRUE),c(TRUE,TRUE,FALSE)]; x[c(TRUE,FALSE),c(2,3)]
– Can also use a single logical vector
x[c(TRUE, FALSE)]; x[x>5]; x[x%%2 == 0]

 Character Vectors as index


x[,"C1"]; x[TRUE,c("A","C")]; x[2:3,c("A","C")]

22
Confidential |
Modify Matrix

 Use assignment operator with accessed elements


x<-1:9; dim(x) <- c(3,3)
x[2,2] <- 10; x
x[x<5] <- 0; x

 Transpose: t(x)
 Add row or column using rbind() and cbind()
– Remove through reassignment
x<-cbind(x, c(1, 2, 3)); x
x<-rbind(x, c(1, 2, 3)); x
x <- x[1:2,]; x
 Modify dimension
x<-1:6; dim(x)<-c(2,3); x
dim(x) <- c(3,2); x
dim(x) <- c(1,6); x

23
Confidential |
Data Structures

 List: May contain mixture of data types


> x <- list(1, "a", TRUE); x
> x <- vector("list", length = 5); length(x)

> x <- 1:10; class(x); length(x)


> x <- as.list(x); class(x); length(x)
– Content of list elements
• x[[1]]; class(x[[1]]); class(x[1])
– Naming the elements of a list
> xlist <- list(a = "Mumbai Central", b = 1:10, data =
head(iris)); xlist; names(xlist)
> length(xlist); str(xlist)
– Elements: indexed by double brackets
• Named elements can be referred by $ (xlist$data)
> xlist$b; xlist$b[1]; xlist$b[2]
> xlist[[1]]; xlist[[2]]; xlist[1]; xlist[2]
24
Confidential |
Lists

 List with tags (optional)


x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3); x
x <- list(2.5,TRUE,1:3); x

 Accessing list components


x<-list("name"="john","age"=29, "speaks"=c("English","French")); x
x[c(1:2)]
– Indexing with [ gives a sublist x[-2]
• Retrieve the content: use [[ x$speaks[1]
x["age"]; typeof(x["age"]) x[[3]][1]
x[["age"]]; typeof(x[["age"]]) x[c(T,F,F)]
x[c("age","speaks")]
• Partial Matching with $
x$a; x$ag; x$age
 Modify List: through reassignment x[["name"]] <- "Claire"; x

 Add components to a list


– Assign values using new tags x[["married"]] <- FALSE; x
 Delete a component: x[["married"]] <- NULL; x
25
Confidential |
Data Structures

 Data Frame: special type of list - every list element has same length
– Usually created by read.csv() and read.table(): data import
– Can be converted to a matrix: data.matrix()
• If all columns are of the same type
> dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20); dat
> class(dat); length(dat); str(dat)
> dat[1, 3]; dat[["y"]]; dat$y
– Number of rows and columns: nrow(dat); ncol(dat)
– Other useful functions:
> head(); tail(); dim(); nrow(); ncol(); str(); names()
> sapply(dataframe,class)
– data.frame() function converts character vector into factor
• Suppress: stringsAsFactors=FALSE
x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John", "Dora"),
stringsAsFactors = FALSE)
26
Confidential |
Data Frame

 Access Data Frame Components: like a list / matrix


– Use either [, [[ or $ operator to access columns of data frame
x["Name"]
x$Name[1]
– Matrix style x[["Name"]][2]
trees[2:3,] x[[3]][1]
trees[trees$Height > 82,]
trees[10:12,2]

 Modify Data Frame x[1,"Age"] <- 20; x

– Adding Components
• Rows: rbind() x<-rbind(x,list(1,16,"Paul"));x
• Columns: cbind() x<-cbind(x,State=c("NY","FL"));x
x$State <- c("FL","OH","NY","WA"); x
• Like a list:
– Deleting Components
• Columns x$State <- NULL; x
• Rows x <- x[-1,];x
27
Confidential |
Factors

 Used to represent categorical data


– Can be ordered or unordered
– Once created, can only contain a pre-defined set of values (levels)
• (Default) levels sorted in alphabetical order
> gender <- factor(c("male", "female", "female", "male"))
> levels(gender); nlevels(gender)
> gender <- factor(gender, levels = c("male", "female"))
> levels(gender); nlevels(gender)

> food <- factor(food, levels = c("low", "medium", "high"),


ordered = TRUE)
levels(food); min(food)
– Levels of a factor are inferred from the data if not provided
• Levels may be predefined even if not used
x <- factor(c("single", "married", "married", "single"), levels = c("single",
"married", "divorced")); x

28
Confidential |
Factors

 Factors are stored as integer vectors


x <- factor(c("single","married","married","single")); str(x)

– Levels are stored in a character vector


• Individual elements are stored as indices
 Factors are created when non-numerical columns are read
into a data frame
x[3]
 Accessing Factor Components x[c(2,4)]
– Similar to that of vectors x[c(1,3)]
x[-1]
x[c(TRUE, FALSE, FALSE, TRUE)]

 Modification of Factors: assignments


x <- factor(c("single", "married", "married", "single"), levels = c("single",
"married", "divorced")); x
x[2] <- "divorced"; x
x[3] <- "widowed"; x

levels(x) <- c(levels(x), "widowed");


x[3] <- "widowed"; x
29
Confidential |
Getting Data

 Import: fairly simple – different packages for applications


– Package "foreign": Stata and Systat

library(foreign)
mydata <- read.dta("c:/mydata.dta")
mydata <- read.systat("c:/mydata.dta")

– Package "Hmisc": SPSS and SAS

library(Hmisc)
mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
mydata <- sasxport.get("c:/mydata.xpt")

– Package "xlsx": MS Excel files

library(xlsx)
mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
30
Confidential |
Getting Data

 Reading from a CSV file:


– Set working directory: setwd(<dirName>)
carSpeeds <- read.csv(file = 'car-speeds.csv'); head(carSpeeds)

– Change delimiters (sep = ';')


– Default: (header = TRUE)
carSpeeds <- read.csv(file = 'car-speeds.csv')
carSpeeds$Color <- ifelse(carSpeeds$Color == 'Blue', 'Green', carSpeeds$Color)
carSpeeds$Color
str(carSpeeds)
– stringsAsFactors = FALSE
• Control over individual columns: (as.is = c(1,3))
– strip.white = TRUE
unique(carSpeeds$Color)
carSpeeds <- read.csv(file = 'car-speeds.csv',
stringsAsFactors = FALSE, strip.white = TRUE, sep = ',')
– na.strings = "-9999"

31
Confidential |
Writing Data

write.csv(carSpeeds, file = 'car-speeds-


cleaned.csv')

write.csv(carSpeeds, file = 'car-speeds-


cleaned.csv', row.names = FALSE)

carSpeeds$Speed[3] <- NA

write.csv(carSpeeds, file = 'car-speeds-


cleaned.csv', row.names = FALSE, na = '-
9999')

32
Confidential |
Descriptive Statistics

 Install the following Libraries


– rcompanion
– DescTools
– psych
 sapply() function with a specified summary statistic
– sapply(carSpeeds, mean, na.rm=TRUE)
– Possible functions used in sapply include
• mean, sd, var, min, max, median, range, and quantile
– Functions designed to provide a range of descriptive statistics:
• summary(carSpeeds); fivenum(carSpeeds$Speed)
• library(psych); describe(carSpeeds)
Summaries for data frames
– describeBy(carSpeeds,group="Color") • str(carSpeeds)
– describeBy(carSpeeds,group="State") • summary(carSpeeds)
– describeBy(carSpeeds,group=c("Color","State")) Dealing with missing values
• library(DescTools); Mode(carSpeeds$Speed) • na.rm = TRUE

• Several different methods to calculate percentiles (default: type=7)


– quantile(carSpeeds$Speed, 0.75, type=2)

33
Confidential |
Thank you

S-ar putea să vă placă și