Sunteți pe pagina 1din 43

MACHINE LEARNING

USING
R
DAY1: 16-12-2018
Introduction to Machine Learning
Machine learning, put simply, involves teaching computers to learn from experience, typically for the
purpose of identifying and/or responding to patterns as well as making predictions about what may
happen in the future.

Types of Machine Learning


The two basic methods of machine learning are supervised and unsupervised machine learning.
In machine learning, we are usually dealing with a target variable and predictor variables.
The target variable is the object of the prediction or what we are trying to learn.

The predictor variables are the variables we put into our model to obtain information about the target
variable.
We want to learn how changes in our predictor variables affect the target variable.

Supervised and Unsupervised Machine Learning Algorithms


Supervised Machine Learning
The majority of practical machine learning uses supervised learning.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output.
Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you
can predict the output variables (Y) for that data.

Supervised learning problems can be further grouped into regression and classification problems.
Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or
“disease” and “no disease”.

Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.

Some common types of problems built on top of classification and regression include recommendation
and time series prediction respectively.
Some popular examples of supervised machine learning algorithms are:

 Linear regression for regression problems.


 Random forest for classification and regression problems.
 Support vector machines for classification problems.

Unsupervised Machine Learning


Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in
order to learn more about the data.

Unsupervised learning problems can be further grouped into clustering and association problems.

 Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such
as grouping customers by purchasing behavior.
 Association: An association rule learning problem is where you want to discover rules that describe
large portions of your data, such as people that buy X also tend to buy Y.
Some popular examples of unsupervised learning algorithms are:
 k-means for clustering problems.
 Apriori algorithm for association rule learning problems.

Semi-Supervised Machine Learning


Problems where you have a large amount of input data (X) and only some of the data is labeled (Y) are
called semi-supervised learning problems.

These problems sit in between both supervised and unsupervised learning.

A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and
the majority are unlabeled.

Machine Learning Process


The process of a machine learning project may not be linear, but there are a number of well-known steps:

 Define Problem.
 Prepare Data.
 Evaluate Algorithms.
 Improve Results.
 Present Results.

Basic Statistics
Much of statistics is focused on analyzing existing data and drawing suitable conclusions using probability
models. Though it's very common to use probabilities in many statistical modeling, we feel it’s important
to identify the different questions probability and statistics help us answer

Confidence Interval and Hypothesis Testing

# Hypothesis Testing
Researchers retain or reject hypothesis based on measurements of observed samples.

The decision is often based on a statistical mechanism called hypothesis testing.

Just like a judge’s conclusion, an investigator’s conclusion may be wrong. Sometimes, by chance alone, a
sample is not representative of the population. Thus the results in the sample do not reflect reality in the
population, and the random error leads to an erroneous inference.

A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the
population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is
actually false in the population. Although type I and type II errors can never be avoided entirely, the
investigator can reduce their likelihood by increasing the sample size (the larger the sample, the lesser is
the likelihood that it will differ substantially from the population).

A type I error is the mishap of falsely rejecting a null hypothesis when the null hypothesis is true.
The probability of committing a type I error is called the significance level of the hypothesis testing, and is
denoted by the Greek letter α

To validate a hypothesis, it will use random samples from a population. On the basis of the result from
testing over the sample data, it either selects or rejects the hypothesis.

#Null Hypothesis – Hypothesis tests are used to test the validity of a claim that is made about a
population. This claim that’s on trial, in essence, is called the null hypothesis. The null hypothesis testing is
denoted by H0.

#Alternative Hypothesis – The alternative hypothesis is the one you would believe if the null hypothesis is
concluded to be untrue. The evidence in the trial is your data and the statistics that go along with it. The
alternative hypothesis testing is denoted by H1 or Ha.

Hypothesis Testing in R
Use hypothesis testing to formally check whether the hypothesis is accepted or rejected. Hypothesis
testing is conducted in the following manner:

 State the Hypotheses –involves stating the null and alternative hypotheses.
 Formulate an Analysis Plan –involves the construction of an analysis plan.
 Analyze Sample Data –involves the calculation and interpretation of the test statistic as described
in the analysis plan.
 Interpret Results –involves the application of the decision rule described in the analysis plan.

NULL HYPOTHESIS

The general idea of hypothesis testing involves:

 Making an initial assumption.


 Collecting evidence (data).
 Based on the available evidence (data), deciding whether to reject or not reject the initial
assumption.

 When you perform a hypothesis test, a p-value helps you determine the significance of your
results.
 Hypothesis tests are used to test the validity of a claim that is made about a population.
 This claim that’s on trial, in essence, is called the null hypothesis.

Alternative Hypothesis

 The alternative hypothesis is the one you would believe if the null hypothesis is concluded to be
untrue.
 The evidence in the trial is your data and the statistics that go along with it.

# P-VALUE
All hypothesis tests ultimately use a p-value to weigh the strength of the evidence or in other words
what the data are about the population. The p-value is a number between 0 and 1 and interpreted in the
following way:
A small p-value (typically ≤0.05) indicates strong evidence against the null hypothesis, so you reject it.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.

A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.

You use P values to determine statistical significance in a hypothesis test.

#Example

For example, suppose a pizza place claims their delivery times are 30 minutes or less on average but you
think it’s more than that. You conduct a hypothesis test because you believe the null hypothesis,
Ho, that the mean delivery time is 30 minutes max, is incorrect.

Your alternative hypothesis (Ha) is that the mean time is greater than 30 minutes.
You randomly sample some delivery times and run the data through the hypothesis test, and your p-value
turns out to be 0.001, which is much less than 0.05.

In real terms, there is a probability of 0.001 that you will mistakenly reject the pizza place’s claim that their
delivery time is less than or equal to 30 minutes.

Since typically we are willing to reject the null hypothesis when this probability is less than 0.05, you
conclude that the pizza place is wrong; their delivery times are in fact more than 30 minutes on average,
and you want to know what they’re goanna do about it!

# BELL CURVE
In a random collection of data it is generally observed that the distribution of data is normal. Which
means, on plotting a graph with the value of the variable in the horizontal axis and the count of the values
in the vertical axis we get a bell shape curve.

The center of the curve represents the mean of the data set.
In the graph, fifty percent of values lie to the left of the mean and the other fifty percent lie to the right of
the graph. This is referred as normal distribution in statistics.

# R has four in built functions to generate normal distribution. They are described below.

dnorm(x, mean, sd)


pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)

Following is the description of the parameters used in above functions −

 x is a vector of numbers.

 p is a vector of probabilities.

 n is number of observations(sample size).

 mean is the mean value of the sample data. It's default value is zero.
 sd is the standard deviation. It's default value is 1.

Ex:
# dnorm()
This function gives height of the probability distribution at each point for a given mean and standard
deviation.

# Create a sequence of numbers between -10 and 10 incrementing by 0.1.


x <- seq(-10, 10, by = .1)

# Choose the mean as 2.5 and standard deviation as 0.5.


y <- dnorm(x, mean = 2.5, sd = 0.5)

# Give the chart file a name.


png(file = "dnorm.png")

plot(x,y)

# Save the file.


dev.off()

#pnorm()
This function gives the probability of a normally distributed random number to be less that the value of a
given number. It is also called "Cumulative Distribution Function".

# Create a sequence of numbers between -10 and 10 incrementing by 0.2.


x <- seq(-10,10,by = .2)
# Choose the mean as 2.5 and standard deviation as 2.
y <- pnorm(x, mean = 2.5, sd = 2)
# Plot the graph.
plot(x,y)

# rnorm()
This function is used to generate random numbers whose distribution is normal.
It takes the sample size as input and generates that many random numbers.
We draw a histogram to show the distribution of the generated numbers.

# Create a sample of 50 numbers which are normally distributed.


y <- rnorm(50)
# Plot the histogram for this sample.
hist(y, main = "Normal DIstribution")
# CORRELATION
# Correlation and Covariance in R

Very often, when analyzing data, you want to know if two variables are correlated.
Correlation measures range between −1 and 1; 1 means that one variable is a (positive) linear function of
the other, 0 means the two variables aren’t correlated at all, and −1 means that one variable is a negative
linear function of the other (the two move in completely opposite directions;

When you have two continuous variables, you can look for a link between them. This link is called a
correlation.

The cor() command determines correlations between two vectors, all the columns of a data frame, or two
data frames.

The cov() command examines covariance.

The cor.test() command carries out a test of significance of the correlation.

The most commonly used correlation measurement is the Pearson correlation statistic

The Pearson correlation statistic is rooted in properties of the normal distribution and works best
with normally distributed data.

Spearman correlation is a nonparametric statistic and doesn’t make any assumptions about the underlying
distribution

Correlation test is used to evaluate the association between two or more variables.

Compute correlation in R

To compute correlations in R, you can use the function cor. This function can be used to compute each of
the correlation measures shown above:

cor(x, y = NULL, use = "everything",


method = c("pearson", "kendall", "spearman"))

You can compute correlations on two vectors (assigned to arguments x and y), a data frame (assigned to x
with y=NULL), or a matrix (assigned to x with y=NULL).

If you specify a matrix or a data frame, then cor will compute the correlation between each pair of
variables and return a matrix of results.

# Simple Correlation in R
Simple correlations are between two continuous variables and use the cor()command to obtain a
correlation coefficient, as shown in the following command:

count = c(9,25,15,2,14,25,24,47)
speed = c(2,3,5,9,14,24,29,34)
cor(count, speed)

Here, we’ll use the built-in R data set mtcars as an example.

# The R code below computes the correlation between mpg and wt variables in mtcars data set:
my_data <- mtcars
head(my_data, 6)

# We want to compute the correlation between mpg and wt variables.

#Visualize your data using scatter plots


# Here, we’ll use the ggpubr R package.
install.packages("ggpubr")
library("ggpubr")
ggscatter(my_data, x = "mpg", y = "wt",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")

#Covariance in R

The cov() command uses syntax similar to the cor() command to examine covariance.
We can use the cov() command as:

cov(mtcars$mpg, mtcars$cyl)

# Significance Testing in Correlation Tests

cor.test(mtcars$mpg, mtcars$cyl)

Mean, Median and Mode

Statistical analysis in R is performed by using many in-built functions. Most of these functions are part of
the R base package. These functions take R vector as an input along with the arguments and give the
result.

Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data series.
The function mean() is used to calculate this in R.

Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)

[1] 8.22

Median
The middle most value in a data series is called the median. The median() function is used in R to calculate
this value.
x is the input vector.

#Example Median
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.


median.result <- median(x)
print(median.result)

[1] 5.6

Mode
The mode is the value that has highest number of occurrences in a set of data. Unike mean and median,
mode can have both numeric and character data.

R does not have a standard in-built function to calculate mode. So we create a user function to calculate
mode of a data set in R.
This function takes the vector as input and gives the mode value as output.

Example Mode
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.


v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.


result <- getmode(v)
print(result)

# Create the vector with characters.


charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.


result <- getmode(charv)
print(result)
When we execute the above code, it produces the following result −

[1] 2
[1] "it"

ANNOVA
We are often interested in determining whether the means from more than two populations or groups are
equal or not. To test whether the difference in means is statistically significant we can perform analysis of
variance (ANOVA) using the R function aov().
Introduction to R
R provides a scripting language with an odd syntax. There are also hundreds of packages and thousands of
functions to choose from, providing multiple ways to do each task.

Features of R
 R supports procedural programming with functions and object-oriented programming with
generic functions. Procedural programming includes procedure, records, modules, and
procedure calls. While object-oriented programming language includes class, objects, and
functions.
 Packages are part of R programming. Hence, they are useful in collecting sets of R functions into a
single unit.
 R programming features include database input, exporting data, viewing data, variable labels, missing
data, etc.
 R is an interpreted language. Hence, we can access it through command line interpreter.
 R supports matrix arithmetic.
 R has effective data handling and storage facilities.
 R supports a large pool of operators for performing operations on arrays and matrices.
 R has facilities to print the reports for the analysis performed in the form of graphs either on-screen
or on hardcopy.

R Scripts
A window will open in which you can type your script. R Script is a series of commands that you can
execute at one time and you can save lot of time. Script is just a plain text file with R commands in it. The
prominent editors available for R programming language are:

 RGui (R graphical user interface)


 Rstudio – Studio R offers a richer editing environment than RGui and makes some common tasks easier
and more fun.

R Graphical User Interface (RGUI)


Once you download R, RGUI is provided as the standard graphical user interface (GUI).

RStudio
RStudio is an integrated development environment (IDE) for R language. It is a code editor and
development environment, with some nice features that make code development in R easy and fun.

a. Features of RStudio
 Code highlighting that gives different colors to keywords and variables, making it easier to read
 Code completion, so as to reduce the effort of typing the commands in full
 Easy access to R Help, with additional features for exploring functions and parameters of functions
 Easy exploration of variables and values.
 RStudio is available free of charge for Linux, Windows, and Mac devices. It can be directly accessed by
clicking the RStudio icon in the menu system on the desktop.
Because RStudio is available free of charge for Linux, Windows, and Mac devices, it is a good option to use
with R. To open RStudio, click the RStudio icon in the menu system or on the desktop.
b. Components of RStudio
 Source – Top left corner of the screen contains a text editor that lets the user work with source
script files. Multiple lines of code can also be entered here. Users can save R script file to disk and
perform other tasks on the script.
 Console – Bottom left corner is the R console window. The console in RStudio is identical to the
console in RGUI. All the interactive work of R programming is performed in this window.
 Workspace and History – The top right corner is the R workspace and history window. This
provides an overview of the workspace, where the variables created in the session along with their
values can be inspected. This is also the area where the user can see a history of the commands
issued in R.
 The bottom right corner gives access to the following tools:

Files – This is where the user can browse folders and files on a computer.
Plots – Now, this is where R displays the user’s plots.
Packages – This is where the user can view a list of all the installed packages.
Help – This is where you can browse the built-in Help system of R.
Scripting in R
Let’s create a script to print “Hello world!” in R. To create scripts in R, you need to perform the following
steps:

print (“Hello world!”)

We get the output as:


*1+ “Hello world!”

Working with markdown files

R Markdown allows you to create documents that serve as a neat record of your analysis.
RStudio offers a handy editor for markdown files.
Start a new markdown file by choosing File New File
R Markdown... . You will see the following dialogue box:

---
title: "Basic R calculations in markdown"
author: "Remko Duursma"
date: "16 September 2015"
output: word_document
---
After the header, you’ll see two kinds of text: chunks of R code and regular text.

R code in markdown
The first thing you will see under the header in your new markdown document is a grey box:
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
You can delete the rest of the example code that came with the file, but keep this box. It contains a
chunk of R code to set the options for the markdown document and the software that processes it
(a package known as knitr). Notice that the chunk starts and ends with three accent characters (found to
the left of the numeric keys onQWERTY keyboards):
```
This tells the software that you are starting a chunk of R code.

Viewing the Supplied Documentation


Use the help.start function to see the documentation’s table of contents:

help.start()

Getting Help
A shortcut for the help command is to simply type ? followed by the function name:
?mean

R is that you can request that it execute the examples, giving you a little demonstration of the function’s
capabilities. The documentation for the mean function, for instance, contains examples,

example(mean)

Printing Something

If you simply enter the variable name or expression at the command prompt, R will print its value. Use the
print function for generic printing of any object. Use the cat function for producing custom formatted
output. just enter it at the command prompt:
pi
[1] 3.141593

sqrt(2)
[1] 1.414214

The beauty of print is that it knows how to format any R value for printing, including structured values such
as matrices and lists:
print(matrix(c(1,2,3,4), 2, 2))
[,1] [,2]
[1,] 1 3
[2,] 2 4

print(list("a","b","c"))
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] "c"

Comments: Single line comment is written with the starting symbol '#' in the beginning of the statement
as given below:

# My first R program is Hello - World


Data Types in R
 Numeric

Decimal values are referred as numeric data types in R. This is the default working out data type. If
you assign a decimal value for any variable x like given below, x will become a numeric type.

g = 62.4 # assign a decimal value to g


g # print the variable's value - g

 Integer

If you want to create any integer variable in R, you have to invoke the as.integer() function to define
any integer type data. You can be certain that y is definitely an integer by applying the is.integer()
function.

s = as.integer(3)
s # print the value of s

 Complex

A complex value for coding in R can be defined using the pure imaginary values 'i'.
k = 1 + 2i # creating a complex number
k # printing the value of k

 Logical

A logical value is mostly created when comparison between variables are done. Example will be
like:
a = 4; b = 6 # sample values
g=a>b # is a larger than b?
g # print the logical value

 Character

A character object can be used for representing string values in R. You have to convert objects into
character values using the as.character() function within your code like this:
g = as.character(62.48)

g # prints the character string

R Variables
 Variables are used for storing data where that value can be altered based to your need.
 Unique name has to be given to variable (also for functions and objects) is identifier.
 Identifier names are a combination of alphabets, digits, period (.) and also underscore (_).
 It is mandatory to start an identifier with a letter or a period.
Here are some of the points other than that which you should remember before naming any identifier:
 Reserved words in R cannot be used as identifiers. (TRUE)
 Valid identifiers in R are:
total, Sum, .work.with, this_is_accepted, Num6
 Invalid identifiers in R:
t0t@l, 5um, _ray, TRUE, .0n3

Assignment Operator
Use the assignment operator (<-). There is no need to declare your variable first:
x <- 3

There is no need to declare or explicitly create variables in R. Just assign a value to the name and R will
create the variable:
x <- 3
y <- 4
z <- sqrt(x^2 + y^2)
print(z)
[1] 5

Listing Variables
The ls function displays the names of objects in your workspace:
> x <- 10
> y <- 50
> z <- c("three", "blind", "mice")
> f <- function(n,p) sqrt(p*(1-p)/n)
> ls()
[1] "f" "x" "y" "z"

Deleting Variables
The rm function removes, permanently, one or more objects from the workspace:
x <- 2*pi
x
[1] 6.283185
rm(x)
x
Error: object "x" not found

Creating Sequences
Use an n:m expression to create the simple sequence n, n+1, n+2, ..., m:
1:5
[1] 1 2 3 4 5

Use the seq function for sequences with an increment other than 1:
seq(from=1, to=5, by=2)
[1] 1 3 5
Use the rep function to create a series of repeated values:
rep(1, times=5)
[1] 1 1 1 1 1

Getting and Setting the Working Directory

Use getwd to report the working directory, and use setwd to change it:
getwd()
[1] "/home/paul/research"
setwd("Bayes")
getwd()
[1] "/home/paul/research/Bayes"

Windows
From the main menu, select File → Change dir... .

Saving Your Workspace


Call the save.image function:
save.image()

Viewing Your Command History


Scroll backward by pressing the up arrow or Ctrl-P. Or use the history function to view your most recent
input:
history()

Accessing Built-in Datasets


The standard datasets distributed with R are already available to you, since the datasets package is in your
search path.
To access datasets in other packages, use the data function while giving the dataset name and package
name:
data(dsname, package="pkgname")

For example, you can use the built-in dataset called pressure:
> head(pressure)

You can see a table of contents for datasets by calling the data function with no arguments:
data() # Bring up a list of datasets

The MASS package, for example, includes many interesting datasets. Use the data function to access a
dataset in a specific package by using the package argument.

MASS includes a dataset called Cars93, which you can access in this way:
data(Cars93, package="MASS")

Viewing the List of Installed Packages


The library function with no arguments prints a list of installed packages. The list can be quite long. On a
Linux computer, these might be the first few lines of output:
library()
Installing Packages from CRAN
Command line
Use the install.packages function, putting the name of the package in quotes:
install.packages("package name")

Windows
You can also download and install via Packages → Install package(s)... from the main menu.

On all platforms, you will be asked to select a CRAN mirror.


On Windows and OS X, you will also be asked to select the packages for download.

Building Blocks
Building blocks of R, uniquely makes R the most sought out programming language among statisticians,
analysts, and scientists.

R is an easy-to-learn and an excellent tool for developing prototype models very quickly.

Calculations
As you would expect, R provides all arithmetic operations you would find in a scientific calculator and
much more. All kind of comparisons like >, >=, <, and <=, and functions such as acos, asin, atan, ceiling,
floor, min, max, mean and median are readily available for all possible computations.

Packages
The strength of R lies with its community of contributors from various domains. The developers bind
everything in one single piece called a package, in R.

A simple package can contain few functions for implementing an algorithm or it can be as big as the base
package itself, which comes with the R installers. We will use many packages throughout the course.
Packages are the fundamental units of reproducible R code. They include reusable R functions, the
documentation that describes how to use them, and sample data.

The directory where packages are stored is called the library. R comes with a standard set of packages.
Others are available for download and installation. Once installed, they have to be loaded into the session
to be used.
Install.packages("ggplot2")
.libPaths() # get library location
library() # see all packages installed
search() # see packages currently loaded

R – Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical manipulations.
R language is rich in built-in operators and provides following types of operators.

Types of Operators
We have the following types of operators in R programming:

1. Arithmetic Operators

Addition
g <- c (4, 6.5, 6)
s <- c (8, 3, 5)
print (g + s)
Substract
g <- c ( 2, 5.5, 6)
s <- c (8, 3, 4)
print(g-s)
Multiply
g <- c ( 26.5,8)
s <- c(6, 4, 3)
print (g * s)
Divide
g <- c( 2,4.6,8)
s <- c(8, 4, 3)
print (g / s)

2. Relational Operators

Greater Than
g <- c (2,5.5,6,9)
s <- c (8,2.5,14,9)
print (g > s)

Less than

g <- c (2, 5.6, 6,9)


s <- c(8,2.5,14,9)
print (g < s)

Equals Operator

g <- c (2,5.5,6,9)
s <- c (8,2.5,14,9)
print (g == s)

Less than or equal

g <- c (2, 5.5, 6, 9)


s <- c (8, 2.5, 14, 9)
print (g <= s)

Greater than or equal

g <- c(2,5.5,6,9)
s <- c(8, 2.5, 14, 9)
print(g>=s)

Not equal

g <- c(2, 5.4, 8, 9)


s <- c(8, 2.5, 14, 8)
print(v!=t)

3. Logical Operators

Element-wise Logical AND Operator

g <- c(3, 1, TRUE, 2+3i)


s <- c(4,1,FALSE, 2+3i)
print (g & s)

Element-wise Logical OR Operator

g <- c(3,0, TRUE, 2+2i)


s <- c(4,0, FALSE, 2+3i)
print (g | s)

Logical NOT Operator

k <- c (3,0, TRUE, 2+2i)


print (!k)

Logical AND Operator

g <- c(3,0,TRUE,2+2i)
s <- c(1,3,TRUE,2+3i)
print (g && s)

Logical OR Operator

g <- c (0,0,TRUE,2+2i)
s <- c (0,3,TRUE,2+3i)
print (g||s)

4. Assignment Operators

There are three types of operators used for assigning values to vectors.

g1 <- c (2,1,TRUE, 2+3i)


g2 <<- c (2,1,TRUE, 2+3i)
g3 = c (2,1, TRUE, 2+3i)
print (g1)
print (g2)
print (g3)
R – Decision Making - Conditional Statements

R programming provides three different types if statements that allows programmers to control their
statements within source code. These are:

1. if statement

The simplest form of decision controlling statement for conditional execution is the 'if' statement. The
'if' produces a logical value (more exactly, a logical vector having length one) and carries out the next
statement only when that value becomes TRUE. In other words, an 'if' statement is having a Boolean
expression followed by single or multiple statements.

if (TRUE) print ("One line executed")


## One line executed
if (FALSE) print ("Line not executed")
## Line not executed
if (NA) print ("Don't know whether true or not!")
## Error: missing value where TRUE/FALSE needed

2. if….else statement

In this type of statements the 'if' statement is usually followed by an optional 'else' statement that gets
executes when the Boolean expression becomes false. This statement is used when you will be having
multiple statements with multiple conditions to be executed.

if (TRUE)
{
print ("This will execute...")
} else
{
print ("but this will not.")
}
## This will execute...

3. switch statement

A switch statement permits a variable to be tested in favor of equality against a list of case values. In the
switch statement, for each case the variable which is being switched is checked. This statement is
generally used for multiple selection of condition based statement.

The basic syntax for programming a switch based conditional statements in R is:
Syntax:

switch (test_expression, case1, case2, case3 .... caseN)


Example:
gk <- switch (
2,
"First",
"Second",
"Third",
"Fourth"
)
print (gk)
## [1] "Second"

R Date Time
Splitting Date and Time

 R provides several options for dealing with date and date/time data. The built-in as.Date function
handles dates (without times);
 The contributed library chron handles dates and times, but does not control for time zones;
 And the POSIXct and POSIXlt classes allow for dates and times with control for time zones.

# The general rule for date/time data in R is to use the simplest technique possible.
Thus, for date only data, as.Date will usually be the best choice.
If you need to handle dates and times, without timezone information, the chron library is a good choice;
the POSIX classes are especially useful when timezone manipulation is important.

Example:
Humidity <- c(37.79, 42.34, 52.16, 44.57, 43.83, 44.59)
Rain <- c(0.971360441, 1.10969716, 1.064475853, 0.953183435, 0.98878849, 0.939676146)
Time <- c("27/01/2015 15:44","23/02/2015 23:24", "31/03/2015 19:15", "20/01/2015 20:52", "23/02/2015
07:46", "31/01/2015 01:55")

weather <- data.frame(Humidity, Rain, Time)

Humidity Rain Time


1 37.79 0.9713604 27/01/2015 15:44
2 42.34 1.1096972 23/02/2015 23:24
3 52.16 1.0644759 31/03/2015 19:15
4 44.57 0.9531834 20/01/2015 20:52
5 43.83 0.9887885 23/02/2015 07:46
6 44.59 0.9396761 31/01/2015 01:55

Hours <- format(as.POSIXct(strptime(weather$Time,"%d/%m/%Y %H:%M",tz="")) ,format = "%H:%M")


#output
"15:44" "23:24" "19:15" "20:52" "07:46" "01:55"

Dates <- format(as.POSIXct(strptime(weather$Time,"%d/%m/%Y %H:%M",tz="")) ,format = "%d/%d/%Y")


#output
"27/27/2015" "23/23/2015" "31/31/2015" "20/20/2015" "23/23/2015" "31/31/2015"

weather$Dates <- Dates


weather$Hours <- Hours
Dates # Print
Hours #Print
#output

Humidity Rain Time Dates Hours


1 37.79 0.9713604 27/01/2015 15:44 27/01/2015 15:44
2 42.34 1.1096972 23/02/2015 23:24 23/02/2015 23:24
3 52.16 1.0644759 31/03/2015 19:15 31/03/2015 19:15
4 44.57 0.9531834 20/01/2015 20:52 20/01/2015 20:52
5 43.83 0.9887885 23/02/2015 07:46 23/02/2015 07:46
6 44.59 0.9396761 31/01/2015 01:55 31/01/2015 01:55

You can now drop the Time variable by doing:

weather <- subset(weather, select = c(1,2,4,5))

weather #Print
Humidity Rain Dates Hours
1 37.79 0.9713604 27/01/2015 15:44
2 42.34 1.1096972 23/02/2015 23:24
3 52.16 1.0644759 31/03/2015 19:15
4 44.57 0.9531834 20/01/2015 20:52
5 43.83 0.9887885 23/02/2015 07:46
6 44.59 0.9396761 31/01/2015 01:55

Extracting Month, Week and Day from Date

 The as.Date function allows a variety of input formats through the format= argument.
 The default format is a four digit year, followed by a month, then a day, separated by either dashes
or slashes.

The following example shows some examples of dates which as.Date will accept by default:
as.Date('1915-6-16')
[1] "1915-06-16"
as.Date('1990/02/17')
[1] "1990-02-17"

Code Value
%d Day of the month (decimal number)
%m Month (decimal number)
%b Month (abbreviated)
%B Month (full name)
%y Year (2 digit)
%Y Year (4 digit)
If your input dates are not in the standard format, a format string can be composed
using the elements shown in Table . The following examples show some ways that this can be used:

To do this, they first have to determine the month, by extracting the months from the datetime object.
An easy way to achieve this is to work with dates in the POSIXlt class, because this type of data is stored
internally as a named list,
which enables you to extract elements by name. To do this, first convert the Date class:

as.Date('1/15/2001',format='%m/%d/%Y')
[1] "2001-01-15"

as.Date('April 26, 2001',format='%B %d, %Y')


[1] "2001-04-26"

as.Date('22JUN01',format='%d%b%y') # %y is system-specific; use with caution


[1] "2001-06-22"

At this point, R treats the vector x as characters.


To force R to interpret these as dates, use lubridate’s mdy function.

mdy will convert date strings where the date elements are ordered as month, day and year.

install.packages("lubridate")
library(lubridate)
x <- c("06/23/2013", "06/30/2013", "07/12/2014")

x.date <- mdy(x)


class(x.date)

If you want to extract the day of the week from a date vector, use the wday function.

wday(x.date)
[1] 1 3 3

If you want the day of the week displayed as its three letter designation, add the label=TRUE parameter.

wday(x.date, label=TRUE)

# If you need to specify the time zone, add the parameter tz=. For example, to specify Eastern Standard
Time, type:

x.date <- mdy(x, tz="EST")


x.date

t <- Sys.time()
typeof(t)
[1] "double"
t
[1] "2014-01-23 14:28:21 EST"
print(t)

c <- as.POSIXct(t)
typeof(c)
[1] "double"
print(c)
[1] "2014-01-23 14:28:21 EST"

a <- as.POSIXlt(t)
a
a$Month

R – Function
A function can be defined as a collection of statements structured together for carrying out a definite
task. R provides a huge number of in built functions and also user can create their own functions (UDF).
In R, a function is treated as object so the R interpreter is capable of passing control to the function, along
with arguments which may be essential to the function for achieving the actions. The function has the
capability to turn its performance and returns control to the interpreter that may be stored in other
objects.
The keyword 'function' is used to create a function in R. The basic structure of a function can be:

function_name <- function (argu_1, argu_2, .... argu_N)


{
#Function body
}

The function in R is having various parts and each of them is having its own characteristics. These are:

 Function Name: is the real name of the function with which you can call it in some other part of the
program. It is stored as an object with this name given to it.
 Arguments: is a placeholder for that specific function. As a function gets invoked, you can pass a value
to the argument. Arguments are not mandatory to be used within the function; i.e. a function may
not contain any arguments. Arguments can contain default values also.
 Function Body: It may contain a set of statements which specifies what the function does and how it
will work along with its use.
 Return Value: Return value of any function is the last expression in the function which tells what that
function is able to return.

Built in Functions
Built in functions are those functions whose meaning and working is already defined within the function's
body and they are kept somewhere within the packages or libraries of R language. These pre- defined
functions make programmers task easier.
Some common examples of in built functions are:

seq(),max(), mean(), sum(x), paste(...) etc.


They are directly called and used by programmers who are writing programs.

Example:

print (seq (12,30))


This creates a sequence of number from 12 to 30 using the predefined function seq().

print (mean (4:26))


This calculates the mean of all the numbers ranging from 4 to 26

User Defined Function in R


In some occasion, we need to write our own function because we have to accomplish a particular task and
no ready made function exists. A user-defined function involves a name, arguments and a body.

function.name <- function(arguments)


{
computations on the arguments
some other code
}

One of the great strengths of R is the user's ability to add functions. In fact, many of the functions in R are
actually functions of functions.

The set.seed() function is generated through the process of pseudorandom number generator that make
every modern computers to have the same sequence of numbers.

UDF Function Components

The different parts of a function are −

Function Name − This is the actual name of the function. It is stored in R environment as an object with
this name.
Arguments − An argument is a placeholder. When a function is invoked, you pass a value to the argument.
Arguments are optional; that is, a function may contain no arguments. Also arguments can have default
values.
Function Body − The function body contains a collection of statements that defines what the function
does.
Return Value − The return value of a function is the last expression in the function body to be evaluated.

# Create a function to print squares of numbers in sequence.


new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
Calling a Function
# Call the function new.function supplying 6 as an argument.

new.function(6)

Calling a function with default argument


# Create a function with arguments.
new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}
# Call the function without giving any argument.
new.function()
# Call the function with giving new values of the argument.
new.function(9,5)

Data Structures in R

VECTORS
Vectors are the most basic R data objects and there are six types of atomic vectors. They are logical,
integer, double, complex, character and raw

Single Element Vector

# Atomic vector of type character.


print("abc");
# Atomic vector of type double.
print(12.5)
# Atomic vector of type integer.
print(63L)
# Atomic vector of type logical.
print(TRUE)
# Atomic vector of type complex.
print(2+3i)

Multiple Elements Vector

Using colon operator with numeric data


# Creating a sequence from 5 to 13.
v <- 5:13
print(v)

# Creating a sequence from 6.6 to 12.6.


v <- 6.6:12.6
print(v)

#Using sequence (Seq.) operator


print(seq(5, 9, by=0.4))

#Using the c() function


# The logical and numeric values are converted to characters.
s <- c('apple','red',5,TRUE)
print(s)

# Elements of a Vector are accessed using indexing. The [ ] brackets are used for indexing.

# Accessing vector elements using position.


t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)

# Accessing vector elements using logical indexing.


v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)

# Accessing vector elements using negative indexing.


x <- t[c(-2,-5)]
print(x)

Vector Manipulation
# Vector Arithmetic
# Two vectors of same length can be added, subtracted, multiplied or divided giving the
result as a vector output.

# Create two vectors.


v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)

# Vector addition.
add.result <- v1+v2
print(add.result)

# Vector substraction.
sub.result <- v1-v2
print(sub.result)

# Vector multiplication.
multi.result <- v1*v2
print(multi.result)

# Vector division.
divi.result <- v1/v2
print(divi.result)

# Elements in a vector can be sorted using the sort() function.


v <- c(3,8,4,5,0,11, -9, 304)

# Sort the elements of the vector.


sort.result <- sort(v)
print(sort.result)

# Sort the elements in the reverse order.


revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)

# Sorting character vectors.


v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)

# Sorting character vectors in reverse order.


revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)

# Numeric vector
x <- c(1,2,3,4,5,6)
print(x)

# To calculate mean for a vector, you can use mean function.


mean(x)

y <- c(1,2,NA,4,5,6)
# To calculate mean for a vector, you can use mean function.
mean(y)

# To calculate mean for a vector excluding NA values, you can include na.rm = TRUE parameter in mean function.
mean(y, na.rm = TRUE)

# You can use subscripts to refer elements of a vector.


x <- c(1,2,3,4,5,6)
sum(x[c(3,5)]) # Sum of 3rd and 5th element of vector)

# Character vector
State <- c("DL", "MU", "NY", "DL", "NY", "MU")

# To calculate frequency for State vector, you can use table function.
table(State)

FACTORS

Factors are the data objects which are used to categorize the data and store it as levels. They can store
both strings and integers. They are useful in the columns which have a limited number of unique values.
Like "Male, "Female" and True, False etc. They are useful in data analysis for statistical modeling.
Factors are created using the factor () function by taking a vector as input.
# Create a vector as input.
data <- c("East","West","East","North","North","East","West","West","West","East","North")
print(data)
print(is.factor(data))

# Apply the factor function.


factor_data <- factor(data)

print(factor_data)
print(is.factor(factor_data))

Changing the Order of Levels in Factor


The order of the levels in a factor can be changed by applying the factor function again with new order of
the levels.
data <- c("East","West","East","North","North","East","West", "West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)
# Apply the factor function with required order of the level.
new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)

ARRAYS

Arrays are the R data objects which can store data in more than two dimensions. For example − If we
create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3
columns. Arrays can store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in
the dim parameter to create an array.

# Arrays
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim=c(3,3,2))
print(result)

Naming Columns and Rows


We can give names to the rows, columns and matrices in the array by using the dimnames parameter.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")

# Take these vectors as input to the array.


result <- array(c(vector1,vector2),dim=c(3,3,2),dimnames =
list(column.names,row.names,matrix.names))
print(result)

Accessing Array Element

# Create two vectors of different lengths.


vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")

# Take these vectors as input to the array.


result <- array(c(vector1,vector2),dim=c(3,3,2),dimnames =
list(column.names,row.names,matrix.names))

# Print the third row of the second matrix of the array.


print(result[3,,2])

# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])

# Print the 2nd Matrix.


print(result[,,2])

Accessing Array Elements


# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")

# Take these vectors as input to the array.


result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names, column.names,
matrix.names))

# Print the third row of the second matrix of the array.


print(result[3,,2])

# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])

# Print the 2nd Matrix.


print(result[,,2])

R - MATRICES

Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout.
They contain elements of the same atomic types. Though we can create a matrix containing only
characters or only logical values, they are not of much use. We use matrices containing numeric elements
to be used in mathematical calculations.
A Matrix is created using the matrix() function.

Syntax
The basic syntax for creating a matrix in R is −

matrix(data, nrow, ncol, byrow, dimnames)

Following is the description of the parameters used −


 data is the input vector which becomes the data elements of the matrix.
 nrow is the number of rows to be created.
 ncol is the number of columns to be created.
 byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
 dimname is the names assigned to the rows and columns.

Create a matrix taking a vector of numbers as input.


# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)

# Elements are arranged sequentially by column.


N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)

# Define the column and row names.


rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))


print(P)

Accessing Elements of a Matrix


Elements of a matrix can be accessed by using the column and row index of the element. We consider the
matrix P above to find the specific elements below.
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

# Create the matrix.


P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))

# Access the element at 3rd column and 1st row.


print(P[1,3])

# Access the element at 2nd column and 4th row.


print(P[4,2])

# Access only the 2nd row.


print(P[2,])

# Access only the 3rd column.


print(P[,3])
Matrix Addition & Subtraction
# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)

matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)


print(matrix2)

# Add the matrices.


result <- matrix1 + matrix2
cat("Result of addition","\n")
print(result)

# Subtract the matrices


result <- matrix1 - matrix2
cat("Result of subtraction","\n")
print(result)

R - LISTS

Lists are the R objects which contain elements of different types like − numbers, strings, vectors and
another list inside it. A list can also contain a matrix or a function as its elements. List is created
using list() function.

Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical values.
# Create a list containing strings, numbers, vectors and a logical values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)
Naming List Elements
The list elements can be given names and they can be accessed using these names.
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Show the list.


print(list_data)

Accessing List Elements


Elements of the list can be accessed by the index of the element in the list using [ ]. In case of named lists
it can also be accessed using the names.
We continue to use the list in the above example −
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Access the first element of the list.


print(list_data[1])

# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])

# Access the list element using the name of the element.


print(list_data$A_Matrix)
Manipulating List Elements
We can add, delete and update list elements as shown below. We can add and delete elements only at the
end of a list. But we can update any element.
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Add element at the end of the list.


list_data[4] <- "New element"
print(list_data[4])

# Remove the last element.


list_data[4] <- NULL

# Print the 4th Element.


print(list_data[4])

# Update the 3rd Element.


list_data[3] <- "updated element"
print(list_data[3])

Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")

# Merge the two lists.


merged.list <- c(list1,list2)

# Print the merged list.


print(merged.list)

Converting List to Vector


A list can be converted to a vector so that the elements of the vector can be used for further
manipulation. All the arithmetic operations on vectors can be applied after the list is converted into
vectors. To do this conversion, we use the unlist() function. It takes the list as input and produces a vector.
# Create lists.
list1 <- list(1:5)
print(list1)
list2 <-list(10:14)
print(list2)
# Convert the lists to vectors.
v1 <- unlist(list1)
v2 <- unlist(list2)
print(v1)
print(v2)
# Now add the vectors
result <- v1+v2
print(result)

DATA FRAMES

A data frame is a table or a two-dimensional array-like structure in which each column contains values of
one variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.

 The column names should be non-empty.


 The row names should be unique.
 The data stored in a data frame can be of numeric, factor or character type.
 Each column should contain same number of data items.

Create Data Frame


# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)

Get the Structure of the Data Frame


The structure of the data frame can be seen by using str() function.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)

Summary of Data in Data Frame


The statistical summary and nature of the data can be obtained by applying summary() function.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))

Extract Data from Data Frame


Extract specific column from a data frame using column name.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)

Extract the first two rows and then all columns


# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract first two rows.
result <- emp.data[1:2,]
print(result)
Expand Data Frame
A data frame can be expanded by adding columns and rows.
Add Column
Just add the column vector using a new column name.
# Create the data frame.

emp.data <- data.frame(


emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Add the "dept" coulmn.


emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)

Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same
structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data frame to
create the final data frame.
# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)

# Create the second data frame


emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)

# Bind the two data frames.


emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)

Sub-setting Data
R has many powerful subset operators and mastering them will allow you to easily perform complex
operation on any kind of dataset. Allows you to manipulate data very succinctly.

Vectors
x <- c(5.4, 6.2, 7.1, 4.8)
x[1]

# skip the first element


x[-1]

Lists
Subsetting a list works in exactly the same way as subsetting an vector. Subsetting a list with [ will always
return a list: [[ and $

x <- as.list(1:10)
x[1:5]

To extract individual elements inside a list, use the [[ operator


# to get element 5
x[[5]]

Matrices
A matrix is subset with two arguments within single brackets, [], and separated by a comma. The first
argument specifies the rows, and the second the columns.

a <- matrix(1:9, nrow = 3)


colnames(a) <- c("A", "B", "C")
a[1:2, ]

Data Frames
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])

df["x"]
df[, "x"]

Required packages
Load the tidyverse packages, which include dplyr:
install.packages("tidyverse")
library(tidyverse)

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier
data analysis.
my_data <- as_tibble(iris)
my_data

Extract rows based on logical criteria - One Column - Extract rows where Sepal.Length > 7:
my_data %>% filter(Sepal.Length > 7)
Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:
my_data %>% filter(Sepal.Length > 6.7, Sepal.Width <= 3)

Remove missing values

We start by creating a data frame with missing values. In R NA (Not Available) is used to represent missing
values:
# Create a data frame with missing data
friends_data <- data_frame(
name = c("A", "B", "C", "D"),
age = c(27, 25, 29, 26),
height = c(180, NA, NA, 169),
married = c("yes", "yes", "no", "no") )
# Print friends_data
Extract rows where height is NA:
friends_data %>% filter(is.na(height))

Exclude (drop) rows where height is NA:


friends_data %>% filter(!is.na(height))

Functions and Apply Family


The apply family pertains to the R base package, and is populated with functions to manipulate slices of
data from matrices, arrays, lists and data frames in a repetitive way. These functions allow crossing the
data in a number of ways and avoid explicit use of loop constructs.

The apply functions form the basis of more complex combinations and helps to perform operations with
very few lines of code. The family comprises: apply, lapply , sapply, vapply, mapply, rapply, and tapply.

The apply() function can be feed with many functions to perform redundant application on a collection of
object (data frame, list, vector, etc.). The purpose of apply() is primarily to avoid explicit uses of loop
constructs. They can be used for an input list, matrix or array and apply a function. Any function can be
passed into apply().

apply() function
We use apply() over a matrix. This function takes 5 arguments:

apply(X, MARGIN, FUN)


Here:
-x: an array or matrix
-MARGIN: take a value or range between 1 and 2 to define where to apply the function:
-MARGIN=1`: the manipulation is performed on rows
-MARGIN=2`: the manipulation is performed on columns
-MARGIN=c(1,2)` the manipulation is performed on rows and columns
-FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-
defined functions can be applied>

The simplest example is to sum a matrix over all the columns. The code apply(m1, 2, sum) will apply the
sum function to the matrix 5x6 and return the sum of each column accessible in the dataset.

m1 <- matrix(C<-(1:10),nrow=5, ncol=6)


m1
a_m1 <- apply(m1, 2, sum)
a_m1

lapply() function

lapply(X, FUN)
Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x

L in lapply() stands for list. The difference between lapply() and apply() lies between the output return. The
output of lapply() is a list. lapply() can be used for other objects like data frames and lists.

lapply() function does not need MARGIN.

A very easy example can be to change the string value of a matrix to lower case with to lower function. We
construct a matrix with the name of the famous movies. The name is in upper case format.

movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")


movies_lower <-lapply(movies, tolower)
str(movies_lower)

sapply() function

sapply() function does the same jobs as lapply() function but returns a vector.

sapply(X, FUN)

Arguments:

-X: A vector or an object


-FUN: Function applied to each element of x

We can measure the minimum speed and stopping distances of cars from the cars dataset.

dt <- cars
lmn_cars <- lapply(dt, min)
smn_cars <- sapply(dt, min)
lmn_cars

tapply() function

The function tapply() computes a measure (mean, median, min, max, etc..) or a function for each factor
variable in a vector.

tapply(X, INDEX, FUN = NULL)


Arguments:
-X: An object, usually a vector
-INDEX: A list containing factor
-FUN: Function applied to each element of x

We can compute the median of the length for each species. tapply() is a quick way to perform this
computation.

data(iris)
tapply(iris$Sepal.Width, iris$Species, median)

Machine Learning Process Flow

1. Installing the R platform.


2. Loading the dataset.
3. Summarizing the dataset.
4. Visualizing the dataset.
5. Evaluating some algorithms.
6. Making some predictions.

S-ar putea să vă placă și