An Introduction to R Fundamentals

An Introduction to R
********************
********************
Price<-c(100,200) << Create a object Price
Price << View the object created
length(Price) << Gives the total number of elements in
an object
Price<-c(100,200,NULL) << NULL added

Price
Length(Price) << Length will be same as old, NULL didn't
increased length
Adding a missing value:

**********************
Price<-c(100,200,NA)
Price
Length(Price)
Checking for missing value:

*****************************
is.na(Price) << check for na in Price, result will be
False,False,True
#which element
which(is.na(Price)) << will give the index of elements in
Price which are NA
which(Price == 100) << gives index of element which is 100
Chracter object:
***************
Names<-c("John","Robert","NA","Catherine") << Create a character object
length(Names) << Gives the total number
elements in Names
To view the type of the object

class(Names) << Gives type of Oject, this will give
Character
class(Price) << this will give number
Setting the working directory:

*****************************
setwd("D:\\Data Manipulation with R\\") << Set the Working directory
getwd() << Get the working directory
Sequence:
*********
Sequence<-seq(1970,2000) << To assign Sequence object all the
numbers in between 1970 to 2000
Sequence_By<-seq(from=1,to=5,by=0.5) << Sequence of numbers from 1 to 5 with
interval of 0.5
Sequence_By << results will be :
1.0,1.5,2.0........4.5,5.0
Repeat:
******
rep(1,5) << Repeat 1 for 5 times, result will be 1,1,1,1,1
rep(1:5,2) << Repeat 1 to 5 for 2 times, result will be
1,2,3,4,5,1,2,3,4,5
rep(1:5,each=2) << Repeat 1 to 5 each for 2 times
#Both numric and character data
mixed<-c(1,2,3,"hi")
mixed << Result will be "1","2","3","Hi"
class(mixed) << This will give characterm as the whole object is converted
into characters
#Take values from the user

**************************
x<-scan()
Vectors:
*******
>Most simplest structure in R
>If data has only one dimesion, like a set of digits, then vectors can be used to
represent it.
Matrices:
********
>Used when data is a higher dimensional array.
>But contains only data of a single class Eg:only character or numeric.
Data Frames:
***********
>it is like a single table with rows and columns of data
>Contains columns or lists of different data
Lists:
*****
>Used when data cannont be reperesented by data frames.
>it contains all kinds of other obects, icluding other lists or data frames.
#Saving an object
save(Names,file="Names.rda")
#Saving the entire workspace
save.image("all_work.RData")
#Saving the entire workspace
save.image("all_work.RData")
#How do i load my object back?
load("Names.rda")
name(iris) or colnames(iris) >>> to get the list of variables in any dataset
dim(iris) >> to get the number of rows and colums

nrow(iris) >> to get the number of rows
ncol(iris) >> to get the number of columns
# To look at top few or bottom few rows of a dataframe

head(iris)
tail(iris)
#Looking at the structure of the datasets

str(iris)
#if we want to see more about the dataset

?iris
#checking the type of an object
class(iris)
#if we have to traverse through any dataset then we calls it using

iris.Sepal_Length like that one way to avoid this is using attach option:
attach(iris) >>> to attach iris with each element of a dataset so that we can
traverse through that dataset without prefixing dataset name before the element
name
detach(iris) >>> to detach the iris attach using above command.
#importing a CSV file

iris<-read.csv("Iris.csv",header=T,sep=',')
#importing a text file

iris<-read.table("Iris.txt",header=T,sep="\t")
#summary of a dataset
summary of a dataset gives us the summary of any dataset across each of its
elements
summary(iris)
#we can check if a variable is character, numeric or factor

# is.character() and is.factor()
is.character(iris$Species)
is.numeric(iris$Petal.Length)
# It is good to know about the following information while started working on any
dataset:
1. Presence of header line
2. Kind of value seperator.
3. Representation of missing values.
4. Notation of comment characters or quotes.
5. Existence of any unfilled or blank lines.
6. Classes of the variables.
# Importing and Exporting of Data in R
*Working with plain text files*
Data from plain text files:

1. Scan function:
Reads data directly from console.
Reads data from files.
Return list or vector R objects.
2. read.table function:
Reads a file in table format and creates a data frame from it.
Working with large text files:

******************************
fread function >>> its package name is data.table
Using read.csv function part of ffdf package.

read.table.ffdf fucntion
read.csv.sql
write.table
write.table.ffdf
System setup inclides Windows 8 64-bit version fitted with 4 GB RAM.
Package function Runtime Good with Big file or not
BASE R read.table ~6.37 sec Bad
data.table fread ~1sec
sqldf read.csv.sql ~11 sec Good
ff read.table.ffdf `8 sec Good
Base R optimized `4 sec Good
read.table
Working with XL files:

**********************
library(XLConnect)
#Loading Excel Workbook

wb<-loadworkbook("customers.xlsx") >> first load the workbook
newyork<-readworksheet(wb,"newyork",header=T) >> Then read worksheet
Data Exploration
****************
****************
Read a file and specify the type of NA in that file:
cr<-read.csv("Credit.csv",na.strings=c("",NA))
To avoid exponential forms in x and y axis:

*******************************************
options(scipen=999)
To get the column names:
***********************
names(cr)
To get the summary of each column in a dataframe:

************************************************
summary(<dataset name>)
To get the index of missing values and to remove it:

***************************************************
index<-which(is.na(cr$Good_Bad))
cr<-cr[-index,]
Look at individual summary:

**************************
summary(cr$Good_Bad) # to get the summary of particular column
Percentile breakup:
******************
quantile(cr$RevolvingUtilizationOfUnsecuredLines,p=c(1:100)/100)
#Discuss with client, 2 limit on the number, replace
cr%>%filter(RevolvingUtilizationOfUnsecuredLines<=2)%>%nrow()
cr%>%filter(RevolvingUtilizationOfUnsecuredLines<=2)->cr
To replace '0' with the missing value:

*************************************
#We find after discussions that '0' here means a missing value
cr$MonthlyIncome<-ifelse(cr$MonthlyIncome==0,NA,cr$MonthlyIncome)
#We uses mutate function to add new column in any dataframe

contribution%>
%mutate(Contribution=FY04Giving+FY03Giving+FY02Giving+FY01Giving+FY00Giving)-
>contribution
Group the data

**************
contribution%>%group_by(Gender)%>
%summarise(Count=n(),Percentage_Count=n()/1230,Total_Contribution=sum(Contributions
),Percentage_Contribution=Total_Contribution/1205454,Average=mean(Contributions)%>
%ungroup()%>%arrange(-Total_Contribution)%data.frame()%>%head(,10)
For Continous Variable we use ntile function to divide that continous variable into
deciles i.e to convert a continous variable into categorical variable
***********************************************************************************
**********************************************************************
cr%>%mutate(quantile=ntile(MonthlyIncome,10))%>%group_by(Good_Bad,quantile)%>
%summarize(N=n())%>%filter(Good_Bad=="Bad")->dat
Divide the data into Test and training samples randomly :

*******************************************************
##Partitioning data##
set.seed(100) >>> this will give the same results after doing
sample function
# so in this we are considering 1 to n rows out of which we are selecting 70% or
rows and replace parameter is set as false
# sample function gives you the index of the rows that are selected.
indexP<-sample(1:nrow(cr),0.70*nrow(cr),replace = F)
# train_cr datafare will have all the rows that were selected in the previous step.
train_cr<-cr[indexP,]
#test_cr dataframe will have all the remaining rows.
test_cr<-cr[-indexP,]
# to know the rows and columns of any dataframe we do:

dim(train_cr) >>> this will give us the number of rows and colums present in
this dataframe
One below function can also be used to get sample data/subset of the dataset:
***************************************************************************
library(caret)
indexPC<-createDataPartition(y=cr$Good_Bad,times = 1,p=0.70,list=F)
train_crC<-cr[indexPC,]
test_crC<-cr[-indexPC,]
In Excel there is a function to get Binomial Distribution:

*********************************************************
>BINOM.DIST
Cumilative probability True in the BINOM.DIST formula gives us the result for the
success <= x(i.e sum of the probabilities from success = 0 till success =X)
Cumilative probability False in the BINOM.DIST formula gives us the result for the
success = x only.
Hypergeometric Districution:
***************************
If in some case if the some selection has been made such the selection has not
replaced back to the population then our poppulation/probability will get changed
and now Binomial distribution is no more used. In this case we uses Hypergeometric
Distribution.
>Xcel Formula for Hypergeometric Distribution is as below:
hypgeom.dist()
Negative Binomial:
*****************
Used to find out the number of trials needed to get X successes.
Xcel Formula for this is =NEGBINOM.DIST()
Example of Negative Binomial :
What is the probability that the 30th purchase in my store will happen with the
100th customer, when the probability of purchase for any customer is 20%?
Geometric Distribution:
**********************
Used when we are interested in the probability of the first success in the rth
trial.
example : Supposing there is a defect rate of 2% with some mechanical component

being produced. What is the probability that a QC inspector will need to review at
most 20 items before finding a defect?
Same NEGBINOM.DIST formula is used for Geometric Distribution also but with
Cumulative = TRUE
Data Manipulation:
*****************
To Get the data used in the first row and third column we can use:
******************************************************************
oj[1,3]
oj[c(1,2,8,456),c(1,3,6)] << to check rows 1,2,8,456 corresponding to
the columns 1,3,6
#Selecting only those rows where brand bought is tropicana:
dat<-oj[oj$brand=='tropicana',]
We can perfrom the OR/AND operations also while selecting rows and columns:
dat1<oj[oj$brand=='tropicana'|oj$brand=='dominicks',]
head(dat1)
Difference between Logical verctors Vs. which statement
#consider vector sales with missing values
sales<-c(100,200,NA,300,400,NA,500,600,700,NA,1000,1500,NA,NA)
#subset data using logical operator
sales[sales>600]
[1] NA NA 700 NA 1000 1500 NA NA <<< as you can see NA is also included in the
results of above logical querry.
#subset data using which
>sales[which(sales>600)]
[1] 700 1000 1500
#Selecting Columns:
dat4<-oj[,c("week","brand")]
head(dat4)
#Adding new columns:
*******************
oj$logInc<-log(oj$INCOME) << this new column will have the value of Log of income
order() retrun the element order that results in a sorted vector:

>students<-c("John","Tim","Alice","Zeus")
>students
[1] "John" "Tim" "Alice" "Zeus"
>order(students)
[1] 3 1 2 4
>students[orders(students)]
[1] "Alice" "John" "Tim" "Zeus"
Ordering of numbers:
*******************
numbers<-c(10,100,5,8)
order(numbers) >>> retruns the indices of the numbers ordered in acccending
order
order(-numbers) >>> returns the indices of the numbers ordered in the
decending order.
GroupWise operations:
********************
aggregate(oj$price,by=list(oj$brand),mean) <<< on the basis of Price group the
data by Brand using mean operation
We can use the tapply function to perform the same task:

*******************************************************
tapply(oj$price,oj$brand,mean)
#Cross tabulations
#Units of different brands sold based on if feature advertisement was run or not
table(oj$brand,oj$feat)
xtabs can also be used for the same operation:
xtabs(oj$INCOME~oj$brand+oj$feat) <<< mean of the incomes based upon the various
brands and whether feat advertisement was done or not
dplyr
*****
1. Works only with Data frames.
To get the rows correponding to the brand names tropicana.

dat8<-filter(oj,brand=="tropicana")
To get the rows corresponding to the brand names tropicana and domonicks
dat9<-filter(oj,brand=="tropicana"|brand=="dominicks")
#Selecting columns
Suppose we have to select Columns brand, Income and feat, we can do that using
following command:
dat10<-select(oj,brand,INCOME,feat)
#we can drop the columns using the -sign before the columns name:
dat1<-select(oj,-brand,-INCOME,-feat)
#Creating a new column

dat12<-mutate(oj,logname=log(INCOME))
#Arranging data
dat13<-arrange(oj,INCOME) << Arrange the OJ dataset based on accending order of
Income
#Decending arranging of data

dat14<-arrange(oj,des(INCOME))
or
dat14<-arrange(oj,-INCOME)
#Summarizing data
#group wise summaries
*********************
gr_brand<group_by(oj,brand)
summarize(gr_brand,mean(INCOME),sd(INCOME))
#Find the mean price for all the people whose income is >=10.5.
#Base R code
mean(oj[oj$income>=10.5,"price"])
#dplyr code
summarize(filter(oj,INCOME>=10.5),mean(price))
Pipe operator:
*************
oj%>%filter(INCOME>=10.5)%>%summarize(mean(price))
Subset the data based on price>=2.5, create a column logIncome, compute the
mean,standard deviation and median of column logIncome
oj%>%filter(price>=2.5)%>%mutate(logIncome=log(INCOME))%>
%summarize(mean(logIncome),sd(logIncome),median(logIncome))
To Convert character string into date:

*************************************
fd$FlightDate<-as.Date(fd$FlightDate,"%d-%b-%y") <<< this command will convert
the date in DD-MMM-YY string format to Date format
[1] 01-10-2018
Code | Value
%d | Day of month(decimal number)
%m | Month(decimal number)
%b | Month(abbreviated)
%B | Month(full name)
%y | Year(2 digits)
%Y | Year(4 digits)
25/Aug/04: "%d/%b/%y"
25-August-2004: %d-%B-%Y
month function will get you the month of the date mentioned in the date format:
months(fd$FlightDate) >>>>>>>> will give you the month of the date in date
unique(months(fd$FlightDate)) >>> to get the unique months present in the
particular date column
#Finding time Interval

fd$FlightDate[60]-fd$FlightDate[900]
#difftime function we can use to get the time interval based upon weeks, days and
hours
difftime(fd$FlightDate[3000])
difftime(fd$FlightDate[3000],fd$FlightDate[90],units="weeks")
difftime(fd$FlightDate[3000],fd$FlightDate[90],units="days")
difftime(fd$FlightDate[3000],fd$FlightDate[90],units="hours")
Sub-setting data: All rows when day is Sunday

*********************************************
nrow() function will get you the number of rows in a particular dataset.
library(dplyr)
fd_s<-fd%>%filter(weekdays(FlightDate)=="Sunday") >>> All rows when day is
Sunday
fd_s1<-fd%>%filter(weekdays(FlightDate)=="Sunday" & city=="Atlanta")%>%nrow() >>>

Find the number of flights on Sundays for destination Atlanta
#Whenever data has time information along with date, R uses POSIXct and POSIXit
classes to deal with dates
date1<-Sys.time()
date1
[1] "2015-03-02 17:35:47 IST"
class(date1)
[1] "POSIXct" "POSIXt"
for using weekdays() and month() functions that date/paramter passed need to be in
POSIXCT POSIXt format
weekdays(date1)
[1] "Monday"
month(date1)
[1] "March"
Lubridate() is a package that is a wrapper for POSIXct class

fd$FlightDate<-ymd(fd$FlightDate)
[1] "2014-01-01","2014-01-01".....
Function Date
dmy() 26/11/2008
ymd() 2008/11/26
mdy() 11/26/2008
dmy_hm() 26/11/2008 20:15
dmy_hms() 26/11/2008 20:15:30
Joining Dataframes:
******************
Inner join: Joining two tables based on a key column,such that rows matching in
both tables are selected.
Customer ID Product CutomerID State

1 1 Toaster 1 2 Alabama
3 3 Toaster 3 6 Ohio
4 4 Radio
5 5 Radio
6 6 Radio
>merge(x=df1,y=df2,by="CustomerId") #Inner Join/Intersection of both tables

CustonerId Product State
1 2 Toaster Alabama
2 4 Radio Alabama
3 6 Radio Ohio
#Full outer join : Two tables are joined irrespective of any match between the
rows:
Customer ID Product CustomerId State

4 4 Radio
5 5 Radio
6 6 Radio
>merge(x = df1,y = df2, by = "CustomerId",all = TRUE) #Outer Join:

CustomerID Product State
1 1 Toaster <NA>
2 2 Toaster Alabama
3 3 Toaster <NA>
4 4 Radio Alabama
5 5 Radio <NA>
6 6 Radio Ohio
#Left Outer Join: All the rows of left table are retained while matching rows of
right table are displayed.
Customer ID Product CustomerID State
4 4 Radio
5 5 Radio
6 6 Radio
>merge(x = df1, y = df2, by = "CustomerID",all.x=TRUE) # Left Join

CustomerID Product State
1 1 Toaster <NA>
2 2 Toaster Alabama
3 3 Toaster <NA>
4 4 Radio Alabama
5 5 Radio <NA>
6 6 Radio Ohio
Right Outer Join : All the rows of right table are retained while matching rows of
left table are displayed
>merge(x = df1, y = df2, by = "CustomerID",all.y=TRUE) # Right Join
Finding Missing values:

**********************
is.na() to find out the total numbers of missing values.
>a<-c(1,2,NA,9)
>is.na(a)
[1] FALSE FALSE TRUE FALSE
>sum(is.na(a)) >> to find out the total number of missing
values
[1] 3
Another option to get the sum of missing values is :
summary() command
#Imputing Missing values

air$Ozone[is.na(air$Ozone)]<-45
#Imputing the mean of the column in the missingvalues:

air$Solar.R[is.na(air$Solar.R)]<-mean(air$Solar.R,na.rm=TRUE) >> in this we have
removed the NA from the values before calculating the mean
summary(air)
RESHAPE function:
****************
It helps in converting data from Wide to Long format and Long to wide format.
Data in Wide format:

Persons Age Weight
Sankar 26 70
Aiyar 30 50
Singh 23 40
Data in Long Format:
Persons Variable Value

******* ******** ******
Sankar Age 26
Sankar Weight 70
Aiyar Age 24
Aiyar Weight 60
Singh Age 25
Singh Weight 65
We can convert the data from one format to another using the functions : Melt and
Cast present in the library reshape2
library(reshape2)
person<-c("Sankar","Aiyar","Singh")
age<-c(26,24,25)
weight<-c(70,60,65)
wide<-data.frame(person,age,weight)
the result of the above command will be :
>wide
person age weight
1 Sankar 26 70
2 Aiyar 24 60
3 Singh 25 65
melted<-melt(wide,id.vars="person",value.name="Demo_value")
Person Variable Demo_value
1 Sankar age 26
2 Aiyar age 34
3 Singh age 25
4 Sankar weight 70
5 Aiyar weight 60
6 Singh weight 65
>dcast(melted,person~variable,value.var = "Demo_Value")
Person age weight
1 Aiyar 24 60
2 Sankar 26 70
3 Singh 25 65
Working with Stings:

*******************
a<-"Batman"
substr(a,start=2,stop=6)
[1] atman
nchar(a) >>> Number of characters in a string

tolower(a) >>>> to convert a string into lowercase
toupper(a) >>> to convert a string into uppercase
strsplit(b,split="-") >>>[1] "Bat" "man"
c<-"Bat/Man"
strsplit(c,split="/") >>> string split by specifying "/" as the spliting
variable
paste(b,split=c) >>> it concatenate two strings one after the other
#sometimes we want to know where some patterns occurs in a string, so we uses grep
command:
c(b,c)
grep("-",c(b,c)) >>> this will tell you the count "-" in the string
So times we want to find whether some pattern exist in the string or not?
>c(b,c)
"Bat-Man" "Bat/Man"
>grepl("/",c(c,b))
FALSE TRUE
#sometimes we want to substitute one pattern with that of other

for eg:
>b
"Bat-Man"
>sub("-","/")
"Bat/Man"
#Using SQL queries inside R

***************************
#Selecting columns from a dataframe :
>oj_s<-sqldf("select brand, income, feat from oj ")
#subseting using where statement:
oj_s<-sqldf("select brand, income, feat from oj where price<3.8 and income<10")
#order by statement
oj_s<-sqldf("Select store,brand,week,logmove,feat,price,income from oj order by
income asc")
#Base Plotting
#Using plot() to study to continous variables
ir<-iris
#To understand the relationship between the two continous variables we uses
vibriant plot
plot(x=ir$Petal.Width,y=ir$Petal.Length) >> whatever we want to put in x axis we
assign it to x
whatever we want to put in y axis we
assign it to y
#Adding xlabels, ylables and title
plot(x=ir$Petal.Width,y=ir$Petal.Length,main=c("Petal Width Vs Petal

Length"),xlab=c("Petal width"),ylab=c("Petal Length"))
#Addiing Colors
Length"),xlab=c("Petal Width"),ylab=c("Petal Length"),col="red")
#Adding different plotting symbol

Length"),xlab=c("Petal Width"),ylab=c("Petal Length"),col="red",pch=2)
#seeing relationship across different species

Length"),xlab=c("Petal Width"),ylab=c("Petal Length"),col=ir$Species)
#Adding a legend
Length"),xlab=c("Petal Width"),ylab=c("Petal Length"),pch=as.numeric(ir$Species))
legend(0.2,7,c("SEtosa","Versicolor","Verginica"),pch=1:3)
Studying Univariant data:
#Box Plot:
boxplot(ir#Petal.Length)
. ------------ OUTLIER More than 3/2 times of upper quartile

. .
___------------- MAXIMUM Greatest value, excluding outliers

|
|
|
__________------------- UPPER QUARTILE 25% of data greater than this value
| |
| |
|________|------------- MEDIAN 50% of data is greater than this value;
middle of dataset
| |
| |
|________|------------- LOWER QUARTILE 25% of data less than this value
|
|
|
___---------------- MINIMUM Least value, excluding outliers
. ---------------- OUTLIER less than 3/2 times of lower quartile
Histograms:
**********
hist(ir$Sepal.Width,col="orange")
Label = True, parameter is added to get the count accross various bins.
Plotting more than one plot in single window:

********************************************
par()
mfrow()
par(mfrow=c(1,2)) >>>>>>>> one row and two columns

plot(x=ir$Speciesy=ir$Sepal.Width,xlab="Species",main="Sepal Width across
species",col="red")
GGPLOT2:
*******
Based on grammar of grapphics: Simple syntax, interaces with ggmap and other
packages.
Grammar of Graphics:
*******************
A plot composed of : Aesthetic Mapping, Geoms, Statistical Transformation,
Coordinate Systems and Scales.
Aesthetic Mapping : What component of data appears on X axis, Y axis, how is the
color, size, fill and position of elements is related with the data.
Geom(Geometrical Objects) : What geometrical objects appear on the plot: points,

lines, polygons, area, boxplot, rectangle,tile etc
Statistical Transformations : Compute desity, counts(Histogram: Need to bin and

count data)
Scales and coordinate Systems:

Discreet scales or Continous, Cartesian or Spherical.
p<-ggplot(ch,aes(x=temp,y=dewpoint,colour=season))
Downloading google maps:

***********************
map<-get_map("bangalore",maptype="hybrid")
ggmap(map)+geom_point(data=sh,aes(x=long,y=lat),colour="red") >>>geom_point is the

function to put points on the map.
Process of visualisation geospatical:

*******
Downloading maps ggmap()
get long-lat data Text file Geospatial file:rgdal()
Overlay data on the map:ggplot2
Downloding maps : ggmap()
Downloading maps using ggmap

****************************
map<-get_map("bangalore",maptype="hybrid")
ggmap(map)+geom_point(data=sh,aes(x=log,y=lat),colour="red") >>>we use geom point

to put our dataset location points on the map.
Extrating long-lat data from shape files using rgdal() package
Many times the data and loactional infomation is not the same file.
Most geospatial data is stored in shape files
Shapefile = Data + Location data
How to extract log lat data from the Spatial points data frame:
***************************************************************
shape2<-readOGR(dsn="Subway","DoITT_SUBWAY_ENTRACE_01_13SEPT2010") >>> Subay is
the folder name and "DOIT ... is the file name"
To convert coordinates into long lat form:

shape2<-spTransform(shape2,CRS("+init=epsg:4326"))
fortify() is used to extract the location data.
How to use ggplot:

*****************
library(ggplot2)
library(dplyr)
#Technology
data%>%filter(Industry=="Technology")->data1
p<-
ggplot(data1,aes(x=Company.Advertising,y=Brand.Revenue,size=Brand.Value,color=Brand
))
q<-p+geom_point()
q+xlab("Company Advertising in Billions of $")+ylab("Brand Revenue in Billions of
$")+scale_size(range = c(2,4),breaks=c(30,60,100),name="Brand Value $ (Billions)")
+geom_text(aes(label=Brand),hjust=0.5,vjust=1)+guides(color=FALSE)+theme_light()
+theme(legend.key=element_rect(fill = "light blue", color = "black"))
data%>%filter(Industry=="Luxury")->data2
p<-
ggplot(data2,aes(x=Company.Advertising,y=Brand.Revenue,size=Brand.Value,color=Brand
))
q<-p+geom_point()
q+xlab("Company Advertising in Billions of $")+ylab("Brand Revenue in Billions of
$")+scale_size(range = c(2,4),breaks=c(10,28.1),name="Brand Value $ (Billions)")
+geom_text(aes(label=Brand),hjust=0.7,vjust=1.7)+guides(color=FALSE)+theme_light()
+theme(legend.key=element_rect(fill = "light blue", color = "black"))
+scale_x_continuous(breaks=seq(0,6,0.1))

An Introduction to R Fundamentals

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

An Introduction to R Fundamentals

Încărcat de

Drepturi de autor:

Formate disponibile

An Introduction to R

Price<-c(100,200,NULL) << NULL added

Adding a missing value:

Checking for missing value:

To view the type of the object

Setting the working directory:

#Take values from the user

name(iris) or colnames(iris) >>> to get the list of variables in any dataset

dim(iris) >> to get the number of rows and colums

# To look at top few or bottom few rows of a dataframe

#Looking at the structure of the datasets

#if we want to see more about the dataset

#if we have to traverse through any dataset then we calls it using

#importing a CSV file

#importing a text file

#we can check if a variable is character, numeric or factor

# Importing and Exporting of Data in R

*Working with plain text files*

Data from plain text files:

Working with large text files:

Using read.csv function part of ffdf package.

Working with XL files:

#Loading Excel Workbook

To avoid exponential forms in x and y axis:

To get the summary of each column in a dataframe:

To get the index of missing values and to remove it:

Look at individual summary:

#Discuss with client, 2 limit on the number, replace

To replace '0' with the missing value:

#We uses mutate function to add new column in any dataframe

Group the data

Divide the data into Test and training samples randomly :

# to know the rows and columns of any dataframe we do:

In Excel there is a function to get Binomial Distribution:

Xcel Formula for this is =NEGBINOM.DIST()

Example of Negative Binomial :

example : Supposing there is a defect rate of 2% with some mechanical component

order() retrun the element order that results in a sorted vector:

We can use the tapply function to perform the same task:

To get the rows correponding to the brand names tropicana.

#Creating a new column

#Decending arranging of data

To Convert character string into date:

#Finding time Interval

Sub-setting data: All rows when day is Sunday

fd_s1<-fd%>%filter(weekdays(FlightDate)=="Sunday" & city=="Atlanta")%>%nrow() >>>

Lubridate() is a package that is a wrapper for POSIXct class

Customer ID Product CutomerID State

>merge(x=df1,y=df2,by="CustomerId") #Inner Join/Intersection of both tables

Customer ID Product CustomerId State

>merge(x = df1,y = df2, by = "CustomerId",all = TRUE) #Outer Join:

>merge(x = df1, y = df2, by = "CustomerID",all.x=TRUE) # Left Join

Finding Missing values:

#Imputing Missing values

#Imputing the mean of the column in the missingvalues:

Data in Wide format:

Data in Long Format:

Persons Variable Value

Working with Stings:

nchar(a) >>> Number of characters in a string

strsplit(b,split="-") >>>[1] "Bat" "man"

paste(b,split=c) >>> it concatenate two strings one after the other

#sometimes we want to substitute one pattern with that of other

Working with plain text files