Sunteți pe pagina 1din 34

R Compendium

Data Structures in R
Vector Data frame
An ordered collection of data of the same type It is a table with rows and columns; data within each column has
> a = c(1,2,3) the same type (e.g. number, text, logical), but different columns
> a*2 may have different types.
[1] 2 4 6
Example:
Matrix >a
A rectangular table of data of the same type Policy_no premium State
12345 100 IL
List
25486 400 NY
An ordered collection of data of arbitrary types.
63254 350 FL
> doe = list(name="john",age=28,married=F)
> doe$name
[1] "john“
> doe$age
[1] 28

Linear Rectangular
All Same Type Vector Matrix
Mixed List Data frame
Lapply, Sapply & Apply
Lapply Sapply

•When the same or similar tasks need to be performed multiple times for sapply( li, fct )
all elements of a list or for all columns of an array. Like apply, but tries to simplify the result, by converting it
•May be easier and faster than “for” loops into a vector or array of appropriate size
•lapply(li, function )
•To each element of the list li, the function function is applied. > li = list("klaus","martin","georg")
•The result is a list whose elements are the individual function > sapply(li, toupper)
results. [1] "KLAUS" "MARTIN" "GEORG"
> li = list("klaus","martin”)
> lapply(li, toupper) > fct = function(x) { return(c(x, x*x)) }
> [[1]] > sapply(1:5, fct)
> [1] "KLAUS" [,1] [,2] [,3] [,4] [,5]
> [[2]] [1,] 1 2 3 4 5
> [1] "MARTIN" [2,] 1 4 9 16 25

Apply
apply( arr, margin, fct )
Apply the function fct along some dimensions of the array arr, according to margin, and return a vector or array of the appropriate size.
>x
[,1] [,2] [,3] > apply(x, 1, sum)
[1,] 5 7 0 [1] 12 24 17 14
[2,] 7 9 8 > apply(x, 2, sum)
[3,] 4 6 7
[4,] 6 3 5
Regular expressions

Functions for detecting patterns Functions for replacing patterns

• grep(pattern, string, value = TRUE) • sub(pattern, replacement, string) - replace first match
• grepl(pattern , string) • gsub(pattern, replacement, string) - replace all
matches
Importing & Exporting data
Understanding data table
DT[i , j , by ]

Which rows ? What to do? Grouped by what?

- vector of row numbers - summarizing - one or more columns


- logical vector - updating variable(s) - on the fly grouping var(s)
- another data table - adding variable(s)

• Order of execution

DT[i , j , by ] DT[ 1 , 3 , 2 ]
Filtering rows & selecting columns data table
Operation Syntax
Subsetting rows by numbers # select first to tenth row
dt[1:10 , ]

Using column names to select rows # selecting rows with subln_grp as "PL"
based on a condition dt[ subln_grp == "PL" , ]

# Selecting rows which don't have NA in premium field


dt[ !is.na(premium) , ]

# Select rows with state as Florida and premium greater than zero
dt[ state == "FL" & premium > 0 , ]

Select columns # Select multiple columns


dt[ , .(policy_no , premium , state)]

Note - .() is an alias to list(). If .() is used, the returned value is a data.table. If .() is
not used, the result is a vector.
Subsetting rows and selecting # Select only rows where segment is "retail" and relevant columns
columns together dt[ segment == "retail" , .(policy_no , premium , state) ]
Summarizing data table
• Count and aggregation
Operation Syntax
Count # Counting number of policies with PL as their subln_grp
dt[ subln_grp == "PL" , .N ]

Count distinct # Count unique number of segments


dt[ , uniqueN(segment)]
uniqueN( dt , by = "segment")

# Count unique number of segments for Florida


dt[ state == "FL" , uniqueN(segment) ]
Simple aggregation # Taking average premium and count of policies
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) ]

Aggregation including filtering # Taking average premium and count of policies for which deductible is greater
than $100
dt[ deductible > 100 , .(count = .N , average = mean(premium , na.rm = TRUE)) ]
Summarizing data table
• Group by
Operation Syntax
Simple group by # Taking average premium and count of policies by segment
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) , by = segment ]

# Taking average premium and count of policies for which deductible is greater
than $100 by segment
dt[ deductible > 100 , .(count = .N , average = mean(premium , na.rm = TRUE)) ,
by = segment ]
OR
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) ,
by = .(segment , ded = deductible > 100 ) ]

Special symbol - .SD & .SDcols

SD = Subset of Data .SDcols


- a data.table by itself - Specifies the columns of data table that are in .SD
- holds data of current group as defined in by - Allows to apply a function to all rows , but only to selected
columns
- when no by, .SD applies to whole data.table
- Comes in handy if there are too many columns and a
- allows for calculations on multiple columns
particular operation is to be performed on a subset of columns
Summarizing data table
Operation Syntax
.SD & .SDcols # Computing average premium using .SD
dt[ , lapply(.SD , function (x) mean( x , na.rm = TRUE) ) , by = segment , .SDcols
= "premium" ]

OR

dt[ , lapply(.SD , function (x) mean( x , na.rm = TRUE) ) , by = segment , .SDcols


= grep( "prem” , names(dt)) ]
Update, add, delete variables in data table
Special operator: :=
 Updates a data table in place (by reference)
 Can be used to:
• update existing columns
• add new columns
• delete columns
 No need to use <-

Operation Syntax
Updating variables # Doubling the premium amount
dt[ , premium := premium *2]

# Doubling both premium and deductible amount


dt[ , `:=` (premium = premium*2 ,
deductible = deductible * 2) ]
Updating variables by group # Modifying premium by segment
dt[ , premium := premium * uniqueN(deductible) / .N , by = segment ]
Update, add, delete variables in data table
Operation Syntax
Adding variables # Adding new premium variable with premium doubled
dt[ , new_premium := premium *2]

Adding variables by group # Computing state premium


dt[ , state_premium := sum(premium , na.rm = TRUE) , by = state ]

# Computing state premium and state_deductible


dt[ , `:=` (state_premium = sum(premium , na.rm = TRUE) ,
state_deductible = max(deductible) ) , by = state ]
Deleting variables # Deleing new_premium variable
dt[ , new_premium := NULL]

# Deleting state variables


dt[ , grep("state_" , names(dt)) := NULL ]

#Deleting first two columns


dt[ , (1:2) := NULL ]
Reshaping data table
From wide to long format – melt function
Syntax:
melt(data, id.vars, measure.vars,
variable.name = "variable",
value.name = "value")

Parameter Description
id.vars ID columns with IDs for multiple entries
measure.vars Columns containing values to fill into cells
variable.name and value.name Names of new columns for variables and values derived from old headers

From long to wide format – dcast function


Syntax:
dcast(dt, id ~ y, value.var = c("a", "b"))

Parameter Description
Id ~ y Formula with a LHS: ID columns containing IDs for multiple entries. And a RHS:
columns with values to spread in column headers

Value.var Columns containing values to fill into cells


Reshaping data table
Wide format Long format

Syntax:
wide_data <- as.data.table(fread("reshape_data.csv"))
long_data <- melt(wide_data, id.vars = c("policy_no" , "state" , "subln_grp"), measure.vars = c("written_premium", "endorsement_premium"),
variable.name = "premium_type" , value.name = "premium")
wide_data <- dcast(long_data , policy_no + state + subln_grp ~ premium_type , value.var = "premium")
Set() family in data table
Operation Syntax
Create or upadate column names by # Change the name of premium field to wrt_prem
reference setnames(dt, "premium", "wrt_prem")

# change names of multiple columns


setnames(dt, c("premium", "policy_no"), c("wrt_prem", "policy_id"))

Setting a key on data table # set policy_no as key


setkey(dt, policy_no)

Reorder columns by reference #Change the column order of the data table
setcolorder(dt, c("deductible", "coverage", "segment", "state", "subln_grp",
"policy_no", "premium"))
Binning variables in data table
Other operations in data table
Other operations in data table
Joins in data table
• Right join
Syntax:
right_join <- dt1[dt2, on = "policy_no“ ] # right join (dt1 is left, dt2 is right)
right_join <- merge(dt1, dt2 , by = "policy_no" , all.y = TRUE)

dt1 dt2

right_join

dt1 captures written premium and state dt2 captures coverage and location code
information

Right_join has all rows from dt2


Joins in data table
• Left join
Syntax:
left_join <- dt2[ dt1, on = "policy_no"] # left join (dt1 is left, dt2 is right)
left_join <- merge(dt1, dt2, by - "policy_no" , all.x = TRUE)

dt1 dt2

left_join

dt1 captures written premium and state dt2 captures coverage and location code
information

left_join has all rows from dt1


Joins in data table
• Inner join
Syntax:
inner_join <- dt1[ dt2, on = "policy_no", nomatch = 0] # inner join
inner_join <- merge(dt1, dt2, by = "policy_no")

dt1 dt2

inner_join

dt1 captures written premium and state dt2 captures coverage and location code
information

Inner_join has all rows where dt1’s key columns values match dt2’s key column values
Joins in data table
• Left anti join
Syntax:
left_anti_join <- dt1[ !dt2, on = "policy_no“ ]

• Right anti join


Syntax:
right_anti_join <- dt2[!dt1, on = "policy_no"]

• Full outer join


Syntax:
full_outer_join <- merge(dt1 , dt2 , by =
"policy_no" , all = TRUE)
Joins in data table
• Rolling join
• Rolling joins are used for analyzing data involving time
• Let’s say data table 1 (dt1) contains policy information like inception date, end date , written
premium and location
• Data table 2 (dt2) has policy endorsement details like endorsement effective date and associated
premium amount
dt1 dt2
Joins in data table
• Rolling join
• Rolling join can be used to merge both tables with the condition that endorsement effective date lies
within the policy period

Syntax:
# Converting character to date
dt1[ , `:=` (policy_start_date =
as.Date(policy_start_date, "%m/%d/%y") ,
policy_end_date =
as.Date(policy_end_date , "%m/%d/%y"))]
dt2[ , endorsement_eff_dt :=
as.Date(endorsement_eff_dt , "%m/%d/%y")]

# Creating the date variable to be used for joining


dt1[ , join_date := policy_start_date]
dt2[ , join_date := endorsement_eff_dt]

# setting keys on each table


setkey(dt1 , policy_no , join_date)
setkey(dt2 , policy_no , join_date)

# Rolling backward
rolling_join <- dt2[dt1 , roll = -365 ]
ggplot2 - Grammar of Graphics plot
 Data: in ggplot2, data must be stored as an R data frame

 Coordinate system: describes 2-D space that data is projected onto - for example, Cartesian
coordinates, polar coordinates, map projections

 Geoms: describe type of geometric objects that represent data - for example, points, lines, polygons

 Aesthetics: describe visual characteristics that represent data - for example, position, size, color,
shape, transparency, fill

 Scales: for each aesthetic, describe how visual characteristic is converted to display values - for
example, log scales, color scales, size scales, shape scales

 Stats : describe statistical transformations that typically summarize data - for example, counts,
means, medians, regression lines

 Facets: describe how data is split into subsets and displayed as multiple small graphs
Creating a plot object
 creates a plot object that can be assigned to a variable
 can specify data frame and aesthetic mappings (visual characteristics that represent data)

Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium ))
p
x‐axis position indicates subln_grp
y‐axis position indicates premium
Adding a layer
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium , color = state))
p + geom_point(size = 2)
Layer
Purpose:
 Display the data – allows viewer to see patterns, overall structure, local structure, outliers

 Display statistical summaries of the data – allows viewer to see counts, means, medians, IQRs, model
predictions

 Data and aesthetics (mappings) may be inherited from ggplot() object or added, changed, or dropped
within individual layers

 Most layers contain a geom - the fundamental building block of ggplot2


full specification: geom_xxx(mapping, data, stat, position, ...)

 Each geom_xxx() has a default stat (statistical transformation) associated with it , but the default
statistical transformation can be changed using stat parameter
Adding a geom layer
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium , color = state))
p + geom_blank() p + geom_point()

p + geom_jitter() p + geom_count()
Displaying Statistical Summary
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state))
p + geom_bar()
Already transformed data
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
transfrmd_data <- plot_data[ , count := .N , by = state]
transfrmd_data <- unique(transfrmd_data[,.(count , state)])
p <- ggplot(data = transfrmd_data , aes(x = state , y = count))
p + geom_col()
# or
p + geom_bar(stat = "identity")

 geom_bar: height of bar proportional to


number of observations in each group.

 geom_col: leaves data as is. geom_bar


uses count stat by default. geom_col
uses identity stat
Displaying distributions
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = premium))
p + geom_histogram() p + geom_freqpoly()

p + geom_freqpoly(aes(color = state))
Displaying Statistical Summaries
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state , y = premium))
p + geom_boxplot()
Position
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state , fill = deductible > 100))
p + geom_bar() p + geom_bar(position="stack")

p + geom_bar(position="dodge") p + geom_bar(position="fill")