Presentation 1

R Compendium
Data Structures in R
Vector Data frame
An ordered collection of data of the same type It is a table with rows and columns; data within each column has
> a = c(1,2,3) the same type (e.g. number, text, logical), but different columns
> a*2 may have different types.
[1] 2 4 6
Example:
Matrix >a
A rectangular table of data of the same type Policy_no premium State
12345 100 IL
List
25486 400 NY
An ordered collection of data of arbitrary types.
63254 350 FL
> doe = list(name="john",age=28,married=F)
> doe$name
[1] "john“
> doe$age
[1] 28
Linear Rectangular
All Same Type Vector Matrix
Mixed List Data frame
Lapply, Sapply & Apply
Lapply Sapply
•When the same or similar tasks need to be performed multiple times for sapply( li, fct )
all elements of a list or for all columns of an array. Like apply, but tries to simplify the result, by converting it
•May be easier and faster than “for” loops into a vector or array of appropriate size
•lapply(li, function )
•To each element of the list li, the function function is applied. > li = list("klaus","martin","georg")
•The result is a list whose elements are the individual function > sapply(li, toupper)
results. [1] "KLAUS" "MARTIN" "GEORG"
> li = list("klaus","martin”)
> lapply(li, toupper) > fct = function(x) { return(c(x, x*x)) }
> [[1]] > sapply(1:5, fct)
> [1] "KLAUS" [,1] [,2] [,3] [,4] [,5]
> [[2]] [1,] 1 2 3 4 5
> [1] "MARTIN" [2,] 1 4 9 16 25
Apply
apply( arr, margin, fct )
Apply the function fct along some dimensions of the array arr, according to margin, and return a vector or array of the appropriate size.
>x
[,1] [,2] [,3] > apply(x, 1, sum)
[1,] 5 7 0 [1] 12 24 17 14
[2,] 7 9 8 > apply(x, 2, sum)
[3,] 4 6 7
[4,] 6 3 5
Regular expressions
Functions for detecting patterns Functions for replacing patterns
• grep(pattern, string, value = TRUE) • sub(pattern, replacement, string) - replace first match
• grepl(pattern , string) • gsub(pattern, replacement, string) - replace all
matches
Importing & Exporting data
Understanding data table
DT[i , j , by ]
Which rows ? What to do? Grouped by what?
- vector of row numbers - summarizing - one or more columns

- logical vector - updating variable(s) - on the fly grouping var(s)
- another data table - adding variable(s)
• Order of execution
DT[i , j , by ] DT[ 1 , 3 , 2 ]
Filtering rows & selecting columns data table
Operation Syntax
Subsetting rows by numbers # select first to tenth row
dt[1:10 , ]
Using column names to select rows # selecting rows with subln_grp as "PL"
based on a condition dt[ subln_grp == "PL" , ]
# Selecting rows which don't have NA in premium field

dt[ !is.na(premium) , ]
# Select rows with state as Florida and premium greater than zero
dt[ state == "FL" & premium > 0 , ]
Select columns # Select multiple columns

dt[ , .(policy_no , premium , state)]
Note - .() is an alias to list(). If .() is used, the returned value is a data.table. If .() is
not used, the result is a vector.
Subsetting rows and selecting # Select only rows where segment is "retail" and relevant columns
columns together dt[ segment == "retail" , .(policy_no , premium , state) ]
Summarizing data table
• Count and aggregation
Operation Syntax
Count # Counting number of policies with PL as their subln_grp
dt[ subln_grp == "PL" , .N ]
Count distinct # Count unique number of segments

dt[ , uniqueN(segment)]
uniqueN( dt , by = "segment")
# Count unique number of segments for Florida

dt[ state == "FL" , uniqueN(segment) ]
Simple aggregation # Taking average premium and count of policies
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) ]
Aggregation including filtering # Taking average premium and count of policies for which deductible is greater
than $100
dt[ deductible > 100 , .(count = .N , average = mean(premium , na.rm = TRUE)) ]
• Group by
Operation Syntax
Simple group by # Taking average premium and count of policies by segment
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) , by = segment ]
# Taking average premium and count of policies for which deductible is greater
than $100 by segment
dt[ deductible > 100 , .(count = .N , average = mean(premium , na.rm = TRUE)) ,
by = segment ]
OR
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) ,
by = .(segment , ded = deductible > 100 ) ]
Special symbol - .SD & .SDcols
SD = Subset of Data .SDcols

- a data.table by itself - Specifies the columns of data table that are in .SD
- holds data of current group as defined in by - Allows to apply a function to all rows , but only to selected
columns
- when no by, .SD applies to whole data.table
- Comes in handy if there are too many columns and a
- allows for calculations on multiple columns
particular operation is to be performed on a subset of columns
Operation Syntax
.SD & .SDcols # Computing average premium using .SD
dt[ , lapply(.SD , function (x) mean( x , na.rm = TRUE) ) , by = segment , .SDcols
= "premium" ]
OR
dt[ , lapply(.SD , function (x) mean( x , na.rm = TRUE) ) , by = segment , .SDcols

= grep( "prem” , names(dt)) ]
Update, add, delete variables in data table
Special operator: :=
 Updates a data table in place (by reference)
 Can be used to:
• update existing columns
• add new columns
• delete columns
 No need to use <-
Operation Syntax
Updating variables # Doubling the premium amount
dt[ , premium := premium *2]
# Doubling both premium and deductible amount

dt[ , `:=` (premium = premium*2 ,
deductible = deductible * 2) ]
Updating variables by group # Modifying premium by segment
dt[ , premium := premium * uniqueN(deductible) / .N , by = segment ]
Update, add, delete variables in data table
Operation Syntax
Adding variables # Adding new premium variable with premium doubled
dt[ , new_premium := premium *2]
Adding variables by group # Computing state premium

dt[ , state_premium := sum(premium , na.rm = TRUE) , by = state ]
# Computing state premium and state_deductible

dt[ , `:=` (state_premium = sum(premium , na.rm = TRUE) ,
state_deductible = max(deductible) ) , by = state ]
Deleting variables # Deleing new_premium variable
dt[ , new_premium := NULL]
# Deleting state variables

dt[ , grep("state_" , names(dt)) := NULL ]
#Deleting first two columns

dt[ , (1:2) := NULL ]
Reshaping data table
From wide to long format – melt function
Syntax:
melt(data, id.vars, measure.vars,
variable.name = "variable",
value.name = "value")
Parameter Description
id.vars ID columns with IDs for multiple entries
measure.vars Columns containing values to fill into cells
variable.name and value.name Names of new columns for variables and values derived from old headers
From long to wide format – dcast function

Syntax:
dcast(dt, id ~ y, value.var = c("a", "b"))
Parameter Description
Id ~ y Formula with a LHS: ID columns containing IDs for multiple entries. And a RHS:
columns with values to spread in column headers
Value.var Columns containing values to fill into cells

Reshaping data table
Wide format Long format
Syntax:
wide_data <- as.data.table(fread("reshape_data.csv"))
long_data <- melt(wide_data, id.vars = c("policy_no" , "state" , "subln_grp"), measure.vars = c("written_premium", "endorsement_premium"),
variable.name = "premium_type" , value.name = "premium")
wide_data <- dcast(long_data , policy_no + state + subln_grp ~ premium_type , value.var = "premium")
Set() family in data table
Operation Syntax
Create or upadate column names by # Change the name of premium field to wrt_prem
reference setnames(dt, "premium", "wrt_prem")
# change names of multiple columns

setnames(dt, c("premium", "policy_no"), c("wrt_prem", "policy_id"))
Setting a key on data table # set policy_no as key

setkey(dt, policy_no)
Reorder columns by reference #Change the column order of the data table
setcolorder(dt, c("deductible", "coverage", "segment", "state", "subln_grp",
"policy_no", "premium"))
Binning variables in data table
Other operations in data table
Other operations in data table
Joins in data table
• Right join
Syntax:
right_join <- dt1[dt2, on = "policy_no“ ] # right join (dt1 is left, dt2 is right)
right_join <- merge(dt1, dt2 , by = "policy_no" , all.y = TRUE)
dt1 dt2
right_join
dt1 captures written premium and state dt2 captures coverage and location code
information
Right_join has all rows from dt2

Joins in data table
• Left join
Syntax:
left_join <- dt2[ dt1, on = "policy_no"] # left join (dt1 is left, dt2 is right)
left_join <- merge(dt1, dt2, by - "policy_no" , all.x = TRUE)
dt1 dt2
left_join
information
left_join has all rows from dt1

Joins in data table
• Inner join
Syntax:
inner_join <- dt1[ dt2, on = "policy_no", nomatch = 0] # inner join
inner_join <- merge(dt1, dt2, by = "policy_no")
dt1 dt2
inner_join
information
Inner_join has all rows where dt1’s key columns values match dt2’s key column values
Joins in data table
• Left anti join
Syntax:
left_anti_join <- dt1[ !dt2, on = "policy_no“ ]
• Right anti join

Syntax:
right_anti_join <- dt2[!dt1, on = "policy_no"]
• Full outer join

Syntax:
full_outer_join <- merge(dt1 , dt2 , by =
"policy_no" , all = TRUE)
Joins in data table
• Rolling join
• Rolling joins are used for analyzing data involving time
• Let’s say data table 1 (dt1) contains policy information like inception date, end date , written
premium and location
• Data table 2 (dt2) has policy endorsement details like endorsement effective date and associated
premium amount
dt1 dt2
Joins in data table
• Rolling join
• Rolling join can be used to merge both tables with the condition that endorsement effective date lies
within the policy period
Syntax:
# Converting character to date
dt1[ , `:=` (policy_start_date =
as.Date(policy_start_date, "%m/%d/%y") ,
policy_end_date =
as.Date(policy_end_date , "%m/%d/%y"))]
dt2[ , endorsement_eff_dt :=
as.Date(endorsement_eff_dt , "%m/%d/%y")]
# Creating the date variable to be used for joining

dt1[ , join_date := policy_start_date]
dt2[ , join_date := endorsement_eff_dt]
# setting keys on each table

setkey(dt1 , policy_no , join_date)
setkey(dt2 , policy_no , join_date)
# Rolling backward
rolling_join <- dt2[dt1 , roll = -365 ]
ggplot2 - Grammar of Graphics plot
 Data: in ggplot2, data must be stored as an R data frame
 Coordinate system: describes 2-D space that data is projected onto - for example, Cartesian
coordinates, polar coordinates, map projections
 Geoms: describe type of geometric objects that represent data - for example, points, lines, polygons
 Aesthetics: describe visual characteristics that represent data - for example, position, size, color,
shape, transparency, fill
 Scales: for each aesthetic, describe how visual characteristic is converted to display values - for
example, log scales, color scales, size scales, shape scales
 Stats : describe statistical transformations that typically summarize data - for example, counts,
means, medians, regression lines
 Facets: describe how data is split into subsets and displayed as multiple small graphs
Creating a plot object
 creates a plot object that can be assigned to a variable
 can specify data frame and aesthetic mappings (visual characteristics that represent data)
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium ))
p
x‐axis position indicates subln_grp
y‐axis position indicates premium
Adding a layer
Syntax:
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium , color = state))
p + geom_point(size = 2)
Layer
Purpose:
 Display the data – allows viewer to see patterns, overall structure, local structure, outliers
 Display statistical summaries of the data – allows viewer to see counts, means, medians, IQRs, model
predictions
 Data and aesthetics (mappings) may be inherited from ggplot() object or added, changed, or dropped
within individual layers
 Most layers contain a geom - the fundamental building block of ggplot2

full specification: geom_xxx(mapping, data, stat, position, ...)
 Each geom_xxx() has a default stat (statistical transformation) associated with it , but the default
statistical transformation can be changed using stat parameter
Adding a geom layer
Syntax:
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium , color = state))
p + geom_blank() p + geom_point()
p + geom_jitter() p + geom_count()
Displaying Statistical Summary
Syntax:
p <- ggplot(data = plot_data , aes(x = state))
p + geom_bar()
Already transformed data
Syntax:
transfrmd_data <- plot_data[ , count := .N , by = state]
transfrmd_data <- unique(transfrmd_data[,.(count , state)])
p <- ggplot(data = transfrmd_data , aes(x = state , y = count))
p + geom_col()
# or
p + geom_bar(stat = "identity")
 geom_bar: height of bar proportional to

number of observations in each group.
 geom_col: leaves data as is. geom_bar

uses count stat by default. geom_col
uses identity stat
Displaying distributions
Syntax:
p <- ggplot(data = plot_data , aes(x = premium))
p + geom_histogram() p + geom_freqpoly()
p + geom_freqpoly(aes(color = state))
Displaying Statistical Summaries
Syntax:
p <- ggplot(data = plot_data , aes(x = state , y = premium))
p + geom_boxplot()
Position
Syntax:
p <- ggplot(data = plot_data , aes(x = state , fill = deductible > 100))
p + geom_bar() p + geom_bar(position="stack")
p + geom_bar(position="dodge") p + geom_bar(position="fill")

Presentation 1

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Presentation 1

Încărcat de

Drepturi de autor:

Formate disponibile

R Compendium

Functions for detecting patterns Functions for replacing patterns

Which rows ? What to do? Grouped by what?

- vector of row numbers - summarizing - one or more columns

# Selecting rows which don't have NA in premium field

Select columns # Select multiple columns

Count distinct # Count unique number of segments

# Count unique number of segments for Florida

Special symbol - .SD & .SDcols

SD = Subset of Data .SDcols

dt[ , lapply(.SD , function (x) mean( x , na.rm = TRUE) ) , by = segment , .SDcols

# Doubling both premium and deductible amount

Adding variables by group # Computing state premium

# Computing state premium and state_deductible

# Deleting state variables

#Deleting first two columns

From long to wide format – dcast function

Value.var Columns containing values to fill into cells

# change names of multiple columns

Setting a key on data table # set policy_no as key

Right_join has all rows from dt2

left_join has all rows from dt1

• Right anti join

• Full outer join

# Creating the date variable to be used for joining

# setting keys on each table

 Most layers contain a geom - the fundamental building block of ggplot2

 geom_bar: height of bar proportional to

 geom_col: leaves data as is. geom_bar

S-ar putea să vă placă și