DP - MATH2349 Semester 2, 2019

22/09/2019 MATH2349 Semester 2, 2019
MATH2349 Semester 2, 2019

Code
Assignment 2
kuntachikkanahalli Srinivasa Reddy Harshitha Reddy (s3797186), Eswar Phani
Paruchuri (s3798488), Sujay Kamal Madisetty (s3794983)
Setup
Installing and loading the below given packages will reproduce the report here:
Hide
# These are the necessary packages that are to be loaded for generating the report.
library(dplyr)
library(knitr)
library(readr)
library(tidyr)
library(Hmisc)
library(car)
library(outliers)
Reading WHO Data:

We are using readr package’s read_csv function to import the data as it is faster than base r functions
read.csv when we are dealing with large data sets.
The Initial Data has 7240 observations of 60 variables.
Hide
# R Chunk to import the WHO Data:
WHO <- read_csv("WHO.csv")
Parsed with column specification:

cols(
.default = col_double(),
country = [31mcol_character() [39m,
iso2 = [31mcol_character() [39m,
iso3 = [31mcol_character() [39m
)
See spec(...) for full column specifications.
Tidy Task 1:
We are initially looking at the different column names in the dataset.
Here we need to transform the data from wide to long format, so we are using gather() function, as the
columns from 5 to 60 represent values but not variables and the variable value is spread across these
columns.
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 1/15
22/09/2019 MATH2349 Semester 2, 2019
Upon using the gather function the first tidy task is completed and the final data set now includes
405,440 rows and 6 columns.
Hide
# This is the code used for Tidy Task 1:
#TocheckColumn Names:
names(WHO);
[1] "country" "iso2" "iso3" "year" "new_sp_m014"

"new_sp_m1524"
[7] "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564" "new_sp_m65"
"new_sp_f014"
[13] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554" "new_sp_f5564"
"new_sp_f65"
[19] "new_sn_m014" "new_sn_m1524" "new_sn_m2534" "new_sn_m3544" "new_sn_m4554"
"new_sn_m5564"
[25] "new_sn_m65" "new_sn_f014" "new_sn_f1524" "new_sn_f2534" "new_sn_f3544"
"new_sn_f4554"
[31] "new_sn_f5564" "new_sn_f65" "new_ep_m014" "new_ep_m1524" "new_ep_m2534"
"new_ep_m3544"
[37] "new_ep_m4554" "new_ep_m5564" "new_ep_m65" "new_ep_f014" "new_ep_f1524"
"new_ep_f2534"
[43] "new_ep_f3544" "new_ep_f4554" "new_ep_f5564" "new_ep_f65" "new_rel_m014"
"new_rel_m1524"
[49] "new_rel_m2534" "new_rel_m3544" "new_rel_m4554" "new_rel_m5564" "new_rel_m65"
"new_rel_f014"
[55] "new_rel_f1524" "new_rel_f2534" "new_rel_f3544" "new_rel_f4554" "new_rel_f5564"
"new_rel_f65"
Hide
#Using Gather() Function to complete the Task:

WHO_TIDY1 <- WHO %>% gather(5:60, key = "code", value = "value");
#viewing final data after task1:

WHO_TIDY1
country iso2 iso3 year code value

<chr> <chr> <chr> <dbl> <chr> <dbl>
Afghanistan AF AFG 1980 new_sp_m014 NA
22/09/2019 MATH2349 Semester 2, 2019
country iso2 iso3 year code value

<chr> <chr> <chr> <dbl> <chr> <dbl>
1-10 of 405,440 rows Previous 1 2 3 4 5 6 ... 100 Next
Hide
#viewing the dimensions of the WHO_TIDY1

dim(WHO_TIDY1)
[1] 405440 6
Tidy Task 2:
The code column still contains four different variables in order to seperate it we are using the seperate()
function in the below code based on the special character "_".
Now since the age and gender is not seperated by any special character we are using mutate() function
to seperate age from gender also with the help of substr() function as shown in the below code.
Now all the required columns from the above data subset are stored in another dataset
WHO_TIDY2_FINAL with the requested variables data in correct format and are selected using the
select() function as shown in the below code.
The reshaped data is now in correct format with 405,440 rows and 9 columns.
Hide
#This is the code used for Tidy Task 2:
#seperate() function used to seperate the code:

WHO_TIDY2 <- WHO_TIDY1 %>% separate(code, into = c("new", "var","sex_age"), sep =
"_")
#mutate() and substr() function used to seperate sex and age columns:
WHO_TIDY2_AGE<-WHO_TIDY2 %>% mutate(age=substr(sex_age, 2, 7), sex=substr(sex_age, 0,
1))
#The Final Reshaped data set with 405,440 rows and 9 columns:
WHO_TIDY2_FINAL <- WHO_TIDY2_AGE %>% select (country,iso2,iso3,year,new,var,sex,age,v
alue)
#viewing final data after Task2:

WHO_TIDY2_FINAL
country iso2 iso3 year new var sex age value

<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
Afghanistan AF AFG 1980 new sp m 014 NA

22/09/2019 MATH2349 Semester 2, 2019
country iso2 iso3 year new var sex age value

<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
Hide
#viewing the dimensions of the WHO_TIDY2_FINAL

dim(WHO_TIDY2_FINAL)
[1] 405440 9
Tidy Task 3:
Using spread() function we generate columns from rows.The rel , ep , sn , and sp keys need to be in
their own columns as we will treat each of these as a separate variable.The code is given below.
The final reshaped data now includes 101,360 rows of 11 columns.
Hide
#This is the code used for Tidy Task 3:
#The spread() function here generates columns based on rows from 'var'column.
WHO_TIDY_TASK3<-spread(WHO_TIDY2_FINAL, key = var, value = value)
#viewing final data after Task3:

WHO_TIDY_TASK3
country iso2 iso3 year new sex age ep rel sn

<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
Afghanistan AF AFG 1980 new m 014 NA NA NA
22/09/2019 MATH2349 Semester 2, 2019

<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1-10 of 101,360 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 100 Next
Hide
#viewing the dimensions of the WHO_TIDY_TASK3

dim(WHO_TIDY_TASK3)
[1] 101360 11
Tidy Task 4:
Using mutate() function and with the help of factor() function both the categorical variables sex and age
have been converted into factors as shown in the below code.
For age variable, labels have been created and ordered. Labels would be: <15, 15-24, 25-34, 35-44, 45-
54, 55-64, 65>=.
Final Data set in this task now includes 101,360 rows and 11 columns.
Hide
#using mutate function to factorise both sex and age categorical variables:
WHO_TIDY_TASK4 <- WHO_TIDY_TASK3 %>% mutate(sex = factor(sex, levels =c("m","f")),age
= factor(age))
#creating and ordering the age variable as requested and the code is follows:
WHO_TIDY_TASK4$age <- ordered(WHO_TIDY_TASK4$age,levels<-c("014","1524","2534","3544"
,"4554","5564","65"),
labels<-c("<15", "15-24", "25-34", "35-44", "45-54", "55-64"
, "65>="))
#Checking the variables sex and age:

str(WHO_TIDY_TASK4)
22/09/2019 MATH2349 Semester 2, 2019
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 101360 obs. of 11 variables:

$ country: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ iso2 : chr "AF" "AF" "AF" "AF" ...
$ iso3 : chr "AFG" "AFG" "AFG" "AFG" ...
$ year : num 1980 1981 1982 1983 1984 ...
$ new : chr "new" "new" "new" "new" ...
$ sex : Factor w/ 2 levels "m","f": 1 1 1 1 1 1 1 1 1 1 ...
$ age : Ord.factor w/ 7 levels "<15"<"15-24"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ ep : num NA NA NA NA NA NA NA NA NA NA ...
$ rel : num NA NA NA NA NA NA NA NA NA NA ...
$ sn : num NA NA NA NA NA NA NA NA NA NA ...
$ sp : num NA NA NA NA NA NA NA NA NA NA ...
Hide
#Viewing the final data set with 101,360 rows and 11 columns:
WHO_TIDY_TASK4

<chr> <chr> <chr> <dbl> <chr> <fctr> <ord> <dbl> <dbl> <dbl>
Afghanistan AF AFG 1980 new m <15 NA NA NA
Hide
#viewing the dimensions of the WHO_TIDY_TASK4

dim(WHO_TIDY_TASK4)
[1] 101360 11
Task 5: Filter & Select

Using the select() function only the required columns are selected and the data is stored in countries
data set as shown below, thereby dropping the redundant ‘iso2’ and ‘new’ columns.
From the tidy version of the WHO data set 3 countries namely ‘Afghanistan’,‘Sri Lanka’, and ‘India’ have
been filtered and stored in the subset of the data frame as ‘WHO_Subset’ *Using dim() function the
22/09/2019 MATH2349 Semester 2, 2019
dimensions of the subset is given as shown in the below code.
Hide
#Checking the column names:

names (WHO_TIDY_TASK4)
[1] "country" "iso2" "iso3" "year" "new" "sex" "age" "ep"

"rel"
[10] "sn" "sp"
Hide
#using select() function pickingup only the necessary columns:

country <- WHO_TIDY_TASK4 %>% select(country,iso3,year,sex,age,ep,rel,sn,sp)
#creating the WHO Subset with 3 countries asbelow:

WHO_Subset <- country %>% filter(country %in% c("Afghanistan", "Sri Lanka", "India"))
#viewing the dimensions of the WHO_Subset:

dim(WHO_Subset)
[1] 1428 9
Hide
#Viewing the final data set with 101,360 rows and 11 columns:
WHO_Subset
country iso3 year sex age ep rel sn sp

<chr> <chr> <dbl> <fctr> <ord> <dbl> <dbl> <dbl> <dbl>
Afghanistan AFG 1980 m <15 NA NA NA NA
Hide
22/09/2019 MATH2349 Semester 2, 2019
NA
Read Species and Surveys data sets

*Reading the species and surveys data and storing them in their respectice data sets as shown below.
Hide
# This is the code used to read species and surveys data:
#Species Data:
Species_data <- read_csv("species.csv")

cols(
species_id = [31mcol_character() [39m,
genus = [31mcol_character() [39m,
species = [31mcol_character() [39m,
taxa = [31mcol_character() [39m
)
Hide
#Surveys_Data:
surveys_data <- read_csv("surveys.csv")

cols(
record_id = [32mcol_double() [39m,
month = [32mcol_double() [39m,
day = [32mcol_double() [39m,
year = [32mcol_double() [39m,
species_id = [31mcol_character() [39m,
sex = [31mcol_character() [39m,
hindfoot_length = [32mcol_double() [39m,
weight = [32mcol_double() [39m
)
Task 6: Join
Comibining ‘surveys’ and ‘species’ data using the key variable ‘species_id’ thereby adding the species
information to surveys data and creating a new combined data frame ‘surveys_combined’ as shown
below.
Also the dimensions of the new data frame has also been updated below.
Hide
22/09/2019 MATH2349 Semester 2, 2019
# This is the code used for Task 6:
#Adding the Species info to the surveys data using left join:
Surveys_combined<- surveys_data %>% left_join(Species_data, c("species_id" = "species
_id"))
Surveys_combined
record_id m… … y… species_id … hindfoot_length wei… genus species

<dbl> <dbl> <dbl>
<dbl><chr> <chr> <dbl> <dbl> <chr> <chr>
1 7 16 1977 NL M 32 NA Neotoma albigula
3 7 16 1977 DM F 37 NA Dipodomys merriami
4 7 16 1977 DM M 36 NA Dipodomys merriami
6 7 16 1977 PF M 14 NA Perognathus flavus
7 7 16 1977 PE F NA NA Peromyscus eremicus
10 7 16 1977 PF F 20 NA Perognathus flavus
Hide
#Code to view thedimensions of the new Data Frame.

dim(Surveys_combined)
[1] 35549 11
Task 7: Calculate
Here we are filtering the data based on asingle species.
And then calculating the average weight and hindfoot length of the selected species for each month.
Also na.rm = TRUE statement has been given to make sure to exclude the missing values during the
calculation.
Hide
22/09/2019 MATH2349 Semester 2, 2019
#Filtering for only one particular species:

Surveys_combined_Task7<- Surveys_combined %>% filter( species=="albigula")
#calculating the average weight and hindfoot length of the selected species observed
in each month, excludibg the NA values:
Surveys_combined_avg <- Surveys_combined_Task7 %>% group_by(month) %>% summarise(avg_
weight = mean(weight, na.rm = TRUE),avg_hindfoot_length = mean(hindfoot_length, na.rm
= TRUE))
Surveys_combined_avg
month avg_weight avg_hindfoot_length

<dbl> <dbl> <dbl>
1 179.3443 32.54098
2 181.3818 32.82353
3 177.4516 32.75862
4 153.0690 32.02439
5 142.7536 31.60000
6 143.7879 32.18889
7 141.7415 32.35398
8 152.5100 32.07143
9 164.9920 32.50427
10 169.1364 32.43119
1-10 of 12 rows Previous 1 2 Next
Hide
#viewing the dimensions of the Surveys_combined_avg

dim(Surveys_combined_avg)
[1] 12 3
Task 8: Missing Values

A data set surveys_combined_year has been created with 1977 as the selected year.
Grouping the data frame basedon species.
Finding the missing values using is.na() function.
Replacing the missing values with the mean values of each speciesand creating a data set for this
imputed data.
Hide
22/09/2019 MATH2349 Semester 2, 2019
#Selecting one year and renaming the data frame:

surveys_combined_year<-Surveys_combined %>% filter( year=="1977")
surveys_combined_year

<dbl> <dbl> <dbl>
1-10 of 503 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 51 Next
Hide
#grouping by species:
surveys_group_by <- surveys_combined_year %>% group_by(species)
surveys_group_by

<dbl> <dbl> <dbl>
22/09/2019 MATH2349 Semester 2, 2019
Hide
#Finding total NA's in R:

sum(is.na(surveys_group_by$weight))
[1] 237
Hide
#Replacing missing values in weight column with mean:

surveys_combined_year$weight <- impute(surveys_group_by$weight, fun = mean)
#Storing Data in surveys_weight_imputed:

surveys_weight_imputed <- surveys_combined_year
surveys_weight_imputed
weight
record_id m… … y… species_id … hindfoot_length <S3: genus speci
<dbl> <dbl> <dbl>
<dbl><chr> <chr> <dbl> impute> <chr> <chr>
1 7 16 1977 NL M 32 46.65038 Neotoma albigu
2 7 16 1977 NL M 33 46.65038 Neotoma albigu
3 7 16 1977 DM F 37 46.65038 Dipodomys merria
4 7 16 1977 DM M 36 46.65038 Dipodomys merria
6 7 16 1977 PF M 14 46.65038 Perognathus flavus
7 7 16 1977 PE F NA 46.65038 Peromyscus eremic
9 7 16 1977 DM F 34 46.65038 Dipodomys merria
10 7 16 1977 PF F 20 46.65038 Perognathus flavus
Hide
#Checking the data in the imputed data set:

sum(is.na(surveys_weight_imputed$weight))
[1] 0
Task 9: Special Values

*The below code is to check for any special values , Fortunately there seem to be no special values for the
given selections.
22/09/2019 MATH2349 Semester 2, 2019
Hide
#Checking for any special values like NaN, Inf, -Inf:
#Code to check fifnite values:

sum(is.finite(surveys_weight_imputed$weight))
[1] 503
Hide
#Code to check if there are any infinite values:

sum(is.infinite(surveys_weight_imputed$weight))
[1] 0
Hide
#Code to check if there are any undefined or special values:

sum(is.nan(surveys_weight_imputed$weight))
[1] 0
Task 10: Outliers

Drawing a boxplot to see if there are any Uni Variate Outliers.
Finding the id of the existing outliers.
Here we are not sure if the presence of outliers is due to a data entry or data pre processing error,as we
dont know the species information in real life so it is not advisable to delete the data of the outliers as a
whole as this may cause mis-interpretation and capping the outliers with the nearest value of 95th
Percentile would be the most feasible approach,So please have a look at the code on how this is being
done.
After this the new boxplot will show all the values ranging in between 0 to 60, with no outliers as
expected.
Hide
#Drawing the boxplot:

Surveys_outlier<-boxplot(Surveys_combined$hindfoot_length)
22/09/2019 MATH2349 Semester 2, 2019
Hide
#Finding the id of the Outliers:

boxplot(Surveys_combined$hindfoot_length, id=TRUE)
Hide
22/09/2019 MATH2349 Semester 2, 2019
#Removing the NA Values:

surveys_combined1 <-na.omit(Surveys_combined$hindfoot_length)
#Using the Capping method to remove the outliers, Definingthe function:

cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#Removing the outliers using the cap function:

surveys_outliers <- surveys_combined1 %>% cap()
#Re-drawing the final boxplot to further check for outliers:

boxplot(surveys_outliers)

DP - MATH2349 Semester 2, 2019

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

DP - MATH2349 Semester 2, 2019

Încărcat de

Drepturi de autor:

Formate disponibile

22/09/2019 MATH2349 Semester 2, 2019

MATH2349 Semester 2, 2019

Reading WHO Data:

# R Chunk to import the WHO Data:

WHO <- read_csv("WHO.csv")

Parsed with column specification:

# This is the code used for Tidy Task 1:

[1] "country" "iso2" "iso3" "year" "new_sp_m014"

#Using Gather() Function to complete the Task:

#viewing final data after task1:

country iso2 iso3 year code value

Afghanistan AF AFG 1980 new_sp_m014 NA

Afghanistan AF AFG 1981 new_sp_m014 NA

Afghanistan AF AFG 1982 new_sp_m014 NA

Afghanistan AF AFG 1983 new_sp_m014 NA

Afghanistan AF AFG 1984 new_sp_m014 NA

Afghanistan AF AFG 1985 new_sp_m014 NA

Afghanistan AF AFG 1986 new_sp_m014 NA

Afghanistan AF AFG 1987 new_sp_m014 NA

country iso2 iso3 year code value

Afghanistan AF AFG 1988 new_sp_m014 NA

Afghanistan AF AFG 1989 new_sp_m014 NA

1-10 of 405,440 rows Previous 1 2 3 4 5 6 ... 100 Next

#viewing the dimensions of the WHO_TIDY1

#This is the code used for Tidy Task 2:

#seperate() function used to seperate the code:

#viewing final data after Task2:

country iso2 iso3 year new var sex age value

Afghanistan AF AFG 1980 new sp m 014 NA

Afghanistan AF AFG 1981 new sp m 014 NA

Afghanistan AF AFG 1982 new sp m 014 NA

country iso2 iso3 year new var sex age value

Afghanistan AF AFG 1983 new sp m 014 NA

Afghanistan AF AFG 1984 new sp m 014 NA

Afghanistan AF AFG 1985 new sp m 014 NA

Afghanistan AF AFG 1986 new sp m 014 NA

Afghanistan AF AFG 1987 new sp m 014 NA

Afghanistan AF AFG 1988 new sp m 014 NA

Afghanistan AF AFG 1989 new sp m 014 NA

1-10 of 405,440 rows Previous 1 2 3 4 5 6 ... 100 Next

#viewing the dimensions of the WHO_TIDY2_FINAL

#This is the code used for Tidy Task 3:

#viewing final data after Task3:

country iso2 iso3 year new sex age ep rel sn

Afghanistan AF AFG 1980 new m 014 NA NA NA

Afghanistan AF AFG 1981 new m 014 NA NA NA

Afghanistan AF AFG 1982 new m 014 NA NA NA

Afghanistan AF AFG 1983 new m 014 NA NA NA

Afghanistan AF AFG 1984 new m 014 NA NA NA

Afghanistan AF AFG 1985 new m 014 NA NA NA

country iso2 iso3 year new sex age ep rel sn

Afghanistan AF AFG 1986 new m 014 NA NA NA

Afghanistan AF AFG 1987 new m 014 NA NA NA

Afghanistan AF AFG 1988 new m 014 NA NA NA

Afghanistan AF AFG 1989 new m 014 NA NA NA

1-10 of 101,360 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 100 Next

#viewing the dimensions of the WHO_TIDY_TASK3

# This is the code used for Tidy Task 4:

#Checking the variables sex and age:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 101360 obs. of 11 variables:

country iso2 iso3 year new sex age ep rel sn

Afghanistan AF AFG 1980 new m <15 NA NA NA

Afghanistan AF AFG 1981 new m <15 NA NA NA