Documente Academic
Documente Profesional
Documente Cultură
Assignment 2
kuntachikkanahalli Srinivasa Reddy Harshitha Reddy (s3797186), Eswar Phani
Paruchuri (s3798488), Sujay Kamal Madisetty (s3794983)
Setup
Installing and loading the below given packages will reproduce the report here:
Hide
# These are the necessary packages that are to be loaded for generating the report.
library(dplyr)
library(knitr)
library(readr)
library(tidyr)
library(Hmisc)
library(car)
library(outliers)
Hide
Tidy Task 1:
We are initially looking at the different column names in the dataset.
Here we need to transform the data from wide to long format, so we are using gather() function, as the
columns from 5 to 60 represent values but not variables and the variable value is spread across these
columns.
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 1/15
22/09/2019 MATH2349 Semester 2, 2019
Upon using the gather function the first tidy task is completed and the final data set now includes
405,440 rows and 6 columns.
Hide
#TocheckColumn Names:
names(WHO);
Hide
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 2/15
22/09/2019 MATH2349 Semester 2, 2019
Hide
[1] 405440 6
Tidy Task 2:
The code column still contains four different variables in order to seperate it we are using the seperate()
function in the below code based on the special character "_".
Now since the age and gender is not seperated by any special character we are using mutate() function
to seperate age from gender also with the help of substr() function as shown in the below code.
Now all the required columns from the above data subset are stored in another dataset
WHO_TIDY2_FINAL with the requested variables data in correct format and are selected using the
select() function as shown in the below code.
The reshaped data is now in correct format with 405,440 rows and 9 columns.
Hide
#mutate() and substr() function used to seperate sex and age columns:
WHO_TIDY2_AGE<-WHO_TIDY2 %>% mutate(age=substr(sex_age, 2, 7), sex=substr(sex_age, 0,
1))
#The Final Reshaped data set with 405,440 rows and 9 columns:
WHO_TIDY2_FINAL <- WHO_TIDY2_AGE %>% select (country,iso2,iso3,year,new,var,sex,age,v
alue)
Hide
[1] 405440 9
Tidy Task 3:
Using spread() function we generate columns from rows.The rel , ep , sn , and sp keys need to be in
their own columns as we will treat each of these as a separate variable.The code is given below.
The final reshaped data now includes 101,360 rows of 11 columns.
Hide
#The spread() function here generates columns based on rows from 'var'column.
WHO_TIDY_TASK3<-spread(WHO_TIDY2_FINAL, key = var, value = value)
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 4/15
22/09/2019 MATH2349 Semester 2, 2019
Hide
[1] 101360 11
Tidy Task 4:
Using mutate() function and with the help of factor() function both the categorical variables sex and age
have been converted into factors as shown in the below code.
For age variable, labels have been created and ordered. Labels would be: <15, 15-24, 25-34, 35-44, 45-
54, 55-64, 65>=.
Final Data set in this task now includes 101,360 rows and 11 columns.
Hide
#using mutate function to factorise both sex and age categorical variables:
WHO_TIDY_TASK4 <- WHO_TIDY_TASK3 %>% mutate(sex = factor(sex, levels =c("m","f")),age
= factor(age))
#creating and ordering the age variable as requested and the code is follows:
WHO_TIDY_TASK4$age <- ordered(WHO_TIDY_TASK4$age,levels<-c("014","1524","2534","3544"
,"4554","5564","65"),
labels<-c("<15", "15-24", "25-34", "35-44", "45-54", "55-64"
, "65>="))
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 5/15
22/09/2019 MATH2349 Semester 2, 2019
Hide
#Viewing the final data set with 101,360 rows and 11 columns:
WHO_TIDY_TASK4
Hide
[1] 101360 11
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 6/15
22/09/2019 MATH2349 Semester 2, 2019
Hide
Hide
[1] 1428 9
Hide
#Viewing the final data set with 101,360 rows and 11 columns:
WHO_Subset
Hide
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 7/15
22/09/2019 MATH2349 Semester 2, 2019
NA
Hide
#Species Data:
Species_data <- read_csv("species.csv")
Hide
#Surveys_Data:
surveys_data <- read_csv("surveys.csv")
Task 6: Join
Comibining ‘surveys’ and ‘species’ data using the key variable ‘species_id’ thereby adding the species
information to surveys data and creating a new combined data frame ‘surveys_combined’ as shown
below.
Also the dimensions of the new data frame has also been updated below.
Hide
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 8/15
22/09/2019 MATH2349 Semester 2, 2019
#Adding the Species info to the surveys data using left join:
Surveys_combined<- surveys_data %>% left_join(Species_data, c("species_id" = "species
_id"))
Surveys_combined
Hide
[1] 35549 11
Task 7: Calculate
Here we are filtering the data based on asingle species.
And then calculating the average weight and hindfoot length of the selected species for each month.
Also na.rm = TRUE statement has been given to make sure to exclude the missing values during the
calculation.
Hide
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 9/15
22/09/2019 MATH2349 Semester 2, 2019
#calculating the average weight and hindfoot length of the selected species observed
in each month, excludibg the NA values:
Surveys_combined_avg <- Surveys_combined_Task7 %>% group_by(month) %>% summarise(avg_
weight = mean(weight, na.rm = TRUE),avg_hindfoot_length = mean(hindfoot_length, na.rm
= TRUE))
Surveys_combined_avg
1 179.3443 32.54098
2 181.3818 32.82353
3 177.4516 32.75862
4 153.0690 32.02439
5 142.7536 31.60000
6 143.7879 32.18889
7 141.7415 32.35398
8 152.5100 32.07143
9 164.9920 32.50427
10 169.1364 32.43119
Hide
[1] 12 3
Hide
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 10/15
22/09/2019 MATH2349 Semester 2, 2019
Hide
#grouping by species:
surveys_group_by <- surveys_combined_year %>% group_by(species)
surveys_group_by
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 11/15
22/09/2019 MATH2349 Semester 2, 2019
Hide
[1] 237
Hide
weight
record_id m… … y… species_id … hindfoot_length <S3: genus speci
<dbl> <dbl> <dbl>
<dbl><chr> <chr> <dbl> impute> <chr> <chr>
Hide
[1] 0
Hide
[1] 503
Hide
[1] 0
Hide
[1] 0
Hide
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 13/15
22/09/2019 MATH2349 Semester 2, 2019
Hide
Hide
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 14/15
22/09/2019 MATH2349 Semester 2, 2019
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 15/15