Sunteți pe pagina 1din 15

22/09/2019 MATH2349 Semester 2, 2019

MATH2349 Semester 2, 2019


Code

Assignment 2
kuntachikkanahalli Srinivasa Reddy Harshitha Reddy (s3797186), Eswar Phani
Paruchuri (s3798488), Sujay Kamal Madisetty (s3794983)

Setup
Installing and loading the below given packages will reproduce the report here:

Hide

# These are the necessary packages that are to be loaded for generating the report.

library(dplyr)
library(knitr)
library(readr)
library(tidyr)
library(Hmisc)
library(car)
library(outliers)

Reading WHO Data:


We are using readr package’s read_csv function to import the data as it is faster than base r functions
read.csv when we are dealing with large data sets.
The Initial Data has 7240 observations of 60 variables.

Hide

# R Chunk to import the WHO Data:

WHO <- read_csv("WHO.csv")

Parsed with column specification:


cols(
.default = col_double(),
country = [31mcol_character() [39m,
iso2 = [31mcol_character() [39m,
iso3 = [31mcol_character() [39m
)
See spec(...) for full column specifications.

Tidy Task 1:
We are initially looking at the different column names in the dataset.
Here we need to transform the data from wide to long format, so we are using gather() function, as the
columns from 5 to 60 represent values but not variables and the variable value is spread across these
columns.
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 1/15
22/09/2019 MATH2349 Semester 2, 2019

Upon using the gather function the first tidy task is completed and the final data set now includes
405,440 rows and 6 columns.

Hide

# This is the code used for Tidy Task 1:

#TocheckColumn Names:
names(WHO);

[1] "country" "iso2" "iso3" "year" "new_sp_m014"


"new_sp_m1524"
[7] "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564" "new_sp_m65"
"new_sp_f014"
[13] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554" "new_sp_f5564"
"new_sp_f65"
[19] "new_sn_m014" "new_sn_m1524" "new_sn_m2534" "new_sn_m3544" "new_sn_m4554"
"new_sn_m5564"
[25] "new_sn_m65" "new_sn_f014" "new_sn_f1524" "new_sn_f2534" "new_sn_f3544"
"new_sn_f4554"
[31] "new_sn_f5564" "new_sn_f65" "new_ep_m014" "new_ep_m1524" "new_ep_m2534"
"new_ep_m3544"
[37] "new_ep_m4554" "new_ep_m5564" "new_ep_m65" "new_ep_f014" "new_ep_f1524"
"new_ep_f2534"
[43] "new_ep_f3544" "new_ep_f4554" "new_ep_f5564" "new_ep_f65" "new_rel_m014"
"new_rel_m1524"
[49] "new_rel_m2534" "new_rel_m3544" "new_rel_m4554" "new_rel_m5564" "new_rel_m65"
"new_rel_f014"
[55] "new_rel_f1524" "new_rel_f2534" "new_rel_f3544" "new_rel_f4554" "new_rel_f5564"
"new_rel_f65"

Hide

#Using Gather() Function to complete the Task:


WHO_TIDY1 <- WHO %>% gather(5:60, key = "code", value = "value");

#viewing final data after task1:


WHO_TIDY1

country iso2 iso3 year code value


<chr> <chr> <chr> <dbl> <chr> <dbl>

Afghanistan AF AFG 1980 new_sp_m014 NA

Afghanistan AF AFG 1981 new_sp_m014 NA

Afghanistan AF AFG 1982 new_sp_m014 NA

Afghanistan AF AFG 1983 new_sp_m014 NA

Afghanistan AF AFG 1984 new_sp_m014 NA

Afghanistan AF AFG 1985 new_sp_m014 NA

Afghanistan AF AFG 1986 new_sp_m014 NA

Afghanistan AF AFG 1987 new_sp_m014 NA

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 2/15
22/09/2019 MATH2349 Semester 2, 2019

country iso2 iso3 year code value


<chr> <chr> <chr> <dbl> <chr> <dbl>

Afghanistan AF AFG 1988 new_sp_m014 NA

Afghanistan AF AFG 1989 new_sp_m014 NA

1-10 of 405,440 rows Previous 1 2 3 4 5 6 ... 100 Next

Hide

#viewing the dimensions of the WHO_TIDY1


dim(WHO_TIDY1)

[1] 405440 6

Tidy Task 2:
The code column still contains four different variables in order to seperate it we are using the seperate()
function in the below code based on the special character "_".
Now since the age and gender is not seperated by any special character we are using mutate() function
to seperate age from gender also with the help of substr() function as shown in the below code.
Now all the required columns from the above data subset are stored in another dataset
WHO_TIDY2_FINAL with the requested variables data in correct format and are selected using the
select() function as shown in the below code.
The reshaped data is now in correct format with 405,440 rows and 9 columns.

Hide

#This is the code used for Tidy Task 2:

#seperate() function used to seperate the code:


WHO_TIDY2 <- WHO_TIDY1 %>% separate(code, into = c("new", "var","sex_age"), sep =
"_")

#mutate() and substr() function used to seperate sex and age columns:
WHO_TIDY2_AGE<-WHO_TIDY2 %>% mutate(age=substr(sex_age, 2, 7), sex=substr(sex_age, 0,
1))

#The Final Reshaped data set with 405,440 rows and 9 columns:
WHO_TIDY2_FINAL <- WHO_TIDY2_AGE %>% select (country,iso2,iso3,year,new,var,sex,age,v
alue)

#viewing final data after Task2:


WHO_TIDY2_FINAL

country iso2 iso3 year new var sex age value


<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>

Afghanistan AF AFG 1980 new sp m 014 NA

Afghanistan AF AFG 1981 new sp m 014 NA

Afghanistan AF AFG 1982 new sp m 014 NA


file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 3/15
22/09/2019 MATH2349 Semester 2, 2019

country iso2 iso3 year new var sex age value


<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>

Afghanistan AF AFG 1983 new sp m 014 NA

Afghanistan AF AFG 1984 new sp m 014 NA

Afghanistan AF AFG 1985 new sp m 014 NA

Afghanistan AF AFG 1986 new sp m 014 NA

Afghanistan AF AFG 1987 new sp m 014 NA

Afghanistan AF AFG 1988 new sp m 014 NA

Afghanistan AF AFG 1989 new sp m 014 NA

1-10 of 405,440 rows Previous 1 2 3 4 5 6 ... 100 Next

Hide

#viewing the dimensions of the WHO_TIDY2_FINAL


dim(WHO_TIDY2_FINAL)

[1] 405440 9

Tidy Task 3:
Using spread() function we generate columns from rows.The rel , ep , sn , and sp keys need to be in
their own columns as we will treat each of these as a separate variable.The code is given below.
The final reshaped data now includes 101,360 rows of 11 columns.

Hide

#This is the code used for Tidy Task 3:

#The spread() function here generates columns based on rows from 'var'column.
WHO_TIDY_TASK3<-spread(WHO_TIDY2_FINAL, key = var, value = value)

#viewing final data after Task3:


WHO_TIDY_TASK3

country iso2 iso3 year new sex age ep rel sn


<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>

Afghanistan AF AFG 1980 new m 014 NA NA NA

Afghanistan AF AFG 1981 new m 014 NA NA NA

Afghanistan AF AFG 1982 new m 014 NA NA NA

Afghanistan AF AFG 1983 new m 014 NA NA NA

Afghanistan AF AFG 1984 new m 014 NA NA NA

Afghanistan AF AFG 1985 new m 014 NA NA NA

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 4/15
22/09/2019 MATH2349 Semester 2, 2019

country iso2 iso3 year new sex age ep rel sn


<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>

Afghanistan AF AFG 1986 new m 014 NA NA NA

Afghanistan AF AFG 1987 new m 014 NA NA NA

Afghanistan AF AFG 1988 new m 014 NA NA NA

Afghanistan AF AFG 1989 new m 014 NA NA NA

1-10 of 101,360 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 100 Next

Hide

#viewing the dimensions of the WHO_TIDY_TASK3


dim(WHO_TIDY_TASK3)

[1] 101360 11

Tidy Task 4:
Using mutate() function and with the help of factor() function both the categorical variables sex and age
have been converted into factors as shown in the below code.
For age variable, labels have been created and ordered. Labels would be: <15, 15-24, 25-34, 35-44, 45-
54, 55-64, 65>=.
Final Data set in this task now includes 101,360 rows and 11 columns.

Hide

# This is the code used for Tidy Task 4:

#using mutate function to factorise both sex and age categorical variables:
WHO_TIDY_TASK4 <- WHO_TIDY_TASK3 %>% mutate(sex = factor(sex, levels =c("m","f")),age
= factor(age))

#creating and ordering the age variable as requested and the code is follows:
WHO_TIDY_TASK4$age <- ordered(WHO_TIDY_TASK4$age,levels<-c("014","1524","2534","3544"
,"4554","5564","65"),
labels<-c("<15", "15-24", "25-34", "35-44", "45-54", "55-64"
, "65>="))

#Checking the variables sex and age:


str(WHO_TIDY_TASK4)

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 5/15
22/09/2019 MATH2349 Semester 2, 2019

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 101360 obs. of 11 variables:


$ country: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ iso2 : chr "AF" "AF" "AF" "AF" ...
$ iso3 : chr "AFG" "AFG" "AFG" "AFG" ...
$ year : num 1980 1981 1982 1983 1984 ...
$ new : chr "new" "new" "new" "new" ...
$ sex : Factor w/ 2 levels "m","f": 1 1 1 1 1 1 1 1 1 1 ...
$ age : Ord.factor w/ 7 levels "<15"<"15-24"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ ep : num NA NA NA NA NA NA NA NA NA NA ...
$ rel : num NA NA NA NA NA NA NA NA NA NA ...
$ sn : num NA NA NA NA NA NA NA NA NA NA ...
$ sp : num NA NA NA NA NA NA NA NA NA NA ...

Hide

#Viewing the final data set with 101,360 rows and 11 columns:
WHO_TIDY_TASK4

country iso2 iso3 year new sex age ep rel sn


<chr> <chr> <chr> <dbl> <chr> <fctr> <ord> <dbl> <dbl> <dbl>

Afghanistan AF AFG 1980 new m <15 NA NA NA

Afghanistan AF AFG 1981 new m <15 NA NA NA

Afghanistan AF AFG 1982 new m <15 NA NA NA

Afghanistan AF AFG 1983 new m <15 NA NA NA

Afghanistan AF AFG 1984 new m <15 NA NA NA

Afghanistan AF AFG 1985 new m <15 NA NA NA

Afghanistan AF AFG 1986 new m <15 NA NA NA

Afghanistan AF AFG 1987 new m <15 NA NA NA

Afghanistan AF AFG 1988 new m <15 NA NA NA

Afghanistan AF AFG 1989 new m <15 NA NA NA

1-10 of 101,360 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 100 Next

Hide

#viewing the dimensions of the WHO_TIDY_TASK4


dim(WHO_TIDY_TASK4)

[1] 101360 11

Task 5: Filter & Select


Using the select() function only the required columns are selected and the data is stored in countries
data set as shown below, thereby dropping the redundant ‘iso2’ and ‘new’ columns.
From the tidy version of the WHO data set 3 countries namely ‘Afghanistan’,‘Sri Lanka’, and ‘India’ have
been filtered and stored in the subset of the data frame as ‘WHO_Subset’ *Using dim() function the

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 6/15
22/09/2019 MATH2349 Semester 2, 2019

dimensions of the subset is given as shown in the below code.

Hide

# This is the code used for Tidy Task 5:

#Checking the column names:


names (WHO_TIDY_TASK4)

[1] "country" "iso2" "iso3" "year" "new" "sex" "age" "ep"


"rel"
[10] "sn" "sp"

Hide

#using select() function pickingup only the necessary columns:


country <- WHO_TIDY_TASK4 %>% select(country,iso3,year,sex,age,ep,rel,sn,sp)

#creating the WHO Subset with 3 countries asbelow:


WHO_Subset <- country %>% filter(country %in% c("Afghanistan", "Sri Lanka", "India"))

#viewing the dimensions of the WHO_Subset:


dim(WHO_Subset)

[1] 1428 9

Hide

#Viewing the final data set with 101,360 rows and 11 columns:
WHO_Subset

country iso3 year sex age ep rel sn sp


<chr> <chr> <dbl> <fctr> <ord> <dbl> <dbl> <dbl> <dbl>

Afghanistan AFG 1980 m <15 NA NA NA NA

Afghanistan AFG 1981 m <15 NA NA NA NA

Afghanistan AFG 1982 m <15 NA NA NA NA

Afghanistan AFG 1983 m <15 NA NA NA NA

Afghanistan AFG 1984 m <15 NA NA NA NA

Afghanistan AFG 1985 m <15 NA NA NA NA

Afghanistan AFG 1986 m <15 NA NA NA NA

Afghanistan AFG 1987 m <15 NA NA NA NA

Afghanistan AFG 1988 m <15 NA NA NA NA

Afghanistan AFG 1989 m <15 NA NA NA NA

1-10 of 1,428 rows Previous 1 2 3 4 5 6 ... 100 Next

Hide

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 7/15
22/09/2019 MATH2349 Semester 2, 2019

NA

Read Species and Surveys data sets


*Reading the species and surveys data and storing them in their respectice data sets as shown below.

Hide

# This is the code used to read species and surveys data:

#Species Data:
Species_data <- read_csv("species.csv")

Parsed with column specification:


cols(
species_id = [31mcol_character() [39m,
genus = [31mcol_character() [39m,
species = [31mcol_character() [39m,
taxa = [31mcol_character() [39m
)

Hide

#Surveys_Data:
surveys_data <- read_csv("surveys.csv")

Parsed with column specification:


cols(
record_id = [32mcol_double() [39m,
month = [32mcol_double() [39m,
day = [32mcol_double() [39m,
year = [32mcol_double() [39m,
species_id = [31mcol_character() [39m,
sex = [31mcol_character() [39m,
hindfoot_length = [32mcol_double() [39m,
weight = [32mcol_double() [39m
)

Task 6: Join
Comibining ‘surveys’ and ‘species’ data using the key variable ‘species_id’ thereby adding the species
information to surveys data and creating a new combined data frame ‘surveys_combined’ as shown
below.
Also the dimensions of the new data frame has also been updated below.

Hide

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 8/15
22/09/2019 MATH2349 Semester 2, 2019

# This is the code used for Task 6:

#Adding the Species info to the surveys data using left join:
Surveys_combined<- surveys_data %>% left_join(Species_data, c("species_id" = "species
_id"))
Surveys_combined

record_id m… … y… species_id … hindfoot_length wei… genus species


<dbl> <dbl> <dbl>
<dbl><chr> <chr> <dbl> <dbl> <chr> <chr>

1 7 16 1977 NL M 32 NA Neotoma albigula

2 7 16 1977 NL M 33 NA Neotoma albigula

3 7 16 1977 DM F 37 NA Dipodomys merriami

4 7 16 1977 DM M 36 NA Dipodomys merriami

5 7 16 1977 DM M 35 NA Dipodomys merriami

6 7 16 1977 PF M 14 NA Perognathus flavus

7 7 16 1977 PE F NA NA Peromyscus eremicus

8 7 16 1977 DM M 37 NA Dipodomys merriami

9 7 16 1977 DM F 34 NA Dipodomys merriami

10 7 16 1977 PF F 20 NA Perognathus flavus

1-10 of 35,549 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 100 Next

Hide

#Code to view thedimensions of the new Data Frame.


dim(Surveys_combined)

[1] 35549 11

Task 7: Calculate
Here we are filtering the data based on asingle species.
And then calculating the average weight and hindfoot length of the selected species for each month.
Also na.rm = TRUE statement has been given to make sure to exclude the missing values during the
calculation.

Hide

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 9/15
22/09/2019 MATH2349 Semester 2, 2019

# This is the code used for Task 7:

#Filtering for only one particular species:


Surveys_combined_Task7<- Surveys_combined %>% filter( species=="albigula")

#calculating the average weight and hindfoot length of the selected species observed
in each month, excludibg the NA values:
Surveys_combined_avg <- Surveys_combined_Task7 %>% group_by(month) %>% summarise(avg_
weight = mean(weight, na.rm = TRUE),avg_hindfoot_length = mean(hindfoot_length, na.rm
= TRUE))

Surveys_combined_avg

month avg_weight avg_hindfoot_length


<dbl> <dbl> <dbl>

1 179.3443 32.54098

2 181.3818 32.82353

3 177.4516 32.75862

4 153.0690 32.02439

5 142.7536 31.60000

6 143.7879 32.18889

7 141.7415 32.35398

8 152.5100 32.07143

9 164.9920 32.50427

10 169.1364 32.43119

1-10 of 12 rows Previous 1 2 Next

Hide

#viewing the dimensions of the Surveys_combined_avg


dim(Surveys_combined_avg)

[1] 12 3

Task 8: Missing Values


A data set surveys_combined_year has been created with 1977 as the selected year.
Grouping the data frame basedon species.
Finding the missing values using is.na() function.
Replacing the missing values with the mean values of each speciesand creating a data set for this
imputed data.

Hide

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 10/15
22/09/2019 MATH2349 Semester 2, 2019

# This is the code used for Task 8:

#Selecting one year and renaming the data frame:


surveys_combined_year<-Surveys_combined %>% filter( year=="1977")
surveys_combined_year

record_id m… … y… species_id … hindfoot_length wei… genus species


<dbl> <dbl> <dbl>
<dbl><chr> <chr> <dbl> <dbl> <chr> <chr>

1 7 16 1977 NL M 32 NA Neotoma albigula

2 7 16 1977 NL M 33 NA Neotoma albigula

3 7 16 1977 DM F 37 NA Dipodomys merriami

4 7 16 1977 DM M 36 NA Dipodomys merriami

5 7 16 1977 DM M 35 NA Dipodomys merriami

6 7 16 1977 PF M 14 NA Perognathus flavus

7 7 16 1977 PE F NA NA Peromyscus eremicus

8 7 16 1977 DM M 37 NA Dipodomys merriami

9 7 16 1977 DM F 34 NA Dipodomys merriami

10 7 16 1977 PF F 20 NA Perognathus flavus

1-10 of 503 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 51 Next

Hide

#grouping by species:
surveys_group_by <- surveys_combined_year %>% group_by(species)
surveys_group_by

record_id m… … y… species_id … hindfoot_length wei… genus species


<dbl> <dbl> <dbl>
<dbl><chr> <chr> <dbl> <dbl> <chr> <chr>

1 7 16 1977 NL M 32 NA Neotoma albigula

2 7 16 1977 NL M 33 NA Neotoma albigula

3 7 16 1977 DM F 37 NA Dipodomys merriami

4 7 16 1977 DM M 36 NA Dipodomys merriami

5 7 16 1977 DM M 35 NA Dipodomys merriami

6 7 16 1977 PF M 14 NA Perognathus flavus

7 7 16 1977 PE F NA NA Peromyscus eremicus

8 7 16 1977 DM M 37 NA Dipodomys merriami

9 7 16 1977 DM F 34 NA Dipodomys merriami

10 7 16 1977 PF F 20 NA Perognathus flavus

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 11/15
22/09/2019 MATH2349 Semester 2, 2019

1-10 of 503 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 51 Next

Hide

#Finding total NA's in R:


sum(is.na(surveys_group_by$weight))

[1] 237

Hide

#Replacing missing values in weight column with mean:


surveys_combined_year$weight <- impute(surveys_group_by$weight, fun = mean)

#Storing Data in surveys_weight_imputed:


surveys_weight_imputed <- surveys_combined_year
surveys_weight_imputed

weight
record_id m… … y… species_id … hindfoot_length <S3: genus speci
<dbl> <dbl> <dbl>
<dbl><chr> <chr> <dbl> impute> <chr> <chr>

1 7 16 1977 NL M 32 46.65038 Neotoma albigu

2 7 16 1977 NL M 33 46.65038 Neotoma albigu

3 7 16 1977 DM F 37 46.65038 Dipodomys merria

4 7 16 1977 DM M 36 46.65038 Dipodomys merria

5 7 16 1977 DM M 35 46.65038 Dipodomys merria

6 7 16 1977 PF M 14 46.65038 Perognathus flavus

7 7 16 1977 PE F NA 46.65038 Peromyscus eremic

8 7 16 1977 DM M 37 46.65038 Dipodomys merria

9 7 16 1977 DM F 34 46.65038 Dipodomys merria

10 7 16 1977 PF F 20 46.65038 Perognathus flavus

1-10 of 503 rows | 1-10 of 11 columns Previous 1 2 3 4 5 6 ... 51 Next

Hide

#Checking the data in the imputed data set:


sum(is.na(surveys_weight_imputed$weight))

[1] 0

Task 9: Special Values


*The below code is to check for any special values , Fortunately there seem to be no special values for the
given selections.
file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 12/15
22/09/2019 MATH2349 Semester 2, 2019

Hide

# This is the code used for Task 9:

#Checking for any special values like NaN, Inf, -Inf:

#Code to check fifnite values:


sum(is.finite(surveys_weight_imputed$weight))

[1] 503

Hide

#Code to check if there are any infinite values:


sum(is.infinite(surveys_weight_imputed$weight))

[1] 0

Hide

#Code to check if there are any undefined or special values:


sum(is.nan(surveys_weight_imputed$weight))

[1] 0

Task 10: Outliers


Drawing a boxplot to see if there are any Uni Variate Outliers.
Finding the id of the existing outliers.
Here we are not sure if the presence of outliers is due to a data entry or data pre processing error,as we
dont know the species information in real life so it is not advisable to delete the data of the outliers as a
whole as this may cause mis-interpretation and capping the outliers with the nearest value of 95th
Percentile would be the most feasible approach,So please have a look at the code on how this is being
done.
After this the new boxplot will show all the values ranging in between 0 to 60, with no outliers as
expected.

Hide

# This is the code used for Task 10:

#Drawing the boxplot:


Surveys_outlier<-boxplot(Surveys_combined$hindfoot_length)

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 13/15
22/09/2019 MATH2349 Semester 2, 2019

Hide

#Finding the id of the Outliers:


boxplot(Surveys_combined$hindfoot_length, id=TRUE)

Hide

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 14/15
22/09/2019 MATH2349 Semester 2, 2019

#Removing the NA Values:


surveys_combined1 <-na.omit(Surveys_combined$hindfoot_length)

#Using the Capping method to remove the outliers, Definingthe function:


cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}

#Removing the outliers using the cap function:


surveys_outliers <- surveys_combined1 %>% cap()

#Re-drawing the final boxplot to further check for outliers:


boxplot(surveys_outliers)

file:///Users/harshithareddy/Documents/DataPreprocessing/DP_MATH2349_1950_Assignment_2_Template.nb.html 15/15

S-ar putea să vă placă și