04 Data Cleaning in R

4- Data cleansing in R
Dr Akhter Raza
Review 1
Difference between parametric and non-parametric

statistics?
Business Statistics: Data Cleaning 2

Review 2
Difference between descriptive and inferential

statistics?

Review 3
Difference between Parameter and statistic?

Data Munging
Sometimes referred to as data wrangling is the

process of transforming and mapping raw data into
more appropriate and valuable form for the
purpose of analytics.

Steps in data preparation
• Check for sensitive data
• Check for missing columns
• Check variables names
• Check missing observations
• Check variable classification
• Check misspellings/extra spaces
• Check numeric data distribution
• Check duplicate rows
• Check statistical assumptions
Steps in data preparation

Missing cases
One of the big issue in data is
i) NA
ii) NaN
iii) Inf
NA’s are the missing casses

NaN are not a number
Inf are the division by zero

Function to be used in cleansing
head()
tail()
is.na()
any(is.na())
colSums(is.na())
na.omit()
complete.cases()
9
Business Statistics: Data Cleaning
Exploring and handling NA’s
The airquality data set is used for this
purpose. This set is found in Base R
df <- airquality
str(df)
this data contains 153 observations of
6 variables
is.na(df)
10
Now we are deliberately creating NA’s
in data.
Add new column and a row full of NA’s
df[,7] <- c(NA)
df[154,] <- c(NA)
any(is.na(df))
is.na(df) 11
Removing column number 7 because it
is full of NA's
df <- df[,-7]
str(df)
Removing last row
df <- df[-154,]
str(df) 12
any(is.na(df))
How many total NA's are there
sum(is.na(df))
Now we check each column for na's
sum(is.na(df$Solar.R))
13
instead of checking columns 1 by 1 for
NA’s we can use colSums function
colSums(is.na(df))
This shows that majority of NA’s are in
first column which is 37 and there are
7 missing cases in column 2 rest of the
columns are full and doesn’t have NA’s
14
na.omit function can be used to
remove all missing cases
df.clean <- na.omit(df)
Most na's are in first column which are
37 if this column does not plays any
important role in data analysis then we
can omit this column
15
we will remove na’s this will enhance
our sample size
df.clean2 <- na.omit(df[,-1])
nrow(df.clean2)
df.clean contains 111 rows

df.clean2 contain 146 rows 16
We can implement a rule of keeping all
those columns in which NA’s are less
than 10
df.clean3 <- df[, colSums(is.na(df))<10]
nrow(df.clean3)
17
mean, median and standard deviation
results in NA if variable having NA
mean(airquality$Solar.R)
median(airquality$Solar.R)
sd(airquality$Solar.R)
All three results are NA's
18
To find mean and sd of remaining
values we use following
mean(!is.na(airquality$Solar.R))
sd(!is.na(airquality$Solar.R))
19
Imputing NA’s
instead of deleting missing rows we
can impute them by mean or by
median
df.meanImputed <- df
df.medianImputed <- df
20
Imputing NA’s
All NA’s are replaced by mean of the
rest of data
df.meanImputed$Solar.R[is.na(df.mean
Imputed$Solar.R)] <-
mean(!is.na(df.meanImputed$Solar.R))

Imputing NA’s
All NA’s are replaced by median
df.medianImputed$Solar.R[is.na(df.me
dianImputed$Solar.R)] <-
median(!is.na(df.medianImputed$Solar
.R))
22
Imputing NA’s
now we check is there any na in solar.r
of the two data frames
any(is.na(df.meanImputed$Solar.R))
any(is.na(df.medianImputed$Solar.R))
23
Removing outliers
str(df.clean2)
boxplot(df.clean2$Temp)
No outlier in Temp variable
boxplot(df.clean2$Wind)
There are three outliers in the Wind
variable
summary(df.clean2$Wind) 24
Removing outliers
There are three outliers in the Wind
variable
summary(df.clean2$Wind)
Q1=quantile(df.clean2$Wind,0.25)
Q3=quantile(df.clean2$Wind,0.75)
IQR_wind=Q3-Q1
25
Removing outliers
# there is a direct function of IQR
# IQR(variablename)
upFenceWind <- Q3 + 1.5 * IQR_wind

df.clean4 <- subset(df.clean2,Wind <=
upFenceWind)
26
Removing outliers
Now we can check the box plot of
Wind variable in clean4
boxplot(df.clean4)
box plot of clean4 shows no outlier in

any of the variable
boxplot(df.clean4$Wind)
27
Checking for duplicates
str(df.clean4)
str(unique(df.clean4))
we duplicated row 130 at the 145
position
df.clean4[145,]<- df.clean4[130,]
str(df.clean4)
df.clean4[c(130,145),] 28
Checking for duplicates
Now using unique function we
eliminate this row
df.clean4Distinct <- unique(df.clean4)
str(df.clean4Distinct)
hist(df.meanImputed$Wind)
hist(df.meanImputed$Temp)
29
Transformations
Histogram is showing slightly left
skewed
we can use a transformation to make it
normal
Take log(), sin(), 1/x, sqrt() of original
data and regenerate Histogram
30
Training and Testing set
Splitting data 80% training and 20% testing
data
sample_data <sample(2,nrow(df.clean4),
replace = TRUE, prob = c(0.8,0.2))
test_data<- df.clean4[sample_data==1,]
31
Training and Testing set
train_data<-df.clean4[sample_data ==2,]
head(test_data)
head(train_data)
str(test_data)
str(train_data)
32
Sending output to file
#sink("myfile",append=FALSE, split=FALSE)
# use sink() again to stop output to file
sink("myfile", append=FALSE, split=FALSE)
str(test_data) # output to myfile
str(train_data) # output to myfile
sink() # return output to screen
33
Sending graphical outputs
# graphical output to any seperate file
# pdf("mygraph.pdf") pdf file
# png("mygraph.png") png file
# jpeg("mygraph.jpg") jpeg file
# bmp("mygraph.bmp") bmp file
# postscript("mygraph.ps") postscript file
# close the output use dev.off() function
34
Sending graphical outputs
# Saving output to pdf
pdf("myplot.pdf")
hist(df.meanImputed$Wind)
hist(df.meanImputed$Temp)
dev.off()
35
Questions?

04 Data Cleaning in R

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

04 Data Cleaning in R

Încărcat de

Drepturi de autor:

Formate disponibile

4- Data cleansing in R

Difference between parametric and non-parametric

Business Statistics: Data Cleaning 2

Difference between descriptive and inferential

Business Statistics: Data Cleaning 3

Difference between Parameter and statistic?

Business Statistics: Data Cleaning 4

Sometimes referred to as data wrangling is the

Business Statistics: Data Cleaning 5

Business Statistics: Data Cleaning 7

NA’s are the missing casses

Business Statistics: Data Cleaning 8

df.clean contains 111 rows

df.clean3 <- df[, colSums(is.na(df))<10]

Business Statistics: Data Cleaning 21

upFenceWind <- Q3 + 1.5 * IQR_wind

box plot of clean4 shows no outlier in

S-ar putea să vă placă și