Sunteți pe pagina 1din 36

4- Data cleansing in R

Dr Akhter Raza
Review 1

Difference between parametric and non-parametric


statistics?

Business Statistics: Data Cleaning 2


Review 2

Difference between descriptive and inferential


statistics?

Business Statistics: Data Cleaning 3


Review 3

Difference between Parameter and statistic?

Business Statistics: Data Cleaning 4


Data Munging

Sometimes referred to as data wrangling is the


process of transforming and mapping raw data into
more appropriate and valuable form for the
purpose of analytics.

Business Statistics: Data Cleaning 5


Steps in data preparation
• Check for sensitive data
• Check for missing columns
• Check variables names
• Check missing observations
• Check variable classification
• Check misspellings/extra spaces
• Check numeric data distribution
• Check duplicate rows
• Check statistical assumptions
Business Statistics: Data Cleaning 6
Steps in data preparation

Business Statistics: Data Cleaning 7


Missing cases
One of the big issue in data is
i) NA
ii) NaN
iii) Inf

NA’s are the missing casses


NaN are not a number
Inf are the division by zero

Business Statistics: Data Cleaning 8


Function to be used in cleansing
head()
tail()
is.na()
any(is.na())
colSums(is.na())
na.omit()
complete.cases()

9
Business Statistics: Data Cleaning
Exploring and handling NA’s
The airquality data set is used for this
purpose. This set is found in Base R
df <- airquality
str(df)
this data contains 153 observations of
6 variables
is.na(df)
10
Business Statistics: Data Cleaning
Exploring and handling NA’s
Now we are deliberately creating NA’s
in data.
Add new column and a row full of NA’s
df[,7] <- c(NA)
df[154,] <- c(NA)
any(is.na(df))
is.na(df) 11
Business Statistics: Data Cleaning
Exploring and handling NA’s
Removing column number 7 because it
is full of NA's
df <- df[,-7]
str(df)
Removing last row
df <- df[-154,]
str(df) 12
Business Statistics: Data Cleaning
Exploring and handling NA’s
any(is.na(df))
How many total NA's are there
sum(is.na(df))
Now we check each column for na's
sum(is.na(df$Solar.R))

13
Business Statistics: Data Cleaning
Exploring and handling NA’s
instead of checking columns 1 by 1 for
NA’s we can use colSums function
colSums(is.na(df))
This shows that majority of NA’s are in
first column which is 37 and there are
7 missing cases in column 2 rest of the
columns are full and doesn’t have NA’s
14
Business Statistics: Data Cleaning
Exploring and handling NA’s
na.omit function can be used to
remove all missing cases
df.clean <- na.omit(df)
Most na's are in first column which are
37 if this column does not plays any
important role in data analysis then we
can omit this column
15
Business Statistics: Data Cleaning
Exploring and handling NA’s
we will remove na’s this will enhance
our sample size
df.clean2 <- na.omit(df[,-1])
nrow(df.clean2)

df.clean contains 111 rows


df.clean2 contain 146 rows 16
Business Statistics: Data Cleaning
Exploring and handling NA’s
We can implement a rule of keeping all
those columns in which NA’s are less
than 10

df.clean3 <- df[, colSums(is.na(df))<10]

nrow(df.clean3)
17
Business Statistics: Data Cleaning
Exploring and handling NA’s
mean, median and standard deviation
results in NA if variable having NA
mean(airquality$Solar.R)
median(airquality$Solar.R)
sd(airquality$Solar.R)
All three results are NA's
18
Business Statistics: Data Cleaning
Exploring and handling NA’s
To find mean and sd of remaining
values we use following

mean(!is.na(airquality$Solar.R))
sd(!is.na(airquality$Solar.R))

19
Business Statistics: Data Cleaning
Imputing NA’s
instead of deleting missing rows we
can impute them by mean or by
median

df.meanImputed <- df
df.medianImputed <- df

20
Business Statistics: Data Cleaning
Imputing NA’s
All NA’s are replaced by mean of the
rest of data

df.meanImputed$Solar.R[is.na(df.mean
Imputed$Solar.R)] <-
mean(!is.na(df.meanImputed$Solar.R))

Business Statistics: Data Cleaning 21


Imputing NA’s
All NA’s are replaced by median

df.medianImputed$Solar.R[is.na(df.me
dianImputed$Solar.R)] <-
median(!is.na(df.medianImputed$Solar
.R))

22
Business Statistics: Data Cleaning
Imputing NA’s
now we check is there any na in solar.r
of the two data frames

any(is.na(df.meanImputed$Solar.R))
any(is.na(df.medianImputed$Solar.R))

23
Business Statistics: Data Cleaning
Removing outliers
str(df.clean2)
boxplot(df.clean2$Temp)
No outlier in Temp variable
boxplot(df.clean2$Wind)
There are three outliers in the Wind
variable
summary(df.clean2$Wind) 24
Business Statistics: Data Cleaning
Removing outliers
There are three outliers in the Wind
variable
summary(df.clean2$Wind)
Q1=quantile(df.clean2$Wind,0.25)
Q3=quantile(df.clean2$Wind,0.75)
IQR_wind=Q3-Q1
25
Business Statistics: Data Cleaning
Removing outliers
# there is a direct function of IQR
# IQR(variablename)

upFenceWind <- Q3 + 1.5 * IQR_wind


df.clean4 <- subset(df.clean2,Wind <=
upFenceWind)
26
Business Statistics: Data Cleaning
Removing outliers
Now we can check the box plot of
Wind variable in clean4
boxplot(df.clean4)

box plot of clean4 shows no outlier in


any of the variable
boxplot(df.clean4$Wind)
27
Business Statistics: Data Cleaning
Checking for duplicates
str(df.clean4)
str(unique(df.clean4))
we duplicated row 130 at the 145
position
df.clean4[145,]<- df.clean4[130,]
str(df.clean4)
df.clean4[c(130,145),] 28
Business Statistics: Data Cleaning
Checking for duplicates
Now using unique function we
eliminate this row
df.clean4Distinct <- unique(df.clean4)
str(df.clean4Distinct)
hist(df.meanImputed$Wind)
hist(df.meanImputed$Temp)
29
Business Statistics: Data Cleaning
Transformations
Histogram is showing slightly left
skewed
we can use a transformation to make it
normal
Take log(), sin(), 1/x, sqrt() of original
data and regenerate Histogram

30
Business Statistics: Data Cleaning
Training and Testing set
Splitting data 80% training and 20% testing
data
sample_data <sample(2,nrow(df.clean4),
replace = TRUE, prob = c(0.8,0.2))

test_data<- df.clean4[sample_data==1,]

31
Business Statistics: Data Cleaning
Training and Testing set
train_data<-df.clean4[sample_data ==2,]
head(test_data)
head(train_data)
str(test_data)
str(train_data)

32
Business Statistics: Data Cleaning
Sending output to file
#sink("myfile",append=FALSE, split=FALSE)
# use sink() again to stop output to file
sink("myfile", append=FALSE, split=FALSE)
str(test_data) # output to myfile
str(train_data) # output to myfile
sink() # return output to screen
33
Business Statistics: Data Cleaning
Sending graphical outputs
# graphical output to any seperate file
# pdf("mygraph.pdf") pdf file
# png("mygraph.png") png file
# jpeg("mygraph.jpg") jpeg file
# bmp("mygraph.bmp") bmp file
# postscript("mygraph.ps") postscript file
# close the output use dev.off() function
Business Statistics: Data Cleaning
34
Sending graphical outputs
# Saving output to pdf

pdf("myplot.pdf")
hist(df.meanImputed$Wind)
hist(df.meanImputed$Temp)
dev.off()
35
Business Statistics: Data Cleaning
Questions?

S-ar putea să vă placă și