Documente Academic
Documente Profesional
Documente Cultură
Exploration in R
TAVISH SRIVASTAVA, APRIL 26, 2015
Introduction
Here are the operation I’ll cover in this article (Refer to this article for similar operations in
SAS):
1.
Code:
All other Read commands are similar to the one mentioned above.
Use is.xyz to test for data type xyz. Returns TRUE or FALSE
Use as.xyz to explicitly convert it.
However, conversion of data structure is more critical than the format transformation. Here
is grid which will guide you with format conversion :
Code
Code
# sort by var1
hist(score)
Let’s try to find the assumptions R takes to plot this histogram, and then modify a few of
those assumptions.
histinfo<-hist(score)
histinfo
$breaks
$counts
$density
[1] 0.0002 0.0005 0.0019 0.0052 0.0084 0.0141 0.0195 0.0201 0.0152
[10] 0.0081 0.0039 0.0025 0.0003 0.0001
$mids
$xname
[1] "score"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
As you can see, the breaks are applied at multiple points. We can restrict the number of break
points or vary the density. Over and above this, we can colour the bar plot and overlay a
normal distribution curve. Here is how you can do all this :
attach(iris)
table(iris$Species)
Here is a code which can find cross tab between two categories :
library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)
This code will simply take out a random sample of 100 observations from the table mydata.
> x
[1] 2 10 6 8 9 11 14 12 11 6 10 0 10 7 7 20 11 17 12 -1
> unique(x)
[1] 2 10 6 8 9 11 14 12 0 7 20 17 -1
> tapply(iris$Sepal.Length,iris$Species,sum)
> tapply(iris$Sepal.Length,iris$Species,mean)
> is.na(y)
[1] 4 5 6 5
As you can see, the missing value has been imputed with the mean of other numbers.
Similarly, we can impute missing values with any best value available.
To merge two data frames (datasets) horizontally, use the merge function. In most cases,
you join two data frames by one or more common key variables (i.e., an inner join).
Appending dataset is another such function which is very frequently used. To join two data
frames (datasets) vertically, use the rbind function. The two data frames must have the same
variables, but they do not have to be in the same order.
End Notes:
In this comprehensive guide, we looked at the R codes for various steps in data exploration
and munging. This tutorial along with the ones available for Python and SAS will give you a
comprehensive exposure to the most important languages of the analytics industry.
Did you find the article useful? Do let us know your thoughts about this guide in the comments
section below