Documente Academic
Documente Profesional
Documente Cultură
we examine the Census Income dataset available at the UC Irvine Machine Learning Repository. We
aim to run by some basic R codes in order to identify the means, length of the given data median and
other similar factors. We would try to predict the
> str(USA Census Data) # displays the internal structure of our object
In order to get the summary of the whole data set the following command will be used.
Following command will be used in order to display the first few rows
Now we shall move on to converting the data set into data table by installing the data table package and
will right the print command in order to see whether the package is working or not
install.packages('data.table')
library('data.table')
DS <- as.data.table(USA Census Data)
print(DS) # allows you to view the first/last few rows of the data as a data table
Now we will use some summary output functions in order to derive out factors like mean median and SD
in order to draw out conclusions and essential information from the data.
> DT[,mean(Hour.per.week), by=Work Class] #computes the mean of the Hour per week
Finally, we will subset the data table along with performing some additional commands.
> DT[, lapply(.SD, mean), by=Work.Class] # creates a subset of the dataset with the mean for each
column arranged by Species name
> DT[, lapply(.SD, max), by= Work.Class] # creates a subset of the dataset with the max for each column
arranged by Species name
> iris = as.data.table(iris)
The Census Income dataset has 48,842 entries. The original dataset contains a distribution of 23.93%
entries labeled with >50k and 76.07% entries labeled with <=50k. The average age of the individual in
the data set is 39 and the median age is 37 which means that the participant in the data set were all in the
mist of 40’s approx. Most of the individuals work in the private sector. The self-employed works more
hours per week i.e. 48 hrs.