Sunteți pe pagina 1din 4

I. Introduction (10 points).

we examine the Census Income dataset available at the UC Irvine Machine Learning Repository. We
aim to run by some basic R codes in order to identify the means, length of the given data median and
other similar factors. We would try to predict the

II. Methods/R Code (30 points).


First, we will get the data uploaded on to R clicking the Import dataset from text base and selecting the
data file. For verifying the data will use the command.

> str(USA Census Data) # displays the internal structure of our object

In order to get the summary of the whole data set the following command will be used.

> summary(USA Census Data) # displays the summary of our object

Following command will be used in order to display the first few rows

> head(USA Census Data) # displays the first few rows

Now we shall move on to converting the data set into data table by installing the data table package and
will right the print command in order to see whether the package is working or not

 install.packages('data.table')
 library('data.table')
 DS <- as.data.table(USA Census Data)
 print(DS) # allows you to view the first/last few rows of the data as a data table

Now we will use some summary output functions in order to derive out factors like mean median and SD
in order to draw out conclusions and essential information from the data.

> DT[,mean(Hour.per.week), by=Work Class] #computes the mean of the Hour per week

>DT[,mean(Hour.per.week),by=substring(Work Class, 1, 1)] # determine the mean of groups. Use the


substring function to group by the first letter of the Work Class name

> DT[, .( Education.num=median(Education.num),


+ Capital.Gain=median(Capital.Gain),
+ Capital.Loss=median(Capital.Loss),
+ Hours.per.week=median(Hours.per.week)),
+ by=Work.Class][order(-Work.Class)]
#This will output the median of the 4 columns indicated in the data step. order(-Work Class) arranges
the data in descending order by Work Class name

Finally, we will subset the data table along with performing some additional commands.
> DT[, lapply(.SD, mean), by=Work.Class] # creates a subset of the dataset with the mean for each
column arranged by Species name
> DT[, lapply(.SD, max), by= Work.Class] # creates a subset of the dataset with the max for each column
arranged by Species name
> iris = as.data.table(iris)

III. Results/Output (25 points).


IV. Analysis of Results (25 points)

The Census Income dataset has 48,842 entries. The original dataset contains a distribution of 23.93%
entries labeled with >50k and 76.07% entries labeled with <=50k. The average age of the individual in
the data set is 39 and the median age is 37 which means that the participant in the data set were all in the
mist of 40’s approx. Most of the individuals work in the private sector. The self-employed works more
hours per week i.e. 48 hrs.

V. Conclusion (10 points).


The above data tells us the average age, working hours and the amount of gains and loss experienced by
different professional individuals over a prescribed working hour and we have found out that the self-
employed individuals works the most.

S-ar putea să vă placă și