Documente Academic
Documente Profesional
Documente Cultură
Preparing Data
A reasonable range of ages for high school students includes those who are at least 13 years old and not yet
20 years old. Any age value falling outside this range will be treated the same as missing data because it is
not feasible to trust the age provided.
Hence, a new grouping dimension has been made named as Age_Measure which is given by:
IF [Age_Calc]== 0 THEN 'NULL'
ELSEIF [Age_Calc] >=13 AND [Age_Calc]< 20 THEN 'PROPER AGE'
ELSE 'NON PROPER AGE'
END
To calculate the missing values, I have decided to omit the values coming under group null.
Gender Count:
For all the years, the count of females was more than males.
Most no of NAs were found in the 2006 year i.e 729
Age Count:
Amongst 30000 records the records having the proper age is 24477
2007 does have 6219 no of graduates, and 2009 have least no of graduates ie 5965
The median age for every year is as follows:
gradyear age
1 2006 18.6
2 2007 17.7
3 2008 16.7
4 2009 15.8
Word frequency:
In all the dataset it was found out that music was the most repeated words in all the years by all the
students
Cluster Analysis:
With the snsdata.csv a cluster analysis was being done keeping in mind the five categories of interests,
namely extracurricular activities, fashion, religion, romance, and antisocial behavior.