Sunteți pe pagina 1din 3

Tableau Project

Overview of snsdata.csv file

Abhisek Nayak 9/3/17 Tableau


Load Data
For this analysis, I will be using a dataset representing a random sample of 30,000 U.S. high school students
who had profiles on a snsdata.csv. However, the assumption is the profiles represent a fairly wide cross
section of American adolescents in 2006.
From the top 500 words appearing across all pages, 36 words were chosen to represent five categories of
interests, namely extracurricular activities, fashion, religion, romance, and antisocial behavior. The 36
words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The final dataset
indicates, for each person, how many times each word appeared in the persons SNS profile.Cleaning.
The data include 30,000 teenagers with four variables indicating personal characteristics (gradyear, gender,
age and friends) and 36 words indicating interests (basketball, football, soccer, etc).

Preparing Data
A reasonable range of ages for high school students includes those who are at least 13 years old and not yet
20 years old. Any age value falling outside this range will be treated the same as missing data because it is
not feasible to trust the age provided.
Hence, a new grouping dimension has been made named as Age_Measure which is given by:
IF [Age_Calc]== 0 THEN 'NULL'
ELSEIF [Age_Calc] >=13 AND [Age_Calc]< 20 THEN 'PROPER AGE'
ELSE 'NON PROPER AGE'
END
To calculate the missing values, I have decided to omit the values coming under group null.
Gender Count:

For all the years, the count of females was more than males.
Most no of NAs were found in the 2006 year i.e 729
Age Count:

Amongst 30000 records the records having the proper age is 24477
2007 does have 6219 no of graduates, and 2009 have least no of graduates ie 5965
The median age for every year is as follows:
gradyear age
1 2006 18.6
2 2007 17.7
3 2008 16.7
4 2009 15.8
Word frequency:
In all the dataset it was found out that music was the most repeated words in all the years by all the
students

Cluster Analysis:
With the snsdata.csv a cluster analysis was being done keeping in mind the five categories of interests,
namely extracurricular activities, fashion, religion, romance, and antisocial behavior.

S-ar putea să vă placă și