Documente Academic
Documente Profesional
Documente Cultură
Project Notes – 2
1 Exploratory Data Analysis:
A new variable called post size was created to check on the distribution of post
length.
Post having post length of 1000 and lesser was classified small. Upto 5000 was
classified medium and greater than that was classified as long.
The number of comments in last 24 hours has been low in most of the cases.
The total comments between 24 to 48 hours have been moderate:
Total comments for majority of posts have been between 1 to 200 and for few of
then it has been up to 350 and very few have comments crossing 500.
This shows that the total number of comments have a decreasing trend after 24
hours.
It looks like all the comments posted in first 24 hours have a post length less than
1000.
Correlation Plot:
There exist multi collinearity between the feature variables which is expected.
Page likes and page talking about are correlated, page talking about and share
count are correlated etc.
The variable called post.promotion.status was removed as it is zero for every record available.
The factor variables like post published weekday, base date weekday and newly formed
variables were transformed as factors to help in model building.
A variable called post size was added based on post length. Posts lesser than length of 1500
were set as short, those between 1500 and 5000 were classified as medium and those greater
were classified as long.
A variable called Pagelikedmost was added which was set as 1 if the likes were greater than
1000 and was set as 0 if it was lesser than that.
A variable called weekendpost has been added which is set to 1 if the post was published on
Saturday and Sunday.
Outliers Detection:
Outliers are present when it comes to post length as very little are with length
above 5000 as below:
x=compfb1$Totalcomments
Here summary of the data tells us that there are some missing values in few of the
fields.
Hence let us use visdata function from visdata package to check on the percentage of
missing data.
There are only 2.8% of the data which is missing which can be imputed by using mice
function.
This was done as a part of data cleaning as missing data followed the pattern of MCAR
(Missing Completely At Random).
Classification techniques like CART can be used. Random Forest would also be an
option to be considered as there can be multiple decision tress built and sampling
without replacement (bootstrapping) can help with such kind of data (Facebook
comment volume prediction).
Each one of them can be tested by either creating confusion matrix, predict
method and var imp plots.