Sunteți pe pagina 1din 10

Facebook Comments Volume Prediction

Project Notes – 2
1 Exploratory Data Analysis:
A new variable called post size was created to check on the distribution of post
length.

Post having post length of 1000 and lesser was classified small. Upto 5000 was
classified medium and greater than that was classified as long.

The distribution was as below:

The number of comments in first 24 hours has been between 1 to 500and


maximum of data hovers around 1 to 100/150.

The number of comments in last 24 hours has been low in most of the cases.
The total comments between 24 to 48 hours have been moderate:

Total comments for majority of posts have been between 1 to 200 and for few of
then it has been up to 350 and very few have comments crossing 500.

This shows that the total number of comments have a decreasing trend after 24
hours.
It looks like all the comments posted in first 24 hours have a post length less than
1000.

Correlation Plot:
There exist multi collinearity between the feature variables which is expected.

There are other relationships as below:

Page likes and page talking about are correlated, page talking about and share
count are correlated etc.

Page talking about vs page likes:


There are more posts published on wednesday

2 Data – Pre Processing:

Removal of Unwanted Variables:

The variable called post.promotion.status was removed as it is zero for every record available.

This was done by assigning it as null


Variable Transformation:

The factor variables like post published weekday, base date weekday and newly formed
variables were transformed as factors to help in model building.

New Variables Added:

A variable called post size was added based on post length. Posts lesser than length of 1500
were set as short, those between 1500 and 5000 were classified as medium and those greater
were classified as long.

A variable called Pagelikedmost was added which was set as 1 if the likes were greater than
1000 and was set as 0 if it was lesser than that.

A variable called weekendpost has been added which is set to 1 if the post was published on
Saturday and Sunday.

Outliers Detection:

Outliers are present when it comes to post length as very little are with length
above 5000 as below:

Outliers are present for total comments across days of week:


For missing values that lie outside the 1.5 * IQR limits, we could cap it by replacing
those observations outside the lower limit with the value of 5th %ile and those that
lie above the upper limit, with the value of 95th %ile.

x=compfb1$Totalcomments

qnt <- quantile(x, probs=c(.25, .75), na.rm = T)

caps <- quantile(x, probs=c(.05, .95), na.rm = T)

H <- 1.5 * IQR(x, na.rm = T)

x[x < (qnt[1] - H)] <- caps[1]

x[x > (qnt[2] + H)] <- caps[2]

#Remove records of post length greater than 5000

compfb2 = subset(compfb1, compfb1$Post.Length > 5000)

box plot after removing:


Missing Value Treatment:

Here summary of the data tells us that there are some missing values in few of the
fields.

Hence let us use visdata function from visdata package to check on the percentage of
missing data.

There are only 2.8% of the data which is missing which can be imputed by using mice
function.

This was done as a part of data cleaning as missing data followed the pattern of MCAR
(Missing Completely At Random).

Also the ID Column can be removed as it is neither an independent variable nor a


dependent variable.

3 Analytical Model Building:


Since the target variable is continuous, and also there is multivariate data, we can
go for multiple model and choose the best based on measuring model
performance.

Classification techniques like CART can be used. Random Forest would also be an
option to be considered as there can be multiple decision tress built and sampling
without replacement (bootstrapping) can help with such kind of data (Facebook
comment volume prediction).

We can also go for predictive modeling approach.

Each one of them can be tested by either creating confusion matrix, predict
method and var imp plots.

Also it can be run on multiple test sets to predict performance.

S-ar putea să vă placă și