Priyanka Finalproject

Priyanka
Verma
GTECH 301-‐ Final Project

Introduction
In recent years, Twitter has emerged as a popular social networking platform. It has made
connecting with friends, family and strangers at a global scale, easier than ever before. In
order to create a Twitter profile, a name, an email and a password is all that is required.
What sets Twitter apart from other social networks is that it allows for users to share brief
and quick updates. The three main numeric attributes of a twitter account are -‐ followers,
following and tweets. The groups of people who follow a particular account are its
followers. Other users that an account follows are counted as following. Each posted update
or a "tweet", as they are called is limited to 140. The total number of the user’s posts is
counted as the number of tweets. The brevity and its simple user interface has made
twitter more user-‐friendly. Being such a quick and easy method of communication, it also
makes it a suitable stage for broadcasting to a large group of people at once. Celebrities are
one such group of people who are active users, tweeting and interacting for their eager
fans. Twitter has defined for itself a platform where everyone, by a sole twitter account can
express themselves with 140 characters, regardless of social or geographical barriers. With
millions of users now, Twitter has top accounts listed, ranked highest based on the number
of followers. Interestingly, all these top accounts happen to be famous outside of Twitter, in
some way. Although Twitter is open and globally available, it should be noted how
powerful a user with the most followers may be.

Objective
The goal of this study is to understand why a particular user has more followers than
others while explaining the relationship with other factors that might be influencing it.
Data
The data used in performing this analysis was collected from the Twitter website itself. It
contains information about the top 100 Twitter accounts on the social networking service.
It contains information about their Followers, Following, Tweets, Description and
Username. The Username shows the name of the profile, which is unique to the person who
owns it. This column in the data also comes with the actual name of the person who owns
the account. There is a description column that specifies the profession of the account
owner. The number of Tweets from the username is specified and shows how active the
account really is. I figured the number alone was not too descriptive of the activity of the
twitter account without knowing how long the account had actually been in use. Thus, I
added an additional column to the data that shows the number of months the person has
held the account for. To compute this, I manually searched every account on the list and
noted the month and year the account had joined Twitter. I then calculated the number of
months the account had been active by subtracting it with May of 2015.
Understanding and learning about the Twitter dataset: The data provides the number of
followers of a user and the number of accounts it is following, respectively.

colnames(TF)
## [1] "Rank" "Username" "Description" "Followers"

## [5] "Following" "Tweets" "MonthsOnTwitter"
summary(TF[,c(4:7)]) # Summary of the numerical data
## Followers Following Tweets MonthsOnTwitter

## Min. :1914071 Min. : 0 Min. : 117 Min. : 40.00
## 1st Qu.:2076083 1st Qu.: 105 1st Qu.: 1229 1st Qu.: 73.00
## Median :3028263 Median : 251 Median : 2652 Median : 75.00
## Mean :3494345 Mean : 24057 Mean : 5259 Mean : 76.28
## 3rd Qu.:4086656 3rd Qu.: 1189 3rd Qu.: 5990 3rd Qu.: 80.00
## Max. :8914965 Max. :701159 Max. :60693 Max. :100.00

The plot below shows the number of followers of an account based on the profession of the
account owner. According to the plot, Musicians seem to be dominating the top twitter
accounts.
barplot(table(TF$Description), col=rainbow(n=factor(TF$Description)),main = "

Visualzing the Categorical variate -‐ Description",las=2,cex.names = 0.6,cex.a
xis = 0.2,ylab="Frequency")

Methodology
In order to initiate the study, correlation analysis was first performed to understand
associations between each of the variables. After getting an understanding of the
correlation, models were fitted to the data to see the strength of the relationship.
Correlating the response variable-‐ Followers to other covariates

(Following, Tweets, MonthsOnTwitter)
Each viable variable given in the data was correlated to the number of Followers. A general
correlation matrix to study bivariate relationships was generated using the cor function on
the data frame. This correlation matrix helps analyze the association between any two
variables of interest, at once without taking into account all remaining variables. In regards
to Followers, the results show that Followers have the strongest positive correlation to the
variable Following, while variables Tweet and MonthsOnTwitter also show low negative
and positive correlations, respectively.

cor(TF[,c(4:7)]) # Correlation matrix
## Followers Following Tweets MonthsOnTwitter

## Followers 1.00000000 0.46938919 -‐0.10949172 0.08265307
## Following 0.46938919 1.00000000 -‐0.07239048 0.26182482
## Tweets -‐0.10949172 -‐0.07239048 1.00000000 0.33050445
## MonthsOnTwitter 0.08265307 0.26182482 0.33050445 1.00000000
pairs(TF[,c(4:7)])

To further gain confidence in the associations, each of the 3 numeric variables-‐ Following,
Tweets and Number of months, was tested for association with followers using Pearson
correlation through the cor.test function. Cor.test helps test for association between paired
samples, returning level of significance for the correlation.
The y-‐axis values were log normalized.
cor.test(x=TF$Followers,y=TF$Following) # Followers vs. Following
##
## Pearson's product-‐moment correlation
##
## data: TF$Followers and TF$Following
## t = 4.3512, df = 67, p-‐value = 4.73e-‐05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2617924 0.6354719
## sample estimates:
## cor
## 0.4693892
plot(TF$Followers~TF$Following,log="xy",col="blue",pch = 19,xlab="log(Followi
ng)",ylab="log(Followers)") # Correlation plot

cor.test(x=TF$Followers,y=TF$Tweets) #Followers vs Tweets
##
##
## data: TF$Followers and TF$Tweets
## t = -‐0.9016, df = 67, p-‐value = 0.3705
## -‐0.3374281 0.1305727
## cor
## -‐0.1094917
cor.test(x=TF$Followers, TF$MonthsOnTwitter) #Followers vs. Months on Twitter
##
##
## data: TF$Followers and TF$MonthsOnTwitter
## t = 0.6789, df = 67, p-‐value = 0.4996
## -‐0.1571009 0.3132067
## cor
## 0.08265307
Based on the results from the cor.test function, the correlation between Followers and
Following was the strongest with a significant p-‐value of less than 0.05. In comparison,
Tweets and MonthsOnTwitter were not as significantly correlated with Followers.
Plot
A boxplot was used to estimate the Followers variable with the categorical Description
variable. It can be noted that on average Musicians, are the ones with some of the highest
number of followers.
par(cex.axis = 0.5)
boxplot(log2(TF$Followers)~TF$Description,las=2,col=rainbow(factor(TF$Descrip
tion)), main="Number of Follower by Description",ylab="Number of Followers (i
n log scale)") # Followers vs Description

Model Fitting
Post all the correlations, with a better understanding of how each of these variables
correlated against Followers, the following models were designed based on this
understanding. Various statistical computations were used to achieve the best fitting
model.

As seen previously, Musicians were the most popular category accounting for the highest
number of Followers. When Followers is modeled in relation to Description, only 34% of
the variation is explained. Thus, Description alone does not account for much of the
variability.
Mod1 = lm(Followers~Description,data = TF)

summary(Mod1)
##
## Call:
## lm(formula = Followers ~ Description, data = TF)
## Multiple R-‐squared: 0.3351, Adjusted R-‐squared: -‐0.1304
The number of tweets posted from the twitter account is added as a covariate to see if it
increases the variability explained. The R^2 goes up by very little which shows that the
covariate might not be too significant. This confirms our correlation findings, which
suggested that Tweets was not significantly associated with the number of followers.
Mod2=lm(formula=Followers~Description+Tweets, data=TF)
summary(Mod2)
The correlation matrix also suggested a weak correlation between Followers and
MonthsOnTwitter. As expected, adding the MonthsOnTwitter as a covariate does not
increase the R-‐Squared significantly suggesting that length of time the user has been on
Twitter makes little difference in the number of followers. Additionally, this might not be a
good variable to use because a person who has had an account for the longest time may not
necessarily be as active as others who have been there a shorter time period. It would have
been helpful if there was information about the actual number of active days on the
account.
Mod3=lm(formula=Followers~Description+MonthsOnTwitter, data=TF)
summary(Mod3)

Using number of followers as our response variable, a model was created using all other
variables. This maximal model, with all 4 covariates against followers is then analyzed. As
seen in the correlation matrix the variable -‐ MonthsOnTwitter seems to be the least
significant, followed by Tweets. This suggests, these two variables are not adding to any
variance for the followers variable and it would to justified to exclude them from further
analysis to obtain the a well fitted model. However, the R-‐Squared indicates that the model
explains quite a high amount of variability at 56%.
Mod4=lm(formula=Followers~Following+Description + Tweets +MonthsOnTwitter,dat

a=TF) #Maximal Model (linear)
summary(Mod4)
## Multiple R-‐squared: 0.5371, Adjusted R-‐squared: 0.1493

By removing the insignificant covariates from the model, we end up with the following
model. In doing so, the R-‐Squared now explains 54% of the variability. This is the best
model for the given data as both the number of people the account follows and the
description of the account holder are significant in predicting the number of followers.
Mod5=lm(formula=Followers~Following+Description,data=TF)
summary(Mod5)
## Multiple R-‐squared: 0.5324, Adjusted R-‐squared: 0.1847

With the variance inflation factor test on the model, it is noticeable that Following and
Description are collinear and contribute equally to the variance of the response variable.
This may suggest that a particular group of users are seen to have the highest number of
followers given that they follow a certain number of people.
vif(mod5)
## GVIF Df GVIF^(1/(2*Df))

## Following 3.566555 1 1.888532
## Description 3.566555 28 1.022967
Conclusion
Starting out with my initial hypothesis that the number of followers was associated with
other factors present in the data with a bias towards the number of Tweets a person had
sent and the months they had been on Twitter. As deduced from the analysis, the best
model for Followers was with the number of people the account followed and the
Description of the person. Consequently, Tweets and MonthsOnTwitter were not so
significantly associated in explaining the number of followers. This suggests that the
strength of a user’s personality as a known figure and their activeness to follow back leads
to them being a popularly followed account. However, there may be additional data points
that can be added to the analysis to further strengthen the claim. Therefore, in addition to
the given data, it may also be important to include specific data pertaining to factors such
as day-‐to-‐day activity and the frequency of interaction with other users on Twitter. These
would help further elucidate how active a user is on Twitter, on a regular basis. These three
variables can strengthen the analysis by actually predicting the followers based on the level
of activeness. Some more descriptive variables such as popularity index of a user can also
help in making the research more conclusive.

Priyanka Finalproject

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Priyanka Finalproject

Încărcat de

Drepturi de autor:

Formate disponibile

Priyanka

## [1] "Rank" "Username" "Description" "Followers"

summary(TF[,c(4:7)]) # Summary of the numerical data

## Followers Following Tweets MonthsOnTwitter

barplot(table(TF$Description), col=rainbow(n=factor(TF$Description)),main = "

Correlating the response variable-‐ Followers to other covariates

cor(TF[,c(4:7)]) # Correlation matrix

## Followers Following Tweets MonthsOnTwitter

The y-‐axis values were log normalized.

cor.test(x=TF$Followers,y=TF$Following) # Followers vs. Following

Mod1 = lm(Followers~Description,data = TF)

## Multiple R-‐squared: 0.3439, Adjusted R-‐squared: -‐0.144

Mod4=lm(formula=Followers~Following+Description + Tweets +MonthsOnTwitter,dat

## Multiple R-‐squared: 0.5371, Adjusted R-‐squared: 0.1493

## GVIF Df GVIF^(1/(2*Df))

S-ar putea să vă placă și

Priyanka Finalproject

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Priyanka Finalproject

Încărcat de

Drepturi de autor:

Formate disponibile

Priyanka

## [1] "Rank" "Username" "Description" "Followers"

summary(TF[,c(4:7)]) # Summary of the numerical data

## Followers Following Tweets MonthsOnTwitter

barplot(table(TF$Description), col=rainbow(n=factor(TF$Description)),main = "

Correlating the response variable-­‐ Followers to other covariates

cor(TF[,c(4:7)]) # Correlation matrix

## Followers Following Tweets MonthsOnTwitter

The y-­‐axis values were log normalized.

cor.test(x=TF$Followers,y=TF$Following) # Followers vs. Following

Mod1 = lm(Followers~Description,data = TF)

## Multiple R-­‐squared: 0.3439, Adjusted R-­‐squared: -­‐0.144

Mod4=lm(formula=Followers~Following+Description + Tweets +MonthsOnTwitter,dat

## Multiple R-­‐squared: 0.5371, Adjusted R-­‐squared: 0.1493

## GVIF Df GVIF^(1/(2*Df))

S-ar putea să vă placă și

Correlating the response variable-‐ Followers to other covariates

The y-‐axis values were log normalized.

## Multiple R-‐squared: 0.3439, Adjusted R-‐squared: -‐0.144

## Multiple R-‐squared: 0.5371, Adjusted R-‐squared: 0.1493