Documente Academic
Documente Profesional
Documente Cultură
Verma
GTECH
301-‐
Final
Project
Introduction
In
recent
years,
Twitter
has
emerged
as
a
popular
social
networking
platform.
It
has
made
connecting
with
friends,
family
and
strangers
at
a
global
scale,
easier
than
ever
before.
In
order
to
create
a
Twitter
profile,
a
name,
an
email
and
a
password
is
all
that
is
required.
What
sets
Twitter
apart
from
other
social
networks
is
that
it
allows
for
users
to
share
brief
and
quick
updates.
The
three
main
numeric
attributes
of
a
twitter
account
are
-‐
followers,
following
and
tweets.
The
groups
of
people
who
follow
a
particular
account
are
its
followers.
Other
users
that
an
account
follows
are
counted
as
following.
Each
posted
update
or
a
"tweet",
as
they
are
called
is
limited
to
140.
The
total
number
of
the
user’s
posts
is
counted
as
the
number
of
tweets.
The
brevity
and
its
simple
user
interface
has
made
twitter
more
user-‐friendly.
Being
such
a
quick
and
easy
method
of
communication,
it
also
makes
it
a
suitable
stage
for
broadcasting
to
a
large
group
of
people
at
once.
Celebrities
are
one
such
group
of
people
who
are
active
users,
tweeting
and
interacting
for
their
eager
fans.
Twitter
has
defined
for
itself
a
platform
where
everyone,
by
a
sole
twitter
account
can
express
themselves
with
140
characters,
regardless
of
social
or
geographical
barriers.
With
millions
of
users
now,
Twitter
has
top
accounts
listed,
ranked
highest
based
on
the
number
of
followers.
Interestingly,
all
these
top
accounts
happen
to
be
famous
outside
of
Twitter,
in
some
way.
Although
Twitter
is
open
and
globally
available,
it
should
be
noted
how
powerful
a
user
with
the
most
followers
may
be.
Objective
The
goal
of
this
study
is
to
understand
why
a
particular
user
has
more
followers
than
others
while
explaining
the
relationship
with
other
factors
that
might
be
influencing
it.
Data
The
data
used
in
performing
this
analysis
was
collected
from
the
Twitter
website
itself.
It
contains
information
about
the
top
100
Twitter
accounts
on
the
social
networking
service.
It
contains
information
about
their
Followers,
Following,
Tweets,
Description
and
Username.
The
Username
shows
the
name
of
the
profile,
which
is
unique
to
the
person
who
owns
it.
This
column
in
the
data
also
comes
with
the
actual
name
of
the
person
who
owns
the
account.
There
is
a
description
column
that
specifies
the
profession
of
the
account
owner.
The
number
of
Tweets
from
the
username
is
specified
and
shows
how
active
the
account
really
is.
I
figured
the
number
alone
was
not
too
descriptive
of
the
activity
of
the
twitter
account
without
knowing
how
long
the
account
had
actually
been
in
use.
Thus,
I
added
an
additional
column
to
the
data
that
shows
the
number
of
months
the
person
has
held
the
account
for.
To
compute
this,
I
manually
searched
every
account
on
the
list
and
noted
the
month
and
year
the
account
had
joined
Twitter.
I
then
calculated
the
number
of
months
the
account
had
been
active
by
subtracting
it
with
May
of
2015.
Understanding
and
learning
about
the
Twitter
dataset:
The
data
provides
the
number
of
followers
of
a
user
and
the
number
of
accounts
it
is
following,
respectively.
colnames(TF)
The
plot
below
shows
the
number
of
followers
of
an
account
based
on
the
profession
of
the
account
owner.
According
to
the
plot,
Musicians
seem
to
be
dominating
the
top
twitter
accounts.
Methodology
In
order
to
initiate
the
study,
correlation
analysis
was
first
performed
to
understand
associations
between
each
of
the
variables.
After
getting
an
understanding
of
the
correlation,
models
were
fitted
to
the
data
to
see
the
strength
of
the
relationship.
pairs(TF[,c(4:7)])
To
further
gain
confidence
in
the
associations,
each
of
the
3
numeric
variables-‐
Following,
Tweets
and
Number
of
months,
was
tested
for
association
with
followers
using
Pearson
correlation
through
the
cor.test
function.
Cor.test
helps
test
for
association
between
paired
samples,
returning
level
of
significance
for
the
correlation.
##
##
Pearson's
product-‐moment
correlation
##
##
data:
TF$Followers
and
TF$Following
##
t
=
4.3512,
df
=
67,
p-‐value
=
4.73e-‐05
##
alternative
hypothesis:
true
correlation
is
not
equal
to
0
##
95
percent
confidence
interval:
##
0.2617924
0.6354719
##
sample
estimates:
##
cor
##
0.4693892
plot(TF$Followers~TF$Following,log="xy",col="blue",pch
=
19,xlab="log(Followi
ng)",ylab="log(Followers)")
#
Correlation
plot
cor.test(x=TF$Followers,y=TF$Tweets)
#Followers
vs
Tweets
##
##
Pearson's
product-‐moment
correlation
##
##
data:
TF$Followers
and
TF$Tweets
##
t
=
-‐0.9016,
df
=
67,
p-‐value
=
0.3705
##
alternative
hypothesis:
true
correlation
is
not
equal
to
0
##
95
percent
confidence
interval:
##
-‐0.3374281
0.1305727
##
sample
estimates:
##
cor
##
-‐0.1094917
cor.test(x=TF$Followers,
TF$MonthsOnTwitter)
#Followers
vs.
Months
on
Twitter
##
##
Pearson's
product-‐moment
correlation
##
##
data:
TF$Followers
and
TF$MonthsOnTwitter
##
t
=
0.6789,
df
=
67,
p-‐value
=
0.4996
##
alternative
hypothesis:
true
correlation
is
not
equal
to
0
##
95
percent
confidence
interval:
##
-‐0.1571009
0.3132067
##
sample
estimates:
##
cor
##
0.08265307
Based
on
the
results
from
the
cor.test
function,
the
correlation
between
Followers
and
Following
was
the
strongest
with
a
significant
p-‐value
of
less
than
0.05.
In
comparison,
Tweets
and
MonthsOnTwitter
were
not
as
significantly
correlated
with
Followers.
Plot
A
boxplot
was
used
to
estimate
the
Followers
variable
with
the
categorical
Description
variable.
It
can
be
noted
that
on
average
Musicians,
are
the
ones
with
some
of
the
highest
number
of
followers.
par(cex.axis
=
0.5)
boxplot(log2(TF$Followers)~TF$Description,las=2,col=rainbow(factor(TF$Descrip
tion)),
main="Number
of
Follower
by
Description",ylab="Number
of
Followers
(i
n
log
scale)")
#
Followers
vs
Description
Model
Fitting
Post
all
the
correlations,
with
a
better
understanding
of
how
each
of
these
variables
correlated
against
Followers,
the
following
models
were
designed
based
on
this
understanding.
Various
statistical
computations
were
used
to
achieve
the
best
fitting
model.
As
seen
previously,
Musicians
were
the
most
popular
category
accounting
for
the
highest
number
of
Followers.
When
Followers
is
modeled
in
relation
to
Description,
only
34%
of
the
variation
is
explained.
Thus,
Description
alone
does
not
account
for
much
of
the
variability.
##
##
Call:
##
lm(formula
=
Followers
~
Description,
data
=
TF)
##
Multiple
R-‐squared:
0.3351,
Adjusted
R-‐squared:
-‐0.1304
The
number
of
tweets
posted
from
the
twitter
account
is
added
as
a
covariate
to
see
if
it
increases
the
variability
explained.
The
R^2
goes
up
by
very
little
which
shows
that
the
covariate
might
not
be
too
significant.
This
confirms
our
correlation
findings,
which
suggested
that
Tweets
was
not
significantly
associated
with
the
number
of
followers.
Mod2=lm(formula=Followers~Description+Tweets,
data=TF)
summary(Mod2)
##
Multiple
R-‐squared:
0.3387,
Adjusted
R-‐squared:
-‐0.1529
The
correlation
matrix
also
suggested
a
weak
correlation
between
Followers
and
MonthsOnTwitter.
As
expected,
adding
the
MonthsOnTwitter
as
a
covariate
does
not
increase
the
R-‐Squared
significantly
suggesting
that
length
of
time
the
user
has
been
on
Twitter
makes
little
difference
in
the
number
of
followers.
Additionally,
this
might
not
be
a
good
variable
to
use
because
a
person
who
has
had
an
account
for
the
longest
time
may
not
necessarily
be
as
active
as
others
who
have
been
there
a
shorter
time
period.
It
would
have
been
helpful
if
there
was
information
about
the
actual
number
of
active
days
on
the
account.
Mod3=lm(formula=Followers~Description+MonthsOnTwitter,
data=TF)
summary(Mod3)
By
removing
the
insignificant
covariates
from
the
model,
we
end
up
with
the
following
model.
In
doing
so,
the
R-‐Squared
now
explains
54%
of
the
variability.
This
is
the
best
model
for
the
given
data
as
both
the
number
of
people
the
account
follows
and
the
description
of
the
account
holder
are
significant
in
predicting
the
number
of
followers.
Mod5=lm(formula=Followers~Following+Description,data=TF)
summary(Mod5)
##
Multiple
R-‐squared:
0.5324,
Adjusted
R-‐squared:
0.1847
With
the
variance
inflation
factor
test
on
the
model,
it
is
noticeable
that
Following
and
Description
are
collinear
and
contribute
equally
to
the
variance
of
the
response
variable.
This
may
suggest
that
a
particular
group
of
users
are
seen
to
have
the
highest
number
of
followers
given
that
they
follow
a
certain
number
of
people.
vif(mod5)
Starting
out
with
my
initial
hypothesis
that
the
number
of
followers
was
associated
with
other
factors
present
in
the
data
with
a
bias
towards
the
number
of
Tweets
a
person
had
sent
and
the
months
they
had
been
on
Twitter.
As
deduced
from
the
analysis,
the
best
model
for
Followers
was
with
the
number
of
people
the
account
followed
and
the
Description
of
the
person.
Consequently,
Tweets
and
MonthsOnTwitter
were
not
so
significantly
associated
in
explaining
the
number
of
followers.
This
suggests
that
the
strength
of
a
user’s
personality
as
a
known
figure
and
their
activeness
to
follow
back
leads
to
them
being
a
popularly
followed
account.
However,
there
may
be
additional
data
points
that
can
be
added
to
the
analysis
to
further
strengthen
the
claim.
Therefore,
in
addition
to
the
given
data,
it
may
also
be
important
to
include
specific
data
pertaining
to
factors
such
as
day-‐to-‐day
activity
and
the
frequency
of
interaction
with
other
users
on
Twitter.
These
would
help
further
elucidate
how
active
a
user
is
on
Twitter,
on
a
regular
basis.
These
three
variables
can
strengthen
the
analysis
by
actually
predicting
the
followers
based
on
the
level
of
activeness.
Some
more
descriptive
variables
such
as
popularity
index
of
a
user
can
also
help
in
making
the
research
more
conclusive.