Sunteți pe pagina 1din 16

TheSenseofDecliningPopularityofBaseball:ATimeSeries

Analysis

Authors:

JakeElter

AustinKinion

Abstract:
AtimeseriesanalysisoftheGooglesearchtermbaseballtodeterminewhetherornot
thereisatruedeclineintheamountofsearchesovertheyears.Detrending,deseasonalizing,and
residualanalysisoftimeseriestofitamodelthatwillbestpredictthefuturevaluesofthetime
series.Therewillbeapredictionof36weeksofdatatoshowpossiblefuturetrends.Thefinalpart
oftheanalysisincludesabriefspectralanalysisofthetimeseries.

1. Introduction

ManypeoplerefertobaseballasAmericaspastime,asportthathasnowspreadto
otherforeigncountrieslikeJapan,Cuba,andtheDominicanRepublic,amongothers.An
increasingamountofinternationalplayersarejoiningtheMLB,anddominatingnonetheless.
However,thereisarecentsensethatthepopularityandloveforthegameinAmericahas
beendeclining,possiblybecausebaseballismoreofaregionalsportwhilesportslikefootball
seemtobemorenationallyfocused.
Thetimeseriesanalysisfromthereportaimsatdeterminingwhetherornotthis
senseofdecliningpopularityisactuallytruebylookingattheGoogletrendswiththesearch
termBaseball.SincethesportoccursfromApriltoOctober,webelievethatthereshouldbe
aseasonalityeffectinthetimeseries.Inthisreport,wewillanalyzethesmoothandrough
components,aswellastheresidualstohopefullyidentifythenatureofthephenomenonand
ultimatelyuseaselectedARMAmodeltoforecastandpredictthefuturevaluesofthe
variable.

2.DataDescription

ThedatathatwearelookingatforouranalysisisthesearchtermBaseball,whichis
collectedweeklyfromGoogleTrendsbetween2005present.Lookingatthetimeseriesplot,
itappearsthatthehighestnumberofsearchesoccursbetweenAprilandJuneeveryyear,
whicharethefirstthreemonthsoftheMLBseason.Meanwhile,theleastnumberofsearches
occursinNovemberandDecember,whicharethefirsttwomonthsaftertheseasonends.
Whilethereclearlyseemstobeaseasonalitycomponent,therealsoseemstobeagradual
declineintheamountofsearchesovertheyears.Infact,thethreelowestnumberofsearches
intheentiredatasetoccurredin2014.Thismaybearesultofdecreasedpopularityinthe
game,butalsocouldbebecausetheproportionofthesearchtermbaseballamongall
searcheshasdeclinedandnotnecessarilythattheactualnumberofsearcheshasgone
down.

SummaryofStatistics
Min.1stQu.MedianMean3rdQu.Max.
16.0033.0046.0048.6964.00100.00

BelowisatimeseriesplotfortheoriginaldatafoundfromGoogletrends.

3.DataAnalysis

3.1DataTransformation

Beforeanalyzingthetrendandseasonalitycomponents,thefirststepisdata
transformation.Todeterminewhetherornottheoriginaldatahadconstantvariance,werana
LjungBoxtest,whichshowedusthatthedatadoesinfacthaveconstantvariance,witha
pvalue<2.2e16,whichisclearlylessthan.05.Asaresult,therewasnoneedtouselog
transformationoranyotherformofdatatransformationinanefforttostabilizethevariance.
Havingsaidthat,therearestilloutliersthatexistinthedatahowever,wewillnotberemoving
themforouranalysis.

3.2Detrending

Lookingattheoriginaldata,wecanseethatthedataisclearlynotstationaryandthat
thereisagradualdecliningtrendovertheyears.Ofthethreemethodstodetrenddata,we
usedMethod2:SmoothingwithOneSidedMovingAverages.Wechosetouseonesided
movingaveragesovertwosidedmovingaveragesduetothefactthattwosidedmoving
averagesremovesthemostcurrentdatapoints,whichisdetrimentalwhenweapproach
predicting.Becausethedataisweekly,ourd=52andthuschosetouseaq=26whendoing
ourmovingaverages.Thedetrendeddataisplottedonthetopofthenextpage.


3.3Deseasonalizing

Afterplottingthedetrendeddata,thereisstillsomeclearevidenceofseasonality,
whichneedstoberemoved.Toremoveseasonality,wesubtractedthemeansofthe
detrendedmatrixfromtheoriginaldetrendedmatrix.Theplotofthedeseasonalizeddatais
providedbelow,whichshowsarelativelyconstantvariancehoveringaroundzero.

3.4TrendReestimation

Thegraphabovedoesstillhaveafewobviousoutliers,andwewouldliketopulldown
thevarianceevenmoretodiminishtheeffectsoftheoutliers.Inordertopullthevariance
downevenfurther,atrendreestimationwasdoneontopofthedeseasonalizeddata.The
timeseriesplotbelow,shows(inred)atrendreestimationofthedeseasonalizeddata.We

canseethatthetrendreestimationworkedwelltopulldownthevariancecomparedtothe
deseasonalizeddata(inblack),becausetheredlinehoversclosertozerothantheblackline.

3.5ResidualAnalysis

Thefirstplotdisplayedbelowisthegraphoftheresiduals,whilethefollowingtwo
graphsaretheACFandPACFoftheresiduals,afterremovingtrendandseasonalityfromthe
data.TheplotsdonotshowclearevidenceofanAR(p)orMA(q)sinceneitherofthemtail
off,sofurtherfittingofanARMA(p,q)lookstobenecessary.However,ifoneoftheplotstailed
off,whiletheotherplotdisplayedaclearcutoffatacertainlag,wewouldbeabletoguess
thatthemodelwouldbeanAR(p)oranMA(q),dependingonwhethertheACForPACFcut
off.

Toensurethattheresidualsarenotindependentlyandidenticallydistributed(iid),we
ranaLjungBoxTest,whichresultedinapvalue<2.2e16.

Therefore,wecansaywith95%
confidence,thattheresidualsarerandomandstationary.

Additionally,weranthePortmanteauTest

totestthehypothesisofindependentand
identicallydistributedresiduals.Theresultsofthistestshowedthatthehypothesisofi.i.d.
residualsisnotrejectedatlevel0.05,withallpvaluesfromthetestbeinglessthan.05,which
confirmsthefactthatthereispossiblewhitenessintheresiduals.Thislikelymeansthatthere
isnomoretrendinthedataandthatthevarianceisstable.

ThefinaltestwerantoanalyzetheresidualswastheRanktest,whichisveryuseful
forfindinglineartrends.TheresultsofthistestreturnsFALSEatalpha=0.05,thusthe

assumptionthattheresidualsarei.i.d.cannotberejected,whichmeansthereisnomore
trendinthedata.

Takingalookatthe

QQplotoftheresiduals,wecanseethatourdatashowssignsof
normality,andwithacorrelationof0.9243496,whichsomewhatconfirmsourtheory.
However,itisdifficulttodeterminewithfullconfidencethatthereiscompletenormalityaswe
cannoticeveryheavytailsoneachend.

3.6FittinganARMAModel

Aftertheanalysisofthesmoothcomponent,wewanttofitstationaryARMA(p,q)
modelstotheresiduals,usingthearima()functioninRtodeterminewhatthebestpandthe
bestqare.Inordertoselectthebestmodel,weusedtheAIC,whichtakesintoaccount
goodnessoffitaswellasthecomplexityofthemodel.Wewereabletofindthefivelowest
AICvalueswhichcorrespondedtoARMA(3,3),ARMA(4,3),ARMA(4,1),ARMA(5,1),
ARMA(4,2).SinceallofthesemodelscontainAICvaluesthatareextremelyclosetoeach
other,wedecidedtochoosethemodelthatseemedtobetheleastcomplex.Outofthefive
ARMAmodels,fourofthemhaveatotalof6or7parameterswhileARMA(4,1)hasonly5
parameters.Therefore,wechosetoselectARMA(4,1)asthebestfittingmodel.Theoutputis
shownbelowwiththelowestAICvaluesinbold.

pq12345
12519.6882519.0392460.6202459.2922458.360
22461.3342463.0532461.0582460.5182459.357
32462.8872465.286
2457.748
2458.4782458.537
4
2457.741

2457.9702457.782
2458.7582460.421
5
2457.397
2459.2062458.3452460.6182462.757

UsinganARMA(4,1),wecanseethecoefficientsandinterceptofthemodeldisplayedbelow:

Call:
arima(x=resid,order=c(4,0,1))

Coefficients:
ar1ar2ar3ar4ma1intercept
1.22020.37490.11370.11721.00000.0115
s.e.0.04370.06900.06900.04370.00550.0049

sigma^2estimatedas6.617:loglikelihood=1221.87,aic=2457.74

NowthatwehaveselectedourARMAmodel,wecannowfitARMA(4,1)tothe
residualsandanalyzetheACFandPACFplots.Wecanseefromthesegraphsbelowthatthe
residualsdoindeedconformtoaWhiteNoisedistribution,sincetherearenopointsoutsideof
theboundsafterlagzeroandthusthereisnomoredependenceleftinthedata.

3.7PredictingFutureValues

Wewouldliketoknowthefuture,forecastedvaluesofthistimeseriesinordertoseeif
therecouldbeasucceedingdowntrendstilloccurring.Inordertodothis,wehadto
reintroducealloftheseasonalizationanddetrendingthatwehaveremovedpreviouslyinthis
paper.
Afterdoingthis,wefittheARMA(4,1)modeltothedata,andusedthepredictfunction
inRtoshowhowfuturevaluesofthistimeseriesmaylook.Theplotbelowshowsthe
predictionofthenext36weeksofdata(inBlue),whichincludesthepredictionofseasonality,
trendestimation,andresiduals.Also,thereisa95%confidenceinterval(inRed)oneachside
oftheprediction.Weare95%confidentthat,thenext36weeksofdatawillfallsomewherein
thisinterval.

Wewouldalsoliketopredictfuturevaluesoftheroughcomponentalone(the
residuals),whichweanalyzedbeforeinthisreport.Belowisaplotofthepredictedtime
seriesgraphofthenext36weeksofourresiduals.
Theplotshowsthatthereisasimilartrendpredictedforthefutureresidualsthanthe
pastones.Itisimportanttokeepinmindthatthispredictionisonlyanestimationandmaynot
beanentirelyaccurateestimateofthefuture.Also,ifweweretoattempttopredictvalues
furtherintothefuture,ourpredictedvalueswillgetlessandlessaccurate.


3.8SpectralAnalysis

Theperiodogramisasampleestimateofapopulationfunctioncalledthespectral
density,whichisafrequencydomaincharacterizationofapopulationstationarytimeseries.
Lookingattheperiodogrambelow,wecanseethatthepeakvaluecorrespondstoa
frequencyofaround0.025.Therefore,theperiodforthisvalue=1/0.025=40,which
indicatesthatittakesapproximately40weeksforacompletecycle.Ifwelookattheplotof
theoriginaldata,itappearsthatthefirstcyclebottomsoutaroundthe40thweek,sothe
estimationfromtheperiodogramseemsaccurate.

4.Discussion

Wefoundthatthereseemstobeacleardecliningtrendforthesearchtermbaseball
withintheGoogleSearchdomain.Whileanalyzingtheresiduals,wewerenotentirelysure
howtointerprettheACFandPACFplotsfromtheresidualssincewehadnotseenplotslike
thisbefore.Fromouranalyses,wedecidedtofitanARMA(p,q)modeltothedata,sincethere
didnotseemtobeaclearAR(p)orMA(q)modelthatwouldfitthedatacorrectly.
Afteranalyzingtheresidualsandfittinganapproximatemodeltothetimeseriestothe
plot,wefoundthatthebestfittingmodelisanARMA(4,1).Insayingthis,therearebetter
fittingmodelswhichwecouldhaveused,butwefeltthatthiswasthebestbecauseitused
thefewestparameters,witharelativelylowAICvalue.
Weranintosometroublewithforecastingthefuturevaluesofthetimeseries,sowe
feltitwaspertinenttoadda95%confidencebandoneachsideofthepredictions,toshow
thatthefuturevaluesmaynotexactlyfollowthepredictioncurve.
Lastly,weincludedabriefSpectralAnalysis.Thiswasthepartthatwestruggledmost
withonthisreport,sinceitrequiredalotmoreresearchandhelpfromoutsidesources.The
briefanalysisthatwediddothough,seemstofitourdataquiteaccuratelysinceittakes
approximately40weekstocompleteonecycleinouroriginaltimeseriesgraph.

5.Conclusions

Inthisreport,weanalyzedatimeseriesfromGoogleTrendswiththesearchterm
baseball.Weremovedthetrendestimatesandseasonalityestimates,inordertoanalyze
theresiduals,andfitamodeltothedata.Wefoundthatthebestfittingmodelforthedata,
wasanARMA(4,1).Afterfindingthebestmodel,wepredictedpossiblefuturevaluesofthe
timeseries,andalsoperformedabriefspectralanalysis.Wefoundthattheredoesseemto
beadecliningtrendwiththesearchtermbaseballwithinthegooglesearchdomain.This
maybeduetoadecreasinginterestinbaseball,butcouldalsobearesultofpeople
becomingmorespecificwiththeirsearches,andalsopossiblyduetoanincreasingnumberof
waystoalternativelysearchtheinternet.


6.References

Crawley,MichaelJ."SpectralAnalysis."
Safari
.JohnWiley&Sons,n.d.Web.03Mar.
2015.

Keshvani,Abbas."UsingAICtoTestARIMAModels."
CoolStatsBlog
.N.p.,14Aug.
2013.Web.05Mar.2015.

Patrick,Joshua
HelpwitharimafunctioninR,andwithinterpretingresiduals.

Stoffer,D.SandShumwayR.H.TimeSeriesAnalysisanditsApplications.University
ofPittsburghStatistics.2010.Web.05Mar.2015

7.Appendix

data=read.csv("basball.csv")
baseball=ts(data)
plot.ts(baseball)

#Thisisweeklydata.Forweeksthatoverlaptwoyears,wetreattheweekbelongstothefirst
year.
#Wewillused=52.Foryearswith53weeks,wewillaveragethevaluesforthelasttwo
weeks
tmp=data
#Year20062007
tmp[156,2]=ceiling((tmp[156,2]+tmp[157,2])/2)
tmp=tmp[157,]
#Year20122013
tmp[468,2]=ceiling((tmp[468,2]+tmp[469,2])/2)
tmp=tmp[469,]
rownames(tmp)=1:nrow(tmp)#Reindexingtherows
data=tmp
data=data[c(573:578),]
plot.ts(data[,2],xlab='Week',ylab='Count',main='TimeSeriesPlotforthesearchterm
"Baseball"')

#Method2:Movingaverageestimation
#diseven,soweneedtoslightlymodifyourWt.
#Withd=52,wehaveq=26,N=11
#Step1:Filtering
data.matrix=matrix(data[,2],ncol=52,byrow=T)
one.sided.filter=filter(data[,2],sides=1,c(0.5,rep(1,51),0.5)/52)
filter.matrix=matrix(one.sided.filter,ncol=52,byrow=T)#Putthefilteredvaluesintoamt.
mu.matrix2=data.matrixfilter.matrix#Thisisthedetrendedmatrix
plot(as.numeric(t(mu.matrix)),type='l',main='Method2:Detrended(Filtered)Data',ylab='',
xlab='Weeks')

#Step2:Seasonalestimation
mu.k=colMeans(mu.matrix,na.rm=T)
sk=mu.kmean(mu.k)
sk.matrix2=matrix(rep(sk,11),ncol=52,byrow=T)
sk.matrix2
deseasonalized2=mu.matrixsk.matrix2
baseball=as.numeric(t(deseasonalized2))

plot(baseball,type='l',main='Method2:DeseasonalizedData',ylab='',xlab='Weeks')

#Step3:Trendreestimation
#Iwanttopulldownthevariance,soIwillusetwosidedfilteringwithsmallq
deseasonalized2=as.numeric(t(deseasonalized2))
filter.method2=filter(deseasonalized2,sides=1,rep(1,5)/5)
lines(filter.method2,col='red')
resid=deseasonalized2filter.method2
plot.ts(resid,main='PredictionofResiduals(inBlue)',xlim=c(0,600))

sum(resid,na.rm=TRUE)

#########################
#Method1:ThesampleACF
#########################
resid=resid[(1:56)]
resid=deseasonalized2filter.method2

resid1.acf=acf(resid)
resid1.pcf=pacf(resid)
lag.plot(data[,2],lags=27,layout=c(3,4),diag=F)


#Checkhowmanywithinbound
resid1.acf=resid1.acf$acf[2:28]#Thefirstacfvalueisoflag0,whichweignorehere
n=length(data[,2])
bounds=c(1.96/sqrt(n),1.96/sqrt(n))
sum(resid1.acf<bounds[2]&resid1.acf>bounds[1])

###############################
#Method2:ThePortmanteautest
###############################
#Usingthebuiltinfunction
teststat.1=numeric(20)
pvals.1=numeric(20)
teststat.2=numeric(20)
pvals.2=numeric(20)
for(iin1:20){
test1=Box.test(resid1,lag=i,type='Ljung')
teststat.1[i]=test1$statistic#Theteststatistics
pvals.1[i]=test1$p.value#Thepvalue

test2=Box.test(resid2,lag=i,type='Ljung')
teststat.2[i]=test2$statistic
pvals.2[i]=test2$p.value
}
#Comparingpvalues
pvals.1<0.05

######################
#Method3:RankTest
######################
n=length(resid)
mu.pi=1/4*n*(n1)
sigma.pi=sqrt(1/72*n*(n1)*(2*n+5))
alpha=0.05
z=qnorm(1alpha/2)
#FindPiforresidual1
Pi.1=0
for(jin1:(n1)){
for(iin(j+1):n){
if(resid[i]>resid[j])Pi.1=Pi.1+1
}
}

P1=abs(Pi.1mu.pi)/sigma.pi
P1>z#Comparingteststatisticswithcriticalvalue

######################
#Method4:QQPlot
######################
q1=qqnorm(resid)
qqline(resid)
cor(q1$x,q1$y)^2

#PREDICTING
#CreatematrixofAICvaluescorrespondingwithARMA(p,q)
#findlowestAICvalues
aic=matrix(NA,nrow=5,ncol=5,byrow=T)
for(pin1:5){
for(qin1:5){
aic[p,q]=AIC(arima(resid,order=c(p,0,q)))
}
}
colnames(aic)=c(1:5)
rownames(aic)=c(1:5)
aic
#ARMA(4,1)isselectedasthebestmodel
ball.ar=arima(resid,order=c(4,0,1))
ball.ar$aic

ball=data[,2]
fit.base=arima(resid,order=c(4,0,1))
names(fit.base)
pacf(fit.base$residuals)
resids=fit.base$residuals
resids=resids[c(1:56)]
pacf(resids,main="PACFofresidsusingARMA(4,1)")
acf(resids,main="ACFofresidsusingARMA(4,1)")

#Predictthenext36weeksforsmoothcomponent
fit.base=arima(data[,2],order=c(4,0,1))
Base.pred=predict(fit.base,n.ahead=36)
baseball.data=data[,2]
plot(baseball.data,xlim=c(0,620),type='l',main="Prediction(Blue)ofsmoothcomponentwith
C.I.inRed")
Base.pred

lines(Base.pred$pred,col="blue")
lines(Base.pred$pred+2*Base.pred$se,col="red",lty=3)
lines(Base.pred$pred2*Base.pred$se,col="red",lty=3)

#Predictthenext24weeksforroughcomponent
predict.24=predict(resid,n.ahead=24,Start=572)
plot.ts(ball,xlab='',ylab='',xlim=c(0,600))
lines(predict.24$pred,col='red')

#SpectralAnalysis
spectrum(data[,2],spans=c(7,7))

S-ar putea să vă placă și