Sunteți pe pagina 1din 6

2012 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT.

2326, 2012, SANTANDER, SPAIN

QUANTIFYING SPATIOTEMPORAL DYNAMICS OF TWITTER REPLIES TO NEWS FEEDS F Biemann , J-M Papaioannou , A Harth , ML Jugel , K-R M ller , M Braun u

ABSTRACT Social network analysis can be used to assess the impact of information published on the web. The spatiotemporal impact of a certain web source on a social network can be of particular interest. We contribute a novel statistical learning algorithm for spatiotemporal impact analysis. To demonstrate our approach we analyze Twitter replies to individual news article along with their geospatial and temporal information. We then compute the multivariate spatiotemporal response pattern of all Twitter replies to information published on a given web source. This quantitative result can be interpreted with respect to a) how much impact a certain web source has on the Twitter-sphere b) where and c) when it reaches it maximal impact. We also show that the proposed approach predicts the dynamics of the social network activity better than classical trend detection methods. Index Terms Social network analysis, spatiotemporal dynamics, canonical trends, tkCCA 1. INTRODUCTION If information is published on the web the Twitter responses sometimes create waves of replies, much like a stone falling into water. These waves can be regarded as the impulse response function of information published on the web the respective social network. Such an impulse response function can give valuable insights into both the future importance of a certain topic in a community and the response behavior of a community. Here we propose a novel approach, canonical trend analysis, for assessing the spatiotemporal impact of a web source on a social network. Canonical trend analysis automatically learns a) the relevant features of a web source that give rise to retweets and b) the optimal impulse response function in a social network such as Twitter. This allows to investigate which features give rise to a Twitter trend and to predict the spatiotemporal dynamics of the Twitter community. We evaluate our method on data obtained from inuential news feeds and Twitter responses to news items published
Berlin Institute of Technology, Dept. Machine Learning, Franklinstr. 28, 10587 Berlin Karlsruhe Institute of Technology, Institute AIFB, Englerstr. 11, 76131 Karlsruhe Korea University, Department of Brain and Cognitive Engineering, Anam-dong, Seongbuk-gu, Seoul 136-713, Korea

on the respective feed. In comparison with classical alternatives to topic detection we nd that our approach consistently predicts Twitter activity better than standard methods. We discuss practical problems of Twitter predictions and future directions of research on spatiotemporal dynamics in online social networks (OSN). 2. RELATED WORK Canonical trend analysis was used in previous studies to extract music trends and corresponding trend setters and followers in user cliques on the social music website http: //last.fm/ [1]. Another study used canonical trend analysis for trendsetter detection in a pool of news websites [2]. An alternative approach to spatiotemporal analysis of OSN is proposed in [3]. Here the authors show that the popularity of topics is correlated with their geographical spread, e.g. highly popular topics tend to cross region boundaries. Topics were extracted by the OpenCalais service http://www.opencalais.com/. In contrast to [3] our approach does not require dedicated topic extraction. Canonical trend analysis estimates the most inuential topics automatically. Another study on trend propagation on Twitter was conducted in [4], where several aspects of topic diffusion in Twitter were studied; the authors constructed retweet trees and studied their temporal and spatial characteristics. Our study is similar in that respect, but while the authors of [4] investigated spatial and temporal dependencies separately we also analyze spatiotemporal dynamics that are not separable in space and time. A focus on geographical dynamics has also been taken in [5]; the authors examine information spread along a social network and across geographic regions by analyzing tweets related to two specic events happening at two different geographic locations. Our approach is complementary to this work: while it can be very important to investigate manually dened topics or events, often it is not clear what information in a web source constitutes a trend; our approach does not require predened events or topics but learns the relevant topic from the data. 3. CANONICAL TRENDS The goal of canonical trend analysis in the present application setting is to predict the spatiotemporal evolution of retweets in

978-1-4673-1026-0/12/$31.00 c 2012 IEEE

response to information on a web source. For canonical trend analysis we extract from each web source f {1, 2, . . . , F } in our collection of F web sources Bag-of-Words (BoW) features xf (t) RW and retweet counts at L locations yf (t) RL , henceforth referred to as Bag-of-Locations (BoL) features, at time points t = {0, 1, . . . , T }. Both types of features are tf-idf normalized. For the sake of simplicity we here assume regularly sampled time points: We average the BoW and BoL features within consecutive non-overlapping temporal windows of one hour. Details about data collection can be found in section 5; we here extracted BoW features as relevant information of each web source but our approach is readily applicable to other feature representations such as semantic entities or collections of hyperlinks. After feature extraction we store the multivariate feature time series in two sparse matrices Xf = [xf (t = 1), . . . , xf (t = T )] RW T , Yf = [yf (t = 1), . . . , xf (t = T )] RLT . (1)

source and the future retweet activity of the information published on that web site. The correlation coefcient in eq. 5 is called canonical as it is invariant w.r.t. linear transformations of the data. We thus refer to the time series xf (t), yf (f ) as canonical trends (CT). 4. CANONICAL TREND ANALYSIS In the following we show how eq. 5 can be optimized efciently. The rst step is a temporal embedding of the retweet data. This is done by creating for each retweet location ma trix Yf a new representation Yf in which we add copies of the data in Yf , shifted forward in time by a time lag of hours: Yf, =1 . LN T . Yf = (6) R . . Yf, =N By temporally embedding the data we increase the dimensionality of the data by a factor of N , the number of time lags. Classical CCA in this setting requires the inversion of covariance matrices of size (W + LN )2 , where T denotes the number of samples, W the number of BoW features and L the number of retweet locations. In contrast kernel CCA involves a generalized eigenvalue problem with matrices of size (2T )2 . For the sake of simplicity we consider linear kernels here, but non-linear dependencies can be easily estimated by replacing the linear kernel with other kernel functions. For linear kernels the canonical trend solution is a linear expansion of data points wy ( ) = Yf, , wx = Xf . (7) (8)

(2)

We model a canonical trend (CT) in the BoW feature space as a weighted combination wx RW of features (i.e. a topic) xf (t) = wx Xf (:, t). (3)

The canonical trend in the retweet location space is modeled as a spatiotemporal convolution of retweet counts yf (t) =

wy ( ) Yf (:, t + ), {1, 2, . . . , N } (4)

where wy ( ) RLN is a space-time convolution with N time lags (in hours). For the sake of simplicity we here only consider one-dimensional trends xf (t) R1 , yf (t) R1 , but multidimensional trend estimates are straightforward, see section 4. In general the dimensionality of xf (f ), yf (t) is min(rank(Xf ), rank(Yf )) [6]. For optimal prediction of the retweets yf (t) from the information published on a web source xf (t) we maximize the correlation between xf (t) and yf (t) argmax Corr(f (t), yf (t)). x
wy ( ),wx

The coefcients and the eigenvectors of the generalized eigenvalue problem 0 Kx Ky Ky Kx 0 = Ly 0 0 Lx (9)

(5)

The optimal wx and wy ( ) can be computed simultaneously using canonical correlation analysis (CCA) [7]. The mathematical properties of CCA are as well understood [8] as its statistical convergence criteria [9, 10]. We here use an extension, temporal kernel CCA (tkCCA), that can deal with high dimensional data, small sample sizes and time delayed nonlinear dependencies between data [11]. The interpretation of wy ( ) and wx is straightforward. In our application example they are the directions in the feature space that maximize the correlation between the information published by a web

where Ky = Yf Yf RT T is the linear kernel matrix of Yf T T and Kx = Xf Xf R is the linear kernel matrix of Xf . The eigenvalue is the canonical correlation on the training data set, which yields the same result as eq. 5. The matri2 ces on the right hand side are computed as Lx = Kx + I 2 and Ly = Ky + I, where is the regularization constant controlling the complexity of the solution. For linear kernels we can recover the canonical projection wx according to eq. 8 and the canonical convolution wy ( ) according to eq. 7. We then could compute the BoW trend xf (t) according to eq. 3 and the retweet trend yf (t) using eq. 4, but this can be compu tationally costly. Instead of recovering wx , wy ( ) and computing yf (t), xf (t), we can stay in kernel space to evaluate the models, which is much faster if the kernels are already computed. The complete canonical trend detection algorithm is summarized in algorithm 1.

Algorithm 1 Canonical Trend Algorithm Require: Data Xf RW T , Yf RLT , N , # Temporal Embedding of the BoL features (eq. 6) Yf = [Yf (:, t + 1) , . . . , Yf (:, t + N ) ] # Compute Linear Kernels Ky = Yf Yf Kx = Xf Xf # Cross-validation loop for fold = 1 to 10 do # Pick Training indices Tr {1, . . . , T } # Pick Test indices Te {1, . . . , T } \ TrN , = kCCA(Ky (Tr, Tr), Kx (Tr, Tr), ) # Predict Test data cf,fold = Kx (Tr, Te)Ky (Te, Tr) end for # Output: canonical correlation, BoW subspace wx , # BoL convolution wy ( ) wy ( ) = Yf wx = Xf

Method Mean PCA CT

Hypothesis Overall BoW counts predict tweets best BoW variance predicts tweet variance BoW-BoL co-variance predicts tweets best

Table 1. Comparison methods, see also section 4.2. ply that the more is published on a news feed the more retweet activity should be expected. Another method to extract trends is latent semantic analysis (LSA) [12] in which only a single matrix of BoW features is factorized. For instance LSA nds the strongest topic vf RW as that subspace in the BoW space that captures most variance argmax(vf Xf Xf vf ),
vf

s.t. vf vf = 1.

(10)

4.1. Model evaluation for time series For evaluation we split the data into consecutive blocks of training and test data, estimate and on the training set and compute the prediction accuracy in eq. 5 on test data. In standard classication settings the training and test samples are often picked randomly. The temporal dependencies present in Twitter and news time series data do not allow for such a random resampling. We splitted the time series in blocks of equal length T /Nfolds , where Nfolds = 10 is the number of cross-validation folds. Due to the temporal embedding (see eq. 6) consecutive blocks will overlap by N samples. Thus we discarded the rst N samples from the training block adjacent to the test data block. This ensured that no data point that we tested on was used for training the KCCA model. The regularization parameters and the optimal number of time lags N were estimated using 10-fold cross-validation (nested within the training data set) and a grid search over time lags {1, 2, . . . , 200} and {108 , 107 , . . . , 101 }. Optimal regularizers were in the range of 103 to 101 , the optimal time lag was between = 80 and = 100 hours. 4.2. Comparison with other approaches We compared the canonical trend prediction with two other approaches, each associated with another hypothesis. The simplest approach to nd a trend in the BoW and BoL space is to weigh each dimension with the same coefcient, i.e. computing the mean across features. If the mean trend in the BoW space predicts the spatiotemporal retweet activity better than the canonical trend approach, this means that there is no information in the features of either modality: the only relevant quantity is just the overall publishing activity. This would im-

Informally the relationship between LSA and CT is similar to the relationship between principal component analysis (PCA) [13] and CCA: PCA maximizes the variance of the retweets Yf and news BoW features Xf separately, while CCA maximizes the co-variation between Xf and Yf . We compared the canonical trend predictions (eq. 5) with the correlation be tween vf Yf and uf Xf obtained by linear kernel PCA on Xf and Yf separately according to eq. 10. Note that in order to ensure a fair comparison with canonical trend prediction we temporally embedded the retweet data Yf before computing PCA. The hypothesis associated with the PCA (or LSA) prediction is: if the prediction obtained with PCA is the same as that of the canonical trend prediction, then there is no useful co-variation between retweets and news content. Instead all that is needed to predict the retweet activity optimally is the feature combination that accounts for as much variance as possible. The decisive difference to the canonical trend approach is that the PCA trends are computed on news item features and retweet count location features separately. A summary of the methods and underlying hypotheses is listed in table 1. 5. DATA COLLECTION We collected data from six news feeds1 during October of 2011. Bag-of-Word (BoW) features were extracted using standard natural language processing tools2 . After removal of stop words and stemming our BoW dictionary contained W 105 words. The time series of each word was tf-idf normalized. The BoW feature time series was then stored in sparse matrices Xf RW T for every news feed separately, where f = {1, . . . , F = 6} denotes news feed and t = {1, . . . , T } denotes the time in hours. Time stamps of all
1 http://beta.wunderfacts.com/ 2 http://www.nltk.org/

Method Mean PCA CT

25th/50th/95th Percentile 0.01/0.02/0.05 0.03/0.08/0.14 0.08/0.17/0.31

Table 2. Results of retweet prediction accuracy. and news publishing activity in all three panels: As expected for a 100 hour interval, there are four bumps (one for each day) in all trend time series. The top panel depicts the trends obtained as the mean of the retweet data matrix Yf and the BoW feature matrix Xf , respectively. The middle panel shows the rst PCA component of retweet and BoW matrix. The retweet trend yf (t) of BoL features clearly reects the daily oscillations, the strongest temporal component. The BoW feature trend xf (t) exhibits more high frequency con tent. It is important to note here that while the PCA analysis does capture the strongest trend for each of the modalities, retweets and news items, it fails to capture trends that are similar in their temporal ne structure. This is different in the case of the proposed canonical trend algorithm. By design, the CT approach aims to nd those trends that exhibit the strongest correlation between a combination of BoW features and a spatiotemporal deconvolution of retweets. The bottom panel in g. 2 shows the canonical trends. While the canonical trends, as the PCA trends, clearly reect the daily tweet and news publishing cycle, they also are very similar in their temporal ne structure. 6.2. Canonical trends predict retweets best We directly compared the prediction accuracy of retweets based on BoW features between all three trend prediction approaches. Table 2 shows the 25th/50th/75th percentiles of correlations between mean trends, strongest PCA trend and strongest canonical trends. Canonical Trend prediction of the retweet time series is consistently better than both mean trends and PCA trends. A direct comparison for all 6 news feeds is shown in a scatter plot in g. 3. The left panel shows the retweet prediction accuracy for the mean trends in the x-axis and the CT prediction accuracy in the y-axis. The results of all news feeds fall above the iso-performance line, indicating that the mean publishing activity of a news website is not a good predictor of the overall retweet activity. The right panel in g. 3 shows along the x-axis the prediction performance obtained with the PCA approach. Although PCA captures more co-variation of news feeds and retweet activity than the simplest mean trend approach, the canonical trend prediction is consistently better. This means that there are correlations between BoW features and spatiotemporal retweet activity that are neglected by standard topic detection techniques such as LSA. Canonical Trend analysis however can use this information for a better prediction of activity in

Fig. 1. Overall retweet activity plotted on world map. news web sources were set to CET. BoW features were temporally averaged in consecutive non-overlapping windows of one hour. In total we analyzed a time period of T = 670 hours. We also collected the retweets of the news items of each web source from the Twitter site over the same period for each web source. We took normalized (i.e. after resolving redirects) URIs of articles and searched for these URIs via the Twitter API3 to collect identiers of retweets that mention the article URI. The tweet status objects were then downloaded using the Twitter API. We extracted the location from the users proles and the date on which the message was tweeted. To make sure that the locations given by the users are valid we used a list of 800 cities (in different languages) with their coordinates as a reference. The reference list was obtained from http://www.openstreetmap. org/. This procedure resulted in sparse Bag-of-Location (BoL) matrices. To overcome the problem of sparsity in our data we reduced the number of locations by spatial averaging. This was done using the http://gadm.geovocab. org/withinRegion service to map the valid locations to higher-level regions, i.e. Palo Alto and San Francisco were mapped to a gadm.geovocab link which represents the state of California. The time series of each location was then tfidf normalized and stored in a sparse matrix Yf RLT for every web source. Figure 1 shows the average (over time) retweet activity plotted on a world map. 6. RESULTS 6.1. Trends in BoW and BoL space Figure 2 shows the (strongest) trends extracted from http: //arstechnica.com/ in the BoW feature space and the BoL feature space respectively. Plotted is a period of 100 hours in October 2011. Note the daily oscillations of retweets
3 https://dev.twitter.com/

arstechnica.com 0.4 0.4

arstechnica.com

Correlation CT

Correlation CT

0.3 slashdot.org 0.2 cosmiclog.msnbc.msn.com blogs.ft.com 0.1 allthingsd.com abcnews.go.com 0.1

0.3 0.2

slashdot.org

cosmiclog.msnbc.msn.com blogs.ft.com 0.1 abcnews.go.com allthingsd.com 0.1

140.0 100.0 60.0 20.0 -20.0 5.0 3.0 1.0 -1.0 -3.0 6.0 2.0 -2.0

Mean

0.0 0.0

Correlation Mean

0.2

0.3

0.4

0.0 0.0

Correlation PCA

0.2

0.3

0.4

Retweets BoW

Fig. 3. Comparison of the proposed canonical trend (CT) method


(plotted on y-axis) with other trend detection approaches (plotted on x-axis); shown are correlations between estimated retweet trends yf (t) and news trends xf (t). Left: CT prediction against mean trends. Right: CT prediction against PCA trends. Note that while the overall predictability of retweets is similar across trend detection methods, the correlations between predicted and measured retweet trends is consistently better with the CT algorithm. This suggests that there is information in the covariation of news items and retweets that can be used for predicting the spatiotemporal evolution of tweets.

2011-10-04T02

2011-10-05T18 PCA

2011-10-07T10

the Twitter social network.


2011-10-04T02 2011-10-05T18 Canonical Trend 2011-10-07T10

6.3. Spatiotemporal retweet dynamics The proposed CT approach allows to visualize the spatiotemporal dynamics of the Twitter community in response to information published on news web sites. In gure 4 we show the temporal dynamics learned at the three locations with the highest absolute weights averaged over time for the news feed http://slashdot.org/, http:// gadm.geovocab.org/id/1_754 (Munich Area, Germany), http://gadm.geovocab.org/id/1_1487 (Okayama, Japan) and http://gadm.geovocab.org/ id/1_152 (Canberra Area, Australia). Note that the temporal structure clearly reects the daily publishing cycle and Twitter activity, respectively. As the spatiotemporal deconvolution wy ( ) is non-separable in space and time it is important to keep in mind that for a complete description of the spatiotemporal coupling dynamics, all locations have to be taken into account at the same time. 7. CONCLUDING DISCUSSION AND OUTLOOK We presented a novel technique for analysis of spatiotemporal dependencies between web sources and social networks. Empirical comparisons show that the proposed canonical trend prediction approach has clear advantages compared to traditional trend prediction approaches (see g. 3 and table 2). We emphasize that not for all feeds the retweet activity could be predicted sufciently well. But there was a clear trend for better prediction accuracies with higher publishing and retweet activity. We conjecture that this is a statistical estimation effect and thus with more data the retweet prediction accuracy

2011-10-04T02

2011-10-05T18 Time [hrs]

2011-10-07T10

Fig. 2. Trend time series extracted from retweet locations (blue)


and BoW features (green) with three different methods from news items on http://arstechnica.com/ and retweets to these. Top: Mean (across all locations) retweet frequency and mean (across all words in the BoW dictionary) BoW feature time series; note the periodic structure reecting the daily oscillations of both news publishing activity and retweets. Middle: First PCA component time series of retweets locations and BoW features; the strongest PCA component of retweets is just the daily oscillation. As PCA is ignorant w.r.t. the sign of the principal components, the retweet trend is anti-correlated to the BoW time series. Bottom: Canonical Trends for the same data; in contrast to PCA trends, canonical trends predict the temporal structure of retweets much more accurately than capturing only daily oscillations.

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00 10

[2] Felix Biemann, Jens-Michalis Papaioannou, Mikio Braun, and Andreas Harth, Canonical trends: Detecting trend setters in web data, in ICML, 2012 (to appear). [3] Sebastien Ardon, Amitabha Bagchi, Anirban Mahanti, Amit Ruhela, Aaditeshwar Seth, Rudra M. Tripathy, and Sipat Triukose, Spatio-temporal analysis of topic popularity in twitter, CoRR, vol. abs/1111.2904, 2011. [4] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon, What is twitter, a social network or a news media?, in Proceedings of the 19th international conference on World wide web, New York, NY, USA, 2010, WWW 10, pp. 591 600, ACM.
90

http://gadm.geovocab.org/id/1_754 http://gadm.geovocab.org/id/1_1487 http://gadm.geovocab.org/id/1_152


20 30

Time lag [hrs]

40

50

60

70

80

[5] Sarita Yardi and Danah Boyd, Tweeting from the town square: Measuring geographic local networks, in ICWSM, 2010. [6] TW Anderson, An Introduction to Multivariate Statistical Analysis, Wiley, 1958. [7] Harold Hotelling, Relations between two sets of variates, Biometrika, vol. 28, no. 3, pp. 321377, 1936. ` [8] C. Jordan, Essai sur la G ometrie a n dimensions, Bull. Soc. e Math. France, vol. 3, pp. 103174, 1875, Tome III, GauthiersVillars, Paris, 1962, 79-149. [9] TW Anderson, Asymptotic theory for canonical correlation analysis, Journal of Multivariate Analysis, 1999.

Spatiotemporal convolution wy ( ) for top three (ranked according to absolute weight) locations for http://slashdot. org/. The convolution of these locations captures the daily trend oscillations. Note that one location (http://gadm.geovocab. org/id/1_1487, Okayama, Japan), has an offset of 6hrs w.r.t. e.g. the location http://gadm.geovocab.org/id/1_754 (Munich, Germany); the time zone difference between these two locations is 7hrs;

Fig. 4.

will become better. One generic qualitative aspect of the social network data we have analyzed is an inbuilt data immanent uncertainty that is unavoidable and should be kept in mind when interpreting the results: The source of geospatial information were only the user proles of Twitter. Only few users will put their location in their prole, moreover we can never exclude that some users have put arbitrary locations. In this paper a working assumption was that we have enough data, such that these noise effects would cancel out. We would also like to mention some caveats when analyzing high dimensional retweet data, in particular in combination with news data. The most important one is nonstationarity of retweet behaviour. If we estimate some spatiotemporal dependency pattern at the beginning of a month, it is very unlikely, that the very same pattern is present at the end of that month. Human Twitter activity and the corresponding spatiotemporal dynamics are highly dynamic and change quickly. This is reected in the highly variable results across cross-validation folds. The analysis of these nonstationarities is an important topic of future research. Another direction of research is the exploration of multi-dimensional canonical subspaces in BoW and BoL spaces. Here we focused only on the strongest topics and locations but the extension of more topic subspaces and locationss, however, is straightforward.
8. REFERENCES [1] Felix Biemann and A Harth, Analysing dependency dynamics in web data, in Proceedings of AAAI Spring Symposium, Stanford, 2010.

[10] Kenji Fukumizu, Francis R Bach, and Arthur Gretton, Statistical consistency of kernel cca, Journal of Machine Learning Research, vol. 8, pp. 361383, 2007. [11] Felix Biemann, Frank C Meinecke, Arthur Gretton, Alexander Rauch, Gregor Rainer, Nikos K Logothetis, and KlausRobert M ller, Temporal kernel cca and its application in u multimodal neuronal data analysis, Machine Learning Journal, vol. 79, no. 1-2, pp. 527, 2010. [12] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman, Indexing by latent semantic analysis, Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391407, 1990. [13] K Pearson, On lines and planes of closest t to systems of points in space, Philosophical Magazine, vol. 2, pp. 559572, 1901.

S-ar putea să vă placă și