Sunteți pe pagina 1din 6

Feature Extraction of Geo-tagged Tweets

Sneha Saha
Student id -4600916
Email id -S.SSaha @student.tudelft.nl

Abstract -- With the advancement of web technology city focused topics. City names are popular in spam
and its growth, there is a huge volume of data present in tweets and they are often chained to draw the attention
the web for internet users and a lot of data is generated of messages which are not city related at all. This data
too. Social networking sites like Twitter, Facebook are
provides many new opportunities and challenges for
rapidly gaining popularity as they allow people to share
natural language processing. One such challenge is
and express their views about topics, have discussion
with different communities, or post messages across the
predicting the tweets based on geolocation. Therefore
world. In this paper, we investigate the linguistic feature we analyzed the tweets which are produced in the city
for detecting the sentiment of twitter message related to within a geographic region in term of quantity,
a geolocation.The tweets has been downloaded via discussed topic and related to city specific properties
Twitter Streaming API (location based approach) and like size or population. Therefore its interesting to
filter by geo-locations (co-ordinate of a city) and investigate the temporal aspects of a tweet of a geo-
location indicative words (LIWs) via feature selection. location.
In this paper, we describe a method for geo-spatial
tweets detection on Social Media streams. We monitor Specifically, we aim at answering the following
all posts on Twitter issued in a given geographic region questions:
and then analyzed the sentiment of the tweets. A novel RQ1: To what extent geo-location tweets helps in
approach is adopted for automatically classifying the sentiment analysis within a geographic region ?
sentiment of Twitter messages. Therefore, in this paper RQ2: To what extend feature extraction possible
we focus on sentiment polarity analysis .This is useful from the tweets within the region?
for consumers who want to research the sentiment of
products before purchase, or companies that want to There has been a large amount of research in the area
monitor the public sentiment of their brands in a of sentiment classification. Traditionally most of it has
region. Only geo-location specific data has been focused on classifying larger pieces of text, like
collected from Twitter to predict sentiment of the reviews [1]. Tweets (and micro blogs in general) are
people related to that geo-location. different from reviews primarily because of their
purpose: while reviews represent summarized
Keyword: Keywords Twitter, sentiment analysis, Social thoughts of authors, tweets are more casual and
Media Analytics. limited to 140 characters of text. Generally, tweets are
not as thoughtfully composed as reviews. Yet, they
1.INTRODUCTION- still offer companies an additional avenue to gather
feedback. There has been some work by researchers in
With the ever-growing popularity of social media, the area of phrase level and sentence level sentiment
massive volumes of user-generated data are produced classification recently [2]. Previous research on
every day, e.g. in the form of Twitter messages analyzing blog posts includes [3]. Sentiment analysis
is a relatively new area, which deals with extracting
(tweets) . A large part of these originate from private
user opinion automatically. An example of a positive
users who describe how they currently feel, what they sentiment is, natural language processing is fun
are doing, or what is happening around them. We are alternatively, a negative sentiment is its a horrible
only starting to understand how to leverage the day, i am not going outside. Objective texts are
potential of these real-time information streams. As deemed not to be expressing any sentiment, such as
cities are hubs in global network, it could be assumed news headlines, for example company shelves wind
that their citizen, companies and others who are sector plans. There are many ways in which social
network data can be leveraged to give a better
located in one of those cities produce a high amount understanding of user opinion such problems are at
of social media content. Furthermore people from the heart of natural language processing (NLP) and
other places can mention those cities and "talk" about data mining research.
1.1 Characteristic of Tweets obvious that a short time like a week is not sufficient
to get fully robust data. For example there could have
Twitter messages have many unique attributes, which been ,and it's very likely that there would be special
differentiates it form other reviews:
events in some cities while in other cities special
Length The maximum length of a Twitter message is
140 characters. From our training set, we calculate events might have been a week earlier or later. Those
that the average length of a tweet is 14 words or 78 events could influence the amount of tweets produced
characters. This is very different from the previous in a day. Using the Streaming API tweets were
sentiment classification research that focused on retrieved which had a valid geo-location .The radius
classifying longer bodies of work, such as movie of the circle is based on official size of the city areas.
reviews. Otherwise co-ordinates of Google Maps are used to
Data availability Another difference is the magnitude
retrieve tweets for that particular city. The use of
of data available. With the Twitter API, it is very easy
to collect millions of tweets for training. In past longitude and latitude to define locations is forced by
research, tests only consisted of thousands of training Twitter API itself and support searching for tweets
items. published within a valid geo-location.
Language model Twitter users post messages from
many different media, including their cell phones. The
frequency of misspellings and slang in tweets is much
higher than in other domains.
Domain Twitter users post short messages about a
variety of topics unlike other sites which are tailored
to a specified topic. This differs from a large
percentage of past research, which focused on specific
domains such as movie reviews.

1.1. Defining Sentiment

Sentiment analysis is the task of identifying positive


and negative opinions, emotions, and evaluations [4].
Since its inception sentiment analysis has been subject
of an intensive research effort and has been
successfully applied e.g., to assist users in their
development by providing them with interesting and Figure1.Geographical plot of the location from where
supportive content [5], predict the outcome of an the tweets are collected
election [6] or movie sales [7]. The spectrum of
sentiment analysis techniques ranges from identifying In the context of Twitter, only about 1% of tweets are
polarity (positive or negative) to a complex geo referenced. Thus, Watanabe et al. [12] try to
computational treatment of subjectivity, opinion and assign geo coordinates to non-geotagged tweets to
sentiment [8]. In particular, the research on sentiment increase the chance of finding localized events. Then,
polarity analysis has resulted in a number of mature
they search for place names and count the number of
and publicly available tools such as SentiStrength [9],
Alchemy1 , Stanford NLP sentiment analyzer [10] key terms that co-occur with each place name.
and NLTK [11] Nevertheless, their method fails to find localized
events when no places are mentioned in the tweets.

With the large range of topics discussed on Twitter, it


2.DATA COLLECTION AND PREPROCESSING would be very difficult to manually collect enough
data to train a sentiment classifier for tweets. Our
To examine the tweet characteristic , Twitter API was solution is to use a model based approach which uses
used for the data extraction process .We extracted machine learning techniques. For each tweet data
tweets from Twitter which are with geo information processing each is related by using TF-IDF vector.
within 5km radius of the city areas(Figure 1). All Before processing data cleaning is necessary for tweet
data. Tokenization process from [13] and [14] was
tweets were collected which matched our criteria and
followed for the data preprocessing task. The steps
were published on Twitter during December 2016.It is followed included the removal of any urls and
usernames (usernames follow the @symbol) and influence the classifier because emoticon features are
removal any characters that repeat more than twice not part of its training data. This is a current limitation
turning a phrase such as OOMMMGGG to of our approach because it would be useful to take
OOMMGG, which is applied by a regular expression. emoticons into account when classifying test data.We
The tweets collected from city can be various consider emoticons as noisy labels because they are
language ,so first all the tweets collected are translated not perfect at defining the correct sentiment of a
to English. For tweet processing ,all the tweets are tweet. This can be seen in the following tweet:
converted to lowercase and filter applied to remove @BATMANNN :( i love chutney.... Without the
basic common English words like "is", "the" etc. emoticon, most people would probably consider this
Further if any word is below 3 digit and higher than tweet to be positive. Tweets with these types of
20 digits are also removed. Finally Porter Stem mismatched emoticons are used to train our classifiers
algorithm is applied to reduce any word to its base because they are difficult to filter out from our
form like "taken", "took" converted to its base form training data.
"take".
3.2Natural Language Processing
3. APPROACHES FOR SENTIMENT ANALYSIS-
The first step to analyzing the Tweets is natural
The resulting term document matrix is used to train an language processing (NLP). The techniques used to
SVM classifier(Naives Bayes) inside Validation classify the Tweets all focus on a bag of words
operator. This Classifier first has to be trained with a
approach. As such, the NLP portion of the project
training dataset, and then it can be used to actually
classify documents. The set of training examples of focused on developing an efficient way to get the best
positve and negative word tweets are chosen bag of words from any given Tweet. Initially, we
correctly, the Classifier then predict the class eliminate obvious stop words that dont contribute to
probabilities of the actual documents with a similar the content of a Tweet. This list includes words such
accuracy (as the training examples). After as a, I, she, etc. Afterwards, we define words
classification each tweet has been classified into two as a string of alphabetic characters with whitespace on
segment polarity and subjectivity. Polarity denotes
both sides. Note that this ignores things such as
whether the tweet is neutral,positve or of negative
sentiment .And subjectivity classify a tweet as numbers and emoticons. Once the set of words in each
objective or subjective based on comparing with the Tweet has been computed for each Tweet in the
training dataset. The objective and subjective training data, mutual information is used to determine
comparison shows that the interjections and personal the words that provide the most insight to the content
pronouns are strong indicators of subjective texts of the Tweets. For our purposes we used about 1000
whilst common and proper nouns are indicators of words. With these, each Tweet was thus characterized
objective texts. Subjective texts are often written in
by which of these 1000 words appeared. For example,
first or second person in past tense whilst objective
texts are often written in third person. for the Tweet I am not happy. He is not happy and
the mutual information words not and happy, the
Tweet would be characterized as [not, happy].
3.1 Emoticons Note that the number of times a word appears is not
taken into account. Once a Tweet has been
Since the training process makes use of emoticons as characterized by the above steps, it is passed along to
noisy labels, it is crucial to discuss the role they play
in classification. We will discuss in detail our training the machine learning portion of the classification.
and test set in the Result section. We strip the
emoticons out from our training data. If we leave the 3.2. Machine Learning Approach-
emoticons in, there is a negative impact on Naive
Bayes. The difference lies in the mathematical models Machine learning based approach uses classification
and feature weight selection of MaxEnt and SVM technique to classify text into classes. There are
.Stripping out the emoticons causes the classifier to mainly two types of machine learning techniques
learn from the other features (e.g. unigrams and
3.2.1. Unsupervised learning:
bigrams) present in the tweet. The classifier uses these
It does not consist of a category and they do not
non-emoticon features to determine the sentiment.
provide with the correct targets at all and therefore
This is an interesting side-effect of our approach. If
rely on clustering.
the test data contains an emoticon, it does not
Before processing all the unwanted data are cleaned
3.2.2. Supervised learning: using Rapidminer.A sample of the clean tweet data is
shown in Table 1.
It is based on labeled dataset and thus the labels are
provided to the model during the process. These @godavar what an anti- what anti-national.
labeled dataset are trained to get meaningful outputs national. Why doesn't he use Why does not he
when encountered during decision-making. PayTM? use PayTM
The success of both this learning methods is mainly @kayotickitchen get here so get here so bitter
depends on the selection and extraction of the specific bitter
set of features used to detect sentiment. @Schiphol That is a gross gross error
The machine learning approach applicable to error in personnel. A lot of personnel. lot of
sentiment analysis mainly belongs to supervised people stand in line and in people stand line
classification. In a machine learning techniques, two danger of missing their flight. and danger of
sets of data are needed: Scandalous! missing their flight
1. Training Set Scandalous
2. Test Set. @RoelieBreidoos values bit lost
A number of machine learning techniques have been @POL_ASchut The values are some teenagers
formulated to classify the tweets into classes. Machine a bit lost in some teenagers!
learning techniques like Naive Bayes (NB), maximum @AliceAvizandum I am guessing it has
entropy (ME), and support vector machines (SVM) @actuallyalice I'm guessing it something to do
have achieved great success in sentiment analysis. has something to do with with breast cancer
Machine learning starts with collecting training breast cancer risk in women risk in woman
dataset. Nextly we train a classifier on the training
data. Once a supervised classification technique is Table 1:Example of tweet before and after data
selected, an important decision to make is to select cleaning
feature. They can tell us how documents are
represented. Each tweet is assigned Polarity confidence value and
The most commonly used features in sentiment a subjectivity value based on the probability
classification are calculated by the algorithm for each instance on each
Term presence and their frequency class. According the tweet has been assigned polarity
"neutral","positve" or "negative". From our analysis
Part of speech information of 3000 tweets ,approximately around 50% of the total
tweet data are neutral in sentiment(Figure 2). This
Negations
was explained by Bo Pang and Lillian Lee[12] , that
Opinion words and phrases the objective (neutral) sentences of the text are less
informative and thus they filter them out and focus
With respect to supervised techniques, support vector only on the subjective statements in order to improve
machines (SVM), Naive Bayes, Maximum Entropy the binary classification.
are some of the most common techniques used. Here
for our classification we have used Naive Bayes
mechanism.

4.RESULT

In this section we interpret the result of our analysis


by diving it into subsection that relate to the research
question we asked in our introduction.

4.1. Sentiment Anlysis of the Tweets . Figure 2 show the histogram of sentiment
distribution of the tweets
We used the Rapid miner data mining tools for our
experiments. For analysis, we create an independent
testing set by randomly selecting 20% of the labeled Based upon our split of test and training dataset the
tweets collected. The remaining 80% is used for accuracy level also differ .Table 2 shows the different
creating classifiers using Nave Bayes algorithm. accuracy level based on different splits.
find distinct indicator of activities ,size ,population of
Accuracy 50-50 66-34 80-20 a region.
Naive 96.80% 96.85% 98.30%
Bayes REFERENCES-
Table2 accuracy level of split ratio.
[1] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs
4.2Topic extraction of the tweets- up? Sentiment classification using machine learning
techniques. In Proceedings of the Conference on
To determine the main topics of the tweets using the Empirical Methods in Natural Language
Streaming Api,we investigate the hash tag in the tweet Processing(EMNLP), pages 79{86, 2002.
related to the city, We filtered out has tags which are
built from city from city names since they were [2] T. Wilson, J. Wiebe, and P. Hofmann .
already used to find out those tweets and would Recognizing contextual polarity in phrase-level
automatically be the highest has tags in our dataset. sentiment analysis.In Proceedings of Human
As the data was collected during end of December Language Technologies Conference/Conference on
2016 ,the time span of data collection has a huge Empirical Methods in Natural Language Processing
effect in topic extraction. In our dataset most frequent (HLT/EMNLP 2005),Vancouver, CA, 2005.
used hash tags are #vacation ,#holiday
,#2017,#2016.There are also very common hash tags [3] G. Mishne. Experiments with mood classification
like #news ,#job ,#radio .Another trend can be in blog posts. In 1st Workshop on Stylistic Analysis
detected .Many has tags are related to sports, specially Of Text For Information Access, 2005.
sport clubs or sport events. The hash tag analysis also
revealed that the tweets contain hash tags of other [4] T. Wilson, J. Weber, and P. Hoffmann,
cities also. In our dataset we have hash tags of Paris, Recognizing contextual polarity in phrase-level
Frankfurt even if we collected tweets from sentiment analysis, in Human Language Technology
Amsterdam.Furthur investigation revealed that this and Empirical Methods in Natural Language
may be because some people use those hash tags Processing. Stroudsburg, PA, USA: Association for
when they plan to visit the city. There are also spam Computational Linguistics, 2005, pp. 347354.
tweets that simply chain hash tags of different cities to
reach a greater audience with their advertisement. [5] T. Honkela, Z. Izzatdust, and K. Lagus, Text
mining for wellbeing: Selecting stories using
semantic and pragmatic features, in Artificial Neural
CONCLUSION AND FUTURE WORK- Networks and Machine Learning, Part II, ser. LNCS.
Springer, 2012, vol. 7553, pp. 467474.
In the paper, we have presented the process of
classifying sentiment of tweets. We introduced the [6] A. Tumasjan, T. O. Springer, P. G. Sandner, and
idea of using a list of sentiment words plus emoticons I. M. Welpe, Predicting elections with twitter: What
as features to represent and to label tweets for 140 characters reveal about political sentiment. in
training data. We also include a neutral classification International AAAI Conference on Weblogs and
of tweets in our corpus. Experiments on tweets Social Media, 2010, pp. 178185.
collected from geographic location Amsterdam.
Based on our approach and experimental results, we [7] G. Mishne and N. S. Glance, Predicting movie
observe that the integration of document filtering and sales from blogger sentiment. in AAAI Spring
document indexing techniques with our approach Symposium: Computational Approaches to
may provide one viable way to the development of Analyzing Weblogs, 2006, pp. 155158.
effective systems for tweets analysis.
Our future work includes context based tweet search. [8] B. Pang and L. Lee, Opinion mining and
Of course not every tweet containing more than one sentiment analysis, Foundations and Trends in
hash tag can be classified as a spam tweet. Therefore Information Retrieval, vol. 2, no. 1-2, pp. 1135,
to achieve an adequate result an algorithm based 2007.
machine learning technique need to be used. Also
size and population of the city could a point of [9] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai,
research from the tweets collected from a geographic and A. Kappas, Sentiment in short strength detection
location. To come to an adequate result in this feild it informal text, J. Am. Soc. Inf. Sci. Technol., vol. 61,
is mandatory to create a detail analysis of more cities no. 12, pp. 25442558, Dec. 2010.
including all influential factors. By this we hope to
[10] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D.
Manning, A. Ng, and C. Potts, Recursive deep
models for semantic compositionality over a
sentiment Treebank, in Empirical Methods in
Natural Language Processing. Ass. for Comp.
Linguistics, October 2013, pp. 16311642.

[11] S. Bird, E. Loper, and E. Klein, Natural


Language Processing with Python. OReilly Media
Inc., 2009

[12] K. Watanabe, M. Ochi, M. Okabe, and R. Onai.


Jasmine: a real-time local-event detection system
based on geolocation information propagated to
microblogs. In CIKM 11, 25412544, 2011.
13]Alec Go, Richa Bhayani and Lei Huang:Twitter
Sentiment Classification using Distant Supervision

[14] Pak, A., Paroubek, P.: Twitter as a corpus for


sentiment analysis and opinion mining.In: Chair),
N.C.C., Choukri, K., Maegaard, B., Mariani, J.,
Odijk, J., Piperidis,S., Rosner, M., Tapias, D. (eds.)
Proceedings of the Seventh International
Conferenceon Language Resources and Evaluation
(LREC10). European Language Resources
Association (ELRA), Valletta, Malta (may 2010)

[15] Bo Pang, Lillian Lee. A Sentimental Education:


Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts.
www.cs.cornell.edu/home/llee/papers/cutsent.pd

S-ar putea să vă placă și