Documente Academic
Documente Profesional
Documente Cultură
Sneha Saha
Student id -4600916
Email id -S.SSaha @student.tudelft.nl
Abstract -- With the advancement of web technology city focused topics. City names are popular in spam
and its growth, there is a huge volume of data present in tweets and they are often chained to draw the attention
the web for internet users and a lot of data is generated of messages which are not city related at all. This data
too. Social networking sites like Twitter, Facebook are
provides many new opportunities and challenges for
rapidly gaining popularity as they allow people to share
natural language processing. One such challenge is
and express their views about topics, have discussion
with different communities, or post messages across the
predicting the tweets based on geolocation. Therefore
world. In this paper, we investigate the linguistic feature we analyzed the tweets which are produced in the city
for detecting the sentiment of twitter message related to within a geographic region in term of quantity,
a geolocation.The tweets has been downloaded via discussed topic and related to city specific properties
Twitter Streaming API (location based approach) and like size or population. Therefore its interesting to
filter by geo-locations (co-ordinate of a city) and investigate the temporal aspects of a tweet of a geo-
location indicative words (LIWs) via feature selection. location.
In this paper, we describe a method for geo-spatial
tweets detection on Social Media streams. We monitor Specifically, we aim at answering the following
all posts on Twitter issued in a given geographic region questions:
and then analyzed the sentiment of the tweets. A novel RQ1: To what extent geo-location tweets helps in
approach is adopted for automatically classifying the sentiment analysis within a geographic region ?
sentiment of Twitter messages. Therefore, in this paper RQ2: To what extend feature extraction possible
we focus on sentiment polarity analysis .This is useful from the tweets within the region?
for consumers who want to research the sentiment of
products before purchase, or companies that want to There has been a large amount of research in the area
monitor the public sentiment of their brands in a of sentiment classification. Traditionally most of it has
region. Only geo-location specific data has been focused on classifying larger pieces of text, like
collected from Twitter to predict sentiment of the reviews [1]. Tweets (and micro blogs in general) are
people related to that geo-location. different from reviews primarily because of their
purpose: while reviews represent summarized
Keyword: Keywords Twitter, sentiment analysis, Social thoughts of authors, tweets are more casual and
Media Analytics. limited to 140 characters of text. Generally, tweets are
not as thoughtfully composed as reviews. Yet, they
1.INTRODUCTION- still offer companies an additional avenue to gather
feedback. There has been some work by researchers in
With the ever-growing popularity of social media, the area of phrase level and sentence level sentiment
massive volumes of user-generated data are produced classification recently [2]. Previous research on
every day, e.g. in the form of Twitter messages analyzing blog posts includes [3]. Sentiment analysis
is a relatively new area, which deals with extracting
(tweets) . A large part of these originate from private
user opinion automatically. An example of a positive
users who describe how they currently feel, what they sentiment is, natural language processing is fun
are doing, or what is happening around them. We are alternatively, a negative sentiment is its a horrible
only starting to understand how to leverage the day, i am not going outside. Objective texts are
potential of these real-time information streams. As deemed not to be expressing any sentiment, such as
cities are hubs in global network, it could be assumed news headlines, for example company shelves wind
that their citizen, companies and others who are sector plans. There are many ways in which social
network data can be leveraged to give a better
located in one of those cities produce a high amount understanding of user opinion such problems are at
of social media content. Furthermore people from the heart of natural language processing (NLP) and
other places can mention those cities and "talk" about data mining research.
1.1 Characteristic of Tweets obvious that a short time like a week is not sufficient
to get fully robust data. For example there could have
Twitter messages have many unique attributes, which been ,and it's very likely that there would be special
differentiates it form other reviews:
events in some cities while in other cities special
Length The maximum length of a Twitter message is
140 characters. From our training set, we calculate events might have been a week earlier or later. Those
that the average length of a tweet is 14 words or 78 events could influence the amount of tweets produced
characters. This is very different from the previous in a day. Using the Streaming API tweets were
sentiment classification research that focused on retrieved which had a valid geo-location .The radius
classifying longer bodies of work, such as movie of the circle is based on official size of the city areas.
reviews. Otherwise co-ordinates of Google Maps are used to
Data availability Another difference is the magnitude
retrieve tweets for that particular city. The use of
of data available. With the Twitter API, it is very easy
to collect millions of tweets for training. In past longitude and latitude to define locations is forced by
research, tests only consisted of thousands of training Twitter API itself and support searching for tweets
items. published within a valid geo-location.
Language model Twitter users post messages from
many different media, including their cell phones. The
frequency of misspellings and slang in tweets is much
higher than in other domains.
Domain Twitter users post short messages about a
variety of topics unlike other sites which are tailored
to a specified topic. This differs from a large
percentage of past research, which focused on specific
domains such as movie reviews.
4.RESULT
4.1. Sentiment Anlysis of the Tweets . Figure 2 show the histogram of sentiment
distribution of the tweets
We used the Rapid miner data mining tools for our
experiments. For analysis, we create an independent
testing set by randomly selecting 20% of the labeled Based upon our split of test and training dataset the
tweets collected. The remaining 80% is used for accuracy level also differ .Table 2 shows the different
creating classifiers using Nave Bayes algorithm. accuracy level based on different splits.
find distinct indicator of activities ,size ,population of
Accuracy 50-50 66-34 80-20 a region.
Naive 96.80% 96.85% 98.30%
Bayes REFERENCES-
Table2 accuracy level of split ratio.
[1] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs
4.2Topic extraction of the tweets- up? Sentiment classification using machine learning
techniques. In Proceedings of the Conference on
To determine the main topics of the tweets using the Empirical Methods in Natural Language
Streaming Api,we investigate the hash tag in the tweet Processing(EMNLP), pages 79{86, 2002.
related to the city, We filtered out has tags which are
built from city from city names since they were [2] T. Wilson, J. Wiebe, and P. Hofmann .
already used to find out those tweets and would Recognizing contextual polarity in phrase-level
automatically be the highest has tags in our dataset. sentiment analysis.In Proceedings of Human
As the data was collected during end of December Language Technologies Conference/Conference on
2016 ,the time span of data collection has a huge Empirical Methods in Natural Language Processing
effect in topic extraction. In our dataset most frequent (HLT/EMNLP 2005),Vancouver, CA, 2005.
used hash tags are #vacation ,#holiday
,#2017,#2016.There are also very common hash tags [3] G. Mishne. Experiments with mood classification
like #news ,#job ,#radio .Another trend can be in blog posts. In 1st Workshop on Stylistic Analysis
detected .Many has tags are related to sports, specially Of Text For Information Access, 2005.
sport clubs or sport events. The hash tag analysis also
revealed that the tweets contain hash tags of other [4] T. Wilson, J. Weber, and P. Hoffmann,
cities also. In our dataset we have hash tags of Paris, Recognizing contextual polarity in phrase-level
Frankfurt even if we collected tweets from sentiment analysis, in Human Language Technology
Amsterdam.Furthur investigation revealed that this and Empirical Methods in Natural Language
may be because some people use those hash tags Processing. Stroudsburg, PA, USA: Association for
when they plan to visit the city. There are also spam Computational Linguistics, 2005, pp. 347354.
tweets that simply chain hash tags of different cities to
reach a greater audience with their advertisement. [5] T. Honkela, Z. Izzatdust, and K. Lagus, Text
mining for wellbeing: Selecting stories using
semantic and pragmatic features, in Artificial Neural
CONCLUSION AND FUTURE WORK- Networks and Machine Learning, Part II, ser. LNCS.
Springer, 2012, vol. 7553, pp. 467474.
In the paper, we have presented the process of
classifying sentiment of tweets. We introduced the [6] A. Tumasjan, T. O. Springer, P. G. Sandner, and
idea of using a list of sentiment words plus emoticons I. M. Welpe, Predicting elections with twitter: What
as features to represent and to label tweets for 140 characters reveal about political sentiment. in
training data. We also include a neutral classification International AAAI Conference on Weblogs and
of tweets in our corpus. Experiments on tweets Social Media, 2010, pp. 178185.
collected from geographic location Amsterdam.
Based on our approach and experimental results, we [7] G. Mishne and N. S. Glance, Predicting movie
observe that the integration of document filtering and sales from blogger sentiment. in AAAI Spring
document indexing techniques with our approach Symposium: Computational Approaches to
may provide one viable way to the development of Analyzing Weblogs, 2006, pp. 155158.
effective systems for tweets analysis.
Our future work includes context based tweet search. [8] B. Pang and L. Lee, Opinion mining and
Of course not every tweet containing more than one sentiment analysis, Foundations and Trends in
hash tag can be classified as a spam tweet. Therefore Information Retrieval, vol. 2, no. 1-2, pp. 1135,
to achieve an adequate result an algorithm based 2007.
machine learning technique need to be used. Also
size and population of the city could a point of [9] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai,
research from the tweets collected from a geographic and A. Kappas, Sentiment in short strength detection
location. To come to an adequate result in this feild it informal text, J. Am. Soc. Inf. Sci. Technol., vol. 61,
is mandatory to create a detail analysis of more cities no. 12, pp. 25442558, Dec. 2010.
including all influential factors. By this we hope to
[10] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D.
Manning, A. Ng, and C. Potts, Recursive deep
models for semantic compositionality over a
sentiment Treebank, in Empirical Methods in
Natural Language Processing. Ass. for Comp.
Linguistics, October 2013, pp. 16311642.