Documente Academic
Documente Profesional
Documente Cultură
Introduction
Opinion mining, also called sentiment analysis, has become a very ac-
Methods
2.1
Sentiment analysis, also called opinion mining, is the field of study that
analyses peoples opinions, sentiments, evaluations, appraisals, attitudes,
and emotions towards entities such as products, services, organizations,
individuals, issues, events, topics, and their attributes [2]. Sentiments can
be expressed at the entity level or at the aspect level. Entity level senti-
ment analysis looks directly at the entire target of an opinion. For example, The new iPhone is excellent expresses an opinion on the product
as a whole. Aspect level sentiment analysis looks at features of a product.
For instance, This car is very quiet but it uses a lot of petrol. In this
example features are the opinion target, the noise produced by a car and
its fuel consumption.
At the entity level, an opinion is a quadruple, (e, s, h, t), where e is the
entity, the opinion target, s is the sentiment, h is the sentiment holder and
t is the time when the sentiment was expressed.
At the aspect level, a sentiment is a quintuple, (e, a, s, h, t), where the
additional element a is the aspect. Here e and a are the opinion targets.
Regular opinions express sentiments on an entity or a feature directly,
whereas comparative opinions compare multiple entities based on some
of their aspects. For example, BMW makes more ecologic cars than
Mercedes. Here the cars are compared based on their consumption of
resources.
Sentiment words such as excellent, good, poor are the most important indicators of sentiments. However, there are other ways of ex-
pressing sentiments. Phrases or idioms without sentiment words can express opinions. For instance, This car cost me an arm and a leg or,
This mattress had a valley after one month.
Opinions can also be expressed using verbs. For example, This TV
sucks or, This washer uses a lot of water. However, the sentence This
hoover really sucks! expresses a positive opinion.
Detecting the real meaning, whether it is meant positively or negatively, is one of the challenges of opinion mining. Others are spam detection. Spammers spread unsolicited or malicious messages. Social
spammers have become rampant and the volume of spam has increased
dramatically [12]. SM can be accessed from anywhere in the world and
users can express their opinions without disclosing their identity and
without the fear of consequences. While this is a highly desirable feature
in some cases, it also allows people with malicious intent or hidden agendas to post fake opinions to promote or discredit products, companies or
individuals. Such individuals are called opinion spammers and their activities are called opinion spamming [2]. Advances have been made in
spam detection, but these techniques are beyond the scope of this article.
Detecting sarcasm automatically is very difficult. Considering the sentence, The government wants to legalize marijuana, oh great! it is impossible to say whether the person is in favor of legalizing it or is being
sarcastic about it. Several studies tried to detect sarcasm. The study that
could detect sarcasm with most accuracy was only able to detect it in
57% of the cases [2].
These are only some of the challenges faced when doing sentiment
analysis, but they exemplify that opinion mining is highly domain specific and as with most data mining tasks, opinion mining usually starts
with understanding the domain.
The examples given here were all using the English language. Opinions are also expressed in SM using other languages that pose different
challenges.
2.2
Preconditioning
multi-model in nature, including content such as images, audio, and videos, concept such as discussion topic, tag, and annotation, and context
such as links, profile, timestamp, and click-through [11]. The data has to
be purified from irrelevant data first. The cleaning process or data preconditioning can involve stop-words removal. Stop-words are words like
the, and is in sentences such as The new Panasonic GM1 camera is
excellent!. They only have grammatical significance and are thus eliminated. Also usually special characters such as !, brackets or smileys
are removed. In addition, words are converted into their canonical form
using stemming algorithms. Stemming algorithms such as unsupervised
morpheme segmentation find the word stem from inflected or derived
words. For instance sucks becomes to suck. Lemmatization might
be performed for further analysis. It is the process of grouping together
the different inflected forms of a word, so they can be analyzed as a single item [15].
An often used pre-conditioning task is creating a bag-of-words. Bagof-words based approaches model news articles by vector space model
which translates each news piece into a vector of word statistical measurements, such as the number of occurrences, etc. [16]. A bag-of-words
2.3
Sentiment lexicons
As shown in the previous chapter, words that express positive or negative sentiments are essential for opinion mining. Also, the examples
showed that these words are highly domain specific. Words can bear different meanings whether the opinion is expressed about a car, a mobile
phone or a mattress. Not only sentiment words are important. The sentiment strength can be altered using sentiment modifiers such as very.
For instance The new iPhone is very good. Here good is the sentiment word and very is the modifier.
Sentiment words can be base types such as good or bad or comparative types such as better or worse.
Negators such as not are also important since they can change the
sentiment to the opposite. For instance The new Chevrolet is not great.
They are called sentiment polarity shifters. They can also change the
opinion in a positive way. For instance The new Mercedes doesnt
suck.
A common way of performing opinion mining is creating a list of sentiment words, a sentiment lexicon, and using it to analyse the opinion
texts.
Compiling sentiment lexicons can be done manually. However, this is
labour intensive and usually an automatic approach is preferred. Dictionaries such as WordNet (http://wordnet.princeton.edu/) or Dictionary.com
(http://dictionary.reference.com/) list synonyms and antonyms of words.
They can be used to automatically generate sentiment lexicons. This approach works as follows. A small set of seed sentiment words is compiled manually. From the seed words, an algorithm searches the online
dictionary for synonyms and antonyms. They are added to the word list.
The search is repeated iteratively until no more sentiment words can be
found. Some sentiment lexicons also weight sentiment words. For instance excellent is stronger than good. They are useful when posts
are not just analysed for positive or negative opinions but divided into
multiclass sentiment categories such as good reviews or very good
reviews.
2.4
N
)
df t
(1)
unknown. Here the class label is positive or negative reviews. Supervised machine learning techniques are a common way of text classification. A set of data, SM posts, is divided into a training and a testing
set. The model is trained using the training data set. The test set is used
to determine how well model performs and calculate the error, the classification accuracy. This process is repeated until the result is acceptable.
The trained model can then be applied for future, unseen SM posts. There
are many supervised machine learning algorithms. Popular algorithms
are the Nave Bayes classifier, Support Vector Machines (SVM) and kNearest Neighbor (k-NN). They take a feature vector as input. A feature
vector can be unigrams, a bag-of-word, containing the sentiment words
identified in sentiment lexicon generation, terms and their frequencies,
part of speech (POS) or sentiment shifters.
Since sentiment words are often the dominant factor for sentiment
classification, it is not hard to imagine that sentiment words and phrases
may be used for sentiment classification in an unsupervised manner [2].
In unsupervised machine learning, documents are clustered into similarity groups. SM posts are grouped together using a similarity, or distance
function. The most commonly used distance functions for numeric attributes are the Euclidean distance and Manhattan (city block) distance
Opinion mining in Big Social Data, p. 12, 2014.
[1]. However, others such as Chebyshev and Minkowski distance functions are also used. K-means Clustering is probably the most popular
clustering algorithm.
2.5
In recent studies Latent Dirichlet Allocation (LDA) has been used for
sentiment analysis using SM [17,18,19,20,23]. LDA is based on Latent
Semantic Indexing and represents a probabilistic model that finds the cooccurrence patterns of terms that corresponds to semantic topics and has
been
used
in
probabilistic
document
model
that
classification.
is
based
It
on
is
generative
multinomial
and
p ( | )
( i )
i 1
1 ... k ,
11
( )
k 1
(2)
i 1
i 1
description of the posts. It thus creates a profile of, for instance, positive
or negative Tweets. It can then filter out the relevant Tweets from a corpus of Tweets.
LDA has also been used for relevance filtering [19], and extensions have
been proposed that consider the underlying sequential structure of the
document [20] or filter out background topics [23]. LDA has proven to
be useful in exploratory as well as predictive text analytics.
Discussion
Sentiment analysis remains a challenging area of research. Whereas
the classification algorithms, machine learning, LDA or others like statistical, are important, data pre-processing remains an equally important
task. SM data is typically noisy, and there is a lot of irrelevant data. The
classification algorithms wont perform well if the data is not properly
preconditioned and the accuracy of the results will suffer. Preconditioning encompasses relevance filtering, noise removal and feature vector
preparation. The feature vector can contain word frequencies of sentiment words, POS, but attributes might also be weighted. For instance,
not all sentiment words might have the same importance, and the feature
vector might also contain weighting. Feature vector creation is at least as
Opinion mining in Big Social Data, p. 14, 2014.
important as selecting the appropriate classification algorithm, nevertheless there seem to be much less research in this area than in the area of
classification.
Other challenges originate from the complexities of natural language
with linguistic constructs such as humour, sarcasm or innuendos which
are very difficult to detect by computers.
Microblogging SM sites such as Twitter and Sina Weibo usually have
character limits and posts are usually clear statements. Forums and blogs
often cover several topics and contain opinions on different subjects,
which makes them more difficult to mine. More complex sentences can
have sentiments on different targets. For instance Microsoft is doing
well in this bad market. Thats why some studies only considered explicit statements [21]. Probably the most difficult posts to analyse are
political opinions since they are full of irony and sarcasm.
The author uses Tweets for his research. Tweets are limited to 140
characters, so they are typically straight to the point and make them suitable targets for getting opinions. However due to the shortness of Tweets
they are usually full of slang or emoticons, which poses a challenge. For
instance Tweets have no subject line so subject words are highlighted
using a hash tag, for example #IBM #share is plummeting or companies are denoted using a $ sign, for example $APPL for Apple Inc.
Slang such as ATTA car, thats a car, abbreviations or acronyms
typically found on SM such as IMHO (In My Humble Opinion) or LOL
(Lough Out Loud) or texts such as gooooood car pose additional challenges for sentiment analysis.
Conclusions
A fully automated and accurate solution for opinion mining using SM
opinion spammers find more sophisticated ways too. Not domain specific word lexicon are still nowhere in sight, but automatic sentiment lexicon creation for a specific domain would improve the whole opinion
mining process. Many studies have analysed the effectiveness of online
marketing campaigns, how influential online opinions are is still not very
well understood. More research in the area of online influence of SM
users on user behaviour would be an interesting and desirable area of
research.
Appendix
5.1
references
[6] Jones, and L. Huan, "Mining Social Media: Challenges and Opportunities." pp. 90-99.
[7] V. Hangya, and R. Farkas, "Target-oriented opinion mining
from tweets." pp. 251-254.
[8] S. Asur, and B. A. Huberman, "Predicting the Future with Social Media." pp. 492-499.
[9] J. Bollen, H. Mao, and X.-J. Zeng, Twitter mood predicts the
stock market, Journal of Computational Science, vol. 2, pp. 8,
2010.
[10]
Siganos, E. Vagenas-Nanos, and P. Verwijmeren, Facebook's daily sentiment and international stock markets, Journal
of Economic Behavior & Organization, no. 0, 2014.
[11]
H. Shen, X.-S. Hua, J. Luo, and V. Oria, Guest editorial: content, concept and context mining in social media,
World Wide Web, vol. 15, no. 2, pp. 115-116, 2012/03/01,
2012.
[12]
J. Tang, Y. Chang, and H. Liu, Mining social media
with social theories: a survey, SIGKDD Explor. Newsl., vol.
15, no. 2, pp. 20-29, 2014.
[13]
Preeti, and BrahmaleenKaurSidhu, Natural Language
Processing, International Journal of Computer Technology and
Applications, vol. 4, pp. 751-758, 09/01, 2013.
[14]
E. Kalampokis, E. Tambouris, and K. Tarabanis, Understanding the predictive power of social media, Internet Research, vol. 23, no. 5, pp. 544-559, 2013.
[15]
S. Stieglitz, and L. Dang-Xuan, Social media and political communication: a social media analytics framework, Social
Network Analysis and Mining, vol. 3, no. 4, pp. 1277-1291,
2013/12/01, 2013.
[16]
X. Li, H. Xie, L. Chen, J. Wang, and X. Deng, News
impact on stock price return via sentiment analysis,
Knowledge-Based Systems, no. 0, 2014.
[17]
Z. Daniel, Z. Daniel, S. Jn, J. Jozef, and C. Anton,
Text Categorization with Latent Dirichlet Allocation, Journal
of electrical and electronics engineering, vol. 7, pp. 161-164,
05/01, 2014.
[18]
T. Shulong, L. Yang, S. Huan, G. Ziyu, Y. Xifeng, B.
Jiajun, C. Chun, and H. Xiaofei, Interpreting the Public Sentiment Variations on Twitter, Knowledge and Data Engineering,
IEEE Transactions on, vol. 26, no. 5, pp. 1158-1170, 2014.
[19]
M. Arias, A. Arratia, and R. Xuriguera, Forecasting
with twitter data, ACM Trans. Intell. Syst. Technol., vol. 5, no.
1, pp. 1-24, 2014.
[20]
L. Du, W. Buntine, H. Jin, and C. Chen, Sequential latent Dirichlet allocation, Knowledge and Information Systems,
vol. 31, no. 3, pp. 475-503, 2012/06/01, 2012.
[21]
J. Bollen, H. Mao, and X.-J. Zeng, Twitter mood predicts the stock market, Journal of Computational Science, vol.
2, pp. 8, 2010.
[22]
A. Porshnev, I. Redkin, and A. Shevchenko, "Machine
Learning in Prediction of Stock Market Indicators Based on
Historical Data and Data from Twitter Sentiment Analysis." pp.
440-444.
[23]
T. Shulong, L. Yang, S. Huan, G. Ziyu, Y. Xifeng, B.
Jiajun, C. Chun, and H. Xiaofei, Interpreting the Public Sentiment Variations on Twitter, Knowledge and Data Engineering,
IEEE Transactions on, vol. 26, no. 5, pp. 1158-1170, 2014.