Sunteți pe pagina 1din 19

Opinion mining in Social Big Data

Corresponding author: Peter Wlodarczak


Affiliation: University of Southern Queensland
Corresponding address:
wlodarczak@gmail.com

Dr. Mustafa Ally


Affiliation: University of Southern Queensland,
Toowoomba, QLD 4350, AUSTRALIA
Prof. Dr. Jeffrey Soar
Affiliation: University of Southern Queensland,
Toowoomba, QLD 4350, AUSTRALIA
Abstract. Opinion mining has rapidly gained importance due to the unprecedented
amount of opinionated data on the Internet. People share their opinions on products,
services, they rate movies, restaurants or vacation destinations. Social Media such as
Facebook or Twitter has made it easier than ever for users to share their views and make
it accessible for anybody on the Web. The economic potential has been recognized by
companies who want to improve their products and services, detect new trends and
business opportunities or find out how effective their online marketing efforts are.
However, opinion mining using social media faces many challenges due to the
amount and the heterogeneity of the available data. Also, spam or fake opinions have
become a serious issue. There are also language related challenges like the usage of
slang and jargon on social media or special characters like smileys that are widely
adopted on social media sites.
These challenges create many interesting research problems such as determining the
influence of social media on peoples actions, understanding opinion dissemination or
determining the online reputation of a company. Not surprisingly opinion mining using
social media has become a very active area of research, and a lot of progress has been
made over the last years. This article describes the current state of research and the
technologies that have been used in recent studies.
Keywords: Big Data; Social Media; opinion mining; sentiment analysis

Opinion mining in Social Big Data, p. 1, 2014.

Electronic copy available at: http://ssrn.com/abstract=2565426

Introduction
Opinion mining, also called sentiment analysis, has become a very ac-

tive area of research. It analyses peoples opinions, appraisals, attitudes,


and emotions toward entities, individuals, issues, events, topics, and their
attributes [1]. Opinion mining using Social Media (SM) is still in its infancy, but there is a growing interest for several reasons. Opinions are
important because they are key influencers of our behaviour [1]. Also,
SM such as Facebook or Google+ has made it very easy for users to share
opinions, views, interests and ideas on the internet and make them visible
worldwide. This yielded an unprecedented amount of user opinions on
products, services, or political events. In the past an organisation had to
conduct polls or surveys to get user or voter opinions. Social Media gives
access to large amounts of user opinions. For the first time in human
history, we now have a huge volume of opinionated data in the social
media on the Web [2]. Also, opinion mining using SM data poses many
challenging problems which makes it a very interesting area of research.
Since opinions are subjective, one opinion is usually not enough for
an application and a collection of opinions needs to be analysed. SM provides a large collection of opinions that makes it an invaluable source for
opinion mining applications. Using Social Media Mining (SMM) all the
Opinion mining in Big Social Data, p. 2, 2014.

Electronic copy available at: http://ssrn.com/abstract=2565426

data can be analysed, and there is no need to select a sample. SM data


are typically in the form of textual content (e.g. in blogs, reviews and
status updates), rating scores in Likert scales or stars (e.g. review ratings), like or dislike indications (e.g. reviews helpful votes and Facebooks like or Googles +1 buttons), web search queries (e.g. Google
trends), tags and profile information (e.g. social network graphs) [14].
But the amount of available data is a challenge in itself and we need to
create some sort of summary. The summary can be in the form of binary
sentiment classifications, positive or negative opinions, or in the form of
multiclass sentiments, for instance in the form of a Likert scale: very
bad, bad, neutral, good, very good, or rating scores like 1 5
stars.
Opinion mining using SM data has been used in many domains. It has
been used to find out about a companys online reputation, to detect new
trends, to analyse user intents and to gain knowledge, and there are many
commercial applications. Several research attempts at sentiment analysis
have been made in the past years, among other to analyse political opinions [3][4], to find influential participants and groups [5], to analyse why
users move from one service to another [6], to detect mood polarity [7],
and in predictive analytics [8,9,10,14,22].
Opinion mining in Big Social Data, p. 3, 2014.

Sentiment analysis is a Natural Language Processing problem. It is a


cross-disciplinary research field with theoretical underpinnings including computer science, linguistics and psychology. Natural Language
Processing (NLP) is an area of research and application that explores
how computers can be used to understand and manipulate natural language text or speech to do useful things [13]. Sentiment analysis touches
every aspect of NLP like word sense disambiguation, coreference resolution or negation handling. However, linguistic issues will not be covered in this article. This article describes the state-of-the-art techniques
adopted for sentiment analysis used in current research.

Methods

2.1

Types of sentiment expressions

Sentiment analysis, also called opinion mining, is the field of study that
analyses peoples opinions, sentiments, evaluations, appraisals, attitudes,
and emotions towards entities such as products, services, organizations,
individuals, issues, events, topics, and their attributes [2]. Sentiments can
be expressed at the entity level or at the aspect level. Entity level senti-

Opinion mining in Big Social Data, p. 4, 2014.

ment analysis looks directly at the entire target of an opinion. For example, The new iPhone is excellent expresses an opinion on the product
as a whole. Aspect level sentiment analysis looks at features of a product.
For instance, This car is very quiet but it uses a lot of petrol. In this
example features are the opinion target, the noise produced by a car and
its fuel consumption.
At the entity level, an opinion is a quadruple, (e, s, h, t), where e is the
entity, the opinion target, s is the sentiment, h is the sentiment holder and
t is the time when the sentiment was expressed.
At the aspect level, a sentiment is a quintuple, (e, a, s, h, t), where the
additional element a is the aspect. Here e and a are the opinion targets.
Regular opinions express sentiments on an entity or a feature directly,
whereas comparative opinions compare multiple entities based on some
of their aspects. For example, BMW makes more ecologic cars than
Mercedes. Here the cars are compared based on their consumption of
resources.
Sentiment words such as excellent, good, poor are the most important indicators of sentiments. However, there are other ways of ex-

Opinion mining in Big Social Data, p. 5, 2014.

pressing sentiments. Phrases or idioms without sentiment words can express opinions. For instance, This car cost me an arm and a leg or,
This mattress had a valley after one month.
Opinions can also be expressed using verbs. For example, This TV
sucks or, This washer uses a lot of water. However, the sentence This
hoover really sucks! expresses a positive opinion.
Detecting the real meaning, whether it is meant positively or negatively, is one of the challenges of opinion mining. Others are spam detection. Spammers spread unsolicited or malicious messages. Social
spammers have become rampant and the volume of spam has increased
dramatically [12]. SM can be accessed from anywhere in the world and
users can express their opinions without disclosing their identity and
without the fear of consequences. While this is a highly desirable feature
in some cases, it also allows people with malicious intent or hidden agendas to post fake opinions to promote or discredit products, companies or
individuals. Such individuals are called opinion spammers and their activities are called opinion spamming [2]. Advances have been made in
spam detection, but these techniques are beyond the scope of this article.

Opinion mining in Big Social Data, p. 6, 2014.

Detecting sarcasm automatically is very difficult. Considering the sentence, The government wants to legalize marijuana, oh great! it is impossible to say whether the person is in favor of legalizing it or is being
sarcastic about it. Several studies tried to detect sarcasm. The study that
could detect sarcasm with most accuracy was only able to detect it in
57% of the cases [2].
These are only some of the challenges faced when doing sentiment
analysis, but they exemplify that opinion mining is highly domain specific and as with most data mining tasks, opinion mining usually starts
with understanding the domain.
The examples given here were all using the English language. Opinions are also expressed in SM using other languages that pose different
challenges.

2.2

Preconditioning

Text categorisation is usually done using clustering or classification


techniques. Text documents are represented in the vector space based on
Vector Space Modelling (VSM). However most SM posts are not in a
form that is usable for text mining techniques. Social media data are

Opinion mining in Big Social Data, p. 7, 2014.

multi-model in nature, including content such as images, audio, and videos, concept such as discussion topic, tag, and annotation, and context
such as links, profile, timestamp, and click-through [11]. The data has to
be purified from irrelevant data first. The cleaning process or data preconditioning can involve stop-words removal. Stop-words are words like
the, and is in sentences such as The new Panasonic GM1 camera is
excellent!. They only have grammatical significance and are thus eliminated. Also usually special characters such as !, brackets or smileys
are removed. In addition, words are converted into their canonical form
using stemming algorithms. Stemming algorithms such as unsupervised
morpheme segmentation find the word stem from inflected or derived
words. For instance sucks becomes to suck. Lemmatization might
be performed for further analysis. It is the process of grouping together
the different inflected forms of a word, so they can be analyzed as a single item [15].
An often used pre-conditioning task is creating a bag-of-words. Bagof-words based approaches model news articles by vector space model
which translates each news piece into a vector of word statistical measurements, such as the number of occurrences, etc. [16]. A bag-of-words

Opinion mining in Big Social Data, p. 8, 2014.

is a list of all the words in a text disregarding grammar or word order.


Bag-of-words are suitable inputs for machine learning methods.

2.3

Sentiment lexicons

As shown in the previous chapter, words that express positive or negative sentiments are essential for opinion mining. Also, the examples
showed that these words are highly domain specific. Words can bear different meanings whether the opinion is expressed about a car, a mobile
phone or a mattress. Not only sentiment words are important. The sentiment strength can be altered using sentiment modifiers such as very.
For instance The new iPhone is very good. Here good is the sentiment word and very is the modifier.
Sentiment words can be base types such as good or bad or comparative types such as better or worse.
Negators such as not are also important since they can change the
sentiment to the opposite. For instance The new Chevrolet is not great.
They are called sentiment polarity shifters. They can also change the
opinion in a positive way. For instance The new Mercedes doesnt
suck.

Opinion mining in Big Social Data, p. 9, 2014.

A common way of performing opinion mining is creating a list of sentiment words, a sentiment lexicon, and using it to analyse the opinion
texts.
Compiling sentiment lexicons can be done manually. However, this is
labour intensive and usually an automatic approach is preferred. Dictionaries such as WordNet (http://wordnet.princeton.edu/) or Dictionary.com
(http://dictionary.reference.com/) list synonyms and antonyms of words.
They can be used to automatically generate sentiment lexicons. This approach works as follows. A small set of seed sentiment words is compiled manually. From the seed words, an algorithm searches the online
dictionary for synonyms and antonyms. They are added to the word list.
The search is repeated iteratively until no more sentiment words can be
found. Some sentiment lexicons also weight sentiment words. For instance excellent is stronger than good. They are useful when posts
are not just analysed for positive or negative opinions but divided into
multiclass sentiment categories such as good reviews or very good
reviews.

Opinion mining in Big Social Data, p. 10, 2014.

2.4

Supervised and unsupervised machine learning methods

The idea of text categorization is to assign semantic similar documents


into the same group and the created groups should be as dissimilar as
possible to each other [17]. In opinion mining on SM, documents are
classified into posts with positive or negative sentiments. Text categorization can be divided into classification and clustering problems. Texts
are represented in the vector space and every word is given a specific
weight. Term Frequency and Inverse Document Frequency (TF-IDF) is
one of the best known term weighting methods [17]. It is defined as:
wt ,d tf t ,d log(

N
)
df t

(1)

where tft,d is the number of occurrences of term t in the document d, N


is the number of document in the collection and dft, is the number of
documents, in which term t appears [17]. Then the similarity of the documents is computed using a distance measure such as the Euclidian distance, Manhattan distance or Chebyshev distance.
Sentiment classification is usually formulated as binary sentiment
classification problem, positive or negative. Supervised techniques are
used when the class label is known, unsupervised techniques when it is

Opinion mining in Big Social Data, p. 11, 2014.

unknown. Here the class label is positive or negative reviews. Supervised machine learning techniques are a common way of text classification. A set of data, SM posts, is divided into a training and a testing
set. The model is trained using the training data set. The test set is used
to determine how well model performs and calculate the error, the classification accuracy. This process is repeated until the result is acceptable.
The trained model can then be applied for future, unseen SM posts. There
are many supervised machine learning algorithms. Popular algorithms
are the Nave Bayes classifier, Support Vector Machines (SVM) and kNearest Neighbor (k-NN). They take a feature vector as input. A feature
vector can be unigrams, a bag-of-word, containing the sentiment words
identified in sentiment lexicon generation, terms and their frequencies,
part of speech (POS) or sentiment shifters.
Since sentiment words are often the dominant factor for sentiment
classification, it is not hard to imagine that sentiment words and phrases
may be used for sentiment classification in an unsupervised manner [2].
In unsupervised machine learning, documents are clustered into similarity groups. SM posts are grouped together using a similarity, or distance
function. The most commonly used distance functions for numeric attributes are the Euclidean distance and Manhattan (city block) distance
Opinion mining in Big Social Data, p. 12, 2014.

[1]. However, others such as Chebyshev and Minkowski distance functions are also used. K-means Clustering is probably the most popular
clustering algorithm.

2.5

Latent Dirichlet Allocation

In recent studies Latent Dirichlet Allocation (LDA) has been used for
sentiment analysis using SM [17,18,19,20,23]. LDA is based on Latent
Semantic Indexing and represents a probabilistic model that finds the cooccurrence patterns of terms that corresponds to semantic topics and has
been

used

in

probabilistic

document

model

that

classification.
is

based

It
on

is

generative

multinomial

and

Dirichlet distribution [17]. The Dirichlet distribution is defined as:


n

p ( | )

( i )
i 1

1 ... k ,
11

( )

k 1

(2)

i 1

Where every document d is characterized by Dirichlet distribution d


with parameter , n is the number of predefined topics and > 1, and
n

1 . (x) is a Gamma function. The parameters can be calculated

i 1

by several variation methods such as Gibbs sampling. As with machine


learning, LDA is first trained with a set of SM posts and creates a latent
Opinion mining in Big Social Data, p. 13, 2014.

description of the posts. It thus creates a profile of, for instance, positive
or negative Tweets. It can then filter out the relevant Tweets from a corpus of Tweets.
LDA has also been used for relevance filtering [19], and extensions have
been proposed that consider the underlying sequential structure of the
document [20] or filter out background topics [23]. LDA has proven to
be useful in exploratory as well as predictive text analytics.

Discussion
Sentiment analysis remains a challenging area of research. Whereas

the classification algorithms, machine learning, LDA or others like statistical, are important, data pre-processing remains an equally important
task. SM data is typically noisy, and there is a lot of irrelevant data. The
classification algorithms wont perform well if the data is not properly
preconditioned and the accuracy of the results will suffer. Preconditioning encompasses relevance filtering, noise removal and feature vector
preparation. The feature vector can contain word frequencies of sentiment words, POS, but attributes might also be weighted. For instance,
not all sentiment words might have the same importance, and the feature
vector might also contain weighting. Feature vector creation is at least as
Opinion mining in Big Social Data, p. 14, 2014.

important as selecting the appropriate classification algorithm, nevertheless there seem to be much less research in this area than in the area of
classification.
Other challenges originate from the complexities of natural language
with linguistic constructs such as humour, sarcasm or innuendos which
are very difficult to detect by computers.
Microblogging SM sites such as Twitter and Sina Weibo usually have
character limits and posts are usually clear statements. Forums and blogs
often cover several topics and contain opinions on different subjects,
which makes them more difficult to mine. More complex sentences can
have sentiments on different targets. For instance Microsoft is doing
well in this bad market. Thats why some studies only considered explicit statements [21]. Probably the most difficult posts to analyse are
political opinions since they are full of irony and sarcasm.
The author uses Tweets for his research. Tweets are limited to 140
characters, so they are typically straight to the point and make them suitable targets for getting opinions. However due to the shortness of Tweets
they are usually full of slang or emoticons, which poses a challenge. For
instance Tweets have no subject line so subject words are highlighted

Opinion mining in Big Social Data, p. 15, 2014.

using a hash tag, for example #IBM #share is plummeting or companies are denoted using a $ sign, for example $APPL for Apple Inc.
Slang such as ATTA car, thats a car, abbreviations or acronyms
typically found on SM such as IMHO (In My Humble Opinion) or LOL
(Lough Out Loud) or texts such as gooooood car pose additional challenges for sentiment analysis.

Conclusions
A fully automated and accurate solution for opinion mining using SM

is nowhere in sight. The main issue is that opinion mining is a natural


language processing problem, and there are many ambiguities, fuzziness
and irregularities in natural languages. Other reasons are the limitations
of the algorithms. Many opinion mining algorithms give satisfactory results, but there are still manual steps necessary. Also the algorithms usually dont produce human readable results, thats why very often it is
difficult to understand the whole process.
Future areas of research are should focus among other on feature vector creation as a crucial step in opinion mining. Automatic opinion spam
detection is an area where a lot of progress has been made, but spam is
usually artfully created and as spam filters detect new forms of spam,
Opinion mining in Big Social Data, p. 16, 2014.

opinion spammers find more sophisticated ways too. Not domain specific word lexicon are still nowhere in sight, but automatic sentiment lexicon creation for a specific domain would improve the whole opinion
mining process. Many studies have analysed the effectiveness of online
marketing campaigns, how influential online opinions are is still not very
well understood. More research in the area of online influence of SM
users on user behaviour would be an interesting and desirable area of
research.

Appendix

5.1

references

[1] B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and


Usage Data, 2 ed., Heidelberg: Springer, 2011.
[2] B. Liu, Sentiment Analysis and Opinion Mining: Morgan &
Claypool, 2012.
[3] M. Kaschesky, P. Sobkowicz, and G. Bouchard, Opinion mining in social media: modeling, simulating, and visualizing political opinion formation in the web, in Proceedings of the 12th
Annual International Digital Government Research Conference:
Digital Government Innovation in Challenging Times, College
Park, Maryland, 2011, pp. 317-326.
[4] S. Stieglitz, and L. Dang-Xuan, Social media and political
communication: a social media analytics framework, Social
Network Analysis and Mining, vol. 3, no. 4, pp. 1277-1291,
2013/12/01, 2013.
[5] D. King, "Introduction to Mining and Analyzing Social Media
Minitrack." pp. 3108-3108.
Opinion mining in Big Social Data, p. 17, 2014.

[6] Jones, and L. Huan, "Mining Social Media: Challenges and Opportunities." pp. 90-99.
[7] V. Hangya, and R. Farkas, "Target-oriented opinion mining
from tweets." pp. 251-254.
[8] S. Asur, and B. A. Huberman, "Predicting the Future with Social Media." pp. 492-499.
[9] J. Bollen, H. Mao, and X.-J. Zeng, Twitter mood predicts the
stock market, Journal of Computational Science, vol. 2, pp. 8,
2010.
[10]
Siganos, E. Vagenas-Nanos, and P. Verwijmeren, Facebook's daily sentiment and international stock markets, Journal
of Economic Behavior & Organization, no. 0, 2014.
[11]
H. Shen, X.-S. Hua, J. Luo, and V. Oria, Guest editorial: content, concept and context mining in social media,
World Wide Web, vol. 15, no. 2, pp. 115-116, 2012/03/01,
2012.
[12]
J. Tang, Y. Chang, and H. Liu, Mining social media
with social theories: a survey, SIGKDD Explor. Newsl., vol.
15, no. 2, pp. 20-29, 2014.
[13]
Preeti, and BrahmaleenKaurSidhu, Natural Language
Processing, International Journal of Computer Technology and
Applications, vol. 4, pp. 751-758, 09/01, 2013.
[14]
E. Kalampokis, E. Tambouris, and K. Tarabanis, Understanding the predictive power of social media, Internet Research, vol. 23, no. 5, pp. 544-559, 2013.
[15]
S. Stieglitz, and L. Dang-Xuan, Social media and political communication: a social media analytics framework, Social
Network Analysis and Mining, vol. 3, no. 4, pp. 1277-1291,
2013/12/01, 2013.
[16]
X. Li, H. Xie, L. Chen, J. Wang, and X. Deng, News
impact on stock price return via sentiment analysis,
Knowledge-Based Systems, no. 0, 2014.
[17]
Z. Daniel, Z. Daniel, S. Jn, J. Jozef, and C. Anton,
Text Categorization with Latent Dirichlet Allocation, Journal
of electrical and electronics engineering, vol. 7, pp. 161-164,
05/01, 2014.

Opinion mining in Big Social Data, p. 18, 2014.

[18]
T. Shulong, L. Yang, S. Huan, G. Ziyu, Y. Xifeng, B.
Jiajun, C. Chun, and H. Xiaofei, Interpreting the Public Sentiment Variations on Twitter, Knowledge and Data Engineering,
IEEE Transactions on, vol. 26, no. 5, pp. 1158-1170, 2014.
[19]
M. Arias, A. Arratia, and R. Xuriguera, Forecasting
with twitter data, ACM Trans. Intell. Syst. Technol., vol. 5, no.
1, pp. 1-24, 2014.
[20]
L. Du, W. Buntine, H. Jin, and C. Chen, Sequential latent Dirichlet allocation, Knowledge and Information Systems,
vol. 31, no. 3, pp. 475-503, 2012/06/01, 2012.
[21]
J. Bollen, H. Mao, and X.-J. Zeng, Twitter mood predicts the stock market, Journal of Computational Science, vol.
2, pp. 8, 2010.
[22]
A. Porshnev, I. Redkin, and A. Shevchenko, "Machine
Learning in Prediction of Stock Market Indicators Based on
Historical Data and Data from Twitter Sentiment Analysis." pp.
440-444.
[23]
T. Shulong, L. Yang, S. Huan, G. Ziyu, Y. Xifeng, B.
Jiajun, C. Chun, and H. Xiaofei, Interpreting the Public Sentiment Variations on Twitter, Knowledge and Data Engineering,
IEEE Transactions on, vol. 26, no. 5, pp. 1158-1170, 2014.

Opinion mining in Big Social Data, p. 19, 2014.

S-ar putea să vă placă și