Documente Academic
Documente Profesional
Documente Cultură
Abstract
The new era of social media has led to the inflow of large amount of data. This unstructured
data which may seem to be unwanted have a special place in data mining. With this data
information or insight to a persons mind can be mined. Today many social media platforms
(Blogs, Forums, Twitter, Facebook, Google+) allow users to write about their experience and
thoughts on a variety of domains including proprietary products, advertisements, movies,
news, stocks and many more. According to a survey many companies see social media as a
fertile ground for opinion mining. Opinion Mining or Sentiment Analysis is a type of natural
language processing which helps in understanding the sentiment or attitude of an individual.
We try to extract the belief (Opinion) of user by proposing a system which analyses the
movie review to understand what the overall reaction (Polarity) to the movie is i.e. if the
movie was liked (Positive) or hated (Negative). A textual movie review is important as it
reveals strong and weak points of the movie plot and by doing the deeper analysis of a movie
review one can tell if movie will meet the expectations of the reviewer. We combine various
concepts of natural language processing and machine learning. We had extracted data from
Pang and Lee corpora and applied supervised machine-learning algorithms like Nave Bayes
to classify data using unigram and bigram features.
Introduction
The present era of Internet has become a huge Cyber Database which hosts gigantic amount
of data which is created and consumed by the users. The database has been growing at an
exponential rate giving rise to a new industry filled with it, in which users express their
opinions across channels such as Facebook, Twitter, Rotten Tomatoes and Foursquare.
Opinions which are being expressed in the form of reviews provide an opportunity for new
explorations to find collective likes and dislikes of cyber community. One such domain of
reviews is the domain of movie reviews which affects everyone from audience, film critics to
the production company. The movie reviews being posted on the websites are not formal
reviews but are rather very informal and are unstructured form of grammar. Opinions
expressed in movie reviews give a very true reflection of the emotion that is being conveyed.
The presence of such a great use of sentiment words to express the review inspired us to
devise an approach to classify the polarity of the movie using these sentiment words.
The evolution of web technology has led to a huge amount of user generated content and has
significantly changed the way we manage, organize and interact with information. Due to the
large amount of user opinions, reviews, comments, feedbacks and suggestions it is essential
to explore, analyze and organize the content for efficient decision making. In the past years
sentiment analysis has emerged as one of the popular techniques for information retrieval and
web data analysis. Sentiment analysis, also known as opinion mining is a subfield of Natural
Language Processing (NLP) and Computational Linguistics (CL) that defines the area that
studies and analyzes peoples opinions, reviews and sentiments
Sentiment Analysis is a technology that will be very important in the next few years. With
opinion mining, we can distinguish poor content from high quality content. With the
technologies available we can know if a movie has more good opinions than bad opinions
and find the reasons why those opinions are positive or negative. Much of the early research
in this field was centered around product reviews, such as reviews on different products on
Amazon.com [1], defining sentiments as positive, negative, or neutral. Most sentiment
analysis studies are now focused on social media sources such as IMDB, Twitter [2] and
Facebook, requiring the approaches be tailored to serve the rising demand of opinions in the
form of text. Furthermore, performing the phrase-level analysis of movie reviews proves to
be a challenging task.
Related Work
Background
Pre-processing method is the first step in the text mining process and plays a very crucial role
in text mining techniques and applications. This is the process which incorporates a new
document into an IRS (Information Retrieval System). An efficient pre-processor represents a
document effectively in terms of both storing the document (space) and for processing
retrieval requests (time) requirements and maintain good retrieval performance measures like
precision and recall. The pre-processing techniques used in our proposed model are as
follows.
Stop Word Removal
In Information Retrieval (IR) and text mining many frequently used words in English are
useless as they do not impart any sentiment in the text and they make the text look heavier and
less important for analysts. These words are known as Stop words and removing stop words
reduce the dimensionality of term space . Stop-words are frequent words like pronouns,
prepositions, conjunctions that carry no information.
Stemming
Stemming techniques tries to find out the root of a word. Stemming convert words to their stems
which take into account language-dependent linguistic knowledge. As the words with the same root
mostly describe same or relatively close meaning, these words can be conflated. For example, the
words, user, users, used, using all can be stemmed to the word 'USE'.
Feature selection
A "feature" (attribute or variable) refers to the characteristic of the data. There are four basic
steps in a typical feature selection method (see Figure ):
a) A generation procedure generates next candidate subset which retains enough
information for better performance of the model ;
b) An evaluation function evaluates the candidate subset;
c) A stopping criterion decides when to terminate;
d) A validation procedure validates the subset.
There have been several studies that evaluated the feature selection approaches. Forman
[Forman 2003] has given an extensive evaluation of various schemes in topical text
classification task. Zheng et al. [Zheng 2004] experiment with feature selection schemes
on the imbalanced data, when one class possesses more documents than other. The idea of
selecting salient features for each category can affect the classifier performance in the
subsequent step. Since we want to determine if a sentence belongs to opinionated or
factual category (same for positive/negative), our aim is to choose terms or features that
are unique or most representative in that category. Ideally, we would like to have a set of
not overlapping features that represent documents in each category. Realistically, it is not
possible. Thus, it is important to find a way to represent the importance of each feature in
both categories (or in all categories when faced with more than two classes).
There are three general classes of feature selection algorithms: filter methods, wrapper
methods and embedded methods. We have used filter methods in our methodology.
Filter Methods
A scoring function is used by filter feature selection methods to assign a score to the dataset
which intern helps to decide if the dataset is worth keeping as it affects the performance of
the model. The methods are often univariate and consider the feature independently, or with
regard to the dependent variable. Some examples of some filter methods include the Chi
squared test, information gain and mutual information.
CHI SQUARE
Chi Square Test is used in the field of statistics to test the independence between two events.
For the calculation of chi square, we take the square of the difference between the observed
(o) and expected (e) values and then divide it by the expected value. Chi Square measures the
deviation between expected counts (e) and observed Count (o).
( oe ) 2
2=
e
We used the corpus by Bo Pang and Lillian Lee which contains a collection of 1000
positive and 1000 negative movie review.
Then we apply various pre-processing techniques to get the corpus prepped up for
further processes.
We apply feature selection schemes to select the most important features which help
in a better model construction.
Finally we apply classification algorithm and find various performance matrices
which aids in understanding the accuracy and execution of the computed model.