Sunteți pe pagina 1din 6

Sentiment analysis using feature selection and classification algorithms

Abstract

The new era of social media has led to the inflow of large amount of data. This unstructured
data which may seem to be unwanted have a special place in data mining. With this data
information or insight to a persons mind can be mined. Today many social media platforms
(Blogs, Forums, Twitter, Facebook, Google+) allow users to write about their experience and
thoughts on a variety of domains including proprietary products, advertisements, movies,
news, stocks and many more. According to a survey many companies see social media as a
fertile ground for opinion mining. Opinion Mining or Sentiment Analysis is a type of natural
language processing which helps in understanding the sentiment or attitude of an individual.
We try to extract the belief (Opinion) of user by proposing a system which analyses the
movie review to understand what the overall reaction (Polarity) to the movie is i.e. if the
movie was liked (Positive) or hated (Negative). A textual movie review is important as it
reveals strong and weak points of the movie plot and by doing the deeper analysis of a movie
review one can tell if movie will meet the expectations of the reviewer. We combine various
concepts of natural language processing and machine learning. We had extracted data from
Pang and Lee corpora and applied supervised machine-learning algorithms like Nave Bayes
to classify data using unigram and bigram features.

Introduction

The present era of Internet has become a huge Cyber Database which hosts gigantic amount
of data which is created and consumed by the users. The database has been growing at an
exponential rate giving rise to a new industry filled with it, in which users express their
opinions across channels such as Facebook, Twitter, Rotten Tomatoes and Foursquare.
Opinions which are being expressed in the form of reviews provide an opportunity for new
explorations to find collective likes and dislikes of cyber community. One such domain of
reviews is the domain of movie reviews which affects everyone from audience, film critics to
the production company. The movie reviews being posted on the websites are not formal
reviews but are rather very informal and are unstructured form of grammar. Opinions
expressed in movie reviews give a very true reflection of the emotion that is being conveyed.
The presence of such a great use of sentiment words to express the review inspired us to
devise an approach to classify the polarity of the movie using these sentiment words.
The evolution of web technology has led to a huge amount of user generated content and has
significantly changed the way we manage, organize and interact with information. Due to the
large amount of user opinions, reviews, comments, feedbacks and suggestions it is essential
to explore, analyze and organize the content for efficient decision making. In the past years
sentiment analysis has emerged as one of the popular techniques for information retrieval and
web data analysis. Sentiment analysis, also known as opinion mining is a subfield of Natural
Language Processing (NLP) and Computational Linguistics (CL) that defines the area that
studies and analyzes peoples opinions, reviews and sentiments
Sentiment Analysis is a technology that will be very important in the next few years. With
opinion mining, we can distinguish poor content from high quality content. With the
technologies available we can know if a movie has more good opinions than bad opinions
and find the reasons why those opinions are positive or negative. Much of the early research
in this field was centered around product reviews, such as reviews on different products on
Amazon.com [1], defining sentiments as positive, negative, or neutral. Most sentiment
analysis studies are now focused on social media sources such as IMDB, Twitter [2] and
Facebook, requiring the approaches be tailored to serve the rising demand of opinions in the
form of text. Furthermore, performing the phrase-level analysis of movie reviews proves to
be a challenging task.

Sentiment analysis is the field of study of analyzing peoples opinions, sentiments,


evaluations, attitudes, and emotions from written language. Sentiment analysis systems are
used in almost every domain because opinions are central to almost all human activities. They
are key influencers of our behaviours.
Sentiment analysis uses natural language processing and text analysis to identify and extract
information about a particular field of interest. Due to the popularity of the social media such
as blogs and social networking sites such as Facebook, Twitter etc the interest in sentiment
analysis has increased to a higher extent.
There are several challenges in Sentiment analysis. The first is that an opinion word that is
considered to be positive in one situation may be considered negative in another situation.
The second challenge is that people dont always express opinions in the same way. The usual
text processing relies on the fact that small differences between two pieces of text dont
change the meaning very much. Sentiment analysis helps to find words that indicate
sentiment and helps to understand the relationship between textual reviews and the
consequences of those reviews. One such example being online movie reviews affect the box
office collection. In this project, data mining techniques are applied on online movie reviews
and predict the box office collection of the movie based on the reviews and analyse how
much effect the reviews have on the box office collection. Box office collection for the next
day is predicted based on online reviews of the present day. A prediction of high or low
collection is also predicted.
The project starts with extracting data from the website. Reviews are collected on IMDb
(http://www.imdb.com/). The second step is to apply sentiment classification using TF-IDF
approach. This involves text preprocessing, text transformation, validating feature
effectiveness using clustering and sentiment classification.
The rest of the paper provides the following details: Section II discusses the related work
done in this domain. Section III explains the proposed work in this paper in depth followed
by the results and analysis obtained in Section IV and Section V gives the conclusion for the
proposed work.

Related Work

Background
Pre-processing method is the first step in the text mining process and plays a very crucial role
in text mining techniques and applications. This is the process which incorporates a new
document into an IRS (Information Retrieval System). An efficient pre-processor represents a
document effectively in terms of both storing the document (space) and for processing
retrieval requests (time) requirements and maintain good retrieval performance measures like
precision and recall. The pre-processing techniques used in our proposed model are as
follows.
Stop Word Removal

In Information Retrieval (IR) and text mining many frequently used words in English are
useless as they do not impart any sentiment in the text and they make the text look heavier and
less important for analysts. These words are known as Stop words and removing stop words
reduce the dimensionality of term space . Stop-words are frequent words like pronouns,
prepositions, conjunctions that carry no information.
Stemming
Stemming techniques tries to find out the root of a word. Stemming convert words to their stems
which take into account language-dependent linguistic knowledge. As the words with the same root
mostly describe same or relatively close meaning, these words can be conflated. For example, the
words, user, users, used, using all can be stemmed to the word 'USE'.
Feature selection
A "feature" (attribute or variable) refers to the characteristic of the data. There are four basic
steps in a typical feature selection method (see Figure ):
a) A generation procedure generates next candidate subset which retains enough
information for better performance of the model ;
b) An evaluation function evaluates the candidate subset;
c) A stopping criterion decides when to terminate;
d) A validation procedure validates the subset.
There have been several studies that evaluated the feature selection approaches. Forman
[Forman 2003] has given an extensive evaluation of various schemes in topical text
classification task. Zheng et al. [Zheng 2004] experiment with feature selection schemes
on the imbalanced data, when one class possesses more documents than other. The idea of
selecting salient features for each category can affect the classifier performance in the
subsequent step. Since we want to determine if a sentence belongs to opinionated or
factual category (same for positive/negative), our aim is to choose terms or features that
are unique or most representative in that category. Ideally, we would like to have a set of
not overlapping features that represent documents in each category. Realistically, it is not
possible. Thus, it is important to find a way to represent the importance of each feature in
both categories (or in all categories when faced with more than two classes).
There are three general classes of feature selection algorithms: filter methods, wrapper
methods and embedded methods. We have used filter methods in our methodology.

Filter Methods

A scoring function is used by filter feature selection methods to assign a score to the dataset
which intern helps to decide if the dataset is worth keeping as it affects the performance of
the model. The methods are often univariate and consider the feature independently, or with
regard to the dependent variable. Some examples of some filter methods include the Chi
squared test, information gain and mutual information.

CHI SQUARE

Chi Square Test is used in the field of statistics to test the independence between two events.
For the calculation of chi square, we take the square of the difference between the observed
(o) and expected (e) values and then divide it by the expected value. Chi Square measures the
deviation between expected counts (e) and observed Count (o).

( oe ) 2
2=
e

Problem statement and methodology


The research on sentiment analysis has been going for a long time. Sentiment analysis in
present days becomes the major issue in field of research and technology. Due to day by day
increase in the number of users on the social networking websites, huge amount of data
produces in the form of text, audio, video and images. There is need to compute an automatic
sentiment analyser that finds whether the sentiment is negative or positive
The main aim of research thesis is to compare the results that are implemented with the help
of supervised classifier.

The methodology followed is:

We used the corpus by Bo Pang and Lillian Lee which contains a collection of 1000
positive and 1000 negative movie review.
Then we apply various pre-processing techniques to get the corpus prepped up for
further processes.
We apply feature selection schemes to select the most important features which help
in a better model construction.
Finally we apply classification algorithm and find various performance matrices
which aids in understanding the accuracy and execution of the computed model.

Results and discussions

S-ar putea să vă placă și