Sunteți pe pagina 1din 9

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823

Machine Learning Algorithms and their Significance in Sentiment Analysis for Context Based Mining

N. KARTHIKEYAN 1 and

R.DHANAPAL 2

1 *Head (B.C.A Dept), Department of Computer Applications, Srimad Andavan Arts&Science College, Tiruchitrapalli, Tamil Nadu, India E-mail: karthi_badri@yahoomail.com

2* Principal K.C.S. Kasi Nadar College of Arts & Science, R.K.Nagar Chennai – 600 021 URL: www.kcskasinadarcollege.in E-mail: drdhanapal@gmail.com

Abstract

The process of sentiment analysis is a typical area which requires analysis of various parts of the text to provide the appropriate results. Since text in general are unstructured, it becomes more difficult for the algorithm to determine the result. This paper uses machine learning algorithms (Neural Networks and SVM) and J48 Classification algorithm to determine the best approach for determining the polarity of a document for sentiment analysis. The results infer that SVM performs better than the other techniques in determining the document polarity. Keywords: Context based mining, Sentiment analysis, SVM, ANN, J48

1.

Introduction

In Content Based Image Retrieval (CBIR), we are concentrating on the aspect of retrieving images corresponding to a query image. In usual text based image search, users will be providing some keywords based on which images are retrieved. In case of text based search the ability of the user to provide an exact query is limited by several factors like, colour, texture and such intricate details could not be represented in textual form in a consistent manner. So the inability to provide proper input will automatically introduce bias or error in the output. So current generation image search is based on images as input so that the match could be much better than providing text as input. The drawback of the current approach is that we are not searching the images in a single well defined context. The image could be anything and should be matched with all other images in the repository before providing the output. Image based search and matching has been successful in many domains that are context specific. Say Iris Scan images when compared to a database containing only Iris images was very successful and similarly, facial recognition, fingerprint readers etc. are all very reliable because of the fact that the images are all from a single well defined context. When it comes to a broad category of images then the drawback of providing an image as input and searching for similar elements from a repository is that, the user is now handicapped because the context of the search is missing. Say for example if the user is providing the image of a dog and searching through the repository, then the context could be any of the following like pet, breed based search, police/sniffer dogs, trained dogs, helper dogs, diseases suffered by dogs, food for dogs etc. So here by providing an image as input the user is unable to specify the context that he/she is looking for in the image result. Human way of looking at an image must be studied from a psychological point of view rather than considering it as just reading all the pixels and trying to make sense out of it. Human vision or the perception of human vision to be precise is based on the overall broad context and once we obtain the context then we ignore the local details. This is completely different from a computerized program. Here semantics and context sensitiveness plays an important

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823

role. This brings out the need for filling the semantic gap in content based image retrieval. Concentrating on the low level features alone makes the search results biased and error prone. Also the changes in the luminance or texture or colour does not change the context of an image and we are looking for the context here. The core concept in retrieving content from an image is currently based on pixel by pixel analysis of the image. But human vision doesn’t provide the same importance to all the pixels as a computer does. So in order to emulate human vision through computers, the key is semantics. To provide such a semantic based image retrieval system the repository as well as the query image must be accompanied by some metadata. Metadata here provides the context. It could be keywords, descriptions and tags. Even sentiment polarity could be included to make the search much more effective and context sensitive. Here in this paper we try to bridge the semantic gap by including the sentiment polarity of the images in CBIR. The remainder of this paper is structured as follows; section II provides

2. Related works

A lot of research has gone into content based image retrieval. ThijsWesterveld in [1] used Latent Semantic

Indexing to uncover hidden semantics. That work concentrates on including co-occurrence statistics to uncover the hidden semantic information. The work tries to bring the best of both worlds, image feature (content) and words (context) into one semantic space. Though the work showed better performance in terms of mono and multilingual text retrieval, its application to multi-modal and cross modal image retrieval involves a lot of computational complexity and also its subjectivity complicates the process further.

In [2] David et al proposed several views regarding the importance of context sensitiveness in image retrieval.

They have even quoted examples from newspapers that provides text as well as images in a biased manner favouring

a particular political or religious faction. They have introduced a new platform and a diversity engine architecture for image retrieval based on opinion analysis, text analysis and content based information retrieval. Though they have stressed the importance of semantics and context sensitiveness in image retrieval, they have only provided an overview and have summarized the existing text, image and other multimedia based retrieval systems.

In [3] Liyan et al presented an approach that utilizes context information to learn adaptive rules for automatic

and human in the loop clustering. The work is a bit more context aware as it considers a particular domain of face tagging and detection. The repository under consideration in their work consists only of human facial images and hence the context sensitiveness to a broader class is found missing. Large scale context based retrieval of images requires analysis of millions or even billions of images and hence computationally complex.

In [4] Thanh-Nghi Doan et al have proposed a parallel incremental methodology for power mean SVM based

classification of large scale image datasets and it is proved to handle 1000’s of visual classes effectively. Such a parallel approach towards context sensitive image retrieval could improve the performance and accuracy as well. It also considers dealing with imbalanced data. In [5] David Ahlstrom et al have shown the effectiveness of simple and sophisticated tools for video exploration. It provides insights from a real time video search competition for video exploration. The next step in web search is based on including users’ sentiment/opinion effectively and hence providing context sensitive results. As suggested in [2], the importance of such sentiment analysis is on the rise as the text mining systems are now being integrated along with multimedia based information retrieval systems. So it is no more just text or image based search, instead a combination of them all resulting in better results that are reliable in a wide variety of domains. Several machine learning based methods are proposed for lexical analysis of text corpus and to infer sentiment polarity from them. In [6] Blinov et al have proposed a machine learning approach based on Support Vector Machines (SVM) and maximum entropy method. Their approach has included information about the proportion of positive and negative words, their colocations, emoticons as such to better identify the context. But their approach is based on manual formation of emotional dictionaries specifically made for each domain. Since such context based emotional dictionaries are not so very widely available for all domains, it could not be a scalable solution for general web based image retrieval systems. Automated Text Classification is done based on machine learning approaches for a long time now. In [7] Ikonomakis et al have provided a detailed study of the state of the art in automated text classification using machine learning approaches. In [8] Stefano et al presented SentiWordNet 3.0 which is the latest edition of lexical resource specifically designed for opinion mining and sentiment classification applications. The difference between the

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823

various versions of SentiWordNet and its features are also clearly explained along with the research applications of such a lexical resource in various automated text classification and sentiment polarity analysis. They have also mentioned the algorithm for automatic WordNet annotations and how it effectively classifies text into positive, negative and neutral elements. Rudy et al in [9] proposed a hybrid approach for sentiment analysis based on rule based classification, supervised learning and machine learning. They have applied that to movie reviews and product reviews and reported effective classification of sentiment polarity. Though the results are comparatively good the hybridization increases the computational complexity of the approach to a greater extent.Bo Pang et al in [10] have considered sentiment analysis based on positive and negative polarity alone and independent of topic. Naive Bayes, maximum entropy classification, and support vector machines have been used for sentiment analysis by them and they have also reported that machine learning approaches are better than human baseline when it comes to sentiment polarity.

3. System architecture

T ex t T ex t

C on t en t A na ly s i s an d F ea
C on t en t A na ly s i s an d F ea t ure V ec t or C rea ti on
St o p wor d Eli m i na ti on F ea t ure
St o p wor d Eli m i na ti on
F ea t ure M a t r i x C rea ti on
1
0 1
0
0
.
.
1 . 1 .
.
.
.
.
.
.
.
.
.
1 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 1 1
1
C on t ex t B ase d S en ti men t A na ly s i s us i n g M ac hi ne
L earn i n g

Figure 1: System Architecture

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823

The process of context based image retrieval uses the base information available in the images to retrieve the context in which they are being used. The context based image retrieval system functions in four phases. The initial phase deals with analyzing the available data and creating a feature vector. These feature vectors are the information that is a broken down form of the available data. In order to remove the unnecessary words and to shortlist the mandatory words needed for the future process, the second phase is performed. This phase removes the stop words and symbols from the feature vectors to make them more refined. After the process of refinement, the feature matrix is created by using the reviews and feature vectors. This data serves as the base for performing the context based sentiment analysis. Machine learning is used for performing this analysis and finding the classification. Figure 1 shows an overall system architecture of the sentiment analysis methodology.

4. Context Based Image Retrieval Using Machine Learning Approaches

The term context refers to perspective or situation. Content retrieval using context as the key has its own complexities. The first and the foremost being sentiment retrieval from the data. In general, context directly refers to the sentiments with which a certain text has been rendered. Emotion analysis is the next level of sentiment analysis. While sentiment analysis refers to finding the polarity of the document (positive, negative or neutral), emotion analysis takes a deeper plunge and refers to the level of emotions. Our methodology here classifies the images based on the polarity of the text, using which the context can be retrieved. The following four phases describe the working methodology of our system.

4.1. Content analysis and Feature Vector Creation

Content of an image can be directly derived using the structural elements of the image. But deriving the context from an image is complex and is mostly inaccurate. Hence it is necessary to search for other means of data that depict the context. This information is mostly found in the metadata and some part of the content that are at close proximity to the image. Metadata here refers to tags, description or keywords corresponding to the image. Hence the initial process in sentiment mining is the content analysis and feature vector creation. The content present in the available information are analyzed and are tokenized and the word vector is created. Here, the word vector is referred to as the feature vector. This vector contains information about the word and its frequency of occurrence. After the completion of this phase, all the data corresponding to the text that is to be analyzed will be listed.

4.2. Stop word elimination

Stop words refer to words that do not contribute to the meaning of a sentence. In short, these are connectors, articles or pronouns. The major contributors in the process of sentiment mining would be the nouns, verbs, adverbs or adjectives that directly talk about the activity taking place or determining the subject. All other words are mostly useless, in other words, they tend to consume memory and reduces the processing speeds. Other types of stop words include punctuations such as comma, full stop, colon, semicolon, question and exclamation. The text that is considered for mining includes user provided unstructured data, which means, the data does not have a proper format like a data from the database. Further, these data might not even be a proper English sentence. There are very high possibilities of this text containing colloquial form of a language and it might even be multi lingual. Even though our current methodology does not deal with multi lingual data, it could be performed in future. The process of stop word elimination uses the stop word collection of the storm project [12,13,14]. The feature vectors that were initially formed are filtered and the stop words occurring in them are eliminated. This removes a considerable amount of data from the main feature vector set, hence enabling faster computation.

4.3. Feature matrix creation

The next phase is the creation of the feature matrix. This method maps the content with the already defined feature vectors and creates a feature matrix. This phase creates an n×m matrix, where n refers to the number of texts considered for evaluation, and m refers to the number of items in the feature vector.

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823


 

a ij

a 11

a

1 n



a

m

1

a

mn

  

(1)

1 word

j

review

i

= 

0 Otherwise

(2)

Equation (1) shows a sample feature vector matrix, while equation (2) shows the conditions for populating the feature matrix. From equation (1) it can be made clear that the rows of the matrix refer to the table and each column refers to each word determine from the feature vector. The matrix is populated in such a way that if the word occurs in the given text, then 1 is added to the matrix, and if the word is not present in the given text, then an entry of 0 is added to the matrix. The feature matrix is generally found to be large and is used as the base for the machine learning algorithms.

4.4. Context based Sentiment Analysis using Machine Learning Algorithms

After the preprocessing and data preparation phases, the data becomes ready for the process of sentiment analysis. Due to the problem nature, we determine that machine learning algorithms work best in the process of sentiment analysis. In order for a machine learning system (supervised) to work best, it should be provided with the appropriate training and test datum. The discussion here is mainly based on the supervised learning technique, because the problem nature demands labeling of terms such that they can be used during future classifications. Hence unsupervised methods might not work efficiently without any sort of training. Both the training and the test data are labeled with their corresponding classes and are provided to the machine learning system.

5. Results and discussion

The data set that is being used is taken from the movie review data taken from [15]. The base form of this data was used in [16] for polarity classification. This domain is experimentally convenient because when it comes to reviews, we can expect a large amount of text and the review text as a whole describes the overall intention of the user, which makes it an efficient data to be used for the purpose of classification. The original source of this data was the Internet Movie Database (IMDb) archive of the ‘rec.arts.movies.reviews’ newsgroups at [17]. The reviews are categorized into positive and negative and are stored separately as training and test corpus. This comparison technique focuses on machine learning approaches (Neural Networks and SVM) and J48 Classification algorithms.

(Neural Networks and SVM) and J48 Classification algorithms. Figure 2: Result of J48 Figure 2 shows

Figure 2: Result of J48

Figure 2 shows the result obtained from the J48 Classifier.

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823

TPR

FPR
FPR

Figure 3: ROC for J48 (Positive Sentiment)

Figure 3shows the ROC plot for the positive sentiment. From the curve, it can be observed that the accuracy is approximately 50%. J48 being a primitive classifier, it can be observed that the result obtained is average; hence we can conclude that a machine learning approach would be a better option.

that a machine learning approach would be a better option. Figure 4: Result of ANN Figure

Figure 4: Result of ANN

Figure 4 shows the working of the neural network model. Due to the continually training approach and the very large data size, the training time of the neural networks seems to be very high. And further, the error rate also seems to be high. It can be observed from Figure 3 that the error rate is 2.133 and is error reduction rate is also found to be very low. Hence the option of considering neural networks is eliminated. ENCOG framework is used for

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823

implementing the neural network model. The neural networks was constructed with three layers. The input and output layers with no biased neurons, the processing layer with two biased neurons. The input layer was constructed according to the number of words obtained after pre-processing. In our case it is 3190. Activation Linear and Activation TanH functions were used in the input and, processing and output layers respectively. Resilient propagation function was used to train the network. The network design is as follows (Table 1):

Table 1: Neural Network Setup

No Of Layers

 

3

No Of Neurons In Input Layer

 

3190

No Of Biased Neurons In The Input Layer

0

No Of Neurons In Processing Layer

3192

No Of Biased Neurons In The Processing Layer

2

No Of Neurons In Output Layer

 

1

No Of Biased Neurons In The Output Layer

0

Activation Function Used In Input Layer

ActivationLinear

Activation

Function

Used

In

Processing

ActivationTanH

Layer

Activation Function Used In Output Layer

ActivationTanH

Neural Network Training Function

 

Resilient Propagation

The same data set is considered and analysis is performed using SVM. It uses the RBF kernel function is used for classification.

(

Kx

i

,

x

j

)

=−γ x x +>r γ

i

j

exp(

||

||

2

),

0

(3)

The SVM requires a special format for reading the data. The expected format of input for an SVM is

[label] [index 1 ]:[value 1 ] [index 2 ]:[value 2 ]

(4)

The values (value 1 , value 2 ,…value n ) in the given format are normalized within the range -1 to 1. In order to convert the data into the required format, Max-Min Normalization is used, which is of the form,

format, Max-Min Normalization is used, which is of the form, (5) A sample input data for

(5)

A sample input data for SVM is of the form shown in figure 5.

10

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823

Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823 Figure 5: Sample input data for SVM

Figure 5: Sample input data for SVM

5, No. 7 ISSN: 1837-7823 Figure 5: Sample input data for SVM FPR TPR Figure 6:

FPR

TPR

Figure 6: ROC for SVM

Figure 6 shows the ROC plot, which provides a promising accuracy. Hence after analysis of the results, SVM is found to work efficiently for the process of context mining. Figure 7 shows the result obtained from SVM Classifier.

Figure 7 shows the result obtained from SVM Classifier. 6. Conclusion Figure 7: Result of SVM

6.

Conclusion

Figure 7: Result of SVM

This paper is an initial implementation for analysis of the available data with the classification algorithms and to select the appropriate technique for the next level of analysis. Implementation is carried out using data obtained from the IMDb dataset, and from the results it is clear that SVM works best on the area of context mining. This process can be further improvised by using one class classification techniques rather than multi-class classification. Further, our next research proposal will take forward this research into mining levels of polarities rather than

11

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7 ISSN: 1837-7823

providing a single polarity base. Level of polarity can be analyzed and can be used for performing emotion analysis, which is a deeper form of sentiment analysis.

7.

References

[1]

ThijsWesterveld, (2000), “Image Retrieval: Content versus Context”, University of Twente, Department of Computer Science, Parlevink Group,PO Box 217, 7500 AE Enschede, The Netherlands.

[2]

David Paul Dupplaw· Michael Matthews · Richard Johansson · Giulia Boato· Andrea Costanzo · Marco Fontani· Enrico Minack· Elena Demidova· Roi Blanco · Thomas Griffiths · Paul Lewis · Jonathon Hare · Alessandro Moschitti, (2014), “Information extraction from multimedia web documents:an open-source platform and testbed”, Int J Multimed Info Retr 3:97–111.

[3]

Liyan Zhang, Dmitri V. Kalashnikov, SharadMehrotra, (2014), “Context Assisted Face Clustering Frameworkwith Human-in-the-Loop”, International Journal of Multimedia Information Retrieval,Volume 3, Issue 2, pp 69-88.

[4]

Thanh-Nghi Doan,Thanh-Nghi Do, Francois Poulet, (2014), “Parallel Incremental Power Mean SVM for the Classificationof Large Scale Image Datasets”, International Journal of Multimedia Information Retrieval,Volume 3, Issue 2, pp 89-96.

[5]

Klaus

Schoeffmann,David

 

Ahlstrom,

Werner

Bailer,

ClaudiuCobarzan,FrankHopfgartner,KevinMcGuinness, CathalGurrin, ChristianFrisson, Duy-Dinh Le,

Manfred Del Fabro, HongliangBai, Wolfgang Weiss, (2014), “The Video Browser Showdown: A Live Evaluationof Interactive Video Search Tools”, International Journal of Multimedia Information Retrieval,Volume 3, Issue 2, pp 113-127.

[6]

Blinov P. D., Klekovkina M. V., Kotelnikov E. V., Pestov O. A. (2013), “Research of lexical approach and machine learning methods for sentiment analysis”.

[7]

M. Ikonomakis, S. Kotsiantis, V. Tampakas, (2005), “Text Classification Using Machine Learning Techniques”, Wseas Transactions On Computers, Issue 8, Volume 4, pp. 966-974.

[8]

Stefano Baccianella, Andrea Esuli, FabrizioSebastiani, (2010), “SENTIWORDNET 3.0: An Enhanced Lexical Resourcefor Sentiment Analysis and Opinion Mining”, LREC. Vol. 10.

[9]

Rudy Prabowo, Mike Thelwall , (2009), “Sentiment Analysis: A Combined Approach”, Journal of Informetrics 3.2 : 143-157.

[10]

Bo Pang,Lillian Lee, ShivakumarVaithyanathan,(2002), “Thumbs up? Sentiment Classification using Machine LearningTechniques”, Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10.

[11]

Rajaraman,

A.;

Ullman,

J.

D.

(2011). "Data

Mining". Mining

of

Massive

Datasets.

pp. 1–

 

[12]

http://storm-project.net, Referred on: 3 Oct 2014.

 

[13]

https://github.com/nathanmarz/storm, Referred on: 3 Oct 2014.

 

[14]

[15]

 

[16]

Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. (2002), "Thumbs up? Sentiment classification using machine learning techniques." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10.

[17]

http://reviews.imdb.com/Reviews, Referred on: 3 Oct 2014.

12