Less Is More: Selecting Informative Unigrams For Sentiment Classification

Less is More: Selecting Informative Unigrams for Sentiment Classication
Ben Christel Stanford University bxel@stanford.edu
Abstract
Data sparsity is often a signicant problem for the performance of supervised classiers, especially those that use bigrams, parsing, or tagging in addition to unigram features. Out of the millions of unique features that may occur in a training corpus, we would like to nd the subset of features that is most informative for the problem at hand. In this paper, I take two different approaches to the problem of removing uninformative tokens from text. First, I investigate several computationally inexpensive methods for evaluating sentiment relevance and polarity in context. I conclude that context-free approaches to measuring sentiment-relevance are preferable, and develop a novel metric for directly evaluating the sentiment-relevance of features.
formation available to a binary sentiment classier. My initial experiments focus on nding words and phrases that are subjective, negated, or have low speaker commitment, and tagging or removing them. However, results indicate that these approaches are less useful than directly estimating the sentiment-relevance of individual unigrams taken out of context.
Approach
Introduction
The expression of sentiment is a vital component of human communication. In some forms of writing, such as movie or product reviews, the primary purpose of a document may even be to convey sentiment towards a particular entity. However, not every word in a typical review is put towards the purpose of expressing the writers sentiment; many movie reviews discuss the plot of the movie and the reactions of other critics. Part of the task of sentiment analysis, then, is to decide what subset of the words or sentences in a review is truly sentiment-relevant. In this paper, I evaluate several approaches for determining the sentiment-relevance of words, phrases, and sentences, and making relevance in-
My system performs the well-researched task of binary sentiment classication of reviews. I opted to use a fairly large movie review corpus and a Maximum Entropy classier, in order to be able to explore more features and tokenization schemes without worrying about data sparsity. This is in contrast to much of the foundational work in sentiment analysis, which has typically used smaller corpora (up to a few thousand documents) and Naive Bayes or SVM classiers (Pang and Lee 2002, Turney 2002, Nasukawa and Yi 2003, Pang and Lee 2004, Mullen and Collier 2004). 2.1 Corpora
I used two publicly-available corpora in my experiments: the Stanford Large Movie Review Dataset (Maas, et al. 2011), and the corpus of subjective and objective sentences from (Pang and Lee 2004). I preferred to use the Large Movie Review Dataset over Pang and Lees movie review corpus because it is larger and there are fewer problematic correlations between the training and test sets, as shown by (Maas, et al. 2011). In particular, the Large Movie Review Dataset ensures that the training and test sets
contain reviews about disjoint sets of movies. This makes the names of movies and actors less useful features for classifying unseen dataas they should be, if the classier is to perform well on a wide variety of movie reviews. 2.2 Classier
The MaxEnt classier I used is the one included with the machine learning toolkit MALLET (McCallum 2002). I also used MALLETs automatic crossvalidation feature for many of my experiments. 2.3 Baseline
expressed in the form X/10. Since many reviewers seemed to be using this rating scheme, I opted to retain any two numbers separated by a slash as a single token. This turned out to be a highly useful feature, although there were a few false positives (most notably dates of the form MM/DD/YY) and this strategy would probably not generalize well to other domains. 2.5 Experimental Conditions
As a strong baseline, I tokenized the raw review text on whitespace and input the resulting tokens into MALLETs MaxEnt classier. This yielded 86.7% accuracy in 10-fold cross-validation. This is more useful than using a random classier as the baseline; since MALLET tokenizes on whitespace by default, any new features or text processing techniques must do better than simple whitespace-tokenization to be of any value at all. 2.4 Tokenization
For all of the experiments described below, I used a slightly more sophisticated tokenization scheme than the baseline of tokenizing on whitespace. I broke punctuation away from word tokens, and assigned each character of punctuation to its own token, with a few exceptions. I kept hyphenated words and contractions as a single token, on the assumption that at least a few hyphenated words would occur frequently enough to be useful features. I also kept ellipses as a single token, since they appear more frequently in negative reviews and are thus sentimentrelevant. I did not downcase the tokens, since uppercase is frequently used to express heightened emotion, and I thought the sentiment-relevance of words might vary depending on whether or not they began a sentence (this hypothesis turned out to be correct). Reviewers often used two consecutive hyphens to stand for an em dash; I therefore kept consecutive hyphens together and separated them from surrounding words and other punctuation. Looking at the data, I noticed that reviewers sometimes include a numeric rating in the text of their review. Often, this is a score out of ten points,
I explored several text preprocessing strategies, with the aim of pinpointing words and phrases in the text that are sentiment-relevant in context. In particular, I noted that the intrinsic sentiment content of a word may differ dramatically from its contextual sentiment content where it appears in a review. There are many reasons why this might be the case: negation can reverse the sentiment polarity of a word; a sentiment-heavy phrase might express the opinion of someone other than the author of the review; a sentiment-related word might be used in an objective sentence, such as in a plot summary; or the word might appear in an idiomatic context that changes its relevance to sentiment, as in awfully good. Since confusion over the sentiment-relevance of a particular word is a major concern in classiers that use a bag-of-words model for a text, having ways to assess the contextual sentiment of a given word seems like it would be tremendously useful for classication. The text preprocessing techniques I tried were: negation tagging, addition of bigram tokens, removal of clauses that appear before but, removal of objective words using the corpus of subjective and objective sentences from (Pang and Lee 2004), and removal of objective sentences. Each of these techniques was tested alone in a separate experimental condition. The tokenization strategy described in the previous section was applied before any other text preprocessing. The negation-tagging condition used the same method that Pang and Lee used in their 2002 paper. The tag NEG was prexed to all tokens occurring between a negative word and a clause-ending punctuation mark. Although Pang and Lee reported no statistically signicant difference in classier accuracy from using negation tags, I hoped that the larger size of my corpus would counteract the data sparsity problems that might have resulted from using nega-
tion tagging in the Pang and Lee corpus. In the bigram feature condition, I appended tokens of the form A B to the text for each pair of consecutive unigrams A and B that consisted of alphabetic characters. I chose not to include bigrams where one or both of the constituent unigrams was a punctuation mark or a number, since I thought that these bigrams would not convey better sentiment information than the unigrams alone, and I wanted to limit the number of unique tokens. The goal of adding bigrams was to distinguish between sentiment-relevant and sentiment-irrelevant uses of certain words. For example, while the phrase I loved is likely very sentiment-relevant, the phrase he loved is most likely part of a plot summary and thus not as sentiment-relevant. I also added trigrams whose middle element was one of the articles a or the; the reasoning behind this was that bigrams like of the and to a are unlikely to be informative unless followed by a noun, and I wanted to capture verb-noun combinations like is the worst. This strategy turned out to work well; many of the most useful features turned out to be trigrams of this type, as I will discuss later. Removing clauses before the word but was a simple technique I tried to reduce the number of words and phrases with low speaker commitment. Pang and Lee (2002) discuss the frustrated expectations structure of many movie reviews, and the difculties it presents for sentiment analysis. Review writers will often discuss a movies reputation early in the review if it contrasts with their own personal response to the movie, since this contrast intensies the expressed sentiment for human readers. It wreaks havoc with sentiment-analysis systems, however, since clauses or entire sentences will express a sentiment opposite to that held by the reviewer. To partially address this problem, I removed tokens that appear between a full stop or a form of be and the word but. This seems like a useful strategy since but is often used to denote a contrast between the speakers opinion and someone elses, or between two sentiment-heavy phrases, the rst of which is deemphasized. For the conditions in which I removed objective words or sentences, I used Pang and Lees (2004) subjectivity corpus, which consists of 10000 sentences, evenly divided between plot summary and
expressions of opinion. In the objective-word removal condition, tokens were removed from the review corpus if they appeared more frequently in objective sentences than subjective ones. Surprisingly, this simpleminded approach resulted in better performance than the objective-sentence removal condition, in which I trained a Naive Bayes classier on the Pang and Lee corpus and used it to identify and remove objective sentences from the review corpus.
Initial Results
The experimental conditions described above were evaluated using mean accuracy in 10-fold crossvalidation on the 25000-document training set. Since there are only two classes, and the training set is divided evenly between positive and negative reviews, accuracy is an adequate measure of performance. Standard error for the measurements of accuracy is between 0.13% and 0.27%. Condition Baseline Tokenization Bigrams Removing Clauses Before But Negation-Tagging Removing Objective Words Removing Objective Sentences Accuracy 86.7% 88.1% 87.4% 87.9% 88.1% 88.0% 87.6%
Table 1: Accuracy for Baseline and Experimental Conditions
3.1
Analysis
My tokenizer raised classier accuracy signicantly above the baseline achieved by tokenizing on whitespace. This can be attributed partly to reduction of data sparsity (since tokens consisting of one or more punctuation marks attached to a word are eliminated from the lexicon). Using the number of unique tokens in the corpus as a rough measure of data sparsity, it is clear that my tokenization scheme reduces sparsity; there are 250149 unique tokens in the whitespace-tokenized corpus, and 114322 unique tokens using my tokenizer. The tokenizer was also specically designed to capture
strings that I thought would be sentiment-relevant. For example, ellipses turned out to be relevant for sentiment; .... appears 739 times in negative reviews, and 433 times in positive reviews. Reviewers often included numeric ratings like 10/10 in the text of their reviews; the tokenizer kept these ratings as a single token, so they turned out to be highly sentiment-relevant. An informal look at the reviews that were classied incorrectly showed that these reviews were often highly ambiguous in sentiment. In some cases, reviews consisted mostly of plot summary with only one or two evaluative sentences. If the tone of a review was dry, the classier had trouble discerning sentiment, since there were no words that stood out as particularly positive or negative. Some reviews seemed entirely ambiguous in sentiment, to the point that human readers would probably disagree on the correct classication. One review tagged as positive consisted mostly of minor criticisms and a single passage that could be interpreted as positive: I want to say that I have enjoyed all 4 movies so far [...] Cant wait until the next movie. The way Clark talks will get you every time. [sic] Even this passage has a possible negative reading: the reviewer might mean that he wishes he could give all four movies a positive review, but didnt like the latest one, and he is holding out hope for the rest of the series. These failure cases point out the aw in a binary sentiment classication scheme: sentiment is a continuous spectrum. However, given the task of binary classication, this classier may be performing within a few percent of the maximum possible for a bag-of-words model. Adding bigram features signicantly decreased classier accuracy; this is most likely attributable to data sparsity. If bigram tokens are added to the corpus, the number of unique tokens explodes to 1481462, an increase of almost 1300%. However, my hypothesis that there would be sentiment-relevant bigrams in the corpus turned out to be correct. Using my metric for the sentimentrelevance of a token (discussed in a later section), many bigrams had high scores. For example, the string of the worst appears 547 times in negative reviews and only 21 times in positive reviews. The phrase your money was also one of the most negative bigrams, appearing 171 times in negative re-
views and 10 times in positive reviews, usually in the phrase dont waste your money. This was a case of a bigram whose sentiment was not apparent from its constituent unigrams. Although these bigrams were highly sentiment-relevant and provided information that unigrams did not, the sheer number of bigrams introduced enough noise into the classier to drown out the effects of the few that were actually useful features. Removing clauses and phrases before but was another preprocessing technique that failed to improve accuracy. This is most likely due to the fact that but often expresses contrasts along dimensions other than sentiment polarity, so removing all phrases followed by but may have eliminated sentiment-relevant information. The concluding sentence of a positive review for the movie Zentropa is Grim, but intriguing, and frightening. The word grim actually turns out to correlate with positive sentiment, so removing it can only hurt accuracy. Another reviewer (of a different lm) wrote, I had to paraphrase many of the subtitles for my daughter, but much of the lm is visually self-explanatory. The key word here is subtitles, which appears more than twice as often in positive reviews as in negative ones, but is lost if the clause before but is removed. The fact that sentiment can often be discerned from words that dont appear directly sentiment-relevant argues against removing entire clauses or sentences from reviews. This observation can also be applied to the failure analysis for the objective-sentence removal condition. Perhaps the most surprising result to come out of this initial round of experiments was that removing sentences classied as objective signicantly hurt accuracy, while removing single words classied as objective did not signicantly affect accuracy. However, this makes sense in retrospect; removing entire sentences, even if they are merely factual, can hurt efforts at sentiment analysis because certain features of objective sentences actually correlate strongly with sentiment. In light of this, context-free approaches to evaluating the sentiment-relevance of a token may yield better results. This observation motivated my next round of experiments, which focused on eliminating from the corpus tokens with low sentiment-relevance. The sentiment-relevance of tokens was evaluated out of context using fre-
quency counts in positive and negative reviews.
A Metric for Sentiment-Relevance
Intuitively, sentiment-relevant tokens have two properties that are easily observable in the corpus: they are signicantly more frequent in one sentiment class than the other, and they occur relatively frequently in the corpus. These two properties are often at odds with each other, however. The most common tokens in the corpus are articles, prepositions, and common verbs with essentially no sentiment content, while many tokens that are heavily skewed toward positive or negative reviews are rare in the corpus in general. Since it is not clear a priori what weight should be given to each of these desirable properties, I used a metric that allows the weights to be adjusted. The relevance RT of a token T is given as: RT = abs(numpos T numneg T ) numpos T + numneg T + X
sentiment-relevance metric were the weighting constant X and the number of tokens used for classication, N . As a baseline, I took N = ; this is effectively the same as the tokenization only condition from the rst round of experiments. The baseline accuracy was 88.1%.
Results
where numpos T is the number of occurrences of T in positive reviews in the training data, numneg T is the number of occurrences in negative reviews, and X is the weight-adjusting constant. I determined the optimal value of X experimentally to be around 50 to 100; presumably this optimal value is dependent on both the number and distribution of words in the corpus. In general, higher values of X give higher scores to more common tokens, while lower values of X give higher scores to tokens that are heavily biased toward one class or the other, regardless of how common they are in the corpus. The application of the sentiment-relevance metric to classication is straightforward: each unique token in the training data is given a relevance score based on its frequency counts in positive and negative reviews. The tokens are then sorted in descending order of relevance and the top N of them chosen for training and testing the classier. In practice, this was done by creating copies of all documents in the training and evaluation folds with the tokens below the relevance threshold removed, then training and evaluating the classier on those documents. All experiments were done with ten-fold cross-validation. Mean accuracies are reported. The experimental variables for the tests of the
Table 2 summarizes the results for my experiments with N [500, 6000] and X [25, 200]. The classier performs poorly with very few tokens, and peaks at around 3000 tokens. Thereafter, accuracy begins to converge to the baseline as more tokens are added. At N = 3000, accuracy was best when X was chosen to be 50. In fact, X = 50 seemed to perform well for a variety of values of N . However, this success was most likely due to overtting, as I will discuss later. The standard error for all measurements of accuracy was below 0.2%; therefore, the classier performed signicantly better than baseline for X = 100, N = 3000 and X = 50, N 3000, 3500, 4000. Two classiers were evaluated on an unseen test dataset of 25000 documents: the baseline of N = and the best-performing experimental classier, which used the parameters X = 50, N = 3000. Results for this test are shown in Table 3. From these results, it is clear that the classier using only 3000 tokens is overtting the training data. Ways to mitigate this effect are discussed in the following section. Additionally, a classier using both bigram features (as in the rst round of experiments) and a limited number of sentiment-relevant tokens was tested, and achieved very high accuracy. Results for the experiments using bigrams and sentiment-relevance can be found in Table 4. 5.1 Analysis
The token ranking produced by the relevance metric largely agrees with human intuitions about sentiment relevance and polarity, although there are a few highly-ranked tokens that a hand-picked list would probably not include. An informal look at the token ranking thus indicates that the relevance
X 0 50 50 50 100 100 100 100 100 100 100 100
N 3000 3500 4000 500 1500 2500 3000 3500 4000 4500 6000
Accuracy 88.1% 88.5% 88.5% 88.5% 86.2% 87.9% 88.2% 88.5% 88.2% 88.1% 88.2% 88.2%
Table 2: Accuracy for Varying X and N in 10-fold CrossValidation of Sentiment-Relevance Condition.
X 0 50
N 3000
Accuracy 86.1% 84.3%
Table 3: Accuracy on Unseen Data.
X 50 50 50 50 50 100 100
N 3000 4000 8000 12000 16000 8000 12000
Accuracy 89.2% 89.4% 89.5% 89.6% 89.4% 89.0% 89.0%
Table 4: Accuracy for Varying X and N in 10-fold CrossValidation, Sentiment-Relevance and Bigrams Condition.
metric is both identifying tokens that are intuitively sentiment-relevant, and nding novel and somewhat counterintuitive correlations. These are both desirable properties, so at rst glance the metric seems to be performing well. A list of the highest-ranked tokens can be found in Appendix A. It seems that most of the highest-ranked tokens are negative in sentiment. Understanding this effect may be of use in building future classiers. It is possible that positive reviews are more creative and subtle in their word choice than negative reviews; negative reviews tend to reuse the same set of words and phrases. If negative reviews do indeed tend to be
less creative, it might be possible to build a classier using bigrams that assigns a slight positive bias to any bigrams not observed in the training data. Thus, reviews containing many unusual bigrams would be more likely to be classied as positive. Some tokens that are ranked as informative do not appear from their meanings to be sentimentrelevant; nevertheless these tokens are in fact good indicators of sentiment. This points to the usefulness the sentiment-relevance metric over a semantic metric like WordNet synonymy distance, used by (Mullen and Collier 2004). An example of this is the token sees, which appears 329 times in positive reviews, and only 166 times in negative reviews. This is doubly surprising because sees appears mostly in plot summaries. Apparently, reviews in which the reviewer describes what a character sees are signicantly more likely to be positive than negative. Generalizing from this, we can hypothesize that perhaps plot summary contains other useful clues to sentiment that have been overlooked. This effect might also help explain why removing objective sentences from the reviews lowered classier accuracy. Many of the most useful features discovered using the sentiment-relevance metric are highly domainspecic, so a classier trained with this method would denitely not generalize well to other domains like product reviews or evaluation of sentiment in dialogue. In particular, sentiment-relevant features that appear in plot summaries (like the words sees and subtitles discussed above) are specic to the domain of movies. Additionally, the sentiment-relevance metric chooses some unigrams that will not even generalize well to unseen movie review data, primarily the titles of movies and the names of actors. For example, the names Victoria and Seagal were among the top 25 tokens for X = 50. Removing named entities from the corpus before performing the token ranking might be a useful way to mitigate overtting. The problem of overtting may also be partially addressed simply by changing the parameters X and N . The basic cause of overtting is that tokens that are common and sentiment-correlated in the training data are either uncommon or not sentimentcorrelated in the test data. Increasing N that is, making more tokens available to the classier
would improve accuracy on unseen data because more tokens would be included that are common in English in general. These tokens are more likely to have the same sentiment distribution across different datasets in the same domain. Increasing X would have a similar effect by causing frequent tokens to be ranked higher. It might also be the case that using a different classier model could improve results, since Naive Bayes and SVM classiers seem to be less prone to overtting than MaxEnt. The tests using bigrams achieved the highest accuracy of any of my classiers. Intuitively, limiting the number of tokens used for classication will work well with bigram features, since the main problem with using bigrams is data sparsity and the fact that the vast majority of bigrams are uninformative for sentiment. However, it is likely that these classiers are also overtting the training data, perhaps even more so than the classiers using unigrams. There do appear to be bigrams whose sentiment is opposite those of their constituent unigrams, which argues for the usefulness of bigram features. For example, be funny is a strongly negative bigram, appearing 192 times in negative reviews and 30 times in positive reviews.
Related Work
To my knowledge, there has been no research on directly evaluating sentiment-relevance from the distribution of tokens in text. However, there has been much research into the related problems of determining the probable sentiment-relevance of words based on context (for example, by classifying sentences as subjective or objective, as in Pang and Lee (2004)), and determining the sentiment polarity of individual words independent of their distribution in the data being classied (Mullen and Collier 2004). The present work deals with evaluating sentimentrelevance both in context and out of context, and so draws on both of these previous appoaches. The problem of determining the context-free sentiment polarity of a feature has been approached from several angles. The WordNet synonymy graph has been used (Mullen and Collier 2004) to measure the similarity between a word in the data being analyzed and the words good and bad. This gives some indication of context-independent senti-
ment polarity. Turney (2002) uses web search hit counts to estimate the polarity of phrases; this approach has the advantage that it is not limited to unigrams, and will discover sentiment-relevant phrases that are not semantically similar to good or bad. These approaches both underperform my classier (Mullen and Collier achieved 86% accuracy using an SVM, and Turney achieved 74% with an unsupervised classier), but are likely to generalize better to different datasets and domains since the information about sentiment-relevance is not extracted from the training data. Previous work on removing sentiment-irrelevant words and sentences from review text includes Pang and Lee (2004). Pang and Lee achieved signicant accuracy improvements by classifying each sentence of a review as subjective or objective and performing sentiment classication on the subjective sentences. Additionally, they used the hypothesis that objective sentences are likely to be clustered together in a review to increase classier accuracy. The subjectivity corpus I used in the rst round of experiments is from Pang and Lee (2004). Many of the features I tested in the rst round of experiments are also based on the earlier research of Pang and Lee (2002). They experimented with negation tagging and bigram features, also with limited success. My hypothesis was that a larger corpus would make these features more useful, but this turned out not to be the case. Previous work using the Large Movie Review Dataset that I used for this research includes Maas, et al. (2011). Maas, et al. combine supervised and unsupervised approaches to determining the sentiment of words, achieving 88.89% classier accuracy on the movie review corpus. Maas, et. al. also show that their classication technique generalizes well to problems besides sentiment classication; it achieves 90% accuracy on the Pang and Lee (2004) subjectivity corpus.
Future Research
The present work leaves open many avenues for future research. In particular, the performance of my classier on unseen data might be improved simply by changing the parameters X and N of the sentiment-relevance metric. Combining sentiment-
relevance with bigram features or negation tagging might also yield good results, since limiting the number of tokens would solve the data sparsity problems inherent in using bigrams and tagging. Preliminary investigations indicate that certain bigrams and negation-tagged words are among the most informative tokens. Additionally, using other classier models might improve results. My metric for sentiment-relevance is probably most appropriate for use with Naive Bayes, not MaxEnt. The reason is that MaxEnt does not assume that features are statistically independent, so it may nd some useful correlation between features that are not informative individually. As a concrete example, consider the case where a reviewer concludes a review with a numeric score like 3 out of 4. Ratings of this type are frequent enough in the corpus that MaxEnt might be able to use the co-occurence of the tokens 3 and 4 as an indication that a review is positive. However, the tokens 3 and 4 are not informative by themselves, since different reviewers use different rating systemswhile 3 out of 4 is indicative of a good review, 3 out of 10 and 4 out of 10 express negative sentiment. Thus, my sentiment-relevance metric will rank 3 and 4 low, even though MaxEnt might nd these tokens quite useful. Devising a relevance metric that takes advantage of this strength of MaxEnt might be a valuable topic for future research. However, due to the fact that MaxEnt tends to overt the training data somewhat, it might be equally useful to simply apply the present relevance metric to a Naive Bayes classier. Although Naive Bayes performs worse than MaxEnt in 10-fold cross validation regardless of the featureset used, it might in some circumstances perform better on unseen data than MaxEnt would. Preliminary experiments using the relevance metric with Naive Bayes showed signicant performance improvements above a baseline of using all the unigrams. The relevance metric I presented may be equally applicable to other binary classication problems besides sentiment analysis. Similar metrics could be applied to n-class problems. Filtering tokens by relevance might cause less overtting for other classication problems, especially those with a smaller set of commonly-used, highly-relevant tokens.
Conclusion
I presented a metric for determining the sentimentrelevance of a token based on frequency counts in positive and negative training reviews. The metric was effective in that it identied tokens that are intuitively sentiment-relevant and improved classier performance. The ranking of sentiment-relevant tokens also led to the discovery of useful tokens that appear frequently in plot summaries, arguing against the elimination of summaries from reviews during sentiment classication. I also showed that negation tagging and bigram features do little to improve the accuracy of MaxEnt classiers even for large corpora. However, the effectiveness of negation tagging and bigrams can be improved if only a limited number of sentiment-relevant tokens are used for classication.
References
Maas, Andrew L., Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). McCallum, Andrew Kachites. 2002. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu/ Mullen, Tony and Nigel Collier. 2004. Sentiment analysis using support vector machines with diverse information sources. In Proc. of the EMNLP, pages 412 418. Nasukawa, Tetsuya and Jeonghee Yi. 2003. Capturing Favorability Using Natural Language Processing. In K-CAP. Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002 Thumbs up? Sentiment Classication using Machine Learning Techniques. In Proc. of EMNLP, pages 7986. Pang, Bo and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. In Proc. of the ACL, pages 271278. Turney, Peter D. 2002. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classication of Reviews. In Proc. of the ACL, pages 417 424 Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In Proc. of the HLT/EMNLP, pages 347354
Appendix A
Following is a list of the most sentiment-relevant features in the training corpus, using the parameter X = 50. Bigrams are included; note that their presence does not change the ranking of unigrams relative to each other.
Token of the worst waste of waste Avoid waste your worst movie the worst worst awful worst movies pointless 7/10 worst lm poorly a waste 8/10 4/10 3/10 so bad unfunny well worth how bad 10/10 laughable this crap redeeming is awful 2/10 Boll lame Victoria is the worst Dont waste 1/10 your money your time not funny Seagal of the best horrible bad acting this piece Highly Pos 21 29 92 2 11 10 165 238 154 4 39 201 4 64 26 218 4 5 70 16 183 18 252 40 2 28 6 1 1 85 183 11 3 8 10 57 14 3 788 152 27 21 142 Neg 547 645 1330 251 331 305 1741 2276 1412 198 448 6 185 601 331 10 174 172 598 243 7 254 19 375 139 296 159 127 124 602 12 177 132 160 171 434 193 131 121 955 262 227 6 Rtoken , X = 50 0.851 0.851 0.841 0.822 0.816 0.808 0.806 0.795 0.778 0.770 0.762 0.759 0.757 0.751 0.749 0.748 0.746 0.736 0.735 0.735 0.733 0.733 0.726 0.720 0.717 0.717 0.712 0.708 0.703 0.701 0.698 0.697 0.697 0.697 0.697 0.697 0.696 0.696 0.696 0.694 0.693 0.691 0.687
of crap crap a must MST3K worse than Highly recommended even worse bad movie save this worse Edie blah is terrible not worth is perfect this mess wasted terrible 9/10 wonderfully lousy Avoid this Excellent all costs atrocious bad movies worst lms pathetic stupid Paulie incoherent Matthau Uwe really bad highly recommend is excellent was terrible beautifully superb badly mess turkey Felix wonderful remotely sucks money on stinker BAD captures unwatchable absolutely no only good
18 137 202 4 32 110 18 39 2 209 103 13 14 14 180 3 76 240 148 276 19 0 125 11 16 13 1 61 261 98 7 130 1 36 227 273 10 346 549 96 100 10 104 1276 19 32 15 3 11 190 4 27 12
205 836 18 128 274 1 198 305 114 1176 0 167 172 170 16 115 478 1290 10 36 192 98 6 149 173 158 100 386 1340 1 126 8 97 260 29 39 136 56 101 523 535 131 4 270 169 226 151 98 133 24 102 200 135
0.685 0.683 0.681 0.681 0.680 0.677 0.677 0.675 0.675 0.674 0.673 0.670 0.669 0.667 0.667 0.667 0.666 0.665 0.663 0.663 0.663 0.662 0.657 0.657 0.657 0.656 0.656 0.654 0.654 0.651 0.650 0.649 0.649 0.647 0.647 0.646 0.643 0.642 0.640 0.638 0.635 0.634 0.633 0.630 0.630 0.630 0.630 0.629 0.629 0.629 0.628 0.625 0.624
garbage dont waste awless is bad was awful loved this excuse for no sense fantastic is amazing dull is a must Worst no plot insult than this touching Gundam Mildred at best great job sit through ridiculous boring Rob Roy loved it Antwone Din is wonderful delightful amateurish refreshing Powell insult to must see Uwe Boll this garbage was bad amazing wasting perfection be funny drivel uninteresting wooden Save dont even very bad Gandhi bad it attempt at excellent excellent as
68 7 116 34 5 140 24 41 614 162 136 111 9 15 26 81 355 85 101 42 186 45 171 327 80 221 75 75 123 227 27 178 166 11 237 1 0 20 991 15 125 30 10 25 48 5 34 24 99 23 38 1547 94
375 112 8 227 102 14 180 251 128 20 640 8 115 139 184 409 68 2 6 247 27 258 767 1397 1 36 0 0 12 38 183 26 23 118 41 78 74 153 232 133 13 192 113 172 262 92 204 165 7 161 219 381 6
0.623 0.621 0.621 0.621 0.618 0.618 0.614 0.614 0.614 0.612 0.610 0.609 0.609 0.608 0.608 0.607 0.607 0.606 0.605 0.605 0.605 0.603 0.603 0.603 0.603 0.603 0.600 0.600 0.600 0.600 0.600 0.598 0.598 0.598 0.598 0.597 0.597 0.596 0.596 0.596 0.596 0.596 0.595 0.595 0.594 0.592 0.590 0.590 0.590 0.590 0.590 0.589 0.587
Bourne bad denitely worth just bad my time is superb Stewart an insult The worst pile of a wonderful my favorite tedious unconvincing my money is a great of garbage horrid is horrible underrated avoid this noir excuse not even no reason Welles poorly written Polanski terric dreadful isnt even Astaire superbly love this dreck is brilliant was the worst money back an excellent embarrassing friendship breathtaking annoying Vance highly recommended this thing to waste pile stupid and Lincoln gem idiotic
155 1786 86 3 25 139 332 9 8 17 357 480 30 24 11 376 6 9 10 194 9 275 72 143 36 210 7 81 340 35 13 102 105 271 3 112 5 3 421 35 228 142 200 74 81 22 16 30 17 129 286 18
22 6907 4 82 166 18 69 104 100 134 76 109 182 159 110 82 91 102 105 34 101 56 334 592 199 39 92 4 75 193 113 10 11 57 76 13 83 75 100 189 46 22 772 3 5 141 119 168 122 19 64 125
0.586 0.586 0.586 0.585 0.585 0.585 0.583 0.583 0.582 0.582 0.582 0.581 0.580 0.579 0.579 0.579 0.578 0.578 0.576 0.576 0.575 0.575 0.575 0.572 0.572 0.572 0.570 0.570 0.570 0.568 0.568 0.568 0.566 0.566 0.566 0.566 0.565 0.562 0.562 0.562 0.562 0.561 0.560 0.559 0.559 0.559 0.557 0.556 0.556 0.556 0.555 0.554

Less Is More: Selecting Informative Unigrams For Sentiment Classification

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Less Is More: Selecting Informative Unigrams For Sentiment Classification

Încărcat de

Drepturi de autor:

Formate disponibile

Less is More: Selecting Informative Unigrams for Sentiment Classication

Ben Christel Stanford University bxel@stanford.edu

Table 1: Accuracy for Baseline and Experimental Conditions

quency counts in positive and negative reviews.

A Metric for Sentiment-Relevance

X 0 50 50 50 100 100 100 100 100 100 100 100

Table 2: Accuracy for Varying X and N in 10-fold CrossValidation of Sentiment-Relevance Condition.

Accuracy 86.1% 84.3%

Table 3: Accuracy on Unseen Data.

N 3000 4000 8000 12000 16000 8000 12000

Accuracy 89.2% 89.4% 89.5% 89.6% 89.4% 89.0% 89.0%

S-ar putea să vă placă și