Sunteți pe pagina 1din 9

ANAND INSTITUTE OF HIGHER TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ACADEMIC YEAR: 2018-19

MINI PROJECT REPORT

INFORMATION RETRIEVAL

TWITTER TWEETS CLASSIFIER

Supervised by: Marks: 20/25

Mr.D.ANAND JOSEPH DANIEL AP/CSE

Submitted by:

NAME: KEERTHANA E

YEAR : II CSE
A GRAPHICAL CALCULATOR:

TABLE OF CONTENTS

1. ABSTARCT

2. INTRODUCTION

3. SOFTWARE DESCRIPTION

4. HARDWARE DESCRIPTION

7. PROGRAM

8. SCREEN SHOTS

9. CONCLUSION
ABSTRACT:

It addresses the problem of sentiment analysis in twitter; that is classifying tweets


according to the sentiment expressed in them: positive, negative or neutral. Twitter is an online
micro-blogging and social-networking platform which allows users to write short status updates
of maximum length 140 characters. It is a rapidly expanding service with over 200 million
registered users - out of which 100 million are active users and half of them log on twitter on a
daily basis - generating nearly 250 million tweets per day]. Due to this large amount of usage we
hope to achieve a reflection of public sentiment by analysing the sentiments expressed in the
tweets. Analysing the public sentiment is important for many applications such as firms trying to
find out the response of their products in the market, predicting political elections and predicting
socioeconomic phenomena like stock exchange. The aim of this project is to develop a functional
classifier for accurate and automatic sentiment classification of an unknown tweet stream.

INTRODUCTION:

MOTIVATION:
It have chosen to work with twitter since we feel it is a better approximation of public
sentiment as opposed to conventional internet articles and web blogs. The reason is that the
amount of relevant data is much larger for twitter, as compared to traditional blogging sites.
Moreover the response on twitter is more prompt and also more general (since the number of
users who tweet is substantially more than those who write web blogs on a daily basis).
Sentiment analysis of public is highly critical in macro-scale socioeconomic phenomena like
predicting the stock market rate of a particular firm.
This could be done by analysing overall public sentiment towards that firm with respect
to time and using economics tools for finding the correlation between public sentiment and the
firm’s stock market value. Firms can also estimate how well their product is responding in the
market, which areas of the market is it having a favourable response and in which a negative
response (since twitter allows us to download stream of geo-tagged tweets for particular
locations. If firms can get this information they can analyze the reasons behind geographically
differentiated response, and so they can market their product in a more optimized manner by
looking for appropriate solutions like creating suitable market segments. Predicting the results of
popular political elections and polls is also an emerging application to sentiment analysis. One
such study was conducted by Tumasjan et al. in Germany for predicting the outcome of federal
elections in which concluded that twitter is a good reflection of offline sentiment.
DOMAIN INTRODUCTION:
The analyzing sentiments of tweets comes under the domain of “Pattern Classification”
and “Data Mining”. Both of these terms are very closely related and intertwined, and they can be
formally defined as the process of discovering “useful” patterns in large set of data, either
automatically (unsupervised) or semiautomatically (supervised). The project would heavily rely
on techniques of “Natural Language Processing” in extracting significant patterns and features
from the large data set of tweets and on “Machine Learning” techniques for accurately
classifying individual unlabelled data samples (tweets) according to whichever pattern model
best describes them. The features that can be used for modeling patterns and classification can be
divided into two main groups: formal language based and informal blogging based.
Language based features are those that deal with formal linguistics and include prior
sentiment polarity of individual words and phrases, and parts of speech tagging of the sentence.
Prior sentiment polarity means that some words and phrases have a natural innate tendency for
expressing particular and specific sentiments in general. For example the word “excellent” has a
strong positive connotation while the word “evil” possesses a strong negative connotation. So
whenever a word with positive connotation is used in a sentence, chances are that the entire
sentence would be expressing a positive sentiment. Parts of Speech tagging, on the other hand, is
a syntactical approach to the problem. It means to automatically identify which part of speech
each individual word of a sentence belongs to: noun, pronoun, adverb, adjective, verb,
interjection, etc. Patterns can be extracted from analyzing the frequency distribution of these
parts of speech (ether individually or collectively with some other part of speech) in a particular
class of labeled tweets.

SOFTWARE REQUIREMENTS:

 Python
 Web Browser: Microsoft Internet Explorer, Mozilla, Google Chrome or later
 MySQL Server (back-end)
 Operating System: Windows XP / Windows7/ Windows Vista

HARDWARE REQUIREMENTS:

Processor: Intel Core Processor Or Better Performance

Memory 1gb Ram Or more Hard Disk space minimum 3 Gb for database usage for future

PROGRAM:

import twitter

# initialize api instance


twitter_api = twitter.Api(consumer_key='YOUR_CONSUMER_KEY',
consumer_secret='YOUR_CONSUMER_SECRET',
access_token_key='YOUR_ACCESS_TOKEN_KEY',
access_token_secret='YOUR_ACCESS_TOKEN_SECRET')

# test authentication
print(twitter_api.VerifyCredentials())

# ------------------------------------------------------------------------

def buildTestSet(search_keyword):
try:
tweets_fetched = twitter_api.GetSearch(search_keyword, count=100)

print("Fetched " + str(len(tweets_fetched)) + " tweets for the term " + search_keyword)

return [{"text":status.text, "label":None} for status in tweets_fetched]


except:
print("Unfortunately, something went wrong..")
return None

# ------------------------------------------------------------------------

search_term = input("Enter a search keyword: ")


testDataSet = buildTestSet(search_term)

print(testDataSet[0:4])

# ------------------------------------------------------------------------

def buildTrainingSet(corpusFile, tweetDataFile):


import csv
import time

corpus=[]

with open(corpusFile,'rb') as csvfile:


lineReader = csv.reader(csvfile, delimiter=',', quotechar="\"")
for row in lineReader:
corpus.append({"tweet_id":row[2], "label":row[1], "topic":row[0]})

rate_limit=180
sleep_time=900/180

trainingDataSet=[]

for tweet in corpus:


try:
status = twitter_api.GetStatus(tweet["tweet_id"])
print("Tweet fetched" + status.text)
tweet["text"] = status.text
trainingDataSet.append(tweet)
time.sleep(sleep_time)
except:
continue
# Now we write them to the empty CSV file
with open(tweetDataFile,'wb') as csvfile:
linewriter=csv.writer(csvfile,delimiter=',',quotechar="\"")
for tweet in trainingDataSet:
try:
linewriter.writerow([tweet["tweet_id"],tweet["text"],tweet["label"],tweet["topic"]])
except Exception as e:
print(e)
return trainingDataSet

# ------------------------------------------------------------------------

corpusFile = "YOUR_FILE_PATH/corpus.csv"
tweetDataFile = "YOUR_FILE_PATH/tweetDataFile.csv"

trainingData = buildTrainingSet(corpusFile, tweetDataFile)

# ------------------------------------------------------------------------

import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords

class PreProcessTweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])

def processTweets(self, list_of_tweets):


processedTweets=[]
for tweet in list_of_tweets:
processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
return processedTweets

def _processTweet(self, tweet):


tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]

tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)

# ------------------------------------------------------------------------

import nltk

def buildVocabulary(preprocessedTrainingData):
all_words = []

for (words, sentiment) in preprocessedTrainingData:


all_words.extend(words)

wordlist = nltk.FreqDist(all_words)
word_features = wordlist.keys()

return word_features

# ------------------------------------------------------------------------

def extract_features(tweet):
tweet_words=set(tweet)
features={}
for word in word_features:
features['contains(%s)' % word]=(word in tweet_words)
return features

# ------------------------------------------------------------------------

# Now we can extract the features and train the classifier


word_features = buildVocabulary(preprocessedTrainingSet)
trainingFeatures=nltk.classify.apply_features(extract_features,preprocessedTrainingSet)

# ------------------------------------------------------------------------

NBayesClassifier=nltk.NaiveBayesClassifier.train(trainingFeatures)

# ------------------------------------------------------------------------

NBResultLabels = [NBayesClassifier.classify(extract_features(tweet[0])) for tweet in


preprocessedTestSet]

# ------------------------------------------------------------------------

# get the majority vote


if NBResultLabels.count('positive') > NBResultLabels.count('negative'):
print("Overall Positive Sentiment")
print("Positive Sentiment Percentage = " +
str(100*NBResultLabels.count('positive')/len(NBResultLabels)) + "%")
else:
print("Overall Negative Sentiment")
print("Negative Sentiment Percentage = " +
str(100*NBResultLabels.count('negative')/len(NBResultLabels)) + "%")

SCREENSHOTS:
CONCLUSION:

Sentiment Analysis is an interesting way to think about the applicability of Natural


Language Processing in making automated conclusions about text. It is being utilized in social
media trend analysis and, sometimes, for marketing purposes. Making a Sentiment Analysis
program in Python is not a difficult task, thanks to modern-day, ready-for-use libraries. This
program is a simple explanation to how this kind of application works. This is only for academic
purposes, as the program described here is by no means production-level.

S-ar putea să vă placă și