Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report

ANAND INSTITUTE OF HIGHER TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ACADEMIC YEAR: 2018-19
MINI PROJECT REPORT
INFORMATION RETRIEVAL
TWITTER TWEETS CLASSIFIER
Supervised by: Marks: 20/25
Mr.D.ANAND JOSEPH DANIEL AP/CSE
Submitted by:
NAME: KEERTHANA E
YEAR : II CSE
A GRAPHICAL CALCULATOR:
TABLE OF CONTENTS
1. ABSTARCT
2. INTRODUCTION
3. SOFTWARE DESCRIPTION
4. HARDWARE DESCRIPTION
7. PROGRAM
8. SCREEN SHOTS
9. CONCLUSION
ABSTRACT:
It addresses the problem of sentiment analysis in twitter; that is classifying tweets

according to the sentiment expressed in them: positive, negative or neutral. Twitter is an online
micro-blogging and social-networking platform which allows users to write short status updates
of maximum length 140 characters. It is a rapidly expanding service with over 200 million
registered users - out of which 100 million are active users and half of them log on twitter on a
daily basis - generating nearly 250 million tweets per day]. Due to this large amount of usage we
hope to achieve a reflection of public sentiment by analysing the sentiments expressed in the
tweets. Analysing the public sentiment is important for many applications such as firms trying to
find out the response of their products in the market, predicting political elections and predicting
socioeconomic phenomena like stock exchange. The aim of this project is to develop a functional
classifier for accurate and automatic sentiment classification of an unknown tweet stream.
INTRODUCTION:
MOTIVATION:
It have chosen to work with twitter since we feel it is a better approximation of public
sentiment as opposed to conventional internet articles and web blogs. The reason is that the
amount of relevant data is much larger for twitter, as compared to traditional blogging sites.
Moreover the response on twitter is more prompt and also more general (since the number of
users who tweet is substantially more than those who write web blogs on a daily basis).
Sentiment analysis of public is highly critical in macro-scale socioeconomic phenomena like
predicting the stock market rate of a particular firm.
This could be done by analysing overall public sentiment towards that firm with respect
to time and using economics tools for finding the correlation between public sentiment and the
firm’s stock market value. Firms can also estimate how well their product is responding in the
market, which areas of the market is it having a favourable response and in which a negative
response (since twitter allows us to download stream of geo-tagged tweets for particular
locations. If firms can get this information they can analyze the reasons behind geographically
differentiated response, and so they can market their product in a more optimized manner by
looking for appropriate solutions like creating suitable market segments. Predicting the results of
popular political elections and polls is also an emerging application to sentiment analysis. One
such study was conducted by Tumasjan et al. in Germany for predicting the outcome of federal
elections in which concluded that twitter is a good reflection of offline sentiment.
DOMAIN INTRODUCTION:
The analyzing sentiments of tweets comes under the domain of “Pattern Classification”
and “Data Mining”. Both of these terms are very closely related and intertwined, and they can be
formally defined as the process of discovering “useful” patterns in large set of data, either
automatically (unsupervised) or semiautomatically (supervised). The project would heavily rely
on techniques of “Natural Language Processing” in extracting significant patterns and features
from the large data set of tweets and on “Machine Learning” techniques for accurately
classifying individual unlabelled data samples (tweets) according to whichever pattern model
best describes them. The features that can be used for modeling patterns and classification can be
divided into two main groups: formal language based and informal blogging based.
Language based features are those that deal with formal linguistics and include prior
sentiment polarity of individual words and phrases, and parts of speech tagging of the sentence.
Prior sentiment polarity means that some words and phrases have a natural innate tendency for
expressing particular and specific sentiments in general. For example the word “excellent” has a
strong positive connotation while the word “evil” possesses a strong negative connotation. So
whenever a word with positive connotation is used in a sentence, chances are that the entire
sentence would be expressing a positive sentiment. Parts of Speech tagging, on the other hand, is
a syntactical approach to the problem. It means to automatically identify which part of speech
each individual word of a sentence belongs to: noun, pronoun, adverb, adjective, verb,
interjection, etc. Patterns can be extracted from analyzing the frequency distribution of these
parts of speech (ether individually or collectively with some other part of speech) in a particular
class of labeled tweets.
SOFTWARE REQUIREMENTS:
 Python
 Web Browser: Microsoft Internet Explorer, Mozilla, Google Chrome or later
 MySQL Server (back-end)
 Operating System: Windows XP / Windows7/ Windows Vista
HARDWARE REQUIREMENTS:
Processor: Intel Core Processor Or Better Performance
Memory 1gb Ram Or more Hard Disk space minimum 3 Gb for database usage for future
PROGRAM:
import twitter
# initialize api instance

twitter_api = twitter.Api(consumer_key='YOUR_CONSUMER_KEY',
consumer_secret='YOUR_CONSUMER_SECRET',
access_token_key='YOUR_ACCESS_TOKEN_KEY',
access_token_secret='YOUR_ACCESS_TOKEN_SECRET')
# test authentication
print(twitter_api.VerifyCredentials())
# ------------------------------------------------------------------------
def buildTestSet(search_keyword):
try:
tweets_fetched = twitter_api.GetSearch(search_keyword, count=100)
print("Fetched " + str(len(tweets_fetched)) + " tweets for the term " + search_keyword)
return [{"text":status.text, "label":None} for status in tweets_fetched]

except:
print("Unfortunately, something went wrong..")
return None
# ------------------------------------------------------------------------
search_term = input("Enter a search keyword: ")

testDataSet = buildTestSet(search_term)
print(testDataSet[0:4])
# ------------------------------------------------------------------------
def buildTrainingSet(corpusFile, tweetDataFile):

import csv
import time
corpus=[]
with open(corpusFile,'rb') as csvfile:

lineReader = csv.reader(csvfile, delimiter=',', quotechar="\"")
for row in lineReader:
corpus.append({"tweet_id":row[2], "label":row[1], "topic":row[0]})
rate_limit=180
sleep_time=900/180
trainingDataSet=[]
for tweet in corpus:

try:
status = twitter_api.GetStatus(tweet["tweet_id"])
print("Tweet fetched" + status.text)
tweet["text"] = status.text
trainingDataSet.append(tweet)
time.sleep(sleep_time)
except:
continue
# Now we write them to the empty CSV file
with open(tweetDataFile,'wb') as csvfile:
linewriter=csv.writer(csvfile,delimiter=',',quotechar="\"")
for tweet in trainingDataSet:
try:
linewriter.writerow([tweet["tweet_id"],tweet["text"],tweet["label"],tweet["topic"]])
except Exception as e:
print(e)
return trainingDataSet
# ------------------------------------------------------------------------
corpusFile = "YOUR_FILE_PATH/corpus.csv"
tweetDataFile = "YOUR_FILE_PATH/tweetDataFile.csv"
trainingData = buildTrainingSet(corpusFile, tweetDataFile)
# ------------------------------------------------------------------------
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
class PreProcessTweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
def processTweets(self, list_of_tweets):

processedTweets=[]
for tweet in list_of_tweets:
processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
return processedTweets
def _processTweet(self, tweet):

tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
# ------------------------------------------------------------------------
import nltk
def buildVocabulary(preprocessedTrainingData):
all_words = []
for (words, sentiment) in preprocessedTrainingData:

all_words.extend(words)
wordlist = nltk.FreqDist(all_words)
word_features = wordlist.keys()
return word_features
# ------------------------------------------------------------------------
def extract_features(tweet):
tweet_words=set(tweet)
features={}
for word in word_features:
features['contains(%s)' % word]=(word in tweet_words)
return features
# ------------------------------------------------------------------------
# Now we can extract the features and train the classifier

word_features = buildVocabulary(preprocessedTrainingSet)
trainingFeatures=nltk.classify.apply_features(extract_features,preprocessedTrainingSet)
# ------------------------------------------------------------------------
NBayesClassifier=nltk.NaiveBayesClassifier.train(trainingFeatures)
# ------------------------------------------------------------------------
NBResultLabels = [NBayesClassifier.classify(extract_features(tweet[0])) for tweet in

preprocessedTestSet]
# ------------------------------------------------------------------------
# get the majority vote

if NBResultLabels.count('positive') > NBResultLabels.count('negative'):
print("Overall Positive Sentiment")
print("Positive Sentiment Percentage = " +
str(100*NBResultLabels.count('positive')/len(NBResultLabels)) + "%")
else:
print("Overall Negative Sentiment")
print("Negative Sentiment Percentage = " +
str(100*NBResultLabels.count('negative')/len(NBResultLabels)) + "%")
SCREENSHOTS:
CONCLUSION:
Sentiment Analysis is an interesting way to think about the applicability of Natural

Language Processing in making automated conclusions about text. It is being utilized in social
media trend analysis and, sometimes, for marketing purposes. Making a Sentiment Analysis
program in Python is not a difficult task, thanks to modern-day, ready-for-use libraries. This
program is a simple explanation to how this kind of application works. This is only for academic
purposes, as the program described here is by no means production-level.

Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report

Încărcat de

Drepturi de autor:

Formate disponibile

ANAND INSTITUTE OF HIGHER TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ACADEMIC YEAR: 2018-19

MINI PROJECT REPORT

TWITTER TWEETS CLASSIFIER

Supervised by: Marks: 20/25

Mr.D.ANAND JOSEPH DANIEL AP/CSE

It addresses the problem of sentiment analysis in twitter; that is classifying tweets

Processor: Intel Core Processor Or Better Performance

# initialize api instance

return [{"text":status.text, "label":None} for status in tweets_fetched]

search_term = input("Enter a search keyword: ")

def buildTrainingSet(corpusFile, tweetDataFile):

with open(corpusFile,'rb') as csvfile:

for tweet in corpus:

trainingData = buildTrainingSet(corpusFile, tweetDataFile)

def processTweets(self, list_of_tweets):

def _processTweet(self, tweet):

for (words, sentiment) in preprocessedTrainingData:

# Now we can extract the features and train the classifier

NBResultLabels = [NBayesClassifier.classify(extract_features(tweet[0])) for tweet in

# get the majority vote

Sentiment Analysis is an interesting way to think about the applicability of Natural

S-ar putea să vă placă și