Documente Academic
Documente Profesional
Documente Cultură
We the student of B.Sc. in Computer Science and Engineering of International Islamic University
Chittagong. We have submitted thesis for partial fulfillment of the requirements for the degree of
Bachelor of Science in Computer Science and Engineering. We hereby declare that the report is
titled as “Real Time Sentiment Analysis and Opinion Mining on Refugee Crisis.” Which is
completely prepared and completed by us. Which is an original work. It has not been submitted
before elsewhere for any other purpose.
_________________________
Sarjis Muhammad Abdullah
ID: C141098
Computer Science and Engineering, 38th batch.
International Islamic University Chittagong
_________________________
Didarul Karim Jewel
ID: C141113
Computer Science and Engineering, 38th batch.
International Islamic University Chittagong
_________________________
Abdullah Al Mahfuj
ID: C141093
Computer Science and Engineering, 38th batch.
International Islamic University Chittagong
TABLE OF CONTENTS
ACKNOWLEFGEMENT ........................................................................................................... II
1.2 History................................................................................................................................... 2
1.6 Contributions......................................................................................................................... 5
2.5 Sentiment Analysis (SA) & Natural Language processing (NLP) ....................................... 8
References .................................................................................................................................... 53
LIST OF FIGURES
Figure 1 Methodology Diagram.................................................................................................... 25
Figure 2 Process Diagram. ............................................................................................................ 26
Figure 3 Data Collection Process Diagram. .................................................................................. 28
Figure 4 Data pre-processing diagram. ......................................................................................... 29
Figure 5 Duplicate Tweets Removal diagram. ............................................................................. 30
Figure 6 Symbol removal diagram. .............................................................................................. 30
Figure 7 Removal of stop words. .................................................................................................. 31
Figure 8 Sentiment classification operator’s diagram. ................................................................. 32
Figure 9 Sentiment classification diagram. ................................................................................... 33
Figure 10 Visualization of polarity class. ..................................................................................... 34
Figure 11 Transformation of Attributes diagram. ......................................................................... 35
Figure 12 Implementation diagram. .............................................................................................. 37
Figure 13 Overview of Polarity Class. .......................................................................................... 41
Figure 14 Overview of Subjectivity Class. ................................................................................... 42
Figure 15 Subjectivity and Polarity Diagram. .............................................................................. 42
Figure 16 English and Polarity Diagram. ..................................................................................... 43
Figure 17 Bangla and Polarity Diagram. ...................................................................................... 43
Figure 18 Turkish and Polarity Diagram. ..................................................................................... 44
Figure 19 Urdu and Polarity Diagram. ......................................................................................... 44
Figure 20 Chinese and Polarity Diagram. ..................................................................................... 45
Figure 21 Retweet Visualization Diagram. ................................................................................... 46
Figure 22 Accuracy of predicting Polarity.................................................................................... 48
Figure 23 Recall, Precision and F-measure of Polarity Class. ...................................................... 49
LIST OF TABLES
i
ACKNOWLEFGEMENT
First of all, we are to grateful to the almighty Allah, the merciful and the benevolent, who has
enabled us to complete this report.
It was not possible to successful completion of thesis and preparation of report without generous
help from our supervisor Md. Mahiuddin, Assistant Professor, CSE, IIUC for his valuable times,
kind encouragements, guidance, opinions, views etc. Here, we are acknowledging his contribution
& Showing heartiest gratitude to him.
Last but not the least important, we would like to give thanks to our senior brother Mazharul Islam
and Md. Taufeeq Uddin Ex-students of CSE, IIUC for their kind instructions.
Finally, we would to thank all the people who have shared their views about our work, provided
us with necessary information, criticized us. We express our heartiest gratitude to all of them.
ii
ABSTRACT
It is acknowledged that Twitter is a micro blogging social site and millions of people share their
thoughts, views, reactions and commenting on particular subject as his opinion or fact for seeking
attention from different categorical person. In the current analysis and experimentation, we
investigated the public opinions, facts and sentiments on Refugee Crisis which is in recent time
widely discussed topic in social media platform. To analyze public sentiment on this crisis, we
extracted around 35,000 relevant Twitter data in five different languages including English,
Bangla, Turkish, Chinese and Urdu and we used them for sentiment analysis and decision mining
in the way of data mining and data science. Considering it we presented a new way of real time
sentiment analysis on Refugee Crisis for provide some prediction on political improvements. This
paper will able to give end level decision of how much people are commenting for supporting
refugee and how much comment are posting against refugee by binomial classification of positive
and negative. Respectively using supervised machine learning algorithm such as DT, RF, NB and
KNN we abled to give prediction of polarity and subjectivity. Where KNN algorithm gave the best
polarity accuracy of 95% compared to DT, RF and NB classifier. Respective and responsible
person for refugee can get a better knowledge by having our analysis.
Keywords— refugee crisis, sentiment analysis, data science, machine learning, opinion mining,
prediction.
iii
Chapter 01: Introduction
1.1 Introduction
Refugee Crisis is one of the biggest issues all over the world recently. This problem originated
since many years before in many countries like Afghanistan, South Sudan and Somalia. Very
lately, people from Myanmar and Syria are the new victim so called as refugee. We are living in
the age of science. Science gave us many gifts. Social media is one of them. So, we are willing to
combine science with social data by exploring insight meaning of social media post on particular
domain knowledge. So that, it can be helpful for public service.
In the present world social media in one of the biggest platforms to express people’s opinion,
thought, view, intention, reaction and idea. There are a lot of social media platform, Twitter is one
of the popular platforms. In addition to having a global coverage of issues, Twitter provides a
media platform that enables sharing opinions easily using various content forms including text,
images, links with the character restriction unlike many other social media platforms. More than
300 million people use Twitter all over the world. Statistics shows that more than 6000 tweets per
second, 350000 tweets per minute and 500 million tweets per day posted in Twitter. This statistic
shows how important impact Twitter has all over the world [1].
So, Twitter sentiment mining can be helpful in different situations such as analyzing people’s
comment on different event, product, movie, song etc. Natural language processing can be helpful
in these circumstances. There are a lot of research done already on Sentiment Analysis in other
identical problem. We also analyzed here by adding some new gateway for smooth result and
accuracy. In our proposed system we are presenting a new way of real time sentiment analysis on
current Refugee Crisis for provide some prediction on polarity types for political improvements.
The extracted and preprocessed datasets were used in various opinion mining algorithm such as
Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), K Nearest Neighbor (KNN) etc.
Using python packages and programming.
1
This paper will able to give end level decision of how much people are commenting for supporting
refugee and how much reactions are commenting against refugee and after that binomial
classification is done by classifying positive and negative. Respectively using supervised machine
learning algorithm such as DT, RF, NB and KNN we able to give prediction of polarity. Where
KNN algorithm gave the best accuracy of 95%. Respective and responsible person for refugee can
get a better knowledge by having our analysis.
1.2 History
Refugee Crisis can be defined as groups of dislocated persons, who could be dislocated persons
either internally or migrated people. Internally Dislocated Persons are the people who forced to
left their homes, but they failed to reach a neighboring country and the Migrated People are the
persons who left their home and took shelters on other country.
Internal Dislocated Person do not full fill the definition of a refugee of 1951 Refugee Convention,
because they have not left their home and country. But because of war nature has changed in the
last few decades, the number of Internal Dislocated Person has increased. The United Nations
estimated in 2018, 68.5 million people are worldwide forced to displaced. From them, 25.4 million
are refugees and 40 million are internally dislocated [2].
85% of refugees are took shelter in developed countries like Australia, USA, France. From them
57% coming from Syria, Afghanistan and South Sudan. Almost 3.5 million of refuges took shelters
in Turkey. The United Nations estimated in 2018, 68.5 million people are worldwide forced to
displaced. From them, 25.4 million are refugees and 40 million are internally dislocated. 85% of
refugees are took shelter in developed countries like Australia, USA, France. From them 57%
coming from Syria, Afghanistan and South Sudan. Almost 3.5 million of refuges took shelters in
Turkey. And most recently more than 0.7 million Rohingya took shelter in Bangladesh to avoid
the religious oppression by security forces of Myanmar [3].
Though millions of people have been suffering by this serious problem. Still now, we can't see any
effective decision and proper solution on this particular embedded problem.
2
1.3 Motivation
In the recent time a huge amount of research is done by researchers on Sentiment Analysis and
Opinion Mining on different subject and event. As social media becomes a great source of data,
Sentiment Analysis or Opinion Mining is able to give more impact on community service.
In the present time Refugee Crisis is one of the important matters in international arena. This
problem evolved from many years before in many countries like Syria, Afghanistan, South Sudan
and Somalia. But it’s not ended, countries like Myanmar, Venezuela are the new victims of it.
Millions of people are suffering every day. But still there is no significant solution.
It is a kind of problem or crisis where people opinion is very important. So, we decided that with
the process of NLP using Machine Learning Algorithm we can classify the thought and opinion
about this matter. After that we can come to a satisfying result that can be helpful for the decision
making in political arena.
3
1.4 Problem Statement
In the recent world people are experimenting on various data. New information and theory are
gathered from there. But Analyzing scientific papers domain is hard, because there are several
features that affect evaluating sentiment reviews.
They have used different analyzing protocols to differentiate the sentiment class. The actual
meaning and classification of sentence is the most important part in decision making. There are
plenty of research regarding sentiment analysis but not much about Refugee Crisis. And the
existing works on Refugee Crisis are more focused on sentiment comparisons between people of
different country. The drawback of the previous work is they didn’t shows any effective result. So,
we tried give that research more efficiently and precisely by predicting accuracy using MLA
(Machine Learning Algorithm).
1.5 Objectives
The objective of our paper is to understand people’s emotion and opinion about Refugee Crisis
and give prediction of polarity (polarity may be supporting to refugee or against refugee which is
decided by tweets after extracting different features from tweets and analyzing them by natural
language processing with the help of existing machine learning algorithm).
4
1.6 Contributions
In the following part explored our contributions in completing this thesis which are as follows:
1.7 Summary
In this chapter we described our thesis objectives, contributions that we took part, little statement
on problem domain, reason behind of our thesis, we called it as motivation and little history
background on behalf of Refugee Crisis.
5
Chapter 02: Background of Sentiment Analysis
2.1 Introduction
Sentiment Analysis and Text Mining is the process of analyzing, experimenting, testing the
opinion of people after collecting data in text form. In the following section we will discuss about
it briefly.
Subjective has terminological meaning that is the own opinion of person and the other one
objective has also inside meaning that is the intention of sharing just facts of any incident. The
subjective sentences have three different types:
• Private situations as references. For example: “He was taking with fear.”
• Expressing private situations as references to speech. For example: “The editors of the New
Age paper attacked the police officer.”
• Expressive subjective entities. For example: “That government is a brilliant.”
6
2.3 Why Sentiment Analysis is Important?
There are billions of online users, who use Facebook, Twitter, Whatsapp, Google+, Skype, Sina
Weibo, Instagram etc. Where the user expressing their reaction, opinion, thought, idea, view about
different events and topics. That’s why Online daily sentiments become the most important
resource in decision making. According to a new survey conducted by World Research, the survey
found the percentage of online customer reviews as much as personal recommendations.
According to study nearly 95% of shoppers read online reviews before making a purchase (Spiegel
Research Center, 2017) this research also showed that displaying reviews can increase sales rates
by 270%. According to 2016 Study ((Fan and Fuel) 94% of customers read online reviews and
97% of shoppers say reviews influence buying decisions [5].
Although a list of weak or strong opinion words are possible to create as dictionary depending on
the application need, computers are still not comfortable for when the strength of opinion mixes
with the position of that opinion then the document completely changes the polarity in many cases.
7
2.5 Sentiment Analysis (SA) & Natural Language processing (NLP)
2.5.1 Overview
In the following section we described about some important definitions that we learnt based on
sentiment terminology.
2.5.1.2 Token
Before any real processing can be done on the input text, it needs to be segmented into linguistic
units such as words, punctuation, and numbers or alphanumeric. These units are recognized as
tokens.
2.5.1.3 Sentence
This refers to an ordered sequence of tokens.
2.5.1.4 Tokenization
Tokenization is defined as the operation of splitting a sentence into tokens. For the language
English which is also called as segmented language, tokenization relatively easier because of the
existence of whitespace.
2.5.1.5 Corpus
Usually, A corpus is known as the large number of sentences, documents, blogs data, websites
data or simply means a body of text.
8
2.5.1.6 Part-of-speech (POS) Tag
A POS tag is nothing but representation of symbols where a word can be categorized into one or
more classes like NN (Noun), VB (Verb), AJ (Adjective), AT (Article). One of the oldest and most
commonly used tag sets is the Brown Corpus tag set.
2.5.2.2 Parsing
In the parsing task, a parser builds the parse tree for given sentence. There are some parsers assume
the existence of a set of grammar rules to parse but recent parsers are smart enough to deduce the
parse trees directly from the given data using complex statistical models. Most parsers also operate
in a supervised setting and require the sentence to be POS-tagged before it can be parsed. Statistical
parsing is an area of active research in NLP.
9
2.5.2.4 Subjective Sentence
When a writer or user expresses own reviews, feelings or sentiments regarding any incidents,
entities and events then these reviews, feelings are called as Subjective sentence. For example: “I
like to give shelter for refugee”.
2.5.2.6 Opinion
Opinion is nothing but belief or judgment based on having knowledge on specific a topic.
Sometimes opinions are called as explicit opinion like: “Refugees are facing dangerous situation
of their life.” But sometimes hidden in the sentiment of a sentence, for example; “Current refugee
problem has no solution yet.”. In fact, polarity class is determined by the positivity or negativity
of an opinion. To determine the polarity of text in details, determining the polarity of each
subjective sentence is one of the main sub tasks of sentiment analysis and Opinion Mining.
10
Object: heroism
Explicit object- feature: show off heroism.
2.5.3 Classifier
Classifier help us to separate our datasets into categorical data. When classifier gets some data
then it started to think in which category it will make. To classifying something, it needs various
features to identify for which category it is going to classify.
It is a function to classify different objects and label them as an output. In sentiment analysis,
classifiers are used to find out the polarity of a subjective sentence from our given data. There are
two types of classification: Supervised Classification, such that a classifier is supposed from the
training set. The classification algorithm can predict the correct label (positive or negative) for any
pre-processed input data. In contrast unsupervised classification supposed the hidden structure of
raw data. In sentiment analysis, both classification types are widely used. The main task of
Sentiment Analysis is extracting suitable features and constructing an engineered feature vector as
an input for classifier
11
Gaussian Naive Bayes classifier is an algorithm often used for text classification in different
domains. It is a simple probabilistic classifier applying Naïve Bayes theorem that is used to predict
the class of a new document.
𝑃(𝑋) × 𝑃(𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 | 𝑋)
𝑃(𝑋 | 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦) =
𝑃(𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦)
𝑃(𝑋) × 𝑃(𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 | 𝑋)
𝑃(𝑋 | 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦) =
𝑃(𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦)
We used above two concepts in our analyzed model. In contrast to other classifiers, a Naive model
is efficient since it can work with a small training dataset classification. GaussianNB implements
the Gaussian Naive Bayes algorithm for classification [8] [9].
𝑝 𝑝 𝑛 𝑛
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛, 𝐼(𝑝, 𝑛) = − 𝑝+𝑛 log 2 𝑝+𝑛 − 𝑝+𝑛 log 2 𝑝+𝑛
𝑣
𝑝𝑖 + 𝑛𝑖
𝐸𝑛𝑡𝑟𝑜𝑝𝑦, 𝐸(𝑎) = ∑ × (𝐼(𝑝, 𝑛))
𝑝+𝑛
𝑖=1
Decision Tree Classifier poses a series of carefully crafted questions about the attributes of the test
record. Each time it receives an answer, a follow-up question is asked until a conclusion about the
class label of the record is reached. In the decision tree, the root and child nodes contain attribute
test conditions to differentiate records that have different structures.
13
2.5.4 Text Mining
Text Mining (TM) is very important terms. It is the system of colleting important and necessary
information from unstructured text. TM identifies facts, relationships and statement that would
remain buried in the mass of textual big data. For analysis and visualization these facts,
relationships and statement are extracted and turned into structured data.
Typical TM tasks include sentiment analysis, categorization of text, clustering of text, document
summarization concept/entity extraction etc. Text mining is a field of computer science which
have a strong connection with NLP, DM, ML, Information retrieval and knowledge management
[14]. TM is much effective than Traditional keyword search. Traditional keyword search retrieves
all the specified keywords contained document. The drawback is we need to read all those
documents to find out whether they actually contain. Where Text mining software is very different,
because it reads and analyzes the documents on your behalf. It works with word level, sentence
level and document level.
14
For example, if we have the two documents below:
1) Rohingya crisis started from many years before.
2) Crisis should be stopped.
The dictionary which is constructed based on BOW will be:
Dictionary= {(1:” Rohingya”), (2:” crisis”), (3:” started”), (4:” from”), (5:” many”), (6:” years”),
(7:” before”), (8:” should”), (9:”be”), (10:” stopped”)}
Hence, the feature vector of each document has “10 dimensionalities” based on the constructed
dictionary. As demonstrate in sentiment analysis discussion, word frequency is very informative.
According to NLP, one word can express the author’s opinion clearly while a sequence of words
cannot. For example, in the sentence below, only the words “like” and “approach” shows the
polarity of the sentence while the whole sentence seems to have positive polarity.
“I like the way of Myanmar government’s approach”.
15
Chapter 03: Literature Review
In this [16] paper they used supervised learning algorithms to find the polarity of the student
feedback based on pre-defined features of teaching and learning. For this work they gathered
student feedback data from a survey results of Middle East College, Oman. They explain step by
step process of implementation by using the analytics tool Rapid Miner. This paper also presents
a comparative performance study of the algorithms like SVM, NB, KNN and NN classifier. The
compared their results to find the better outcome with respect to different features for the different
algorithms. Their Result shows that KNN got the best precision result of 100%, NB got the best recall
and accuracy result of 97.07% and 99.11% respectively.
In this [17] paper they used text analyzing tool to get tweets in Hindi language. They analyzed
42,235 on tweets collected that referenced on various political parties in India, during the
campaigning period of elections in 2016. On their work they used both supervised and
unsupervised technique. They use Dictionary Based, NB and SVM algorithm to classified the data
as positive, negative and neutral.
In their work the results of the analysis for NB was the BJP, for SVM it was the BJP and for the
Dictionary Approach it was the Indian National Congress. They predicted by their work BJP had
the chance of 78.4% to win elections due to the positive sentiment they received in tweets. And
when election was done BJP won 60 out of 126 constituencies and NC only won 26 out of 126
constituencies. The NB algorithm gave them an accuracy of 62.1%.
16
In this [18] paper, they experimented the extraction of sentiment from a famous website Twitter
where the people post their views and opinion. They done sentiment analysis on movie review
related tweet. They used Hadoop tools for processing data that is available on the Twitter website
in the form of people’s reviews, feedback, and their comments. They showed the results of their
sentiment analysis as different sections presenting positive, negative and neutral sentiments.
In this [19] study, they analyze the public opinions and sentiments towards the Syrian Refugee
Crisis. They collected tweets about Syrian Refugee Crisis in two languages including Turkish and
English. They considered Turkish tweet because Turkey has sheltered the huge number of refugees
and Turkish tweets carried important information to reflect public perception and views. They
done a comparative SA of retrieved tweets. Their results showed that there was significant
difference between sentiment getting from Turkish tweet and sentiments from English tweets. The
paper showed more positive sentiments posted by Turkish towards Syrians and refugees. On the
other hand, the largest number of English tweets are neutral.
In this [20] study they focus on experimenting customers feedback on SaaS products by predicting
reviewer’s attitudes. The goal of their paper was to predict the sentiment of SaaS customers
reviews. They proposed five technique based on five algorithms, the SVM, NB, NB(Kernel), KNN
and the DT algorithm to predict the attitude of SaaS reviews. In their experiment they got 92.37%
accuracy by using SVM algorithm which proved that this algorithm is able to give better result on
sentiment of online reviews compared with the other models. By this work they able to give
important information into online SaaS reviews and that help in the design of SaaS review
websites.
Lately, most research [21] done on opinion mining of online user by using the data like tweets,
reviews, blogs, comments etc. Using SentiWordNet they worked on opinion mining for newspaper
headline. Further, they separated the adverb-adjective combination exist in the statements. In their
paper they also analyzed the news headline whether it is a part-of-speech tag. During their research
they use python packages to classify words. They used SentiWordNet 3.0 to find out the polarity
(positive & negative) of each word. By means Through this way they evaluated the impact of news
headline by measuring the total positive & negative polarity.
17
In [22], they proposed approach, heterogeneous features such as machine learning based and
Lexicon based features and classification algorithms like NB and LSVM used to build the system
model. By their proposed heterogeneous features and hybrid approach they abled to get better
sentiment accuracy compare to others. These heterogeneous features can be used for building
advance and more accurate models using DL.
In their experiment they used 250 training dataset and 100 testing datasets. They able to get 89%
for NB and 76% for SVM. Again, they used 300 training dataset and 150 testing datasets. They
able to get 84% for NB and 79% for SVM.
In this [23] paper they worked on a dataset of tweets for 6 major US Airlines and performed a
multi-class sentiment analysis. They start off with pre-processing techniques used to clean the
tweets and then representing these tweets as vectors using a deep learning concept to do a phrase-
level analysis. They used 7 different classification strategies: DT, RF, SVM, KNN, LR, NB and
AdaBoost. They trained 80% of their data and tested the remaining 20% data. They set the tweet
sentiment 3 class (positive, negative & neutral). Based on the results obtained, they calculated the
accuracies to draw a comparison between each classification approach and the overall sentiment
count was visualized combining all six airlines.
18
3.3 Strong and Weak Point
The strong point of the existing analysis of sentiment which are as follows:
• In many experiment researchers made their own system to work with Text Mining and
Sentiment Analysis.
• From their working experience we got to understand that big data can be handled by using
different tools.
• In the previous works they used classifier to Fraud Detection, Face Recognition, Predicting
Election.
• Researcher gave hint about RapidMiner that is data science workable environment and right
gateway to work with data.
• Existing research can help to find out the best way of marketing, reviewing.
• In some research they worked with 5 classes of polarity.
Though we noticed many advantages in existing research, however we found some drawback in
existing environment and system. Here is some:
• As It is difficult for machine to find out the accurate sentiment from a text every time.
Sometime prediction may not true.
• Most of the time existing research is done by working with smaller data set. Smaller dataset
can be handled without facing any difficulty and hardships.
• While in performing Sentiment Analysis sometimes they concerned only positivity of the text
and then made the rest of data as negative.
• The classification result they got can be improved.
3.4 Summary
In this chapter we discussed about different existing paper on sentiment analysis. We tried to give
some idea about some paper. We also discussed about strong and weak point about existing work.
19
Chapter 04: Environment Study
4.1.1 Motivation
We have motivated to work with classification algorithm and RapidMiner tools and protocols for
only reason behind is to handle big amount of data with the help of classifier. With the help of
python and R programming we could complete our analysis but it would be slow process as our
experiment. But using single protocol, if we able to collect data, pre-process data, can get the
natural language processing platform and also, can use classifier then we observed that to choose
these tools for our getting desired result. Though we evaluated algorithmic performance by python
program but in primary level we experimented our dataset by RapidMiner tools.
Instead of relying on manually crafted rules, text classification with machine learning learns to
make classifications based on past observations. By using pre-labeled examples as training data, a
machine learning algorithm can learn the different associations between pieces of tweets and that
a particular output is expected for a particular input.
20
The first step towards training a classifier with machine learning is feature extraction: a method is
used to transform each text into a numerical representation in the form of a vector. One of the most
frequently used approaches is bag of words, where a vector represents the frequency of a word in
a predefined dictionary of words.
Then, the machine learning algorithm is fed with training data that consists of pairs of feature sets
and tags to produce a classification model:
A lot of data pre-processing step is needed to work with any data science project. If we worked
with smaller dataset then we can easily complete it using free and open source tools. We choose
RapidMiner tools for overcome our data pre-processing steps and also, we realized that in RM is
one of the most reliable working tools for data scientist.
We collected Twitter data around 35k rows which is not easier task to extract tweets by following
the Twitter community authorization guideline.
21
4.2.3 Aylien Tap
We were searching out a tool which will be the combination of NLP platform and all the things
described in previous section. As we are working with RapidMiner software so that we searched
in their market place for text analyzing extension. It is well known that, we can do sentiment
analysis in various ways. But we may not able to extract a good result if we don’t work with
AYLIEN extension because we are working with large number of datasets that contain 35k rows
of data. That wouldn’t be wise thinking if we don’t work with it. It is not a free extension. We
abled to evaluate one-thousand text per-day by showing our university identification and
educational certificate.
22
"subjectivity confidence": 0.7
}
4.2.4 Spyder
Spyder is the Scientific Python Development Environment (SPDE). Anaconda has included
Spyder IDE. It able to give us opportunity to coding, testing and debugging our module with
python programming. Spyder is one of the open source tools for handling the programming with
python provided by Anaconda.
4.2.4.1 Python
Python is high level programming language where we implemented our thesis related all machine
learning algorithm in object-oriented way. It abled to provide us different packages which is
needed to implement our desired end level outcome.
23
Chapter 05: Methodology and Experiment
5.1 Methodology Overview
Methodology is the process of how our work implemented and modeling refers, can be explained
as elaborated description of followed strategy and diagram. It includes here our strategy,
procedure, scheme and process.
5.2 Methodology
In the following part, sequentially we described our methodology and experiment of overall
procedure.
1. At first, we created developer account in Twitter.
2. After creating account Twitter authorized us to collect tweets using Twitter API.
3. Using required query, we received tweets with five language such that Bangla, Turkish, Urdu,
Chinese and English.
4. After that we converted Bangla, Turkish, Urdu, Chinese tweets to English text.
5. After collection and conversion, we performed data pre-processing state for collected tweets.
6. Then we converted all tweets to lower case.
7. We added attributes with text.
8. Then we used AYLIEN NLP analyzer to find out polarity of tweets positive, neutral and
negative.
9. We made it support and against class for polarity and opinion and fact class for subjectivity.
10. Then we added sentiment attributes for getting better prediction result by classification
algorithm.
11. After that, for numerical presentation of our dataset we encoded our string data into numeric
form.
12. Then we split dataset into training and testing data.
13. After that, we trained our data using KNN algorithm.
14. Then we tested our trained data for find out accuracy.
15. Repeat the steps 13 and 14 for DT, NB, SVM, RF supervised ML algorithm.
16. Presentation of found results by Recall, Precision and F-measure.
24
Following diagram states our strategy:
25
5.3 Process Strategy
In our stratagem, process is one of the sub-parts of our methodology. We made our process part
with the combination of data collection, data pre-processing, data classification into different class
level and implementation technique.
5.3.1.2 Motivation
We planned to work with real time data. It is true that data is available but available in unstructured
way. We ensured that we were not collecting corpus data. We needed to work with document
formatted file. We also get know that formatted and structured data is much dependable and
workable for algorithm implementation. Many websites can provide us data. We searched in
Kaggle.com to find our topics related data but we failed to find out related data.
In the beginning of our data collection process we abled to extract tweets contained google sheet
for a limited period. After passing few days Twitter made their website more secured than previous
to share their data. They wanted to see university provided email address which we don’t had that
time if we want to extract tweets by google sheet.
26
In the mean time we had luckily get to know that there is well-known and remarkable software
called RapidMiner which can help us to extract tweets from Twitter with authorized gateway
created by RapidMiner and Twitter developer community. It is noticeable that RapidMiner can
also able to handle any data science project and modeling with SML algorithm in proper and
dependable way.
27
5.3.1.4 Collection Process
We planned to collect tweets using required query which can easily match our searched tag with
tweets posted by Twitter user. If our searched tag found in Twitter user’s tweet then it will be
extracted in our document file. For that reason, we used two different types of operator from
RapidMiner called as “search Twitter” and “write excel”. Where in search twitter operator we used
the parameter “refugee”, “save syria”, “save rohingya”, “help rohingya”, “help syria”, “help syria
children”, “syria crisis”, “refugee crisis”, “sudan refugee”, “afganisthan refugee”, “stop killing
rohingya”, “stop killing refugee”, “stop syria war” etc.
Following figure shows us the operator that we used in RapidMiner for collecting tweets.
We have collected around thirty-five thousand rows of tweets from authorized Twitter community
as a Twitter developer with five different languages. So, Data needs to convert from various
language to global language. For that case we used google translator for tweets conversion. Data
was not noise free and congested with removable resources. Its included duplicated tweets, lots of
URL links, special symbol, emojis, mentioned name that is needed to be cleansing and cleaning.
28
5.3.2 Data Pre-Processing
5.3.2.1 Overview
The data which extracted from Twitter or other social media website contains different non-
sentiment contents such as duplicate tweets, website links, emotions, mentioned symbol or
username, white spaces, retweet tag, hashtag etc. which we removed before processing our tweets
so that the sentiment can generate accurately. Our pre-processing step include followings:
29
Figure 5 Duplicate Tweets Removal diagram.
5.3.2.3 Removal of Website Link
Extracted Twitter data so called tweets consist of different type of information what we called as
URL. If Tweeter user posted any video, audio, article link which is unnecessary for use in
sentiment analysis. Therefore, that URL should be removed from our tweet. The URL we found
here many as YouTube and Facebook’s video link.
30
5.3.2.5 Username Removing Part
One user can use one username and that is should be unique by following the guide line of Twitter;
so, anything is posted by a user there is his/her username proceeding by @ which is used as proper
nouns. For example, @someones_username. This also removed from our dataset for effective
analysis.
31
off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each,
few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, can, will,
just, don, should, now.
5.3.3 Classification
After completing our pre-processing steps, we got 8788 structured tweets from around 35,000
unstructured tweets.
So now, we processed and analyzed our tweets, so that it gave us following Polarity (positive,
neutral & negative) and Subjectivity (subjective & objective).
32
Figure 9 Sentiment classification diagram.
33
Support and Against class
458
8330 458
Support Against
5.3.3.4 Attributes
In our dataset we have several attributes. Various source of text we found in our dataset. These
are Twitter Web Client, Twitter for Android, Twitter for iPhone, Twitter Lite, Twitter for iPad,
Facebook, Google, LinkedIn. Our provided attributes are as follows:
• Id
• Name
• Tweet date
• Source of text
• Language
• Retweet
• Subjectivity
• Subjectivity confidence
• Polarity
• Polarity confidence.
But someone’s Id, Name, Tweet date contains no sentiment value. That is why we removed them
from our dataset. Especially, Tweet date is removed because we are working in real time. So, tweet
date is redundant in this case of analyzing.
34
5.3.3.5 Transformation
Many classification algorithms can’t deal with string data. That is reason behind of our data
transformation. We transformed our attributes into numerical value for getting algorithmic
performance with the help of python programming. This is what just form of presentation. We
encoded the source attribute of each tweet into different labeled of numeric value which is done
by LabelEncoder class from python. We transformed our language attribute by dummy encoding
with the help of OneHotEncoder class from python. In subjectivity attribute, subjective is what we
called it opinion presented by 1 and objective is what we called it fact is presented by 0. In polarity
class support is presented by 0 and against presented as 1.
35
5.3.4 Implementation
5.3.4.1 Overview
We implemented our SML algorithm by python programming. We have done our project by going
through OOP concept with Python. In the following section, we are presenting our steps what we
followed in completing our project.
Step3: Then we performed label encoding for our categorical data by LabelEncoder class from
scikit learn.
Step4: Then we formed one hot encoding or binary presentation for our required attribute by
OneHotEncoder class from scikit learn.
Step6: Then the help of Scikit-learn and cross validation, we split our data set into training and
testing data.
Step7: Then we completed fit and transformation by StandardScaler class from scikit learn.
Step9: At last, making the confusion matrix from scikit learn with computed our desired results.
36
Figure 12 Implementation diagram.
5.3.4.2 Coding
In the following session, we showed our label encoding by python step by step.
37
38
39
5.4 Summary
Int this methodology and experiment chapter we described about our methodology and process
stereogram where it included data collection process, data pre-processing process, data
classification strategy by our given feature analysis and finally implementation of our
classification and machine learning algorithm with python programming.
40
Chapter 06: Result
458
8330
Support Against
41
Subjectivity Class
6000
5000
4000
3000
2000
1000
0
Fact Opinion
Series1 3659 5129
Count of polarity
6000
4797
5000
4000 3533
3000
2000
1000
332
126
0
fact opinion fact opinion
against against support support
42
6.1.4 English and Polarity
3223
133 3223
2229
110 2229
43
6.1.6 Turkish and Polarity
909
54 909
96
835 96
44
6.1.8 Chinese and Polarity
65
1592 65
45
6.1.9 Retweet Visualization
We discovered that tweet versus retweet. Retweet means when a Twitter user posted, number
people comment on particular post.
2052
Retweet
5275
8498
1
294
587
880
1173
1466
1759
2345
2638
2931
3224
3517
3810
4103
4396
4689
4982
5568
5861
6154
6447
6740
7033
7326
7619
7912
8205
Figure 21 Retweet Visualization Diagram.
46
6.2.2 Accuracy
Accuracy is defined as the ratio between addition of TP, TN and addition of TP, TN, FN, FP.
Simply we can write as follows:
TP + TN
Accuracy =
(TP + TN + FN + FP)
6.2.3 Recall
Recall is defined as the ratio between TP and addition of TP, FN Simply we can write as follows:
TP
Recall =
(TP + FN)
6.2.4 Precision
Precision is defined as the ratio between TP and addition of TP, FP. Simply we can write as
follows:
TP
Precision =
TP + FP
6.2.5 F-measure
For calculating f-measure first we multiplied 2, recall & precision with each other and then product
divided by addition of recall, precision.
2 × 𝑅𝑒𝑐𝑎𝑙𝑙 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
47
6.3 Algorithmic Result Visualization
We experimented our structured data twice by selecting seventy five percent training dataset and
by selecting eighty percent training dataset where twenty five percent dataset was for testing and
twenty percent dataset was for testing respectively.
6.3.1.1 Accuracy
We noticed that KNN machine learning algorithm giving us the best accuracy by only giving five
percent error and NB giving the lowest accuracy by eighty three percent. In the following figure
we presented pie diagram which contains accuracy of our used KNN, NB, DT and RF classifier.
Polarity Accuracy
93% 94%
83%
95%
RF KNN NB DT
6.3.1.2 Table
In the following table we presented Recall, Precision, F-measure with respect to our used classifier
and average of all in the fifth column.
48
Table 2 Recall, Precision and F-measure of Polarity Class.
RF KNN NB DT
Recall 99% 99% 86% 98%
Precision 95% 95% 95% 95%
F-measure 97% 98% 90% 96%
49
6.4 Comparison Table
We compared our accuracy result with related work. We got better predicting result after
calculating the accuracy from confusion matrix.
6.5 Summary
In this result chapter we presented our algorithmic analysis by calculating accuracy, recall,
precision and f-measure for better understanding of our thesis outcome by showing the graph and
table. Finally, we presented comparison table with related work.
50
Chapter 07: Conclusion
7.1 Discussion
In this thesis we have shown a different approach of text mining and sentiment analysis. So that,
we can see the improvements in results chapter, where our model classifier gave us the most
accurate and best result while in classification of support and against class. As we wanted to work
with big data so that we did it partially. Because handling of big data is really complex and big
issue in these circumstances. We used some protocols that can deal with big dataset but it was time
consuming too. If we abled to use high configured system then can get better outcome with
minimum time consuming.
After completing our analysis, we observed that text mining and sentiment analysis can be done
in so many ways.
Our observation on this thesis, it is just way of proof presentation that we can now deal with data
science project partially. Using theoretical term, we abled to analyze real life problem.
7.2 Limitation
As Data and information became more secured than before, it’s going difficult to collect data for
experiment. And it is difficult for machine to find out the accurate sentiment from a text every
time.
We got to observe that to work with real life problem in the way of data mining we need some
extraordinary and high configured system. It was tough job for us to complete our work with
traditional system what we used to in general.
In our thirty-five thousand tweets, a lot of incomplete word, sarcastic post found by our tools which
is removed when we started to check neutral class. We actually avoided neutral class thinking that
it is not able to give us appropriate decision for reaching our desired destination.
51
Too much dependency with aylien natural language processing tools. Though it is provided by
dependable source RapidMiner. As Data and information became more secured than before, it’s
going difficult to collect data for experiment.
It is difficult for machine to find out the accurate sentiment from a text every time.
Using our experience on this project we are willing to work with classification algorithm for facial
recognition, fraud detection and solving other classification problem in real life.
In future we are willing work with more social media platform data like Facebook, YouTube etc.
We are now willing to work with big data with high configured system and to have more training
on real life data science project. So, our future plan is to work with big data in handling the text
mining and analysis of text.
52
References
[1] Team Internet Live Stats, "Internet Live Stats," InternetLiveStats.com, 20 October 2018.
[Online]. Available: http://www.internetlivestats.com/twitter-statistics/. [Accessed 20
October 2018].
[2] Wikipedia writers, "Wikipedia," Wikimedia Foundation Inc., 5 October 2018. [Online].
Available: https://en.wikipedia.org/wiki/Internally_displaced_person. [Accessed 10 October
2018].
[3] World Vision Staff, "World Vision," World Vision, Inc., 26 June 2018. [Online]. Available:
https://www.worldvision.org/refugees-news-stories/forced-to-flee-top-countries-refugees-
coming-from. [Accessed 1 September 2018].
[4] Haseena Rahmath P, "Opinion Mining and Sentiment Analysis Challenges and
Applications," International Journal of Application or Innovation in Engineering &
Management (IJAIEM), vol. 3, no. 5, 2014.
[5] Kristen McCabe, "Crowd Learning Hub," G2 Crowd, Inc, 16 May 2018. [Online]. Available:
https://learn.g2crowd.com/customer-reviews-statistics. [Accessed 20 October 2018].
[6] Doaa Mohey El-Din Mohamed Hussein, "Analyzing Scientific Papers Based on Sentiment
Analysis," Research Gate, Cairo University, 2016.
[7] Bird, S., Klein, E., and Loper, E., Natural Language Processing with Python, vol. 1, O’Reilly
Media publisher, 2009.
[8] Zengchang Qin, "Naive Bayes Classification Given Probability Estimation Trees," IEEE,
Orlando, FL, USA, 2006.
[9] Andrew Christian Flores ; Rogelyn I. Icoy ; Christine F. Peña ; Ken D. Gorro, "An Evaluation
of SVM and Naive Bayes with SMOTE on Sentiment Analysis Data Set," Phuket, Thailand,
2018.
[10] Bo Liu ; Zhi-Feng Hao ; Xiao-Wei Yang, "Nesting support vector machinte for muti-
classification [machinte read machine]," IEEE, Guangzhou, China, China, 2005.
[11] Zonghu Wang ; Zhijing Liu, "Graph-based KNN text classification," in IEEE, Yantai, China,
2010.
53
[12] Shiueng-Bien Yang ; Shen-I Yang, "New decision tree based on genetic algorithm," in
International Symposium on Computer, Communication, Control and Automation (3CA),
Tainan, Taiwan, 2010.
[13] Yashaswini Hegde ; S.K. Padma, "Sentiment Analysis Using Random Forest Ensemble for
Mobile Product Reviews in Kannada," in IEEE 7th International Advance Computing
Conference (IACC), Hyderabad, India, 2017.
[14] Ian, H.W., Eibe, F., and Mark A. H., Data Mining: Practical machine learning tools.,
Waikatio, New Zealand: Morgan Kaufmann Publishers, 2011.
[15] Yin, Z., Rong, J., and Zhi-Hua, Z., "Understanding Bag-of-Words Model: A Statistical
Framework," International Journal of Machine Learning and Cybernetics, 2010.
[16] Dhanalakshmi V., Dhivya Bino, Saravanan A. M., "Opinion mining from student feedback
data using supervised learning algorithms," in 3rd MEC International Conference on Big
Data and Smart City, Muscat, Oman, 2016.
[17] Parul Sharma, Teng-Sheng Moh., "Prediction of Indian Election Using Sentiment Analysis.,"
in IEEE International Conference on Big Data (Big Data), San Jose, CA, USA, 2016.
[18] Huma Parveen, Prof. Shikha Pandey., "Sentiment Analysis on Twitter Data-set using Naive,"
in 2nd International Conference on Applied and Theoretical Computing and Communication
Technology (iCATccT)., Bhilai, India, 2016.
[19] Nazan Ozturka, Serkan Ayvaz, "Sentiment analysis on Twitter: A text mining approach to
the Syrian refugee crisis," in Science Direct, Istanbul, Turkey, 2017.
[20] Asma Musabah Alkalbani, Ahmed Mohamed Ghamry, Farookh Khadeer Hussain, Omar
Khadeer Hussain., "Predicting the sentiment of SaaS online reviews using supervised
machine learning techniques.," in 2016 International Joint Conference on Neural Networks
(IJCNN)., Sydney, Australia., 2016.
[21] Apoorv Agarwal, Vivek Sharma, Geeta Sikka, Renu Dhir., "Opinion Mining of News
Headlines using SentiWordNet.," in 2016 Symposium on Colossal Data Analysis and
Networking (CDAN)., Punjab, India., 2016.
[22] Rachana Bandana, "Sentiment Analysis of Movie Reviews Using Heterogeneous Features,"
in IEEE, Nadiad, India. , 2018.
54
[23] Ankita Rane, Dr. Anand Kumar., "Sentiment Classification System of Twitter Data for US
Airline Service Analysis.," in 42nd IEEE International Conference on Computer Software
& Applications., Dubai, UAE., 2018.
[24] "RapidMiner," RapidMiner, 10 September 2018. [Online]. Available:
https://rapidminer.com/. [Accessed 100 September 2018].
[25] Ch. Nanda Krishna, Dr. P. Vidya Sagar, Dr. Nageswara Rao Moparthi, "Sentiment Analysis
of Top Colleges," in 4th International Conference on Advances in Electrical, Electronics,
Information, Communication and Bio-Informatics (AEEICB-18), Andhra Pradesh, India,
2018.
55