Web Blog Miner Licence Thesis

BLOG MINER
WEB BLOG MINING FOR CLASSIFICATION

OF MOVIE REVIEWS
THESIS
Submitted By:
Onur ENEZ 120045072
Kadir ARDIÇ 120042663
Advisor:
Yrd. Doç. Dr. Arzu Baloğlu
MARMARA UNIVERSITY
FACULTY OF ENGINEERING
I
Abstract
Blogs are the latest and most popular way to express the ideas, interests and emotions
for the world. With the increasing use of internet sources for all needs of life and peoples
choices on how to send their life around computers new organizations on the web such as
online networks, forums and blogs are the new meeting point for the people. Blogs are also
such an important source of information but it is hard as well reach that information by
automatically. The difficulty comes out by the personalized design and the size of the
blogosphere; every blog has a different structure which prevents us to find the information
or related data with several tracking from one to another. Approach of the project is to
create an analysis framework uses web mining principles. Aim of the project is to form an
opinion mining application to grab people’s opinions and emotions about recent movies
from contents of weblogs.
II
Abstract ............................................................................................................................ II
1-Introduction................................................................................................................... 1
2-Literature Review .......................................................................................................... 2
3-Approach ....................................................................................................................... 5
3.1 Overview ............................................................................................................................ 5
3.2 Problem Definition and Goals.............................................................................................. 5
3.2.1 Problem Definition ............................................................................................................................... 5
3.2.2 Goals ..................................................................................................................................................... 5
3.3 Solution .............................................................................................................................. 5
4-Project Development ..................................................................................................... 7
4.1 Planning Phase ................................................................................................................... 7
4.1.1 Project Identification ............................................................................................................................ 7
4.1.2. Feasibility Analysis ............................................................................................................................... 8
4.2 Analysis Phase .................................................................................................................... 8
4.2.1 Requirements Analysis ......................................................................................................................... 8
4.2.2 Modeling process and data ................................................................................................................ 10
4.3 Designing Phase-System Architecture ................................................................................ 11
4.3.1 Blog Crawler........................................................................................................................................ 11
4.3.2 Sentiment Analyzer ............................................................................................................................ 12
4.3.3 Web User Interface............................................................................................................................. 16
4.4 Implementation Phase ...................................................................................................... 20
5. Experiments and Results ............................................................................................. 20
5.1 Data ................................................................................................................................. 20
5.2 Experimental Results ........................................................................................................ 20
5.3 Discussion ........................................................................................................................ 23
5.4 Difficulties Encountered .................................................................................................... 23
6. Conclusion.......................................................................................................................... 24
7.References ................................................................................................................... 24
III
Figure 1 State of the Blogosphere ........................................................................................................................... 1
Figure 2 Blog Miner Overall Process Model ............................................................................................................ 6
Figure 3 Blog Crawler Data Flow ........................................................................................................................... 10
Figure 4 Sentiment Analyzer Data Flow ................................................................................................................ 11
Figure 5 User Interface Data Flow ......................................................................................................................... 11
Figure 6 Crawler Architecture ............................................................................................................................... 12
Figure 7 Sentiment Analyzer Process Model ......................................................................................................... 14
Figure 8 Blog Miner ER Diagram ........................................................................................................................... 14
Figure 9 Blog Miner Class Diagram ....................................................................................................................... 15
Figure 10 Words Matching Class Diagram ............................................................................................................ 16
Figure 11 Main Page ............................................................................................................................................. 17
Figure 12 Graphs Page .......................................................................................................................................... 18
Figure 13 User Interface Process Model ................................................................................................................ 19
IV
Table 1 SentiWord Data Table .............................................................................................................................. 13
Table 2 Sample Graphs.......................................................................................................................................... 20
Table 3 Word Tags ................................................................................................................................................ 22
Table 4 Experiment Results ................................................................................................................................... 23
V
1-Introduction
“Every idea is valuable”. This was the motivation for developing a sentiment analysis
engine. World’s biggest library internet is getting feed by every user around the world.
People all donate their personal signatures, ideas, moments, knowledge and so on by
internet. We live in the century of technology every simple step of life has
h moved over
different virtual communication lines.
Sociologists have used many different ways to recognize people nature their interests,
community aims, preferences and we are quite sure the most realistic way to do
generalization is to look for shared
shared common points. The system has designed in that manner
to use human idea to define the aim of web communities which is a person also. Grab their
ideas over web specifically from their sharing. Most efficient way for that are people’s own
diaries or books as named web blogs. The most popular way of sharing your world with your
sentences or your quotations and also least studies made on it to use the valuable
information contained in them.
With increasing usage of the internet, blogging and blog pages are grown rapidly and blog
pages are the most popular way to express opinions and emotions. According to the blog
search engine of Technorati [1], by the end of 2008, there were 133 million blogs on the
global Internet, which
ich are indexed by Technorati. Figure 1: Shows the state of the
blogosphere at 2008.
Figure 1 State of the Blogosphere
Figure shows how rapidly blog number is increasing and will.

will. People are writing their
opinions and emotions about almost every topic to the blogs. Mining opinions from reviews
1
on web pages, however, is a complex process, which requires more than just text mining
techniques. The complexity is related to a couple of issues. First, review data has to been
crawled from websites, in which web spiders or search engines can play an important role.
Moreover, it is necessary to separate the data of reviews from non-reviews. The sentiment
classification process can then be conducted [11].
This thesis proposes a system that extracts movie reviews from blogs and classifies these
reviews into two groups: positive and negative with defined different categories or overall
also it has been designed in the idea to be extended for other alternative topics for future.
Every component needed for an effective sentiment web mining introduced in details and
with reasons. Then application summarizes the result to the user with an effective visual
way.
2-Literature Review
When it has been first started to search on web blog mining there was not a clear idea of
what it was really concerning. It was new search topic and developments have done were
limited done mostly by other academic students. It is been chosen some of these papers to
point the direction on which the focus on research and development has to be. Below there
are short descriptions of methodology and techniques used by previous researchers.
In paper [4] has been built a sentiment classification application which uses phrase
patterns to classify opinions. In their method, they construct some phrase patterns and
calculate their sentiment orientation by unsupervised learning algorithm. At the document
classification phase, they are adding special tags to some words in the text, and then
matching the tags within a sentence with some phrase patterns to get the sentiment
orientation of the sentence. At last, they are adding up the sentiment orientation of each
sentence. They are classifying the text according to this summation. This method achieves an
accuracy rate of 86% when used to evaluate sports reviews from some websites.
In paper [5] has been built a reputation management application on the WebFountain
(WebFountain is a platform for very large-scale text analytics applications that allows
uniform access to a wide variety of sources.) platform that enables various analyses for
corporate and product reputation, and tracking of market trends. A key component of their
reputation management system is the sentiment miner that extracts sentiment (or opinions)
people express about a subject, such as a company, brand, or product name. They designed
the sentiment miner with the following challenge in mind: Not only is the overall opinion
about a topic, but also sentiment about individual aspects of the topic essential information
of interest. Because document level sentiment classification fails to detect sentiment about
individual aspects of the topic. The sentiment miner analyzes grammatical sentence
structures and phrases based on natural language processing (NLP) techniques. It detects,
2
for each occurrence of a known topic spot, the sentiment specifically about the topic. With
these characteristics their NLP based sentiment mining system achieved high quality results
(∼90% of accuracy) on various datasets including online review articles and the general web
pages and news articles. Their feature extraction algorithm successfully identified topic
related feature terms from online review articles, enabling sentiment analysis at finer
granularity.
In paper [6] has been built an application on sentiment classification with review
extraction. Their whole process can be illustrated logically in three phases:
1) Extract the review expressions on specific subjects and attach sentiment tag and
weight to each expression;
2) Calculate the sentiment indicator of each tag by accumulating the weights of all the
expression with the corresponding tag;
3) Given the indicators on different tags, use a classifier to predict the sentiment label of
the text.
It has been used some on-line documents to test the performance of their application. The
experimental documents cover two domains: politics and religion. The experiments within
those domains achieve accuracy between %85 to %95.
In paper [7] has been applied the method of opinion mining to help e-learning systems to
know the users’ opinions on the course-wares, the teachers, the charge or something else of
the e-learning system and to help the developers improve the services. They developed an
opinion mining system for e-learning reviews, the goal of this system is to extract and
summarize the opinions and reviews, and determine whether these reviews and opinions
are positive or negative and how strong they are. They divided the whole task into 4
subtasks;
1) Expression identification
2) Opinion determination
3) Content-value pair identification
4) Sentiment analysis.
And the achieved precision of these subtasks are respectively; %94, %84.2, %80.9 and %92.6.
In paper [8] has been developed the unified collocation framework for opinion mining.
They propose o novel unified collocation-driven opinion mining method. And they compared
this method with the attribute-driven method, sentiment-driven method and general
collocation-driven method, the unified collocation-driven method exhibits reasonable
generalization ability. As showed by the experimental results, 0.245 on average improves
recall in opinion extraction without obvious loss on opinion extraction precision and
sentiment analysis accuracy. The unified collocation-driven method incorporates attribute-
sentiment collocations as well as their syntactical features to achieve reasonable
generalization ability.
3
In paper [9] has been built a sentiment mining and retrieval system called: AMAZING.
They introduce a ranking mechanism, which is different from general web search engine
since it utilizes the quality of each review rather than the link structures for generating
review authorities. In this system most important part is they incorporate temporal
dimension information into the ranking mechanism, and make use of temporal opinion
quality and relevance to rank review sentences. They monitor customer reviews’ changing
trends with time, and visualize the changing trends of positive and negative opinion
respectively. And they generate visual comparison between positive and negative evaluation
of a particular feature which potential customers are interested in. They conducted
experiments in the sentiment mining and retrieval system using the customer reviews of
four kinds of electronic products including 20 digital cameras, 20 cell phones, 20 laptops and
20 MP3 players. And they achieved a precision of %85 approximately.
In paper [12] a multi-knowledge based approach is proposed, which integrates WordNet

[13], statistical analysis and movie knowledge. WordNet® is a large lexical database of
English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and
adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct
concept. They decompose the problem of review mining and summarization into the
following subtasks:
1) Identifying feature words and opinion words in a sentence;
2) Determining the class of feature word and the polarity of opinion word;
3) For each feature word, first identifying the relevant opinion word(s), and then
obtaining some valid feature-opinion pairs;
4) Producing a summary using the discovered information.
WordNet, movie casts and labeled training data were used to generate a keyword list for
finding features and opinions. Then grammatical rules between feature words and opinion
words were applied to identify the valid feature-opinion pairs. Finally, they re-organized the
sentences according to the extracted feature-opinion pairs to generate the summary. The
objective of their work is to automatically generate a feature class-based summary for
arbitrary online movie reviews. Experimental results show that their method working with
an average precision of %65 approximately. In addition, with their approach, it is easy to
generate a summary with movie-related people names as the sub-headlines, which probably
interests many movie fans.
The work done on this project is most similar with the work in [12]. One and important
difference of work done is the aim to calculate sentiment orientation of the movie reviews
from the blogs. All of the works researched are working on a constant dataset but the
projects review dataset will be crawled from the blogs and then will be worked on this
dataset to calculate movie scores. Discussion of the method with details is in next section.
4
3-Approach
3.1 Overview
In this section it is shortly defined the techniques, goals of project and what is aimed to
succeed as result and methods applied during the project development. The project is
separated to three phases. The first phase crawling phase which data gathered from web
blogs or portals; second phase is to parse, analyze and process that data to information; the
last phase is interfacing or visualizing our analyze results that will be presented, comparisons
will be made with existing results and accuracy of work will be tested. More details of the
technical and architectural work will be explained in the system architecture part.
3.2 Problem Definition and Goals

3.2.1 Problem Definition
Web blogs and portals are full with un-indexed and unprocessed text that is containing
so much useful analysis source. This is direct interaction to a person’s ideas. There is a
need to take and process that data and let people to use it in their decision making
processes. For sure many people take action by the words of common interest of a fact.
Like to buy a camera that most claimed it is the best between the options. We focused in
the same manner to create a blog mining system that will took movie comments from
blogs or portals and define to user what most thinks about the movie with its related sub
units from director to screen writer.
3.2.2 Goals
Gather the right data from the right sources to process.

Process the data into information using well-defined word libraries and well-
defined procedures that will analyze it and turn it to meaningful results.
Present the results in a clear way that user can use for:
Consuming time,
Learn about community agreements on a topic.
Produce a model for follower researches that will work on that topic to have a
base example.
3.3 Solution
The problem to address is to parse data existing in texts on blogs or on web portals that
people talk on their ideas, making comments or criticize. Included in this problem to alter is
to take that data by an automated system as you do with a human eye. Define stop points
5
and specialize on factors to define the lines to not get out of subject. Second problem to
alter will be after getting that data to analyze it. Again it will be necessary
necessary to define subject
specific algorithms that will take of the subjective pieces to use in pointing or sketching
information. Last part of work is to make a presentation environment for the end user that
can note the work and see project’s accuracy that
that we will handle by building a web site and
graphical interfaces. Figure 2 shows the overall process model of our application.
Figure 2 Blog Miner Overall Process Model
Let us introduce you the basic principles and working mechanism of our system;
Crawling the blogs for movie reviews: OpenWebSpider and Arachnode have been used for
crawling the blogs and collecting data
dat for sentiment analysis. A Web crawler (also known as
a Web spider, Web robot) is a program or automated script that browses the World Wide
Web in a methodical, automated manner. Other less frequently used names for Web
crawlers are ants, automatic indexers,
indexers bots, and worms.
This process is called Web crawling or spidering. Many sites,

tes, in particular search engines,
use spidering as a means of providing up-to-date
up date data. Web crawlers are mainly used to
create a copy of all the visited pages for later processing by a search engine that will index
the downloaded pages to provide fast searches.
searches. Crawlers can also be used for automating
maintenance tasks on a Web site, such as checking links or validating HTML code. Also,
crawlers can be used to gather specific types of information from Web pages, such as
harvesting e-mail
mail addresses (usually
(usually for spam) or gathering text content like we do.
6
A Web crawler is one type of robot, or software agent. In general, it starts with a list of
URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks
in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the
frontier are recursively visited according to a set of policies.
Sentiment analysis of blogs: Sentiment analysis has three main tasks; Determining
subjectivity, determining sentiment orientation and determining the strength of the
sentiment orientation. Sentiment analysis can be done in two different ways:
With using unsupervised approach.
With using supervised machine learning approach.
This application uses the unsupervised approach. OPEN-NLP is used to find the types of
words. There is a keyword database which contains the specific words about movie domain.
Keyword are searched in the text for analyze, if is found a keyword then calculation of the
score is done by the identification if it is has adjectives or adverbs. Below it is mentioned
about this algorithm as keyword algorithm. Also another algorithm that looks the all words
in related sentences and calculates the general score for a movie has defined. It is
mentioned as all words algorithm.
Generating visual results: It has been used Zed Graph for visualization of the findings. Zed
Graph is a class library, Windows Forms User Control, and ASP web-accessible control for
creating 2D line, bar, and pie graphs of arbitrary datasets. Zed Graph is maintained as an
open-source development project. The results presented on the project web site over a
shared database.
4-Project Development
4.1 Planning Phase

In planning phase, it is established a high level view of the intended project and
determined its goals. The Water Fall methodology selected for developing this project and
the project is divided into four phases according to this methodology. The project started
with planning phase, after planning phase the analysis phase starts, after analysis phase
design phase and implementation phase has been done respectively.
4.1.1 Project Identification
User Need
User need is to get an analyzed information that is product of a sentimental analysis
made by system to use for comparison or help on decision making.
Business Requirements
7
With new interests of today business carried on web platforms and every person is also a
virtual costumer or just attendee and their ideas and sharing on web needs to be used to
help to firms or business owners to recognize their costumers better. They can change
their direction of production and portfolio knowing what people are looking for.
Business Value
Produce a model for follower researches that will work on that topic to have a base
example.
4.1.2. Feasibility Analysis
Technical Feasibility
Web content or data mining are new terms and they are not such common as other
subjects like e commerce sites that you can find many material and examples but need
and popularity increases. So the development is done against limited help about
documentation and researches done before by people working in same topic.
Development environment choice is to use C#.Net and ASP.net and work done on
Visual Studio 2008. For database management SQL-2005 is chosen. Ajax is used as web
controls also in our web interface. Graphs library used is named Zed Graphs. It is an open
source library which is also graphs used by Wikipedia for most of their charts. They work
both in form applications and web applications.
OPEN-NLP is used to help on natural language processing when the parsing is done
over the texts. SentiWord.Net database for words are used as word database. Porter
Stemmer is used to find the stem of a word and NetSpell [18] library for correcting
misspelled words. In rest it has defined project’s own classes and algorithms to operate
on text.
Economic Feasibility
Economic feasibility analysis is not a must for the project that the developers and
investment is not necessary for the project development tools.
4.2 Analysis Phase

4.2.1 Requirements Analysis
Functional Requirements
Functionality of the system basically is to process human opinion using web mining
techniques so most of the functionality is done in the code at the background user will
have the action only to browse already processed data. By the admin side of the system
8
can be criticized on which functions will be necessary to accomplish the mission of the
project.
FR0: User can see all analyzed movies one by one with all topics included
FR1: User can choose movies to compare to each other in a specific topic or overall
FR2: Users shall be able to feedback using feedback screens to request an analysis about
their choices.
FR3: Developers can use our sentiment algorithms as packages to rule for text processing
on their own analysis.
FR4: User can test our system with results taken from imdb to see accuracy of the system.
FR5: We can crawl any site in the dept as we wish and specify our topics or web site limits
as we wish using crawler interfaces.
FR6: We shall modify the content of sentiment analysis to make analysis on different
topics.
FR7: Developers can use the existing system as a template and modify the code basics in
their interest of analysis.
Non-functional Requirements
NF0: It is important that graphical information is clear and easy to understand by the web
site users.
NF1: The response times of the user searches must be short.
NF2: The accuracy of the returned results must be high.
NF3: Comparison options must be logic.
NF4: Hardware of the system that will host the crawler should be high performance
because of fast transactions and data storing of crawler.
NF5: Worker threads should be used on crawling options to have performance and multi
process on web crawling. Otherwise crawling big and multiple sites to analyze are harder.
NF6: The platform that application will be set up has to have .NET 3.5 framework and MS-
SQL Server 2005 has to be set up on that machine.
NF7: Users will need to have environments to browse ASP.NET pages. Any basic web
browser already defines this ability.
NF8: Implementation environment should be set up for developers. All list below is
necessary during implementation phase.
Visual Studio 2008
MS-SQL 2005
Windows XP
Computer
Internet Connection
NF9: System should have a high maintainability capacity because system should be
customizable and easy to change for another topic.
9
4.2.2 Modeling process and data
In this phase, data flows between components of the system were determined. Then they
were modeled with Data Flow Diagrams. Figure 3 shows the data flow of the blog crawler.
Figure 4 shows the data flow of the sentiment analyzer and figure 5 shows the data flow in
the web user interface.
Figure 3 Blog Crawler Data Flow
10
Figure 4 Sentiment Analyzer
An Data Flow
Figure 5 User Interface Data Flow
4.3 Designing Phase-System

System Architecture
4.3.1 Blog Crawler
One of the most important parts of the application is the Blog Crawler. The crawler has a
really heavy work, because is needed to analyze as many as data that can reach good
accuracy results. If analyze have not been done with enough data, results will show opinions
of only restricted group of people but it is a goal to calculate general opinions about a movie.
So it has to be crawled as many blogs
blogs as it can be to reach good results but there are some
hardware restrictions in this matter. The blogosphere contains very huge data but the
storage capacity is limited also the crawler needs very fast computer with high memory to
crawl all of the blogosphere,
osphere, so it is crawled only some part of the blogosphere. İt is a
hypothesis that when the hardware specifications will improved and crawled part of the
blogosphere increased, the application will create better results.
Arachnode.Net is used for crawling the blogs. Arachnode.net

rachnode.net is an open source Web
crawler for downloading, indexing and storing Internet content including e-maile addresses,
files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server
2005. Arahnode.net uses the Lucene.Net library forfor indexing and searching. Arachnode.Net
is selected because it is very customizable and well written; also it is written with C# and this
makes the customization and integration easier. Customization is done on Arachnode.Net
for crawling blogs and crawler has started with seeds like www.blogpulse.com and
11
www.technorati.com,, because these web sites contains a lot of links to blogs and this
improves the crawling performance. Figure
Figure 3 shows the main working process of the
crawler.
Figure 6 Crawler Architecture
4.3.2 Sentiment Analyzer
Sentiment analyzer is the main structure of the application. In this part it is being
calculated scores for a movie from the comments about that that movie. For calculating scores,
first of all blogs selected that contains comments about a specific movie and after text of the
web page is taken to parse the text into sentences for sentence level calculation. In first
algorithm for every sentence that are are being looked for the keywords which was created
about movie domain, if is found a keyword in the sentence then is being looking for the
modifying adjectives of the keyword. SentiWordNet [14] is being used for sentiment score of
the words. SentiWordNet iss a lexical resource in which each WordNet [13] synset s is
associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective,
positive, and negative the terms contained in the synsets are. Table 1 shows some adjectives
and their scoresres according to the SentiWordNet. After modifying adjectives are found that
12
are being looked for modifying adverbs for these adjectives. These adverbs are separated
into two categories; degree adverbs and reversing adverbs. If there is a degree adverb like lik
“less” or “more” founded for the adjective then multiply the adjective’s score with the
degree adverb’s score and use the result as keyword’s score. If reversing adverb is found like
“not” for that adjective, simply reversing the score of that adjective and using the score
sco as
keyword’s score. All keywords’ scores are calculated for every related blog page and then
calculated the average of these scores. The keywords created with different categories like
“Screen Play”, “Director” and “Producer”, there are 9 categories likee that for movie domain.
After all keywords’ score calculation completed, it is calculated calculated the scores of these
9 different categories according to the keywords’ categories. In second algorithm, for every
related sentence that is being looked for every word’s scores from SentiWord database and
are calculated the average of these word scores and that gives the general movie score.
Table 1 SentiWord Data Table
The peoples’ opinions and comments in their blogs may contain spelling errors and these thes
errors will decrease accuracy of the application. To alter this problem NetSpell [18] is used as
a spell checker library in the score calculator method. Only stem of the words are stored in
the SentiWord table and to find out the sentiment score of a wordword it must searched with its
stem. And to alter this problem it is used, the Porter Stemmer [16] to get the stem of a word.
Also there is a string similarity project called Words Matching created for keywords that still
can’t found after spelling control and
and stemming. This project calculates the similarity of two
strings and returns a value between 0 and 1; assumed that if similarity score is greater than
13
0.8 strings are equal. These text and word modifications will improve the application’s
accuracy. Figure 7 shows the process model of the sentiment analyzer.
Figure 7 Sentiment Analyzer Process Model
Figure 8 Blog Miner ER Diagram
14
Figure 8 shows the ER Diagram of the application but this diagram does not includes the
Arachnode.Net database which the the crawler uses and stores the blog pages. It could not be
added the database diagram of the Arachnode.Net because the size of the diagram is very
large but you can find the database diagrams of Arachnode.Net at [15].. In Movies table score
results of the each movie investigated are stored. In People table the related people
information about movies for example actor, actress names, director etc are stored. this
information is stored forr improving the accuracy of the score calculator method for catching
all comments about a movie. In SentiWord table the sentiment dictionary is stored which
was obtained from SentiWordNet [14]. Movie Elements table contains the 9 categories for
movie domain n and Element Alias table contains the keywords about these categories.
Figure 9 Blog Miner Class Diagram
Figure 9 shows the class diagram of the main project. In the project most of the work is
done by MovieScoreCalculator class, this class uses the Porter Stemmer, Spell Checker
classes and Words Matching project for improving efficiency. Score Calculator class is a test
class which calculates the scores of 10 movie from imdb comments. Also the crawler calls
the MovieScoreCalculator class when a page is related to a movie and MovieScoreCalculator
class calculates the score and updates or creates the movie score.
15
Figure 10 Words Matching Class Diagram
4.3.3 Web User Interface
Web blog mining process that has been worked on this thesis mostly lies behind the visual
interface and results and work is mostly lies between processes in databases and in
functions. After a long process of gathering data storing it , cropping it to evaluate more
logical data from raw data and processing it with defined parsing and sentiment analysis
functions results comes out for our work as just simple numbers. That is more actually
pointing of movies in a few data table. This work is planned to present to the end user in the
most simplest and useful way as graphical charts that they can select what they want to
screen on a simple graphs.
A project web site has been developed for both to present the project evaluation and to
give information about what has been gained all this process long. And most importantly to
publish the web blog mining sentiment analysis results with basic mechanism.
16
Web site has five main pages. Three of them present project and reference materials.
One is a comment page and the last, most important one is the graphs page that the results
of the project presented.
Web Site Pages

Start Page
Figure 11 Main Page
The main page of the web site is as in figure 11. User can browse between main pages
from the menu up. On the right side there are some referential pages ad quick launch
options. Users will be able to reach all documentation from paper and materials pages. Most
important page of the interface is the graphs page that is explained in detail below.
Graphs Page
17
Figure 12 Graphs Page
Graphs page is formed by two sections. First section is the selection part. There have
been three selection options first one is to select movie name analyzed and then clicking to
show all button. This will create a bar graph that has 9 different analysis result sketched.
The second selection option is to choose a category from combo section. Here users
specify a selection only. The third part formed as grid lists the movies that have been
analyzed. Here users can select the movies they want to sketch in graph or they can choose
them all. And with the specified selection up graph will be sketched.
On the right side near graph users will be able to see imdb.com point of the movie to
have a comparison base to see our accuracy of the system. When a user chooses to show
score option the score comes out in overall and imdb point taken as overall rating will be
there to compare. Think that imdb takes their point on voting and we analyze the comments
so if a person made a comment but not specified a vote this
this may lead to deviation of real
result.
Second section of graphs page is zed graphs that are dynamic chartings that will be
created each time users specify a selection. Next section contains a short summary of zed
graphs and how it is used in this work.
work
Zed Graphs
18
Zed Graph [17] is a set of classes, written in C#, for creating 2D line and bar graphs of
arbitrary datasets. The classes provide a high degree of flexibility -- almost every aspect of
the graph can be user-modified.
modified. At the same time, usage of the the classes is kept simple by
providing default values for all of the graph attributes. The classes include code for choosing
appropriate scale ranges and step sizes based on the range of data values being plotted.
Zed Graphs has two different libraries thatth can be used both for windows form
applications and web pages. And there is two different modes can be used in both. The
second option “image render mode” has been chosen that allows user to create a graphs
load it to a folder as a temp image and load form
fo there if user re-clicks
clicks the same graphs
which fastens graph loading time. Chart creation progress showed below in the figure 13 for
out graphs page.
Figure 13
1 User Interface Process Model
Process diagrams show how to handle data feed to the graphs data ta set. In this point what
to take in care is for the graph type has been chosen, the data should be send that would
make sketch logical. Theree are plenty of graphs types can be created with very simple codes.
Some sample graphs can be made
mad very easily are in table 2.
Sample Bar Graph Pie Charts Line & Symbol Charts
19
Table 2 Sample Graphs
4.4 Implementation Phase

At design phase the structure of the project is well defined and every step of
implementation are determined. Development language of project is C# and development
environment is Microsoft Visual Studio 2008. For database development Microsoft SQL 2005
is selected. All the classes and projects are implemented as defined in design phase. A
sentiment analyzer project created as a WPF project and integrated with Blog Crawler. An
ASP.NET web site project created for publishing the results of project. For improving
implementation performance Visual Studio’s dataset and table adapter structures are used
for database interactions. Object oriented design rules and structures are used while
implementing these projects. For creating diagrams Smart Draw 2009 and Visual Paradigm
for UML 7.0 Enterprise Edition is used.
5. Experiments and Results

5.1 Data
The user reviews of a few movies from IMDB have been used as the data set. These
movies are selected from recent movies. The selected movies should be familiarized by most
movie fans because this work aims to analyze as many as possible user comments and the
well-known movies have enough comments for this purpose. According to the above
criterions, 10 movies from the IMDB have been selected. The selected movies are The Fast
and Furious, Monsters vs. Aliens, State of Play, Knowing, The Dark Knight, Wall-E, Slumdog
millionaire, No Country for old men, There Will be Blood and The Curious Case of Benjamin
Button. For each movie, approximately 10 review pages are crawled by the Blog Crawler.
This makes approximately 1000 reviews in total. These reviews are used for experiments to
calculate accuracy of application.
5.2 Experimental Results

The experiment will be presented in the flow of blog miner processes the raw data and
calculates its results. A sample review has been chosen from imdb the blog miner will work
on it. The keyword algorithm has been used for this example review. The second algorithm
also uses a similar way but looks every word’s score not just looking only keywords’ scores.
Sample Review:
“I thought it wouldn't be as good as it was, because thousands of people and reviews said it
would suck! It was great, but what it missed was that it needed to be at-least an hour longer,
because it missed a-little bit, but it still rocked! I loved it! I thought it was funny, and as did
the person next to me, when John says: "I'll be back!””.
20
Step 1: Split the text into sentences
In this step the text will be splitted into sentences to make the sentiment analysis at
sentence level. Text below is the condition of the sample review after step 1.
~1~ I thought it wouldn’t be as good as it was, because thousands of people and reviews said
it would suck! ~1~
~2~ It was great, but what it missed was that it needed to be at-least an hour longer,
because it missed a-little bit, but it still rocked! ~2~
~3~ I loved it! ~3~
~4~ I thought it was funny, and as did the person next to me, when John says: "I'll be
back!””. ~4~
Step 2: Tag the words in each sentence by their type

In step 2, appropriate tags will be added to the words for understanding the meanings of the
words more accurately. Table 3 shows the tags has been used and the meanings of these
tags. And the below text is the sample review after step 2.
I/PRP thought/VBD it/PRP would/MD not/RB be/VB as/RB good/JJ as/IN it/PRP was/VBD
,/, because/IN thousands/NNS of/IN people/NNS and/CC reviews/NNS said/VBD it/PRP
would/MD suck/VB !/.
It/PRP was/VBD great/JJ ,/, but/CC what/WP it/PRP missed/VBD was/VBD that/IN it/PRP
needed/VBD to/TO be/VB at-least/JJ an/DT hour/NN longer/RB ,/, because/IN it/PRP
missed/VBD a-little/JJ bit/NN ,/, but/CC it/PRP still/RB rocked/VBD !/.
I/PRP loved/VBD it/PRP !/.
I/PRP thought/VBD it/PRP was/VBD funny/JJ, /, and/CC as/RB did/VBD the/DT person/NN
next/JJ to/TO me/PRP, /, when/WRB John/NNP says/VBZ :/: "/`` I/PRP will/MD be/VB
back/RB !/. ”/NN. /.
21
Table 3 Word Tags
Step 3: Point the text using full text algorithm or key point based algorithm.
/PRP would/MD not/RB<-1> be/VB as/RB good/JJ<0.844>

“I/PRP thought/VBD it/PRP good as/IN
it/PRP was/VBD ,/, because/IN
because thousands/NNS of/IN people/NNS and/CC
and reviews/NNS
said/VBD it/PRP would/MD suck/VB
suck !/.
(sentence score = -0.844)
0.344> ,/, but/CC what/WP it/PRP missed/VBD

It/PRP was/VBD great/JJ<0.344 /VBD was/VBD that/IN
it/PRP needed/VBD<-0.140625
0.140625> to/TO be/VB at-least/JJ an/DT hour/NN/NN longer/RB ,/,
because/IN it/PRP missed/VBD/VBD a-little/JJ bit/NN ,/, but/CC it/PRP
/PRP still/RB<-0.171>
rocked/VBD !/.
(sentence score = 0.0104)
/PRP !/.
I/PRP loved/VBD<0.375> it/PRP
(sentence score = 0.375)
22
I/PRP thought/VBD it/PRP was/VBD
was funny/JJ<-0.515> ,/, and/CC as/RB
/RB did/VBD the/DT
/PRP ,/, when/WRB John/NNP says/VBZ
person/NN next/JJ to/TO me/PRP /VBZ :/: "/`` I/PRP will/MD
be/VB back/RB !/. ””/NN ./.
(sentence score = -0.515)
The application has been tested with data mentioned above and scores of movies with
two different techniques has been calculated.
calculated The results of experiments
eriments can be shown in
Table 4. Keyword algorithm gives more average scores for each movie’s general score; all
words algorithm gives closer
ser results to imdb scores. In producer and screen writer columns
there are some rows have score of 5.25, these scores are default values because there are
no keywords found for these movies.
Table 4 Experiment Results
5.3 Discussion
The results of the experiment have been compared with each movie’s imdb score. In
imdb page of movie there is only movie’s general score because of that there can be two
comparisons made; first comparison is imdb score with keyword algorithm’s score and the
second one is imdb score with all words algorithm’s score. The scores of other categories
cannot be compared because there is no available data at imdb.com to compare.
5.4 Difficulties Encountered

The first difficulty to alter was to learn what web mining is because it was a new term for
us and for many others in computer science also. So that a lot of time has been spent on
researches on internet to learn what is done and what is the methodology behind the web
mining idea especially on thinking how to reach to that data on the web. But there are many
useful helping open source software has been explored will make the work easier.
eas The open
source software created
reated another difficulty which is to integrate those projects to our work.
This difficulty encountered with reading the documentation of these software and analyzing
their codes.
23
6. Conclusion
As a conclusion, opinion mining in web 2.0 is very important and this area is developing
day by day. Because with web 2.0 user created content of web increased enormously and
collecting meaningful information from this data became an important task. In this work, an
opinion mining application is created for calculating movie scores from blog posts.
Experiment results shows this task is not an easy one. Some of the results are close to the
real scores but some results are far from expectations. With this work we learned the
unsupervised approach for sentiment analysis not giving enough accuracy. We have
searched and investigated many works about this subject and we believe that using
supervised approach might create more accurate results for sentiment analysis.
7.References
[1] Technorati, Inc. http://technorati.com; Available at 20.05.2009
[2] A Content based Algorithm for Blog Ranking. Jie Shen, Yan Zhu, Hui Zhang, Chen
Chen, Rongshuang Sun, Fayan Xu Yangzhou University, Jiangsu Province, China, p. 1, 2008
International Conference on Internet Computing in Science and Engineering.
[3] Blog Mining through Opinionated Words. Giuseppe Attardi. Dipartimento di

Informatica Università di Pisa attardi@di.unipi.it Maria Simi Dipartimento di Informatica
Università di Pisa simi@di.unipi.it p.1 2006
[4] Sentiment Classification Using Phrase Patterns Zhongchao Fei, Jian Liu, and Gengfeng
Wu. Proceedings of the Fourth International Conference on Computer and Information
Technology (CIT’04).
[5] Sentiment Mining in WebFountain. Jeonghee Yi, and Wayne Niblack. Proceedings of
the 21st International Conference on Data Engineering (ICDE 2005).
[6] Super Parsing: Sentiment Classification with Review Extraction. Jian Liu, JianXin Yao,
and GengFeng Wu. Proceedings of the Fifth International Conference on Computer and
Information Technology (CIT’05).
[7] Opinion Mining in e-Learning System. Dan Song, Hongfei Lin, and Zhihao Yang.
International Conference on Network and Parallel Computing (IFIP 2007).
24
[8] The Unified collocation Framework for Opinion Mining. Yun-Qing Xia, Rui-Feng Xu,
Kam-Fai Wong, and Fang Zheng. Proceedings of the Sixth International Conference on
Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007.
[9] AMAZING: A sentiment mining and retrieval system. Qingliang Miao, Qiudan Li, and
Ruwei Dai. Expert Systems with Applications (2008) doi:10.1016/j.eswa.2008.09.035.
[10] Opinion Mining. Bing Liu. Department of Computer Science University of Illinois at
Chicago 851 S. Morgan Street Chicago, IL 60607-0753.
[11] Sentiment classification of online reviews to travel destinations by supervised

machine learning approaches. Qiang Ye, Ziqiong Zhang, and Rob Law. Expert Systems
with Applications (2008) doi:10.1016/j.eswa.2008.07.035.
[12] Movie review mining and summarization. Li Zhuang, Feng Jing, Xiao-yan Zhu.
[13] WordNet. http://wordnet.princeton.edu; Available at 20.05.2009
[14] SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining. Andrea
Esuli and Fabrizio Sebastiani
[15] Arachnode.Net Database Diagrams: Available at 20.05.2009;

http://arachnode.net/media/g/database_diagrams/default.aspx;
[16] Porter Stemmer: http://tartarus.org/~martin/PorterStemmer/; Available at

20.05.2009
[17]Zed Graphs: http://zedgraph.org/wiki/index.php?title=Main_Page; Available at

20.05.2009
[18]NetSpell: http://sourceforge.net/projects/netspell/; Available at 20.05.2009
25

Web Blog Miner Licence Thesis

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Web Blog Miner Licence Thesis

Încărcat de

Drepturi de autor:

Formate disponibile

BLOG MINER

WEB BLOG MINING FOR CLASSIFICATION

Figure 1 State of the Blogosphere

Figure shows how rapidly blog number is increasing and will.

In paper [12] a multi-knowledge based approach is proposed, which integrates WordNet

3.2 Problem Definition and Goals

Gather the right data from the right sources to process.

Figure 2 Blog Miner Overall Process Model

This process is called Web crawling or spidering. Many sites,

4.1 Planning Phase

4.1.1 Project Identification

4.1.2. Feasibility Analysis

4.2 Analysis Phase

Figure 3 Blog Crawler Data Flow

Figure 5 User Interface Data Flow

4.3 Designing Phase-System

Arachnode.Net is used for crawling the blogs. Arachnode.net

Figure 6 Crawler Architecture

4.3.2 Sentiment Analyzer

Table 1 SentiWord Data Table

Figure 7 Sentiment Analyzer Process Model

Figure 8 Blog Miner ER Diagram

Figure 9 Blog Miner Class Diagram

4.3.3 Web User Interface

Web Site Pages

Figure 11 Main Page

Sample Bar Graph Pie Charts Line & Symbol Charts

4.4 Implementation Phase

5. Experiments and Results

5.2 Experimental Results

Step 2: Tag the words in each sentence by their type

I/PRP loved/VBD it/PRP !/.

/PRP would/MD not/RB<-1> be/VB as/RB good/JJ<0.844>

0.344> ,/, but/CC what/WP it/PRP missed/VBD

Table 4 Experiment Results

5.4 Difficulties Encountered

[1] Technorati, Inc. http://technorati.com; Available at 20.05.2009

[3] Blog Mining through Opinionated Words. Giuseppe Attardi. Dipartimento di

[11] Sentiment classification of online reviews to travel destinations by supervised

[13] WordNet. http://wordnet.princeton.edu; Available at 20.05.2009

[15] Arachnode.Net Database Diagrams: Available at 20.05.2009;

[16] Porter Stemmer: http://tartarus.org/~martin/PorterStemmer/; Available at

[17]Zed Graphs: http://zedgraph.org/wiki/index.php?title=Main_Page; Available at

[18]NetSpell: http://sourceforge.net/projects/netspell/; Available at 20.05.2009

S-ar putea să vă placă și