Sunteți pe pagina 1din 67

POLITECNICO DI MILANO

Master of Science in Computer Engineering Department of Computer Engineering

REPUTATION ANALYSIS ON THE WEB 2.0

Supervisor: Prof. Lorenzo Cantoni Master Thesis of: Sarp Erdag, matricola 722413

Academic Year: 2008 - 2009

Thanks to my family, especially my father for supporting me through all aspects of my two year graduate studies.

Speacial thanks to Davide Eynard and Alessandro Inversini who patiently guided and advised me during the creation of this work.

Executive Summary
This report discusses and analyzes the technical side of the project Web2rism that is being carried out by a team of researchers and developers situated at the Webatelier lab that is a part of Universita della Svizzera Italiana in Switzerland. The project is an e-tourism centric online reputation monitoring software where the aim is to measure the popularity of touristic destinations and services. This research within the project mainly focuses on the methodologies, tools, frameworks and custom applications used and designed within the context of the ORM software being developed. The report begins with an introduction to what ORM is and why it is becoming an important issue businesses should consider. Later on, existing solutions trying to solve common ORM problems are addressed and a comparison between them is made. Reaching a conclusion out of the best alternatives available, the topic is then bound to how we approached to building a ORM application for the tourism industry. Finally, the ingredients for building up the model is presented, our decisions and details on how we proceeded with the implementation are explained. The system architecture behind the project relies heavily on data mining and web scraping techniques, therefore we have tested a broad range of tools and frameworks to ease the operation. The project is also built as an advanced web application that uses semantic technologies and knowledgebases. Until the current phase of the development process, I have been able to create six dierent scrapers that gather data from dierent social media and UGC sites, set-up an RDF store for storing the collected data and built an API over our knowledge-base that is able to do the necessary analysis and shaping of raw data before it is presented to the user. There is also a manager application that enables easy administration of the whole data gathering process. Continued, the thesis report discusses my implementation strategies and methodologies and presents the evaluation of chosen

applied technologies. As a nal conclusion reached by the research studies and a phase of software development completed, we were satised with the system architecture and the model designed for an e-tourism centric ORM application. The nal chapter of the report discusses the possible future steps that should be taken to take the project to next levels.

Contents
Executive Summary 1 Introduction 1.1 Objectives and motivations 1.2 The Web2rism project . . . 1.3 My Participation . . . . . . 1.4 Terms and Abbreviations . 1.5 Citations . . . . . . . . . . I 2 2 3 4 4 9 12 12 13 13 14 15 24 24 25 25 26 27 27 27 28

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Background 2.1 Online Reputation Analysis & Management . . . . 2.1.1 Short History of ORM . . . . . . . . . . . . 2.1.2 Scope of ORM . . . . . . . . . . . . . . . . 2.1.3 A Business Approach to Online Reputation ment and Monitoring . . . . . . . . . . . . 2.2 Existing Solutions . . . . . . . . . . . . . . . . . . 3 My Approach 3.1 The Positioning of Web2rism . . . 3.2 Reputation Analysis Methodology 3.2.1 Content . . . . . . . . . . . 3.2.2 Sentiments . . . . . . . . . 3.2.3 Authorship . . . . . . . . . 3.2.4 Query expansion . . . . . . 3.2.5 Location Factor . . . . . . . 3.3 Technical Methodology . . . . . . .

. . . . . . . . . . . . . . . . . . Manage. . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 Implementation 29 4.1 Tools and Frameworks Used . . . . . . . . . . . . . . . . . . . 29 4.1.1 Data & Knowledgebase Tools . . . . . . . . . . . . . . 29 4.1.2 Programming Languages and Web Frameworks . . . . 31 III

4.2

4.3

4.1.3 APIs and Wrappers . . . . . . . . . . 4.1.4 User Interface Design . . . . . . . . . 4.1.5 Sentiment Analysis . . . . . . . . . . . Scrapers, Scraping and Parsing Techniques . 4.2.1 Web2rism Scrapers . . . . . . . . . . . 4.2.2 A Custom Scraper Example . . . . . . System Architecture . . . . . . . . . . . . . . 4.3.1 The MVC Architecture and its Usage 4.3.2 Layers and Components . . . . . . . . 4.3.3 Project Files Organization . . . . . . 4.3.4 System Workow . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

32 36 36 38 38 39 41 41 42 48 49

5 Tests and 5.0.5 5.0.6 5.0.7

Evaluations 51 Functional Scrapers . . . . . . . . . . . . . . . . . . . 51 KnowledgeBase & Scraper Management . . . . . . . . 55 System Performance . . . . . . . . . . . . . . . . . . . 56 58 58 59 59 60 60 60 61 61

6 Conclusion 6.1 Current Status of the Work . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Extending the Data Gathering Layer . . . . . . . . . 6.2.2 Optimization & Scalability Issues . . . . . . . . . . . 6.2.3 A Logging System . . . . . . . . . . . . . . . . . . . 6.2.4 Easier Interaction with the RDF Store . . . . . . . . 6.2.5 More Organized Author and Location Classication 6.2.6 UI Enhancements . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

List of Figures
2.1 2.2 2.3 Reputation Analysis Results of Perspective Hotel Singapore on Brand Karma . . . . . . . . . . . . . . . . . . . . . . . . . Nielsens approach for Advanced Buzz Management. . . . . . Comparison of the best of existing ORM tools using Google Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Content Analysis for Reputation Measurement . . . . . . . . System Architecture . . . . . . . . . . . . . ScraperManager application UML Diagram Scraper Manager UI . . . . . . . . . . . . . An example of a SPARQL query used to get A A A A A A A view view view view view view view from from from from from from from . . . . . . . . . data . . . . . . . . . from . . . . . . . . . . . . the KB 19 21 23 26 43 45 46 47 52 53 54 54 55 56 57

3.1 4.1 4.2 4.3 4.4 5.1 5.2 5.3 5.4 5.5 5.6 5.7 0

Google BlogSearch Scrapers ndings . . . . . . Technorati Scrapers ndings . . . . . . . . . . . Twitter Scrapers ndings . . . . . . . . . . . . . Flickr Scrapers ndings . . . . . . . . . . . . . . YouTube Scrapers ndings . . . . . . . . . . . . WikiTravel Scrapers ndings . . . . . . . . . . . a call to the KB API to get Tweets about Italy

Chapter 1

Introduction
This chapter acts as as an introductory entry to the whole report. First, my motivation about being involved in the project related to my thesis is explained and its connection with my objectives are presented. Next, general information about the whole project and how my participation within is expounded. Lastly, the structure of the report is given and several terms and acronyms that the reader may come across while reading the whole document are listed.

1.1

Objectives and motivations

During my academic years, in addition to computer and software engineering itself, I have always been interested in the human side of things. Software engineering and creating useful products is a teams work. Although one can code and develop applications on his own, in order to perfect it, market it and monetize it a group of dierent people with dierent expertises are required. In in this manner, I interpret the phrase human side of things as communication between the stakeholders of a product being engineered. I believe a good engineer has to have excellent communication skills and an entrepreneurial point of view to take himself to higher positions. On this basis, creating my own product, marketing it and managing the business around has been my dream. Moreover, better and clearer communication is not only required in the development phase but also after the product is released. Especially with the emergence of Web 2.0 and social media, in the web applications sector, it is now much more crucial to continuously be in contact with the users, customers, fans who are interacting with a software 2

1.2. The Web2rism project

application. During the last year of my studies, I have started to grow an interest in online reputation management and analysis tools that give us detailed feedbacks on what people are already saying about a specic brand in social media, blogs, comments, online communities and forums. I also had in-depth experience in web development and I was in search of a big project that I can be a part of both in technical and product design terms. As a result, I decided to do my thesis within the Web2rism project that is currently being developed at the Webatelier lab in Universit` Svizzera Itala iana. As can be deducted from its name, Web2rism (Web 2.0 + Tourism) is a project that aims to study the reputation of a destination starting form user generated content available online. The project requires a deep research about how the data collected from the web can be turned into a reputation indicator. There is also the need of using sentiment analysis methods to actually understand how people are talking about a certain being over the web. In addition, the project team was using knowledge bases instead of traditional, relational databases for web2rism, so there is a strong connection to the Semantic Web within the project. All these aspects of the study has been extremely appealing for me and they suited well for my area of interests.

1.2

The Web2rism project

Developed by the webatelier lab in Universita della Svizzera Italiana (USI, Lugano - Switzerland) and funded by the Swiss Confederations innovation promotion agency CTI and PromAX, the Web2rism project aims to bring together the latest trends in tourism and touristic destination management on the world wide web. Webatelier.net is a laboratory of the Faculty of Communication Sciences of the Universit` della Svizzera Italiana, directed by Prof. Lorenzo Cana toni. The lab deals with a broad range of topics related to new media in communication and it is specialized in research and development in the eld of online communication and ICT in general, stressing the human side of it. The development process of Web2rism had started in January 2009 and as a 2 year long project, the planned nish date is January 2011.

1.3. My Participation

1.3

My Participation

The development of Web2rism holds the characteristics and the requirements of creating an enterprise level project. Although it is being developed in an academic environment, because there have been several companies and associations existing as stakeholders of the project, their needs and demands should be analyzed really well and they should be converted to a solution that will ll their needs. The Web2rism development team at Webatelier was divided in two groups one which deals with the communication with the stakeholders and developing a model that has to be converted into a software application and another that basically develops the software. In this manner, there has been a strong technical side of the whole project. My participation came on stage at this point. The technical background of Web2rism consists of four main layers and my contribution has been in the creation and use of the software tools that were going to be used in the data gathering, storing and analysis parts. All the technical aspects of the project, including the layers that are mentioned here, will be explained in detail in the upcoming Implementation chapter.

1.4

Terms and Abbreviations

Before getting into deeper details about the project, it is important to clearly dene some terms that will be used throughout this report. Web scraping: Also known as Web harvesting or Web data extraction, scraping is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-edged Web browsers. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Social Media: The media designed to be disseminated through social interaction, created using highly accessible and scalable publishing techniques.

1.4. Terms and Abbreviations

Social media supports the human need for social interaction, using Internet and Web-based technologies to transform broadcast media monologues (one to many) into social media dialogues (many to many). It supports the democratization of knowledge and information, transforming people from content consumers into content producers. Businesses also refer to social media as user-generated content (UGC) or consumer-generated media (CGM). Wiki: A wiki is a website that allows the easy creation and editing of any number of interlinked Web pages, using a simplied markup language or a WYSIWYG text editor, within the browser Sentiment Analysis: Also known as opinion mining, Sentiment Analysis refers to a broad (denitionally challenged) area of natural language processing, computational linguistics and text mining. It aims to determine the attitude of a speaker or a writer with respect to some topic. The attitude may be their judgment or evaluation (see appraisal theory), their aective state (that is to say, the emotional state of the author when writing) or the intended emotional communication (that is to say, the emotional eect the author wishes to have on the reader). CGM: Abbreviation for Consumer Generated Media. JSON: Abbreviation for JavaScript Object Notation. is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C languages, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language. Oicial Website: http://www.json.org XML: Abbreviation for Extensible Markup Language. Extensible Markup Language (XML) is a simple, very exible text format. Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. XML has a set of rules for encoding documents electronically.

1.4. Terms and Abbreviations

Oicial Website: http://www.w3.org/XML/ AJAX : Abbreviation for Asynchronous JavaScript and XML. Ajax is a group of interrelated web development techniques used on the client-side to create interactive web applications or rich Internet applications. With AJAX, web applications can retrieve data from the server asynchronously in the background without interfering with the display and behavior of the existing page. RSS: Abbreviation Really Simple Syndication. RSS is a family of web feed formats used to publish frequently updated workssuch as blog entries, news headlines, audio, and videoin a standardized format. RDF: Abbreviation for Resource Description Framework. Its a family of World Wide Web Consortium (W3C) specications originally designed as a metadata data model. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources; using a variety of syntax formats. Oicial Website: http://www.w3.org/RDF/ RDFS : Abbrevation for RDF Schema. RDFS, (also abbreviated as RDFS, RDF(S), RDF-S, or RDF/S) is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called Resource Description Framework vocabularies, intended to structure RDF resources. OWL : Abbreviation for Web Ontology Language. OWL is a family of knowledge representation languages for authoring ontologies, and is endorsed by the World Wide Web Consortium. DBMS: Abbreviation for Database Management System. A DBMS is a set of computer programs that controls the creation, maintenance, and the use of the database of an organization and its end users. RDBMS: Abbreviation for Relational Database Management System. An RDBMS is a DBMS in which data is stored in the form of tables and the relationship among the data is also stored in the form of tables. SQL: Abbreviation for Structured Query Language. SQL is a database

1.4. Terms and Abbreviations

computer language designed for managing data in relational database management systems. SPARQL: It is an RDF query language; its name is a recursive acronym that stands for SPARQL Protocol and RDF Query Language. It was standardized by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is considered a key semantic web technology. SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. API: Abbreviation for Application Programming Interface. An interface in computer science that denes the ways by which an application program may request services from libraries and/or operating systems. An API determines the vocabulary and calling conventions the programmer should employ to use the services. URL: Abbreviation for Uniform Resource Locator. URL is a subset of the Uniform Resource Identier (URI) that species where an identied resource is available and the mechanism for retrieving it. In popular language, a URI is also referred to as a Web address. SDK: Abbreviation for Software Development Kit. An SDK or devkit is typically a set of development tools that allows a software engineer to create applications for a certain software package, software framework, hardware platform or similar platform. grep: Abbreviation for Global Regular Expression Print. Grep is a command line text search utility originally written for Unix. LAMP: The acronym LAMP refers to a solution stack of software, usually free and open source software, used to run dynamic Web sites or servers. The original expansion is as follows: Linux, referring to the operating system. Apache, the Web server. MySQL, the database management system (or database server), plus of several scripting languages: Perl, PHP or Python. CRON: Cron is a time-based job scheduler in Unix-like computer operating systems. Cron is short for Chronograph and it enables users to schedule jobs (commands or shell scripts) to run automatically at a certain time or date. It is commonly used to perform system maintenance or administration.

1.4. Terms and Abbreviations

CGI: Abbreviation for Common Gateway Interface which is a standard protocol for interfacing external application software with an information server, commonly a web server. MVC: Abbreviation for Model View Controller which is an architectural pattern used in software engineering. The pattern isolates business logic from input and presentation, permitting independent development, testing and maintenance of each. Each model is associated with one or more views (projections) suitable for presentation (not necessarily visual presentation). ORM (1): Abbreviation Object-relational mapping. ORM in computer software is a programming technique for converting data between incompatible type systems in relational databases and object-oriented programming languages. This creates, in eect, a virtual object database that can be used from within the programming language. ORM (2): Abbreviation for Online Reputation Management. REST: Abbreviation Representational State Transfer. REST is a style of software architecture for distributed hypermedia systems such as the World Wide Web. REST-style architectures consist of clients and servers. GUI: Abbreviation for Graphical User Interface. It is a type of user interface which allows people to interact with electronic devices such as computers; hand-held devices such as MP3 Players, Portable Media Players or Gaming devices; household appliances and oice equipment with images rather than text commands. CSS: Abbreviation for Cascading Style Sheets which is a style sheet language used to describe the presentation semantics of a document written in a markup language. WYSIWYG: Abbreviation for What You See Is What You Get. The term is used in computing to describe a system in which content displayed during editing appears very similar to the nal output, which might be a printed document, web page, slide presentation or even the lighting for a theatrical event. UGC: Abbreviation for User-Generated Content. It is also known as consumer-generated media (CGM) or user-created content (UCC), refers to

1.5. Citations

various kinds of media content, publicly available, that are produced by endusers. Most of the popular media sites like YouTube, Wikipedia, Facebook or Twitter are examples to UGC sites where the members of the sites are providing content and making the site grow. W3C: Abbreviation for The World Wide Web Consortium. W3C is the main international standards organization for the World Wide Web. Oicial Website: http://www.w3.org/

1.5

Citations

1. Liu B., Web Data Mining - Exploring Hyperlinks, Contents and Usage Data. Springer, December, 2006 (http://www.cs.uic.edu/ liub/WebMiningBook.html) 2. Baumgartner R., WEB DATA EXTRACTION SYSTEM (http://www.cs.washington.edu/homes/gatter/download/Web Data Extraction System.pdf) 3. Baraglia, R. Silvestri, F.Dynamic personalization of web sites without user intervention, 2007 4. Google. Using JSON with Google Data APIs. Retrieved July 3, 2009. 5. Schrenk, Michael (2007). Webbots, Spiders, and Screen Scrapers. No Starch Press. ISBN 978-1-59327-120-6. 6. GeoSeeker, Semantic annotation based web scraping, (http://www.gooseeker.com/en/node/knowledgebase/freeformat) 7. Django FAQ. Lawrence Journal-World. Retrieved 2008-04-01. 8. Django Book 2.0. (http://www.djangobook.com/en/2.0/) 9. Django threading review. (http://code.djangoproject.com/wiki/DjangoSpecications/Core/Threading) 10. What is Python Good For?. General Python FAQ. Python Foundation. Retrieved 2008-09-05. 11. General Python FAQ. python.org. Python Software Foundation. Retrieved 2009-06-27. (http://www.python.org/doc/faq/general/)

1.5. Citations

10

12. Boodhoo, Jean-Paul (August 2006). Design Patterns: Model View Presenter. Retrieved 2009-07-07. 13. World Wide Web Consortium (December 9, 2008). The Forms Working Group. Retrieved 2009-07-07. 14. Resource Description Framework (RDF) Model and Syntax Specication. (http://www.w3.org/TR/PR-rdf-syntax/) 15. Optimized Index Structures for Querying RDF from the Web. Andreas Harth, Stefan Decker, 3rd Latin American Web Congress, 16. Grune, Dick; Jacobs, Ceriel J.H., Parsing Techniques - A Practical Guide, VU University Amsterdam, Amsterdam, The Netherlands Buenos Aires, Argentina, October 31 to November 2, 2005, pp. 71-80 (http://sw.deri.org/2005/02/dexa/yars.pdf) 17. Schawbel, D. (2009). Top 10 Reputation Tracking Tools Worth Paying For. (http://mashable.com/2008/12/29/brand-reputation-monitoring-tools/) 18. Fernando, A. (2004). Big Blogger is Watching You! Reputation Management in an Opinionated, Hyperlinked World. Communication World. 19. Thompson, N. (2003). More Companies Pay Heed to Their Word of Mouse Reputation. New York Times. 20. Susan Kinzie and Ellen Nakashima (July 2, 2007). Calling In Pros to Rene Your Google Image. The Washington Post. 21. W3C Semantic Web Activity News - SPARQL is a Recommendation. W3.org. 2008-01-15. Retrieved 2009-10-01. (http://www.w3.org/blog/SW/2008/01/15/sparql is a recommendation) 22. API Documentation from Flickr (http://www.ickr.com/services/api/) 23. API Documentation from Technorati (http://technorati.com/developers/api/) 24. API Documentation from Twitter (http://apiwiki.twitter.com/) 25. BeautifulSoup Documentation (http://www.crummy.com/software/BeautifulSoup/documentation.html) 26. Google Data API Documentation (http://code.google.com/apis/gdata/overview.html)

1.5. Citations

11

27. Joseki Server Documentation (http://www.joseki.org/documentation.html) 28. W3C, SPARQL Query Language for RDF. W3C Recommendation 15 January 2008. (http://www.w3.org/TR/rdfsparqlquery/) 29. Jena Documentation (http://jena.sourceforge.net/documentation.html) 30. Google Visualization API Documentation (http://code.google.com/intl/it-IT/apis/visualization/documentation/)

Chapter 2

Background
This chapters aim is to provide the necessary background to better understand the characteristics of the project I worked on. It begins with explaining what Online Reputation Analysis and Management means and how it is used in several dierent industries. Later on, a short history and scope of the practice is presented. This section is then connected to a part explaining the details about how ORM systems are aecting companies and brands. Finally, existing solutions in the whole world are analyzed in detail and my opinions about each of them in detail are given. In addition, the last part contains information about how they gave me an inspiration and how I used them in the technical building blocks of the Web2rism project.

2.1

Online Reputation Analysis & Management

Online reputation management, or ORM, is the practice of consistent research and analysis of ones personal or professional, business or industry reputation as represented by the content across all types of online media. It is also sometimes referred to as online reputation monitoring, maintaining the same abbreviation. ORM is a relatively new industry but has been brought to the forefront of professionals consciousnesses due to the overwhelming nature of both amateur UGCs and professional journalistic content. The type of online content monitored in ORM spans professional journalism sponsored by traditional news and media giants as well as user-created and user-generated blogs, ratings, reviews, and comments, and all manner of specialized websites. These websites can be about any particular subject, such as an individual, group,

12

2.1. Online Reputation Analysis & Management

13

company, business, product, event, concept, or trend. ORM partly formed from a need to manage consumer generated media (CGM). The appeal of reputation mechanisms is that, when they work, they facilitate cooperation without the need for costly enforcement institutions. They have, therefore, the potential of providing more economically eicient outcomes in a wide range of moral hazard settings where societies currently rely on the threat of litigation in order to induce cooperation. The rising importance of online reputation systems not only invites, but also necessitates rigorous research on their functioning and consequences. Marketing and social media experts see ORM as the future of branding and it is an absolute necessity for any company striving to protect the positive image and brand equity theyve worked so hard to achieve.

2.1.1 Short History of ORM


As CGM grew with the rise of social media and other similar user-based online content aggregators, it began to eect search results more and more, bringing with it increased attention to the matter of managing these results. EBay was one of the rst web companies to harness the power of CGM feedback. By using user generated feedback ratings buyers and sellers were given reputations that helped other users make purchasing and selling decisions. ReputationDefender was one of the rst companies that oered to proactively manage online reputations. ClaimID is another company that early on presented services designed to promote personal ORM. Other ORM tools include: Trackur, SinoBuzz, BrandsEye and Google Alerts. The UK market for ORM will grow by around 30% in 2008, to an estimated value of $60 million.

2.1.2

Scope of ORM

Specically, the online media that is monitored in ORM is: Social networks (e.g. Facebook, MySpace, FriendFeed) Social news/bookmarking sites (e.g. Delicious, Digg)

2.1. Online Reputation Analysis & Management Traditional/mainstream websites Consumer Review sites (e.g. Yelp, Epinioins) Sites like PersonRatings.com which allow reviews of individuals. Collaborative Research sites (e.g. Yahoo Answers, Redi Q&A) Independent discussion forums

14

User-generated content (UGC) / Consumer Generated Media (CGM) Blogs Blogging communities (e.g. Open Diary, LiveJournal, Xanga) Microblogs (e.g. Twitter, Identica, Jaiku, Plurk)

2.1.3 A Business Approach to Online Reputation Management and Monitoring


This section lists how ORM is being used by companies and how it is useful for their brand identities. Before the development of the web, news was slow moving and organizations could take their time to develop structured responses to problems. Currently, rapid developments in CGM sites mean that the general public can quickly air their views. These views can make or break a brand. Consumers trust these published opinions and base their buying decisions on them. Any information available to potential clients aects a companys reputation and its customers buying decisions. Similarly, ex-employees, and brand activists can easily get their personal viewpoints out there. Competitors who can also spread malicious rumors and lies about a company or brand in the hopes of stealing its market share. These types of unsubstantiated reporting can aect companys corporate images. Sites containing these kinds of information are being indexed by search engines and appearing in search results for brand names. And more importantly, the information can spread to the traditional media, compounding the damage. The goals of Online Reputation Management are high rankings and indexing in the search engines for all positive associated web sites. The result is an increase in a brands overall positive web presence, which will help the

2.2. Existing Solutions

15

company own the top spots of the search engine rankings for its brand. ORM enables companies to protect and manage their reputation by becoming actively involved in the outcome of search engine results through a three-step process. The three steps involved in Online Reputation Management are: Monitoring and tracking what is being said online Monitoring gives immediate heads-up if adverse information is appearing and it is an essential and useful tactic for controlling adverse information on the search engines and social media sites. Analyze how the visible information aects a brand At this stage of the analysis, it is possible to analyze the present position of a brand. Inuence Ths stage is when the company starts inuencing the results by participating in the conversation and eliminating negative sites. This report, as its name has it, will be focusing mostly on Online Reputation Monitoring that is the 1st stage of an Online Reputation Management work. The letter M in the abbreviation ORM stands both for Management and Monitoring, therefore I decided to use the word analysis and will continue with the phrase Reputation Analysis which corresponds to the rst stage of the ORM process.

2.2

Existing Solutions

This section describes other peoples approaches to the reputation analysis issue. Each of them has specic pros and cons, targeting dierent market areas. For each tool, rst a general description, usually extracted from the oicial websites of the services are presented, then my personal comments, how they gave me an inspiration and how I used them in the technical building blocks of the web2rism project are explained. Google Trends Google Trends is a public web facility of Google Inc., about Google Search, that shows how often a particular search-term is entered relative to the total search-volume across various regions of the world, and in various languages.

2.2. Existing Solutions

16

Google Trends also allows the user to compare the volume of searches between two or more terms. An additional feature of Google Trends is in its ability to show news related to the search-term overlaid on the chart, showing how new events aect search popularity. Personal Comment: Google Trends is a great tool that depends on the most popular and powerful search engine on the planet. However, it is not built with a a semantic point of view. It just measures the density of searches connected to given keywords. In this manner, we can not really speak of a reputation analysis tool but a density of buzz tracker. Website: http://www.google.com/trends Google Insights for Search Google Insights for Search is a service by Google similar to Google Trends, providing insights into the search terms people have been entering into the Google search engine. Unlike Google Trends, Google Insights for Search provides a visual representation of regional interest on a map. It has been noted, however, that term order is important, and that dierent results will be found if the results are keywords are placed in a dierent order Personal Comment: In many aspects, Google Insights has a lot of similarities with Google Trends. There have been quite a lot discussions why Google has released a second, similar product when it already has Google Trends, but then it was announced in the Google Adwords blog that Insights was slightly more targeted towards advertisers with Google. For instance, it contains categories (alternatively called verticals) to restrict your terms to. It also shows top searches and top rising searches in the neighborhood of keywords you enter. Overall, this seems to be a huge extension to Google Trends, Google Ad Planner, and the tools available within AdWords to advertisers. For this tool, Google does not provide an API, it is only possible to get a csv, excel format report of the data output. In the web2rism project, the data analytics provided by Google Insights was densely used and we had to manually scrape and parse the csv le. It may be better for 3rd party developers to create tools that use Google Insights if the system had an API. Website: http://www.google.com/insights

2.2. Existing Solutions

17

Image-1: Web Search Interest data from Google Insights - Comparing Microsoft vs Google

Google Alerts Google Alerts is a service oered by the search engine company Google which noties its users by email (or as a feed) about the latest web and news pages of their choice. Google currently oers six types of alert searches: News, Web, Blogs, Comprehensive, Video and Groups. A News alert is an email that lets the user know if new articles make it into the top ten results of his/her Google News search. A Web alert is an email that lets the user know if new web pages appear in the top twenty results for the users Google Web search. A News & Web alert is an email that lets the user know when new articles related to his/her search term make it into the top ten results for a Google News search or the top twenty results for a Google Web search. A Groups alert is an email that lets the user know if new posts make it into the top fty results of the Google Groups search. Personal Comment: The tool is not an all-in-one reputation management tool. It is only an alert system that keeps the user informed over time about

2.2. Existing Solutions

18

a keyword he/she has specied. It can be of better use when used in integration with Googles other reputation and trend analysis services. Generally speaking, except oering a whole solution, Google provides separate, distributed tools. This may actually be a part of their company vision, since Googles attitude has always been about letting people reach any data in an easier way. They may not be in an aim to create a powerful ORM suit but to provide tools that let the world of developers create their own applications. Website: http://www.google.com/alerts Circos Brand Karma Circos is a leading technology company that specializes in extracting brand sentiments from the actual text in consumer-written reviews and comments. The companys proprietary technologies apply semantic analysis to social media, surfacing rich insights about brands based on personal preferences. Circos specializes in social media for the hotel and tourism industries, and its Brand Karma product is helping leading hotel brands increase revenue, improve search engine optimization, and credibly brand themselves online. Personal Comment: Circos Brand Karma has quite a lot similarities with the web2rism project since it also focuses on the tourism industry. Also in terms of UI design, it has been a good example and inspiration on how I can build up the look&feel of the reporting pages of our project. Another good point noted about Brand Karma is that it does not only give the popularity of a touristic destination or hotels name but it also provides real business results like increased revenue, customer satisfaction, and loyalty. Website: https://brandkarma.circos.com/ Radian6 Radian6 oers a solution, where you can setup certain keywords to monitor on a dashboard, automatically track the keywords on blogs, image sharing sites and microblogging sites, and then have it report back to you with an analysis of the results. Data is captured in real-time as discovered and delivered to dashboard analysis widgets. The solution covers all forms of social media including blogs, top video and image sharing sites, forums, opinion sites, mainstream online media and emerging media like Twitter. Conversational dynamics are constantly tallied to track the viral nature of each post.

2.2. Existing Solutions

19

Figure 2.1: Reputation Analysis Results of Perspective Hotel Singapore on Brand Karma

Personal Comment: Radian6 monitors a considerably big amount of social media and trends sites. The latest news on its oicial site was that it has now integrated WebTrends web analytics, and SalesForce.com CRM information. Radian6s analysis works over website addresses instead of keywords. It is more suited for web analytics experts who are trying to learn more about the status of the reputation of their websites. Website: http://www.radian6.com/ TNS Cymfony TNS Cymfony oers the Mestro Platform, which is built on a Natural Language Processing engine that automatically identies, classies, qualies and benchmarks important people, places, companies and topics for you. The platform is able to decipher between dierent media sources, such as traditional media and social media. Cymfonys dierentiation is that their engine dissects articles, paragraphs and sentences to determine who and what is being talked about, whether something or someone is a key focus or a passing reference, and how the various entities mentioned relate to one another. Personal Comment: Although I was not able to test the Maestro Software (a demo is only available upon approved request), I liked TNS Cymfonys approach because they are not only doing keyword based analysis. Their technology uses a more sophisticated form of information extraction based on

2.2. Existing Solutions

20

detailed grammatical analysis of the text. Grammer-based approaches eliminate irrelevant content and are far more intelligent than keyword searching. This provided us examples on how we can connect web2rism to a sentiment analysis / NLP system. See section 4.2.5 of this report for more details about the Sentiment Analysis work. Website: http://www.cymfony.com/solutions/our-approach/orchestra-platform Sentiment Metrics Sentiment Metrics has a reputation management tool that, just like the other services mentioned, helps you monitor what is being said about you, your brand and your products across blogs, forums and news sites. The reports reveiwed by using this software focus on sentiment, which tells you if the mention is positive, negative or neutral. Personal Comment: Sentiment Metrics analysis tool is also demoed upon booking thats why I was not able to do a hands on test. As far as I am enlightened by the company website, they oer quite a powerful tool and they were able to get their tool used by big brands like Samsung, LG and Honda. They are also using sentiment analysis technologies and it is indicated that they are working with academicians. In this way, it has similarities with web2rism, since web2rism is also being developed in a universitys academic environment. Website: http://www.sentimentmetrics.com/ Nielsen Buzzmetrics Nielsen oers Buzzmetrics, which will supply you with key brand health metrics and consumer commentary from all consumer-generated media. They also have ThreatTracker, which alerts of real-time online reputation threats and gives you a scorecard to show you how you are doing relative to the competition. Nielsen uncovers and integrates data-driven insights culled from nearly 100 million blogs, social networks, groups, boards and other consumer-generated media platforms Personal Comment: Nielsens harvest - clean - analyze - nd relevancy approach to the buzz management shows resemblances with web2rism. As explained in section 3.2 - Reputation Model, web2rism has a very similar

2.2. Existing Solutions

21

workow for gathering and converting the gathered data into a reputation measurement. Moreover, Nielsens approach looks more professional and enterprise level when compared to the other tools. Website: http://en-us.nielsen.com/

Figure 2.2: Nielsens approach for Advanced Buzz Management.

Cision Cision oers the Cision Social Media service, which claims to monitor over 100 million blogs, tens of thousands of online forums, and over 450 leading rich media sites. One of the main benets, just like Nielsen Buzzmetrics, is that these companies have been monitoring and measuring traditional media sites for decades, so they can provide a more comprehensive solution across the board. Cisions product is unique in that it oers 24/7 buzz reporting. Their service is powered by Radian6, which is mentioned above. They also have a Dashboard and daily reports, just like the other services, where they tell you whats going on with your brand twice a day through email. Personal Comment: Cisions product uses a mix of best tools and they are providing a more umbrella approach to the industry. They have several dierent packages of their social media intelligence software and none of them was able to be tested for free so I was only able to deduct what kind

2.2. Existing Solutions

22

of a service they provide by browsing the company / product website and examine the screenshots. Website: http://www.cision.com/en/ BrandsEye BrandsEye was developed by Quirk, and has been used internally by the development team and their clients as its algorithm was tested and tweaked. It has been recognised as signifying a massive leap from the pre-existing ORM tools that merely track and monitor the brand buzz. BrandsEye not only traces and assesses your online presence but provides you with a real-time Reputation Score for both you and your competitors. This allows companies to monitor the sentiments and opinions of their own customers, while making educated judgements about how to respond to attacks on their online reputation. Personal Comment: Brands Eye oers reputation management packages for bloggers, small businesses and enterprises. The tool tracks every online mention of your brand, giving you a score that accurately reects the state of your reputation over time. Part of the dierentiation is that you can actually tag mentions of your brand and rank them in terms of a number of pre-determined criteria. Brands eyes target market spans a bigger portion of users in contrast to the other tools explained above and it has aordable solutions for even non-prot fun bloggers. Website: http://www.brandseye.com/

To reach to a nal conclusion about which existing solution is the most popular one being used, I did an interesting study and used Google Insights itself. I did not include market specic tools like Circos Brand Karma or general keyword based volume measurers like Google Trends or Insights to see the best among enterprise level buzz management suites. As a result, the graph shown below was generated.

2.2. Existing Solutions

23

Figure 2.3: Comparison of the best of existing ORM tools using Google Insights

Chapter 3

My Approach
In this chapter my approach and methods that I intended to use to create the ORM software of the Web2rism project are explained. In the rst section, the positioning of the whole project and how a tourism focused ORM tool should function is discussed, next the reputation model and how I decided to interpret the business requirements into pieces of code are presented.

3.1

The Positioning of Web2rism

This section basically explains how ORM should be used in the Web2rism project. Online Reputation Management is relatively a new industry and there are good opportunities both in re-dening the concept and the creation of tools. In the business world a strong reputation is a companys greatest asset. But in the web world, reputation no longer hangs on whats real. It hangs on a perception of reality created by CGMs and UGCs. Every day, for good or bad, someone, somewhere is talking about someones business. While constructive criticism is always welcome, malicious attacks on company reputations can spread like a global wildre without the company owners even knowing it. When we look at the already existing solutions, we see that most of the ORM solutions are dealing with broader ranges of audiences and they are not focused on certain markets. Since these tools, in theory can be used to track anything that is representable by at least one keyword on the web, dierent sectors of companies using ORM software does not become really

24

3.2. Reputation Analysis Methodology

25

important. In this manner the specications of an E-Tourism centric ORM solution is really vital. Questions like How the reputation of a Hotel diers from the reputation of a newly released car dier?, or How and where people are talking about a touristic experience they had? were the starting points at Webatelier. We found that checking tourism blogs and hotel review sites is much more important than going directly to YouTube or Flickr to measure the buzz. The important thing here is that people can be talking a lot about something, but it does not indicate whether what they are talking about is good quality product or service. The tracking and calculation of reputation should depend on something thats rating can be measured. A good thing about guessing what potential customers will require and ask from an E-Tourism ORM was to get them use our development versions. Since Lugano is already a highly touristic area, Webatelier was able to create necessary contacts with local touristic businesses, let them test our software and we were able to get frequent feedbacks from them.

3.2

Reputation Analysis Methodology

This section explains how weve built up the model to calculate reputation of touristic destinations or services. As the analysis model is still evolving, the current status of the model is able to provide us enough data to collect and lter to convert to a reputation identier.

3.2.1 Content
We categorized the various types of objects that can eect the reputation of a touristic destination as four main kinds. Reviews: These are the reviews about hotels and services on tourism related portals such as TripAdvisor. The reviews are written by people who stayed at the hotel and used the services provided. Photos: These are the photos taken during touristic experiences. They have connections with the destinations that is being analyzed. The main source of photos to get photo content was Flickr and Picasa for Web2rism. Videos: These are the videos taken during touristic experiences. They have connections with the destinations that is being analyzed. The main source of video content was YouTube for Web2rism.

3.2. Reputation Analysis Methodology

26

Blog Posts: These include the posts submitted to blogging sites or personal webpages and little pieces of texts (such as Tweets from Twitter) submitted to micro-blogging platforms. As there are dierent methods for someone to setup a blog site, we rst decided to go and get links from blog indexes. Googles Blogspot, Technorati or Wordpress are in the range of our blog tracking portals. After we categorized the content, the essential part was to decide on how we are going to extract the indicators that eect the reputation of brands. Since we are dealing with UGC sites, like most of social-media type systems, the user content generated has got ratings and comments. In addition to these, one more factor we took into consideration was the density of comments and ratings. The higher number of ratings or comments a content object has, the more saturated it is in terms of peoples opinions.

Figure 3.1: Content Analysis for Reputation Measurement

3.2.2 Sentiments
The raw form of scraped content in text format does not really mean anything unless it has any ratings or indicators showing that is has positive or negative values. Therefore, there is the requirement of a sentimental analysis layer built on top the data gathering part. This way we are able

3.2. Reputation Analysis Methodology

27

to understand which language is being used in the comment texts and see if they are being written in a positive manner or not. Details on how we used sentiment analysis tools and technologies are presented in the Chapter 4: Implementation.

3.2.3 Authorship
A person can have many identities on the Web. One can be a member of many dierent social sites and they might be using dierent nicknames on each. However, we needed to understand if the person on one site, using the same nick on another site is the same person or not. To overcome this problem, author objects attached to content objects and they can be shared among dierent contents. Calculation and detection of the author object can be a big problem though and in the current phase, we have been studying deeper the issue.

3.2.4 Query expansion


When using popular search engines, there is a feature that gives you related search suggestions as you type letters into the search box. Inspired by this, we decided people might be looking for related topics to what they have started searching for. When a keyword (representing a hotel, a destination or a brand) on Web2rism is about to be sent to the system for analysis, it is rst expanded and suggestions about possible similar searches are presented to the user.

3.2.5 Location Factor


With the increasing number of mobile devices and applications connected to social media sites and UGC platforms, the uploaded content has now the possibility to have geo-tagging properties. For example most of the photos on Flickr now have got latitude and longitude values representing geographical coordinates assigned to them. Although not clearly dened and converted to code yet, there is the idea of adding also the geo-location attributes of the content objects to the reputation calculations. More info about the location factor included in the project is given in the Future Work section under conclusion chapter of this report.

3.3. Technical Methodology

28

3.3

Technical Methodology

This section explains foreseeings on the best way to implement the idea of a E-Tourism centric ORM system into a software application, based on the projects needs, and its characteristics. Needs and Decisions The applications has to be reached from anywhere in the world. It should use open source tools and do not depend on commercial software. The project is being developed in an academic environment and it should provide the possibility to students and researchers to do study on related areas. The software should be easy to understand and keep the learning curve as short as possible for future developers. It should be using the latest trends in software development so that the best results can be achieved. The software development should respect the DRY (Do NOT Repeat Yourself) principle so that existing code can be reused for other purposes. Considering all these necessities, we decided to create a high scalability, high performance web application that is able to use semantic technologies. We chose completely open sourced tools and frameworks that use open standards to ease the interoperability of dierent layers of our system with each other. We also saw the necessity to develop our own API, since it is the trend in most web applications for the past few years. Sites having APIs are usually more trusted and they are much more open to extensions, which is usually done by third party developers. Since Web2rism is an academic project where many people get involved from time to time, we thought newcomers should be able to understand how the already built system works and do not lose time with design changes. The upcoming Implementation chapter presents which tools, frameworks and methods we chose and how wwe designed the system architecture.

Chapter 4

Implementation
In this chapter, various dierent tools and frameworks used throughout the software development process of the Web2rism project are presented. In addition, the results of the usages of mentioned tools and how they were tested and evaluated as useful are analyzed. Later on a detailed infrastructure of how I built the overall system and how each component interacts with each other are explained.

4.1

Tools and Frameworks Used

This section lists the various programming languages, frameworks, tools and applications used in the various layers of development process. Trying, learning, experimenting and deciding on the tool to use was an important issue about the project because we needed to work with several dierent UGC sites that were providing their data specially for 3rd party application development.

4.1.1 Data & Knowledgebase Tools


Joseki Joseki is an HTTP engine that supports the SPARQL Protocol and the SPARQL RDF Query language. Josekis features include RDF Data generation from les and databases and HTTP (GET and POST) implementation of the SPARQL protocol. Oicial Website: http://www.joseki.org/ 29

4.1. Tools and Frameworks Used

30

JENA Jena is a Java framework for building Semantic Web applications. It provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. Jena is open source and grown out of work with the HP Labs Semantic Web Program. Jena is included in the Joseki distribution along with ARQ which is a A SPARQL query engine for Jena. We used Jenas API to extract data from and write to our Knowledgebase RDF graphs. The graphs are represented as an abstract model each graph can be sourced with data from les, databases, URLs or a combination of these. A model can also be queried through SPARQL and updated through SPARUL. Detailed explanations of how we used SPARQL and SPARUL queries to query our KB model are given in section 4.3.2 - Layers and Components, part: Data Storage. Oicial Website: http://jena.sourceforge.net/ SqLite3 SQLite is a software library that implements a self-contained, serverless, zero-conguration, transactional SQL database engine. The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk les. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk le. The reason why I chose to use SQLite instead of MySQL is because I needed to store the congurations of dierent Scrapers in the ScraperManager application in a fast and simple way. As explained in Section 4.1 - System Architecture, each Scraper script is embedded dynamically into the system and the threads that are run by these scripts has to be logged permanently. Since the manager application will only be used by the administrators of the Web2rism project, there is no need to set up a huge MySQL installation. SQLites compact and easy to integrate structure lled this need fast and adequately. Oicial Website: http://www.sqlite.org/

4.1. Tools and Frameworks Used

31

4.1.2 Programming Languages and Web Frameworks


Programming Language As the main programming language, I have used the Python (version 2.5.1). The reason why I chose to code in Python is because it is a trending high level language that is used by the technology and trend leader of the world wide web, Google. In addition, Python lets you work more quickly and integrate your systems more eectively. It is also open sourced and free to use, even for commercial products, because of its OSI-approved license. Oicial Website: http://www.python.org/ Web Framework Because of the need for web user interfaces especially for management consoles, I used Django (version 1.1) which is the most popular Python web framework available. It follows the model-view-controller architectural pattern. Djangos primary goal is to ease the creation of complex, databasedriven websites. The framework also emphasizes reusability and pluggability of components, rapid development, and the principle of DRY (Dont Repeat Yourself). Python is used throughout, even for settings, les, and data models. Oicial Website: http://www.djangoproject.com/ External Applications: Another advantage of Python is that it eases the way of plugging external Python applications to your projects. This way, I was able to search for applications that have been created by other programmers to solve common problems. For instance, to convert the models (business objects) which are lled by the data coming from the knowledge base, into JSON objects, I have used simplejson. Below is a full list of the pluggable python apps I used in the server side programming process. Simplejson Simplejson is a simple, fast, extensible JSON encoder/decoder for Python. It is compatible with Python 2.4 and later with no external dependencies. It covers the full JSON specication for both encoding and decoding, with unicode support.

4.1. Tools and Frameworks Used

32

Oicial Website: http://www.undened.org/python/ BeautifulSoup Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. I used BeautifulSoup in the process of parsing the HTML data after scraping sites that do not provide their own APIs to let us easily gather data. Oicial Website: http://www.crummy.com/software/BeautifulSoup/ WWW:Mechanize Mecahize is a handy Perl object behaving much like BeautifulSoup for Python. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form elds can be populated and submitted. I use mechanize while getting used to scraper programming and testing custom scrapers written in Perl. Oicial Website: http://search.cpan.org/dist/WWW-Mechanize/

4.1.3

APIs and Wrappers

Google BlogSearch Blog Search is Google search technology focused on blogs. Google is a strong believer in the self-publishing phenomenon represented by blogging, and its Blog Search helps users to explore the blogging universe more eectively. Its results include all blogs, not just those published through Blogger, so it is possible to get the most accurate and up-to-date results. The goal of Blog Search is to include every blog that publishes a site feed (either RSS or Atom). It is not restricted to Blogger blogs, or blogs from any other service. Web2rism required the crawling of travel blogs from all over the world, thats why blog search engines like Googles was crucial for the project. Google Blogsearch oers its raw data in ATOM or RSS formats and it was easy to integrate into our data gathering layer. Oicial Website: http://blogsearch.google.com/ Google Suggest

4.1. Tools and Frameworks Used

33

As you type into the search box on Google Web Search, Google Suggest oers searches similar to the one youre typing. Start to type [ Como ] even just [ Co ] and youll be able to pick searches for the Como Lago, Como Italy, and Como te llamas (which is the Spanish way asking someones name). Type some more, and you may see a link straight to the site Google thinks youre looking for all from the search box. This feature of Google was used in the query expansion of Web2rism, where the given keywords are inserted into Googles suggestion function and they are re-presented to the user to help him specify what he is really looking for. Oicial Website: http://www.googlelabs.com/ Yahoo Search Suggests Yahoos search suggestion service is a similar one to what Google has. However, our analysis showed that it provided pretty dierent results for dierent search terms thats why we also included this service for the query expansion part of Web2rism. Details about the Query Expansion part is explained in the upcoming section: 4.2 - System Architecture. Oicial Website: http://www.yahoo.com/ Google Charts API The Google Chart API lets you dynamically generate charts. It basically returns a PNG-format image in response to a URL. Several types of image can be generated, including line, bar, and pie charts. For each image type, you can specify attributes such as size, colors, and labels. The charts API is used in the visualization of the data shaped by the reputation model, to the end-user which is in most cases the manager/owner of a brand or a touristic destination. Oicial Website: http://code.google.com/apis/chart/ Google Data API Googles YouTube APIs and Tools enable developers to integrate YouTubes video content and functionality into their websites, software applications, or devices. . It is possible to search for videos, retrieve standard feeds, and

4.1. Tools and Frameworks Used

34

see related content. A program can also authenticate as a user to upload videos, modify user playlists, and more. I used the YouTube API in the Web2rism scrapers to fetch the texts of the comments written for a given video, count the density of videos uploaded about a given query or to nd the top contributing authors. The API works over HTTP returning XML. Though it is possible to use these services with a simple HTTP client, there are libraries that provide helpful tools to streamline the code and keep up with server-side changes. Because I had been coding in Python, I used the Gdata Python Client, which is a global library created for Google to use their APIs over Python codes. Oicial Website: http://code.google.com/apis/gdata/ Google Data (gdata) Python Client Library The Google Data Python Client Library provides a library and source code that make it easy to access data through Google Data APIs. This library is able to communicate with many of the Google products that provide their own APIs such as the Google Blogspot, Google Calendar, Maps, Picasa Web Albums, etc. Oicial Website: http://code.google.com/p/gdata-python-client/ Google Insights Google Insights for Search is a service by Google similar to Google Trends, providing insights into the search terms people have been entering into the Google search engine. Unlike Google Trends, Google Insights for Search provides a visual representation of regional interest on a map. The consumption of Googles Insights data was rather a hard process because the site does not oer an API. There is only a CSV le of the results generated automatically and I had to parse this le manually to extract the data required. Oicial Website: http://www.google.com/insights/search/

4.1. Tools and Frameworks Used

35

Twitter Search API The popular micro-blogging platform Twitters data is exposed via its easyto-use API. The API is divided into two parts as: Search and REST. I have used the search API to browse through the tweets of Twitter users, calculate trending topics and top contributing users. Oicial Website: http://apiwiki.twitter.com/ Technorati API Technorati, the famous blog indexing and search service has a program called The Technorati Developer Program which helps power users and tool developers integrate Technorati data directly into their favorite applications. They provide an SDK, example scripts, a mailing list, and other helpful resources to assist your development process. Technorati exposes many of its data services via an application programming interface. Technoratis API returns results in its own proprietary XML as well as common feed formats such as RSS. The consumation of Technorati data played an important role in the Web2rism project since as indicated in the Google BlogSearch explanation, the crawling of travel blogs from all over the world was a crucial thing for the data gathering part. Oicial Website: http://technorati.com/developers/api/ Flickr API The Flickr API consists of a set of callable methods, and some API endpoints. To perform an action using the Flickr API, you it is needed to select a calling convention, send a request to its endpoint specifying a method and some arguments, and will receive a formatted response. The function calls can return data in JSON or XML formats. I used the Flickr API in the project to fetch comments given to photos, receive meta-tags of uploaded medias, calculate trending topics, locations and top contributing users. An important feature of Flickr was that it provided geo-tagging functionality for photos uploaded and this played an important role in the reputation model of the Web2rism project.

4.1. Tools and Frameworks Used

36

Oicial Website: http://www.ickr.com/services/api/

4.1.4 User Interface Design


jQuery jQuery is a fast and concise JavaScript Library that simplies HTML document traversing, event handling, animating, and Ajax interactions for rapid web development. Microsoft and Nokia have announced plans to bundle jQuery on their platforms, Microsoft adopting it initially within Visual Studio for use within Microsofts ASP.NET AJAX framework and ASP.NET MVC Framework whilst Nokia will integrate it into their Web Run-Time platform. Because it is becoming an industry standard and a must have library for web development, I have used jQuery for faster creation and the durability of the user interfaces instead of coding purely in JavaScript. Oicial Website: http://www.jquery.com

4.1.5 Sentiment Analysis


Before diving more into how Sentiment Analysis practices are used in the project, its essential to give some denitions and explanations on the matter. Sentiment analysis or opinion mining refers to a broad area of natural language processing, computational linguistics and text mining. Generally speaking, it aims to determine the attitude of a speaker or a writer with respect to some topic. The attitude may be their judgement or evaluation, their aective state or the intended emotional communication. After the Data Gathering part where text-based data is collected from several dierent resources, the collection is sent through a sentiment analysis tool of our choice that is able to detect whether a piece of text is written in a positive manner or the opposite. For instance, lets think about a comment on a YouTube video about the Lake of Como. By the basic analysis of the comment, it can be understood whether the text or sentence given is positive or negative. Basically, if the comment text contains words like beautiful, good or relaxing and if the sentence is positive grammatically, it can be deducted that the comment is a positive one.

4.1. Tools and Frameworks Used

37

Mood / Sentiment detection The kind of analysis mentioned above is called Mood / Sentiment detection and in most of the content analyzer tools, there are 4 dierent kinds of analysis ways: Polarity Basic Analysis: Basic analysis of whether the text or sentence given is positive or negative. Subjectivity Basic: Analyzes whether the opinion or fact of the given text. Hierarchal Polarity analysis: Analyzes the polarity (e.g. positivity or negativity) of the given text using subjectivity results. Uses the subjectivity classier to extract subjective sentences from reviews to be used for polarity classication. Polarity Whole Analysis (cross-validation of hierarchal polarity analysis): Analyzes the polarity of the given text or sentence using multiple check and test using dierent training and test sets. Cross-validation performs multiple divisions of the data into training and test sets and then averages the results in order to bring down evaluation variance in order to tighten condence intervals. For Sentiment Analysis, a famous, multi-lingual tool called Lingpipe is used. When checked with movie review datas, Lingpipe gives 81% accuracy (average of the four kinds of test results). Natural Language Detection The web is not consisting of one language. And since the matter here is e-tourism, it is very likely to nd info about a touristic destination in the language of the country which the touristic place belongs to. In these cases, before doing content format analysis, there is the need of language detection. Going over the YouTube video comment example mentioned in the beginning of this section, using natural language detection tools it is possible to nd out the language of the comment text or any kind of text format data that our system needs to analyze after the data gathering phase. For natural language detection in Web2rism, the Lextek Language Identier is used. This tool is capable of identifying not only what language

4.2. Scrapers, Scraping and Parsing Techniques

38

it is written in but also its character encoding and what other languages are most similar. It oers more language and encoding modules than any other language identier. Currently, there are 260 language and encoding modules for you to use in your analysis. Oicial website: http://www.lextek.com/langid/li

4.2

Scrapers, Scraping and Parsing Techniques

By denition, Web scraping (also called Web harvesting or Web data extraction) refers to a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol, or embedding certain full-edged Web browsers. Web scraping is closely related to Web indexing, which indexes Web content using a bot. This is a universal technique adopted by most search engines. On the other hand, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. The process of automatically collecting Web information shares a common goal with the Semantic Web vision, which is a more ambitious initiative that still requires breakthroughs in text processing, semantic understanding, articial intelligence and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies even though some solutions are entirely ad-hoc.

4.2.1 Web2rism Scrapers


There are dierent levels of automations that existing Web-scraping technologies currently provide. These include text grepping, regular expression matching, HTML parsing, DOM parsing and Semantic annotation recognizing. In the Web2rism project however, my approach to the term scraper included one more technology usage that is the consumption of web-services and APIs provided by the target sites. Since we were interested mostly in social-media or UGC sites, the big bosses of the sector had already providing their website content in organized formats (mostly in JSON or XML) via their web-services or APIs.

4.2. Scrapers, Scraping and Parsing Techniques

39

During the development phase of Web2rism software, I created 7 dierent scraper scripts that use the APIs and/or HTML parsing tools mentioned in the Tools and Frameworks Used section of this chapter. Scrapers for Flickr, YouTube, Technorati and Twitter were using the sites respective APIs while I built custom parsers for WikiTravel, Google Blogsearch and Google Insights.

4.2.2 A Custom Scraper Example


This section shows a scraper script example coded in Python and BeautifulSoup library. The code s pretty simple and straightforward. Using the urrlib2 library, an http connection is opened to the remote site to be scraped and then the source is retrieved. After that, the magic BeautifulSoup library comes and we play sift though the HTML tags of the source until we get the area we are interested in. Later on, we can ll a custom object and prepare it to save to any database / knowledge-base.

4.2. Scrapers, Scraping and Parsing Techniques

40

from BeautifulSoup import BeautifulSoup, NavigableString import urllib2, sys import re reload(sys) sys.setdefaultencoding(utf-8) def WikiTravelURL(city name): domain = http://wikitravel.org city url = domain + /en/ + city name return city url def ScrapeWikiTravel(city name): city url = WikiTravelURL(city name) city html = urllib2.urlopen(city url).read() city soup = BeautifulSoup(city html) see section = city soup.nd(attrs={name : See}) do section = city soup.nd(attrs={name : Do}) if see section != None: data = see section.ndNextSibling(ul) if data != None: try: sibling count = 0 # go until the do section while data.ndNextSibling(a) == do section: sibling count = sibling count + 1 # go through the list of li tags data = data.ndNextSibling(ul) if data != None: if data.li.b != None: # the tagfree destination name, # can be checked if it exists on the knowledgebase if data.li.b.a != None: destination name = data.li.b.a.next else: destination name = data.li.b.next if data.li.span != None: if data.li.span != None: destination name = data.li.span.span.next else: destination name = data.li.span.next places to see.append(destination name) return places to see

4.3. System Architecture

41

4.3

System Architecture

In this part the general architecture of the system and how various components of the system are communicating with each other are explained.

4.3.1 The MVC Architecture and its Usage


The main aim of the MVC architecture is to separate the business logic and application data from the presentation to the user. The reasons why I decided to use the MVC design pattern is that because they are reusable and expressive. When problems recurs, there is no need to invent a new solution, we just have to follow the pattern and adapt it as necessary. 1). Model: The model object knows about all the data that need to be displayed. It is model who is aware about all the operations that can be applied to transform that object. It only represents the data of an application and is not aware about the presentation data and how that data will be displayed to the browser. 2). View : The view represents the presentation of the application. The view object refers to the model. It uses the query methods of the model to obtain the contents and renders it. The view is not dependent on the application logic. It remains same if there is any modication in the business logic. In other words, it is the responsibility of the of the views to maintain the consistency in its presentation when the model changes. 3). Controller: Whenever the user sends a request for something then it always go through the controller. The controller is responsible for intercepting the requests from view and passes it to the model for the appropriate action. After the action has been taken on the data, the controller is responsible for directing the appropriate view to the user. In GUIs, the views and the controllers often work very closely together. However, the web framework I used, Django interprets the term MVC in a dierent way. In Django, the Controller is called the view, and the View is called the template. The view describes the data that gets presented to the user. Its not necessarily how the data looks, but which data is presented. The view describes which data you see, not how you see it.

4.3. System Architecture

42

Its a subtle distinction. So, in Djangos case, a view is the Python callback function for a particular URL, because that callback function describes which data is presented. Furthermore, its sensible to separate content from presentation which is where templates come in. In Django, a view describes which data is presented, but a view normally delegates to a template, which describes how the data is presented. As for the controller its probably the framework itself: the machinery that sends a request to the appropriate view, according to the Django URL conguration. As a result it can be said say that Django is a MTV framework that is, model, template, and view.

4.3.2 Layers and Components


To provide a better understanding of the whole system, it is essential to follow the architecture chart below. 4.3.2.1 Scraper As explained on section 4.2.1 - Web2rism Scrapers, a scraper in our project is a standalone code script that uses an API over HTTP or just uses an HTML/XMLS parser and an HTTP connection library. The two methods are explained in detail below: 1) Using APIs In this method, the script uses the API of the website needed to be scraped. When compared to scrapers that use parsers, this method is much easier to use since most of the APIs are already returning data to users in JSON or XML formats. For YouTube, Twitter, Technorati and Flickr, I created API using scraper scripts in Python with related API wrappers. One problem that can be faced while using APIs is that the sites oering their APIs may be having technical diiculties from time to time. For example Technoratis service was unstable during the weeks I was coding the Data Gathering

4.3. System Architecture

43

Figure 4.1: System Architecture

Technorati scraper and they were making changes to their API frequently. While designing apps that use 3rd party APIs, it is vital to keep up with the changes in the APIs used and get frequent news and updates about the changes so that the developed software can be adapted and unexpected errors can be avoided. 2) Using a Parser In this method, the scraper basically travels to the given URI and starts reading the HTML source line by line. Then using the parser library, the necessary information is extracted and shaped in the format that we would like to save into the knowledge-base. The problem encountered with this method is that the format of a specic kind of page on a website does not have the same format all the time. For instance, for WikiTravel, we needed to scrape the City pages where each city is reviewed with sections like See, Do, Eat, Stay etc. And not all the city pages included all of

4.3. System Architecture

44

these sections. For example for big city like Rome, the page was lled with all kinds of data for all the sections but for a small town like Lecco, most of these sections were empty. Moreover, the pages HTML sources can greatly change if the design of the site is changed and our custom scraper scripts can never be aware of such situations. Custom scrapers run exactly the way we want but they require much more attention and maintenance than API-Scrapers. A nice feature of the system designed is that the scraper scripts can work in a language independent way. Since the scrapers are important building blocks of the system and the project in whole is an academic one where many students with dierent programming professions can be involved in, we decided to implement this feature thinking of ease in future extensibility of the application. The scraper class (model in Django terms) has two attributes named command and script le. Using the ScraperManager Admin UI, administrators can install new scrapers to the system. Details on how the scrapers are managed are explained in detail in the upcoming ScraperManager subsection. ScraperThread Crawling through big stacks of interconnected pages may take a long time especially for scrapers that use parsers. To ease the management of scrapers and see their statuses, I created a multi-threaded system that enables the running of multiple scrapers at the same time. Each time a scraper runs, it creates a thread that runs in the background and saves information about the active thread to the database. It updates this information on the conguration database (sqlite3) when a change occurs in the status of the scraper (such as the completion of the scraping activity or the occurrence of an exception).

ScraperManager The scraper manager, so called SkyScraper as will be described in the upcoming subsection Project Files Organization is the application that lets the administrators using Web2rism to see the status of the scrapers attached to the system. It is possible to start/stop scrapers, add new scrapers to the system or inactivate / remove existing ones.

4.3. System Architecture

45

Figure 4.2: ScraperManager application UML Diagram

SkyScraper has got a CRON job that is scheduled to run all the active scrapers in certain time intervals. The CRON job checks the ScraperManager cong database (sqlite3), reads the list of active scrapers and runs them automatically on the exact time they are scheduled to run. Apart from this, the scrapers can be run manually, without being bound to the CRON job. 4.3.2.2 Data Storage

KnowledgeBase As mentioned in previous sections, Web2rism uses a Semantic RDF Store. The data is saved in a subject, object, predicate model and queried by SPARQL and SPARUL queries over JENA. On my Django project, I have created an interface to easily connect to the KB by posting get and update queries. These queries are simply sent over HTTP using a normal HTTP post and the server is able to return results in XML or JSON formats. Later on our KnowlegeBase API gets the results and shapes them in the needed way for easier presentation on the UI.

4.3. System Architecture

46

Figure 4.3: Scraper Manager UI

The data mined by Web2rism scraper scripts are directly saved into the RDF store in a raw format. For instance, tweets collected over Twitters API have attributes like the author, date/time, text, geo-location of the tweet objects. Below, can be seen how these attributes are distributed as objects inside the RDF model.

For dierent needs like asking data from the RDF store about specic subjects, it is possible to create complex queries. Below is a SPARQL query that asks the KB for anything related to the keyword Lugano.

4.3. System Architecture

47

Figure 4.4: An example of a SPARQL query used to get data from the KB

Conguration Database This is a simple sqlite3 database used for the ScraperManager apps ORM. I decided to use a seperate DB instead of using the KnowledgeBase to make things faster and be able to use the ORM capability of Django. Django natively supports ORM for only MySQL, SQLite and PostgreSQL and sqlite3 was the easiest to use among them and it is the best choice for rather small databases with few tables. Django also oers an automatically generated admin page for easy management of application models in the project. So if a problem occurs in the ScraperManager UI it is also possible to manage everything with direct access to database tables over this admin panel. 4.3.2.3 Data Analysis

KnowledgeBase API One of the most interesting points of the Web2rism project was to create our own API that will let any developer to use the data in the way they want. The API acts as a service connected to the Knowledge-Base and it functions as an analyzer and shaper of the raw data coming from the Knowledge-Base. These analysis include the language detection, sentimental analysis, popularity analysis and ordering of data according to several dierent lters. For example, video comments coming from YouTube are saved in the KB in raw format as if they are directly coming from YouTubes databases. When

4.3. System Architecture

48

a reputation analysis is going to be made, the reputation calculator can directly use the KB-API and get the highest rated comments of the videos of touristic places which the search keyword represents. The APIs usage also eases the development process because it returns data in JSON format and using any programming language, it is possible to read and easily traverse the gathered data. To convert the data gathered from the KB into JSON, I rst gathered the Subject - Object - Predicate formatted data from the RDF Store and then assigned the data they contain into objects like video, tweet,photo etc. And then I used the SimpleJson Python library to encode it to JSON strings and output them as an HTTP Response. Details on how the API methods are functioning can be seen in Chapter 5: Tests and Evaluations. 4.3.2.4 Presentation

The UI to view scrapers have got dense uses of Google Charts API because of complex graphs needed to be generated from numeric data. For the management parts, the client side JavaScript library jQuery is used. I decided to keep the interface as simple as possible since data visualization is already an important issue in usability matters. All the presentation layer of the whole project is over the web, powered by HTML4 and CSS3 standards.

4.3.3

Project Files Organization

Pythons most appealing feature is that it greatly eases the pluggability of Pythonic applications with each other. Related with this while creating a Django or Python project, creating the les and folders and separating pieces of code logically to increase pluggability is very important. Explained below is how I created dierent Django applications that when attached together, they make up the data gathering and analysis parts of the Web2rism project. Core Contains core functions that are used throughout the whole Django project. For instance, post and get functions that enable interaction with the KB are dened here. In addition, there are some classes that provide custom template tags to be used in the view.

4.3. System Architecture

49

KnowledgeBase API This is the source of the KB API. The models le is empty here since I have not used any ORM for mapping any business models with the semantic knowledgebase. There are just basic views and class les that do the data analysis work and it returns an interface to the user powered in JSON formatted data. Scrapers This directory contains the custom scraper scripts that are attached to the system over the scraper manager interface. All the les are uploaded via a web interface and the scripts can be in any language. SkyScraper I have named the Scraper-Manager layer as SkyScraper as it resembled an umbrella like structure, acting on top of all scrapers attached to the system. This directory contains the Scraper and ScraperThread models (classes) as well as the views and url conguration les of the skyscraper Django application. The schema below shows how the les are organized in the project directory of my implementation.

4.3.4 System Workow


This subsection describes in detail the workow of a typical data gathering scenario consisting of a sequence of connected steps. In each of the step, the components I designed have specic interactions with each other. 1. A scraper script is coded in any programming language. 2. Scraper script is fed into the scraper manager. 3. Scraper manager assigns a default schedule. This schedule can later be modied by the administrators. 4. According to the schedule, the manager creates a scraper-thread and runs the scraper script over certain time intervals. 5. The thread runs and starts collecting data from the web.

4.3. System Architecture

50

6. The collected data is ltered, analyzed (sentiment) and shaped to be used. 7. The shaped data is saved to the Knowledge-Base. 8. The thread ends and updates its state as idle to re-run in the database. 9. The data gathering and analysis part is over, the user now can query the system using a keyword representing the place he wants to learn its reputation.

10. The keyword is received and query expansion is applied. 11. The expanded query is sent to the KB API which reads the KnowledgeBase and exposes its results in JSON format. 12. The UI gets the results and displays them to the user.

Chapter 5

Tests and Evaluations


In this chapter, the current status and evaluation of my work and in the Web2rism project is presented.

5.0.5 Functional Scrapers


Google Trends / Insights The scraper that runs over Google Insights for Search and Google Trends gets the regions, cities and languages where the hits about a certain keyword are coming from on Googles search engine. It is also able to get the related news from Google News. Currently, it is not connected to the RDF store and it is working on the y just displaying what is can nd on the CSV le generated by Googles service. An example on what the Trends CSV contains can be seen here: http://www.google.com/trends/viz?q lugano&graphll csv&saN a Google Blogsearch The Google BlogSearch scraper can get the titles, excerpts of posts containing the search keyword, as well as the authorship properties. The data is scraped from the ATOM feed generated by custom searches on Googles side. In addition, there is an on the y calculation about which author have written most blog post about Lugano. For instance for the keyword Lugano, it shows that the top blogger is Prof. Lorenzo Cantoni from USI. In addition, it can show a graph of timeline of blog posts densities related with a keyword grouped by days. As a reference, it also lists the latest posts of the search. An example of the ATOM formatted blog search results can be seen here: feed://blogsearch.google.com/blogsearch feeds?hln&q e lugano&outputtom a 51

52

Figure 5.1: A view from Google BlogSearch Scrapers ndings

Technorati The Technorati Scraper runs using Technoratis native API. It can list the number of blog entries found, and the post densities varying by date. As in Google BlogSearch, the graphs are generated using Google Charts API. There is also the top authors calculation.

Twitter Twitters API is quite powerful and I have tried to get as much as possible from the 140 character texts. Currently, Web2risms Twitter Scraper calculates a ratio of the tweets containing the search keyword, out of 1000 tweets posted in the last 7 days. It also brings up who the most active Twitter users are (who have been writing tweets about the queried keyword most). The Twitter scraper works fully integrated with the Knowledge-Base, saving tweet objects data to the RDF store. There are two dierent scrapers one an older version that works on the y and makes the calculations and

53

Figure 5.2: A view from Technorati Scrapers ndings

measurements explained above, and another one working by saving data to the RDF while the analysis is being done over the API.

Flickr The Flickr Scraper runs over Flickrs own API and it is able to list the most interesting photos, number of photos, top authors and photos by date timeline graph. Although not fully complete, the analysis tool on the API is also able to provide a function that lets users to query specic area by adjusting the radius of a target location to see photos posted by Flickr users inside the zone.

YouTube The YouTube scraper runs over Googles native GData API. It is able to fetch the top videos about the query keyword, get their ratings, calculate average ratings, get comments, calculate comment ratings, get favorited content and get tags associated with the videos. There are also options like

54

Figure 5.3: A view from Twitter Scrapers ndings

Figure 5.4: A view from Flickr Scrapers ndings

55

getting the top authors, most used tags and so on. The GData API is really powerful and easy to use therefore it can be extended greatly for future releases too. Currently, the YouTube scraper does not work over the RDF store but just displays data on the y.

Figure 5.5: A view from YouTube Scrapers ndings

WikiTravel The WikiTravel scraper uses parser libraries and runs kind of problematic because of WikiTravels unstable and not fully uni-formatted pages. It is able to get the recommended places to see about the queried city or country and list the top contributing authors.

5.0.6 KnowledgeBase & Scraper Management


The Knowledge-Base API gives the Query Expansion functions that use Google Suggest and Yahoo Search Suggestions interfaces. In addition, there are functions for querying the Knowledge-Base for Tweets and YouTube videos. All the functions return results in JSON format. The scraper manager is also fully functional, letting the administrators of Web2rism attach new scraper scripts to the system. Linked with the scraper management

56

Figure 5.6: A view from WikiTravel Scrapers ndings

screen, there is a script upload form on the manager UI. The forms asks for the name, description and the UNIX command that will be required to run the scraper script over the CRON job on a scheduled basis. It is possible to see the active scrapers threads, their statuses, start / stop them or delete and create new scrapers.

5.0.7 System Performance


In terms of system resources, it is foreseen that an online reputation analysis tool will be consuming a lot but we havent been able to test the whole system, running all the scrapers continuously so far. This is because still, the project is in heavy development phase and there is more time requirement until the test phase comes. However, considering the resource hungry system, an innovative idea like trying to use distributed computing on Hadoop clusters can be applied. Especially after the RDF store gets lled with huge amounts of data, the system will need really powerful server machines, cpu / memory / cache usage optimizations. To make the system scalable, I see the opportunity to try using cloud computing on services like Google AppEngine or Amazons S3 services, or use specialized software frameworks for distributed processing of large data sets on compute clusters.

57

Figure 5.7: A view from a call to the KB API to get Tweets about Italy

Chapter 6

Conclusion
What characterizes the most this project has been constructive approach taken, and the assembly of a set of tools that power nal big picture. As the Web2rism project still spans to two years and although the core is complete, the project is far from complete. This chapter has got two sections, rst one explaining the current status of the work and the latter one listing my opinions on how future enhancements can be.

6.1

Current Status of the Work

As discussed in detail in the preceding chapters, currently Web2rism has got a functional auto scraping system that saves data into an RDF KnowledgeBase. During the development phase, I needed to demo my work from time to time to other people involved therefore I had started building scrapers working on the y (not saving any data anywhere and just showing what is scraped). Later on, I started converting these scrapers to be able to work in conjunction with the Knowledge-Base. Currently, the Twitter and YouTube scrapers are working in RDF mode. The Knowledge-Base is also functional for both of these scrapers, providing functions that are callable over HTTP to get data in JSON format from the store. In addition, the ScraperManager application is functional, both usable over its UI I created and accessible through the native Django admin panel. As a nal conclusion of the research and studies made, we were satised with the system architecture in general and the model developed for an e-tourism centric ORM application. The current status of the overall Web2rism application can be seen over

58

6.2. Future Work

59

this URL: http://web2rism.ath.cx/ And the latest version of my work, which we named Django2rism is over here: http://django2rism.ath.cx/

6.2

Future Work

This subsection deals with the analysis of required features and their characteristics I nd needed for the marketable version of the product.

6.2.1 Extending the Data Gathering Layer


Web2rism is all about measuring the buzz density on the web about a specic brand, product or place. So like every ORM or ORA tool, the most essential components of the system are its scrapers. As for future work, the existing scrapers should be extended to work more eiciently and new scrapers should be coded. For example, LonePlanet (www.lonelyplanet.com) has been a quite good and trustable source for me during the time I travelled in Europe and there are hotel and touristic destination reviews inside. In addition, they are growing rapidly, increasing their content size. TripAdvisor is another example as it is one of the most popular trip planning sites. However TripAdvisor does not have an API of itself and the site pages are extremely crowded and big in size, hardening the scraping using parsers process. In addition I see the need to create a generic buzz data model that will help us organize and shape the data collected. Right now, right now most of the things are happening on the y and there collected data is not being shaped and interlinked with other scraped data. For instance, the buzz object can have attributes like author, create date, text, geo location, sentiment positiveness, view estimation, site url and so on. As you may have noticed, this buzz object can be used to shape data coming from any UGC site. Let it be Twitter, YouTube, WikiTravel or TripAdvisor, every piece of content on a UGC site has got an author, a create date, a location where the posting was made, an indication if it is favorited or not, or how much rating it has. If the collected data is shaped according to such a generic object, I believe it will be much easier to sift through all the buzz collection and calculate the overall reputation of the brand or product that is being analysed.

6.2. Future Work

60

6.2.2 Optimization & Scalability Issues


Web2risms scrapers and management panels work ne when only few clients are connected and querying the system but I see there is a lot of work to be done to enable the system function, serving to high number of users. First of all, constantly scraping remote sources and saving data to local databases is a bandwidth and space consuming issue. In this manner, distributed computing mechanisms can be used instead of working only on one computer. For example lately the distributed computing scene is seeing good examples of the usage of the tool: Hadoop from Apache community. Hadoop is an open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. It has its own le system and a high-performance coordination service for distributed applications. Considering Hadoops trend increasing dy by day and foreseeing Web2rism as a large-scale application being developed in an academic environment, I think future scalability problems can be solved in this way.

6.2.3 A Logging System


Currently, the system does not have very exception handling capabilities. There are many open source error logging and tracking frameworks available and one should be chosen and attached to the system for better management of the system.

6.2.4 Easier Interaction with the RDF Store


Although I was not involved in the Data Storing Layer, I was not really comfortable using the JENA API over the Joseki RDF server and I tended to use some an RDF engine compatible with the Django project I am developing. In this manner I wanted to use a pluggable engine named django-rdf. Django-RDF is an RDF engine implemented in a generic, reusable Django app, providing complete RDF support to Django projects without requiring any modications to existing framework or app source code, or incurring any performance penalty on existing control ow paths. The biggest obstacle to implement a web framework reliant RDF engine to the system was obvious. Django-RDF work only with Django and it is not usable with programming languages other than Python. Web2risms web layer can be changed to PHP, ASP or any other similar web technologies in the future and the team did not want to use a language dependant framework. However, in my opinion in the use of a stable Django RDF engine, it is possible to do web development using Django just like youre used to, then turn the knob and

6.2. Future Work

61

- with no additional eort - expose your project on the semantic web.

6.2.5 More Organized Author and Location Classication


This is kind of a harder issue and may require some research but there is a vital need especially for Authorship Recognition. By authorship recognition, we mean understanding that it is the same user by checking the nicknames of a person being the same on two dierent UGC sites. Similar to the author recognition work, a location classication should be done. In this manner, conversion of the collected buzz location into GPS coordinates can be useful. Both of these issues are included in the Reputation Analysis method but they havent been included in the software yet.

6.2.6 UI Enhancements
Currently, the projects user interfaces lack the attractiveness and usefulness of slick web apps. There is the need to create a basic yet useful UI based on usability tests and needs of the customers. Of course, this is a step that has to be taken much further of the whole development process since most of the required features rely on server-side coding.

S-ar putea să vă placă și