Reaction Paper: A Scale For Crawler Effectiveness On The Client-Side Hidden Web

A Scale for Crawler Effectiveness on the Client-Side Hidden Web
By: Benj Arriola

Article Report July 8, 2012 MBA ISYS683W University of Redlands
Information Systems Strategy Capstone Prof. Mark Gruber
A Scale for Crawler Effectiveness on the Client-Side Hidden Web

This report is a review of the academic paper under the same title: A Scale for Crawler Effectiveness of the Client-Side Hidden Web. Research came from professors of the Communications and Information Technologies Department at the University of A Corua in Spain, published 2012 in the Computer Science and Information Systems Journal.1 This paper focuses on a comparison of technologies, mainly different software platforms of free and commercial web crawlers to test their effectiveness and in crawling the hidden web. The paper is academic in nature and like many science journal articles, it does not discuss the practical or business application of this research and is written in a tone directed to the academic audience where the application of these technologies are assumed to be known by the readers. To get a better understanding of the paper, definitions will be discussed first; on what is a crawler and what is client-side hidden web, furthermore on its business application that was not tackled in the paper and the limitations that may give false notions to the average reader. Below is an outline of the flow of this report: Definitions o Web Crawlers o Client-Side Hidden Web Business Application Significance of this Study The Academic Paper by the Professors of University of A Corua o The Conducted Experiments o The Results of the Research Paper o Conclusions of the Research Paper Possible Wrong Deductions by Readers of the Paper o Inferior Crawlers are Inferior for a Reason Not Crawling AJAX and Flash links o Lack of Research of Crawling Technologies o Crawling and Information Retrieval are Two Different Things o Lack of Knowledge of Google, Bing and Yahoo Search Engine Robots & IP Addresses Redirection Handling Report Conclusion
Prieto, V. M., Alvarez M., Lopez-Garcia, R., Cacheda, F., University of Carua, A Scale for Crawler Effectiveness on the Client-Side Hidden Web. Computer Science and Information Systems, Vol. 9, No. 2, 561-583. (2012) ComSIS Consortium
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
7/8/2012
Page 2
Definitions What is a Crawler

Crawlers are simply software programs that visit pages through their URLs and the program crawls or searches within the pages for other URLs to crawl and analyze until all pages are exhaustively crawled. Some crawlers may be limited to crawling HTML pages alone, while others also crawl other page assets such as images, videos, CSS files, JavaScript files and more. Crawlers are also called spiders, robots, or simply bots.
What is The Client-Side Hidden Web

For every loaded URL in a web browser, a page can be created in real time on the server which runs Server-Side Technologies. Conversely, every URL loaded in a browser can load elements that may change the appearance or content of a webpage within the web browser itself, and these are Client-Side Technologies. Due to the number of technologies that build up a webpage, not all information is readily crawlable. Crawlers are not necessarily client devices or web browsers, they are software scripts trying to decipher code, mainly in HTML. And with the current web technologies such as JavaScript, Adobe Flash, Adobe ShockWave, Apple Quicktime, Real Technologies Real Player, AJAX, XML, and several other less popular client-side technologies make it difficult for crawlers to gather all available data a webpage may offer.
Business Application Significance of this Study

In the information age, more and more information is shared online, either publicly or privately using the Internet. More common software applications for business have been moving to the Internet creating web-based applications using cloud technologies where the base interface is through a web browser and may run on a large number of devices such as computers, mobile phones, tablets and others. Other applications may also use the cloud but not necessarily through a web-browser, but a custom application that accesses the data from the cloud that expands the limitations of an applications in terms of possible allowed Internet protocols and port numbers. With the greater utilization of the cloud, more data is stored on the internet and most of which are web-based. This makes it more important to make data searchable. To be able to search data properly on the web, information must be appropriately saved and indexed which can be achieved by crawling the pages. Using crawlers with the best capabilities of crawling the hidden web decreases the limitations in format or method of content creation. The better content is crawled, the more complete the content that is indexed and searchable which can always help improve work efficiencies in a cloud environment.
7/8/2012
Page 3
The Academic Paper by the Professors of University of A Corua The Conducted Experiment
Most crawlers run by following hyperlinks on a page, who in its simplest form is an HTML tag called the anchor tag which is in a plain text format. To test crawlers crawling the hidden web, several other types of links were tested. The table below from the paper is shown that summarizes the type of links used in the experiment.
Figure 1: Image from the academic paper, the links used in the experiment to test various crawlers capability of crawling the hidden web. To test these links on a variety of crawlers, aside from observing results from existing web search engines such as Google, Yahoo, Bing, Ask, Alexa, Gigablast and PicSearch, free and commercial crawlers were used, which is summarized in the following table.
7/8/2012
Page 4
Figure 2: Image from the academic paper, the crawlers used in the experiment compared by crawling capabilities. From this list alone, the paper omits several web crawlers that are SaaS services that were not considered, such as MajesticSEO (http://www.majesticseo.com/), OpenSiteExplorer (http://www.opensiteexplorer.org/), and Screaming Frog (http://www.screamingfrog.co.uk/seo-spider/) that all run as both a commercial and a limited free service. Another great free, but not opensource web crawler that runs as a Windows desktop application is Xenu Link Sleuth (http://home.snafu.de/tilman/xenulink.html). A free opensource PHP written crawler that also did not make the list is Sphider (http://www.sphider.eu/). It is totally understandable if this study was on a limited budget and was not able to purchase more expensive commercial crawlers, but there was not even a single mention of the Google Search Appliance (http://www.google.com/enterprise/search/) which is a configurable hardware and software appliance that can be set to crawl different types of files, formats in a variety of methods even in a private network closed from the Internet. Considering that this paper was internationally published in the Computer Science and Information Systems journal and written by authors with PhD degrees and strong academic background did not do enough research to find some of the best crawlers on the internet is appalling.
7/8/2012
Page 5
The Results of the Research Paper

The paper then compares all the crawlers with a scale, which is some form of scoring system. The next two tables show the results of their free and commercial crawlers as well as results from web-based search engines.
Figure 3: Results for the free and commercial crawlers on the different types of links in the experiment.
Figure 4: Results for the web search engine crawlers on the different types of links in the experiment.
7/8/2012
Page 6
Conclusion of the Research Paper

The paper uses various mathematical models as defined as a scale to compare the performance of each crawler. The summary of the scale results are: For Simple Average, WebCopier gets the best results, followed by Heritrix and Google. For Maximum Level, Google places first since it processes level 8 links. It is followed by WebCopier, which obtains 7 points, and Heritrix, which obtains 6 points. As Google achieves the maximum level in this model but not in others, we can conclude that it has the capacity to deal with every scenario, but it does not try some of them because of internal policies. For the Weighted Average measure, WebCopier is followed by Google, Heritrix and Nutch. For Eight Levels, top places are for Heritrix, Web2Disk and WebCopier. GoogleBot places fourth. This means that the three top crawlers have dealt with a big quantity of levels in each group or they have gone through links that were part of a group with few links, which makes each link more valuable.
The paper further concludes that the only crawlers that achieve good results at processing the client-side Hidden Web are WebCopier and GoogleBot, where the paper mentions that these crawlers are surely using an interpreter that allows them to execute code. Thus being able to read the hidden web since this means JavaScript, Flash, and other technologies are interpreted properly by these crawlers.
Possible Wrong Deductions by Readers of the Paper Inferior Crawlers are Inferior for a Reason
Although Google was one of the chosen top crawlers that can read the hidden web, Google has more power than what it appears to have in this study. Googles main objective is not crawl everything and it purposely disregards some pages for a reason. Googles main objective is to serve its users with the most relevant data related to any search query and give the results in the fastest possible way searching records of billions of pages. To serve the purpose of pleasing its users, Googles algorithm may purposely disregard some pages that seem to not provide quality content to the users of Google. Thus pages or any type of information that is not crawled by Google does not necessarily mean it does not have the capacity to crawl data in the hidden web properly. If businesses want to take advantage of the greater power Google has to offer, businesses may opt to use the Google Search Appliance where this can be configured to crawl more file types and within private company networks that the normal Googlebot search engine crawler will not have access to.
Not Crawling AJAX and Flash Links

In the specific data within the paper, no crawler was able to crawl Flash and AJAX links. Again this shows the lack of research of the paper. Google is able to crawl Flash and AJAX links but results may appear negative. Google has been trying to crawl flash for a very long time as shown by the fact that Flash pages are indexed. By simply using the ext search command on Google you can filter results to URLs having a specific file name extension. Running this for
7/8/2012
Page 7
Flash files with the .swf extension, Google has about 239 billion Flash pages indexed. (https://www.google.com/search?q=ext%3Aswf) although this does not immediately prove Flash is crawled by Google, but noting that the title and description of the Flash files are pulled within the Flash file itself, shows that Google can read the text within a Flash file. If ever a Flash file would have a link, it would be included in the text that is readable by Google.
Figure 5: Screenshot of indexed Flash files showing titles and descriptions pulled from within the Flash files. This proves that Google can read the contents of a Flash file.
7/8/2012
Page 8
Aside from this, it was also Adobes desire to make Flash crawlable, and Google and Adobe has partnered up2,3,4 back in 2008 to further improve Googles crawling of Flash files. As for the crawling of AJAX, Google has announced at the Search Marketing Expo (SMX) conference back in 2009 the existence of a headless browser. In this session entitled CSS, AJAX, Web 2.0 & SEO5 were Vanessa Fox, a former Google Engineer, Kathrin Probst and Bruce Johnson, both Google Engineers, Benj Arriola, SEO Engineer at BusinessOnLine and Richard Chavez of iCrossing were discussing search engine optimization of technologies that may cause hidden web issues. During that time, the headless browser was a proposed feature6 of Google which is now in effect today. There may be a number of different reasons why AJAX and Flash may not appear within the experiment results done in the paper by the professors of University of A Carua. Google takes more time to crawl and index Flash and AJAX files since it has to run a separate crawling technology. And not only given that factor, Google also crawls and updates a several billion pages a day. Thus the results in the paper may not reflect the capabilities of Google if it was not given enough time. Google may decide to omit pages within its displayed index. Googles main concern is to deliver the most relevant results to a user. If a page show low quality content or no content at all, Google may decide to deindex the page. The study done by the professors was in 2012. In the whole of 2011, Google came up with a series of updates code named the Panda Update.7 Panda Update focused on a number of changes related to content quality. Not only does Google look at words and semantic analysis of words, but also user activity to determine if a page contains good content.
Lack of Research of Crawling Technologies

The research done was for a great number of crawlers. This may deceive readers of this paper that these are the only crawlers, or are the best crawlers out there. There are many other crawlers that exist but were not tested in comparison to the ones in the research paper. Some of these are Xenu, Screaming Frog, MajesticSEO, OpenSiteExplorer and even the Google Search Appliance.
Crawling and Information Retrieval are Two Different Things

The average reader of this paper may perceive the more comprehensive crawler would be the best to use for some internal search technology. Just because a crawler can index more files does not necessarily mean it is the best to use information retrieval. When searching within
2
Adobe Corporation, Adobe Flash Technology Enhances Search Results for Dynamic Content and Rich Internet Applications. (2008) URL: http://www.adobe.com/aboutadobe/pressroom/pressreleases/200806/070108AdobeRichMediaSearch.html 3 Adler, R., Stipins, J. Google Learns to Crawl Flash (2008) Google Official Blog URL: http://googleblog.blogspot.com/2008/06/googlelearns-to-crawl-flash.html 4 Adler, R., Stipins, J., Ohye, M. Improved Flash Indexing (2008) Google Webmaster Central Blog URL: http://googlewebmastercentral.blogspot.com/2008/06/improved-flash-indexing.html 5 Schawartz, B. Live: CSS, AJAX, Web 2.0 & SEO (2009) Search Engine Roundtable URL: http://www.seroundtable.com/archives/020845.html 6 Mueller, J. A Proposal for Making AJAX Crawlable (2009) Google Webmaster Central Blog URL: http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html 7 SEOMoz. Google Algorithm Change History (2011) URL: http://www.seomoz.org/google-algorithm-change#2011
7/8/2012
Page 9
files of various formats, even if your database of URLs is very comprehensive, returning the most relevant results is totally a different technology in itself. Databases often index records in a specific format where data is arranged in pre-defined tables, and in each table are fields that have index keys and point how one table can be related to another. Search queries in finding the proper information is done on the appropriate fields. While in text search of any text string on any type of file, with no specific field, table or variable type can give an overwhelming amount of results and this is where relevancy is important, where search engine algorithms come into play.8 The best crawler may have the worst text search algorithm. Some of the best algorithms for relevancy and search result quality include the PageRank algorithm9, EigenTrust10, Hilltop11, HITS12, Topic-Sensitive PageRank13 and more. Given the fact that in this paper, the top crawlers were WebCopier and Google, WebCopier does not have these complex algorithms for search results. Thus Google would be the better tool in a business environment, not only for crawling but for searching as well.
Lack of Knowledge of Google, Bing and Yahoo

Because of the lack of knowledge by the authors or lack of research on how search engines work, the study may be giving results that do not really reflect the real capabilities of common day search engines such as Google, Bing and Yahoo. Some of the areas that may be causing false reporting of the search engines indexing capabilities involve the identification of search engines and how search engines treat URL redirections. Search Engine Robots & IP Addresses According to the paper, search engines were identified by user agent and IP address and was dependent on the official Robots.txt page at http://www.robotstxt.org. Even if this is the official page about Robots.txt, this does not make it the most authoritative source of the types of user agents and IP addresses used by the search engines. Google has more than 1 user agent string that it uses when crawling websites. And in the experiment, the results were only tested using the user agent string of Google, Bing and Yahoo as listed in Robotstxt.org. Even Google mentions that they have various crawlers with different user agent names. Each crawler may have a different type of file being crawled. There are other resources on the internet that have more comprehensive list of user agents such as http://user-agentstring.info/, http://iplists.com/ and a more comprehensive list updated almost daily but is a paid service is available at http://fantomaster-seo.com/. A better way to identify search engines is by using a reverse and forward DNS check. This also prevents false positives caused by IP and DNS spoofing. Google specifically gives some support pages on how this can be done here:
8
Zobel, J., RMIT University, Moffat, A., The University of Melbourne, Inverted Files for Text Search Engines. ACM Computing Surveys, Vol. 38, No. 2, 1-55. (2006) Association for Computing Machinery 9 Brin, S., Page, L., Stanford University, The Anatomy of a Large-Scale Hypertextual Web Search Engine (PageRank). Computer Networks and ISDN Systems, Vol. 30, No. 1-7, 107-117. (1998) Elsevier Science Publishers 10 Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H., Stanford University, The EigenTrust Algorithm for Reputation Management in P2P Networks. Proceedings of the 12th International World Wide Web Conference. (2003) 11 Bharat, K., Compaq Research Center, Mihaila, A., University of Toronto, Hilltop: A Search Engine based on Expert Documents. Proceedings of the 9th International World Wide Web Conference (2000) 12 Keinberg, J.M., Cornell University, Authoritative Sources in a Hyperlinked Environment. (HITS) Journal of Association of Computing Machinery, Vol. 46 No. 5, 604-632. (1999) 13 Haveliwala T.H., Stanford University, Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search Proceedings of the 11th International World Wide Web Conference. (2002) IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 4, 784796. (2003) Extended Version
7/8/2012
Page 10
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=80553. Redirection Handling There are 4 ways to redirect 1 page to another which are the following: Meta Refresh HTML tag JavaScript Window.Location function HTTP Server Header 302 Temporary Redirect HTTP Server Header 301 Permanent Redirect
Search engines only honor the 301 redirect and do not pay attention to the others, although if all the others have been existing for over 2 years, search engines consider them also as 301 permanent redirects. The paper does not give specific details on the duration of the experiments, and it may be highly unlikely running for more than 2 years.
Report Conclusion
The paper was a good research piece with regards to comparing crawlers within its current range of crawlers used. What the paper lacks is more research on other crawlers that exist, some of which are popularly known to be some of the best crawlers and yet were not used by the authors. The methodology was not implemented in the best way possible due to their lack of knowledge on how search engines work and could have clearly done more accurate test. Given that the paper made WebCopier and Google as the main winners without the authors knowledge that Google can crawl Flash and AJAX. They may be losing some data by misidentifying Google, and their unawareness of Google Search Appliance already shows strong signs that Google can outweigh WebCopier as the better crawler. PhD level research by 4 professors of a university, published worked in an international IT journal just recently, in 2012, can still show disappointing results where knowledge and experience in the industry sometimes, or probably often times beats knowledge within the academe alone.
A copy of this paper downloaded at: http://bit.ly/crawlerreview or simply scan the QR Code.
7/8/2012
Page 11

Reaction Paper: A Scale For Crawler Effectiveness On The Client-Side Hidden Web

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Reaction Paper: A Scale For Crawler Effectiveness On The Client-Side Hidden Web

Încărcat de

Drepturi de autor:

Formate disponibile

A Scale for Crawler Effectiveness on the Client-Side Hidden Web

By: Benj Arriola

A Scale for Crawler Effectiveness on the Client-Side Hidden Web

Definitions What is a Crawler

What is The Client-Side Hidden Web

Business Application Significance of this Study

The Results of the Research Paper

Conclusion of the Research Paper

Not Crawling AJAX and Flash Links

Lack of Research of Crawling Technologies

Crawling and Information Retrieval are Two Different Things

Lack of Knowledge of Google, Bing and Yahoo

S-ar putea să vă placă și