Documente Academic
Documente Profesional
Documente Cultură
Prieto, V. M., Alvarez M., Lopez-Garcia, R., Cacheda, F., University of Carua, A Scale for Crawler Effectiveness on the Client-Side Hidden Web. Computer Science and Information Systems, Vol. 9, No. 2, 561-583. (2012) ComSIS Consortium
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
7/8/2012
Page 2
7/8/2012
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
Page 3
The Academic Paper by the Professors of University of A Corua The Conducted Experiment
Most crawlers run by following hyperlinks on a page, who in its simplest form is an HTML tag called the anchor tag which is in a plain text format. To test crawlers crawling the hidden web, several other types of links were tested. The table below from the paper is shown that summarizes the type of links used in the experiment.
Figure 1: Image from the academic paper, the links used in the experiment to test various crawlers capability of crawling the hidden web. To test these links on a variety of crawlers, aside from observing results from existing web search engines such as Google, Yahoo, Bing, Ask, Alexa, Gigablast and PicSearch, free and commercial crawlers were used, which is summarized in the following table.
7/8/2012
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
Page 4
Figure 2: Image from the academic paper, the crawlers used in the experiment compared by crawling capabilities. From this list alone, the paper omits several web crawlers that are SaaS services that were not considered, such as MajesticSEO (http://www.majesticseo.com/), OpenSiteExplorer (http://www.opensiteexplorer.org/), and Screaming Frog (http://www.screamingfrog.co.uk/seo-spider/) that all run as both a commercial and a limited free service. Another great free, but not opensource web crawler that runs as a Windows desktop application is Xenu Link Sleuth (http://home.snafu.de/tilman/xenulink.html). A free opensource PHP written crawler that also did not make the list is Sphider (http://www.sphider.eu/). It is totally understandable if this study was on a limited budget and was not able to purchase more expensive commercial crawlers, but there was not even a single mention of the Google Search Appliance (http://www.google.com/enterprise/search/) which is a configurable hardware and software appliance that can be set to crawl different types of files, formats in a variety of methods even in a private network closed from the Internet. Considering that this paper was internationally published in the Computer Science and Information Systems journal and written by authors with PhD degrees and strong academic background did not do enough research to find some of the best crawlers on the internet is appalling.
7/8/2012
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
Page 5
Figure 3: Results for the free and commercial crawlers on the different types of links in the experiment.
Figure 4: Results for the web search engine crawlers on the different types of links in the experiment.
7/8/2012
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
Page 6
The paper further concludes that the only crawlers that achieve good results at processing the client-side Hidden Web are WebCopier and GoogleBot, where the paper mentions that these crawlers are surely using an interpreter that allows them to execute code. Thus being able to read the hidden web since this means JavaScript, Flash, and other technologies are interpreted properly by these crawlers.
Possible Wrong Deductions by Readers of the Paper Inferior Crawlers are Inferior for a Reason
Although Google was one of the chosen top crawlers that can read the hidden web, Google has more power than what it appears to have in this study. Googles main objective is not crawl everything and it purposely disregards some pages for a reason. Googles main objective is to serve its users with the most relevant data related to any search query and give the results in the fastest possible way searching records of billions of pages. To serve the purpose of pleasing its users, Googles algorithm may purposely disregard some pages that seem to not provide quality content to the users of Google. Thus pages or any type of information that is not crawled by Google does not necessarily mean it does not have the capacity to crawl data in the hidden web properly. If businesses want to take advantage of the greater power Google has to offer, businesses may opt to use the Google Search Appliance where this can be configured to crawl more file types and within private company networks that the normal Googlebot search engine crawler will not have access to.
7/8/2012
Page 7
Flash files with the .swf extension, Google has about 239 billion Flash pages indexed. (https://www.google.com/search?q=ext%3Aswf) although this does not immediately prove Flash is crawled by Google, but noting that the title and description of the Flash files are pulled within the Flash file itself, shows that Google can read the text within a Flash file. If ever a Flash file would have a link, it would be included in the text that is readable by Google.
Figure 5: Screenshot of indexed Flash files showing titles and descriptions pulled from within the Flash files. This proves that Google can read the contents of a Flash file.
7/8/2012
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
Page 8
Aside from this, it was also Adobes desire to make Flash crawlable, and Google and Adobe has partnered up2,3,4 back in 2008 to further improve Googles crawling of Flash files. As for the crawling of AJAX, Google has announced at the Search Marketing Expo (SMX) conference back in 2009 the existence of a headless browser. In this session entitled CSS, AJAX, Web 2.0 & SEO5 were Vanessa Fox, a former Google Engineer, Kathrin Probst and Bruce Johnson, both Google Engineers, Benj Arriola, SEO Engineer at BusinessOnLine and Richard Chavez of iCrossing were discussing search engine optimization of technologies that may cause hidden web issues. During that time, the headless browser was a proposed feature6 of Google which is now in effect today. There may be a number of different reasons why AJAX and Flash may not appear within the experiment results done in the paper by the professors of University of A Carua. Google takes more time to crawl and index Flash and AJAX files since it has to run a separate crawling technology. And not only given that factor, Google also crawls and updates a several billion pages a day. Thus the results in the paper may not reflect the capabilities of Google if it was not given enough time. Google may decide to omit pages within its displayed index. Googles main concern is to deliver the most relevant results to a user. If a page show low quality content or no content at all, Google may decide to deindex the page. The study done by the professors was in 2012. In the whole of 2011, Google came up with a series of updates code named the Panda Update.7 Panda Update focused on a number of changes related to content quality. Not only does Google look at words and semantic analysis of words, but also user activity to determine if a page contains good content.
Adobe Corporation, Adobe Flash Technology Enhances Search Results for Dynamic Content and Rich Internet Applications. (2008) URL: http://www.adobe.com/aboutadobe/pressroom/pressreleases/200806/070108AdobeRichMediaSearch.html 3 Adler, R., Stipins, J. Google Learns to Crawl Flash (2008) Google Official Blog URL: http://googleblog.blogspot.com/2008/06/googlelearns-to-crawl-flash.html 4 Adler, R., Stipins, J., Ohye, M. Improved Flash Indexing (2008) Google Webmaster Central Blog URL: http://googlewebmastercentral.blogspot.com/2008/06/improved-flash-indexing.html 5 Schawartz, B. Live: CSS, AJAX, Web 2.0 & SEO (2009) Search Engine Roundtable URL: http://www.seroundtable.com/archives/020845.html 6 Mueller, J. A Proposal for Making AJAX Crawlable (2009) Google Webmaster Central Blog URL: http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html 7 SEOMoz. Google Algorithm Change History (2011) URL: http://www.seomoz.org/google-algorithm-change#2011
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
7/8/2012
Page 9
files of various formats, even if your database of URLs is very comprehensive, returning the most relevant results is totally a different technology in itself. Databases often index records in a specific format where data is arranged in pre-defined tables, and in each table are fields that have index keys and point how one table can be related to another. Search queries in finding the proper information is done on the appropriate fields. While in text search of any text string on any type of file, with no specific field, table or variable type can give an overwhelming amount of results and this is where relevancy is important, where search engine algorithms come into play.8 The best crawler may have the worst text search algorithm. Some of the best algorithms for relevancy and search result quality include the PageRank algorithm9, EigenTrust10, Hilltop11, HITS12, Topic-Sensitive PageRank13 and more. Given the fact that in this paper, the top crawlers were WebCopier and Google, WebCopier does not have these complex algorithms for search results. Thus Google would be the better tool in a business environment, not only for crawling but for searching as well.
Zobel, J., RMIT University, Moffat, A., The University of Melbourne, Inverted Files for Text Search Engines. ACM Computing Surveys, Vol. 38, No. 2, 1-55. (2006) Association for Computing Machinery 9 Brin, S., Page, L., Stanford University, The Anatomy of a Large-Scale Hypertextual Web Search Engine (PageRank). Computer Networks and ISDN Systems, Vol. 30, No. 1-7, 107-117. (1998) Elsevier Science Publishers 10 Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H., Stanford University, The EigenTrust Algorithm for Reputation Management in P2P Networks. Proceedings of the 12th International World Wide Web Conference. (2003) 11 Bharat, K., Compaq Research Center, Mihaila, A., University of Toronto, Hilltop: A Search Engine based on Expert Documents. Proceedings of the 9th International World Wide Web Conference (2000) 12 Keinberg, J.M., Cornell University, Authoritative Sources in a Hyperlinked Environment. (HITS) Journal of Association of Computing Machinery, Vol. 46 No. 5, 604-632. (1999) 13 Haveliwala T.H., Stanford University, Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search Proceedings of the 11th International World Wide Web Conference. (2002) IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 4, 784796. (2003) Extended Version
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
7/8/2012
Page 10
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=80553. Redirection Handling There are 4 ways to redirect 1 page to another which are the following: Meta Refresh HTML tag JavaScript Window.Location function HTTP Server Header 302 Temporary Redirect HTTP Server Header 301 Permanent Redirect
Search engines only honor the 301 redirect and do not pay attention to the others, although if all the others have been existing for over 2 years, search engines consider them also as 301 permanent redirects. The paper does not give specific details on the duration of the experiments, and it may be highly unlikely running for more than 2 years.
Report Conclusion
The paper was a good research piece with regards to comparing crawlers within its current range of crawlers used. What the paper lacks is more research on other crawlers that exist, some of which are popularly known to be some of the best crawlers and yet were not used by the authors. The methodology was not implemented in the best way possible due to their lack of knowledge on how search engines work and could have clearly done more accurate test. Given that the paper made WebCopier and Google as the main winners without the authors knowledge that Google can crawl Flash and AJAX. They may be losing some data by misidentifying Google, and their unawareness of Google Search Appliance already shows strong signs that Google can outweigh WebCopier as the better crawler. PhD level research by 4 professors of a university, published worked in an international IT journal just recently, in 2012, can still show disappointing results where knowledge and experience in the industry sometimes, or probably often times beats knowledge within the academe alone.
A copy of this paper downloaded at: http://bit.ly/crawlerreview or simply scan the QR Code.
7/8/2012
Text Search Information Retrieval Article Report ISYS683W MBA University of Redlands Benj Arriola http://www.internetmarketinginc.com http://www.seoreligion.com
Page 11