Sunteți pe pagina 1din 6

Mitchell Krieger

Using Linear Algebra to Calculate Page Rankings on Google For over a decade, Google has been the dominate the search engine on the internet. To google has even become a verb in the dictionary, because of its influence on popular culture and the internet at large. Having your webpage at the top of a Google search exponentially increases the number of hits your website will receive, due to its overwhelming popularity. As of 2010, Google gets 6.91 billion visits each day (via Wolfram Alpha). Plus since its acquisition of YouTube in 2006, it also remain the top authority on video searches on the internet. What makes Google so reliable? Its unique and innovative way of searching the web for hits that calculates a pages PageRank (importance score). Developed by Google co-founder Larry Page1 at Stanford University in 1996, the PageRank system revolutionized the search engine by ranking relevance and importance of a web page not by number of times the search query was mentioned on the page, but rather the number of other webpages that link to it. This creates a democracy on the internet, where web pages endorse or vote for pages as an authority on a topic by simply having a hyperlink to it on their own site which yields better search results than classic searches that look for the number of times a search query is mentioned. In essence, in order to be the top result on Google your webpage has to have the most other websites cite it as a source through a hyperlink. PageRanks patent is held by Stanford University with a exclusive licensing agreement with Google. This was the key to googles success in the market. Lets concentrate on the algorithm that ranks websites rather than Googles other algorithm that goes through the process of counting the number of links. If the internet is thought

Why its called Page-Rank, not because of web pages

Mitchell Krieger

of as an actual web of interconnected pages, it is easy to count the number of times each page receives a hyperlink referencing it.

A B E C D

In the diagram above each letter represents a website on an internet that consists of a total of five websites (ignoring that search engines are also websites). Each arrow represents what a particular website hyperlinks link to. If xA is defined to be the number of links that A is receiving, we can say xA = 4 because sites B, C, D and E all link to it. Then it is easy to write a computer program that will sort this from greatest to least and a simple version of ranking the importance of pages has been created. In this case a search engine would list A as the first result in a search, then B and E (both xB and xE = 2), then D (xD = 1) and lastly C (xc = 0). This seems like an nice and simple to rank pages, but we must also consider the credibility or significance of a page. If site A is a major website such as Yahoo! CNN, or BBC, a link from those websites are more significant than a link from site C which is a minor website such as www.tomsmomscookies.com. Therefore, although both sites B and E have an equal x value, E should be ranked higher in a search than B because a major website has linked giving it much more significance than a link from the relatively unimportant site C.

Mitchell Krieger

So how do we incorporate the significance of a link from a website? Rather than simply counting the number of links to a site to calculate its importance score, we can instead add the importance scores of all the sites that link to it. This effectively creates a hierarchy in which a link from sights of more significance increase a websites importance score more. So site As importance score could be calculated by xA = xB + xC + xD + xE because sites B, C, D and E all link to it. The problem with this model however is that a site that links more has more influence over what pages are ranked higher in a search. To combat this, we can dilute a score by dividing a sites score (x) by the number of times site D links to any other site (nD). In other words, each pages score is divided up and given to the sites the page links to So we can now the score of xA:

xA =
Or in a general form:

xB xC xD xE + + + nB nC nD nE

xA =

L A

x n

Where the elements exists with in the set that links to site A (LA). Now we can put together a system of equations in the simplified version of the internet above: xA = xB/2+ xC/2 + xD/3+ xE/1 xB = xC/2 + xD/3 xC = xB/2 xD = xC/2 xE = xA/1+ xD/3 Put into a matrix and assuming you cant link to yourself:

Mitchell Krieger

Where Mx = x and x=[xA, xB, xC, xD, xE]T If we can find an eigenvector where the eigenvalue is 1 (thereby satisfying the equation Mx = x) we can calculate a new, more accurate importance score for each site. The real eigenvalues for matrix M are (1, -1, -0.62) and the eigenvector for eigenvalue 1 is [0.707107, 0, 0, 0, 0.707107]T . Here we can see that website A and E will be ranked high while the other websites are low. What is interesting here is that sites A and E were given the same value for ranking, even though A has a lot more links. This demonstrates the power and equity that exists in the PageRank algorithm. Site A was obviously ranked high because every other site in the web references it, and that may be due to the quality, authority, or resources (money and time to spend on the creation of the website) site A has. In a classic search engine a small website like E might be ranked very low, but by the large site A giving credibility to it it gets ranked high providing a fair chance to all sites to use google as a way to gain website visits. An example of this might be an independently run grocery stores in a small town, we will use Atkins Farms in Amherst, MA as an example. Atkins might not have the resources to have a fancy website or employ a web developer that could get them a lot of hits. So if you were to search Atkins in a classical search engine, many pages that discuss the Atkins diet might come up way before you ever get to the actual Atkins Farms website. On a search engine using the PageRank algorithm, the first result will still be the Atkins diet main page, however the second result will be Atkins Farms because the algorithm diluted the power of other websites importance to the search query Atkins. The PageRank algorithm isnt perfect though. In a phenomena that has become known as google bombing or google washing, the results of a google search can be greatly obscured. If there is enough movement among people to link to certain websites, ranks may be changed, even

Mitchell Krieger

if the websites relevance to a search query is low. Many times this is done for political or commercial gain. The most famous of these was the google bombing of former President George W. Bushs biography. In 2006 if you were to type into google the query, miserable failure, the first result was a link to a page on the White Houses official website detailing the life of the then Present Bush. This was done by a movement of people who posted links to the Presidents Biography with the anchor text, miserable failure. Since then google has rearranged the PageRank algorithm so that the first result in the search query miserable failure is a link to an article about the phenomena of google bombing and has also made it significantly harder to google bomb. Google has not released the algorithm used to combat Google bombs, but one could guess that another variable representing the subject of site that is linking might be involved. For example, if a small town general stores website is linking to George W. Bushs biography, it may be reasonable to be suspicious about the reason for the link. Another issue is that not all websites on the internet are interconnected with hyperlinks, nor are all websites linked to. These are called dangling nodes. PageRank takes this into account by thinking about the internet as multiple webs: F A B

E D C

In the example above, we find that B, C and D are not connected to A, E, and F and that sites C and E have no other sites linking to them. How do we nd a universal reference point for ranking

Mitchell Krieger

pages between webs, when our algorithm is contingent upon self referencing importance scores within a web? PageRank does this by creating matrixes that consist of the matrixes that describe each individual web. Eigenvectors and eigenvalues can then be found using a similar method as before to create an importance ranking of webs, so that it can compare rankings of sites within those webs.
This is a very simplied overview of the PageRank algorithm which is much more

complex than what has been outlined here. This includes the did you mean feature, the related searches feature, sponsored advertisements, among others features that google has spent years developing. It is quite amazing how a relatively simple mathematical algorithm has revolutionized the way information is found, and continues to be used 17 years after its inception.

S-ar putea să vă placă și