Sunteți pe pagina 1din 5

보 안 공 학 연 구 논 문 지

제 2 권, 제 1 호, 2005 년 11 월

Reliable Evaluations of URL Normalizationg


Sung Jin Kim1, Hyo Sook Jeong2, Sang Ho Lee3
1
School of Computer Science and Engineering, Seoul National University
2
Department of Computing Graduate School, Soongsil University
3
School of Computing, Soongsil University

sjkim@oopsla.snu.ac.kr, hsjeong@comp.ssu.ac.kr, shlee@comp.ssu.ac.kr

Abstract
URL normalization is a process of transforming URL strings into canonical form. Through this process,
duplicate URL representations for web pages can be reduced significantly. There are a number of normalization
methods. In this paper, we describe four metrics for evaluating normalization methods. The reliability and
consistency of a URL is also considered in our evaluation. With the metrics proposed, we evaluate seven
normalization methods. The evaluation results on over 25 million URLs, extracted from the web, are reported in
this paper.

1. Introduction introduced three issues of extended normalizations.


Discussed issues are the case sensitivity at the path
component, the trailing slash symbol at the path
A Uniform Resource Locator (URL) is a string
component, and the designation of a default page.
representing a web resource (hereafter, referred to as a
Selecting URL normalization methods to use is
“web page”). With a URL, we can access a single web
dependent on the web applications. Users should take into
page on the World Wide Web (WWW). A web page can
consideration efficiency and effectiveness of web
have two (syntactically different) or more URLs with
applications. If effectiveness is more important factor,
which it can be accessed. Equivalent URLs means those
users have to select the normalization methods that do not
that are syntactically different but represent the same page.
cause false positives. On the other hand, if efficiency is
The inability to recognize equivalent URLs gives rise to a
more important, users have to select the normalization
large processing overhead in web applications; for
methods that can reduce the number of duplicate URLs as
example, a web crawler repeatedly requesting,
many as possible.
downloading, and storing the same page, hence resulting
The goal of this paper is to evaluate normalization
in unnecessary network bandwidth, disk I/Os, disk space,
methods in a reliable way. We describe four evaluation
and so on.
metrics. First, the URL consistency measures how
URL normalization is a processing of transforming
consistently a URL is used to retrieve the same page
URL strings into canonical form. After normalization,
during a given time unit. Second, the URL applying rate
identically transformed URLs are regarded as equivalent
represents how many URLs are transformed by a URL
URLs. Basically, the URL normalization determines
normalization method. Third, the URL reduction rate
whether two URLs are equivalent prior to access to the
represents how many URLs are reduced (how many
corresponding web pages. The term “false positive” is
URLs become same) after a URL normalization method is
used to mean that non-equivalent URLs are determined as
applied to a set of URLs. Fourth, the true positive rate
equivalent ones, whereas “false negative” is used to mean
represents how many URLs are transformed correctly.
that equivalent URLs are determined as non-equivalent
Finally, with the metrics we propose, we evaluate the
ones.
standard URL normalization methods. The evaluation was
The standard body [1] defined the three types of URL
performed on over 25 million URLs, which were
normalizations, namely the syntax-based normalization,
extracted from the 20,000 Korean web sites in July 2005.
the scheme-based normalization, and the protocol-based
Our paper is organized as follows. In section 2, URL
normalization. The standard normalizations reduce false
normalization is discussed. In section 3, we describe the
negatives while strictly avoiding false positives (they
evaluation metrics. Section 4 presents the experimental
never transform non-equivalent URLs into a syntactically
results, and lastly, section 5 contains the closing remarks.
identical string). Lee and Kim [6] argued the necessity
of extending the standard normalization methods and
2. Preliminary Study
g
This work was supported by Korea Research Foundation Grant (KRF-2004-005-D00172).

1
Journal of Security Engineering
Vol. 2, No. 1, November, 2005

2.1 URL Representation specification to reduce the probability of false negatives.


A URL is composed of five components: the scheme, The following three normalizations belong to the syntax-
authority, path, query, and fragment components. Fig. 1 based normalization.
shows all the components of a URL. - Case normalization
http://example.com:8042/over/there?name=ferret#nose - Percent-encoding normalization
- Path segment normalization
scheme authority path query fragment The hexadecimal digits within a percent-encoding
triplet (e.g., “%3a” versus “%3A”) are case-insensitive.
Fig. 1. An example of a URL The scheme and host component are also case-insensitive.
The case normalization transforms all characters within
The scheme component contains a protocol (here, the triplet into upper-case letters for the digits A-F, and
Hypertext Transfer Protocol) that is used for transforms characters in the scheme and host components
communicating between a web server and a client. The into lower-case letters. For example,
authority component has three subcomponents: user “HTTP://EXAMPLE.com” is transformed into
information, host, and port. The user information may “http://example.com/”.
consist of a user name and, optionally, scheme-specific During the percent-encoding normalization, all
information about how to gain authorization to access the unreserved characters (i.e., uppercase and lowercase
resource. The user information, if present, is followed by letters, decimal digits, hyphens, periods, underscores, and
a commercial at-sign (“@”) that delimits it from the host. tildes) should be decoded. For example,
The host component contains the location of a web server. “http://example.com/%7Esmith” should be transformed
The location can be described as either a domain name or into “http://example.com/~smith”.
IP (Internet Protocol) address. A port number can be The path segments “.” and “..” are intended only to be
specified in the component. The colon symbol (“:”) used within relative references. During the path segment
should be prefixed prior to the port number. For instance, normalization, the path segment “.” and “..” are removed.
the port number of the example URL is 8042. When “..” is removed, as deemed necessary, the path
The path component contains directories, including a segment located on the left side of the segment “..” is also
web page and a file name of the page. The query removed. For example,
component contains parameter names and values that may “http://example.com/a/b/./../c.htm” is normalized into
be supplied to web applications. The query string starts “http://example.com/a/c.htm”.
with the question mark symbol (“?”). A parameter name The URL normalization may use scheme-specific rules,
and a parameter value are separated by the equals symbol at additional processing cost (compare with the syntax-
(“=”). For instance, in Fig. 1, the value of the “name” based scheme), to reduce the probability of false
parameter is “ferret”. The fragment component is used negatives. Given the “http” scheme, the following
for indicating a particular part of a document. The normalization can be done. First, the default port number
fragment string starts with the sharp symbol (“#”). For (i.e., 80 for the “http” scheme) is truncated from the URL,
instance, the example URL denotes a particular part (here, since two URLs with or without the default port number
“nose”) on the “there” page. represent the same page. For example,
A percent-encoding mechanism is used to represent a “http://example.com:80/” is normalized into
data octet in a URL component when that octet’s “http://example.com/”. Second, if a path string is null,
corresponding character is outside the allowed set or is then the path string is transformed into “/”. A URL with
being used as a delimiter of, or within, the component. A a null path string and a URL with a “/” path string
percent-encoded octet is encoded as a character triplet, represent the same page. For instance,
consisting of the percent character “%” followed by the “http://example.com” and “http://example.com/”
two hexadecimal digits representing that octet’s numeric represent the same page. The former URL is transformed
value. For example, “%20” is the percent-encoding for the to the latter one. Third, a URL with a fragment and a
binary octet “00100000”, which corresponds to the space URL without a fragment represent the same page. For
character in US-ASCII. instance, “http://example.com/list.htm#chap1” and
“http://example.com/list.htm” represent the same page.
2.2 Standard URL Normalization During the normalization, the fragment in the URL is
The URL normalization is a process that transforms a truncated. The former URL is transformed into the latter
URL into a canonical form. During the URL one.
normalization, syntactically different URLs that are The protocol-based normalization is only appropriate
equivalent should be transformed into a syntactically when equivalence is clearly indicated by both the result of
identical URL (simply the same URL string), and URLs accessing the resources and the common conventions of
that are not equivalent should not be transformed into a their scheme's dereference algorithm (in this case,
syntactically identical URL. The standard document [1] redirection is used by HTTP origin servers to avoid
describes three types of standard URL normalizations: problems with relative references). For example,
syntax-based normalization, scheme-based normalization, “http://example.com/a/b” (if the path segment “b”
and protocol-based normalization. represents a directory) is very likely to be redirected into
The syntax-based normalization uses logic based on “http://example.com/a/b/”.
the URL definitions provided by the standard

2
보 안 공 학 연 구 논 문 지
제 2 권, 제 1 호, 2005 년 11 월

2.3 Reliability and Consistency For example, let us suppose we collect 100 URLs (i.e., Mb
A URL does not in itself pose a security threat. = 100), such as u1, u2, …, u100. And, the last ten URLs
However, as URLs are often used to provide a compact are normalized into u1, u1, u101, u101, u101, u101, u102, u103,
set of instructions for access to network resources, care u104, and, u105, respectively. Then, N = 10 and the URL
must be taken to properly interpret the data within a URL, applying rate is (10 / 100) = 0.1.
to prevent that data from causing unintended access, and When different URL strings could become identical
to avoid including data that should not be revealed in after normalization, users leave only one URL among the
plain text. identically transformed URLs and throw away the others.
There is no guarantee that once a URL has been used Let Ma be the total number of URLs to be handled after
to retrieve a web page, the same page will be retrievable normalization. We define the URL reduction rate as
by that URL in the future. Nor is there any guarantee that below:
the page retrievable via that URL in the future will be ▪ URL reduction rate = (Mb – Ma) / N
observably similar to that retrieved in the past. The URL In the above example, Ma is 95 because 95 URLs (i.e., u1,
syntax does not constrain how a given scheme or u2, …, u90, u101, u102, u103, u104, u105) remain after
authority apportions its namespace or maintains it over normalization. As a result, the URL reduction rate is (100
time. Such guarantees can only be obtained from the – 95) / 10 = 0.5. The URL reduction rate shows how
person(s) controlling that namespace and the page in many URL strings equal to the others. More precisely
question. speaking, this metric represents the probability that a
normalized URL ux equals to the original forms of the
3. Evaluation Metrics for URL Normalization non-normalized URLs (i.e., from u1 to u90) or the
This section describes four metrics (namely, URL transformed forms (i.e., u1, u101, u102, u103, u104, u105) of the
consistency, URL applying rate, URL reduction rate, true normalized URLs (i.e., from u91 to u100).
positive rate) for evaluating URL normalization. We When an original URL string is transformed into
define the URL consistency metric in order to evaluate another URL string, the original URL is not used any
normalization methods on consistent URLs. Given a time more. Therefore, when both the pages downloaded with
unit t, a “consistent” URL is referred to as the URL via the original and the transformed URLs are different, the
which the same page has been retrieved during the time original page could be lost. When, those pages are
unit. Let Rt be the number of download requests during identical, we call the transformation the correct
the time unit t. The URL consistency metric is defined as transformation. The true positive rate represents how
below: correctly a normalization method transforms URLs. The
true positive rate is defined as below:
▪ URL consistency =
1 – ((the number of unique pages – 1) / (Rt – 1)) ▪ True positive rate =
If downloading a web page is unsuccessful, the the number of correct URL transformations / N
downloaded page is regarded as the page with null string. For example, suppose that nine transformations are
For example, suppose that we request a web pages five correct but one transformation is incorrect. Then, the true
positive rate is (9/10) = 0.9.
times for a second, and that the download results are ⓐ,
z, ⓑ, z, ⓒ, where black circles means that
4. Empirical Evaluation
downloading is unsuccessful and circled characters denote
the contents of downloaded page. Then, the URL Our experiment was performed in the following
consistency is 1 – ((4 – 1) / (5 – 1)) = 0.25. procedure. First, the robot [4] was used to collect web
URL consistency is critical in terms of evaluating URL pages. Second, we extracted raw URLs (URL strings as
normalization reliably. Note that once a URL has been found) from the collected web pages. Third, we
used to retrieve a web page, there is no guarantee that the eliminated duplicate URLs with simple string comparison
same page will be retrievable by that URL in the future. methods (which will be discussed in more detail later) to
Hence, when a normalization method is applied to an obtain a set of URLs that are to be normalized. This step
inconsistent URL, the request results via the original URL is simply intended to get a set of URLs that are
and its normalized URL cannot be compared reliably. In syntactically different with each other, irrespective of
other words, even though both the pages that are retrieved URL normalizations. Fourth, relative URLs were
via an inconsistent URL and its normalized URL are transformed into absolute URLs. Fifth, we applied each
different, we cannot be sure that the normalization of the standard normalization methods to the absolute
method is incorrect. It is required to evaluate URLs. Sixth, after requesting web pages with the absolute
normalization on the URLs with sufficiently high URL URLs and their normalized URLs, we compare the
consistency values. download results before and after the normalization.
Let Mb be the total number of URLs to be handled (or We randomly selected 20,000 Korean sites. The web
to be collected) before normalization. The URL applying robot collected 655,645 web pages from the sites in July
rate represents how many URLs join the normalization. 2005. The robot was allowed to crawl at most 3,000 pages
Let N be the number of URLs to which a normalization for each site, and the robot requested web pages within
method is applied. The URL applying rate is defined as nine hops from the root page of a site. Timeout was set to
below: two seconds. If there were no communication between a
▪ URL applying rate = N / Mb web robot and the web server for two seconds, the robot

3
Journal of Security Engineering
Vol. 2, No. 1, November, 2005

gave up the URL.


From the collected web pages, we were able to extract 1,600,000

over 25 million (exactly 25,838,285) raw URLs, where a 1,400,000


1,200,000
single URL could be counted many times as long as the

Number of URLs
1,000,000
URL is founded at many places. For instance, when the
800,000
string “http://www.example.com/” was found twice on the 600,000
same page, the number of extracted URLs was counted as 400,000
two in our experiment. 200,000
First, let us see how often raw URLs are found in 0
duplicates on web pages. We considered the following 0 0.5 1
three cases. First, we eliminated duplicates of U RL C o n s i s t en cy
syntactically identical, raw URLs that are found on the
same web page. Second, we eliminated duplicates of
syntactically identical URLs starting with the slash Fig. 2. Distribution of URL consistency
symbol (it means that these URLs are expressed as an
absolute path) as long as they are found on the same site. Fig. 2 shows the distributions of the URL consistencies.
Third, we eliminated duplicates of syntactically identical The X-axis represents the consistency value, and the Y-
URLs starting with the “http:” prefix. Table 1 shows the axis represents the number of URLs whose consistency
numbers of remaining URLs after each elimination values corresponds to that of X-axis. About 31% (exactly,
method was applied. Note that these simple eliminations 718,038) URLs of the absolute URLs were completely
of duplicated URLs were able to remove more than a half inconsistent (i.e., their consistency values were 0). There
of all the raw URLs that were found in the beginning. were 1,515,522 URLs (approximately, 65% of the
absolute URLs), whose consistencies were 1. Only
Table 1. Eliminating duplicate URLs consistent absolute URLs were used for evaluating
normalization methods.
Percent of We evaluate the seven normalization methods as
Number of remaining URLs to
Actions follows:
remaining URLs all the extracted
URLs - Method 1: Change letters in the scheme component
All extracted URLs 25,838,285 100% into the lower-case letters
Eliminate the same URLs - Method 2: Change letters in the host component into
22,757,954 88.1%
on a web page
the lower-case letters
Eliminate the same URLs
expressed as an absolute 19,647,693 76.0% - Method 3: Eliminate the default port (i.e., “:80”)
path on each site - Method 4: Transform a null path string into the slash
Eliminate the same URLs
11,046,159 42.8% symbol
starting with “http:” - Method 5: Decode unreserved characters
- Method 6: Eliminate the fragment component
After transforming relative URLs, in which some - Method 7: Eliminate the trailing slash symbol
components of URL are omitted, into absolute URLs, we
obtained 2,329,770 unique absolute URLs. We computed The first six methods (i.e., Methods 1 to 6) are defined
URL consistencies of the absolute URLs with Mt = 3, in the standard document [1], the last method (i.e.,
where t is one second. When we request web pages three Method 7) is introduced in [1] and [6].
times with an absolute URL, three consistency values can
be produced by our consistency metric. When the number
URL applying rate URL reduction rate True positive rate
of downloaded pages is 1, the consistency of the URL is 1
because the same page is downloaded successively (or 1.00
0.90
consistently). When the number of downloaded pages is
0.80
3, the consistency is 0 because different pages are 0.70
downloaded whenever we request. When the number of 0.60
0.50
downloaded page is 2, the consistency is 0.5 (i.e., 1 – 0.40
((2 – 1) / (3 – 1)) = 0.5). 0.30
0.20
0.10
0.00
Method 1 Method 2 Method 3 Method 4 Method 5 Method 6 Method 7

Fig. 3. Evaluation results of the seven normalization


methods

Fig. 3 shows the URL applying rate, URL reduction


rate, and true positive rate of the seven normalization
methods. The applying rates of Methods 1 to 4 were
below 0.01, and those of Methods 5 to 7 were 0.03, 0.12,

4
보 안 공 학 연 구 논 문 지
제 2 권, 제 1 호, 2005 년 11 월

and 0.03, respectively. The reduction rates of Methods 2, effectiveness of the combinations of the normalization
3, and 7 were below 0.05. Reduction rates of Methods 1 methods. The orders of the normalization methods will
and 4 showed that about one third of the URLs that we be investigated, too. Second, we will study how to find
transform were removed. About half (48.3%) of the URLs equivalent URLs effectively. Using some information
to which we applied Method 5 were removed. Most such as page contents, site characteristics, and so on, we
(97.3%) of the URLs to which we applied Method 6 were can make a mapping table where pairs of equivalent
removed. The applying rate and the reduction rate of URLs are listed. And then, we normalize URLs not only
Method 6 were relatively very high. The figures showed using the normalization rule but also referring to the
that 11.7% of the absolute URLs were transformed by mapping table.
Method 6, and 97.3% of the transformed URLs were
duplicates. Note that the standard normalizations
References
(Methods 1 to 6) do not cause false positives. The true
positive rates of Methods 1 to 6 were 1, and that of [1] Berners-Lee, T., Fielding, R., and Masinter, L.:
Method 7 was 0.95. We learned that approximately 5% of Uniform Resource Identifiers (URI): Generic Syntax,
the transformed URLs were wrongly transformed by http://gbiv.com/protocols/uri/rev-
Method 7 even though the method reduced 5% of the 2002/rfc2396bis.html, (2005)
transformed URLs. [2] Burner, M.: Crawling Towards Eternity: Building an
Archive of the World Wide Web, Web Techniques
5. Conclusion and Future Work Magazine, Vol. 2. No. 5. (1997) 37-40
[3] Heydon, A. and Najork, M., 1999. Mercator: A
In this paper, we described four metrics for evaluating
Scalable, Extensible Web Crawler, International
URL normalization methods. The proposed metrics are
Journal of WWW, Vol. 2. No. 4. (1999) 219-229
summarized in Table 2.
[4] Kim, S.J. and Lee, S.H.: Implementation of a Web
Table 2. Summary of the proposed metrics Robot and Statistics on the Korean Web, Springer-
Verlag Lecture Notes in Computer Science, Vol. 2713.
Metric Description (2003) 341-350
URL
Represent how consistently a URL is used [5] Kim, S.J. and Lee, S.H.: An Empirical Study on the
to retrieve the same page during a given Change of Web Pages, Springer-Verlag Lecture Notes
consistency
time unit in Computer Science, Vol. 3399. (2005) 632-642
URL applying Represent how many URLs are transformed [6] Lee, S.H., Kim, S.J. and Hong, S.: On URL
rate by a normalization method Normalization, Springer-Verlag Lecture Notes in
Represent how many URLs are reduced
URL reduction Computer Science, Vol. 3481. (2005) 1076-1085
(how many URLs become identical) after a
rate [7] Shkapenyuk, V. and Suel, T.: Design and
normalization method is applied
True positive Represent how correctly URLs are Implementation of a High-performance Distributed
rate transformed Web Crawler, In Proceedings of 18th Data
Engineering Conference, (2002) 357-368
With the metrics proposed, we evaluated seven [8] Netcraft: Web Server Survey,
normalization methods. The first six methods were http://news.netcraft.com/archives/web_server_survey.
standard normalization methods and the last method html, (2004)
(Method 7) was eliminating the trailing slash symbol.
Among 2,329,770 URLs, approximately 65% URLs were
consistent URLs. True positive rates of the standard
normalization methods were, of course, 1. True positive
rate of the seventh method is 0.95, which means that 5%
URLs were incorrectly transformed by eliminating the
trailing slash symbol. The sixth method (i.e., eliminating
the fragment component) exhibited the highest URL
applying rate and URL reduction rate values.
In practice, adoption of URL normalization methods
has been treated heuristically so far in that each
normalization method is primarily devised on a basis of
developer experiences. The contributions of this paper
include details on the effects of URL normalization, and
at the same time, present an analytic way to evaluate URL
normalization methods. Our metrics can be used to
evaluate not only standard normalization methods but also
extended normalization methods to be developed in the
future.
Lastly, we would like to mention our future works.
First, we plan to devise evaluation metrics measuring the

S-ar putea să vă placă și