Documente Academic
Documente Profesional
Documente Cultură
제 2 권, 제 1 호, 2005 년 11 월
Abstract
URL normalization is a process of transforming URL strings into canonical form. Through this process,
duplicate URL representations for web pages can be reduced significantly. There are a number of normalization
methods. In this paper, we describe four metrics for evaluating normalization methods. The reliability and
consistency of a URL is also considered in our evaluation. With the metrics proposed, we evaluate seven
normalization methods. The evaluation results on over 25 million URLs, extracted from the web, are reported in
this paper.
1
Journal of Security Engineering
Vol. 2, No. 1, November, 2005
2
보 안 공 학 연 구 논 문 지
제 2 권, 제 1 호, 2005 년 11 월
2.3 Reliability and Consistency For example, let us suppose we collect 100 URLs (i.e., Mb
A URL does not in itself pose a security threat. = 100), such as u1, u2, …, u100. And, the last ten URLs
However, as URLs are often used to provide a compact are normalized into u1, u1, u101, u101, u101, u101, u102, u103,
set of instructions for access to network resources, care u104, and, u105, respectively. Then, N = 10 and the URL
must be taken to properly interpret the data within a URL, applying rate is (10 / 100) = 0.1.
to prevent that data from causing unintended access, and When different URL strings could become identical
to avoid including data that should not be revealed in after normalization, users leave only one URL among the
plain text. identically transformed URLs and throw away the others.
There is no guarantee that once a URL has been used Let Ma be the total number of URLs to be handled after
to retrieve a web page, the same page will be retrievable normalization. We define the URL reduction rate as
by that URL in the future. Nor is there any guarantee that below:
the page retrievable via that URL in the future will be ▪ URL reduction rate = (Mb – Ma) / N
observably similar to that retrieved in the past. The URL In the above example, Ma is 95 because 95 URLs (i.e., u1,
syntax does not constrain how a given scheme or u2, …, u90, u101, u102, u103, u104, u105) remain after
authority apportions its namespace or maintains it over normalization. As a result, the URL reduction rate is (100
time. Such guarantees can only be obtained from the – 95) / 10 = 0.5. The URL reduction rate shows how
person(s) controlling that namespace and the page in many URL strings equal to the others. More precisely
question. speaking, this metric represents the probability that a
normalized URL ux equals to the original forms of the
3. Evaluation Metrics for URL Normalization non-normalized URLs (i.e., from u1 to u90) or the
This section describes four metrics (namely, URL transformed forms (i.e., u1, u101, u102, u103, u104, u105) of the
consistency, URL applying rate, URL reduction rate, true normalized URLs (i.e., from u91 to u100).
positive rate) for evaluating URL normalization. We When an original URL string is transformed into
define the URL consistency metric in order to evaluate another URL string, the original URL is not used any
normalization methods on consistent URLs. Given a time more. Therefore, when both the pages downloaded with
unit t, a “consistent” URL is referred to as the URL via the original and the transformed URLs are different, the
which the same page has been retrieved during the time original page could be lost. When, those pages are
unit. Let Rt be the number of download requests during identical, we call the transformation the correct
the time unit t. The URL consistency metric is defined as transformation. The true positive rate represents how
below: correctly a normalization method transforms URLs. The
true positive rate is defined as below:
▪ URL consistency =
1 – ((the number of unique pages – 1) / (Rt – 1)) ▪ True positive rate =
If downloading a web page is unsuccessful, the the number of correct URL transformations / N
downloaded page is regarded as the page with null string. For example, suppose that nine transformations are
For example, suppose that we request a web pages five correct but one transformation is incorrect. Then, the true
positive rate is (9/10) = 0.9.
times for a second, and that the download results are ⓐ,
z, ⓑ, z, ⓒ, where black circles means that
4. Empirical Evaluation
downloading is unsuccessful and circled characters denote
the contents of downloaded page. Then, the URL Our experiment was performed in the following
consistency is 1 – ((4 – 1) / (5 – 1)) = 0.25. procedure. First, the robot [4] was used to collect web
URL consistency is critical in terms of evaluating URL pages. Second, we extracted raw URLs (URL strings as
normalization reliably. Note that once a URL has been found) from the collected web pages. Third, we
used to retrieve a web page, there is no guarantee that the eliminated duplicate URLs with simple string comparison
same page will be retrievable by that URL in the future. methods (which will be discussed in more detail later) to
Hence, when a normalization method is applied to an obtain a set of URLs that are to be normalized. This step
inconsistent URL, the request results via the original URL is simply intended to get a set of URLs that are
and its normalized URL cannot be compared reliably. In syntactically different with each other, irrespective of
other words, even though both the pages that are retrieved URL normalizations. Fourth, relative URLs were
via an inconsistent URL and its normalized URL are transformed into absolute URLs. Fifth, we applied each
different, we cannot be sure that the normalization of the standard normalization methods to the absolute
method is incorrect. It is required to evaluate URLs. Sixth, after requesting web pages with the absolute
normalization on the URLs with sufficiently high URL URLs and their normalized URLs, we compare the
consistency values. download results before and after the normalization.
Let Mb be the total number of URLs to be handled (or We randomly selected 20,000 Korean sites. The web
to be collected) before normalization. The URL applying robot collected 655,645 web pages from the sites in July
rate represents how many URLs join the normalization. 2005. The robot was allowed to crawl at most 3,000 pages
Let N be the number of URLs to which a normalization for each site, and the robot requested web pages within
method is applied. The URL applying rate is defined as nine hops from the root page of a site. Timeout was set to
below: two seconds. If there were no communication between a
▪ URL applying rate = N / Mb web robot and the web server for two seconds, the robot
3
Journal of Security Engineering
Vol. 2, No. 1, November, 2005
Number of URLs
1,000,000
URL is founded at many places. For instance, when the
800,000
string “http://www.example.com/” was found twice on the 600,000
same page, the number of extracted URLs was counted as 400,000
two in our experiment. 200,000
First, let us see how often raw URLs are found in 0
duplicates on web pages. We considered the following 0 0.5 1
three cases. First, we eliminated duplicates of U RL C o n s i s t en cy
syntactically identical, raw URLs that are found on the
same web page. Second, we eliminated duplicates of
syntactically identical URLs starting with the slash Fig. 2. Distribution of URL consistency
symbol (it means that these URLs are expressed as an
absolute path) as long as they are found on the same site. Fig. 2 shows the distributions of the URL consistencies.
Third, we eliminated duplicates of syntactically identical The X-axis represents the consistency value, and the Y-
URLs starting with the “http:” prefix. Table 1 shows the axis represents the number of URLs whose consistency
numbers of remaining URLs after each elimination values corresponds to that of X-axis. About 31% (exactly,
method was applied. Note that these simple eliminations 718,038) URLs of the absolute URLs were completely
of duplicated URLs were able to remove more than a half inconsistent (i.e., their consistency values were 0). There
of all the raw URLs that were found in the beginning. were 1,515,522 URLs (approximately, 65% of the
absolute URLs), whose consistencies were 1. Only
Table 1. Eliminating duplicate URLs consistent absolute URLs were used for evaluating
normalization methods.
Percent of We evaluate the seven normalization methods as
Number of remaining URLs to
Actions follows:
remaining URLs all the extracted
URLs - Method 1: Change letters in the scheme component
All extracted URLs 25,838,285 100% into the lower-case letters
Eliminate the same URLs - Method 2: Change letters in the host component into
22,757,954 88.1%
on a web page
the lower-case letters
Eliminate the same URLs
expressed as an absolute 19,647,693 76.0% - Method 3: Eliminate the default port (i.e., “:80”)
path on each site - Method 4: Transform a null path string into the slash
Eliminate the same URLs
11,046,159 42.8% symbol
starting with “http:” - Method 5: Decode unreserved characters
- Method 6: Eliminate the fragment component
After transforming relative URLs, in which some - Method 7: Eliminate the trailing slash symbol
components of URL are omitted, into absolute URLs, we
obtained 2,329,770 unique absolute URLs. We computed The first six methods (i.e., Methods 1 to 6) are defined
URL consistencies of the absolute URLs with Mt = 3, in the standard document [1], the last method (i.e.,
where t is one second. When we request web pages three Method 7) is introduced in [1] and [6].
times with an absolute URL, three consistency values can
be produced by our consistency metric. When the number
URL applying rate URL reduction rate True positive rate
of downloaded pages is 1, the consistency of the URL is 1
because the same page is downloaded successively (or 1.00
0.90
consistently). When the number of downloaded pages is
0.80
3, the consistency is 0 because different pages are 0.70
downloaded whenever we request. When the number of 0.60
0.50
downloaded page is 2, the consistency is 0.5 (i.e., 1 – 0.40
((2 – 1) / (3 – 1)) = 0.5). 0.30
0.20
0.10
0.00
Method 1 Method 2 Method 3 Method 4 Method 5 Method 6 Method 7
4
보 안 공 학 연 구 논 문 지
제 2 권, 제 1 호, 2005 년 11 월
and 0.03, respectively. The reduction rates of Methods 2, effectiveness of the combinations of the normalization
3, and 7 were below 0.05. Reduction rates of Methods 1 methods. The orders of the normalization methods will
and 4 showed that about one third of the URLs that we be investigated, too. Second, we will study how to find
transform were removed. About half (48.3%) of the URLs equivalent URLs effectively. Using some information
to which we applied Method 5 were removed. Most such as page contents, site characteristics, and so on, we
(97.3%) of the URLs to which we applied Method 6 were can make a mapping table where pairs of equivalent
removed. The applying rate and the reduction rate of URLs are listed. And then, we normalize URLs not only
Method 6 were relatively very high. The figures showed using the normalization rule but also referring to the
that 11.7% of the absolute URLs were transformed by mapping table.
Method 6, and 97.3% of the transformed URLs were
duplicates. Note that the standard normalizations
References
(Methods 1 to 6) do not cause false positives. The true
positive rates of Methods 1 to 6 were 1, and that of [1] Berners-Lee, T., Fielding, R., and Masinter, L.:
Method 7 was 0.95. We learned that approximately 5% of Uniform Resource Identifiers (URI): Generic Syntax,
the transformed URLs were wrongly transformed by http://gbiv.com/protocols/uri/rev-
Method 7 even though the method reduced 5% of the 2002/rfc2396bis.html, (2005)
transformed URLs. [2] Burner, M.: Crawling Towards Eternity: Building an
Archive of the World Wide Web, Web Techniques
5. Conclusion and Future Work Magazine, Vol. 2. No. 5. (1997) 37-40
[3] Heydon, A. and Najork, M., 1999. Mercator: A
In this paper, we described four metrics for evaluating
Scalable, Extensible Web Crawler, International
URL normalization methods. The proposed metrics are
Journal of WWW, Vol. 2. No. 4. (1999) 219-229
summarized in Table 2.
[4] Kim, S.J. and Lee, S.H.: Implementation of a Web
Table 2. Summary of the proposed metrics Robot and Statistics on the Korean Web, Springer-
Verlag Lecture Notes in Computer Science, Vol. 2713.
Metric Description (2003) 341-350
URL
Represent how consistently a URL is used [5] Kim, S.J. and Lee, S.H.: An Empirical Study on the
to retrieve the same page during a given Change of Web Pages, Springer-Verlag Lecture Notes
consistency
time unit in Computer Science, Vol. 3399. (2005) 632-642
URL applying Represent how many URLs are transformed [6] Lee, S.H., Kim, S.J. and Hong, S.: On URL
rate by a normalization method Normalization, Springer-Verlag Lecture Notes in
Represent how many URLs are reduced
URL reduction Computer Science, Vol. 3481. (2005) 1076-1085
(how many URLs become identical) after a
rate [7] Shkapenyuk, V. and Suel, T.: Design and
normalization method is applied
True positive Represent how correctly URLs are Implementation of a High-performance Distributed
rate transformed Web Crawler, In Proceedings of 18th Data
Engineering Conference, (2002) 357-368
With the metrics proposed, we evaluated seven [8] Netcraft: Web Server Survey,
normalization methods. The first six methods were http://news.netcraft.com/archives/web_server_survey.
standard normalization methods and the last method html, (2004)
(Method 7) was eliminating the trailing slash symbol.
Among 2,329,770 URLs, approximately 65% URLs were
consistent URLs. True positive rates of the standard
normalization methods were, of course, 1. True positive
rate of the seventh method is 0.95, which means that 5%
URLs were incorrectly transformed by eliminating the
trailing slash symbol. The sixth method (i.e., eliminating
the fragment component) exhibited the highest URL
applying rate and URL reduction rate values.
In practice, adoption of URL normalization methods
has been treated heuristically so far in that each
normalization method is primarily devised on a basis of
developer experiences. The contributions of this paper
include details on the effects of URL normalization, and
at the same time, present an analytic way to evaluate URL
normalization methods. Our metrics can be used to
evaluate not only standard normalization methods but also
extended normalization methods to be developed in the
future.
Lastly, we would like to mention our future works.
First, we plan to devise evaluation metrics measuring the