Thesis - Final - Web Cache

Web Caching: A Clustered Approach for Web Content Mining
Dissertation
SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENT
FOR THE AWARD OF THE DEGREE OF
MASTER OF TECHNOLOGY
IN
COMPUTER ENGINEERING
TO
MAHARISHI DAYANAND UNIVERSITY
ROHTAK – 124001-HARYANA
Submitted By
Ritika Balhara
1132312320
Under the Supervision of
Mr. Praveen Kumar

Asstt. Prof. (CSE).
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

BHIWANI INSTITUTE OF TECHNOLOGY & SCIENCES
BHIWANI– 127021-HARYANA-INDIA
June, 2011
Chapter 1 – Background
The World Wide Web is a huge information repository. When so many users access this
information repository, it is easy to find certain patterns in the way they access web resources.
Web request prediction has been implemented in the past, primarily for static content. Increasing
web content and Internet traffic is making web prediction models very popular.
The objective of a prediction model is to identify the subsequent requests of a user, given the
current request that a user has made. This way the server can pre-fetch it and cache these pages
or it can pre-send this information to the client. The idea is to control the load on the server and
thus reduce the access time. Careful implementation of this technique can reduce access time
and latency, making optimal usage of the server’s computing power and the network bandwidth.
Markov model is a machine learning technique and is different from the approach that data
mining does with web logs. Data mining approach identifies the classes of users using their
attributes and predicting future actions without considering interactivity and immediate
implications. There are other techniques like prediction by partial matching and information
retrieval that may be used in conjunction with Markov modeling, to enhance performance and
accuracy.
A web prediction model unlike other prediction models are particularly challenging because of
the many states that it has to hold and the dynamic nature of the web in terms of user actions and
continuously changing content. We therefore use the Markov probabilistic idea to design a
prediction model. It is a stochastic counterpart to deterministic process in probability theory.
1.1 Web Cache
A web cache is a mechanism for the temporary storage (caching) of web documents, such as
HTML pages and images, to reduce bandwidth usage, server load, and perceived lag. A web
cache stores copies of documents passing through it; subsequent requests may be satisfied from
the cache if certain conditions are met.
2
Figure-1.1 Cache architecture
Web caching is one of the most successful solutions for improving the performance of Web-
based system. In Web caching, the popular web objects that likely to be visited in the near future
are stored in positions closer to the user like client machine or proxy server. Thus, the web
caching helps in reducing Web service bottleneck, alleviating of traffic over the Internet and
improving scalability of the Web system. The Web caching has three attractive advantages to
Web participants, including end users, network managers, and content creators.
a. The Web caching decreases user perceived latency.
b. The Web caching reduces network bandwidth usage.
c. The Web caching reduces loads on the origin servers.
1.1.1Basic Types of Web Cache
Web caching keeps a local copy of Web pages in places close to the end user. Caches are found
in browsers and in any of the Web intermediate between the user agent and the origin server.
Typically, a cache is located in client (browser cache), proxy server (proxy cache) and origin
server (cache server)
3
a. Browser cache: It is located in the client. The user can notice the cache setting of any modern
Web browser such as Internet Explorer, Safari, Mozilla Firefox, Netscape, and Google chrome.
This cache is useful, especially when users hit the “back” button or click a link to see a page they
have just looked at. In addition, if the user uses the same navigation images throughout the
browser, they will be served from browsers’ caches almost instantaneously.
b. Proxy server cache: It is found in the proxy server which located between client machines
and origin servers. It works on the same principle of browser cache, but it is a much larger scale.
Unlike the browser cache which deals with only a single user, the proxies serve hundreds or
thousands of users in the same way. When a request is received, the proxy server checks its
cache. If the object is available, it sends the object to the client. If the object is not available, or it
has expired, the proxy server will request the object from the origin server and send it to the
client. The object will be stored in the proxy’s local cache for future requests.
c. Origin server cache: Even at the origin server, web pages can be stored in a server-side cache
for reducing the need for redundant computations or database retrievals. Thus, the server load
can be reduced if the origin server cache is employed.
A cache is a component that transparently stores data so that future requests for that data can be
served faster. The data that is stored within a cache might be values that have been computed
earlier or duplicates of original values that are stored elsewhere. If requested data is contained in
the cache (cache hit), this request can be served by simply reading the cache, which is
comparatively faster. Otherwise (caches miss), the data has to be recomputed or fetched from its
original storage location, which is comparatively slower. Hence, the more requests can be served
from the cache the faster the overall system performance is.
To be cost efficient and to enable an efficient use of data, caches are relatively small.
Nevertheless, caches have proven themselves in many areas of computing because access
patterns in typical computer applications have locality of reference. References exhibit temporal
locality if data is requested again that has been recently requested already. References exhibit
spatial locality if data is requested that is physically stored close to data that has been requested
already.
4
In simple terms, Web caching is a technology that can significantly enhance end-user’s web
browsing experience and, at the same time, save bandwidth for service providers. In detail, a
web cache is a temporary storage place for data content requested from the Internet. After an
original request for data has been successfully fulfilled, and that data has been stored in the
cache, further requests for those files (e.g., HTML pages, images) results in the information
being returned from the cache, if certain conditions are met, rather than the original location.
Web Caching is the widely used technique, used by Internet Service Providers (ISPs) all around
the world, to save bandwidth and to improve user response time. In short, web caching
temporarily stores web objects – HTTP and FTP data – flowing into ISP’s network. This is not
an entirely new invention in that Caching is an integral part of computer architecture, for
example CPU cache speeds up an access to main memory, file system cache stores commonly
requested blocks for faster access and so on.
1.1.2 Need for Web Caching
We'll start with a simple analogy. This analogy is not a perfect match but it will give you the
basic idea. Each morning before John goes to get the daily paper he asks his roommate, Bill, if
he has already purchased one. If Bill already has the paper there is no reason for John to walk
all the way to the store and spend money on the exact same information. He'll just read the
paper that's already there. If a copy of the daily paper was already retrieved by Bill and is on
hand, John saves his money and his time by not making a trip to the store.
Web caching enhances web browsing in much the same way. When a user visits a site, say
www.yahoo.com, web caching (if in use and available) will retrieve the page from yahoo’s web
server and store a copy of that page locally – in cache server. The next time a user requests
www.yahoo.com the web cache delivers the locally cached copy of the page (without fetching it
from yahoo’s web server). The user will experience a very fast download because the request did
not have to traverse the entire Internet – all the way to where yahoo server is located. Also, the
bandwidth that would normally be used to download the web site is not required and is free for
other information retrieval or delivery.
5
1.1.3 Web Caches Working
Web sites are continually updating their content. News headlines change, stock quotes change,
Weather changes. It may seem that caching is not worthwhile if it is returning outdated material.
A traffic report that is two hours old doesn't do you much good. Fortunately there are checks and
balances in place to ensure that the content you are viewing is current. Web sites are made up of
many small pieces that come together to make a complete page. A site might have logos,
photographs, tables, text, and sounds. Each item will be cached as a different object, and some
items may not cache at all. For example, when you access CNN.com frequently your cache may
hang on to the CNN logo object some advertising bars, and the rest of the stuff that makes up the
basic look of the CNN Web site. But the news items will not sit in cache because they change so
often. In this case your cache has made the CNN site much easier and faster to download because
all the static graphics are already on hand and the only thing you need to complete the picture is
the news content
All caches have a set of rules that they use to determine when to serve a representation from the
cache, if it’s available. Some of these rules are set in the protocols and some are set by the
administrator of the cache (either the user of the browser cache, or the proxy administrator).
Generally speaking, these are the most common rules that are followed (don’t worry if you don’t
understand the details, it will be explained below):
1. If the response’s headers tell the cache not to keep it, it won’t.
2. If the request is authenticated or secure (i.e., HTTPS), it won’t be cached.
3. A cached representation is considered fresh (that is, able to be sent to a client without
checking with the origin server) if:
 It has an expiry time or other age-controlling header set, and is still within the
fresh period, or
 If the cache has seen the representation recently, and it was modified relatively
long ago. Fresh representations are served directly from the cache, without
checking with the origin server.
6
4. If a representation is stale, the origin server will be asked to validate it, or tell the cache
whether the copy that it has is still good.
5. Under certain circumstances — for example, when it’s disconnected from a network — a
cache can serve stale responses without checking with the origin server.
So how does your cache know what to hang on to and what to let go? That depends on choices
Made by the Web developer as well as the way the user configures his cache. As mentioned
above, Web sites are made up of individual pieces. Each one of these pieces is encoded with
specific information that will tell your cache how to handle it. This information may specify,
don’t cache this item,” in which case the cache will ignore it. The item may have a “max age”
specified. This tells the cache that after a set amount of time the cache must check in with the
web site for newer versions of that object. The “expires” field serves roughly the same purpose.
The item might also have a “last modified” field. Last modified is another way for your cache to
ask the Web server if the object has been modified since your last visit. If it has, the cache gets
a new copy, if not the cache just hangs on to the copy it already has. The web site administrator
controls each of these items. There are many cache products available. Each has lots of different
configuration options to help ensure that your data is current. Caches can be configured to accept
all, some, or none of the priorities that the web site administrator sets.
1.1.4 Why web cache important
A very advanced topic, for server developers who want to speed up their Web servers. Caching
will help speed up frequently accessed pages. Web caching is a tool to improve the performance
of Web servers. Caches store the Web pages in a central place and only update them every once
in a while. They make the pages appear to load faster, while sacrificing timeliness.
This will give several positive outcomes:
1. Reduced cost of Internet traffic:

Traffic on the web consumes more bandwidth than any other Internet service. Therefore any method that
can reduce the bandwidth requirements is welcome, especially in parts of the world in which
telecommunication services are expensive. Even when telecommunication costs are reasonable, a large
fraction of traffic is destined to or from the U.S. and so must use expensive trans-oceanic network links.
7
2. Reduced latency:
One of the most common end-user desires is for more speed and many people believe that web
caching can help reduce the "World Wide Wait.” Latency improvements are most noticeable in
areas of the world in which data must travel long distances, accumulating significant latency as a
result of speed-of-light constraints, accumulating processing time by many systems over many
network hops, and increased likelihood of experiencing congestion as more networks are
traversed to cover such distances. High latency because of speed-of-light constraints is
particularly taxing in satellite communications
1.1.5 Where is Web caching performed
Caching can be performed by the client application and is built in to most web browsers. There
are a number of products that extend or replace the built-in caches with systems that contain
larger storage, more features, or better performance. In any case, these systems cache net objects
from many servers but all for a single user.
Figure-1.2 Caching Performed at the client location
8
Caching can also be utilized in the middle, between the client and the server, as part of a proxy.
Proxy caches are often located near network gateways to reduce the bandwidth required over
Expensive dedicated Internet connections. These systems serve many users (clients) with cached
Objects from many servers. In fact, much of the usefulness (reportedly up to 80% for some
installations) is in caching objects requested by one client for later retrieval by another client. For
even greater performance, many proxy caches are part of cache hierarchies, in which a cache can
inquire of neighboring caches for a requested document to reduce the need to fetch the object
directly.
Figure-1.3 Caching performed at the proxy server location
Finally, caches can be placed directly in front of a particular server, to reduce the number of
requests that the server must handle. Most proxy caches can be used in this fashion, but this form
has a different name (reverse cache, inverse cache, or http accelerator) to reflect the fact that it
caches objects for many clients but from only one server.
9
Figure-1.4 Caching Performed at the web server location.
1.1.6 How Does Cache Apply to the Web
Caching is a well-known concept in computer science: when programs continually access the
same set of instructions, a massive performance benefit can be realized by storing those
instructions in RAM. This prevents the program from having to access the disk thousands or
even millions of times during execution by quickly retrieving them from RAM. Caching on the
web is similar in that it avoids a roundtrip to the origin web server each time a resource is
requested and instead retrieves the file from a local computer's browser cache or a proxy cache
closer to the user.
10
1.2 What Is Data Mining
Web usage mining is a subset of web mining operations which itself is a subset of data mining in
general. The aim is to use the data and information extracted in web systems in order to reach
knowledge of the system itself. Data mining is a set operations performed on a collection of data
or a subset of it so as to extract meaningful patterns on the data. Another definition is
“Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies,
rules, and statistically significant structures and events in data”. That is, data mining attempts to
extract knowledge from data. If a subset is to be used, careful and unbiased sampling algorithms
should be used to avoid biased result. Data mining is different from information extraction
although they are closely related. To better understand the concepts brief definitions of keywords
can be given as:
Data: “A class of information objects, made up of units of binary code that are intended to be
stored, processed, and transmitted by digital computers”
Information: “is a set of facts with processing capability added, such as context, relationships to
other facts about the same or related objects, implying an increased usefulness. Information
provides meaning to data”
Knowledge: “is the summation of information into independent concepts and rules that can
explain relationships or predict outcomes”
Information extraction is the process of extraction information from data sources whether they
are structured, unstructured or semi-structured into structured and computer understandable data
formats. Data mining operations are performed on the data already extracted by means of
information retrieval. Based on the types of data sources it is applied on, data mining can be
categorized. One such application would be on geographic data such as digital maps as usually
seen most GIS applications which is called spatial data mining. Another area where data mining
is widely used is bioinformatics where very large data about protein structures, networks and
genetic material is analyzed. The sub category of interest in this thesis is the web mining which
acts on the data made available in the World Wide Web (WWW) data servers.
1.2.1 Web Mining
11
Web mining consists of a set operations defined on data residing on WWW data servers defines
web mining as“…the discovery and analysis of useful information from the World Wide Web”.
Such data can be the content presented to users of the web sites such as hyper text markup
language (HTML) files, images, text, audio or video. Also the psychical structure of the websites
or the server logs that keep track of user accesses to the resources mentioned above can be
targets of web mining techniques. Web mining as a sub category of data mining is fairly recent
compared too their areas since the introduction of internet and its widespread usage itself is also
recent. However, the incentive to mine the data available on the internet is quite strong. Both the
number of users around the world accessing online data and the volume of the data itself
motivate the stakeholders of the web sites to consider analyzing the data and user behavior. Web
mining is mainly categorized into two subsets namely web content mining and web usage
mining. While the content mining approaches focus on the content of single web pages, web
usage mining uses server logs that detail the past accesses to the web site data made available to
public. Usually the physical structure of the web site itself which is a graph representation of all
web pages in the web site is used as a part of either method. However recent approaches that
Appoint more focus on the physical link structure of the web site have introduced web structure
mining as a separate concept.
1.2.2 Web Content Mining
“Web content mining describes the automatic search of information resources available on-line”.
The focus is on the content of web pages themselves. content mining as agent-based approaches;
where intelligent web agents such as crawlers autonomously crawl the web and classify data and
database approaches; where information retrieval tasks are employed to store web data in
databases where data mining process can take place Most web content mining studies have
focused on textual and graphical data since the early years of internet mostly featured textual or
graphical information. Recent studies started to focus on visual and aural data such as sound and
video content too.
1.2.3 Web Usage Mining
12
The main topic of this thesis is the web usage mining. Usage mining as the name implies focus
on how the users of websites interact with web site, the web pages visited, the order of visit,
timestamps of visits and durations of them. The main source of data for the web usage mining is
the server logs which log each visit to each web page with possibly IP, referrer, time, browser
and accessed page link. Although many areas and applications can be cited where usage mining
is useful, it can be said the main idea behind web usage mining is to let users of a web site to use
it with ease efficiently, predict and recommend parts of the web site to user based on their and
previous users actions on the web site.
Chapter-2 Introduction
13
2.1 Introduction
Web prefetching is an attractive solution to reduce the network resources consumed by Web
services as well as the access latencies perceived by Web users. Unlike Web caching, which
exploits the temporal locality, Web prefetching utilizes the spatial locality of Web objects.
Specifically, Web prefetching fetches objects that are likely to be accessed in the near future and
stores them in advance. In this context, a sophisticated combination of these two techniques may
cause significant improvements on the performance of the Web infrastructure. Considering that
there have been several caching policies proposed in the past, the challenge is to extend them by
using data mining techniques. In this paper, we present a clustering-based prefetching scheme
where a graph-based clustering algorithm identifies clusters of ‘‘correlated’’ Web pages based on
the users’ access patterns. This scheme can be integrated easily into a Web proxy server,
improving its performance. Through a simulation environment, using a real data set, we show
that the proposed integrated framework is robust and effective in improving the performance of
the Web caching environment.
The ongoing increase of digital data on the Web has resulted in the overwhelming amount of
research in the area of Web user browsing personalization and next page access prediction. It is
rather a complicated issue since, until now, there is not a single theory or approach that can
handle the increasing number of data with improved performance, efficiency and accuracy of
Web page prediction Two of the most common approaches used for Web user browsing pattern
prediction are Markov model and clustering. Each of these approaches has its own shortcomings.
Markov model is the most commonly used prediction model because of its high accuracy. Low
order Markov models have higher accuracy and lower coverage than clustering. In order to
overcome low coverage, all-kth order Markov models have been used where the highest order is
first applied to predict a next page. If it cannot predict the page, it decreases the order by one
until prediction is successful. This can increase the coverage, but it is associated with higher state
space complexity. Clustering methods are unsupervised methods, and normally are not used for
classification directly. However, proper clustering groups users’ sessions with similar browsing
history together, and this facilitates classification. Prediction is performed on the cluster sets
rather than the actual sessions. Clustering accuracy is based on the selected features for
14
partitioning. For instance, partitioning based on semantic relationships or contents or link
structure usually provides higher accuracy than partitioning based on bit vector, spent time, or
frequency. However, even the semantic, contents and link structure accuracy is limited due to the
unidirectional nature of the clusters and the multidirectional structure of Web pages.
Reducing the web latency is one of the primary concerns of Internet research. Web caching and
web prefetching are two effective techniques to latency reduction. A primary method for
intelligent prefetching is to rank potential web documents based on prediction models that are
trained on the past web server and proxy server log data, and to pre-fetch the highly ranked
objects. For this method to work well, the prediction model must be updated constantly, and
different queries must be answered efficiently. In this paper we present a data-cube model to
represent Web access sessions for data mining for supporting the prediction model construction.
The cube model organizes session data into three dimensions. With the data cube in place, we
apply efficient data Mining algorithms for clustering and correlation analysis. As a result of the
analysis, the web page clusters can then be used to guide the prefetching system. In this paper,
we propose an integrated web-caching and web-prefetching model, where the issues of
prefetching aggressiveness, replacement policy and increased network traffic are addressed
together in an integrated framework. The core of our integrated solution is a prediction model
based on statistical correlation between web objects. This model can be frequently updated by
querying the data cube of web server logs. This integrated data cube and prediction based
prefetching framework represents a first such effort in our knowledge.
Web usage Mining is the application of data mining techniques to Web click stream data in order
to extract usage patterns .As Web sites continue to grow in size and complexity, the results of
Web usage mining have become critical for a number of applications such as Web site design,
business and marketing decision support, personalization, usability studies, and network traffic
analysis. The two major challenges involved in Web usage mining are preprocessing the raw
data to provide an accurate picture of how a site is being used, and filtering the results of the
various data mining algorithms in order to present only the rules and patterns that are potentially
interesting .Web usage mining consists of three phases, namely preprocessing, pattern discovery,
and pattern analysis. Given its application potential, Web usage mining has seen a rapid increase
15
in interest, from both the research and practice communities. Web data are those that can be
collected and used in the context of Web personalization
The World Wide Web is a huge information repository. When so many users access this
information repository, it is easy to find certain patterns in the way they access web resources.
Web request prediction has been implemented in the past, primarily for static content. Increasing
web content and Internet traffic is making web prediction models very popular.
The objective of a prediction model is to identify the subsequent requests of a user, given the
current request that a user has made. This way the server can pre-fetch it and cache these pages
or it can pre-send this information to the client. The idea is to control the load on the server and
thus reduce the access time. Careful implementation of this technique can reduce access time
and latency, making optimal usage of the server’s computing power and the network bandwidth.
Markov model is a machine learning technique and is different from the approach that data
mining does with web logs. Data mining approach identifies the classes of users using their
attributes and predicting future actions without considering interactivity and immediate
implications. There are other techniques like prediction by partial matching and information
retrieval that may be used in conjunction with Markov modeling, to enhance performance and
accuracy.
A web prediction model unlike other prediction models are particularly challenging because of
the many states that it has to hold and the dynamic nature of the web in terms of user actions and
continuously changing content. We therefore use the Markovian probabilistic idea to design a
prediction model. It is a stochastic counterpart to deterministic process in probability theory.
Web mining decomposing into these subtasks, namely:
1. Resource finding: the task of retrieving intended is documents.

2. Information selection and preprocessing: automatically selecting and pre-processing specific
information from retrieved Web resources.
16
3. Generalization: automatically discovers general patterns at individual Web sites as well as
across multiple sites.
4. Analysis: validation and /or interpretation of the mined patterns.
The Web is an excellent tool to deliver on-line courses in the context of distance education.
However, counting only on web traffic statistical analysis does not take advantage in the
potential of hidden patterns inside the web logs. Web usage mining is a non-trivial process of
extracting useful implicit and previously unknown patterns from the usage of the Web.
Significant research is invested to discover these useful patterns to increase profitability of e-
commerce sites.
2.2 Markov model
Markov Models have been widely used for predicting next Web-page from the users’
navigational behavior recorded in the Web-log. This usage-based technique can be combined
with the structural properties of the Web-pages to achieve better prediction accuracy. This paper
proposes one of the pre-fetching techniques relying both on Markov Model and Ranking which
considers the structural properties of the Web. In this paper, prediction accuracy is realized as a
linear function of transition probability of first order Markov Model and ranking of the Web-
page. The chance of the predicted Web-page being the next Web-page would be higher if the
prediction accuracy of the Web-prediction. This research work proposes an improved hash
mining association algorithm. To minimize the number of candidate sets while generating
association rules by evaluating quantitative information associated with each item that occurs in
a transaction, which was usually, discarded as traditional association rules focus just on
qualitative correlations. The proposed approach reduces not only the number of item sets
generated but also the overall execution time of the algorithm. Any valued attribute will be
treated as quantitative and will be used to derive the quantitative association rules which usually
increases the rules' information content. Transaction reduction is achieved by discarding the
transactions that does not contain any frequent item set in subsequent scans which in turn
reduces overall execution time. Dynamic item set counting is done by adding new candidate item
sets only when all of their subsets are estimated to be frequent. The frequent item ranges are the
17
basis for generating higher order item ranges using a Priory algorithm. During each iteration of
the algorithm, use the frequent sets from the previous iteration to generate the candidate sets and
check whether their support is above the threshold.
If a model possesses Markov property, it implies that, given the present state of the model, future
states are independent of the past states. Thus the description of the present state fully captures
all the information that could influence the future evolution of the process. Future states will be
reached through a probabilistic process instead of a deterministic one.
At each instant, the system may change its state from the current state to another state, or remain
in the same state, according to a certain probability distribution. The changes of states are called
transitions, and the probabilities associated with various state-changes are termed transition
probabilities. The transition probabilities reflect itself as a branching tree in Markov tree.
Let, P = {p1, p2, p3,….. pn} be set of pages in a web site.

S: user session including a sequence of pages visited by the user.
Then probability(pi| S) is the probability that a user visit page pi next .
Then, page Pl+1 that the user visit is estimated by (assume that the user has visited l pages):
Pl+1 = max pεP { P (Pl+1 = p|S)} = max pεP { P (Pl+1 = p| pl,, pl-1,,…. pl)}. Larger the value of
S and l, higher the accuracy of the prediction. However, this will also increase the model
complexity. Hence we apply Markov rule to reduce Pl+1 . Pl+1 = max pεP { P (Pl+1 = p| pl,,
pl-1,,…. pl-(k-1))}. Here, k is the number of preceding pages and identifies the order of Markov
model. We can also estimate lower order (say single transition step) from n-gram of the form
<X1, X2> to yield conditional probabilities
p(x2|x1) = Pr (X2=x2 | X1=x1)
The zeroth order Markov model is then the conditional base rate probability
p(xn) = Pr(Xn), which is the proportion of visits to a page over a period of time.
In general the probability of going from state i to state j in n time steps as
p(n)ij=Pr (Xn =j | X0 = i ) and the single-step transition as
pij=Pr (X1 =j | X0 = i )
These facts have been used to determine probabilities in various applications. For instance to
determine probabilities of weather conditions, the weather on the preceding day is represented by
18
a transition matrix P. If (P) i j is the probability that, if a given day is of type i, it will be
followed by a day of type j.
Markov models are becoming very commonly used in the identification of the next page to be
accessed by the Web site user based on the sequence of previously accessed pages.
Let P = {p1, p2… pm} be a set of pages in a Web site. Let W be a user session including a
sequence of pages visited by the user in a visit. Assuming that the user has visited l pages, then
prob(pi|W) is the probability that the user visits pages pi next. Page Pl+1 the user will visit next
is estimated by:

This probability, prob(pi|W), is estimated by using all sequences of all users in history (or
training data), denoted by W. Naturally, the longer l and the larger W, the more accurate prob(pi|
W). However, it is infeasible to have very long l and large W and it leads to unnecessary
complexity. Therefore, to overcome this problem, a more feasible probability is estimated by
assuming that the sequence of the Web pages visited by users follows a Markov process. The
Markov process imposed a limit on the number of previously accessed pages k. In other words,
the probability of visiting a page pi does not depend on all the pages in the Web session, but only
on a small set of k preceding pages, where k << l. The equation becomes:

k denotes the number of the preceding pages and it identifies the order of the Markov model. The
resulting model of this equation is called the all k th order Markov model. Of course, the Markov
model starts calculating the highest probability of the last page visited because during a Web
session, the user can only link the page he is currently visiting to the next one.
The example:
Let be a state containing k pages, = {pl-(k-1),pl-(k-2),…,pl}. The probability of P(pi|
) is estimated as follows from a history (training) data set.
19

This formula calculates the conditional probability as the ratio of the frequency of the sequence
occurring in the training set to the frequency of the page occurring directly after the sequence.
The fundamental assumption of predictions based on Markov models is that the next state is
dependent on the previous k states. The longer the k is, the more accurate the predictions are.
However, longer k causes the following two problems: The coverage of model is limited and
leaves many states uncovered; and the complexity of the model becomes unmanageable.
Therefore, the following are three modified Markov models for predicting Web page access.
1. All kth Markov model: This model is to tackle the problem of low coverage of a high
order Markov model. For each test instance, the highest order Markov model that covers
the instance is used to predict the instance. For example, if we build an all 4-Markov
model including 1-, 2-, 3-, and 4-, for a test instance, we try to use 4-Markov model to
make prediction. If the 4-Markov model does not contain the corresponding states, we
then use the 3-Markov model, and so forth.
2. Frequency pruned Markov model: Though all kth order Markov models result in low
coverage, they exacerbate the problem of complexity since the states of all Markov
models are added up. Note that many states have low statistically predictive reliability
since their occurrence frequencies are very low. The removal of these low frequency
states affects the accuracy of a Markov model. However, the number of states of the
pruned Markov model will be significantly reduced.
3. Accuracy pruned Markov model: Frequency pruned Markov model does not capture
factors that affect the accuracy of states. A high frequent state may not present accurate
prediction. When we use a means to estimate the predictive accuracy of states, states with
low predictive accuracy can be eliminated. One way to estimate the predictive accuracy
using conditional probability is called confidence pruning. Another way to estimate the
predictive accuracy is to count (estimated) errors involved, called error pruning.
When choosing the Markov model order, our aim is to determine a Markov model order that
leads to high accuracy with low state space complexity. Figure 1 reveals the increase of precision
as the all kth order Markov model increases. On the other hand, table 1 shows the increase of the
20
state space complexity as the order of all k th Markov model increases. Based on this information,
we use the all 2nd order Markov model because it has better accuracy than that of the all 1 st order
Markov model without the drawback of the state space complexity of the all 3 rd and all 4th order
Markov model.
Figure -2.1: Precision of all 1-, 2-, 3-and 4-Markov model orders
2.3 Markov based web prediction model
The Markov Model collected in this implementation clearly indicates that a third and higher
order model has high success rate in terms of positive future predictions. One can therefore build
variations of this model and use the one with the highest applicability and success rate or use a
combination. The different order model is directly associated with n-gram, used by the speech
and language processing community. We may borrow this idea and consider an n-gram as a
sequence of n consecutive request. To make a prediction, one should match the prefix of length
n-1 of an n-gram and use the Markov model to predict the n th request. However the important
thing here is, given a prefix of length n-1, there are several possibilities of the n th request. How
do we identify the nth request appropriately? We use the Markov model’s idea of states. Each
node and therefore each request represent a state transition. The transition from one state to
21
another has some probability associated with it. Given a sequence of n-1 states, we pick the n th
state with the highest probability. How we calculate this probability is explained in the next
section. Note that the transition with the highest probability may not be the correct request
always and hence we use the idea of top-n predictions. Here we not only consider the nth request
with the highest probability but more than one request with high probabilities. We could
establish a minimum threshold to achieve higher accuracy. The change between different states
is known as transition and the probability with which this occurs is known as transition
probability. So when we have a request that gets a web page or a resource, it will be considered
as the current state of the system. Given this prediction model (associated transition
probabilities) and a set of request sequence, our goal is to predict the future state of the request.
2.4 Applications of prediction models
Prediction models have a wide range of applications. Some of the applications of a prediction
model are listed below:
1. Pre-fetching
Web pre-fetching is mechanism by which web server can pre-fetch web pages well in advance
before a request is actually received by a server or send by a client. The question here is, give a
request, how accurately can you predict the next consequent request? A web server can cache the
most probable next request reducing the time taken to respond to a request considerably. It can
help to make up for the web latency that we face on the Internet today. Pre-fetching has been
performed in the past for static web content.
2. Pre-sending
While in pre-fetching the resources are cached at the server side, in pre-sending it does more by
forwarding the resources to the respective clients. Requests are thus served locally.
3. Recommendation systems
Recommendation systems are tools that suggest related pages or resources to web surfers. They
may be simple or sophisticated tools to assist clients maneuver through a web site. Dynamically
adapting web pages and applications are examples where recommendation systems can be
22
useful. Web intelligence, dynamic site adaptation to user or dynamic customization in order to
reach user information in reduced time is other applications.
4. Web caching policies

Even with good hardware support, there is always a threshold to caching performance. Most of
the caching algorithms are performance oriented and do not consider user preferences and
patterns in recognizing potential cacheable content. Such models can be used to locally cache
resources in a personalized fashion.
5. Web site design and analysis

Predictive models of browsing could help accurately simulate surfer paths through hypothetical
web site designs. This can further help web site designers to design the web so as to encourage
certain way of browsing through the content. It will further help in designing websites, capable
of dynamically adapting itself to meet the needs of the user.
2.5 Types of prediction model profiles
There are two types of prediction model profiles: point profile and path profile; path profile
being the more commonly used profile.
2.5.1 Point profile
Point based prediction models are effectively first order Markov models or in other words they
give page-to-page transition probabilities. Given the current state, the model will predict the
future state. We already know that first order models have high applicability but they do not have
the same precision as higher order models. First order models are general models without any
patter specificity associated with them.
2.5.2 Path profile
Path based models makes use of the principle of specificity to use longer paths to make
predictions. It makes use of the user’s precious navigation path. It can for instance make use of
the longest available sequence to make a prediction. Path profile is a jargon used in the compiler
optimization community. It effectively uses the longest path profile matching the current context
to make a prediction. Alternatively for the paths in the profile that match, the one with the
highest observed frequency is selected to make the prediction. So if (a, b, c, d) was the best
23
observed match with the highest frequency then it can be used to predict (d) given that the user
visited (a, b, c). The n-gram and the Markov property together makes path profile based model
more efficient and hence they are more popular today.
2.6 Pre-processing
Web prediction models are generally based on machine learning techniques to identify the most
probable future action for sequence of requests for the following class of users:
a. Specific users (client based models are more suitable)

b. Similar minded users (e.g. group of students in the same research team)
c. General users (e.g. for wide range of users in an internet cafe)
Information about the above mentioned class of users and their browsing patterns have to be
extracted from web server log files. These files have thousands of log entries and do not readily
come with the information that we need to classify users. Some of the issues associated with log
files and the preprocessing steps to make them more useable are discussed below.
Precisely mapping a web domain is not an easy task. There are hundreds of documents that have
links between them and they are changing continuously. Web pages are updated and new pages
are introduced. Tracking these changes and incorporating it in a model on a regular basis is a
challenging task. Some of the reasons that make these logs difficult to work with are:
i. Most of the logs that are available record only one type of action performed by a user, namely
the document request in the World Wide Web. The more commonly available logs that the server
maintains are the access logs.
ii. The server log entries will not directly reflect any specific pattern as it represents a variety of
users with different objectives.
iii. Again, all user requests are not recorded by the web server. Some of the excluded requests are
the ones served by the browser cache or client/proxy cache.
2.7 Preprocessing steps
24
Server log preprocessing steps involves removing log entries generated by web crawler and
search engines. These log entries are automatically generated every time a web crawler tries to
index web pages to update page ranking and other book keeping parameters. They do not reflect
any user visit pattern. Also, a considerable number of log entries have their referrer pointing to
search engines. Such log entries can also be ignored.
One can use referrer information to identify requests that are not implicitly recorded (self
referring pages/ cached web pages). Referrers in logs provide information about the originating
point of the request. They can be useful to identify user movements like back button action and
pages that are served by intermediate cache/proxy before a request reaches the server.
Embedded documents are assumed to be implicitly fetched during pre-fetching and hence these
requests are neglected. A great number of log entries are requests for embedded documents like
graphical, audio and video files. Such components are integral part of a web page and hence pre-
fetching a page will implicitly pre-fetch the embedded components also. Eliminating these
components considerably reduces the size of the hash file used to maintain identifiers for unique
URLs. Clustering algorithms may be used to categorize web requests according to parameters
like users (IP address) or session interval.
2.8 Clustering
Many clustering schemes have been proposed; that when applied along with Markov prediction
techniques, achieve better accuracy. Proposes an unsupervised distance based partitioned
clustering scheme. It is widely used in grouping web user sessions. It is also known as C-means
clustering algorithm. Prediction techniques were applied using each cluster and using the whole
data set. Results indicate that the clustering algorithms on the data set improve the accuracy of
the prediction model. There are other clustering schemes like distance based hierarchal clustering
and model based clustering, which are also known to improve predictive accuracy. In our
implementation of the model, we have not used any advanced clustering scheme on the data set.
However we grouped the URLs as accessed by the user classifying it using the IP address
available in the log files. We then see if these URLs have been accessed within fifteen, thirty,
forty-five minute or above interval. Our results revealed that the prediction was better for the
fifteen and thirty minute scheme as compared to other intervals or having no session intervals at
25
all. However we do believe that this conclusion may not be true for all web servers and largely
depends on the content the web site hosts and the users visiting that website.
According, to clustering is a pattern discovery algorithm in the Web usage mining stage of Web
mining. It is defined as the classification of patterns into groups (clusters) based on similarity in
order to improve common Web activities. Clustering can be model-based or distance-based.
With model-based clustering, the model type is often specified a priori and the model structure
can be determined by model selection techniques and parameters estimated using maximum
likelihood algorithms, e.g., the Expectation Maximization (EM). Distance-based clustering
involves determining a distance measure between pairs of data objects, and then grouping similar
objects together into clusters. The most popular distance-based clustering techniques include
partitioned clustering and hierarchical clustering. A partitioned method partitions the data objects
into K groups and is represented by k-means algorithm. A hierarchical method builds a
hierarchical set of nested clusters, with the clustering at the top level containing a single cluster
of all data objects and the clustering at the bottom level containing one cluster for each data
object. Model-based clustering has been shown to be effective for high dimensional text
clustering. However, hierarchical distance-based clustering proved to be unsuitable for the vast
amount of Web data. Although distance-based clustering methods are computationally more
complex than model-based clustering approaches, they have displayed their ability to produce
more efficient Web documents clustering results. Clustering can also be supervised or
unsupervised. The difference between supervised and unsupervised clustering is that with
supervised clustering, patterns in the training data are labeled. New patterns will be labeled and
classified into existing labeled groups. Unsupervised clustering can be classified as hierarchical
or non-hierarchical. A common method of non-hierarchical clustering is the k-means algorithm
that tends to cluster data into even populations. In this paper, we use a straightforward
implementation of the k-means clustering algorithm. It is distance-based, unsupervised and
partitioned.
Clustering Web sites can be achieved through page clustering or user clustering. Web page
clustering is performed by grouping pages having similar content. Page clustering can be simple
if the Web site is structured hierarchically. In this case, clustering is obtained by choosing a
higher level of the tree structure of the Web site. On the other hand, clustering user sessions
26
Involves selecting an appropriate data abstraction for a user session and defining the similarity
between two sessions. This process can get complicated due to the number of features that exist
in each session.
Clustering involves determining a distance measure between pairs of data objects, and then
grouping similar objects together into clusters. The most popular distance-based clustering
techniques include partitioned clustering and hierarchical clustering. A partitioned method
partitions the data objects into K groups and is represented by k-means algorithm. A hierarchical
method builds a hierarchical set of nested clustering’s, with the clustering at the top level
containing a single cluster of all data objects and the clustering at the bottom level containing
one cluster for each data object. Model-based clustering has been shown to be effective for high
dimensional text clustering.
Whereas, hierarchical distance-based clustering proved to be unsuitable for the vast amount of
Web data. Partitioned distance-based clustering is disadvantaged by the large number of
proposed different distance measures for clustering purposes and defining a good similarity
measure is very much data dependent and often requires expert domain knowledge. The most
commonly used distance measures are Euclidean distance and Manhattan or Cosine distance for
data that can be represented in a vector space. Although distance-based clustering methods are
computationally more complex than model-based clustering approaches, they have displayed
their ability to produce more efficient Web documents clustering results.
2.8.1 Distance Measures
The clustering algorithm chosen for this work is K-means clustering algorithm that is a simple
and popular form of cluster analysis. It has been widely used in grouping Web user sessions. It is
distance based as opposed to complex model based algorithms. It involves the following:
1. Define a set of sessions (n-by-p data matrix) to be clustered.
2. Define a chosen number of clusters (k).
3. Randomly assign a number of sessions to each cluster.
The k-means clustering repeatedly performs the following:
27
1. Calculate the mean vector for all items in each cluster.
2. Reassign the sessions to the cluster whose center is closest to the session.
2.8.2Types of clustering
Hierarchical algorithms find successive clusters using previously established clusters. These
algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down").
Agglomerative algorithms begin with each element as a separate cluster and merge them into
successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide
it into successively smaller clusters.
Partitioned algorithms typically determine all clusters at once, but can also be used as divisive
algorithms in the hierarchical clustering.
Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this

approach, a cluster is regarded as a region in which the density of data objects exceeds a
threshold. DBSCAN and OPTICS are two typical algorithms of this kind.
Subspace clustering methods look for clusters that can only be seen in a particular projection
(subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The
general problem is also known as Correlation clustering while the special case of axis-parallel
subspaces is also known as Two-way clustering, co-clustering or bi-clustering: in these methods
not only the objects are clustered but also the features of the objects, i.e., if the data is
represented in a data matrix, the rows and columns are clustered simultaneously. They usually do
not however work with arbitrary feature combinations as in general subspace methods. But this
special case deserves attention due to its applications in bioinformatics.
Many clustering algorithms require the specification of the number of clusters to produce in the
input data set, prior to execution of the algorithm. Barring knowledge of the proper value
beforehand, the appropriate value must be determined, a problem on its own for which a number
of techniques have been developed.
28
2.9 Clustering Modules
1.Web caching approach:

Web caching introduces new issues in Web objects management and retrieval across the
network. Specifically, Web caching is implemented by proxy server applications developed to
support many users. Proxy applications act as an intermediate between Web users and servers.
Users make their connection to proxy applications running on their hosts. The proxy connects the
server and relays data between the user and the server. At each request, the proxy server is
contacted first to find whether it has a valid copy of the requested object. If the proxy has the
requested object this is considered as a cache hit, otherwise a cache miss occurs and the proxy
must forward the request on behalf of the user. Upon receiving a new object, the proxy services a
copy to the end-user and keeps another copy to its local storage.
2. Web prefetching approach:

The Web prefetching schemes are based on the spatial locality of Web objects. Typically, the
main benefits of employing prefetching is that it prevents bandwidth underutilization and reduces
the latency. Therefore, bottlenecks and traffic jams on the Web are bypassed and objects are
transferred faster. Thus, the proxies may effectively serve more users’ requests, reducing the
workload from the origin servers. Consequently, the origin servers are protected from the flash
crowd events as a significant part of the Web traffic is dispersed over the proxy servers.
3. Short-term prefetching:
Requests are predicted to the cache’s recent access history. Based on these predictions, clusters
of Web objects are pre-fetched in this context, the short-term prefetching schemes use
Dependency Graph (DG), where the patterns of accesses are held by a graph and Prediction by
Partial Matching (PPM), where a scheme is used, adopted from the text compression domain.
4. Long-term prefetching:
Global object access pattern statistics (e.g., objects’ popularity, objects’ consistency) are used to
identify valuable (clusters of) objects for prefetching. In this type of scheme, the objects with
higher access frequencies and no longer update time intervals are more likely to be pre-fetched.
29
5. The clustering approach:
One simplistic solution to this problem is to cluster for each client group the most popular
objects. However, authors in report that the popularity of each object varies considerably. In
addition, the use of administratively tuned parameters to select the most popular objects, or
decide the number of clusters causes additional headaches, since there is no at priori knowledge
about where to set the popularity threshold or how many clusters of objects exist. Refraining
from the above limitations, we present a graph-based approach in order to cluster in an efficient
way the Web pages.
2.10 Association Rule Mining
Association rule mining is a major pattern discovery technique. Association rule discovery on
usage data results in finding group of items or pages that are commonly accessed and purchased
together. The original goal of association rule mining is to solve the market basket problem. the
application of association rule mining are far beyond market bucket application & they have used
for various domains including web mining.in web mining context, association rules help
optimize the organization & structure of web site. Association rules are mainly defined by two
matrices: support and confidence. The mining support requirement dictates the efficiency of
association rule mining. Support corresponds to statistical significance while confidence is a
measure of the rule strength.
There are four types of sequential association rules presented by:
1. Subsequence rule: they represent the sequential association rules where the items are
listed in order.
2. Latest subsequence rule: they take into consideration the order of the items and most
recent items in the set.
3. Substring rule: they take into consideration the order and the adjacency of the items.
4. Latest substring rules: they take into consideration the order of the items, the most recent
items in the set as well as the adjacency of the items.
30
Support are defined as the discovery of frequent item set(i.e. item sets which satisfy a minimum
support threshold) and confidence is defined as the discovery of association rules from these
frequent item set.
Association rule mining allows businesses to infer useful information on customer purchase
patterns, shelving criterion in retail chains, stock trends etc. The basket data essentially consists
of a large number of individual records called transactions and each transaction is a list of items
that participated in the transaction. The goal of association rule mining is to discover rules it is
likely to contain a specific item. A formal definition of association rule mining is presented
sampling techniques for association rule mining in massive databases. Sampling has been used
quite effectively for solving several problems in databases and data mining.
Association (rule) mining, the task of finding correlations between items in a dataset, Initial
research was largely motivated by the analysis of market basket data, the results of which
allowed companies to more fully understand purchasing behavior and, as a result, better
Target market audiences. For example, an insurance company, by finding a strong correlation
between two policies A and B, of the form A ⇒ B, indicating that customers that held policy A
were also likely to hold policy B, could more efficiently target the marketing of policy B through
marketing to those clients that held policy A but not B. In effect, the rule represents knowledge
about purchasing behavior. Association mining applications have since been applied to many
different domains including market basket and risk analysis in commercial environments,
epidemiology, clinical medicine, fluid dynamics, astrophysics, crime prevention, and counter-
terrorism—all areas in which the relationship between objects can provide useful knowledge
Association mining is user-centric as the objective is the elicitation of useful (or interesting)
rules from which new knowledge can be derived. The association mining system’s role in this
process is to facilitate the discovery, heuristically filter, and enable the presentation of these
inferences or rules for subsequent interpretation by the user to determine their usefulness.
Association mining analysis has become a mature field of research. The fundamentals of
association mining are now well established and there appears little current research on
optimizing the performance of classic item set identification.
31
The majority of current research involves the specialization of fundamental association mining
algorithms to address specific issues, such as the development of incremental algorithms to
facilitate dynamic dataset mining or the inclusion of additional semantics (such as time, space,
ontologies, etc.) to discover, for example, temporal or spatial association rules. Association
mining analysis is a two part process. First, the identification of sets of items or item sets within
the dataset. Second, the subsequent derivation of inferences from these item sets. The majority of
related research to date has focused upon the efficient discovery of item sets as its level of
complexity is significantly greater than that of inference generation.
2.11 Pruning
Previous work on pruning focuses on the different strategies that are used to make effective
pruning. For instance claims three effective pruning schemes and discusses their advantages
over each other. Pruning is very useful as it reduces the space and time complexity of the
prediction model.
While pruning a Markov tree, we need to ensure that the Markov property is not disturbed.
During pruning operations, we alter the state information of the node and sub-states associated
with it. In subsequent sub-topics, we will discuss on how to retain the necessary information
while safely pruning. Before that, we describe the different pruning techniques that have been
used on Markov models.
There are several pruning strategies that have been proposed for Markov prediction models [4].
Some of the effective ones are, frequency pruning, confidence pruning, error pruning and
pessimistic selection.
2.12 Parameters used to measure the efficiency of the model
We use global parameters and local parameters to measure the efficacy of the model. There are
three different global parameters that are used to determine the efficacy of a prediction model.
They are discussed in subsequent paragraphs.
32
2.13 Global parameters to measure model performance
a. Accuracy (predictive precision): Overall accuracy of the model is the number of correct
predictions to all the user requests. But as pointed out in [1], we should be concerned with the
accuracy of the predictions and that is the ratio of true prediction to attempted predictions. This
is normally established using a set of real world logs that was not used to train the model. The
model is made to predict the trimmed n-gram and then compared with the actual web request
sequence. Accuracy = Number of true predictions/ Number of predictions made.
b. Number of states: This parameter measures the space-time complexity of learning. A model
with large number of states will be difficult to implement as it will have a large memory
overhead and updating the model will have higher time complexity. Hence we need to make sure
that the model does not retain unwanted states.
c. Coverage of the model (Applicability): measures the number of times the model was able to
compute prediction without referring to any default predictions. Applicability is therefore
measured by number of times predictions are made by following one of the branches in Markov
tree. Note that the count is incremented, irrespective of the fact that it is a true prediction or not.
Applicability = Number of predictions made/ Total number of requests.
2.14 Local parameters to describe model state
Predictive performance of a model state depends on two local values associated with each state:
the predictive confidence and the frequency of the state. Confidence and frequency (support) can
be defined as follows:
a. Confidence: Confidence is the fraction of the number of times a predicted request occurred in
this context during the training of the model. It is therefore the probability with which a
particular request will be chosen for prediction. In case of a Markov tree, this value is calculated
by dividing the self-count of the node by the child-count of the parent.
33
b. Frequency (support): Frequency is the number of times that a state has been visited in a
particular context. In order to ensure that the given request is the most likely request in the future
it must have high confidence and support. In case of a Markov tree, the self-count of a node is its
frequency.
Note: A prediction with high confidence based on low frequency may not be the best choice and
the converse is also true. Confidence and support threshold can be used to prune states.
2.15 Factors that can affect predictive performance
a. Number of predictions: Under normal circumstances, given a state, we make only one
prediction for the next possible action. We choose a state with the highest predictive confidence.
Predictive performance is seen to be much better when more than one prediction is made (top-n).
Making extra predictions comes at the cost of system resource and hence seen as a trade-off
issue, as will be some of the other parameters discussed below. However the chances of making
a correct prediction also increase. This technique is also popularly known as top-n prediction.
Here we cache/pre-fetch n resources starting with the one that has highest probability.
b. N-gram size: N-gram size or the order of Markov model has been experimentally proven to
have a huge impact on predictive capability. The effectiveness of a particular Markov order tree
is decided by predictive precision and applicability of the model. While a first order model has
the highest applicability, higher order models like the fourth and fifth order models have high
precision (a 2-gram model is effectively the same as a first order prediction model). Increase in
context helps the model to learn more specific patterns. Third and fourth order model have been
seen to perform best in terms of balanced precision and support. If the longest sequence is
always used to make predictions, then performance peaks and then wanes out as longer but
uncommon sequences are used.
c. Prediction window: An improvement in predictive accuracy can be observed if we are to

allow predictions to match more than the immediate request. As stated, if the request is satisfied
by any of the cached (predicted web requests) resource, it could be treated as a successful
34
prediction. Using a prediction window is particularly useful in case of a server catering large web
traffic.
d. Clustering algorithm: Clustering has already been discussed before. There are different
types of clustering algorithms. Some of these are supervised while others are not. They may be
partitional or non-partitional. Two must commonly used clustering algorithms are distance and
model based clustering.
e. Mistake costs: If the total numbers of predictions are increased, then the fraction of correct
predictions also tends to grow. There is also a trade-off between resource utilization
(bandwidth/cache usage) and total number of predictions made. The higher the number of
predictions made, more is the resource utilization. Therefore it is important to consider the
probability of false predictions. This can also be controlled to some extend by establishing
confidence threshold.
2.16 Different types of logs
The server maintains a variety of logs that can be of assistance in performance analysis, failure
inspection and also to identify patterns in user activity. Some of the commonly found logs are
error logs, query logs, access logs and referrer logs. We are concerned with access logs and
referrer/combined format logs because they have information about user visits and browsing
activities on the server. Logs that are maintained by the server largely depend on how the server
is configured. Access log contains all the requests that a server receives. These are by and large
new requests. Repeated requests may be served by the browser cache, the proxy server, the
reverse proxy server or the main server itself in case it maintains a cache. The referrer log and
combined logs captures traversal information. However referrer header is not a mandatory field
in HTTP and may often be empty.
2.16.1Access logs
35
The web server-access logs record all requests processed by the server. The location and content
of access logs are controlled by custom log directives. The log format directive can be used to
simplify the selection of the contents of the logs.
2.16.2Referrer logs
While access logs records the requests that are received by the server, the referrer logs
additionally document the browsing activities on the server. It also helps to document the
requests served by intermediate stages like the browser cache, proxy server or the reverse proxy
server. This format is also often called (common log format).
36
Chapter-3 Methodology:
3.1 About Methodology
Our integration model involves using low order markov model to predict the next page to be
visited by a user and then applying association rule techniques to predict the next page to be
accessed by the user based on the long history data.
Web page prediction involves anticipating the next page to be accessed by the user or the link the
Web user will click at next when browsing a Web site. For example, what is the chance that a
Web user visiting a site that sells computers will buy an extra battery when buying a laptop? Or,
maybe there is a greater chance the user will buy an external floppy drive instead. Users’ past
browsing experience is very fundamental in extracting such information. This is when modeling
techniques come at hand. For instance, using clustering algorithms, we are able to personalize
users according to their browsing experience. Different users with different browsing behaviour
are grouped together and then prediction is performed based on the users’ link path in the
appropriate cluster. Similar kind of prediction can be in effect using Markov models conditional
probability. For instance, if 50% of the users access page D after accessing pages ABC, then
there is a 50/50 chance that a new user that accesses pages ABC will access page D next. Our
work improves the Web page access prediction accuracy by combining both Markov model and
clustering techniques. It is based on dividing Web sessions into groups according to Web
services and performing Markov model analysis using clusters of sessions instead of the whole
data set. This process involves the following steps:
1. Pre-process the Web server log files in a manner where similar Web sessions are
allocated to appropriate categories.
2. Analyze and calculate different distance measures and determine the most suitable
distance measure.
3. Decide on the number of clusters (k) and partition the Web sessions into clusters
according to the chosen distance measure.
37
4. For each cluster, return the data to its uncategorized and expanded state.
5. Perform Markov model analysis on the whole data set.
6. For each item in the test data set, find the appropriate cluster the item belongs to.
7. Calculate 2-Markov model accuracy using the cluster data as the training data set.
8. Calculate the total prediction accuracy based on clusters.
9. Compare the Markov model accuracy of the clusters to that of the whole data set.
3.2 Integration Markov Model with Association Rules
Integration association rule with markov model in order to improve the prediction accuracy. But
both have been used individual for prediction purpose, but each of them has its own limitation
when it occurs to be web page prediction accuracy & state space complexity. The main
advantage of markov model is that they can generate navigation paths that could be used for
automatically for prediction, without any extra processing and thus they very useful for web
personalization. The original goal for association rule mining are used for predict the next page
to be accessed by the web users. The more frequently the pages are accessed the higher the
probability of the user accessing the next pagr.it involves dealing too many rules & it is easy to
find the suitable subset of rule to make accurate and reliable prediction.
The architecture of the integration Markov & Association Model (IMAM) is depicted as:
Figure-3.1 The integration Markov & Association Model (IMAM)
38
The integration model profit from the decrease the state space complexity of the lower markov
model by using association mining in case of ambiguity. the integration model also provides the
complexity of the association rules since the rules are generated only in special cases.
In brief, the new integration model results in an increase the accuracy and a decrease in state &
rule complexity.
3.3 Integration Markov with clustering Model (IMCM)
The web page prediction anticipated the next page to be accessed by the user or the link the web
user. Clustering is able to personalize user according to their browsing experience. this is very
significant since the markov model for a subgroup, that is assumed to be more homogenous than
the whole data set, has an higher quality than the markov model of the whole data set. Markov
model archives the higher accuracy but the association with the higher state space complexity
with the clustering. Although the clustering technique have been used for the personalization
purpose by discovering web site structure, and extracting useful pattern. Proper clustering groups
user’s sessions with similar browsing and facilities classifications. The markov model and
clustering based on the state space complexity association with the higher –order. Web sessions
are categories into the numbers of categories: K-means clustering is based on the web sessions
identified and is carried out according to some distance matrices.
39
Figure-3.2 The integration Markov & Clustering Model (IMCM)
3.4 The integration Markov Model with Clustering and Association
Rule (IMCA)
Combine the Markov Model, association Rule and clustering to improve the next page prediction
accuracy and combining Markov Model with clustering technique has been proved to improve
the Prediction accuracy to a greater extend. It helps to reduce the I/O overhead association with
large data base by making only one pass over the database when learning association rule. They
combine the rules with Markov Model is novel to our knowledge and only few of past research’s
combined all three modules. It improves the performance of Markov Model, sequential
association rules, and clustering by combining all these model together. It improver the
prediction accuracy as opposed to other combines that prove to improve the prediction coverage
and complexity. Therefore, better clustering means the better Markov Model Prediction accuracy
because the Markov Model Prediction will be based on more meaningfully grouped data. It also
improves the state space complexity because Markov Model Predict will be carried out on one
particular cluster as opposed to the whole dataset.
Figure-3.3 The integration Markov Model, Clustering & Association Rules (IMCA)
40
The Integration Model then computer Markov Model Predict on the resulting clusters.
Association rule only examined in the case where the Prediction results are based on the state.
3.5 C-mean clustering
Here, the C-means methods that are of relocation partitioning type .This type of clustering
depends on the calculation of the mean of each cluster in the process of assigning objects to
clusters. The purpose of focusing on the C-means methods is due to the main concern of this
work. The C-means methods can be of hard clustering type (called Crisp C-means) and can be of
soft clustering type (called Fuzzy C-means). Crisp C-means clustering algorithm is a widely used
and effective method for finding clusters in data; where each data object belongs to one and only
one cluster. Crisp C-means clustering algorithm is as the follows:
Step 1. The user is asked to provide the number of cluster k to be sought.

Step 2. Randomly, choose k objects as seeds (representative) to the k clusters.
Step 3. For the rest of the objects in the data set, assign each object to the nearest representative
object. Repeat this process until all the objects are assigned to clusters and therefore we end up
with k clusters: C1, C2, …, Ck. These k clusters are called the initial clustering.
Step 4. Calculate the mean of each cluster and update the location of each object. Each object is
reassigned to the cluster that has the minimum distance to its mean.
Step 5. Repeat steps 4 until some termination condition is met.
Crisp C-means clustering algorithm, after the creation of the initial clustering, the mean of each
cluster is calculated and some changes will take place in the location of objects in the clusters.
The mean of each cluster could be an object in the cluster or it could be an imaginary point in the
cluster. Our modification is in determining the mean of a cluster. So, the mean of a cluster is
an actual object instead of an imaginary object the C-means algorithm and assigning the closest
object to be the mean of the cluster. The objectives of this work were set to implement a
modified version of the C-means algorithm a system developed in visual basic.net programming
language and to test the modified version by two well-known databases on the criteria such as
time and number of passes. C-means is a soft, or fuzzy, version of the k-means least-squares
41
clustering algorithm. It is the sum of the squared distances of each data point to each cluster
centre, weighted by the membership of the data point to each cluster, for all data points. There is
a lot of inherent parallelism in the C-means algorithm. The most obvious is that the distance
calculation between every data point and each cluster centre is independent. C-means has a good
mix of coarse and fine grained parallelism.
Partitioned clustering is an important part of cluster analysis. Based on various theories,

numerous clustering algorithms have been developed, and new clustering algorithms continue to
appear in the literature. It is plays a pivotal role in data-based models and partitioned clustering
is categorized as a data-based model.
The three main contributions of this paper can be summarized as follows:
1) According to a novel definition of the mean, a unifying generative framework for partitioned
clustering algorithms, called a general c-means clustering model (GCM), is presented and
studied.
2) Based on the local optimality test of the GCM, the connection between Occam's razor and
partitioned clustering is established for the first time. As its application, a comprehensive review
of the existing objective function-based clustering algorithms is presented based on GCM.
3) Under a common assumption about partitioned clustering, a theoretical guide for devising and
implementing clustering algorithm is discovered. These conclusions are verified by numerical
experimental results
3.5.1 Distance measure
An important step in most clustering is to select a distance measure, which will determine how
the similarity of two elements is calculated. This will influence the shape of the clusters, as some
elements may be close to one another according to one distance and farther away according to
another. Common distance functions:
 The Euclidean distance (also called distance as the crow flies or 2-norm distance). A
review of cluster analysis in health psychology research found that the most common
distance measure in published studies in that research area is the Euclidean distance or
the squared Euclidean distance.
42
 The Manhattan distance.
 The maximum norm.
 The distance corrects data for different scales and correlations in the variables
 The angle between two vectors can be used as a distance measure when clustering high
dimensional data. See Inner product space.
 The Hamming distance measures the minimum number of substitutions required to
change one member into another.
Another important distinction is whether the clustering uses symmetric or asymmetric distances.
Many of the distance functions listed above have the property that distances are symmetric (the
distance from object A to B is the same as the distance from B to A).
3.5.2 C-means Algorithm
It minimize the index of quality defines as sum of squared distances for all points included in the
cluster space to the center of the cluster
Algorithm:
1. Fix the number of cluster.
2. Randomly assign all training input vector to a cluster .This creates partition.
3. Calculate the cluster center as the mean of each vector component of all vectors assigned
to that cluster. Repeat for all cluster
4. Compute all Euclidean distances between each cluster center and each input vector.
5. Update partitioned by assigning each input vector to its nearest cluster minimum
Euclidean distance.
6. Stop if the center do not move any more otherwise loop to step, where is the calculation
of a cluster center.
Quality of Result:
C-means depends on:
1. Amount of chosen cluster center.
2. Sequence of pattern survey.
3. Geometric properties of data.
43
Chapter-4 Implementation
4.1 About MATLAB
MATLAB is a high-level technical computing language and interactive environment for

algorithm development, data visualization, data analysis, and numeric computation. Using the
MATLAB product, you can solve technical computing problems faster than with traditional
programming languages, such as C, C++, and Fortran. You can use MATLAB in a wide range of
applications, including signal and image processing, communications, control design, test and
measurement, financial modeling and analysis, and computational biology. Add-on toolboxes
(collections of special-purpose MATLAB functions, available separately) extend the MATLAB
environment to solve particular classes of problems in these application areas.
MATLAB provides a number of features for documenting and sharing your work. You can
integrate your MATLAB code with other languages and applications, and distribute your
MATLAB algorithms and applications.
Features include:
• High-level language for technical computing.
• Development environment for managing code, files, and data.
• Interactive tools for iterative exploration, design, and problem solving.
• Mathematical functions for linear algebra, statistics, Fourier analysis, filtering,
optimization and numerical integration.
• 2-D and 3-D graphics functions for visualizing data.
• Tools for building custom graphical user interfaces.
• Functions for integrating MATLAB based algorithms with external applications and
Languages, such as C, C++, FORTRAN, Java™, COM, and Microsoft® Excel.
MATLAB is a high-performance language for technical computing. It integrates computation,

visualization, and programming in an easy-to-use environment where problems and solutions are
expressed in familiar mathematical notation.
Typical uses include:
• Math and computation
44
• Algorithm development
• Modeling, simulation, and prototyping
• Data analysis, exploration, and visualization
• Scientific and engineering graphics
• Application development, including Graphical User Interface building
MATLAB is an interactive system whose basic data element is an array that does not require
dimensioning. This allows you to solve many technical computing problems, especially those
with matrix and vector formulations, in a fraction of the time it would take to write a program in
a scalar non-interactive language such as C or Fortran. The name MATLAB stands for matrix
laboratory. MATLAB was originally written to provide easy access to matrix software developed
by the LINPACK and EISPACK projects, which together represent the state-of-the-art in
software for matrix computation.
MATLAB has evolved over a period of years with input from many users. In university
environments, it is the standard instructional tool for introductory and advanced courses in
mathematics, engineering, and science. In industry, MATLAB is the tool of choice for high-
productivity research, development, and analysis.
MATLAB features a family of application-specific solutions called toolboxes. Very important to

most users of MATLAB, toolboxes allow you to learn and apply specialized technology.
Toolboxes are comprehensive collections of MATLAB functions (M-files) that extend the
MATLAB environment to solve particular classes of problems. Areas in which toolboxes are
available include signal processing, control systems, neural networks, fuzzy logic, wavelets,
simulation, and many others. The MATLAB System. The MATLAB system consists of five
main parts:
1. The MATLAB language.

This is a high-level matrix/array language with control flow statements, functions, data
structures, input/output, and object-oriented programming features. It allows both "programming
in the small" to rapidly create quick and dirty throw-away programs, and "programming in the
large" to create complete large and complex application programs.
45
2. The MATLAB working environment.
This is the set of tools and facilities that you work with as the MATLAB user or programmer. It
includes facilities for managing the variables in your workspace and importing and exporting
data. It also includes tools for developing, managing, debugging, and profiling M-files,
MATLAB's applications.
3. Handle Graphics.
This is the MATLAB graphics system. It includes high-level commands for two-dimensional and
three-dimensional data visualization, image processing, animation, and presentation graphics. It
also includes low-level commands that allow you to fully customize the appearance of graphics
as well as to build complete Graphical User Interfaces on your MATLAB applications.
4. The MATLAB mathematical function library.

This is a vast collection of computational algorithms ranging from elementary functions like
sum, sine, cosine, and complex arithmetic, to more sophisticated functions like matrix inverse,
matrix eigenvalues, Bessel functions, and fast Fourier transforms.
5. The MATLAB Application Program Interface (API).

This is a library that allows you to write C and FORTRAN programs that interact with
MATLAB. It include facilities for calling routines from MATLAB (dynamic linking), calling
MATLAB as a computational engine, and for reading and writing MAT-files.
4.2 About Mat Lab Desktop
When you start MATLAB, the desktop appears, containing tools (graphical user interfaces) for
managing files, variables, and applications associated with MATLAB.
The following illustration shows the default desktop. You can customize the arrangement of
tools and documents to suit your needs. For more information about the desktop tools, see
Desktop Tools and Development Environment.
46
Figure-4.1 Matlab Window
4.3 About Mat Lab Matrices
In the MATLAB environment, a matrix is a rectangular array of numbers. Special meaning is

sometimes attached to 1-by-1 matrices, which are scalars, and to matrices with only one row or
column, which are vectors. MATLAB has other ways of storing both numeric and nonnumeric
data, but in the beginning, it is usually best to think of everything as a matrix. The operations in
MATLAB are designed to be as natural as possible. Where other programming languages work
with numbers one at a time, MATLAB allows you to work with entire matrices quickly and
easily .The best way for you to get started with MATLAB is to learn how to handle matrices. tart
MATLAB and follow along with each example.
To enter the matrix, simply type in the Command Window
A = [16 3 2 13; 5 10 11 8; 9 6 7 12; 4 15 14 1]
47
MATLAB displays the matrix you just entered:
A=
16 3 2 13
5 10 11 8
9 6 7 12
4 15 14 1
MATLAB actually has a built-in function that creates magic squares of almost any size. Not
surprisingly, this function is named magic:
B = magic(4)
B=
16 2 3 13
5 11 10 8
9 7 6 12
4 14 15 1
This matrix is almost the same as the one of the same "magic" properties; the only difference is
that the two middle columns are exchanged.
To make this B into A, swap the two middle columns:

A = B (:,[1 3 2 4])
This subscript indicates that—for each of the rows of matrix B—reorder the elements in the
order 1, 3, 2, 4. It produces:
A=
16 3 2 13
5 10 11 8
9 6 7 12
4 15 14 1
48
Chapter-5 Dissertation Result
5.1 Result
Merging Web pages by web services according to functionality reduces the number of unique
pages i.e- 2924 to 155 categories. The sessions were divided into 5 clusters using the k-means
algorithm and according to the C-means distance measure. For each cluster, the categories were
expanded back to their original form in the data set. This process is performed using a simple
program that seeks and displays the data related to each category. Markov model implementation
was carried out for the whole data set. The data set was divided into training set and test set and
2-Markov model accuracy was calculated accordingly. Then, using the test set, each transaction
was considered as a new point and distance measures were calculated in order to define the
cluster that the point belongs to. Next, 2-Markov model prediction accuracy was computed
considering the transaction as a test set and only the cluster that the transaction belongs to as a
training set. Prediction accuracy results were achieved using the maximum likelihood based on
conditional probabilities. All predictions in the test data that did not exist in the training data sets
were assumed incorrect and were given a zero value. All implementations were carried out using
MATLAB. The Web pages in the user sessions are first allocated into categories according to
Web services that are functionally meaningful. Then, k-means clustering algorithm is
implemented using the most appropriate number of clusters and distance measure. Prediction
techniques are applied using each cluster as well as using the whole data set. The experimental
results reveal that implementing the k-means clustering algorithm on the data set improves the
accuracy of the next page access prediction. The prediction accuracy achieved is an improvement
to previous research papers that addressed mainly recall and coverage.
49
Figure-5.1 Pages inside Web Cache
This figure shows the web pages sessions which present the how many times a page to be visited
inside the web cache. The visiting pages to be mentioned inside.
50
Figure-5.2 sessions visited by users
The above figure to be mentioned by the web pages to be more recently visited by the users. the
modules-7,8 &9 are the pages most recently visited by the user to be stored inside the web cache.
51
Figure-5.3 Markov Model with Association
Markov Models have been widely used for predicting next Web-page from the users’ recorded in
the Web-log. Association rule discovery on usage data results in finding group of items or pages
that are commonly accessed and purchased together.so, individually we visit the web pages in
the web cache. Find the above data to be present in the different colors.
52
Figure-5.4 pruning with association rule.
The pruning phase with association rules starts by counting the item ranges in the database, in
order to determine the frequent ones. Pruning with association remove the data or pages that does
not use for long duration of time. The modules 2,3 &4 shows the pruning with the association
rule.
53
Figure-5.5 Pruning with clustering
The Pruning with the association to be completed in the above figure -4.we will use the
clustering pruning, the pages or data not use for the long duration of time after the association we
pruning inside the clustering.as mention in the above figure, the modules-4 shows the prune
clustering.
54
Figure-5.6 Clustering
In the dissertation the above we have done the markov model with association, association with
pruning and clustering with the pruning. We will find the finally cluster in the Modules-1,2,3
&4. They represent the finally data to be get the cluster data or pages to be store inside the web
cache. The grouping of web sessions into meaningful clusters help increase the markov model
accuracy. Clustering evaluation methods that adhere to internal criterion assign the best score to
the algorithm that produces clusters with high similarity within a cluster and low similarity
between clusters.
55
Figure-5.7 C-means Methodology
Here, find C-means algorithm assigns each point to the cluster whose center is nearest. The
center is the average of all the points in the cluster — that is, its coordinates are the arithmetic
mean for each dimension separately over all the points in the cluster. The clustering is to select a
distance measure, which will determine how the similarity of two elements is calculated. C-
means has been a very important tool for web caching page in clustering in web contents. In k-
means clustering methods, it is often requires several analysis before the number of clusters can
be determined. It can be very sensitive to the choice of initial cluster center.
56
Figure-5.8 Cluster web pages
Finally, the dissertation Clustering can be achieved through page clustering or user clustering.
Web page clustering is performed by grouping pages having similar content. Page clustering can
be simple if the Web site is structured hierarchically. In this case, clustering is obtained by
choosing a higher level of the tree structure of the Web site. On the other hand, clustering user
sessions involves selecting an appropriate data abstraction for a user session and defining the
similarity between two sessions.
57
Chapter-6 Conclusions
6.1 General Discussion
The main objective of the dissertation is to help to achieve better prediction accuracy for web
page access. Recommending a next page the web user will access is very important for various
web applications web page prediction is addressed by many literature publications. The main
topic implement for this Purpose is through using web page using Prediction model. In third
dissertation we examine the three most important techniques as Markov Model, Association Rule
Mining, Clustering and C-means. Through the Pattern discovery Models integrations, we
exhausted their varied Positive impact on Web Page Prediction accuracy. By keeping the
Modules limitations to a minimum on their advantages and by integrating the different modules
according to different constraints, we will able to achieve more accurate prediction result
6.2 Conclusion with Result
With the growth of Web based application, specifically electronic commerce, there is significant
interest in analyzing Web usage data to better understand Web usage, and apply the knowledge
to better serve users. This has lead to a number of open issues in Web Usage Mining area. In
many practical applications, due to the introduction of stricter laws, privacy respect represents
big challenge. In this survey paper, we briefly explored various applications of web usage mining
suggested by authors. We also analyzed some problems and challenges of Web usage mining.
Anyway we believe that the most interesting research area deals with the integration of semantics
within Web site design so to improve the results of Web Usage Mining applications. Efforts in
this direction are likely to be the most fruitful in the creation of much more effective Web Usage
Mining and personalization systems that are consistent with emergence and proliferation of
Semantic Web. Web prediction can go a long way in improving the experience of the users on
the web. Web and web technologies are still evolving and the opportunities to incorporate such
techniques are wide open.
58
6.3 Scope for future work
The above model can be modified to make further improvement in predictive accuracy. An
important observation that I made while going though the literature on various web prediction
models is that, not many researchers have made an effort to consolidate the different mechanism
to improve accuracy of the model. These mechanisms may only give a small increase when
applied singly. Though, when applied together, they may create a substantial difference. This
involves using a model like an All Kth -order Markov model in conjunction with clustering
algorithms and pruning techniques. The above-mentioned reasons make it compelling for us to
consider this proposition.
This model has a few drawbacks too. One of the drawbacks is that it does not track the
probability of the next item that has never been seen. Using a variant of prediction by partial
matching will help take care of this situation and should be considered in the work ahead.
Another improvement that can be done is to perform statistical evaluation to establish pruning
thresholds on the fly. These results are based on logs from a single web server. It is important to
validate these observations over log files from different sources.
59
Chapter-6 References
1. Ian Tian Yi Li, Quiang Yang, Ke Wang. Classification pruning for web request prediction.
WWW Posters, 2001.
2. Faten Khalil, Jiuyong Li, Hua Wang. Integrating Markov model with clustering for
predicting web page accesses. Vol. 74. Conferences in Research and Practice in Information
Technology (CRPIT) 2008.
3. D.Bustard, W. Liu and R. Sterrritt. Using Markov chains for Link prediction in Adaptive
web sites. LNCS 2311, pp. 60-73, Springer_Verlag Berlin Heidelberg 2002.
4. James Pitkow and Peter Pirolli. Mining longest repeating subsequences to predict World
Wide Web surfing. Proceedings of USITS’ 99. The 2nd USENIX symposium on Internet
technologies & systems 1999.
5. Mukund Deshpande and George Karypis. Selective Markov model for predicting web-page
accesses. Volume 4, Issue 2,pp. 163-184 ACM transactions on Internet technology 2004.
6. Raluca Popa and Tihamer Levendovvszky. Marcov model for web access prediction. 8th
International symposium of Hungarian researchers on computational intelligence and
infomatics. November 2007
7. A. Banerjee and J. Ghosh. Clickstream clustering using weighted longest common

subsequences. SIAM Conference on Data Mining, Chicago, pages 33–40, 2001.
8. I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Model-based clustering and

visualization of navigation patterns on a web site. Data Mining and Knowledge Discovery,
7, 2003.
9. M. Deshpande and G. Karypis. Selective models for predicting web page accesses.
Transactions on Internet Technology, 4(2):163–184, 2004.
10. C. F. Eick, N. Zeidat, and Z. Zhao. Supervised clustering -algorithms and benefits. IEEE
ICTAI’04, pages 774–776, 2004.
11. S. Gunduz and M. T. OZsu. A web page prediction model based on clickstream tree
representation of user behavior. SIGKDD’03, USA, pages 535– 540, 2003.
12. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing
Surveys, 31(3):264–323, 1999.
13. D. Kim, N. Adam, V. Alturi, M. Bieber, and Y. Yesha. A clickstreambased collaborative

filtering personalization model: Towards a better performance. WIDM ’04, pages 88–95, 2004.
60
14. R. Sarukkai. Link prediction and path analysis using markov chains. 9th International
WWW Conference, Amsterdam, pages 377–386, 2000.
15. M. Spiliopoulou, L. C. Faulstich, and K. Winkler. A data miner analysing the navigational
behaviour of web users. Workshop on Machine Learning in User Modelling of the
ACAI’99, Greece, 1999.
16. J. Srivastava, R. Cooley, M. Deshpande, and P. Tan. Web usage mining: Discovery and
applications of usage patterns from web data. SIGDD Explorations, 1(2):12–23, 2000.
17. A. Strehl, J. Ghosh, and R. J. Mooney. Impact of similarity measures on web-page clustering.
AI for Web Search, pages 58–64, 2000.
18. 15. H. Xiong, J. Wu, and J. Chen. K-means clustering versus validation measures: A data
distribution perspective. KDD’06, USA, pages 779–784, 2006.
19. Q. Zhao, S. S. Bhomick, and L. Gruenwald. Wam miner: In the search of web access motifs
from historical web log data. CIKM’05, Germany, pages 421–428, 2005.
20. S. Zhong and J. Ghosh. A unified framework for model-based clustering. Machine
Learning Research, 4:1001–1037, 2003.
21. J. Zhu, J. Hong, and J. G. Hughes. Using markov models for web site link prediction.
HT’02, USA, pages 169–170, 2002.
22. A Tutorial on Clustering Algorithms

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html
23. Chen X, Zhang X. Popularity-based PPM: an effective web prefetching technique for high
accuracy and low storage. In: Proceedings of the international conference on parallel
processing. Canada, Vancouver; 2002.
24. Chen Y, Qiu L, Chen W, Nguyen L, Katz RH. Efficient and adaptiveWeb replication
using content clustering. IEEE J Selected Areas Commun 2003;21(6):979–94.
25. Deshpande M, Karypis G. Selective Markov models for predicting Web-page accesses. In:
Proceedings of the 1st SIAM international conference on data mining. Chicago, USA; 2001.8.
James Pitkow and Peter Pirolli. Mining longest repeating subsequences to predict World Wide
Web surfing. Proceedings of USITS’ 99. The 2nd USENIX symposium on Internet technologies
& systems 1999.
61
26. Mukund Deshpande and George Karypis. Selective Markov model for predicting web-page
accesses. Volume 4, Issue 2,pp. 163-184 ACM transactions on Internet technology 2004.
27. Miyamoto, S., Ichihashi, H. and Honda, K., Algorithms for Fuzzy Clustering Methods,
Springer, Verlag Berlin Heidelberg, 2008, Pages 16 – 39.
28. Kundu, A., Dutta, R. and Mukhopadhyay, D. (2006)“ An Alternate Way to Rank Hyper-
linked Web-pages,” 9th International Conference on Information Technology, ICIT 2006
Proceedings; Bhubaneswar, India; IEEE Computer Society Press, New York, USA, December
18-21, 2006, pp.297-298.
29. Mukhopadhyay, D., Mishra, P. and Saha, D. (2007) “An Agent Based Method for Web
Page Prediction,” 1st KES Symposium on Agent and Multi-Agent Systems – Technologies and
Applications, AMSTA 2007 Proceedings, Wroclow, Poland, Lecture Notes in Computer Science,
Springer-Verlag, Germany, May 31- June 1, 2007, pp.219-228.
30. Mukhopadhyay, D., Dutta, R. Kundu, A. and Kim, Y. (2006) “A Model for Web Page
Prediction using Cellular Automata,” The 6th International Workshop MSPT 2006
Proceedings, Youngil
31. A.Sarasere,E.Omiecinsky,and S.Navathe. An efficient algorithm for mining association

rules in large databases. In Proc. 21St International Conference on Very Large Databases
(VLDB) , Zurich, Switzerland, Also Catch Technical Report No. GIT-CC-95-04.,1995.
62
63
64

Thesis - Final - Web Cache

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Thesis - Final - Web Cache

Încărcat de

Drepturi de autor:

Formate disponibile

Web Caching: A Clustered Approach for Web Content Mining

SUBMITTED IN PARTIAL FULFILLMENT OF THE

Under the Supervision of

Mr. Praveen Kumar

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

1.1 Web Cache

a. The Web caching decreases user perceived latency.

b. The Web caching reduces network bandwidth usage.

c. The Web caching reduces loads on the origin servers.

1.1.1Basic Types of Web Cache

1.1.2 Need for Web Caching

1.1.4 Why web cache important

1. Reduced cost of Internet traffic:

1.1.5 Where is Web caching performed

Figure-1.2 Caching Performed at the client location

Figure-1.3 Caching performed at the proxy server location

1.1.6 How Does Cache Apply to the Web

1.2.1 Web Mining

1.2.2 Web Content Mining

1.2.3 Web Usage Mining

1. Resource finding: the task of retrieving intended is documents.

2.2 Markov model

Let, P = {p1, p2, p3,….. pn} be set of pages in a web site.

Let be a state containing k pages, = {pl-(k-1),pl-(k-2),…,pl}. The probability of P(pi|

) is estimated as follows from a history (training) data set.

2.3 Markov based web prediction model

2.4 Applications of prediction models

4. Web caching policies

5. Web site design and analysis

2.5 Types of prediction model profiles

a. Specific users (client based models are more suitable)

2.7 Preprocessing steps

2.8.1 Distance Measures

The k-means clustering repeatedly performs the following:

Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this

1.Web caching approach:

2. Web prefetching approach:

2.10 Association Rule Mining

2.12 Parameters used to measure the efficiency of the model

2.14 Local parameters to describe model state

2.15 Factors that can affect predictive performance

c. Prediction window: An improvement in predictive accuracy can be observed if we are to

2.16 Different types of logs

3.2 Integration Markov Model with Association Rules

Figure-3.1 The integration Markov & Association Model (IMAM)

3.3 Integration Markov with clustering Model (IMCM)

3.5 C-mean clustering

Step 1. The user is asked to provide the number of cluster k to be sought.

Partitioned clustering is an important part of cluster analysis. Based on various theories,

The three main contributions of this paper can be summarized as follows:

3.5.1 Distance measure

3.5.2 C-means Algorithm

MATLAB is a high-level technical computing language and interactive environment for

MATLAB is a high-performance language for technical computing. It integrates computation,

MATLAB features a family of application-specific solutions called toolboxes. Very important to

1. The MATLAB language.

4. The MATLAB mathematical function library.

5. The MATLAB Application Program Interface (API).

4.2 About Mat Lab Desktop

4.3 About Mat Lab Matrices

In the MATLAB environment, a matrix is a rectangular array of numbers. Special meaning is

To make this B into A, swap the two middle columns:

6.2 Conclusion with Result