Web Mining Notes PDF

Web Mining
Data mining is the nontrivial process of identifying valid novel, potentially useful, and ultimately
understandable patterns in data – Fayyad. The most commonly used techniques in data mining is
artificial neural networks, decision trees, genetic algorithm, nearest_neighbour method, and rule
induction. Data mining research has drawn on a number of other fields such as inductive
learning, machine learning and statistics etc.
achine learning – is the automation of a learning process and learning is based ba sed on observations
of environmental statistics
statistics and transitions. achine learning e!amines previous e!amples and
their outcomes and learns how to reproduce these make generali"ations about new uses.
#nductive learning – #nduction means inference of information from data and #nductive learning
is a model building process where the database is analy"ed to find find patterns. ain strategies are
supervised learning and unsupervised learning.
$tatistics% used to detect unusual patterns and e!plain patterns using statistical
statistical models such as
linear models.
Data mining models can be a discovery model – it is the system automatically discovering
important information hidden in the data or verification model – takes an hypothesis from the
user and tests the validity of it against the data.
The web contains collection of pages that includes countless hyperlinks and huge volumes of
access and usage information. &ecause of the ever'increasing amount of information in
cyberspace, knowledge discovery and web mining are becoming critical for successfully
conducting business in the cyber world. world. (eb mining is the discovery
discovery and analysis of useful
information from the web. (eb mining is the the use of data mining techniques to automatically
discover and e!tract information from web documents an d services )content, structure, and
usage*. Two different approaches were taken in in initially defining web mining%
mining% i. +rocess_centric
iew – (eb mining as a sequnce of tasks ii. Data_centric view – web mining as a web data that
was being used in the mining process. The important data mining techniques applied in the web
domain include -ssociation ule, $equential pattern discovery, clustering, path analysis,
classification and outlier discovery.
i. -sso
-ssociciat
atio
ionn ul
ulee ini
ining%
ng% +red
+redicictt the
the ass
associ
ociat
atio
ion
n and
and corre
correla
lati
tion
on among
among set
set
of items /where the presence of one set of items in a transaction implies )with
a certain degree of conconfidence*
fidence* the presence of other itms. That is, 0*
discovers the correlations between pages that are most often referenced
together in a single server session1user session.
session. 2* provide the information%
information% i.
(hat are the set of pages frequently accessed together by web users3 users3 ii. (hat
page will be fetched ne!t3 iii. (hat are paths frequently accessed by web
users3. 4* -ssociations and correlations% i. +age association from usage data –
user sessions,
sessions, user transactions ii. +age associations from content data –
similarity based on content analysis iii. page associations based on structure –
link connectivity between pages. -dvantages% a* 5uide for web site
restructuring – by adding links that interconnect p ages often viewed together.
&* #mprove the systemsystem performance by prefetching
prefetching web data.
ii.
ii. $eque
$equent ntia
iall patte
pattern
rn disc
discov
over
ery%
y% -ppl
-pplie ied
d to web
web acce
access
ss ser
serve
verr trans
transacacti
tion
on logs
logs..
The purpose is to discover sequential patterns that indicate user visit patterns
over a certain period. That is, the order in which 67s tend to be accessed. accessed.
-dvantage% a* useful user trends can be discovered b* predictions concerning
visit pattern can be made c* c * to improve website navigation d* personali"e
advertisements e* dynamically reorgani"e link structure and adopt web site
contents to individual client requirements or to provide clients with automatic
recommendations that best suit customer profiles..
iii. 8lustering% 5roup together items )users, pages, etc.,* that have similar
characteristics. a* +age clusters% groups of pages that seem to be conceptually
related according to users9 perception. &* 6ser 8luster% groups or users that
seem to be behave similarly when navigating through a web site.
iv. 8lassification% maps a data item into one of several predetermined classes.
:!ample% describing each users category using profiles. 8lassification
algorithms are decision tree, na;ve &ayesian classifier, neural networks.
v. +ath -nalysis% - technique that involves the generation of some form of
graph that /represents relation<s= defined on web pages. This can be the
physical layout of a web site in which the web pages are nodes and links
between these pages are directed edges. ost graphs are involved in
determining frequent traversal patterns1 more frequently visited paths in a web
site. :!ample% (hat paths do users traversal before they go to a particular
673.
To use data mining on our web site, we have to establish and record visitor and item
characteristics, and visitor interactions. isitor characteristics includes%
i. Demographics – are tangible attributes such as home address, income,
property, etc.
ii. +sychographics – are personality types such as early technology interest,
buying tendencies>
iii. Technographics – are attributes of visitor9s system, such as operating system,
browser, and modem speed>
#tem characteristics include%
i. (eb content information – media type, content category, 67>
ii. +roduct information ' product category, color, si"e, price
isitor interactions include%
i. isitor_item interactions include purchase history, advertising history, and
preference information>
ii. isitor_site statistics are per session characteristics, such as total time, pages
viewed, and so on.
(e have a lot of information about web visitors and content, but we probably are not
making the best use of it. The e!isting ?7-+ systems can report only on directly
observed and easily correlated information. They rely on users to discover patterns and
decide what to do with them. The information is even too comple! for humans to
discover these patterns using an ?7-+ system. To solve these problems, data mining
techniques are utili"ed.
The scope of data mining is i. -utomated prediction of trends, and beh aviors ii.
-utomated discovery of previously unknown patterns.
(eb mining is searches for i. (eb access patterns, ii. (eb structure, iii. regularity and
dynamics of web contents. The web mining research is a converging research area from
several research communities, such as database, information retrieval, and -# research
communities, especially from machine learning and natural language processing. (orld
wide web is a popular and interactive medium to gather information today. The (((
provides every #nternet citi"en with access to an abundance of information. 6sers
encounter some problems when interacting with the web.
i. Finding relevant information )information overload – ?nly a small portion of
the web pages contain truly relevant1useful information*%
a. low precision )the abundance problem – @@A of information of no interest
to @@A of people* – which is due to the irrelevance of many of the search
results. This results in a difficulty of finding the relevant information.
b. 7ow recall )limited coverage of the web'#nternet sources hidden behind
search interface* – due to the inability to inde! all the information
available on the web. This results in a difficulty of finding the uninde!ed
information that is relevant.
ii. Discovery of e!isting but /hidden knowledge )retrieve 014rd of the /inde!able
webB*
iii. +ersonali"ation of the information )type C presentation of information* –
7imited customi"ation to individual users.
iv. 7earning about customers1individual users.
v. 7ack of feedback on human activities.
vi. 7ack of multidimensional analysis and data mining support.
vii. The web constitutes a highly dynamic information source. ot only does the
web continue to grow rapidly, the information # holds also receives constant
updates. ews, stock market, service center, and corporate sites revise their
web pages regularly. 7inkage information and access records also undergo
frequent updates.
viii. The web serves a broad spectrum of user communities. The #nternet9s rapidly
e!panding user community connects millions of workstations, and usage purposes.
any lack good knowledge of the information network9s structure, are unaware of a
particular search9s heavy cost, frequently get lost within the web9s ocean of
information and lenthy waits required to retrieve search results.
i!. (eb page comple!ity far e!ceeds the comple!ity of any traditional te!t
document collection. -lthough the web functions as a huge digital library, the
pages themselves lack a uniform structure and contain far more authoring
style and content variations than any set of books or traditional te!t'based
documents. oreover, searching it is e!tremely difficult.
8ommon problems web marketers want to solve are how to target advertisements
)Targeting*, +ersonali"e web pages )+ersonali"ation*, create web p ages that show
products often bought together )associations*, classify articles automatically
)8lassification*, characteri"e group of similar visitors )clustering*, estimate missing data
and predict future behavior.
(eb mining techniques could be used to solve the above problems directly or indirectly.
$ub tasks in web mining%
0. esource finding% the task of retrieving 1 discovery of locations of
unfamiliar files on the network.
2. #nformation selection and pre'processing% automatically selecting and
preprocessing specific information from retrieved web resources .
4. 5enerali"ation% automatically discovers general patterns at individual web
sites as well as across multiple sites.
E. -nalysis% validation and1or interpretation of the mined patterns.
#n general web mining tasks are% i. ining web search engine data ii. -naly"ing the
web9s link structures iii* classifying web document automatically iv* mining web page
semantic structure and page contents v* mining web dynamics vi* personali"ation.
Thus, web mining refers to the overall process of discovering potentially useful and
previously unknown information or knowledge from the web data. (eb mining aims at
finding and e!tracting relevant information that is hidden in web'related data, in
particular in te!t documents that are published on the web like data mining is a multi'
disciplinary effort that draws technique from fields like information retrieval, statistics,
machine learning, natural language processing and others. (eb mining can be a
promising tool to address ineffective search engines that produce incomplete inde!ing,
retrieval of irrelevant information1unverified reliability or retrieved information. #t is
essential to have a system that helps the user find relevant and reliable information easily
and quickly on the web. (eb mining discovers information from mounds of data on the
www, but it also monitors and predicts user visit patterns. This gives designers more
reliable information in structuring and designing a web site.
5iven the rate of growth of the web, scalability of search engines is a key issue, as
the amount of hardware and network resources needed is large, and e!pensive. #n
addition, search engines are popular tools, so they have heavy constraints on query
answer time. $o, the efficient use of resources can improve both scalability and answer
time. ?ne tool to achieve these goal is web mining.
(eb mining can be categori"ed into three areas of interest based on which part of
the web to mine )(eb mining research lines*%
0. (eb content mining – discovery of useful information from the web
contents1data1documents )or* is the application of data mining techniques to
content published on the #nternet. The web contains many kinds and types of
data. &asically, the web content consists of several types of data such as plain
te!t )unstructured*, image, audio, video, meta da ta as well as T7 )semi
$tructured*, or G7 )structured documents*, dynamic documents, multimedia
documents. ecent research on mining multi types of data is termed multimedia
data mining. Thus we could consider multimedia data mining as an instance of
web content mining. The research around applying data mining techniques to
unstructured te!t is termed knowledge discovery in te!ts1 te!t data mining1 te!t
mining. ence we could consider te!t mining as an instance as an instance of
web content mining. esearch issues addressed in te!t mining are% topic
discovery, e!tracting association patterns, clustering of web documents and
classification of web pages.
#ssues in (eb content ining%
i. developing intelligent tools for information retrieval
ii. finding keywords and key phases
iii. discovering grammatical rules collections
iv. hyperte!t classification1categori"ation
v. e!tracting key phrases from te!t documents
vi. learning e!traction rules
v. Testing user interfaces, monitoring for security purposes, and more
importantly, in web personali"ation applications.
- typical (eb usage mining system consists of 2 tiers% i. Tracking, in which user
interactions are captured and acquired ii. -nalysis, in which user access patterns are
discovered and interpreted by applying typical data mining techniques to the acquired
data.
(eb ining
(eb 8ontent ining (eb $tructure ining (eb 6sage ining
Te!t #mage -udio ideo $tructured yperlinks Document (eb -ppln -ppln
ecords $tructure $erver $erver 7evel
#ntra_document #nter_document

Web Mining Notes PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Web Mining Notes PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Web Mining

(eb 8ontent ining (eb $tructure ining (eb 6sage ining

S-ar putea să vă placă și