Documente Academic
Documente Profesional
Documente Cultură
Submitted by: Aradhana yadav It 7th sem 0203it091011 Submitted to: Ms.Richa Nakra
AGENDA
o Introduction o Overview of data mining technology o Association rules o Classification o Clustering o Applications of data mining o Commercial tools o Conclusion
Introduction
o What is data mining? o Why do we need to mine data? o On what kind of data can we mine?
Function 1D Signal
Metadata Annotation
GeneFilter Comparison Report GeneFilter 1 Name: GeneFilter 1 Name: O2#1 8-20-99adjfinal N2#1finaladj INTENSITIES RAW NORMALIZED ORF NAME GENE NAME CHRM F G R GF1 GF2 YAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,78 YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73" YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53" YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 YDL211C 4 1 A 1 7 17.31 35.34 581.00 YDR155C CPH1 4 1 A 1 8 349.78 401.84 YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 YBR162C 2 1 A 2 4 226.84 293.83
Knowledge Discovery in Databases and Data Mining o The non-trivial extraction of implicit, unknown, and potentially useful information from databases. oThe knowledge discovery process comprises six phases:
Content
o Introduction o Overview of data mining technology o Association rules o Classification o Clustering o Application of data mining o Commercial tools o Conclusion
Association Rules
o Purpose
Providing the rules correlate the presence of a set of items with another set of item Examples:
Association Rules
o Some concepts
o Market-basket model
Put the SHOES near the SOCKS so that if a customer buys one they will buy the other
oTransactions: is the fact the person buys some items in the itemset at supermarket
Association Rules
o Apriori Algorithm
Input: database of m transactions, D, and a minimum support, mins, represented as a fraction of m Output: frequent itemsets, L1, L2, , Lk
o Apriori algorithm
Transaction-id 101 792 1130 1735 time 6:35 7:38 8:05 8:40
The candidate 1-itemsets D mins = 2 minf = 0.5 milk, bread, juice, cookies, eggs, coffee 0.75, 0.5, 0.5, 0.5, 0.25, 0.25
support 3 2 2 1 1
Cookies 2
frequent 1itemsets
milk, bread, juice, cookies 0.75, 0.5, 0.5, 0.5
frequent 3-itemsets
o Apriori Algorithm
Begin + compute support(ij) = count(ij)/m for each individual item, i1, i2, ..,in by scanning the database once and counting the number of transactions that item ij appears in + the candidate frequent 1-itemset, C1, will be the set of items i1, i2, , in + the subset of items containing ij form Ci where support(ij) >= mins becomes the frequent + 1-itemset, L1; + k=1; + termination = false; + repeat + Lk+1 = ; + create the candidate frequent (k+1)-itemset, Ck+1, by combining members of Lk that have k-1 items in common; (this forms candidate frequent (k+1)-itemsets by selectively extending frequent k-itemsets by one item) + in addition, only consider as elements of Ck+1 those k+1 items such that every subset of size k appears in Lk; + scan the database once and compute the support for each member of Ck+1; if the support for a member of Ck+1; if the support for a member of Ck+1 >= min then add that member to Lk+1; + if Lk+1 is empty then termination = true else k = k+1; + until termination end;
Association Rules
o Frequent-pattern tree algorithm
Motivated by the fact that Apriori based algorithms may generate and test a very large number of candidate itemsets. Example:
with 1000 frequent 1-items, Apriori would have to generate 2^1000 candidate 2-itemsets
The FP-growth algorithm is one aproach that eliminates the generation of a large number of candidate itemsets
Association Rules
o Frequent-pattern tree algorithm
Generating a compressed version of the database in terms of an FP-Tree FP-Growth Algorithm for finding frequent itemsets
Association Rules
o FP-Tree algorithm
Item head table
Item Milk bread Support 3 2 link
Root
Bread:1
cookies
juice
2
2
Bread:1
Juice:1
Cookies:1
Transaction 4 1 2 3
Bread, Milk, Bread, bread, Bread, milk, Milk, cookies, Milk cookies, cookies juicecoffee eggs juice
Cookies:1
Juice:1
Association Rules
o FP-Growth algorithm
Milk:3 Root
Bread:1
Bread:1
Juice:1
Cookies:1
Cookies:1
Juice:1
Association Rules
o FP-Growth algorithm
Procedure FP-growth (tree, alpha); Begin If tree contains a single path then For each combination, beta, of the nodes in the path Generate pattern (beta U alpha) with support = minimum support of nodes in beta Else For each item, I, in the header of the tree do Begin Generate pattern beta = (I U alpha) with support = i.support; Construct betas conditional pattern base; Construct betas conditional FP-tree, bete_tree; If beta_tree is not empty then FP-growth (beta_tree, beta); End end
Association Rules
o Demo of FP-Growth algorithm
Classification
o Introduction
Classification is the process of learning a model that describes different classes of data, the classes are predetermined The model that is produced is usually in the form of a decision tree or a set of rules
married Yes salary <20k Poor risk >=20k <50k >=50 no Acct balance >5k <5k Poor risk <25 Fair risk age >=25 Good risk
RID 1
Married No
Salary >=50
Age >=25
Loanworthy Yes
Class attribute
2
3 4 5 6
Yes
Yes No No Yes
>=50
20k..50k <20k <20k 20k..50k
>=5k
<5k >=5k <5k >=5k
>=25
<25 <25 >=25 >=25
Yes
No No No Yes
E(Married)=0.92 Gain(Married)=0.08
E(Salary)=0.33 Gain(Salary)=0.67
E(A.balance)=0.82 Gain(A.balance)=0.18
E(Age)=0.81 Gain(Age)=0.19
Expected information
I ( S1 , S 2 ,... S n ) pi log 2 pi
I(3,3)=1
i 1
>=50k
E ( A)
j 1
S j1 ... S jn S
<25
>=25
* I (S j1 ,...,S jn )
Classification
o Algorithm for decision tree induction
Procedure Build_tree (records, attributes); Begin Create a node N; If all records belongs to the same class, C then Return N as a leaf node with class label C; If Attributes is empty then Return n as a leaf node with class label C, such that the majority of records belong to it; Select attribute Ai (with the highest information gain) from Attributes; Label node N with Ai; For each know value, Vj, of Ai do Begin Add a brach from node N for the condition Ai=Vj; Sj=subset of Records where Ai=Vj; If Sj is empty then Add a leaf, L, with class label C, such that the majority of records belong to it and return L Else add the node returned by build_tree(Sj, Attributes Aj) End; End;
Classification
o Demo of decision tree
Clustering
o Introduction
The previous data mining task of classification deals with partitioning data based on a preclassified training sample Clustering is an automated process to group related records together. Related records are grouped together on the basis of having similar values for attributes The groups are usually disjoint
Clustering
o Some concepts
An important facet of clustering is the similarity function that is used When the data is number, a similarity function based on distance is typically used Euclidean metric (Euclidean distance), Minkowsky metric, Mahattan metric.
Clustering
o K-means clustering algorithm
o Input: a database D, of m records r1,,rm and a desired number of clusters. k o Output: set of k clusters Begin
Randomly choose k records as the centroids for the k clusters Repeat
Assign each record, ri, to a cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the k clusters; Recalculate the centroid (mean) for each cluster based on the records assigned to the cluster;
Until no change;
End;
Clustering
o Demo of K-means algorithm
Content
o Introduction o Overview of data mining technology o Association rules o Classification o Clustering o Applications of data mining o Commercial tools o Conclusion
Commercial tools
o Oracle Data Miner
http://www.oracle.com/technology/products/bi/od m/odminer.html
o Data To Knowledge
http://alg.ncsa.uiuc.edu/do/tools/d2k
o SAS
http://www.sas.com/
o Clementine
http://spss.com/clemetine/
o Intelligent Miner
http://www-306.ibm.com/software/data/iminer/
Conclusion
o Data mining is a decision support process in which we search for patterns of information in data.
o This technique can be used on many types of data. o Overlaps with machine learning, statistics, artificial intelligence, databases, visualization
Conclusion
The result of mining may be to discover the following type of new information:
Association rules Sequencial patterns Classification trees
References
o Fundamentals of Database Systems
fourth edition -- R.Elmasri, S.B.Navathe -Addition Wesley -- ISBN 0-321-20448-4