Documente Academic
Documente Profesional
Documente Cultură
Data Mining
CompSci 316
Introduction to Database Systems
on Thursday, in class
Open-book, open-notes
No communication devices
Solution to sample midterm was emailed this weekend
Will cover all materials through today
But more focus will be on parts that you already exercised
Project
Data integration
Data
Need
OLAP
Mostly updates
Mostly reads
Short, simple transactions
Long, complex queries
Clerical users
Analysts, decision makers
Goal: transaction throughput Goal: fast queries
Lazy
On demand: at query time
Leave data at sources
Faster
Can operate when sources
are unavailable
Approaches
Recomputation
Easy to implement; just take periodic dumps of the sources, say, every night
What if there is no night, e.g., a global organization?
What if recomputation takes more than a day?
Incremental maintenance
Dimension table
Product
SID
city
PID
name
cost
s1
Durham
p1
beer
10
s2
Chapel Hill
p2
diaper
16
s3
RTP
Sale
Customer
OID
Date
CID
PID
SID
qty
price
100
08/23/2012
c3
p1
s1
12
102
09/12/2012
c3
p2
s1
17
105
09/24/2012
c5
p1
s3
13
CID
name
address
c3
Amy
c4
Ben
city
Durham
c5
Coy
Durham
Durham
Fact table
Big
Constantly growing
Stores measures (often
aggregated in queries)
Dimension table
Small
Updated infrequently
Data cube
Simplified schema: Sale (CID, PID, SID, qty)
Product
s3
p2
Store
(c3, p1, s1) = 1
s2
p1
s1
c3
ALL
c4
c5
Customer
Product
s3
Store
(c3, p1, s1) = 1
s2
p1
ALL
s1
c4
c5
Customer
10
s3
Store
(c3, p1, s1) = 1
s2
s1
ALL
c4
c5
Customer
11
s3
Store
(c3, p1, s1) = 1
s2
s1
ALL
c3
c4
c5
Customer
CUBE operator
12
13
Aggregation lattice
GROUP BY
Roll up
GROUP BY
CID
GROUP BY
PID
GROUP BY
SID
GROUP BY
CID, PID
GROUP BY
CID, SID
GROUP BY
PID, SID
Drill down
GROUP BY
CID, PID, SID
A parent can be
computed from any child
Materialized views
14
) Idea:
15
Example
GROUP BY is small, but not useful to most queries
GROUP BY CID, PID, SID is useful to any query, but too large
to be beneficial
16
17
GROUP BY
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
| 2
Durham
| chips
| Bob
| 3
Durham
| diaper
| Alice
| 5
Raleigh
| beer
| Alice
| 2
Raleigh
| diaper
| Bob
| 100
| beer
|
| Alice
| Bob
| 10
| 2
Durham
| beer
|
| Alice
| Bob
|
|
10
2
Durham
| diaper
| Alice
Durham
| chips
| Bob
18
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
| 2
Durham
| chips
| Bob
| 3
Durham
| diaper
| Alice
| 5
Raleigh
| beer
| Alice
| 2
Raleigh
| diaper
| Bob
| 100
| beer
|
| Alice
| Bob
| 10
| 2
Durham
| beer
|
| Alice
| Bob
|
|
10
2
Durham
| diaper
| Alice
Durham
| chips
| Bob
sid
|
pid
| sum | rank
------------+------------+-----+-----Durham
| beer
| 12 |
1
Durham
| diaper
| 5 |
2
Durham
| chips
| 3 |
3
Raleigh
| diaper
| 100 |
1
Raleigh
| beer
| 2 |
2
19
Multiple windows
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
| 2
Durham
| chips
| Bob
| 3
Durham
| diaper
| Alice
| 5
Raleigh
| beer
| Alice
| 2
Raleigh
| diaper
| Bob
| 100
20
Data mining
knowledge
DBMS meets AI and statistics
Clustering, prediction (classification and regression),
association analysis, outlier analysis, evolution
analysis, etc.
Data
21
a large database of
transactions, each containing
a set of items
Example: market baskets
Find
TID
items
22
First try
A
nave algorithm
2 , where
Think:
23
must also
) If
24
25
Example: pass 1
TID
items
T001
A, B, E
T002
B, D
T003
B, C
T004
A, B, D
T005
A, C
itemset
count
T006
B, C
{A}
T007
A, C
{B}
T008
A, B, C, E
{C}
T009
A, B, C
{D}
T010
{E}
Transactions
% = 20%
Frequent 1-itemsets
(Itemset {F} is infrequent)
Example: pass 2
26
Check
min. support
TID
items
T001
A, B, E
T002
B, D
itemset
count
itemset
count
itemset
T003
B, C
{A}
{A,B}
{A,B}
T004
A, B, D
{B}
{A,C}
{A,C}
T005
A, C
{C}
{A,D}
{A,E}
T006
B, C
{D}
{A,E}
{B,C}
T007
A, C
{E}
{B,C}
{B,D}
T008
A, B, C, E
{B,D}
{B,E}
T009
A, B, C
{B,E}
T010
{C,D}
{C,E}
{D,E}
Frequent
1-itemsets
Transactions
% = 20%
count
Frequent
2-itemsets
Candidate
2-itemsets
27
Example: pass 3
TID
items
T001
A, B, E
T002
B, D
T003
B, C
T004
A, B, D
itemset
count
T005
A, C
{A,B}
T006
B, C
{A,C}
T007
A, C
{A,E}
T008
A, B, C, E
{B,C}
T009
A, B, C
{B,D}
T010
{B,E}
Transactions
% = 20%
itemset
count
Candidate
3-itemsets
Check
min. support
itemset
count
Frequent
3-itemsets
Frequent
2-itemsets
28
Example: pass 4
TID
items
T001
A, B, E
T002
B, D
Generate
candidates
T003
B, C
T004
A, B, D
itemset
T005
A, C
{A,B,C} 2
T006
B, C
{A,B,E}
T007
A, C
T008
A, B, C, E
T009
A, B, C
T010
count
itemset
count
Frequent
3-itemsets
Candidate
4-itemsets
Transactions
% = 20%
29
count
itemset
count
itemset
{A}
{A,B}
{A,B,C} 2
{B}
{A,C}
{A,B,E}
{C}
{A,E}
{D}
{B,C}
{E}
{B,D}
{B,E}
Frequent
1-itemsets
Summary
Data
count
2
Frequent
3-itemsets
Frequent
2-itemsets
30
warehousing
mining
10