Sunteți pe pagina 1din 86

Homework Index

The following problems are based on problems from your textbook.


However most of them are slightly revised. Please check this spreadsheet
to see if the questions have been changed, or if you are required to use
different data or examples.

Chapter 1:

1, 2, 3, 4, 5, 6

Chapter 2:

2, 3, 4, 6

Chapter 3:

1, 2, 3, 4, 7, 11

Chapter 4:

1, 2, 3, 4, 5, 16

Chapter 5:

1, 2, 4, 6, there are example problems to help you

Chapter 6:

6.6, 6.14c

Chapter 7:

Chapter 8:

7, 12

Chapter 9:

1, 3, 4, 6

Chapter 10:

1, 2, 6, 10, 16

1 and 3 are directly from the text.

Complete problems 1, 2, 3, 4, 5, 6 from the text.


Do not answer verbatim from the Instructor's Manual or you will not receive credit.
Do not answer verbatim from previosu students' answers. I will check, and you
will receive a zero on this homework, and possibly be reported to the Dean's office.

14
17
19
19
23
25
27
31
31
32
33
37
37
37
37
45
45
48
48
49
49
53
75
79
80

Use this Data for Chapter 2, Problem 2

Problem 2.3 Use the following data:


Time (in minutes)
0 but less than 5
5 but less than 10
10 but less than 15
15 but less than 20
20 but less than 25
25 but less than 30
30 or more
a.
b.
c.
d.
e.
f.

Freq.
53
37
65
18
12
13
2

What is the width of each class?


How many sessions in the sample?
What is the relative frequency of sessions
15 but less than 20?
What is the cumulative frequency of sessions
15 but less than 20?
In what class does the median occur?
Compute an approximate median.

L_1
n
freq_l
freq_median
width
median:

Rel. Freq.

Problem 2.4

Using the data given in the book (also below), plot a scatter plot of %Fat v. Age.
Also, plot a q-q plot of these two variables.
See instructions below for plotting a q-q plot.
(Those instructions are for plotting vs. a normal dist. NOT the case here.)
Note: If you have the same number of observations in both datasets (which you do in this case)
you can plot a q-q plot by simply plotting a scatter plot of the sorted datasets against each othe
If you are using an Excel trendline, it cannot be customize to run through Q1 and Q3.
For this assignment, you don't have to plot the line.
If you were using a statistical package that supports q-q plots, the line would be drawn.
Alternatively, you could paste the chart into Word or PPT and draw the line there.
Detailed instructions on plotting a q-q plot are below.
These instructions are for plotting a distribution against a normal distribution
to see if the distribution is normal. These can be modifie for other q-q plots.
AGE
23
23
27
27
39
41
47
49
50
52
54
54
56
57
58
58
60
61

% FAT
9.5
26.5
7.8
17.8
31.4
25.9
27.4
27.2
31.2
34.6
42.5
28.8
33.4
30.2
34.1
32.9
41.2
35.7

Creating Quantile-Quantile (Q-Q) plots

These instructions are for plotting against a normal distribution, using z-scores. However, the same in

1. Place or load your data values into the first column. Leave the first row blank for labeling the columns
2. Label the second column as Rank. Enter the ranks, starting with 1 in the row right below the label.
3. Label the third column as Rank Proportion. This column shows the rank proportion of each value.
4. Label the fourth column as Rank-based z-scores. Excel provides these values with the normsinv

5. Copy the first column to the fifth column. The Excel chart wizard works better if the x-axis values a
6. Select the fourth and fifth column. Select the chart wizard and then the scatter plot. The default da

%Fat v. Age.

ets (which you do in this case),


ed datasets against each other.
hrough Q1 and Q3.

line would be drawn.


the line there.

a normal distribution
e for other q-q plots.

(Q-Q) plots in Excel

g z-scores. However, the same instructions can be modified for plotting other datasets against each other.

ow blank for labeling the columns. Sort the data in ascending order (look under the Data menu).
in the row right below the label. Each following row will be one more than the last (note: you can use an expression, copy
he rank proportion of each value. Use this expression for the first data value =(b2 - 0.5) / count(b$2:b$N) where N sho
these values with the normsinv function. Use this function to create the values in the fourth column.

orks better if the x-axis values are just to the left of the y-axis values.
n the scatter plot. The default data values should be good, but you should provide good labels.

each other.

an use an expression, copy and then paste to save you time)


nt(b$2:b$N) where N should have the row number of the last cell. Finish the column by copying the first data expression

ng the first data expression to the remaining rows. Check to make sure your percentiles look like they are correct!

ke they are correct!

For Problem 2.6,


Using the following tuples, compute the distances listed below:
Tuples:
(6, 3, 20, 11) and (12, 0, 17, 9)
6
3
20
11

12
0
17
9

Euclidean Distance:

=SQRT( (5-10)^2 + ..(7-4)^2)

Tuple i

Tuple j

(xi-xj)^2

SUM =
SQRT(SUM)=

As an example, using the book's data:


Tuple i
22
1
42
10

Tuple j
20
0
36
8
SUM =
SQRT(SUM)=

Manhattan Distance: = ABS(5-10) + +ABS(7-4)


Tuple i

Tuple j

ABS(xi - xj)

Tuple i
22
1
42
10

SUM =

Minkowski Distance (using h=3)


Tuple i

= the h-root of (ABS (5-10)^h + ABS(7-4)^h)

Tuple jABS(xi - xj)^3

Tuple i
22
1
42
10

SUM =
h-root=
Supremum Distance: the maximum distance for one of the attributes:
Tuple i

Tuple j

Tuple j
20
0
36
8
SUM =

ABS(xi-xj)

Tuple j
20
0
36
8
SUM =
h-root=

MAX =

ple, using the book's data:


(xi-xj)^2
4
1
36
4
45
6.708204

BS(xi - xj)
2
1
6
2
11

xi - xj)^3
8
1
216
8
233
6.153449

Problem 3.1:

From textbook, but you must use different examples than what is in the Instructor
And don't tell me that you don't have access to the Instructor's Manual. Or previous studen
Answers must be in your own words.

Problem 3.2:

Describe the methods and how they were or could have been implemented in the

what is in the Instructor's Manual.


nual. Or previous students.

een implemented in the IRIS dataset.

Problem 3.3, Revised


14
17
19
Smoothing by bin means, bin-depth of 3:
19
23
Sum of Bin
25
Bin 1:
27
Bin 2:
31
Bin 3:
31
Bin 4:
32
Bin 5:
33
Bin 6:
37
Bin 7:
37
Bin 8:
37
37
45
45
48
48
49
49
53
75
79
79

Mean

New Bin

Problem 3.4
Regarding data integration:

a. List some domain-specific (e.g., business, scientific, etc., NOT technicao) reasons why heterogeneo
b. List some ways that data can be heterogeneous and require integration. Give a one-sentence exam
synonyms, homonyms, formatting issues, levels of granularity, etc.
c. What is a schema? What is the difference between static schema integration and partial dynamic i
Give one example of when each would be appropriate.

easons why heterogeneous data sources may require integration.


Give a one-sentence example from any domain.

ion and partial dynamic integration?

Problem 3.7
DATA
14
17
19
19
23
25
27
31
31
32
33
37
37
37
37
45
45
48
48
49
49
53
75
79
80

a.

Transform the value 53 for this dataset onto the range [0.0, 1.0]
v' =[ (v - minA)/(maxA-minA)]*(new_maxA - new_minA) +new_minA
v=
minA =
maxA =
new_minA =
new_maxA =
v' =

b.

Transform the value 35 to a z-value.


(Recall that you have and from Problem 2.2--use 18.18 for .)
z = (x-)/
x=
=
= 18.18
z=
z=
v' =

c.

Use normalization by decimal scaling:


(Expalin how many decimal places you are using for the scaling.)
v' =

d.

Which of the normalization methods would be appropriate for the IRIS data
(Review pages 113-115.)
Discuss one of the numeric attributes, such as PetalWidth.
(Consider the limitations of the various methods.)

decimal scaling:

min-max:

z-score:

ange [0.0, 1.0]


minA) +new_minA

se 18.18 for .)

or the scaling.)

propriate for the IRIS dataset?

Problem 3.11
Use the data listed for Part a. Use the IRIS dataset for the other parts.
Data

Bins
14
17
19
19
23
25
27
31
31
32
33
37
37
37
37
45
45
48
48
49
49
53
75
79
80

20
30
40
50
60
70
80

Problem 3.11
a.
Plot an equi-width histogram of width 10.
b.
Using the IRIS dataset, sketch examples of sampling:

sample petalwidth. In your results, show observa


petalwidth and class.

SRSWOR, SRSWR, cluster sampling

Use samples of size 5 (and also 5 clusters)


c.
Would you recommed using a stratified sample?
Why or why not?
(hint: sorting the data by your stratification variable
you decide.)

SRSWOR:
Obsv. #

PetalWidth Class

SRSWR:
Obsv. #

PetalWidth Class

Clustered:
Obsv. #

(create clusters on a separate wo


PetalWidth Class

To create clusters, keep the data in the same order.


Divide the data into however many clusters you want

Take a SRS from each cluster. Include that observatio


in your final sample.

width 10.
examples of sampling:
results, show observation#.

also 5 clusters)
tratified sample?

stratification variable may help

rs on a separate worksheet)

a in the same order.


any clusters you want.

nclude that observation

Observation #

sepallengt sepalwidthpetallengthpetalwidth class


1
6.2
2.8
4.8
1.8 Iris-virginica
2
6.3
2.9
5.6
1.8 Iris-virginica
3
5.1
3.5
1.4
0.3 Iris-setosa
4
5.2
3.5
1.5
0.2 Iris-setosa
5
5.9
3
4.2
1.5 Iris-versicolor
6
5.7
3
4.2
1.2 Iris-versicolor
7
5.5
2.6
4.4
1.2 Iris-versicolor
8
6.4
2.8
5.6
2.2 Iris-virginica
9
5.1
2.5
3
1.1 Iris-versicolor
10
6.7
3.3
5.7
2.5 Iris-virginica
11
7.7
3.8
6.7
2.2 Iris-virginica
12
4.8
3.4
1.6
0.2 Iris-setosa
13
7
3.2
4.7
1.4 Iris-versicolor
14
6.3
2.5
4.9
1.5 Iris-versicolor
15
5.2
4.1
1.5
0.1 Iris-setosa
16
5.8
2.6
4
1.2 Iris-versicolor
17
5.7
2.8
4.5
1.3 Iris-versicolor
18
4.6
3.4
1.4
0.3 Iris-setosa
19
6.4
2.9
4.3
1.3 Iris-versicolor
20
5.6
2.8
4.9
2 Iris-virginica
21
6.3
2.7
4.9
1.8 Iris-virginica
22
7.7
2.8
6.7
2 Iris-virginica
23
6.1
2.9
4.7
1.4 Iris-versicolor
24
5.5
2.5
4
1.3 Iris-versicolor
25
4.9
3
1.4
0.2 Iris-setosa
26
6
2.2
5
1.5 Iris-virginica
27
4.9
3.1
1.5
0.1 Iris-setosa
28
5.7
2.8
4.1
1.3 Iris-versicolor
29
5
2
3.5
1 Iris-versicolor
30
5.4
3
4.5
1.5 Iris-versicolor
31
6.5
3
5.8
2.2 Iris-virginica
32
6.5
3
5.2
2 Iris-virginica
33
4.9
3.1
1.5
0.1 Iris-setosa
34
6.7
3.1
4.7
1.5 Iris-versicolor
35
5.4
3.4
1.5
0.4 Iris-setosa
36
4.4
3.2
1.3
0.2 Iris-setosa
37
7.4
2.8
6.1
1.9 Iris-virginica
38
6.1
3
4.6
1.4 Iris-versicolor
39
6.2
2.2
4.5
1.5 Iris-versicolor
40
6.9
3.1
4.9
1.5 Iris-versicolor
41
6
2.9
4.5
1.5 Iris-versicolor
42
5.4
3.7
1.5
0.2 Iris-setosa
43
5.1
3.4
1.5
0.2 Iris-setosa
44
5.3
3.7
1.5
0.2 Iris-setosa

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

6
5.1
6.5
5.4
5.7
6.1
5.4
5
5
5
6.3
4.6
6.4
5.5
7.9
6.4
5.8
6.3
5
5.7
6.5
7.6
6.4
5.9
6.7
5.5
4.9
4.9
6.3
7.7
5
5.2
6.8
5.4
6.6
5.6
4.8
6.7
5.5
5.8
5.8
6.1
6.2
6.7
4.6

2.2
3.3
2.8
3.9
3.8
3
3.4
3.5
3
3.3
2.3
3.1
3.2
2.4
3.8
3.1
2.7
3.3
3.4
4.4
3.2
3
3.2
3.2
2.5
4.2
3.1
2.5
3.4
3
3.4
2.7
3.2
3.9
2.9
3
3.4
3
3.5
2.8
2.7
2.8
2.9
3.1
3.6

4
1.7
4.6
1.7
1.7
4.9
1.7
1.6
1.6
1.4
4.4
1.5
5.3
3.7
6.4
5.5
5.1
6
1.5
1.5
5.1
6.6
4.5
4.8
5.8
1.4
1.5
4.5
5.6
6.1
1.6
3.9
5.9
1.3
4.6
4.5
1.9
5
1.3
5.1
4.1
4.7
4.3
4.4
1

1 Iris-versicolor
0.5 Iris-setosa
1.5 Iris-versicolor
0.4 Iris-setosa
0.3 Iris-setosa
1.8 Iris-virginica
0.2 Iris-setosa
0.6 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
1.3 Iris-versicolor
0.2 Iris-setosa
2.3 Iris-virginica
1 Iris-versicolor
2 Iris-virginica
1.8 Iris-virginica
1.9 Iris-virginica
2.5 Iris-virginica
0.2 Iris-setosa
0.4 Iris-setosa
2 Iris-virginica
2.1 Iris-virginica
1.5 Iris-versicolor
1.8 Iris-versicolor
1.8 Iris-virginica
0.2 Iris-setosa
0.1 Iris-setosa
1.7 Iris-virginica
2.4 Iris-virginica
2.3 Iris-virginica
0.4 Iris-setosa
1.4 Iris-versicolor
2.3 Iris-virginica
0.4 Iris-setosa
1.3 Iris-versicolor
1.5 Iris-versicolor
0.2 Iris-setosa
1.7 Iris-versicolor
0.2 Iris-setosa
2.4 Iris-virginica
1 Iris-versicolor
1.2 Iris-versicolor
1.3 Iris-versicolor
1.4 Iris-versicolor
0.2 Iris-setosa

90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134

4.7
5.1
7.2
5
6.9
5.5
4.8
6.4
5.9
6.7
4.5
6.7
6.1
5.5
5.6
6.5
6.6
6
6.9
6.2
7.2
4.9
5.1
5.8
5.7
7.3
4.7
5
5.7
6.8
6.3
5.8
4.6
6
6.3
5.1
6.8
5.7
5.6
7.7
4.4
4.8
6.9
5
6

3.2
3.8
3
2.3
3.2
2.4
3.1
2.7
3
3.1
2.3
3
2.8
2.3
2.9
3
3
2.7
3.1
3.4
3.6
2.4
3.7
2.7
2.9
2.9
3.2
3.5
2.6
2.8
2.5
4
3.2
3
2.8
3.8
3
2.5
3
2.6
2.9
3
3.1
3.2
3.4

1.6
1.9
5.8
3.3
5.7
3.8
1.6
5.3
5.1
5.6
1.3
5.2
4
4
3.6
5.5
4.4
5.1
5.4
5.4
6.1
3.3
1.5
5.1
4.2
6.3
1.3
1.3
3.5
4.8
5
1.2
1.4
4.8
5.1
1.5
5.5
5
4.1
6.9
1.4
1.4
5.1
1.2
4.5

0.2 Iris-setosa
0.4 Iris-setosa
1.6 Iris-virginica
1 Iris-versicolor
2.3 Iris-virginica
1.1 Iris-versicolor
0.2 Iris-setosa
1.9 Iris-virginica
1.8 Iris-virginica
2.4 Iris-virginica
0.3 Iris-setosa
2.3 Iris-virginica
1.3 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1.8 Iris-virginica
1.4 Iris-versicolor
1.6 Iris-versicolor
2.1 Iris-virginica
2.3 Iris-virginica
2.5 Iris-virginica
1 Iris-versicolor
0.4 Iris-setosa
1.9 Iris-virginica
1.3 Iris-versicolor
1.8 Iris-virginica
0.2 Iris-setosa
0.3 Iris-setosa
1 Iris-versicolor
1.4 Iris-versicolor
1.9 Iris-virginica
0.2 Iris-setosa
0.2 Iris-setosa
1.8 Iris-virginica
1.5 Iris-virginica
0.3 Iris-setosa
2.1 Iris-virginica
2 Iris-virginica
1.3 Iris-versicolor
2.3 Iris-virginica
0.2 Iris-setosa
0.1 Iris-setosa
2.3 Iris-virginica
0.2 Iris-setosa
1.6 Iris-versicolor

135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

5.1
5.6
5.1
5.2
6.7
6.4
4.4
7.2
5.8
6.3
5
5.6
7.1
6.1
4.3
4.8

3.8
2.7
3.5
3.4
3.3
2.8
3
3.2
2.7
3.3
3.6
2.5
3
2.6
3
3

1.6
4.2
1.4
1.4
5.7
5.6
1.3
6
3.9
4.7
1.4
3.9
5.9
5.6
1.1
1.4

0.2 Iris-setosa
1.3 Iris-versicolor
0.2 Iris-setosa
0.2 Iris-setosa
2.1 Iris-virginica
2.1 Iris-virginica
0.2 Iris-setosa
1.8 Iris-virginica
1.2 Iris-versicolor
1.6 Iris-versicolor
0.2 Iris-setosa
1.1 Iris-versicolor
2.1 Iris-virginica
1.4 Iris-virginica
0.1 Iris-setosa
0.3 Iris-setosa

Observation #

sepallengt sepalwidthpetallengthpetalwidth class


3
5.1
3.5
1.4
0.3 Iris-setosa
4
5.2
3.5
1.5
0.2 Iris-setosa
12
4.8
3.4
1.6
0.2 Iris-setosa
15
5.2
4.1
1.5
0.1 Iris-setosa
18
4.6
3.4
1.4
0.3 Iris-setosa
25
4.9
3
1.4
0.2 Iris-setosa
27
4.9
3.1
1.5
0.1 Iris-setosa
33
4.9
3.1
1.5
0.1 Iris-setosa
35
5.4
3.4
1.5
0.4 Iris-setosa
36
4.4
3.2
1.3
0.2 Iris-setosa
42
5.4
3.7
1.5
0.2 Iris-setosa
43
5.1
3.4
1.5
0.2 Iris-setosa
44
5.3
3.7
1.5
0.2 Iris-setosa
46
5.1
3.3
1.7
0.5 Iris-setosa
48
5.4
3.9
1.7
0.4 Iris-setosa
49
5.7
3.8
1.7
0.3 Iris-setosa
51
5.4
3.4
1.7
0.2 Iris-setosa
52
5
3.5
1.6
0.6 Iris-setosa
53
5
3
1.6
0.2 Iris-setosa
54
5
3.3
1.4
0.2 Iris-setosa
56
4.6
3.1
1.5
0.2 Iris-setosa
63
5
3.4
1.5
0.2 Iris-setosa
64
5.7
4.4
1.5
0.4 Iris-setosa
70
5.5
4.2
1.4
0.2 Iris-setosa
71
4.9
3.1
1.5
0.1 Iris-setosa
75
5
3.4
1.6
0.4 Iris-setosa
78
5.4
3.9
1.3
0.4 Iris-setosa
81
4.8
3.4
1.9
0.2 Iris-setosa
83
5.5
3.5
1.3
0.2 Iris-setosa
89
4.6
3.6
1
0.2 Iris-setosa
90
4.7
3.2
1.6
0.2 Iris-setosa
91
5.1
3.8
1.9
0.4 Iris-setosa
96
4.8
3.1
1.6
0.2 Iris-setosa
100
4.5
2.3
1.3
0.3 Iris-setosa
112
5.1
3.7
1.5
0.4 Iris-setosa
116
4.7
3.2
1.3
0.2 Iris-setosa
117
5
3.5
1.3
0.3 Iris-setosa
121
5.8
4
1.2
0.2 Iris-setosa
122
4.6
3.2
1.4
0.2 Iris-setosa
125
5.1
3.8
1.5
0.3 Iris-setosa
130
4.4
2.9
1.4
0.2 Iris-setosa
131
4.8
3
1.4
0.1 Iris-setosa
133
5
3.2
1.2
0.2 Iris-setosa
135
5.1
3.8
1.6
0.2 Iris-setosa

137
138
141
145
149
150
5
6
7
9
13
14
16
17
19
23
24
28
29
30
34
38
39
40
41
45
47
55
58
67
68
76
79
80
82
85
86
87
88
93
95
102
103
104
106

5.1
5.2
4.4
5
4.3
4.8
5.9
5.7
5.5
5.1
7
6.3
5.8
5.7
6.4
6.1
5.5
5.7
5
5.4
6.7
6.1
6.2
6.9
6
6
6.5
6.3
5.5
6.4
5.9
5.2
6.6
5.6
6.7
5.8
6.1
6.2
6.7
5
5.5
6.1
5.5
5.6
6.6

3.5
3.4
3
3.6
3
3
3
3
2.6
2.5
3.2
2.5
2.6
2.8
2.9
2.9
2.5
2.8
2
3
3.1
3
2.2
3.1
2.9
2.2
2.8
2.3
2.4
3.2
3.2
2.7
2.9
3
3
2.7
2.8
2.9
3.1
2.3
2.4
2.8
2.3
2.9
3

1.4
1.4
1.3
1.4
1.1
1.4
4.2
4.2
4.4
3
4.7
4.9
4
4.5
4.3
4.7
4
4.1
3.5
4.5
4.7
4.6
4.5
4.9
4.5
4
4.6
4.4
3.7
4.5
4.8
3.9
4.6
4.5
5
4.1
4.7
4.3
4.4
3.3
3.8
4
4
3.6
4.4

0.2 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
0.1 Iris-setosa
0.3 Iris-setosa
1.5 Iris-versicolor
1.2 Iris-versicolor
1.2 Iris-versicolor
1.1 Iris-versicolor
1.4 Iris-versicolor
1.5 Iris-versicolor
1.2 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1.4 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1 Iris-versicolor
1.5 Iris-versicolor
1.5 Iris-versicolor
1.4 Iris-versicolor
1.5 Iris-versicolor
1.5 Iris-versicolor
1.5 Iris-versicolor
1 Iris-versicolor
1.5 Iris-versicolor
1.3 Iris-versicolor
1 Iris-versicolor
1.5 Iris-versicolor
1.8 Iris-versicolor
1.4 Iris-versicolor
1.3 Iris-versicolor
1.5 Iris-versicolor
1.7 Iris-versicolor
1 Iris-versicolor
1.2 Iris-versicolor
1.3 Iris-versicolor
1.4 Iris-versicolor
1 Iris-versicolor
1.1 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1.4 Iris-versicolor

107
111
114
118
119
128
134
136
143
144
146
1
2
8
10
11
20
21
22
26
31
32
37
50
57
59
60
61
62
65
66
69
72
73
74
77
84
92
94
97
98
99
101
105
108

6
4.9
5.7
5.7
6.8
5.6
6
5.6
5.8
6.3
5.6
6.2
6.3
6.4
6.7
7.7
5.6
6.3
7.7
6
6.5
6.5
7.4
6.1
6.4
7.9
6.4
5.8
6.3
6.5
7.6
6.7
4.9
6.3
7.7
6.8
5.8
7.2
6.9
6.4
5.9
6.7
6.7
6.5
6.9

2.7
2.4
2.9
2.6
2.8
3
3.4
2.7
2.7
3.3
2.5
2.8
2.9
2.8
3.3
3.8
2.8
2.7
2.8
2.2
3
3
2.8
3
3.2
3.8
3.1
2.7
3.3
3.2
3
2.5
2.5
3.4
3
3.2
2.8
3
3.2
2.7
3
3.1
3
3
3.1

5.1
3.3
4.2
3.5
4.8
4.1
4.5
4.2
3.9
4.7
3.9
4.8
5.6
5.6
5.7
6.7
4.9
4.9
6.7
5
5.8
5.2
6.1
4.9
5.3
6.4
5.5
5.1
6
5.1
6.6
5.8
4.5
5.6
6.1
5.9
5.1
5.8
5.7
5.3
5.1
5.6
5.2
5.5
5.4

1.6 Iris-versicolor
1 Iris-versicolor
1.3 Iris-versicolor
1 Iris-versicolor
1.4 Iris-versicolor
1.3 Iris-versicolor
1.6 Iris-versicolor
1.3 Iris-versicolor
1.2 Iris-versicolor
1.6 Iris-versicolor
1.1 Iris-versicolor
1.8 Iris-virginica
1.8 Iris-virginica
2.2 Iris-virginica
2.5 Iris-virginica
2.2 Iris-virginica
2 Iris-virginica
1.8 Iris-virginica
2 Iris-virginica
1.5 Iris-virginica
2.2 Iris-virginica
2 Iris-virginica
1.9 Iris-virginica
1.8 Iris-virginica
2.3 Iris-virginica
2 Iris-virginica
1.8 Iris-virginica
1.9 Iris-virginica
2.5 Iris-virginica
2 Iris-virginica
2.1 Iris-virginica
1.8 Iris-virginica
1.7 Iris-virginica
2.4 Iris-virginica
2.3 Iris-virginica
2.3 Iris-virginica
2.4 Iris-virginica
1.6 Iris-virginica
2.3 Iris-virginica
1.9 Iris-virginica
1.8 Iris-virginica
2.4 Iris-virginica
2.3 Iris-virginica
1.8 Iris-virginica
2.1 Iris-virginica

109
110
113
115
120
123
124
126
127
129
132
139
140
142
147
148

6.2
7.2
5.8
7.3
6.3
6
6.3
6.8
5.7
7.7
6.9
6.7
6.4
7.2
7.1
6.1

3.4
3.6
2.7
2.9
2.5
3
2.8
3
2.5
2.6
3.1
3.3
2.8
3.2
3
2.6

5.4
6.1
5.1
6.3
5
4.8
5.1
5.5
5
6.9
5.1
5.7
5.6
6
5.9
5.6

2.3 Iris-virginica
2.5 Iris-virginica
1.9 Iris-virginica
1.8 Iris-virginica
1.9 Iris-virginica
1.8 Iris-virginica
1.5 Iris-virginica
2.1 Iris-virginica
2 Iris-virginica
2.3 Iris-virginica
2.3 Iris-virginica
2.1 Iris-virginica
2.1 Iris-virginica
1.8 Iris-virginica
2.1 Iris-virginica
1.4 Iris-virginica

50 Iris-setosa.
50 Iris-versicolor.
50 Iris-virginica.

Problem 4.1
from the textbook

Update-driven is a static-schema integration. Query-driven uses mappings and ontologies


(just different words for the same thing.)

ngs and ontologies.

Problem 4.2

from the text.

Problem 4.3
Consider the following situation:

A chemical research company runs multiple projects. Each project is


headquartered in specific region. Projects require chemists to work on them,
and equipment must be assigned to those chemists for specific projects. The
company maintains many categories of equipment, much of it quite expensive.
Therefore, the company maintains records of the number of hours that a specific
piece of equipment is checked out to a specific chemist for a specific project. In
addition, the charge to the project varies, depending upon the project, the
chemist and the equipment.

The following attributes are stored in the data warehouse for each instance of equipment assign
(hours used and amount charged)

For each Chemist:


ChemistID
Chemist Name
Chemist Rank

For each piece of Equipment:


equipment_serial#
equipmentDescription
EquipmentCategory
Category Manager

For each Project:


ProjectID
ProjectName
HQ Region
HQ City
HQ State
HQ Zip Code

Construct a star-schema and a snowflake schema for this data warehouse.


Determine the keys, and construct new keys as you extract additional entities
for the snowflake schema.

of equipment assignment:

Problem 4.4
Refer to the problem in the text.
Part-a (the snowflake schema) is completed below.

Using this schema, starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations
(e.g., roll up from semester to year) should one perform in order to list the following:
to list the average grade for all students in COMP 300
to list the average grade for all of my students in COMP 300 (I am instructor #007)

to list the average grade for all students taking courses in the CS department
to list the average grade for all math majors in 2013
to list the average grade for all students in 2013
to list the average grade of English courses for each student
to list the average grade for each student in the year 2012
to list the average grade for all students in the year 2012

Problem 4.5:
a.

b.
c.
d.

Consider the scenario in the textbook for this problem, and given the Star sche

Given the dimensions shown in the schema, give an example of a


concept hierarchy. This may be a dimension, or even parts of a dimension.
(Given: Spectator status is indeed hierarchical--so don't use that as an example.)
Give an example of a dimension that most likely has attributes that are not part of a
a concept hierarchy.
Starting with the base cuboid [date, spectator, location, game], what specific OLAP operations
should one perform in order to list the total charge paid in Chicago?
Starting with the base cuboid [date, spectator, location, game], what specific OLAP operations
should one perform in order to list the total charge paid by students
in the Allstate Arena in December?

em, and given the Star schema below.

as an example.)
t are not part of a

hat specific OLAP operations

hat specific OLAP operations

Problem 4.16:

Answer the questions in the text, but use as an example the 3-D data cube sh
That is the same as the left-most cuboid in Figure 4.4, but without the dimension of S
Although the book states that Fig 4.3 is not a base cuboid (rather, they state it is a c
by Supplier), it displays the same data as the first base cuboid in Figure 4.4.
So for the purposes of this assignment, treat Fig 4.3 as a base cuboid.
Modify it as follows:
The company also does business in Los Angeles
The company has recently expanded into wiring.
The company has decided to track its sales data by the fifth of the year in
(It's a strange company. They call them Q1, Q2, Q3, Q4, and Q
This means that instead of four distinct values for each dimension in the b

A data cube, C, has n dimensions, and each dimension has exactly p distinct values in the base cuboid
Assume that there are no concept hierarchies associated with the dimensions.
For our example, n=3 and p=5
a.
What is the maximum number of cells possible in the base cuboid?
(To visualize, the cuboid is pictured in Figure 4.3.)
b.
What is the minimum number of cells possible in the base cuboid?
Give an example, listing those cells that constitute an example of a minimum number.
For instance, cells (1,1,1) (1,1,2), (1,1,3) (1,1,4) and (1,1,5) (This is an incorrect example.)
Then, give the values of those cells. For instance, (Q1, home entertainment, Vancouver) (Fig. 4
c.
What is the maximum number of cells possible (including both base cells and aggregate cells) i
the data cube, C?
Again, using the example of Figure 4.3, what is the maximum number of cells?
How many cells in the base cuboid?
How many cells in the 3-D cuboids?
How many cells in the 2-D cuboids?
How many cells in the 1-D cuboids?
How many cells in the Apex?
d.
What is the minimum number of cells possible in the data cube, C?

xample the 3-D data cube shown in Figure 4.3 on p. 138.


ut without the dimension of Supplier.
id (rather, they state it is a cuboid summarized
cuboid in Figure 4.4.
a base cuboid.

data by the fifth of the year instead of by quarter.


them Q1, Q2, Q3, Q4, and Q5.)
es for each dimension in the base cuboid, there are 5.

nct values in the base cuboid.

a minimum number.
is an incorrect example.)
rtainment, Vancouver) (Fig. 4.3).
e cells and aggregate cells) in

ber of cells?

Problem 5.2:

straight from the text

Problem 5.4 homework assignment:


Suppose that a base cuboid has three dimensions, A, B, C, with the following number of cells:
|A| =5000, |B| = 500, and |C| = 50000.
Suppose that each dimension is evenly partitioned into 5 equal-size portions for chunking.
a.

Assuming each dimension has only one level, draw the complete lattice of the cube.

b.

If each cube cell stores one measure with 1 byte:


What is the size of each portion in A?
in B?
in C?

c.

What is the total size of the computed cube if the cube is dense?

d.

State the order for computing the chunks in the cube that requires the least amount of space, and
compute the total amount of main memory space required for computing the 2-D planes.
The order of computation that requires the least amount of space is C-A-B.

mber of cells:

chunking.

amount of space, and


e 2-D planes.

Problem 5.6
Suppose that there are only 2 base cells in a 20-dimensional base cuboid:
{(a1, a2, a3,a4,a5, ., a19, a20), (a1, a2, a3, a4, b5,.. , b19, b20)}
Compute the # of non-empty aggregate cells.
List the overlapped cells.

When computing a cube of high dimensionality, we encounter the inherent curse of dimensionality
problem: there exists a huge number of subsets of combinations of dimensions.
a.

Suppose that there are only two base cells, {(a1, a2, a3, . . . , a100), (a1, a2, b3, . . . , b100)}, in a
dimensional base cuboid. Compute the number of nonempty aggregate cells. Comment on the
storage space and time required to compute these cells.
Each base cell generates 2^1001 aggregate cells. (We subtract 1 because, for examp
is not an aggregate cell.) Thus, the two base cells generate 2(21001) = 21012 agg
however, four of these cells are counted twice. These four cells are: (a1, a2, , . . . , )
and (, , . . . , ). Therefore, the total number of cells generated is 2^101 6.
NOTE: there are 2 elements in common. So you subtract 2^2 from the number of aggr
If there would be 5 elements In common, you would subtract 2^5 from the # of aggreg
Note that any cell that has a3,, ..a100 or b3, .., b100 in it will NOT be a duplicate
So the only possible duplicate cells are those with ONLY a1 and/or a2 (or all *****)

b.

Suppose we are to compute an iceberg cube from the above. If the minimum support count in
the iceberg condition is two, how many aggregate cells will there be in the iceberg cube? Show
the cells.
They are 4: {(a1, a2, , . . . , ), (a1, , , . . . , ), (, a2, , . . . , ), (, , , . . . , )
Note that this is 2^2. The exponent is the same as the number of common elements.

c.

Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells in a data
cube. However, even with iceberg cubes, we could still end up having to compute a large number
of trivial uninteresting cells (i.e., with small counts). Suppose that a database has 20 tuples that
map to (or cover) the two following base cells in a 100-dimensional base cuboid, each with a cell
count of 10: {(a1, a2, a3, . . . , a100) : 10, (a1, a2, b3, . . . , b100) : 10}.

Let the minimum support be 10. How many distinct aggregate cells will there be like the following
{(a1, a2, a3, a4, . . . , a99, ) : 10, . . . , (a1, a2, , a4, . . . , a99, a100) : 10, . . . , (a1, a2, a3, , .
10}?
There will be 2^101 6, as shown above.

That's because the base cells already have a count of 10. So all of the aggregate cells based on t
will have a minimum of 10 also.

of dimensionality

b3, . . . , b100)}, in a 100Comment on the

ecause, for example, (a1, a2, a3, . . . , a100)


1) = 21012 aggregate cells,
(a1, a2, , . . . , ), (a1, , . . . , ), (, a2, , . . . , ),
2^101 6.
he number of aggregate cells.
m the # of aggregate cells.

OT be a duplicate cell.
2 (or all *****)

support count in
eberg cube? Show

), (, , , . . . , )}.
mmon elements.

te cells in a data
pute a large number
has 20 tuples that
oid, each with a cell

e be like the following:


. . , (a1, a2, a3, , . . . , , ) :

gate cells based on them

Example for Chapter 5 Exercise 1 using only 5 dimensions


Agg. #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

base cell 1 (c1)


(a1,d2,d3,d4,d5)
(*,d2,d3,d4,d5)
(a1,*,d3,d4,d5)
(*,*,d3,d4,d5)
(a1,d2,*,d4,d5)
(*,d2,*,d4,d5)
(a1,*,*,d4,d5)
(*,*,*,d4,d5)
(a1,d2,d3,*,d5)
(*,d2,d3,*,d5)
(a1,*,d3,*,d5)
(*,*,d3,*,d5)
(a1,d2,*,*,d5)
(*,d2,*,*,d5)
(a1,*,*,*,d5)
(*,*,*,*,d5)
(a1,d2,d3,d4,*)
(*,d2,d3,d4,*)
(a1,*,d3,d4,*)
(*,*,d3,d4,*)
(a1,d2,*,d4,*)
(*,d2,*,d4,*)
(a1,*,*,d4,*)
(*,*,*,d4,*)
(a1,d2,d3,*,*)
(*,d2,d3,*,*)
(a1,*,d3,*,*)
(*,*,d3,*,*)
(a1,d2,*,*,*)
(*,d2,*,*,*)

c1 overlaps
none
none
c2:3
none
c3:5
none
c2:7, c3:7
none
none
none
c2:11
none
c3:13
none
c2:15, c3:15
none
none
none
c2:19
none
c3:21
none
c2:23, c3:23
none
none
none
c2:27
none
c3:29

base cell 2 (c2)


(d1,b2,d3,d4,d5)
(*,b2,d3,d4,d5)
(d1,*,d3,d4,d5)
(*,*,d3,d4,d5)
(d1,b2,*,d4,d5)
(*,b2,*,d4,d5)
(d1,*,*,d4,d5)
(*,*,*,d4,d5)
(d1,b2,d3,*,d5)
(*,b2,d3,*,d5)
(d1,*,d3,*,d5)
(*,*,d3,*,d5)
(d1,b2,*,*,d5)
(*,b2,*,*,d5)
(d1,*,*,*,d5)
(*,*,*,*,d5)
(d1,b2,d3,d4,*)
(*,b2,d3,d4,*)
(d1,*,d3,d4,*)
(*,*,d3,d4,*)
(d1,b2,*,d4,*)
(*,b2,*,d4,*)
(d1,*,*,d4,*)
(*,*,*,d4,*)
(d1,b2,d3,*,*)
(*,b2,d3,*,*)
(d1,*,d3,*,*)
(*,*,d3,*,*)
(d1,b2,*,*,*)
(*,b2,*,*,*)

additional
c2 overlaps
none
none
none
none
none
c3:6
none
none
none
none
none
none
none
c3:14
none
none
none
none
none
none
none
c3:22
none
none
none
none
none
none
none

30
31

(a1,*,*,*,*)
(*,*,*,*,*)
c1 single overlaps:
c1 double overlaps:
total single overlaps:

1.
2.

none
c2:31, c3:31
8
4
4
12

(d1,*,*,*,*)
(*,*,*,*,*)

c3:30
none
4

1 x 2^2
this has to be counted twice
3 x 2^2

non-empty aggregate cells:


93
3 x 2^5 - 3
each cell generates 2^5 - 1 nonem
non-overlapping aggregates:
73
3 x 8 x 2^2 - 1 x 3 x 2^2 - 2 x 1 x 2^2 -3 = 19 x 2^2 -3
---------------------------------------------------Explaining the final formula
3 x 8 x 2^2 - 1 x 3 x 2^2 - 2 x 1 x 2^2 -3 = 19 x 2^2 -3 =
Begin with the non-empty aggregate cells:
3 x 2^5 - 3
An easier way to view single overlaps: We know th
Factor out 2^3 (since everything else
3 x 2^3 x 2^2 -3
generates 2^n non-empty cells. So, for double ov
is written in terms of 2^2)
cells have 2 elements in common (d4 and d5), so e
Simplify
3 x 8 x 2^2 -3
2^2 aggregate cells drom d4 and d5. That's 3 * 2^
Subtract single overlaps
1 x 3 x 2^2
keep . The other two are duplicates. So we subtra
For single overlaps, consider the case of cell 1 and
Subtract double overlalps (twice!!)
2 x 1 x 2^2
3 elements in common: d3, d4 and d5. Those dim
generate 2^3 overlapped cells. However, we alrea
How many non-empty cuboids will a full data cube contain?
cells for the double overlap. What we are really sa
2^5 = 32
have one additional case (when d3 overlaps) that
How many non-empty aggregate (i.e. non-base) cells
overlap. So we have the 1 additional single overla
will a full cube contain?
same situation holds true for cell2 and cell3. There
element (d1) that they have in common. This wou
Each cell generates 2^5 - 1 nonempty aggregated cells.
addtional 2^2 overlapping cells. (Why is the expo
So we have 3 x 2^5- 3 cells.
there's only one additonal element in common? B
generating duplicates for when you have d3, d4 a
Double Overlap: There is one case where all 3 base cells overlap: d4, d5
common. See aggregate#3 and aggregate#11 as
1 case x 2^2 = 4. So there are 4 aggregate cells that "double" overlap.
So this must be subtracted twice, since it was originally include 3 times.
This may help to explain why in your textbook prob
subtract 3*2^7--because a total off 7 elements ove
so you are subtracting those additional cases of do
Single Overlap: There are 3 cases where 2 base cells overlapl:
where you also have an overlap with yet another e
only overlaps in two of the cells.
base cell c1: a1, d2, d3, d4, d5
green circles around d3

base cell c2: d1, b2, d3, d4, d5

blue circles around d2

where you also have an overlap with yet another e


only overlaps in two of the cells.

base cell c3: d1, d2, c3, d4, d5

red circles around d1


3 x 2^2 (exponent is number of elements all 3 have in common)
In the problem in the book (5.1), note that the exponent is 2^7.
(the circles were placed so that they printed correctly.)

base cell 3 (c3)


(d1,d2,c3,d4,d5)
(*,d2,c3,d4,d5)
(d1,*,c3,d4,d5)
(*,*,c3,d4,d5)
(d1,d2,*,d4,d5)
(*,d2,*,d4,d5)
(d1,*,*,d4,d5)
(*,*,*,d4,d5)
(d1,d2,c3,*,d5)
(*,d2,c3,*,d5)
(d1,*,c3,*,d5)
(*,*,c3,*,d5)
(d1,d2,*,*,d5)
(*,d2,*,*,d5)
(d1,*,*,*,d5)
(*,*,*,*,d5)
(d1,d2,c3,d4,*)
(*,d2,c3,d4,*)
(d1,*,c3,d4,*)
(*,*,c3,d4,*)
(d1,d2,*,d4,*)
(*,d2,*,d4,*)
(d1,*,*,d4,*)
(*,*,*,d4,*)
(d1,d2,c3,*,*)
(*,d2,c3,*,*)
(d1,*,c3,*,*)
(*,*,c3,*,*)
(d1,d2,*,*,*)
(*,d2,*,*,*)

(d1,*,*,*,*)
(*,*,*,*,*)
additional c2 single overlaps

l generates 2^5 - 1 nonempty aggregate cells


1 x 2^2 -3 = 19 x 2^2 -3

1 x 2^2 -3 = 19 x 2^2 -3 = 73

w single overlaps: We know that each cell


empty cells. So, for double overlaps, all of the
s in common (d4 and d5), so each cell generates
drom d4 and d5. That's 3 * 2^2. One copy we
are duplicates. So we subtract 2*2^2.
onsider the case of cell 1 and cell 2. They have
on: d3, d4 and d5. Those dimensions will
pped cells. However, we already counted 2^2
verlap. What we are really saying is that we
case (when d3 overlaps) that d4 and d5 also
the 1 additional single overlap of 2^2. The
true for cell2 and cell3. There is one additional
ey have in common. This would generate an
pping cells. (Why is the exponent 2, when
tonal element in common? Because it is
es for when you have d3, d4 and d5 ALL in
gate#3 and aggregate#11 as examples.

lain why in your textbook problem 5.1, they


ause a total off 7 elements overlap in all 3 cells,
g those additional cases of double overlap
e an overlap with yet another element, which
of the cells.

e an overlap with yet another element, which


of the cells.

Problem 5.1
Consider a base cuboid of 4 dimensions with 3 base cells:

base cell 1
(a1, a2, a3, a4)

base cell 2
(a1, a2, a3, b4)

base cell 3
(a1, c2, c3, b4)

a. How many non-empty cuboids will a full data cube contain?


b. How many non-empty aggregate (i.e. nonbase) cells will a full cube contain?
c. How many nonempty aggregate cells will an iceberg cube contain?

NOTES:
1. Notice that there is one overlapped element across all 3 cells.
2. Notice that there are NO additional overlapped elements between cell 1 and cell 3.
This is different from our examples in class.
3. Notice that there is 1 additional (single overlapped) element between cell 2 and cell 3.
4. Notcie that there are 2 additional (single overlapped) elements between cell 1 and cell 2.
A good way to proceed might be:
1. How many double overlapped cells are there?
2. How many single overlapped should be subtracted for overlaps between cell2 and cell3?
3. No consider Cell1 and Cell2. If you were considering them separately, just those two cells,
how many overlapped cells would there be (they have 3 elements in common)?
So how many would you subtract?
But wait!!! You already subtracted some of those, because they are double overlapps.
So how many do you have left to subtract?
This problem is a little bit different from in-class examples, just to see if you really "get it".
Even if you generate all of the cells, and highlight them to show the overlaps, answer the
questions in the steps above, so that I know that you understand why.

and cell 3.
l 1 and cell 2.

l2 and cell3?
hose two cells,

overlapps.

ally "get it".

answer the

Problem 6.6 Find frequent itemsets, using both apriori and FP-tree
For apriori: show each C_k an L_k, as demonstrated in class
For FP: show each tree iteration

T100
T200
T300
T400
T500

{H, O, A, R, D, S, E}
{C, O, A, R, S, E}
{E, C, A, R, D, S}
{R, O, A, D, S}
{H, O, U, S, E}

min_sup = 60%
60% of 5 transactions = 3

Create the strong association rules that can be inferred from L_2.
Create the strong assocation rules for set SOR.
To create association rules where min_sup = 60% and min_conf = 80%:
For each set, L, generate all non-empty sets. For each non-empty subset, s:
support_count is simply how often it appears in the list.
support is support_count over total # of transactions.
confidence = support_count(L) / support_count (s)
BTW, this is P(Y and K)/ P(K). It's conditional probability...
More precisely, it's also P(Y U K)/P(K)

ri and FP-tree

Problem 6.14 -revi

Use the data below:

hamburgers
~hamburgers

hot dogs ~hot dogs


1500
1000
2000
500

Compute the Chi-Square statistic.


1. Add a totals row and column to your contingency table.
2. Create a new contingency table with the expected frequencies.
3. What is the chi-square critical value?
3. Compute the chi-square statistic.
4. What is your conclusion regarding indepenence?
Compare several pattern evaluation measures. Refer to Table 6.9 on page 269.
1. Using the data from this problem complete one row of a table similar to
2. Interpret your results.
It will help, if you complete the following statistics first:
P(Hamburgers):
P(Hot Dogs):
P(Hamburgers|HotDogs):
P(HotDogs|Hamburgers):
When your book shows your formulae for P(A|B) and so forth:
Event A will be "Hamburgers" and Event B will be "Hot Dogs".

Once your contingency table is complete, and you have also computing the above conditional p
It is probably easier to use the formulae that express the measurement in terms of conditional
These are bolded and blue below:
all_conf = sup(Hamburgers U Hot Dogs)/Max(sup(Hamburgers), sup(Hot Dogs))

all_conf = min[P(A|B,P(B|A)
lift = P(Hamburgers U Hot Dogs)/P(hot dog) P(hamburgers)

lift = P(B|A)/P(B)
max_confidence = max( sup(ab)/sup(a), sup(ab)/sup(b) )

max_confidence MAX[PA|B),P(B|A)]
Kulczynski = sup(ab)/2 * (1/sup(a) + 1/sup(b) )

Kulczynski = 1/2[PA|B)+P(B|A)]
cosine = sup(ab)/sqrt(sup(a) * sup(b))

cosine = SQRT[P(A|B) * P(B|A)]

ed frequencies.

row of a table similar to the one on p. 269.

Hot Dogs".

the above conditional probabilities:


in terms of conditional probabilites.

Problem 7.6 Homework


Consider the following cases:
Case #1: v = 5
S = {1,2,3,4,5}
Case #2: v = 6
S = {1,2,3,4}
Case #3: v = 6
S = {8,9,10}
Case #4 v = 4
S = {1,2,3}
Case #5 V = {1,2,3} S = {1,2,3,4}
Case #6 V = {1,2,3} S = {1,2}

S'
S'
S'
S'
S'
S'

=
=
=
=
=
=

{1,2,3,4,5,6}
{1,2,3,4,5,6}
{2,3,4,5,6,7,8,9,10,11}
{1,2,3,4,5}
{1,2,3,4,5}
{1,2,3,4,5}

Using the above cases as examples, prove by counter-example, or demonstrate with an example,
whether the following rule constraints are antimonotonic or monotonic.

a.

V S

b.

S V

c.

min(S) <= V

d.

max(S) <= V

e.

max(S) >= V

e with an example,

Problem 8.7, revised


Problem 8.7 (revised) HW
Part 1:
Using the data in Table 8.1 on page 338 your text, construct a Decision Tree in RapdMiner.
Create your decision tree three ways:
*using Information Gain
*using Gain Ratio
*using Gini
Compare your results.

Part 2:
Consider the data for problem 8.7 on page 387.
Calculate (outside of RapidMiner) the Gain and the Gain Ratio for the attribute department.

NOTE: The target (label) attribute is status.


NOTE: These data have been aggregated already. For instance, there are 30 tuples represented by th
These counts must be considered in your calculations.
For example, if you review Example 8.1 of your text, they list 9 tuples of class "yes", 5 of class "no".
In this problem, there are 30 tuples of class "senior" summarized in the first row alone!!
The table summarized 165 tuples, 113 Juniors and 52 Seniors!!
Your calculations should be for the first data split.
Show your calculations.
Discuss your results and comparison.
Refer to pages 338- 341 for your calculations and to support your conclusions.

Step 1: Start with equation 8.1, p. 337, to calculate Info(D):


(you may find it helpful to write some basic variables here)

Step 2: Calculate Info_Dept

Equation 8.2, page 337

Step 3:

Calculate Gain_Dept

Equation 8.3, page 337

Step 4: Calculate SplitInfo_Dept (Equation 8.5, p. 340)

Step 5: Calculate Gain Ratio (Equation 8.6, p. 341)

n RapdMiner.

e department.

tuples represented by the first row.

"yes", 5 of class "no".


ow alone!!

Problem 8.12
Complete the following table, and then plot the ROC curve:
TPR = TP/P
FPR = FP/P
P=5 (we know as we are looking at training data)
Tuple #
1
2
3
4
5
6
7
8
9
10

Class
P
N
P
N
N
N
P
P
N
P

Prob.
0.91
0.83
0.72
0.66
0.60
0.55
0.53
0.52
0.45
0.37

TP

FP

TN

FN

TPR

FPR

Homework Problem 9.1:


Continuing Example 9.1 of your text (pp. 404-405), assume that the second training tuple
is X = (1, 0, 1). Using the new weights and biases as your "Old" weights,
show the backpropagation calculations that will be triggered by this training tuple.
Assume that the known class label of this tuple is "1".
Assume that the learning rate remains .9.

You may use programming tool you wish (or none at all). But your resu
should be presented in a format similar to the tables below,
as demonstrate in class.

item Old_w
x1
1
x2
1
x3
0
w14 0.1921
w15 -0.3059
w24
0.4
w25
0.1
w34 -0.5079
w35 0.1941
w46 -0.2608
w56 -0.138
4 -0.4079
5 0.1941
6 0.2181

Table 9.2 Net Input and Output Calculations


Use equation 9.4 on p. 402 to compute Net Inputs
Use equation 9.5 on p. 402 to compute Output
Unit

Net Input Formula

Net Input

Output

4
5
6

Table 9.3 Calculation of error at each node


Use equation 9.6 on page 403 for output layer error
Use equation 9.7 on page 403 for hidden layer error
The example in the book states that the known class label is "1".
Unit
Error Formula
6
5
4

Error

Table 9.4 Calculations for Weight and Bias Updating


Use equations 9.8 an 9.9 on page 403
L= 0.9

This is the constant learning rate they chose for


this specific model.

item Old_w
x1
1
x2
1
x3
0
w14 0.1921
w15 -0.3059
w24
0.4
w25
0.1
w34 -0.5079
w35 0.1941
w46 -0.2608
w56 -0.138
4 -0.4079
5 0.1941
6 0.2181

New weight formula

New weight

ning tuple

). But your results

New weight

Problem 9.4, as in the textbook.


Problem 9.6 as in the textbook, but add this question:
What kinds of applications might be appropriate for each of these classification methods?

on methods?

HW Problems, Chapter 10
Homework Problems 10.1, 10.2, 10.6, 10.12, 10.16

as in the textbook, no changes

xtbook, no changes

S-ar putea să vă placă și