Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1

Simple learning algorithms
One R learning;
Bayes Model;
Decision Tree;
Covering algorithm;
Mining for Association Rules
Linear models for numeric prediction;
Instance-based learning.
Reading Materials: Chapter 4 of textbook
by Witten etc, Sections 6.1, 6.2,7.17.4, 7.8
of the textbook by Han.
Inferring rudimentary rules

One R algorithm: learns a 1- level decision
tree or generates a set of rules that all test
on one particular attribute
Basic version for nominal attributes:
One branch for each of the attributes
values and each branch assigns most frequent
class
Error rate: proportion of instances that
dont belong to the majority class of their
corresponding branch
Choose attribute with lowest error rate
Pseudo-code for 1R
For each attribute,
For each value of the attribute:
count how often each class appears
find the most frequent class
make the rule assign that class to this
attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
Note: missing is treated as a separate attribute value.
The weather problem

Outlook
Temper.
humidity
windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
In total, there are 6 instances whose temperature is mild. Four of them with final decision
Yes and two with No. The rule is
If Temper.=mild then Play=Yes
Error rate: 2/6.
1R algorithm
Attribute
Outlook
Temper.
Humidity
Windy
Rules
Errors
Sunny no
2/5
overcast yes
0/4
rainy no
2/5
hot no
2/4
mild yes
2/6
cool yes
1/4
high no
3/7
normal yes
1/7
f alse yes
2/8
true no
3/6
Total err.
4/14
5/14
4/14
5/14
Dealing with numeric attributes

Discretization: the range of the attribute is
divided into a set of intervals
Instances are sorted according to attributes
values
Breakpoints are placed where the main
class changes (minimizing the errors)
Discretization
64
65
68 69 70
71 72
72 75 75
80
81 83
85
No
Y Y Y
No No
?Y Y Y
No
Y Y
No
Overfitting Problem: The procedure is very

sensitive to noise
A single instance with an incorrect class
label might result in a separate interval
Also: time-stamp attribute (have different
values for all instances) will have zero errors
Simple solution: enforce minimum number
of instances in majority class of per interval
64 65 68 69 70
71 72 72 75 75
80 81 83 85
Y No Y Y Y
No No Y Y Y
No Y Y No
Merging two adjacent partitions with a common majority together, we get

64 65 68 69 70 71 72 72 75 75
80 81 83 85
Y No Y Y Y No No Y Y Y
No Y Y No
How about?
64 65 68 69 70 71 72 72 75 75 80 81 83
85
Y No Y Y Y No No Y Y Y No Y Y
No
Statistical Modelling
Basic assumptions: Attributes are equally
important and statistically independent
Illusive assumptions never meet in practice, but the scheme works well!
The weather data with probabilities
Outlook
Temper.
Humidity
Windy
Play
y n
y n
y n
y n
y n
Sunny 2 3
Hot 2 2
High 3 4
F 6 2
9 5
Overc 4 0
Mild 4 2
Norm. 6 1
T 3 3
Rainy 3 2
Cool 3 1
Sunny
Overc
Rainy
2
9
4
9
3
9
3
5
Hot
Mild
2
5
Cool
2
9
4
9
3
9
2
5
2
5
1
5
High
Norm.
3
9
6
9
4
5
1
5
F
T
6
9
3
9
2
5
3
5
9 5
14 14
New instance: (sunny, cool, high, true, ?)

Likelihood of
2 3 3 3
9

= 0.0053
9 9 9 9 14
5
3 1 4 3
= 0.0206.
no =
5 5 5 5 14
yes =
Normalization into a probability by

0.0053
= 20.5%
0.0053 + 0.0206
0.0206
= 79.5%
P(no) =
0.0053 + 0.0206
P(yes) =
Naive Bayes Model

Bayes rule:Probability of event H given evidence E:
P r[E|H]P r[H]
P r[H|E] =
P r[E]
Priori probability of H: Pr[H], probability of
event before evidence has been seen
Posteriori probability of H: P r[H|E], probability of event after evidence has been seen
Naive Bayes for Classification
Whats the probability of the class for a given
instance?
Evidence: E= instance
Event: H = class value for instance
Naive Bayes assumption: evidence can be
split into independent parts (i.e. attributes
of instance!)
P r[H|E] =
P r[E1|H]P r[E2 |H] P r[En|H]P r[H]

.
P r[E]
Consider the weather problem with the instance (sunny, cool, high, true, ?)
P r[Y es|E]P r[E] = P r[sunny|yes]P r[cool|yes]P r[high|yes]
P r[true|yes]P r[yes]
2 3 3 3 9
=
9 9 9 9 14
No worry about Pr[E] as it will disappear

after normalization!
Zero Prob and missing values

Another instance: : (overcast, mild, high, f alse)
Likelihood of:
9
4 4 3 3

= 0.0254
9 9 9 5 14
5
2 1 2
= 0.
no = 0
5 5 5 14
yes =
Does it make sense to claim that the likelihood is zero? If not, how should we deal with
this issue?
Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)
In some cases adding a constant different
from 1 might be more appropriate:
Attribute outlook for class yes
2 + 4 + 3 +
,
,
9+ 9+ 9+
Weights satisfying + + = 1, a > 0, b >
0, g > 0.
Extra Merit: Missing values are not counted
in both training and prediction!!

1 Usual assumptions: attributes have a
normal or Gaussian probability distribution
2 The probability density function for the
normal distribution is defined by two parameters
1 Pn
i The sample mean: = n
i=1 xi
ii The standard deviation:
v
u n
u X (xi )2
=t
i=1 n 1
iii The density function:

1
(x)
e 22
f (x) =
2
For the weather problem, if the attribute temperature has a mean of 73 and a standard
deviation of 6.2, then the density function
f (temperature = 66|yes) =
1
26.2
(7366)
2(6.2)2
= 0.0340,

Weather data with numeric attributes
Outlook
Temper.
Humidity
Windy
Play
Y no
Y no
Y no
Y no
Y no
Sunny 2 3
83 85
86 85
F 6 2
9 5
Overc 4 0
70 80
96 90
T 3 3
Rainy 3 2
68 65
80 70
... ...
... ...
73 74.6
79.1 86.2
6.2 7.9
10.2 9.7
Sunny
Overc
Rainy
2
9
4
9
3
9
3
5
0
5
2
5
6
9
3
9
2
5
3
5
9 5
14 14
For a new day(sunny,66,90,true,?), using

f (temperature = 66|yes) = 0.0340,
we have the Likelihood of
2
3 9
yes = 0.0340 0.0221
= 0.000036
9
9 14
3 5
3
= 0.000136,
no = 0.0291 0.0380
5
5 14
which gives
P r(yes) = 20.9%, P r(no) = 79.1%.
Missing values are not counted!
Probability densities
Relationship between probability and density:
P r[c < x c + ] f (c).

2
2
This doesnt change calculation of a posteriori probabilities because cancels out
Exact relationship:
P r[a x b] =
Z b
a
f (t)dt.
Merits and flaws of Nave Bayes
Nave Bayes works surprisingly well (despite

the illusive assumption)
Reason: classification doesnt require accurate probability estimates as long as maximum probability is assigned to correct class;
Redundant attributes might cause problems
Note: for numeric attributes that are not
normally distributed, other kernel density estimators should be considered!
Decision trees
Normal procedure: top down in recursive
divide-and-conquer fashion
1. Attribute is selected for root node and
branch is created for each possible attribute value
2. The instances are split into subsets (one
for each branch extending from the node)
3. Procedure is repeated recursively for
each branch, using only instances that
reach the branch
Process stops if all instances have the
same class
Issue: How to select the splitting attribute?
Criteria: The best attribute is the one leading to the smallest tree
Trick: choose the attribute that produces
the purest nodes
How to measure information? Information gain!
Computing information
Information gain increases with the average
purity of the subsets that an attribute produces
Strategy: choose the attribute that results
in greatest information gain
Information is measured in bits
1. Given a probability distribution, the info
required to predict an event is the distributions entropy
2. Entropy gives the information required
in bits(involve fractions of bits!)
Formula for computing the entropy:
entropy(p1, p2, , pn) =
n
X
pi log pi.
i=1
Example: attribute Outlook = Sunny:

inf o([2, 3]) = entropy(2/5, 3/5)
2 3
3
2
= log log = 0.971bits;
5
5 5
5
More Examples
Outlook = Overcast:
inf o([4, 0]) = entropy(1, 0) = log 1 = 0bits;
Outlook = Rainy:
inf o([3, 2]) = entropy(3/5, 2/5)
2 3
3
2
= log log = 0.971bits;
5
5 5
5
Expected information for attribute:
10
inf o([3, 2], [4, 0], [3, 2]) = 0.971
+0
14
= 0.693bits.
Information gain: the gap between information before and after the splitting.
Information gain for the weather problem:
gain(outlook)= inf o([9, 5])inf o([3, 2],[4, 0],[3, 2])
= 0.940 0.693 = 0.247bits.
gain(temper.) = 0.029bits,
gain(humidity) = 0.152bits,
gain(windy) = 0.048bits.
Wishlist for a purity measure

Desirable Properties of a purity measure:
A pure node should be measured as zero

When impurity is maximal (all classes equally
likely), measure should be maximal
Measure should enjoy the multistage property (decisions in several stages):
q
r
, q+r
).
entr(p, q, r) = entr(p, q + r) + (q + r)entr( q+r
measure([2, 3, 4]) = measure([2, 7])+7/9

measure([3, 4]).
Entropyis the only function that satisfies all

the above properties!
Simplification of computation:
entr(x1, , xn ) = Pn1
i=1
xi
Pn
i=1 xi log xi + log(
Pn
i=1 xi ).
Instead of maximizing info gain we can minimize information!
Avoiding overfitting
Trouble: attributes with a large number of
values (extreme case: ID code) and thus the
corresponding subsets are more likely to be
pure. In this case, Information gain is biased towards choosing attributes with a large
number of values. This leads to so-called
overfitting (selection of a useless attribute
for prediction)
Remedy: use the gain ratio which takes

the number of branches into account when
choosing an attribute. Let a be an attribute,
gain ratio(a) =
gain(a)
.
inf o(a)
For example, inf o(1, 1, , 1) = log n where

n is the dimension of the all 1s vector.
Note: the distinction between two info
functions: inf o([1, 1]) = log 2 and
inf o([0, 1], [1, 0]) = 0.
Computing gain ratio

Suppose there is a different ID code associated with every instance in the weather problem, the information gain is 0.940bits, but
the info for the ID code is log 14 = 3.807bits.
Information for weather data
Attribute
Info.
Info Gain
Split Info
Gain Ratio
outlook
0.693
0.247
info([5,4,5])
0.156
temper.
0.911
0.029
1.362
0.021
humidity
0.788
0.152
info([7,7])=1
0.152
windy
0.892
0.048
0.985
0.049
The ID code still has a very high gain ratio.

Ad hoc test is needed to prevent the ID case.
The gain ratio might bias to attributes with
low intrinsic information. Restriction only to
the attribute whose information gain is above
the average is necessary.
Comments: Algorithm for induction decision
trees ID3 was developed by Ross Quinlan.
Gain ratio is one modification of this basic algorithm. Its advanced version is coined C4.5,
which can deal with numeric attributes, missing values, and noisy data.
Covering algorithm
Decision tree can be converted into a rule
set by using straightforward conversion, but
this usually leads to very complex rule set.
Efficient conversions are useful but not easy
to find.
The covering approach generates a rule
set directly (exclude instances in other classes).
Key idea: find the rule set that covers all the
instances in one class.
Let consider the problem of classifying a set
(two classes denoted by circles and boxes) of
points on the plane. We can start with
If ? then the point belongs to the circle class
It covers all the instances in circle-class, but
too general.
Adding a pre-condition (x 1), we get:
If x 1, then class is circle.
The rule covers some instances in circle.
We need more rules to cover some circleinstances and box instances.
Figure: Covering
y
y=1.5
x
x=1
If ? then the point belong to class
If x<=1 then the point belongs to class
Procedure for covering

Simple approach
Generates a rule by adding tests that maximize rules accuracy
Each new test reduces rules coverage: start
from space of examples, then rule so far and
rule after adding new term
The selection of a test depends on the following quantities
t: total number of instances covered by

rule
p: positive examples of the class covered
by rule
t-p: number of errors made by rule
Select test that maximizes the ratio p/t
We are finished when p/t = 1 or the set
of instances cant be split any further
Covering for contact lenses data

Rule we seek: If ? then recommendation = hard
Possible tests:
Age = Young
2/8
Age=Pre-presbyopic
1/8
Age=Presbyopic
1/8
Spectacle prescription=Myope
3/12
Spectacle prescription=Hypermetrope 1/12
Astigmatism=no
0/12
Astigmatism=yes
4/12
Tear production rate = Reduced
0/12
Tear production rate = Normal
4/12
We choose the test Astigmatism=yes, then

we get a subset as
Age
Spect. prescr.
Astig.
Tear prod.
Rec. lenses
Prepresbyopic
Hypermetrope
Yes
Reduced
None
Prepresbyopic
Hypermetrope
Yes
Normal
None
Presbyopic
Myope
Yes
Reduced
None
Presbyopic
Myope
Yes
Normal
Hard
Presbyopic
Hypermetrope
Yes
Reduced
None
Presbyopic
Hypermetrope
Yes
Normal
None
Prepresbyopic
Myope
Yes
Normal
Hard
Prepresbyopic
Myope
Yes
Reduced
None
Young
Hypermetrope
Yes
Normal
Hard
Young
Hypermetrope
Yes
Reduced
None
Young
Myope
Yes
Normal
Hard
Young
Myope
Yes
Reduced
None
Further refinement
Rule we seek: If astigmatism=yes and ? then
recommendation = hard
Possible tests:
Age = Young
2/4
Age=Pre-presbyopic
1/4
Age=Presbyopic
1/4
Spectacle prescription=Myope
3/6
Tear production rate = Reduced
0/6
Tear production rate = Normal 4/6
Adding the test Tear production rate = Normal, we obtain

Resulting Table
Age
Spect. prescr.
Astig.
Tear prod.
Rec. lenses
Prepresbyopic
Hypermetrope
Yes
Normal
None
Presbyopic
Myope
Yes
Normal
Hard
Presbyopic
Hypermetrope
Yes
Normal
None
Myope
Yes
Normal
Hard
Young
Hypermetrope
Yes
Normal
Hard
Young
Myope
Yes
Normal
Hard
Prepresbyopic
Further refinement
Rule we seek: If astigmatism=yes and tear production rate=normal and ? then recommendation = hard
Possible tests:
Age = Young
2/2
Age=Pre-presbyopic
1/2
Age=Presbyopic
1/2
Spectacle prescription=Myope 3/3
Between the first and fourth, we select the one with
larger coverage.
Resulting Table
Age
Spect. prescr.
Astig.
Tear prod.
Rec. lenses
Presbyopic
Myope
Yes
Normal
Hard
Prepresbyopic
Myope
Yes
Normal
Hard
Young
Myope
Yes
Normal
Hard
The Final rule: If astigmatism = yes and tear production rate = normal and spectacle prescription =
myope then recommendation = hard
Second rule for recommending hard lenses: If age =
young and astigmatism = yes and tear production rate
= normal then recommendation = hard
This rule is built from instances not covered

by first rule
The above two rules cover all hard lenses.
We can follow a similar process for other two
classes.
PRISM
Pseudo- code for PRISM:
For each class C
Initialize E to the instance set
While E contains instances in class C
1. Create a rule R with an empty left- hand side
that predicts class C
2. Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R,
and each value v,
Consider adding the condition A = v to the
left- hand side of R
Select A and v to maximize the accuracy p/
t (break ties by choosing the condition with
the largest p)
Add A = v to R
Remove the instances covered by R from E
Comments: This simple algorithm utilized

divider-conquer to extract all rules. However,
it did not tell us what is the ordering of interpretation. It also did not address the issue
of how to deal with missing values and the
overfitting problem.
Mining association rules

Simple method for finding association rules:
Using the standard separate- and- conquer method,
treating every possible combination of attribute
values as a separate class
Two problems:
1. Computational complexity
2. Resulting number of rules;
Remedy: Focus on rules with high support and
confidence only!
Difficulty: Hard to define coverage and accuracy.

Item: one test/ attribute- value pair
Item set: a set of all the items occurring in
a rule
Frequency: The occurrence frequency of an
item-set in the data set
Goal: only rules that exceed pre-defined minimum frequency/support are reported
We can find all item sets with the given
minimum frequency and generate rules from
them!
Item sets
Item sets for weather data
one-item set
outlook=sunny(5)
two-item set
three-item set
outlook=sunny(2)
outlook=sunny
temperature=mild
temperature=hot
humidity=high(2)
temperature=cool(4)
outlook=sunny(3)
outlook=sunny
hudimity=high(3)
humidity=high
windy=false(2)
Four-item set: (outlook=sunny,temperature=hot, humidity=high ,play=no)[2]

(outlook=rainy, temperature=mild, windy=false, play=yes)[2]
In total: 12 one- item sets, 47 two- item

sets, 39 three- item sets, 6 four- item sets.
Once all item sets with minimum support
have been generated, we can turn them into
rules
Example: Humidity = Normal, Windy = False,
Play = Yes (4) In total seven (2n 1) potential rules:
If Humidity = Normal and Windy = False then Play = Yes 4/4
If Humidity = Normal and Play = Yes then Windy = False 4/6
If Windy = False and Play = Yes then Humidity = Normal 4/6
If Humidity = Normal then Windy = False and Play = Yes 4/7
If Windy = False then Humidity = Normal and Play = Yes 4/8
If Play = Yes then Humidity = Normal and Windy = False 4/9
If True then (?,?, Normal, False, Yes
4/12
Association rules
Rules for the weather data with support > 1:
In total: 3 rules with support four, 5 with
support three, and 50 with support two
Example rules from the same set:
Temperature = Cool, Humidity = Normal, Windy = False, Play
= Yes (2)
Resulting rules : Temperature = Cool, Windy = False
Humidity = Normal, Play = Yes
Temperature = Cool, Windy = False, Humidity = Normal
Play = Yes
Temperature = Cool, Windy = False, Play = Yes Humidity =
Normal
owing to the following frequent item sets:

Temperature=Cool, Windy=False
Temperature=Cool, Humidity=Normal, Windy=False
Temperature=Cool, Windy=False, Play=Yes
All these item-sets have minimum support 2.
Frequent item sets

A frequent item set is a set whose support
exceeds the minimal requirement.
Apriori Property:
If (A B) is frequent item set, then (A) and
(B) have to be frequent item sets as well;
In general: if X is frequent k-item set, then
all (k-1)-item subsets of X are also frequent;
Based on Apriori Property, we can compute
k-item set by merging (k-1)-item sets.
Finding one-item sets easy;
Using one-item sets to get two-item sets,
two-item sets to get three-item sets.
An example: For given five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)
The sets are lexicographically ordered!

Candidate for four-item sets: (A B C D),
OK because (B C D) has greater than minimum coverage! (A C D E) Not OK because
(C D E) does not satisfy the minimum coverage requirement!
Final check by counting instances in dataset
(k-1)-item sets are stored in hash table
Apriori Algorithm
We are looking for all high-confidence rules:
Support of antecedent obtained from hash
table
Building (c+1)-consequent rules from cconsequent ones
Observation: (c+1)-consequent rule can
only hold if all corresponding c-consequent
rules also hold
Just like the procedure for large item sets
Key Steps from k-item sets to (k+1)-item
sets:
Create a table of potential candidates of
(k+1)-item sets from the hash table of kitem sets by computing the product of two
sets.
Using the Apriori property of the frequent
item set and the order in the hash table to
improve the efficiency!
Remove non-promising candidates in the
table via consulting the hash table of k-item
sets.
Scan the whole data set to remove the
candidates that does not satisfy the minimal support requirement and get the frequent
(k+1)-item sets.
Mining for transaction data

TID
Items
T1
I1, I2, I5
T2
I2, I4
T3
I2, I3
T4
I1, I2, I4
T5
I1, I3
T6
I2, I3
T7
I1, I3
T8
I1,I2,I3,I5
T9
I1, I2,I3
Itemset
C: Candidate set
L: the frequent item set
Support
Itemset
Support
I1
I1
I2
I2
I3
I3
I4
I4
I5
I5
C1 = L1
Mining Transaction Data

Itemset
Sup.
{I1,I2}
{I1,I3}
Itemset
Sup.
{I1, I4}
{I1,I2}
{I1,I5}
{I1,I3}
{I2,I3}
{I1,I5}
{I2,I4}
{I2,I3}
{I2,I5}
{I2,I4}
{I3,I4}
{I2,I5}
{I3,I5}
{I4,I5}
C2 = L2
Itemset
Sup.
{I1,I2, I3}
{I1,I2, I5}
C3 = L3
Itemset
Sup.
{I1,I2, I3}
{I1,I2, I5}
Mining transaction data

Procedure
1 Generate the candidate 1-item set and

scan all the transaction to count the occurrences of each item to get the 1-item
set satisfying the minimum support.
2 Generate the candidate 2-items set, and

scan the data set to find the 2-items set.
3 Generate the candidate 3-items set, scan

the data set to get L3.
To generate the candidate set, let
C3 = L2 L2 = {(I1, I2, I3), (I1, I2, I5), (I1, I3, I5),
(I2, I3, I4), (I2, I3, I5), (I2, I4, I5)}
By using the Apriori property, we can remove

the last four items. Thus we have
C3 = {(I1, I2, I3), (I1, I2, I5)}.
Linear models
Work naturally with numeric attributes
Standard technique for numeric prediction: linear regression
Output is a linear combination of attributes
y = w 0 + a 1 x1 + a 2 x2 + + a k xk .
Weights are calculated from the training
data
Predicted value for the training instance
a(1)
(1)
(1)
(1)
(1)
y = a 0 x0 + a 1 x1 + a 2 x2 + + a k xk .
All these k + 1 coefficients are chosen so that
the squared error on the training data is minimized
n
X
i=1
y i
n
X
j=0
(i)
a j xj .
Coefficient can be derived by solving a

linear system
Can handle many instances
Solving Least Square Problem

Linear regression always leads to unconstrained
convex quadratic optimization as follows
1
QP minn f (x) = xT Qx q T x.
2
xIR
Theorem .1 Suppose that the matrix Q is

positive semidefinite. Then x solves (QP)
if and only if it is a solution of the linear
equation system Qx = q.
Let us consider the linear regression for the
point set S = {(1, 0), (1, 2), (2, 1)}. In other
word, try to find a linear model y = ax + b to
approximate these points. Thus we have
min f (a, b) = (a+b)2+(2ab)2+(12ab)2.
By Theorem 1, we need only to solve the
linear system

0
12 8 a
= .
8 6
Solving the above system we get the linear

model y = 2x + 3.
From Regression to Classification

Any regression technique can be used for
classification:
Training: perform a regression for each
class, setting the output to 1 (y i = 1) for
training instances that belong to class, and 0
(y i = 0) for those that dont
Prediction: predict class corresponding to
model with largest output value.
Let us consider the case with 3 classes (S1, S2
and S3) on a plane where each point is characterized by parameter pair (x, y). We can
define the label for each class in the following way:
(x, y) S1, labelled as (1, 0, 0)T ;
(x, y) S2, labelled as (0, 1, 0)T ;
(x, y) S3, labelled as (0, 0, 1)T .
Performing linear regression for all the three
classes by minimizing
min f (a1, b1, c1, a2, b2, c2, a3, b3, c3)
where the function f is defined in next page.
Multiple-Class classification

2

1
a1 x + b1y + c1

X

f =
0 a2x + b2y + c2

(x,y)S1
0
a3 x + b3y + c3

2

0
a1 x + b1 y + c1

X

+
1 a2x + b2y + c2

(x,y)S2
0
a3 x + b3 y + c3

2

0
a1 x + b1 y + c1

X

0 a2x + b2y + c2
+

(x,y)S3
1
a3 x + b3 y + c3
A complex optimization problem is involved

and this is known as multi-response linear regression.
By minimizing f over the space, we find all
the three models for the sets S1, S2, S3. For
any new point (x,y), we calculate the value
for each model, which is defined by a1x+b1y+
c1, a2x+b2y +c2 and a3x+b3y +c3 separately.
The set model with the largest value assign
the class for the point (x, y).
Two-class classification
For two-class classification problem, we should
change the label for each class to (1, 0)T and
(0, 1)T , respectively.
Another way for bi-classification is to perform regression for each class first
1
1
1
f1(a) = x1
0 + a 1 x1 + a 2 x2 + + a k xk ,
2 + a x2 + + a x2 .
+
a
x
f2(a) = x2
2 2
1 1
k k
0
Then, use these two models to predict the

class of the instance a, saying
If f1(a) f2(a), then a is in class 1.
If f1(a) < f2(a), then a is in class 2.
y=a1 x+b1
y=a2x+b2
Pairwise regression
Another regression model for classification:
Regression for each pair of classes, using
only instances from these two classes
An output of +1 is assigned to one member of the pair, and 1 to the other
Class receiving most votes is predicted
What to do if there is no agreement
Maybe more accurate but expensive
Logistic regression:
Designed for classification problems
Tries to estimate class probabilities directly using the following linear model:
P r(G = i|X = x)
= i0 + iT x, i = 1, 2, , K 1
P r(G = K|X = x)
exp(i0 + iT x)
, i = 1, 2, , K 1
P r(G = i|X = x) =
PK1
T
1+ l=1 exp(l0 +l x)
1
.
P r(G = K|X = x) =
PK1
T x)
exp(
+
1+
l0
l
l=1
log
Define = {10, 1, , (k1)0, k1}, we

denote P r(G = i|X = x) = pi(x, ). The logistic regression use the conditional likelihood
of G given X. For N observations {g1, , gN },
the likelihood is
l() =
N
X
i=1
log pgi (xi , )
Instance-based learning
Distance function defines whats learned
Most popular distance function:

1
2
1
2
1
2
(a1 a1, a2 a2, , an an)
2
where a1 and a2 are two instances with k attributes.

If different attributes are measured, normalization is necessary.
Distance for nominal attributes should be
defined carefully.
For missing values, it is always assumed
to be maximally distant
1-NN algorithm:
Very accurate for the training data.
Very slow in prediction: scan the entire
training set for one prediction.
Usually treats all attributes equally,
However, attribute weights is necessary!
Remedies for noisy instances: taking a
majority vote over the k nearest neighbors.
Examples for K-NN Methods

The weather problem
Outlook
Temper.
Humidity
windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
Yes
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
No
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
New instances:
(sunny, hot, high, true), easy!
(sunny, cool, high, false), no agreement!
What to do? Go with majority, no.
(rainy, hot, normal, false), no agreement!
A tie between (rainy, mild,normal, false) and
(rainy, cool,normal, false)! Maybe go from
1-NN to 2-NN,... K-NN.
Figure: Linear Model VS K-NN
Classification
by Linear Model
Decision Boundary by
K-NN

Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1

Încărcat de

Drepturi de autor:

Formate disponibile

Simple learning algorithms

Inferring rudimentary rules

The weather problem

Dealing with numeric attributes

Overfitting Problem: The procedure is very

Merging two adjacent partitions with a common majority together, we get

New instance: (sunny, cool, high, true, ?)

Normalization into a probability by

Naive Bayes Model

P r[E1|H]P r[E2 |H] P r[En|H]P r[H]

No worry about Pr[E] as it will disappear

Zero Prob and missing values

Dealing with numeric attributes

ii The standard deviation:

iii The density function:

Dealing with numeric attributes

For a new day(sunny,66,90,true,?), using

P r[c < x c + ] f (c).

Merits and flaws of Nave Bayes

Nave Bayes works surprisingly well (despite

Example: attribute Outlook = Sunny:

Wishlist for a purity measure

A pure node should be measured as zero

measure([2, 3, 4]) = measure([2, 7])+7/9

Entropyis the only function that satisfies all

i=1 xi log xi + log(

Instead of maximizing info gain we can minimize information!

Remedy: use the gain ratio which takes

For example, inf o(1, 1, , 1) = log n where

Computing gain ratio

The ID code still has a very high gain ratio.

If ? then the point belong to class

If x<=1 then the point belongs to class

Procedure for covering

The selection of a test depends on the following quantities

t: total number of instances covered by

Covering for contact lenses data

We choose the test Astigmatism=yes, then

Adding the test Tear production rate = Normal, we obtain

This rule is built from instances not covered

Comments: This simple algorithm utilized

Mining association rules

Difficulty: Hard to define coverage and accuracy.

Four-item set: (outlook=sunny,temperature=hot, humidity=high ,play=no)[2]

In total: 12 one- item sets, 47 two- item

owing to the following frequent item sets:

All these item-sets have minimum support 2.

Frequent item sets

The sets are lexicographically ordered!

Mining for transaction data

Mining Transaction Data

Mining transaction data

1 Generate the candidate 1-item set and

2 Generate the candidate 2-items set, and

3 Generate the candidate 3-items set, scan

By using the Apriori property, we can remove

Coefficient can be derived by solving a

Solving Least Square Problem

Theorem .1 Suppose that the matrix Q is

Solving the above system we get the linear

From Regression to Classification

A complex optimization problem is involved

Then, use these two models to predict the