Sunteți pe pagina 1din 41

Simple learning algorithms

One R learning;
Bayes Model;
Decision Tree;
Covering algorithm;
Mining for Association Rules
Linear models for numeric prediction;
Instance-based learning.
Reading Materials: Chapter 4 of textbook
by Witten etc, Sections 6.1, 6.2,7.17.4, 7.8
of the textbook by Han.

Inferring rudimentary rules


One R algorithm: learns a 1- level decision
tree or generates a set of rules that all test
on one particular attribute
Basic version for nominal attributes:
One branch for each of the attributes
values and each branch assigns most frequent
class
Error rate: proportion of instances that
dont belong to the majority class of their
corresponding branch
Choose attribute with lowest error rate
Pseudo-code for 1R
For each attribute,
For each value of the attribute:
count how often each class appears
find the most frequent class
make the rule assign that class to this
attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
Note: missing is treated as a separate attribute value.

The weather problem


Outlook

Temper.

humidity

windy

Play

Sunny

Hot

High

False

No

Sunny

Hot

High

True

No

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

No

Overcast

Cool

Normal

True

Yes

Sunny

Mild

High

False

No

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

Overcast

Mild

High

True

Yes

Overcast

Hot

Normal

False

Yes

Rainy

Mild

High

True

No

In total, there are 6 instances whose temperature is mild. Four of them with final decision
Yes and two with No. The rule is
If Temper.=mild then Play=Yes
Error rate: 2/6.

1R algorithm
Attribute

Outlook

Temper.

Humidity

Windy

Rules

Errors

Sunny no

2/5

overcast yes

0/4

rainy no

2/5

hot no

2/4

mild yes

2/6

cool yes

1/4

high no

3/7

normal yes

1/7

f alse yes

2/8

true no

3/6

Total err.

4/14

5/14

4/14

5/14

Dealing with numeric attributes


Discretization: the range of the attribute is
divided into a set of intervals
Instances are sorted according to attributes
values
Breakpoints are placed where the main
class changes (minimizing the errors)

Discretization
64

65

68 69 70

71 72

72 75 75

80

81 83

85

No

Y Y Y

No No

?Y Y Y

No

Y Y

No

Overfitting Problem: The procedure is very


sensitive to noise
A single instance with an incorrect class
label might result in a separate interval
Also: time-stamp attribute (have different
values for all instances) will have zero errors
Simple solution: enforce minimum number
of instances in majority class of per interval
64 65 68 69 70

71 72 72 75 75

80 81 83 85

Y No Y Y Y

No No Y Y Y

No Y Y No

Merging two adjacent partitions with a common majority together, we get


64 65 68 69 70 71 72 72 75 75

80 81 83 85

Y No Y Y Y No No Y Y Y

No Y Y No

How about?
64 65 68 69 70 71 72 72 75 75 80 81 83

85

Y No Y Y Y No No Y Y Y No Y Y

No

Statistical Modelling
Basic assumptions: Attributes are equally
important and statistically independent
Illusive assumptions never meet in practice, but the scheme works well!
The weather data with probabilities
Outlook

Temper.

Humidity

Windy

Play

y n

y n

y n

y n

y n

Sunny 2 3

Hot 2 2

High 3 4

F 6 2

9 5

Overc 4 0

Mild 4 2

Norm. 6 1

T 3 3

Rainy 3 2

Cool 3 1

Sunny
Overc
Rainy

2
9
4
9
3
9

3
5

Hot

Mild

2
5

Cool

2
9
4
9
3
9

2
5
2
5
1
5

High
Norm.

3
9
6
9

4
5
1
5

F
T

6
9
3
9

2
5
3
5

9 5
14 14

New instance: (sunny, cool, high, true, ?)


Likelihood of
2 3 3 3
9

= 0.0053
9 9 9 9 14
5
3 1 4 3
= 0.0206.
no =
5 5 5 5 14

yes =

Normalization into a probability by


0.0053
= 20.5%
0.0053 + 0.0206
0.0206
= 79.5%
P(no) =
0.0053 + 0.0206

P(yes) =

Naive Bayes Model


Bayes rule:Probability of event H given evidence E:
P r[E|H]P r[H]
P r[H|E] =
P r[E]
Priori probability of H: Pr[H], probability of
event before evidence has been seen
Posteriori probability of H: P r[H|E], probability of event after evidence has been seen
Naive Bayes for Classification
Whats the probability of the class for a given
instance?
Evidence: E= instance
Event: H = class value for instance
Naive Bayes assumption: evidence can be
split into independent parts (i.e. attributes
of instance!)
P r[H|E] =

P r[E1|H]P r[E2 |H] P r[En|H]P r[H]


.
P r[E]

Consider the weather problem with the instance (sunny, cool, high, true, ?)
P r[Y es|E]P r[E] = P r[sunny|yes]P r[cool|yes]P r[high|yes]
P r[true|yes]P r[yes]
2 3 3 3 9
=
9 9 9 9 14

No worry about Pr[E] as it will disappear


after normalization!

Zero Prob and missing values


Another instance: : (overcast, mild, high, f alse)
Likelihood of:
9
4 4 3 3

= 0.0254
9 9 9 5 14
5
2 1 2
= 0.
no = 0
5 5 5 14

yes =

Does it make sense to claim that the likelihood is zero? If not, how should we deal with
this issue?
Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)
In some cases adding a constant different
from 1 might be more appropriate:
Attribute outlook for class yes
2 + 4 + 3 +
,
,
9+ 9+ 9+
Weights satisfying + + = 1, a > 0, b >
0, g > 0.
Extra Merit: Missing values are not counted
in both training and prediction!!

Dealing with numeric attributes


1 Usual assumptions: attributes have a
normal or Gaussian probability distribution
2 The probability density function for the
normal distribution is defined by two parameters
1 Pn
i The sample mean: = n
i=1 xi

ii The standard deviation:

v
u n
u X (xi )2
=t
i=1 n 1

iii The density function:


1
(x)
e 22
f (x) =
2

For the weather problem, if the attribute temperature has a mean of 73 and a standard
deviation of 6.2, then the density function
f (temperature = 66|yes) =

1
26.2

(7366)
2(6.2)2

= 0.0340,

Dealing with numeric attributes


Weather data with numeric attributes
Outlook

Temper.

Humidity

Windy

Play

Y no

Y no

Y no

Y no

Y no

Sunny 2 3

83 85

86 85

F 6 2

9 5

Overc 4 0

70 80

96 90

T 3 3

Rainy 3 2

68 65

80 70

... ...

... ...

73 74.6

79.1 86.2

6.2 7.9

10.2 9.7

Sunny
Overc
Rainy

2
9
4
9
3
9

3
5
0
5
2
5

6
9
3
9

2
5
3
5

9 5
14 14

For a new day(sunny,66,90,true,?), using


f (temperature = 66|yes) = 0.0340,
we have the Likelihood of
2
3 9
yes = 0.0340 0.0221
= 0.000036
9
9 14
3 5
3
= 0.000136,
no = 0.0291 0.0380
5
5 14
which gives
P r(yes) = 20.9%, P r(no) = 79.1%.
Missing values are not counted!

Probability densities
Relationship between probability and density:

P r[c < x c + ] f (c).


2
2
This doesnt change calculation of a posteriori probabilities because cancels out
Exact relationship:
P r[a x b] =

Z b
a

f (t)dt.

Merits and flaws of Nave Bayes

Nave Bayes works surprisingly well (despite


the illusive assumption)
Reason: classification doesnt require accurate probability estimates as long as maximum probability is assigned to correct class;
Redundant attributes might cause problems
Note: for numeric attributes that are not
normally distributed, other kernel density estimators should be considered!

Decision trees
Normal procedure: top down in recursive
divide-and-conquer fashion
1. Attribute is selected for root node and
branch is created for each possible attribute value
2. The instances are split into subsets (one
for each branch extending from the node)
3. Procedure is repeated recursively for
each branch, using only instances that
reach the branch
Process stops if all instances have the
same class
Issue: How to select the splitting attribute?
Criteria: The best attribute is the one leading to the smallest tree
Trick: choose the attribute that produces
the purest nodes
How to measure information? Information gain!

Computing information
Information gain increases with the average
purity of the subsets that an attribute produces
Strategy: choose the attribute that results
in greatest information gain
Information is measured in bits
1. Given a probability distribution, the info
required to predict an event is the distributions entropy
2. Entropy gives the information required
in bits(involve fractions of bits!)
Formula for computing the entropy:
entropy(p1, p2, , pn) =

n
X

pi log pi.

i=1

Example: attribute Outlook = Sunny:


inf o([2, 3]) = entropy(2/5, 3/5)
2 3
3
2
= log log = 0.971bits;
5
5 5
5

More Examples
Outlook = Overcast:
inf o([4, 0]) = entropy(1, 0) = log 1 = 0bits;
Outlook = Rainy:
inf o([3, 2]) = entropy(3/5, 2/5)
2 3
3
2
= log log = 0.971bits;
5
5 5
5
Expected information for attribute:

10
inf o([3, 2], [4, 0], [3, 2]) = 0.971
+0
14
= 0.693bits.

Information gain: the gap between information before and after the splitting.
Information gain for the weather problem:
gain(outlook)= inf o([9, 5])inf o([3, 2],[4, 0],[3, 2])
= 0.940 0.693 = 0.247bits.
gain(temper.) = 0.029bits,
gain(humidity) = 0.152bits,
gain(windy) = 0.048bits.

Wishlist for a purity measure


Desirable Properties of a purity measure:

A pure node should be measured as zero


When impurity is maximal (all classes equally
likely), measure should be maximal
Measure should enjoy the multistage property (decisions in several stages):
q
r
, q+r
).
entr(p, q, r) = entr(p, q + r) + (q + r)entr( q+r

measure([2, 3, 4]) = measure([2, 7])+7/9


measure([3, 4]).

Entropyis the only function that satisfies all


the above properties!
Simplification of computation:
entr(x1, , xn ) = Pn1

i=1

xi

Pn

i=1 xi log xi + log(

Pn

i=1 xi ).

Instead of maximizing info gain we can minimize information!

Avoiding overfitting
Trouble: attributes with a large number of
values (extreme case: ID code) and thus the
corresponding subsets are more likely to be
pure. In this case, Information gain is biased towards choosing attributes with a large
number of values. This leads to so-called
overfitting (selection of a useless attribute
for prediction)

Remedy: use the gain ratio which takes


the number of branches into account when
choosing an attribute. Let a be an attribute,
gain ratio(a) =

gain(a)
.
inf o(a)

For example, inf o(1, 1, , 1) = log n where


n is the dimension of the all 1s vector.
Note: the distinction between two info
functions: inf o([1, 1]) = log 2 and
inf o([0, 1], [1, 0]) = 0.

Computing gain ratio


Suppose there is a different ID code associated with every instance in the weather problem, the information gain is 0.940bits, but
the info for the ID code is log 14 = 3.807bits.
Information for weather data
Attribute

Info.

Info Gain

Split Info

Gain Ratio

outlook

0.693

0.247

info([5,4,5])

0.156

temper.

0.911

0.029

1.362

0.021

humidity

0.788

0.152

info([7,7])=1

0.152

windy

0.892

0.048

0.985

0.049

The ID code still has a very high gain ratio.


Ad hoc test is needed to prevent the ID case.
The gain ratio might bias to attributes with
low intrinsic information. Restriction only to
the attribute whose information gain is above
the average is necessary.
Comments: Algorithm for induction decision
trees ID3 was developed by Ross Quinlan.
Gain ratio is one modification of this basic algorithm. Its advanced version is coined C4.5,
which can deal with numeric attributes, missing values, and noisy data.

Covering algorithm
Decision tree can be converted into a rule
set by using straightforward conversion, but
this usually leads to very complex rule set.
Efficient conversions are useful but not easy
to find.
The covering approach generates a rule
set directly (exclude instances in other classes).
Key idea: find the rule set that covers all the
instances in one class.
Let consider the problem of classifying a set
(two classes denoted by circles and boxes) of
points on the plane. We can start with
If ? then the point belongs to the circle class
It covers all the instances in circle-class, but
too general.
Adding a pre-condition (x 1), we get:
If x 1, then class is circle.
The rule covers some instances in circle.
We need more rules to cover some circleinstances and box instances.

Figure: Covering
y

y=1.5

x
x=1

If ? then the point belong to class

If x<=1 then the point belongs to class

Procedure for covering


Simple approach
Generates a rule by adding tests that maximize rules accuracy
Each new test reduces rules coverage: start
from space of examples, then rule so far and
rule after adding new term

The selection of a test depends on the following quantities

t: total number of instances covered by


rule
p: positive examples of the class covered
by rule
t-p: number of errors made by rule
Select test that maximizes the ratio p/t
We are finished when p/t = 1 or the set
of instances cant be split any further

Covering for contact lenses data


Rule we seek: If ? then recommendation = hard
Possible tests:
Age = Young
2/8
Age=Pre-presbyopic
1/8
Age=Presbyopic
1/8
Spectacle prescription=Myope
3/12
Spectacle prescription=Hypermetrope 1/12
Astigmatism=no
0/12
Astigmatism=yes
4/12
Tear production rate = Reduced
0/12
Tear production rate = Normal
4/12

We choose the test Astigmatism=yes, then


we get a subset as
Age

Spect. prescr.

Astig.

Tear prod.

Rec. lenses

Prepresbyopic

Hypermetrope

Yes

Reduced

None

Prepresbyopic

Hypermetrope

Yes

Normal

None

Presbyopic

Myope

Yes

Reduced

None

Presbyopic

Myope

Yes

Normal

Hard

Presbyopic

Hypermetrope

Yes

Reduced

None

Presbyopic

Hypermetrope

Yes

Normal

None

Prepresbyopic

Myope

Yes

Normal

Hard

Prepresbyopic

Myope

Yes

Reduced

None

Young

Hypermetrope

Yes

Normal

Hard

Young

Hypermetrope

Yes

Reduced

None

Young

Myope

Yes

Normal

Hard

Young

Myope

Yes

Reduced

None

Further refinement
Rule we seek: If astigmatism=yes and ? then
recommendation = hard
Possible tests:
Age = Young
2/4
Age=Pre-presbyopic
1/4
Age=Presbyopic
1/4
Spectacle prescription=Myope
3/6
Spectacle prescription=Hypermetrope 1/6
Tear production rate = Reduced
0/6
Tear production rate = Normal 4/6

Adding the test Tear production rate = Normal, we obtain


Resulting Table
Age

Spect. prescr.

Astig.

Tear prod.

Rec. lenses

Prepresbyopic

Hypermetrope

Yes

Normal

None

Presbyopic

Myope

Yes

Normal

Hard

Presbyopic

Hypermetrope

Yes

Normal

None

Myope

Yes

Normal

Hard

Young

Hypermetrope

Yes

Normal

Hard

Young

Myope

Yes

Normal

Hard

Prepresbyopic

Further refinement
Rule we seek: If astigmatism=yes and tear production rate=normal and ? then recommendation = hard
Possible tests:
Age = Young
2/2
Age=Pre-presbyopic
1/2
Age=Presbyopic
1/2
Spectacle prescription=Myope 3/3
Spectacle prescription=Hypermetrope 1/1
Between the first and fourth, we select the one with
larger coverage.
Resulting Table
Age

Spect. prescr.

Astig.

Tear prod.

Rec. lenses

Presbyopic

Myope

Yes

Normal

Hard

Prepresbyopic

Myope

Yes

Normal

Hard

Young

Myope

Yes

Normal

Hard

The Final rule: If astigmatism = yes and tear production rate = normal and spectacle prescription =
myope then recommendation = hard
Second rule for recommending hard lenses: If age =
young and astigmatism = yes and tear production rate
= normal then recommendation = hard

This rule is built from instances not covered


by first rule
The above two rules cover all hard lenses.
We can follow a similar process for other two
classes.

PRISM
Pseudo- code for PRISM:
For each class C
Initialize E to the instance set
While E contains instances in class C
1. Create a rule R with an empty left- hand side
that predicts class C
2. Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R,
and each value v,
Consider adding the condition A = v to the
left- hand side of R
Select A and v to maximize the accuracy p/
t (break ties by choosing the condition with
the largest p)
Add A = v to R
Remove the instances covered by R from E

Comments: This simple algorithm utilized


divider-conquer to extract all rules. However,
it did not tell us what is the ordering of interpretation. It also did not address the issue
of how to deal with missing values and the
overfitting problem.

Mining association rules


Simple method for finding association rules:
Using the standard separate- and- conquer method,
treating every possible combination of attribute
values as a separate class
Two problems:
1. Computational complexity
2. Resulting number of rules;
Remedy: Focus on rules with high support and
confidence only!

Difficulty: Hard to define coverage and accuracy.


Item: one test/ attribute- value pair
Item set: a set of all the items occurring in
a rule
Frequency: The occurrence frequency of an
item-set in the data set
Goal: only rules that exceed pre-defined minimum frequency/support are reported
We can find all item sets with the given
minimum frequency and generate rules from
them!

Item sets
Item sets for weather data
one-item set
outlook=sunny(5)

two-item set

three-item set

outlook=sunny(2)

outlook=sunny

temperature=mild

temperature=hot
humidity=high(2)

temperature=cool(4)

outlook=sunny(3)

outlook=sunny

hudimity=high(3)

humidity=high
windy=false(2)

Four-item set: (outlook=sunny,temperature=hot, humidity=high ,play=no)[2]


(outlook=rainy, temperature=mild, windy=false, play=yes)[2]

In total: 12 one- item sets, 47 two- item


sets, 39 three- item sets, 6 four- item sets.
Once all item sets with minimum support
have been generated, we can turn them into
rules
Example: Humidity = Normal, Windy = False,
Play = Yes (4) In total seven (2n 1) potential rules:
If Humidity = Normal and Windy = False then Play = Yes 4/4
If Humidity = Normal and Play = Yes then Windy = False 4/6
If Windy = False and Play = Yes then Humidity = Normal 4/6
If Humidity = Normal then Windy = False and Play = Yes 4/7
If Windy = False then Humidity = Normal and Play = Yes 4/8
If Play = Yes then Humidity = Normal and Windy = False 4/9
If True then (?,?, Normal, False, Yes
4/12

Association rules
Rules for the weather data with support > 1:
In total: 3 rules with support four, 5 with
support three, and 50 with support two
Example rules from the same set:
Temperature = Cool, Humidity = Normal, Windy = False, Play
= Yes (2)
Resulting rules : Temperature = Cool, Windy = False
Humidity = Normal, Play = Yes
Temperature = Cool, Windy = False, Humidity = Normal
Play = Yes
Temperature = Cool, Windy = False, Play = Yes Humidity =
Normal

owing to the following frequent item sets:


Temperature=Cool, Windy=False
Temperature=Cool, Humidity=Normal, Windy=False
Temperature=Cool, Windy=False, Play=Yes

All these item-sets have minimum support 2.

Frequent item sets


A frequent item set is a set whose support
exceeds the minimal requirement.
Apriori Property:
If (A B) is frequent item set, then (A) and
(B) have to be frequent item sets as well;
In general: if X is frequent k-item set, then
all (k-1)-item subsets of X are also frequent;
Based on Apriori Property, we can compute
k-item set by merging (k-1)-item sets.
Finding one-item sets easy;
Using one-item sets to get two-item sets,
two-item sets to get three-item sets.
An example: For given five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)

The sets are lexicographically ordered!


Candidate for four-item sets: (A B C D),
OK because (B C D) has greater than minimum coverage! (A C D E) Not OK because
(C D E) does not satisfy the minimum coverage requirement!
Final check by counting instances in dataset
(k-1)-item sets are stored in hash table

Apriori Algorithm
We are looking for all high-confidence rules:
Support of antecedent obtained from hash
table
Building (c+1)-consequent rules from cconsequent ones
Observation: (c+1)-consequent rule can
only hold if all corresponding c-consequent
rules also hold
Just like the procedure for large item sets
Key Steps from k-item sets to (k+1)-item
sets:
Create a table of potential candidates of
(k+1)-item sets from the hash table of kitem sets by computing the product of two
sets.
Using the Apriori property of the frequent
item set and the order in the hash table to
improve the efficiency!
Remove non-promising candidates in the
table via consulting the hash table of k-item
sets.
Scan the whole data set to remove the
candidates that does not satisfy the minimal support requirement and get the frequent
(k+1)-item sets.

Mining for transaction data


TID

Items

T1

I1, I2, I5

T2

I2, I4

T3

I2, I3

T4

I1, I2, I4

T5

I1, I3

T6

I2, I3

T7

I1, I3

T8

I1,I2,I3,I5

T9

I1, I2,I3

Itemset

C: Candidate set
L: the frequent item set

Support

Itemset

Support

I1

I1

I2

I2

I3

I3

I4

I4

I5

I5

C1 = L1

Mining Transaction Data


Itemset

Sup.

{I1,I2}

{I1,I3}

Itemset

Sup.

{I1, I4}

{I1,I2}

{I1,I5}

{I1,I3}

{I2,I3}

{I1,I5}

{I2,I4}

{I2,I3}

{I2,I5}

{I2,I4}

{I3,I4}

{I2,I5}

{I3,I5}

{I4,I5}

C2 = L2

Itemset

Sup.

{I1,I2, I3}

{I1,I2, I5}

C3 = L3

Itemset

Sup.

{I1,I2, I3}

{I1,I2, I5}

Mining transaction data


Procedure

1 Generate the candidate 1-item set and


scan all the transaction to count the occurrences of each item to get the 1-item
set satisfying the minimum support.

2 Generate the candidate 2-items set, and


scan the data set to find the 2-items set.

3 Generate the candidate 3-items set, scan


the data set to get L3.
To generate the candidate set, let
C3 = L2 L2 = {(I1, I2, I3), (I1, I2, I5), (I1, I3, I5),
(I2, I3, I4), (I2, I3, I5), (I2, I4, I5)}

By using the Apriori property, we can remove


the last four items. Thus we have
C3 = {(I1, I2, I3), (I1, I2, I5)}.

Linear models
Work naturally with numeric attributes
Standard technique for numeric prediction: linear regression
Output is a linear combination of attributes
y = w 0 + a 1 x1 + a 2 x2 + + a k xk .
Weights are calculated from the training
data
Predicted value for the training instance
a(1)
(1)

(1)

(1)

(1)

y = a 0 x0 + a 1 x1 + a 2 x2 + + a k xk .
All these k + 1 coefficients are chosen so that
the squared error on the training data is minimized
n
X

i=1

y i

n
X

j=0

(i)

a j xj .

Coefficient can be derived by solving a


linear system
Can handle many instances

Solving Least Square Problem


Linear regression always leads to unconstrained
convex quadratic optimization as follows
1
QP minn f (x) = xT Qx q T x.
2
xIR

Theorem .1 Suppose that the matrix Q is


positive semidefinite. Then x solves (QP)
if and only if it is a solution of the linear
equation system Qx = q.
Let us consider the linear regression for the
point set S = {(1, 0), (1, 2), (2, 1)}. In other
word, try to find a linear model y = ax + b to
approximate these points. Thus we have
min f (a, b) = (a+b)2+(2ab)2+(12ab)2.
By Theorem 1, we need only to solve the
linear system

0
12 8 a
= .

8 6

Solving the above system we get the linear


model y = 2x + 3.

From Regression to Classification


Any regression technique can be used for
classification:
Training: perform a regression for each
class, setting the output to 1 (y i = 1) for
training instances that belong to class, and 0
(y i = 0) for those that dont
Prediction: predict class corresponding to
model with largest output value.
Let us consider the case with 3 classes (S1, S2
and S3) on a plane where each point is characterized by parameter pair (x, y). We can
define the label for each class in the following way:
(x, y) S1, labelled as (1, 0, 0)T ;
(x, y) S2, labelled as (0, 1, 0)T ;
(x, y) S3, labelled as (0, 0, 1)T .
Performing linear regression for all the three
classes by minimizing
min f (a1, b1, c1, a2, b2, c2, a3, b3, c3)
where the function f is defined in next page.

Multiple-Class classification

2

1
a1 x + b1y + c1

X


f =
0 a2x + b2y + c2

(x,y)S1
0
a3 x + b3y + c3

2

0
a1 x + b1 y + c1

X


+
1 a2x + b2y + c2

(x,y)S2
0
a3 x + b3 y + c3

2

0
a1 x + b1 y + c1

X

0 a2x + b2y + c2
+




(x,y)S3
1
a3 x + b3 y + c3

A complex optimization problem is involved


and this is known as multi-response linear regression.
By minimizing f over the space, we find all
the three models for the sets S1, S2, S3. For
any new point (x,y), we calculate the value
for each model, which is defined by a1x+b1y+
c1, a2x+b2y +c2 and a3x+b3y +c3 separately.
The set model with the largest value assign
the class for the point (x, y).

Two-class classification
For two-class classification problem, we should
change the label for each class to (1, 0)T and
(0, 1)T , respectively.
Another way for bi-classification is to perform regression for each class first
1
1
1
f1(a) = x1
0 + a 1 x1 + a 2 x2 + + a k xk ,

2 + a x2 + + a x2 .
+
a
x
f2(a) = x2
2 2
1 1
k k
0

Then, use these two models to predict the


class of the instance a, saying
If f1(a) f2(a), then a is in class 1.
If f1(a) < f2(a), then a is in class 2.
y=a1 x+b1

y=a2x+b2

Pairwise regression
Another regression model for classification:
Regression for each pair of classes, using
only instances from these two classes
An output of +1 is assigned to one member of the pair, and 1 to the other
Class receiving most votes is predicted
What to do if there is no agreement
Maybe more accurate but expensive
Logistic regression:
Designed for classification problems
Tries to estimate class probabilities directly using the following linear model:
P r(G = i|X = x)
= i0 + iT x, i = 1, 2, , K 1
P r(G = K|X = x)
exp(i0 + iT x)
, i = 1, 2, , K 1
P r(G = i|X = x) =
PK1
T
1+ l=1 exp(l0 +l x)
1
.
P r(G = K|X = x) =
PK1
T x)
exp(
+

1+
l0
l
l=1
log

Define = {10, 1, , (k1)0, k1}, we


denote P r(G = i|X = x) = pi(x, ). The logistic regression use the conditional likelihood
of G given X. For N observations {g1, , gN },
the likelihood is
l() =

N
X
i=1

log pgi (xi , )

Instance-based learning
Distance function defines whats learned
Most popular distance function:



1
2
1
2
1
2
(a1 a1, a2 a2, , an an)
2

where a1 and a2 are two instances with k attributes.


If different attributes are measured, normalization is necessary.
Distance for nominal attributes should be
defined carefully.
For missing values, it is always assumed
to be maximally distant
1-NN algorithm:
Very accurate for the training data.
Very slow in prediction: scan the entire
training set for one prediction.
Usually treats all attributes equally,
However, attribute weights is necessary!
Remedies for noisy instances: taking a
majority vote over the k nearest neighbors.

Examples for K-NN Methods


The weather problem
Outlook

Temper.

Humidity

windy

Play

Sunny

Hot

High

False

No

Sunny

Hot

High

True

Yes

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

No

Overcast

Cool

Normal

True

Yes

Sunny

Mild

High

False

No

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

No

Sunny

Mild

Normal

True

Yes

Overcast

Mild

High

True

Yes

New instances:
(sunny, hot, high, true), easy!
(sunny, cool, high, false), no agreement!
What to do? Go with majority, no.
(rainy, hot, normal, false), no agreement!
A tie between (rainy, mild,normal, false) and
(rainy, cool,normal, false)! Maybe go from
1-NN to 2-NN,... K-NN.

Figure: Linear Model VS K-NN

Classification
by Linear Model

Decision Boundary by
K-NN

S-ar putea să vă placă și