Sunteți pe pagina 1din 9

Classification

The basic methods used to solve the classification problem: 1. Specifying boundaries 2. Using PDF . Using Posterior Probabilities 1. Specifying boundaries: Dividing the input space of potential database tuples into into regions !here each region is associated !ith one class. ".g. Teacher classifying students based on their #$rades% &very simple classification'( Decision trees 2. Using probability distributions: For any given class( )*( P&ti + )*' is the PDF for the class evaluated at one point( ti. ,ere each tuple in the database is assumed to consist of a single value rather than a set of values. -f a probability of occurrence for each class( P &)*' is .no!n &perhaps determined by a domain e/pert'( then P &) *' P&ti + )*' is used to estimate the probability that ti is in class )*. Using Posterior Probabilities: $iven a data value ti( !e !ould li.e to determine the probability that ti is in a class )*. This is donated by P&) * + ti' and is called the posterior probability. 0ne classification approach !ould be to determine the posterior probability for each class and then assign ti to the class !ith the highest probability. ".g. 1 22

Different Algorithms Used in Classification 1. 2. . Statistical3based algorithms &uses statistical information' 4egression 5ayesian Distance3based algorithms &uses similarity or distance measures' -4 approach 6 2earest 2eighbours Decision3tree based algorithms &uses the structures' -D )7.8 and )9.8 ):4T ),:-D 2eural net!or. based algorithms &uses the structures' Propagation 22 Supervised ;earning 4adial 5asis Function 2et!or.s Perceptrons 4ule3based algorithms &generates if-then rules' $enerating 4ules from DT $enerating 4ules from 22 $enerating 4ules !ithout a DT or 22 )ombining techni=ues

7.

9.

<.

Statistical-based algorithms (uses statistical information)

Regression
4egression problems deal !ith #estimation% of an output value based on input values. Ta.es a set of data and fits the data to a formula Used to solve classification as !ell as forecasting problems. )lassification approaches: a' Division data are divided into regions based on class Step31: This method vie!s the data as plotted in an n3dimensional space >-T,0UT any "?P;-)-T class values sho!n. Step32: Through regression( the space is divided into regions 1 one per class. b' Prediction Formulas are generated to predict the output class value Step31: : value for each class is included in the graph. Step32: Using regression( the formula for a line to predict class values is generated.

ayesian
:ssume that the contribution by all attributes are independent and that each contributes e=ually to the classification problem( a simple classification scheme called 2a@ve 5ayes classification has been proposed that is based on 5ayes rule of conditional probability.

!a"#e- ayes
2a@ve35ayes is a classification techni=ue that is both predictive and descriptive. :nalyses the relationship bet!een each independent variable and the dependent variable to derive a conditional probability for each relationship :;; variables &indAdep' are categorical re=uires only one pass through the training set generate a classification model most efficient data mining techni=ue. Does not handle continuous data( so any independent or dependent variables that contain continuous values must be binned or brac.eted A credit ris$ e%ample: !ame 4e/ 4am Prasad 5iranchi Satish Cayur Debt ,igh ;o! ;o! ,igh ;o! &ncome ,igh ,igh ,igh ;o! ;o! 'arried( Bes Bes 2o Bes Bes Ris$ $ood $ood Poor Poor Poor

4is. to be predicted dependent variable target variable 0ther columns independent variables 2ame column ignored :ll the columns( e/cept 2ame( have t!o possible values. The restriction to t!o values is only to .eep the e/ample simple. Process: 1' Training: probability &Prior' of each outcome &dependent variable value' is computed by counting ho! many times it occurs in the training dataset. "/ample: prior probability for Good 4is. D 2A9 D8.7

Follo!ing !ay: E-f - .no! nothing else about a loan applicant( there is a 8.7 probability that the applicant is a Good 4is..E 2' ,o! fre=uently each independent variable value occurs in combination !ith each dependent &output' variable value. From the sample data( !e cross3tabulate counts of each 4is. outcome & Good or Poor' and each value in the independent variable columns. For e/ample( ro! reports t!o cases !here -ncome is High and 4is. is Good and one case !here -ncome is High and 4is. is Poor. Counts &ndependent *ariable *alue Debt Debt -ncome -ncome Carried Carried ,otal by Ris$ ,igh ;o! ,igh ;o! Bes 2o )i$elihood gi#en +ood Ris$ Poor Ris$ +ood Ris$ 1 1 8.98 1 2 8.98 2 1 1.88 8 2 8 2 2 1.88 8 1 8 . Counts +i#en Poor Ris$ 8. 8.<F 8. 8.<F 8.<F 8.

&2ote: 5ottom ro! is also used to compute the prior probabilities for Good and Poor 4is.. Prior probability for $ood is 8.78 &t!o of five cases' G that for Poor is 8.<8 &three of five cases'.' These fre=uencies are then used to compute conditional probabilities that are combined !ith the prior probability to ma.e the predictions. ' )ompute )onditional Probability: For that compute the li$elihood that one of the independent variables has a particular value( given a .no!n ris. level: by using the count and dividing by the ETotal by 4is.E number &on the bottom ro!'. "/ample( the li.elihood that a Good 4is. has High -ncome is D 2A2D 1.88 &see ro! '. p&4is.D$ood + -ncomeD,igh' D 2A2D 1 p&4is.DPoor + -ncomeD;o!' D 2A D 8.<F &see ro! 7' )omputing a Score &related to Posterior probability': $iven a particular case( for both values of ris. level simply by multiplying the prior probability for ris. level by all the li.elihood figures from the above table for the independent variables. Decision: The highest scoring value becomes the predicted value. !ame 4e/ 4am Prasad 5iranchi Satish Cayur Debt &ncome 'arried( ,igh ;o! ;o! ,igh ;o! ,igh ,igh ,igh ;o! ;o! Bes Bes 2o Bes Bes Ris$ Actual $ood $ood Poor Poor Poor +ood Ris$Poor Ris$ Ris$ Predicted Score Score 8.2 8.87 $ood 8.2 8 8 8 8.8H 8.87 8.8H 8.1I $ood Poor Poor Poor

"/ample: The first ro! for 4e/ in the training set has High Debt( High -ncome and Carried is Yes. -n the first table the li.elihoods associated !ith these values and Good 4is. are 8.9( 1.8 and 1.8 respectively &see ro!s 1( and 9'. Score for 4e/ &for $ood' D Product of these three numbers and the prior probability for $ood &8.78' is 8.28 &8.98 / 1 / 1 / 8.78'.

For Poor 4is. the probabilities &also from ro!s 1( and 9' are 8. ( 8. ( and 8.<F. Prior probability for Poor 4is. of 8.<8 Score for 4e/ &for Poor' D 8.877 &8. / 8. / 8.<F / 8.<8'. Decision: 5ecause the score for Good is higher( !e predict that 4e/ !ill be a Good ris.. Table presents the actual ris.( the scores for Good ris. and Poor ris.( and the predicted ris. for all cases in the sample data. :ccuracy: 188 J on the training set : good signK SatisfiedK 2everL Cust also validate the model by using it to predict the outcomes for separate test data Scores To posterior probabilities: (Score of that case A sum of all scores for that case'. "/ampleD posterior probability that 4e/ is a Good 4is. is appro/imately I2 percent &8.2 divided by 8.277( !hich is the sum of the Good 4is. value of 8.2 and the Poor 4is. value of 8.877.'. The posterior probability that 4e/ is a Poor 4is. is 1I percent. 2ote that !e donMt need to .no! values for all the independent variables to ma.e a prediction. -n fact( if !e .no! none of them( !e can still predict using *ust the prior probability. -f !e .no! only the value for -ncome( !e can use the conditional probabilities associated !ith -ncome to modify the prior probabilities( and ma.e a prediction. This ability to ma.e predictions from partial information is a significant advantage of 2a@ve35ayes. The use of 5ayesM Theorem in computing the scores and posterior probabilities in this !ay is valid only if !e assume a statistical independence bet!een the various independent variables such as debt and income. &,ence the term Ena@veE.' Despite the fact that this assumption is usually not correct( the algorithm appears to have good results in practice. :dvantages: "asy to use 0nly one scan of the training data is re=uired "asily handle missing values by simply omitting that probability !hen calculating the li.elihoods of memberships in each class.

Disadvantages :ttributes usually not independent Does not handle continuous data. Dividing the continuous values into ranges could be used to solve the problem( but the division of the domain into ranges is not an easy tas.( and ho! this is done can certainly impact the results.

Distance-based algorithms (uses similarity or distance measures) / !earest !eighbour ($-!!) Predictive techni=ue suitable for classification models
,ere training data is not scanned or processed to create the model. -nstead( the training data is the model. 2e! case to the model algorithm loo.s at all the data to find a subset of cases that are most similar to it and uses them to predict the outcome. 2 principal drivers: the number of nearest cases to be used &.' and a metric to measure !hat is meant by nearest. .322 family of algorithms 1322( 2322( 322( and so forth. .322 is based on a concept of distance

Cetric to determine distances: -t is arbitrary because there is no preset definition of !hat constitutes a EgoodE metric. -t is important because the choice of a metric greatly affects the predictions. Different metrics( used on the same training data( can result in completely different predictions. Domain e/pert determine a good metric. A credit ris$ e%ample: !ame 4e/ 4am Prasad 5iranchi Satish Cayur Debt ,igh ;o! ;o! ,igh ;o! &ncome ,igh ,igh ,igh ;o! ;o! 'arried( Bes Bes 2o Bes Bes Ris$ $ood $ood Poor Poor Poor

4is. to be predicted dependent variable target variable 0ther columns independent variables 2ame column ignored :ll the columns( e/cept 2ame( have t!o possible values. The restriction to t!o values is only to .eep the e/ample simple. >e !ill use .D ( or 322. Simple metric 3 summing scores for each of the three independent columns( !here the score for a column is 8 if the values in the t!o instances are the same( and 1 if they differ Distance bet!een 4e/ and 4am PrasadK Scores are 1( 8( and 8 because they have different values only in the Debt column. Distance: the sum of these scores 1 is e=ual to 1. Re% 8 1 2 1 2 Ram Prasad iranchi 1 2 8 1 1 8 2 1 2 Satish 'ayur 1 2 2 1 2 8 1 1 8

4e/ 4am Prasad 5iranchi Satish Cayur Catri/ is symmetrical.

:pply .322 techni=ue: to see ho! it classifies our training data. 4emember that !e chose .D ( so !e are interested in the three neighbours nearest to 4e/. The distances in column 1 sho! that 4e/( 4am Prasad( and Satish are 4e/Ms three nearest neighbours because they have the lo!est distance scores. 5ut ho! can 4e/ be his o!n neighbourK The 4is. values for 4e/Ms three nearest neighbours &4e/ himself( 4am Prasad( and Satish' are Good( Good( and Poor( respectively. The predicted 4is.: for 4e/ is the value that is most fre=uent among the k neighbours( or Good( in this case. >ho are 4am PrasadMs nearest neighboursK )learly 4am Prasad himself( but !hat about 4e/( 5iranchi and Cayur( !ho are all the same distance &1' from 4am PrasadK >e could include all three( e/clude all three( or include all three !ith a proportionate vote &2A each in this case'. The decision is entirely up to the implementers of the algorithm. For our e/ample( !eMll use 2A vote each( resulting in votes of Good (Ram Prasad himself), 2/3 Good, 2/3 Poor ( and 2/3 Poor( for a consensus of Good. The follo!ing table enumerates the predictions from the 322 algorithm.

!ame 4e/ 4am Prasad 5iranchi Satish Cayur

Debt ,igh ;o! ;o! ,igh ;o!

&ncome ,igh ,igh ,igh ;o! ;o!

'arried( Bes Bes 2o Bes Bes

Ris$ $ood $ood Poor Poor Poor

.-!! Prediction $ood $ood 3 Poor Poor

>hat is a good value for kK 1322 accuracy on the training set !ill be 188 J e/tremely susceptible to noise and !ill never reflect any .ind of pattern in the data. -deal 6D18 .322 does not ma.e a #learning% pass through the data. Unfortunately( this does not ma.e .3 22 an especially efficient algorithm( particularly !ith large training datasets. >hile the nearest neighbour techni=ue is simple in concept( the selection of k and the choice of distance metrics pose definite challenges.

Decision-tree based algorithms (uses the structures) Decision ,rees


:im: To able to produce a set of rules or a model of some sort that can identify a high percentage of responders. : decision tree may formulate a condition such as: E ample! "#stomers $ho are male and married and ha%e incomes o%er &'(,((( and $ho are home-o$ners responded to o#r offer The condition selects a much high percentage of responders than if you too. a random selection of customers. -n contrast( a 22 identifies !hich class a customer belongs to( but cannot tell you !hy. Factors that determine its classifications are not available for analysis( but remain implicit in the net!or. itself. )ommon G popular in tool Splitting a data set build a model classifies each record in terms of a target variable. "/ample: is a decision tree( !hich classifies a data set according to !hether customers did or did not buy a particular product. DT $enerated from the training set of data splitting the data into progressively smaller subsets "ach iteration considers the data in only one node The first iteration considers the root node that contains all the data. Subse=uent iterations !or. on derivative nodes that !ill contain subsets of the data. >hich -ndependent variableK :t each iteration !e need to chose the independent variable that most effectively splits the data. This means that the subsets produced by splitting the data according to the value of the independent variable should be as homogeneous as possible( !ith respect to the dependent variable. &ssues faced by most D, algorithms 1. 2. . 7. 9. <. F. )hoosing splitting attributes 0rdering of splitting attributes Splits Tree structures Stopping criteria Training data Pruning

)ommon algorithms: ),:-D( ):4T and )7.9. ),:-D and ):4T by statisticians. ),:-D produce tree !ith multiple sub3nodes for each split. ):4T re=uires less data preparation than ),:-D( but produces only t!o3!ay splits. -D 3 a predecessor of )7.9 )7.8( )7.9 G )9.8 comes from the !orld of machine ;earning( and is based on information theory0

&D. algorithm (a predecessor of C102)


-t attempts to minimiNe the e/pected number of comparisons. Uses entropy concept from information theory to measure ho! much uncertainty/ surprise/ randomness there is in a set( !ith respect to the value of the dependent variable. >hen all data in a set belong to a single class( there is no uncertainty. -n this case entropy is 8. The ob*ective of DT classification is to iteratively partition the given data set into subsets !here all elements in each final subset belong to the same class. "ntropy is to =uantify information. "ntropy is measured on a scale from 8 to 1. -f a set !ere split 98398 bet!een good and poor ris.s( !e !ould be completely uncertain !hether a person pic.ed at random from the set !ould be a good or poor ris.. -n this case( the entropy of the set !ould be 1. -f( on the other hand( the !hole set !ere good ris.s there !ould be no uncertainty and the entropy !ould be 8. Similarly if they !ere all poor ris.s. Ceasuring entropy of the complete training set: >e find the proportion p1 of good ris.s in the set and the proportion p 2 of poor ris.s in the set. ShannonMs formula for entropy is:

entropy = p i log 2 pi
i

!here !e ta.e p log 2 p = 0 if p = 0 ( and i runs over the different subsets. ;oan "/ample: There are t!o good ris.s and three poor ris.s in the complete training set( and so: entropy = ( log 2

2 5

2 3 3 + log 2 ) 5 5 5

D 3&8.7O&31. 21H ' P 8.<O&38.F <HF'' D 3&38.92IFF 38.7721I' D 3045342 >e then consider all the possible !ays of splitting the set 3 by debt( income and marital status 3 and calculate the overall entropy of the resulting subsets for each of the three cases( in turn. )onsider( first( splitting the training set by debt. The follo!ing is a cross3tabulation of the training set by debt and by ris.: 6igh Debt )o7 Debt ,otal +ood Ris$ Poor Ris$ 1 1 1 2 2 ,otal 2 9

The subset !ith high debt has one good ris. and one poor ris.( and so: entropy = ( log 2

1 2

1 1 1 + log 2 ) 2 2 2

D 3&8.9O&31' P 8.9O&31'' D 3&38.9 38.9' D 8033333 as !e !ould e/pect for a set that is split do!n the middle. The subset !ith lo! debt has one good ris. and t!o poor ris.s( and so: entropy = ( log 2

1 3

1 2 2 + log 2 ) 3 3 3

D 3&&1A 'O&31.9I7H<' P &2A 'O&38.9I7H<'' D 3&3&8.92I 2' 3&8. IHHI'' D 30489.3 Since there are altogether t!o high debts and three lo! debts( the average &or e/pected' entropy for the t!o subsets resulting from splitting by debt is: &2A9'O1 P & A9'O8.H1I 8 D 8.H98HI -n other !ords( splitting by debt reduces the entropy by: 8.HF8H9 3 8.H98HI D 8.81HHI Similar calculations sho! that splitting the training set by income reduces the entropy by 8.71HHI( !hilst splitting by marital status reduces the entropy by 8.1F8H9. So( splitting by income is the most effective !ay of reducing the entropy in the training data set( and thereby producing as homogeneous subsets as possible:
Good (2) 40.0% Poor (3) 60.0% Total 5

In com e

H ig h
Good (2) 66.7% Poor (1) 33.3% Total 3 60.0%

Low
Good (0) 0.0% Poor (2) 100.0% Total 2 40.0%

The second of these subsets &lo! income' consists of 188J poor ris.s. Since it is totally homogeneous &and has an entropy of 8'( there is no more !or. to be done on that branch. 5ut the first branch is a mi/ of good and poor ris.s. -t has an entropy of 8.H1I 8 and needs to be split by a second independent variable. Should it be debt or marital statusK )onsider( first( splitting the high income set by debt. The follo!ing is a cross3tabulation of the training set by debt and by ris.: 6igh Debt )o7 Debt ,otal +ood Ris$ 1 1 2 Poor Ris$ 8 1 1 ,otal 1 2

The subset !ith high debt has one good ris. and no poor ris.. -t is completely homogeneous and has an entropy of 8.

The subset !ith lo! debt has one good ris. and one poor ris.. -t is split do!n the middle and has an entropy of 1. Since there are altogether one high debts and t!o lo! debts( the average &or e/pected' entropy for the t!o subsets resulting from splitting the high incomes by debt is: &1A 'O8 P &2A 'O1 D 8.<<<<F -n other !ords( splitting the high incomes by debt reduces the entropy of this set by: 8.H1I 8 3 8.<<<<F D 8.291< 0n the other hand if !e use marital status( !e obtain t!o completely homogeneous subsets and so the entropy is brought do!n to Nero. Carital status is obviously the one to go !ith.
Good (2) 40.0% Poor (3) 60.0% Total 5

In com e

H ig h
Good (2) 66.7% Poor (1) 33.3% Total 3 60.0%

Low
Good (0) 0.0% Poor (2) 100.0% Total 2 40.0%

M a r r ie d ?

Y es
Good (2) 100.0% Poor (0) 0.0% Total 2 40.0%

No
Good (0) 0.0% Poor (1) 100.0% Total 1 20.0%

2o more splitting needs to be done and !e have a decision tree that !e can apply to ne! cases. The operational tree can be e/pressed as a set of rules: -F -ncome D ,igh :2D CarriedK D Bes T,"2 4is. D $ood -F -ncome D ,igh :2D CarriedK D 2o T,"2 4is. D Poor -F -ncome D ;o! T,"2 4is. D Poor

2ote: The other classification techni=ues !ill be provided in separate teaching notes.
***

S-ar putea să vă placă și