Sunteți pe pagina 1din 14

Comparision of Feature Selection Methods for Web Classification

Hakan zpalamutcu

Feature Selection
Prepares data for data mining and machine learning. Commonly used on high dimensional data. Studies how to select a subset or list of attributes or variables that are used to construct models describing data. Purposes include reducing dimensionality, removing irrelevant and redundant features

Feature Selection for Classification

Select among a set of variables the smallest subset that maximizes classification performance

a set of predictors features and a class/category is given minimum set that achieves maximum classification performance is found

Why Feature Selection is Important?


May improve performance of classification algorithm Classification algorithm may not scale up to the size of the full feature set either in sample or time Allows us to better understand the domain

Comparision Steps
Choosing data set Preprocess of data Converting data to Weka format Applying feature selection methods to data Applying classification to data

Choosing data set


Category # of Documens

Course
Department Faculty Other Project Staff Student

927
140 1124 3761 504 137 1640

Category
Course Faculty

# of Positive Document 100 100

# of Negative Document 25 25

Preprocess of Data
Removal of HTML tags Removal of punctuation characters and numeric values

Converting data to WEKA format

Text2arff tool is used Stopwords are removed Min frequency is 100 Frequency calculated using tf-idf scheme

Category Course

# of Initial Attributes
76

Faculty

93

Applying feature selection methods to data

Attribute Evaluators
CfsSubsetEval ConsistencySubsetE val ClassifierSubsetEval
Attribute evaluator CfsSubsetEval CfsSubsetEval ConsistencySubsetEval ConsistencySubsetEval ClassifierSubsetEval Search method GeneticSearch BestFirst RankSearch BestFirst GeneticSearch

Search Methods
GeneticSearch BestFirst RankSearch

Applying feature selection methods to data


Category Attribute evaluator Search method # of Features Selected 18 12 12 Selected Features Course Course Course CfsSubsetEval CfsSubsetEval ConsistencySubsetEval GeneticSearch BestFirst RankSearch 3,6,9,13,14,19,34,40,42,43,45,48,49,52,59,65,69,70 3,6,13,18,19,42,43,45,48,64,67,70 6,14,18,19,40,42,43,45,48,64,67,70

Course
Course

ConsistencySubsetEval
ClassifierSubsetEval

BestFirst
GeneticSearch

7
6

3,6,13,19,42,48,70
2,27,31,33,73,75

Category

Attribute evaluator

Search method

# of Features Selected 20 3 3 3 10

Selected Features
2,6,12,16,19,27,42,43,53,56,58,61,65,67,73,74,76,84, 90,92 16,43,74 16,43,74 16,43,74 1,3,35,47,49,59,64,67,81,90

Faculty Faculty Faculty Faculty Faculty

CfsSubsetEval CfsSubsetEval ConsistencySubsetEval ConsistencySubsetEval ClassifierSubsetEval

GeneticSearch BestFirst RankSearch BestFirst GeneticSearch

Applying classification to data

Classifiers
Naive Bayes (bayes)
Class for a Naive Bayes classifier using estimator classes

Bagging (meta)
Class for bagging a classifier to reduce variance. Can do classification and regression depending on the base learner

J48 (trees)
Class for generating a pruned or unpruned C4.5 decision tree

Results

Quality of measures
CCI-correctly classfied instances F-measure

Results
Before appliying feature selection
Naive Bayes Category Course CCI 106 F-Measure 0.857 J48 CCI 121 F-Measure 0.967 Bagging CCI 107 F-Measure 0.821

After appliying feature selection

CATEGORY:COURSE
Feature Selection
Naive Bayes

Classification
J48 Bagging

Attribute evaluator CfsSubsetEval

Search method GeneticSearch

CCI 106 108 100 105 104

F-Measure 0.853 0.867 0.801 0.840 0.813

CCI 120 118 118 109 119

F-Measure 0.959 0.938 0.938 0.855 0.947

CCI 104 112 112 110 105

F-Measure 0.779 0.884 0.881 0.863 0.794

CfsSubsetEval ConsistencySubsetEval
ConsistencySubsetEval ClassifierSubsetEval

BestFirst RankSearch
BestFirst GeneticSearch

Results
Before appliying feature selection
Naive Bayes Category Faculty CCI 99 F-Measure 0.811 J48 CCI 121 F-Measure 0.967 Bagging CCI 114 F-Measure 0.902

After appliying feature selection

CATEGORY:FACULTY
Feature Selection
Naive Bayes

Classification
J48 Bagging

Attribute evaluator CfsSubsetEval

Search method GeneticSearch

CCI 101 104 104 104 92

F-Measure 0.815 0.808 0.808 0.808 0.750

CCI 119 105 105 105 105

F-Measure 0.951 0.802 0.802 0.802 0.794

CCI 112 107 107 107 107

F-Measure 0.898 0.821 0.821 0.821 0.821

CfsSubsetEval ConsistencySubsetEval
ConsistencySubsetEval ClassifierSubsetEval

BestFirst RankSearch
BestFirst GeneticSearch