Sunteți pe pagina 1din 3

SWENG 545: Term Project Team submission by Karen Kraus and Steve Luttrell April 25, 2011 Section

1: Description of the dataset The ADA_Prior.aarf dataset, found on the Tunedit.org website, contains census data. The dataset contains 4562 instances and 15 attributes as follows: a. age numeric b. workclass {Federal-gov,Without-pay,Self-emp-inc,State-gov,Localgov,Self-emp-not-inc,Private} c. fnlwgt numeric d. education {11th,Masters,Some-college,Assoc-voc,5th6th,10th,Preschool,9th,Assoc-acdm,Doctorate,Bachelors,HSgrad,12th,Prof-school,1st-4th,7th-8th} e. educationNum numeric f. maritalStatus {Married-civ-spouse,Divorced,Married-spouseabsent,Separated,Widowed,Married-AF-spouse,Never-married} g. occupation {Sales,Protective-serv,Prof-specialty,Adm-clerical,Techsupport,Priv-house-serv,Other-service,Handlers-cleaners,Transportmoving,Craft-repair,Armed-Forces,Machine-op-inspct,Execmanagerial,Farming-fishing} h. relationship {Unmarried,Not-in-family,Other-relative,Wife,Husband,Ownchild} i. race {White,Black,Asian-Pac-Islander,Other,Amer-Indian-Eskimo} j. sex {Male,Female} k. capitalGain numeric l. capitalLoss numeric m. hoursPerWeek numeric n. nativeCountry {Portugal,Cuba,Philippines,Iran,Taiwan,Greece,Ecuador,Yugoslavia,Colu mbia,UnitedStates,Ireland,England,Nicaragua,South,Italy,India,Vietnam,France,Haiti, Honduras,Peru,China,Trinadad&Tobago,PuertoRico,Hong,Guatemala,Outlying-US(Guam-USVIetc),Jamaica,Scotland,Cambodia,Hungary,Mexico,Laos,ElSalvador,Canada,Poland,Dominican-Republic,Germany,Japan} o. label {-1 if <=50K, 1 if >50K}

Section 2: Objectives of the experiment

The dataset is a real training and validation dataset used by the ADA marketing database to discover high revenue people from the census data. The dataset could also be used to determine: What are the occupation categories of the high revenue people? Whats the education level of high revenue people? - What are the primary occupations of people with a High School graduate education level? Section 3: Individual experiments Talk about pre processing here.... Karen 1. K Nearest Neighbor Lazy Classification - For K nearest neighbor, Ill set KNN to 1, 3, 5, 7, 9 and hope to improve the results with the best setting. Ill also play with the distanceWeighting setting. a. Algorithm description (inputs, outputs, main steps) b. Performance Metrics used to evaluate c. Training and Testing Instances d.Performance e. Time to Construct Model f. Conclusion 2. J48/C4.5 Decision Tree - Ill work with MinNumObj, reducedErrorPruning, unpruned, and default Weka settings. a. Algorithm description (inputs, outputs, main steps) - At each node, J48 builds a decision tree by choosing the one attribute that splits the training data into the cleanest subsets. The attribute with the highest information gain is used. As inputs, J48 can handle numeric, discrete and continuous data. Where continuous data is provided, J48 determines a threshold and splits the list in two: less than or equal to and greater than. b. Performance Metrics used to evaluate - Ill use Entropy because it indicates how clean the leaf node is (representing 1 class versus more than 1). The lower the entropy, the better. Im also planning to use ROC, FMeasure, Precision. c. Training and Testing Instances d.Performance e. Time to Construct Model f. Conclusion Steve 3. Simple CART a. Algorithm description (inputs, outputs, main steps) b. Performance Metrics used to evaluate c. Training and Testing Instances

d.Performance e. Time to Construct Model f. Conclusion 4. ADTree a. Algorithm description (inputs, outputs, main steps) b. Performance Metrics used to evaluate c. Training and Testing Instances d.Performance e. Time to Construct Model f. Conclusion 5. Decision Stump a. Algorithm description (inputs, outputs, main steps) b. Performance Metrics used to evaluate c. Training and Testing Instances d.Performance e. Time to Construct Model f. Conclusion Section 4: Compare and Contrast Individual Experiments and Results Might make sense to provide a chart here... Section 5: Group Experiment 1. ???? a. Algorithm description (inputs, outputs, main steps) b. Performance Metrics used to evaluate c. Training and Testing Instances d.Performance e. Time to Construct Model f. Conclusion Section 6: Summary of Results 2. Performance 3. Evaluation 4.Discussion 5.Conclusion

S-ar putea să vă placă și