Sunteți pe pagina 1din 2

CSC 573: Data Mining Weka Assignment #1: Attribute Relevance Analysis in WEKA Instructor: Ratko Orlandic For

this assignment, your task is to familiarize yourself with the WEKA machine learning tool and the attribute ranking facilities in WEKA (Select attributes feature in WEKA Explorer). For this assignment, you will use contact-lenses, iris, and soybean data sets, all of which are available in the required .arff format in the WEKA package. The contact-lenses data set has 24 instances with 5 nominal attributes, the last of which (contact-lenses) is the class dimension. The iris set has 150 instances with 4 continuous attributes and the nominal class, which is the last (5th) dimension. The soybean set has 683 instances with 36 nominal attributes, the last of which is the class dimension. Unlike the other two sets, soybean has missing values.

Installing WEKA on Your Computer


WEKA machine learning tool is installed on the computers in the computer lab UHB 2030. You can also download to your computer a free copy of the software as follows: 1. Go to: http://www.cs.waikato.ac.nz/~ml/ . 2. Click on the software tab. 3. Under Getting started, click on Download. 4. Under Windows, click on the link to download a self-extracting executable that includes Java VM 1.4 (weka-3-4-10jre.exe). 5. Install the WEKA software on your computer selecting default directories. WEKA comes with certain data files and some documentation. Once you install the software, you can find these on your computer in the directory C:\Program Files\Weka-3-4. Whether you work in the lab or on your computer, you should spend some time familiarizing yourself with WEKA. For this assignment, you will be working with WEKA Explorer.

Attribute Relevance Ranking


For each step, open the indicated file in the Preprocess window. Then, go to the Attribute Selection window and set the Attribute selection mode to Use full training set. For each case A-E below, perform attribute ranking using the following attribute selection methods with default parameters: a) InfoGainAttributeEval; and b) GainRatioAttributeEval; These attribute selection methods should consider only non-class dimensions (for each set, the class attribute is indicated above the Start button). Record the output of each run in a text file called output.txt. For that, copy the output of the run from the Attribute selection output window in the Explorer and paste it at the end of the output.txt file. A. Perform attribute ranking on the contact-lenses.arff data set using the two attribute ranking methods with default parameters.

B. Load the iris.arff data set. Perform attribute ranking on the iris.arff data set using the two attribute ranking methods with default parameters. C. Go back to Preprocess and load the iris.arff data set. Perform discretization of all non-class attributes into 10 equal-width bins as follows: under Filter in the Preprocess window of the Explorer, select filters->unsupervised->attribute->Discretize (use default parameters of the Discretize filter) and hit `Apply. Verify that all attributes are nominal by clicking on individual attributes in the Attributes window in Preprocess. Then perform attribute ranking on the discretized set using the two attribute-ranking methods with default parameters. D. Go back to Preprocess and load the original iris.arff data set again. Perform discretization of all non-class attributes into 5 close-to-equal-height bins by selecting the Discretize filter. Then, select appropriate parameters by clicking on the Discretize filter in the Filter window, and setting `bins to 5 and useEqualFrequency to true. After you verify that all attributes are nominal, perform attribute ranking on the new set using the two attribute-ranking methods with default parameters. E. Load the soybean.arff data set. Then perform attribute ranking on the soybean.arff data set using the two attribute ranking methods with default parameters.

Evaluation
Once you have performed the experiments, you should spend some time evaluating your results. In particular, try to answer at least the following questions: Why would one need attribute relevance ranking? Do these attribute-ranking methods often agree or disagree? On which data set(s), if any, these methods disagree? Does discretization and its method affect the results of attribute ranking? Do missing values affect the results of attribute ranking? Record these and any other observations in a Word file called Observations.doc.

Assignment Submission and Grading


On or before the due date, please submit in a single zipped file through the Blackboard system the output.txt file with the results of your runs and the Observations.doc file. Please adhere to the following submission procedure: 1. ZIP all files using WinZip; 2. Name the zipped file as follows: LastnameFirstnameAssign1.zip; 3. Submit the zipped file through the digital drop box in the Blackboard system. Grading will be done based on the correctness of the results in your output file as well as the extensiveness, clarity, and correctness of your observations. Good luck!

S-ar putea să vă placă și