Documente Academic
Documente Profesional
Documente Cultură
Submitted by: Shashidhar Shenoy N (10BM60083) MBA, 2nd Year, Vinod Gupta School of Management, IIT Kharagpur As part of the course IT for Business Intelligence
Introduction to Weka
Weka stands for Waikato Environment for Knowledge Analysis and is a free open source software developed by at the University of Waikato, New Zealand. It is a very popular set of software for machine learning, containing a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality. Although not as sophisticated as the other statistical packages, Wekas popularity lies in the fact that it is not only a freeware but also code is open source, which means that new algorithms can be implemented by making use of the existing algorithms and sufficiently modifying them. Weka can be used to do a wide variety of operations on the data. Some of the important operations which can be carried out using weka suite are: Classification of data Regression analysis and prediction Clustering of data Associating data
A quick guide on how to carry out some of these operations is described in this document.
Page 2
The data is imported into weka in the native (Attribute-Relation File Format) arff format. Weka supports imports of the ubiquitous .csv formats too. This is done by clicking on Explorer in the Weka Gui Chooser suite and then going to Open File.. under the preprocess tab.
Page 3
Once the file is loaded, a variety of pre-process operations can be done on the data. The data can be edited using the Edit option too. In the left section of the Explorer window, it outlines all of the columns in the data (Attributes) and the number of rows of data supplied (Instances). By selecting each column, the right section of the Explorer window will also give information about the data in that column of your data set. Theres a visual way of examining the data, which we can see by clicking the Visualize All button. The next step would be to perform the regression analysis. For this, we go to the Classify tab and click on the Choose button. Since we are running a simple linear regression, we need to go to the Classifiers.functions.simplelinearregression and click on it. Once this is done, we need to supply the test options for building the regression model. The following options are available: Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on. Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field.
Choose one of these for a model, make sure that the dependant variable is shown in the field below as body weight (kg) and click on start. This is the output we get:
Page 4
It gives the model summary and the details of the regression. Thus, simple linear regression model has been built using the weka suite.
8 Yes 8 instances of the variable horsepower are removed because they have unknown value
This data set is loaded into the Weka suite using the Open file syntax as explained before. This is how the window looks like when the data is imported.
Page 5
The first seven attributes are all independant variables, while the eighth one, ie, CLASS is the dependant variable for which we try and build a predictive model. Before doing so, we can use as many visualizations on the data as necessary to see the relevant information in each attribute.
The next step is to perform the regression. Go to the Classify tab and on the choose button, go to classifiers -> functions -> linear regressions. Once this is done, we need to supply the test options for building the regression model, in the same manner which we did for simple linear regression. We initially give a Percentage split of 80% of the test data and see the output:
Page 6
This model might appear as complex for beginners but it is not. For example, the first line of the regression model, -2.2744 * cylinders=6,3,5,4 means that if the car has six cylinders, you would place a 1 in this column, and if it has eight cylinders, you would place a 0. We could use a test set and see the deviation from the expected results and calculate the error. Example data:
data = 8,390,190,3850,8.5,70,1,15 class (aka MPG) = -2.2744 -4.4421 6.74 0.012 -0.0359 -0.0056 1.6184 1.8307 1.8958 1.7754 1.167 1.2522 * * * * * * * * * * * * 0 + 0 + 0 + 390 + 190 + 3850 + 0 + 0 + 0 + 0 + 0 + 0 +
Page 7
2.1363 * 0 + 37.9165 Expected Value = 15 mpg Regression Model Output = 14.2 mpg
So, we see that the regression model output is pretty near the expected value and thus we have a predictive model for beginners. We could continue to improve on this model to improve the accuracy. We can also go for visualization to plot each of the independent variable against the dependent one and see how the variation occurs. A sample plot of horsepower versus Miles per gallon is shown. The relationship can be found to be inversely proportional.
Use the Open file.. syntax to import the arff file into weka suite as instructed before. The tenth attribute, ie, the contraceptive method used is the predicted variable and the data looks like this:
Page 9
Next, go to the classify tab, and use the ZeroR algorithm to run the classification model. ZeroR is the basic classification model and it does not do anything but classify all the instances into one class. We ask weka to run the model using the entire training set without splitting it into test and trainsets. This can be done by giving the choice as Use train set under Test options as explained in the case of regression before. As expected, the model will be inaccurate. This is the output of the Weka file.
Of particular importance is the Confusion matrix which shows the correctly and incorrectly classifcied instances. Here, we see that all samples have been classified as a and the 333 samples which should have been b and the 511 samples which should have been classified as c are also incorrectly classified as a. Thus, the accuracy of the model is only 42% (629 out of 1473 samples) We could now go for more accurate algorithms like NaveBayes or NaiveBayesUpdateable to improve the accuracy of the predictions. Here is the ouput of the NaiveBayes simple classification scheme:
Page 10
Here we see that the accuracy of this model, although under acceptable limits has improved over the previous model. Thus, we can start training the software to be more accurate by using better algorithms. Various visualization schemes are present which will help visualize the independent and dependant variables.
Conclusion
In this term paper, two simple techniques which can be used to get started with Weka regression and classification are presented. In regression, we have demonstrated how Weka can be used to build a regression model with one dependant variable and many independent variables. The live example used was the automobile miles per gallon based on many independent attributes in a car. In classification, we have demonstrated how Weka can be trained to classify the given data set based on observations in a training set. The live data used was the choice of contraceptive method based on a number of demographic factors. Though the outputs are not intriguing, the real power of Weka lies in the fact that the algorithms can be trained to produce better results. Since the source code is open for everyone, anyone can download the same and simple manipulations can be done on the existing algorithms with ease to produce more accurate algorithms. Hence, Weka is used by many researchers in their study.
Page 11
References
1. 2. 3. 4. Weka reference manual pdf available at their website http://www.cs.waikato.ac.nz/ml/weka/ http://archive.ics.uci.edu/ml/datasets.html http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html#N100F6
Page 12