Sunteți pe pagina 1din 10

Data Mining Techniques Using Weka

(IT for Business Intelligence)

Submitted By: Amit Bhatia 10BM60005 MBA 2010-12

Introduction
Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The overall goal of the data mining process is to extract knowledge from a data set in a humanunderstandable structure and besides the raw analysis step involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating. Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University Of Waikato, New Zealand. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Example 1: Classification via Decision Trees in WEKA This example illustrates the use of C4.5 (J48) classifier in WEKA. The sample data set used for this example is the bank data available in comma-separated format (bankdata.csv) from WEKA site. WEKA has implementations of numerous classification and prediction algorithms. Step 1: We begin by loading the data into WEKA by choosing PREPROCESS option:

Step 2: Next we click CLASSIFY tab and click

on CHOOSE button to select J48 classifier.

Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach. Since we do not have separate evaluation data set, this is necessary to get a reasonable idea of accuracy of the generated model. We now click "Start" to generate the model. The ASCII version of the tree as well as evaluation statistics will appear in the eight panel when the model construction is completed. Note that the classification accuracy of our model is only about 69%. This may indicate that we may need to do more work (either in preprocessing or in selecting the correct parameters for classification), before building another model. WEKA also let's us view a graphical rendition of the classification tree. This can be done by right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu.

Step 3: ARFF file In this case, we open the file "bank-new.arff" and upon returning to the main window, we click the "start" button. This, once again generates the models from our training data, but this time it applies the model to the new unclassified instances in the "banknew.arff" file in order to predict the value of "pep" attribute. In this case, we open the file "bank-new.arff" and upon returning to the main window, we click the "start" button. This, once again generates the models from our training data, but this time it applies the model to the new unclassified instances in the "banknew.arff" file in order to predict the value of "pep" attribute. Our test instances the value of the class attribute ("pep") was left as "?" as WEKA has no actual values to which it can compare the predicted values of new instances.

Step 4: We are interested in knowing how our model managed to classify the new instances. To do so we need to create a file containing all the new instances along with their predicted class value resulting from the application of the model. Doing this is much simpler using the command line version of WEKA classifier application. However, it is possible to do so in the GUI version using an "indirect" approach, as follows. First, right-click the most recent result set in the left "Result list" panel. In the resulting pop-up window select the menu item "Visualize classifier errors". This brings up a separate window containing a two-dimensional graph.

"save" the classification results from which the graph is generated. In the new window, we click on the "Save" button and save the result as the file: "bank-predicted.arff"

Example 2: Regression Regression is the easiest technique to use, but is also probably the least powerful. This model can be as easy as one input variable and one output variable (called a Scatter diagram in Excel, or an XYDiagram in OpenOffice.org). Of course, it can get more complex than that, including dozens of input variables. In effect, regression models all fit the same general pattern. There are a number of independent variables, which, when taken together, produce a result a dependent variable. The regression model is then used to predict the result of an unknown dependent variable, given the values of the independent variables.

Simple linear regression Nonlinear regression

Linear Regression Linear Regression estimates the coefficients of the linear equation, involving one or more Independent variables that best predict the value of the dependent variable. For example, you can try to predict a salespersons total yearly sales (the dependent variable) from independent variables such as age, education, and years of experience. NonLinear Regression In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations. Linear regression example in Weka In this example we have taken a file containing data about the responses of the students about their GPA, Siblings, interest area, personality related information and other information. 1. Number of Older Siblings 2. Number of Younger Siblings 3. Grade Point Average 4. Predicted Points in Class 5. Major If Not Psychology 6. Pet 7. Area of Interest 8. Intend To Get PhD or PsyD 9. Section 10. I would rather stay at home and read than go out with my friends 11. One of my favorite pastimes is talking to people 12. I live a fast paced life 13. I hardly ever sit around doing nothing 14. I am an extravert 15. I rarely forget when an appointment is 16. I often have overdue library books just because I forgot to return them 17. I usually put bills next to the front door so I will remember to mail them 18. If I tell friends that I will meet them for dinner, I rarely forget my commitment 19. I rely on a calendar / day-planner to remember what I am supposed to do. From attribute 10-19 students have been asked to rate their preference on the scale of 5 with 1=Strongly agree and 5= Strongly Disagree.

Steps to be followed 1. Select Explorer from the WEKA GUI user window and load the file regression file as described in the clustering example. Following screen will appear after this:

2. Click Classifier tab in the explorer window and then click the Choose button in the Classifier panel. Then select LinearRegression from functions. Following screen will appear:

3. In this case we are predicting the value of attribute, I would rather stay at home and read than go out with my friends, so select it. 4. Press the Start button. Following output will be generated:

5. Output can also be viewed in a separate window (as described earlier in clustering example).

From the regression equation we see that I would rather stay at home and read than go out with my friends attribute is positively correlated with Major If Not Psychology=ENGLISH ,ART,MATH, Major If Not Psychology=ART,MATH,I rarely forget when an appointment is and negatively correlated with attributes Number of Older Siblings, I hardly ever sit around doing nothing, I often have overdue library books just because I forgot to return them, If I tell friends that I will meet them for dinner, I rarely forget my commitment.

S-ar putea să vă placă și