Sunteți pe pagina 1din 8

A Comparative Study on Mushrooms Classification

Muhammad Azhar Fairuzz Hiloh Faculty of Information Science and Technology National University of Malaysia

Abstract. This Paper illustrates how several data mining tools can be used in classification in mushrooms dataset. Decision Trees, Artificial Neural Networks and Rough Set Theory are three classifiers that used in this study to classify the dataset which contain 22 attributes to either edible or poisonous. The experiments have been performed in WEKA environment for Decision Trees and Artificial Neural Networks with applying the algorithms of J48, ID3 and Multilayer Perceptron respectively. The Rough Set Theory has been performed in Rosetta environment with applying the genetic algorithm for reduction. Outcome of this study illustrates the accuracy of each classifier and the results are compared and discussed to which classifier is best representing the classification of the mushrooms dataset. Keywords: Classification, Decision Trees, Artificial Neural Network, Rough Set Theory.

Introduction

Classification is the task of learning a target function, which also known informally as a classification model, that maps each attribute set to one of the predefined class labels [1]. Data classification normally performed in two-steps process [2,10]. The first process is the learning step where a classification algorithm builds the classifier by analyzing a training set made up of database tuples and their associated class labels whereas the second step, the predictive accuracy of the classifier is estimated by using a test set, made up of test tuples and their associated class labels [2,10]. The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. The classification process needs to be fast, reliable and adaptable [12]. Recent data mining researches have contributed to the development of scalable algorithms for classification. As a result, there are several classification methods that can be used for effective classification. Most classification algorithms are trying to have a model that reaches the highest accuracy, or equivalently, the lowest error rate when applied to the test set [1]. In order to develop a reliable model that makes fewer errors in classification, the available population of data examples, whose classes are already known, is normally divided into two subsets: a training set and a test set. Examples from the training set are used to induce a model whereas examples from the test set are used to evaluate the accuracy of the model. The accuracy is measured by the error rate of classification when the model is used to classify the examples in the test set. Normally, a number of attempts is made to develop a number of classification models by using different parts of the available population of examples as training set and test sets. A model with a significant better accuracy is then selected as the final model [3]. The mushrooms dataset (available at http://archive.ics.uci.edu/ml/), after applying the preprocessing tasks, contains of 21 attributes and each species classify to either edible or poisonous. Mushroom, a plant that grows in soil, wood or decaying matter, is actually the fruits of fungus. Fungi with a neat, clean appearance are not necessarily edible, and ugly or messy fungi are not necessarily poisonous [18]. In this dataset, the characteristics from the mushrooms root to its

cap and as well as their habitat and population are collected to describe and classify whether the species is edible or poisonous. In this paper, based on the 21 attributes from the mushrooms dataset, a model or classifier is constructed to predict the categorical labels, which are edible or poisonous for the mushrooms data. 2 Related Works

The benefits of classification have contributed across various industries. The decision trees technique for instances, is used to develop treatment plan support in outpatient healthcare delivery where decision support has been discovered as an alternative to improve healthcare delivery that incorporating the experts knowledge or evidenced-based clinical data to support decision making processes [4]. On the other hand, another edge in healthcare that using decision trees as a classifier is to model outbreak detection in oil and gas pollution area [5] and dengue [6]. Besides, [7] has conducted a study to compare the available algorithms in decision trees with the spam email dataset. The study demonstrated a comparison of ID3, CART, J48 and ADTree algorithms in classifying the email as spam or non-spam and the results suggested J48 as the best algorithm to be used in spam email dataset. The J48 algorithm is also used in [4-6]. Although decision trees have widely been used in health, it does not always represent as the best classification technique in this discipline. A comparative study of classification methods in cardiovascular disease prediction has presented Support Vector Machine as the best classifier as compare to rule based RIPPER technique, Decision Tree and ANN [8]. Classification using ANN has correspondingly been used in numerous fields such as health. For instances, ANN assists in diagnosis of neurological disease by classifying electroencephalograph (EEG) signals to be either normal or epileptic [9]. Further to this study, after using three-layer feed-forward ANN as a classifier, it is proven that relative wavelet energy is a useful technique to represent the characteristic of EEG signal [9]. On the other hand, this classifier has also contribute in multimedia indexing by classifying audio data to different segments such as speech, music, silence, etc [12]. The Backpropagation (BP) algorithm in ANN is the most popular learning method in neural networks community with two weaknesses: slow computing speed and possibility of getting trapped in local minima [10]. Thus, a hybrid method combining Particle Swarn Optimization (PSO) and BP, known as PSO-BP, has been conducted to improve the algorithm and further resulting a Modified PSO (MPSO) which given a better result in comparison with BP and PSO-BP by using protein feature classification [10]. Another example of work done to improve ANN is the evolution of ANN for pairwise classification using gene expression programming [11]. Rough set theory (RST) is very useful, especially in handling imprecise data and extracting relevant patterns from crude data for proper utilization of knowledge [13]. For example, it can be used to show the rough relationships between historical and current traffic pattern data and proposes a traffic signal system based on the patterns of historical data yet does not follow a fixed cycle method [13]. In addition, RST also helps in selecting relevant features for web-pages classification [14], extracting multiple classification rules which correctly identifies a subset of shots of the event retrieval in video archive [15, 16], etc. The ability of rough sets theory when combined with Nave Bayes classifier, further known as Rough-Nave Bayes, can achieve better performance than the Nave Bayes and Rough Sets themselves [17].

Mushrooms Classification Methods

The mushrooms dataset is experimented using three classification methods: Decision Trees, Artificial Neural Networks and Rough Set Theory. The decision trees consists of internal nodes representing the attributes, leaf node holds a class label and the link from a parent node to a child node representing a value of the attribute of the parent node [2,3]. The decision trees emerge as a popular classifier because it does not require any domain knowledge and parameter setting and hence is appropriate for exploratory knowledge discovery [2]. In addition, decision trees also can handle high dimensional data, simple and fast of learning and classification steps and generally has good accuracy [2]. Artificial neural network (ANN) is a connected network of artificial neuron nodes, imitating the network of biological neurons of the human brain [3]. ANN consists of many computational neural units connected to each other. In ANN, the problem knowledge is distributed in neurons and connections weights of links between neurons. The neural networks has to be trained to adjust the connection weights and biases in order to produce the desired mapping [9]. One of neural networks advantage is it high tolerance of noisy data as well as its ability to classify patterns on which they have not been trained [2]. However, ANN is suitable for applications where time consuming is acceptable [2] as it involves slow computing speed and the possibility of getting trapped in local minima [10]. On the other hand, to discover structural relationships within imprecise or noisy data, a rough set theory (RST) is used. This classifier applies discrete-valued attributes and consequently, continuous-valued attributes must be discretized before its use [2]. RST is very useful, especially in handling imprecise data and extracting relevant patterns from crude data for proper utilization of knowledge [13]. In addition, rough set theory can be combined with other classifier. For example, the use of RST in detecting attributes dependency and significance can be used in RNB algorithm [17]. The experiments of this study have been conducted in WEKA environment for decision trees and ANN classifiers whereas RST is conducted in Rosetta. Respectively, ID3 and J48 algorithm are used for decision trees, multilayer perceptron for ANN and genetic algorithm for RST. In this paper, the split percentage ratio for training and testing set is experimented ranging from 90:10 to vice versa for each classifier and the results are compared and discussed in the following section. 4 Results and Discussion

This section describes the experimental results from various models for each classifier. The best model identified from each classifier then compared with other classifiers to discover what classifier is best to be used for classification of mushrooms dataset. The experiments begin with identifying the best model for decision trees classifier. For this purpose, ID3 and J48 algorithms have been used and compared. Data allocation from 90:10 to vice versa have been experimented to compare the accuracy, mean squared error, taken time to build the model, and length of rules generated from the dataset between the algorithms. The experiment shows that both ID3 and J48 have the same accuracy for all data allocation. This might be due to J48 is an evolution and refinement of ID3 [7]. However, J48 slightly yields better results in time taken to build model and length of rules. Nevertheless, all models in J48 and ID3 have the acceptable accuracy which is ranging from 98.58% to 100%. For this study, the J48 algorithm with 60:40 data allocation is chosen for representing decision trees classifier.

Similar to decision trees, the experiments for all models in multilayer perceptron are producing the acceptable accuracy. Hence, the experiment is focused on what number of hidden layers that best to represent the ANN classifier. The results shown that the more hidden layers are applied, the smaller mean squared error produced in the model. Nonetheless, increasing the number of hidden layers will increase the time taken to build the model. Therefore, the smallest number of hidden layers yet producing the acceptable accuracy is selected to represent the ANN classifier. Genetic algorithm (GA) has been applied to discover the best model for RST. The mushroom dataset is split into two repeatedly to build a model following the data allocation. The reduction steps have been performed by using GA and the rules are generated. The rules, which then applied to the remaining test object data have resulting the acceptable accuracy in all data allocation. Hence, a model with the smallest number of rules is chosen to represent RST. The models that are representing for each classifier are as following:
Table 1: Comparative results of Classification Techniques Technique Accuracy (%) Number of Length of Rules Rules 100 23 6 DT * 100 2 NIL ANN 100 6930 7 RST
*No. of Hidden Layers

Table 1 demonstrates the classification accuracy results of three classifiers. The results show that regardless DT, ANN or RST is best classifying the mushrooms dataset with 100% accuracy. Therefore, both three classifiers that have been used in this study are best tools to classify the mushrooms dataset. 5 Conclusion and Suggestion

In this study we have performed the experiments in order to determine the classification accuracy of three classification techniques in terms of which classifier better determine whether a mushroom is edible or poisonous with the help of data mining tools known as WEKA and Rosetta. Three classifiers namely decision trees, artificial neural networks and rough set theory were compared on the basis of different percentage of correctly classified instances. All these three classifier come under the classification methods of data mining which makes a relationship between a dependent (output) variable and independent (input) variables by mapping the data points. In simple terms, classification problem refers to identifying an object as belonging to a given class for example whether a mushroom is edible or poisonous. It is clear from the results that none of the classifiers incorrectly classified the mushroom dataset. The results showing that both three classifiers have the ability to classify the dataset with 100% accuracy. Therefore, any classification method in this study can be used as the best classifier that can reach the best accuracy. However, this study is clearly comparing the classifiers in term of accuracy. Due to limitation of tools, some other factors such as number of rules and length of rules are unable to be compared since the factors are not in common for every classifier. Some other algorithms available in each classifier as well as using another classifier

which not named in this study could be used to re-analyze the dataset and produce useful knowledge in the research field.

References 1. Tan, P. N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Education, Boston (2006) 2. Han, H., Kamber, M.: Data Mining: Concepts and Techniques (2nd Edition). Morgan Kaufmann, San Francisco (2006) 3. Du, H.: Data Mining Techniques and Applications. Cengage Learning EMEA. Hampshire (2010) 4. Ali, S. N. S., Razali, A. M., Bakar, A. A., Suradi, N. R.: Development Treatment Plan Support in Outpatient Health Care Delivery with Decision Trees Techniques. In: Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II, pp. 475-482 (2010) 5. Bakar, A. A., Idris, N., Hamdan, A. R., Othman, Z., Nazari, M. Z. A., Zainudin, S.: Classification Models for Outbreak Detection in Oil and Gas Pollution Area. In: International Conference on Electrical Engineering and Informatics, pp. 1-6 (2011) 6. Bakar, A. A., Kefli, Z., Abdullah, S., Sahani, M.: Predictive Models for Dengue Outbreak Using Multiple Rulebase Classifier. In: International Conference on Electrical Engineering and Informatics, pp. 1-6 (2011) 7. Sharma, A. K., Sahni, S.: A Comparative Study of Classification Algorithms for Spam Email Data Analysis. International Journal on Computer Science and Engineering 3, 1890-1895 (2011) 8. Kumari, M., Godara, S.: Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction. International Journal of Computer Science and Technology 2, 304-308 (2011) 9. Guo, L., Seoane, J. A., Rivero, D., Pazos, A.: Classification of EEG Signals Using Relative Wavelet Energy and Artificial Neural Networks. In: Proceedings of the First ACM/SIGEVO Summit on Genetic and Evolutionary Computation, pp. 177-183 (2009) 10. Shee, B. K., Vipsita, S., Rath, S. K.: Protein Feature Classification Using Particle Swarm Optimization and Artificial Neural Networks. In: Proceedings of the International Conference on Communication, Computing and Security, pp. 313-318 (2011) 11. Johns, S., Santos, M. V.: On the Evolution of Neural Networks for Pairwise Classification Using Gene Expression Programming. In: Proceedings of the Annual Conference on Genetic and Evolutionary Computation, pp. 1903-1904 (2009) 12. Khan, M. K. S., Al-Khatib, W. G., Moinuddin, M.: Automatic Classification of Speech and Music Using Neural Networks. In: Proceedings of the 2nd ACM International Workshop on Multimedia Databases, pp. 94-99 (2004) 13. Bit, M., Beaubouef, T.: A Rough Set Approach for Traffic Light System with No Fixed Cycle. In: CCSC South Central Conference, pp. 14-20 (2009) 14. Wakaki, T., Itakura, H., Tamura, M.: Rough Set-Aided Feature Selection for Automatic Web-Page Classification. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 70-76 (2004)

15. Shirahama, K., Uehara, K.: Example-based Event Retrieval in Video Archive Using Rough Set Theory and Video Ontology. In: Proceedings of the 10th International Workshop on Multimedia Data Mining, pp. 30-37 (2010) 16. Shirahama, K., Uehara, K.: Query by Shots: Retrieving Meaningful Events Using Multiple Queries and Rough Set Theory. In: Proceedings of the 9th International Workshop on Multimedia Data Mining, pp. 43-52 (2008) 17. Al-Aidaroos, K., Bakar, A. A., Othman, Z.: Data Classification Using Rough Sets and Nave Bayes. In: Proceedings of Rough Set and Knowledge Technology, pp.134-142 (2010) 18. Hoffmann, H., Bougher, N., Wood, P.: Recognising Edible Fields Mushrooms. Gardennote. Western Australian Agriculture Authority (2010).

APPENDIX
Model Data Allocation Table 2: Experimental Results for J48 Models Accuracy Model Number of (%) Building Rules Time (Second) 100 0.09 23 100 0.11 23 100 0.04 23 100 0.01 23 100 0.02 23 100 0.01 23 99.65 0.01 23 99.75 0.01 23 98.58 0.01 23 Mean Squared Error 0 0 0 0 0 0 0.0329 0.0496 0.1218 Length of Rules

1 2 3 4 5 6 7 8 9

90:10 80:20 70:30 60:40 50:50 40:60 30:70 20:80 10:90

6 6 6 6 6 6 6 6 6

Model

1 2 3 4 5 6 7 8 9

Table 3: Experimental Results for ID3 Models Data Accuracy Model Mean Allocation (%) Building Squared Time Error (Second) 90:10 100 0.10 0 80:20 100 0.06 0 70:30 100 0.05 0 60:40 100 0.15 0 50:50 100 0.04 0 40:60 100 0.03 0 30:70 99.65 0.06 0.0531 20:80 99.75 0.05 0.0496 10:90 98.58 0.05 0.0743

Length of Rules

5 5 5 5 5 5 5 5 5

Model

Data Allocation 90:10 80:20 70:30 60:40 50:50 40:60 30:70 20:80 10:90 90:10 80:20 70:30 60:40

1 2 3 4 5 6 7 8 9 10 11 12 13

Table 4: Experimental Results for ANN Models Num. Model Learning Accuracy MSE (Mean Hidden Building rate Square Nodes Time Eror) a 430.83 0.3 100 0.0007 a 431.62 0.3 100 0.0006 a 441.78 0.3 100 0.0007 a 435.10 0.3 100 0.0008 a 442.67 0.3 100 0.0011 a 440.90 0.3 100 0.0025 a 449.59 0.3 99.95 0.0175 a 438.28 0.3 99.98 0.0128 a 445.16 0.3 99.78 0.0354 2 25 0.3 100 0.0013 2 25.38 0.3 100 0.0012 2 24.49 0.3 100 0.0010 2 26.08 0.3 100 0.0012

14 15 16 17 18 19 20 21 22 23 24 25 26 27

50:50 40:60 30:70 20:80 10:90 90:10 80:20 70:30 60:40 50:50 40:60 30:70 20:80 10:90

2 2 2 2 2 3 3 3 3 3 3 3 3 3

25.12 0.3 100 25.32 0.3 100 26.26 0.3 99.95 25.33 0.3 99.98 25.48 0.3 99.70 34.13 0.3 100 33.63 0.3 100 34.44 0.3 100 33.61 0.3 100 34.15 0.3 100 34.27 0.3 100 34.80 0.3 99.96 35.76 0.3 99.95 36.29 0.3 99.70 *a = (attribs + classes) / 2

0.0014 0.0048 0.0205 0.0139 0.0474 0.0008 0.0011 0.0009 0.0011 0.0014 0.0029 0.0147 0.0144 0.0464

Model 1 2 3 4 5 6 7 8 9

Table 5: Experimental Results for Rough Set Classification Models Data Accuracy Number of Length of Allocation Rules Rules 90:10 100 7289 7 80:20 100 7546 7 70:30 100 8058 7 60:40 100 6930 7 50:50 99.90 7829 7 40:60 100 8922 6 30:70 99.96 11538 7 20:80 99.97 8889 6 10:90 100 8480 6