Sunteți pe pagina 1din 6

Meta-learning systems using ID3 in Zoo Dataset

Mohamad Raziff Ramli,Omar Mukhtar Hambaran, Muhammad AlHafiz Baharuddin, Lovelone Juin, Siti Hawa Abd Hamid Faculty of Information and Communication Technology, Universiti Teknikal Malaysia Melaka yasalam_ajibtu90@yahoo.com kidzeclipes@yahoo.com love.guardian@yahoo.com llone13@yahoo.com khashah_hawa89@yahoo.com

Abstract
The Weka machine learning workbench provides a general-purpose environment for automatic classification, clustering and feature selection common data mining problems in bioinformatics research. Therefore Weka also contains an extensive data pre-processing methods and the experimental comparison of different machine learning techniques on the same problem. Its main objectives are (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. Here in this paper we used a classification for classify the zoo dataset (problem). Keywords: Bioinformatics research, zoo dataset, classification. Availability: http://www.cs.waikato.ac.nz/ml/weka

Data mining refers to extracting or mining knowledge from large amounts of data. It is an increasingly popular field that uses statistical, visualization, machine learning, and other data manipulation and knowledge extraction techniques aimed at gaining an insight into the relationships and patterns hidden in the data. It is very useful if results of data mining can be communicated to humans in an understandable way. Often databases growth in large amount that human interpretation of the data is not feasible and accordingly, there is a gap between data generation and data understanding and usage. One of the main goals of applying data mining learning algorithm in zoos environment is to uncover new relations among data and reveal new patterns that identify the type of animals in the zoo. For instance, if information is irrelevant or redundant, or the data is noisy and unreliable, then knowledge discovery during training is more difficult. There are algorithms that classified novel examples by retrieving the nearest stored training example, using all the available features in its distance computations. In Weka, there are algorithms that try to focus on relevant features and ignore irrelevant ones. Decision tree inducers are examples of this approach. By testing the values of certain features, decision tree algorithms attempt to divide training data into subsets containing a strong majority of one class.

1. Problem Background

In our paper, we will compare different classifier algorithm in zoo dataset. Here we introduce an efficient machine learning algorithm which is one of the data mining techniques that learn the important disease attributes in ten of zoo dataset needed for interpretation. The proposed technique is based on Weka classifier algorithm that has high transparency and accuracy. In addition, among all features, we use only the subset of features that leads to the best performance. The evaluation of our result is based on a cross validation approach.

In real situation, the animals expert have to conduct some research/experiment on the animal to get data for determine species or class of animals. It will take long time to get the result. By using data mining tools we can get the result of the animal class in a short time with most accuracy outcomes. We also investigate the performance of the classifier mainly in decision-tree algorithm and meta learners to determine which of the best meta learners is suit for Id3.

4. Scope
These project scopes based on zoo dataset that were obtained from the UCI Machine Learning Repository at the University of California (UCI). The pre-processing process includes converting process was required done on this dataset because most of the data is in numeric format. Besides that, this dataset also will run with weka classifier to obtain the accuracy to compare which meta learners that suitable using ID3 in weka.

2. Problem Statement
A frequent problem of studies carried out in zoo dataset is that, due to practical or ethical limitations, they are often based around a limited of replicates. For example, zoos have limited in the number of animals that available to test a hypothesis, or the number of independent enclosures in which animals can be kept while being studied. In multi zoo studies, individual zoos will often be used as the independent data points, creating obvious difficulties in generating large data sets. Usually, data mining is a relatively new field of research that its objective is to acquire knowledge from large amounts of data. In zoo areas, due to regulations and due to the availability of computers, a large amount of data is becoming available. On the one hand, practitioners (zoologist and veterinary) are expected to use all this data in their work but, at the same time, such a large amount of data cannot be processed by humans manually in a short time to classify the animal.

5. The zoo dataset


The datasets used in this investigation were obtained from the UCI Machine Learning Repository at the University of California (UCI). The size of the datasets is 101 instances. The input datasets is in the format of Waikato Environment for Knowledge Analysis (WEKA) arff file format. The zoo datasets contain 18 attributes which are animal name, 15-Booleanvalued attributes and 2-numeric attributes. This zoo dataset does not have missing valued. So does not have to do the clearing process and normalized process. The type attributes appears to be the class attribute that represented in numeric (integer values in range [1, 7]).

3. Objective
A major objective of this project is to evaluate data mining tools in animals and zoo environment to suggest a tool that can help make accurate decisions.

6. Classifier algorithm

Classification is an important problem in the emerging field of data mining. Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memory-resident data, thus limiting their suitability for data mining large data sets. In this project we used two techniques in data mining which is: Decision tree (ID3): Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Meta Learner: Subfield of machine learning where automatic learning algorithms are applied on meta-data about machine learning experiments. The main goal is to use such meta-data to understand how automatic learning can become flexible in solving different kinds of learning problems, hence to improve the performance of existing learning algorithm.

this is not to say that their use is recommended to the exclusion of more traditional methods. Indeed, when the typically more stringent theoretical and distributional assumptions of more traditional methods are met, the traditional methods may be preferable. But as an exploratory technique, or as a technique of last resort when traditional methods fail, classification trees are, in the opinion of many researchers, unsurpassed. The study and use of classification trees are not widespread in the fields of probability and statistical pattern recognition , but classification trees are widely used in applied fields as diverse as medicine (diagnosis), computer science (data structures), botany (classification), and psychology (decision theory). Classification trees readily lend themselves to being displayed graphically, helping to make them easier to interpret than they would be if only a strict numerical interpretation were possible. Tree classifiers perform a greedy search for rules by heuristically selecting the most promising features. Such greedy (local) search may discard important rules. Associative classifiers, on the other hand, perform a global search for rules satisfying some quality constraints (i.e., minimum support). This global search, however, may generate a large number of rules. Further, many of these rules may be useless during classification, and worst, important rules may never be mined.

6.1 ID3
Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. Classification tree analysis is one of the main techniques used in data mining. The goal of classification trees is to predict or explain responses on a categorical dependent variable, and as such, the available techniques have much in common with the techniques used in the more traditional methods of Discriminate Analysis, Cluster Analysis, Nonparametric Analysis and Nonlinear Estimation. The flexibility of classification trees makes them a very attractive analysis option, but

6.2 AdaBoostM1
Boosting is a general method of producing a very accurate prediction rule by combining rough and moderately inaccurate "rules of thumb." Besides that, boosting can be also defined as a machine learning meta algorithm for performing supervised learning. The training set used for each member of the series is chosen based on the performance of the earlier classifier(s) in the series. In Boosting, examples that are incorrectly predicted by

previous classifiers in the series are chosen more often than examples that were correctly predicted. Focusing primarily on the AdaBoost algorithm, AdaBoost calls a weak repeatedly in a series of rounds t = 1,,T. For each call a distribution of weights Dt is updated that indicates the importance of examples in the data set for the classification. The strengths of boosting are boosting can be used for regression (including generalized regression), density estimation, survival analysis or for multivariate analysis. Boosting can then be seen as an interesting regularization scheme for estimating a model. This statistical perspective will drive the focus of our exposition of boosting. On the other hand, the weaknesses are the problem of improving the accuracy of an hypothesis output by a learning algorithm in the distribution-free (PAC) learning model, A concept class is learnable (or strongly learnable) if, given access to a source of examples of the unknown concept, the learner with high probability is able to output an hypothesis that is correct on all but a arbitrarily small action of the instances.

error, strength, correlation and it is relatively robust to outliers and noise.

6.4 Dagging
Dagging feeds data chunks to a copy of another classier, in our case SMO (sequential minimal optimization). Dagging is very similar to bagging, but instead of using bootstrap sampling it uses disjoint samples. The training set is partitioned into k subsets. A base classifier generates a hypothesis for each subset. The final prediction is done with plurality vote as in bagging. Another difference is dagging uses no extra resources, since the same amount of examples are used as the training set. Dagging cannot stand by itself (have to combine with other classifier/meta classifier) to get the better result.

6.5 Stacking
Stacking is the one of combining classifier. Combining classifiers are learning algorithms that construct a set of classifiers and then classify new data points by taking a vote of their predictions. Combining classifiers can also be defined as a set of classifiers whose individual decisions are combined in some way (typically by weighted or un-weighted voting) to classify new examples. Combining classifiers can often perform better than any single classifier. The main discovery is that combining classifiers are often much more accurate than the individual classifiers that make them up. Moreover, combining multiple classifiers is of particular interest in multimedia applications. Each modality in multimedia data can be analyzed individually, and combining multiple pieces of evidence can usually improve classification accuracy. However, the weaknesses are most combination strategies used in previous studies implement some ad hoc designs, and ignore the varying expertise of specialized individual

6.3 Bagging
Randomization meta-learning algorithm uses random sub samples of the training data and randomized base level algorithm. Bagging is a method that generates a diverse ensemble of classifiers by manipulating the training data given to a base learning algorithm. In other word, bagging is a ``bootstrap'' ensemble method that creates individuals for its ensemble by training each classifier on a random redistribution of the training set. Bagging and Randomization both construct each decision tree in- dependently of the others. Bagging and Randomization do well in both the noisy and noise-free cases, because they are focusing on the statistical problem, and noise increases this statistical problem. Bagging can enhance accuracy when random features are used. Randomization gives useful internal estimates of

modality classifiers in recognizing a category under particular circumstances.

meta.Gradding+Id3 meta.RandomSubSpace+Id3 meta.Stacking+Id3

40.566 93.3962 40.566

59.434 6.6038 59.434

6 Experiment
For the experiment setup, the original dataset is converted to CSV (Comma Separates Value Format) as the input file format for the Weka system. Before pre-process we remove animal name because it is unique value of the characteristic. This removal does not affect the dataset and the result. Then, the zoo dataset will do pre-processing before classification to improve the accuracy for the result. All attributes of the dataset which in numeric type is filter into nominal because we need to avoid the decimal value in the graph. Next, all the identified classifier (ID3 and meta-classifier) are tested to zoo dataset with the option of using 10-fold cross validation. 10-fold cross validation (10CV) is a standard way of predicting the error rate. For data testing, we use standard default setting inside Weka system (version 3.6.4) without any modification. For the experiment result, we are interested in the percentage of correctly classified instances of the algorithms (accuracy percentage) and compare the accuracy of each classifier to determine which classifiers perform most accurate result.

Table 1: accuracy of Id3 classifier and meta-classifiers with Id3

Meta-learners with Id3


meta.Stacking+Id3 meta.RandomSubSpace+Id3 meta.Gradding+Id3 meta.FilteredClassifier+Id3

Classifier

meta.Decorate+Id3 meta.Dagging+Id3 meta.Bagging+Id3 meta.AttributeSelectedClassifi er+Id3 meta.AdaBoostM1+Id3 Id3

20 40 60 80 100
Accuracy(%)

8 Conclusion
In a conclusion, ID3 classifier was the best compared to ID3 that using meta learner. These because ID3 give more accurate value in term of precision, recall, F-measure and the ROC Area. As we know from the above decision tree is widely used in botany (classification) and it commonly interacts with attribute-value description. The best meta learner that suitable interact with Id3 is FilteredClassifier.

7 Result
Classifier Correctly Classified (%) 93.3962 93.3962 88.6792 92.4528 76.4151 90.566 93.3962 Incorrectly Classified (%) 5.6604 6.6038 10.3774 6.6038 22.6415 9.434 5.6604

Id3 meta.AdaBoostM1+Id3 meta.AttributeSelectedClass ifier+Id3 meta.Bagging+Id3 meta.Dagging+Id3 meta.Decorate+Id3 meta.FilteredClassifier+Id3

References

[1] Eibe Frank and Stefan Kramer. Ensembles of nested dichotomies for multi-class problems. ICML, 2004. [2] Hong Guo, Qing Zhang and Asoke K. Nandi: Feature Generation Using Genetic Programming Based On Fisher Criterion, Department of Electrical Engineering and Electronics, The University of Liverpool, Brownlow Hill, Liverpool, L69 3GJ, U.K, Poznan, 2007. [3] Thanaruk Theeramunkong, Boonserm Kijsirikul, Nick Cercone, Tu Bao Ho: Advances in Knowledge Discovery and Data Mining, 3th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009 Proceedings, LNAI 5476. [4] Md. Rafiqul Islam, Morshed U. Chowdhury and Safwan Mahmood Khan: Classification Using an Efficient Data Mining Technique; School of Information Technology Deakin University, Burwood, Victoria 3125, Australia, Dept. of Computer Science & Engineering, University of Dhaka, Dhaka, Bangladesh. [5] Dr Amy Plowman, Prof Graeme Ruxton, Dr Nick Colegrave, Heidi Mitchell: Zoo Research Guidelines Statistics for typical zoo datasets, Paignton Zoo Environmental Park, Totnes Road, Paignton, Devon TQ4 7EU, U.K. [6] Norman Jackson: Exploring the Concept of Metalearning, University of Surrey and UK Learning and Teaching Support Network Generic Centre, Jan 2004. [7] Chandra, B. Paul V, P.: A Robust Algorithm for Classification Using Decision Trees, Dept. of Math., IIT, New Delhi, December 2006.

S-ar putea să vă placă și