Sunteți pe pagina 1din 8

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 14, ISSUE 1, JULY 2012 1

A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification
Amit Biswas, Dewan Md. Farid and Chowdhury Mofizur Rahman
AbstractNovel class detection in concept drifting data stream classification is the process of learning, where the data distributions change over time like weather conditions, economical changes, astronomical, and intrusion detection etc. Arrival of a novel class in concept-drift occurs in data stream when new data introduce the new concept classes or remove the old ones. Existing data mining classifiers cannot detect and classify the novel class instances until the classifiers are trained with the labeled instances of the novel class. In this paper, we propose a new approach for detecting novel class in concept drifting data stream classification using decision tree classifier that can determine whether new data instance belongs to a novel class. The proposed approach builds a decision tree from training data points, which continuously updates with recent data points so that the tree represents the most recent concept in data stream. The experimental analysis on benchmark datasets from UCI machine learning repository proved that the proposed approach can detect novel class in concept drifting data stream classification problems. Index TermsConcept Drifting, Data Stream Classification, Decision Tree, Novel Class.

1 INTRODUCTION

ATA stream classification is the process of extracting knowledge and information from continuous data instances. A data stream is an ordered sequence of data points that includes attribute values and class values. The goal of data mining classifiers is to predict the class value of a new or unseen instance, whose attribute values are known but the class value is unknown. The existing data mining classifiers (or classification models) are trained on instances of the dataset with fixed number of class values, but in real-world data stream classification problems a new data instance with new class value may appear and the classification model misclassify the new instance. Most of the existing data mining classifiers cannot detect and classify the novel class instances until the classifiers are trained with the labeled instances of the novel class. In real-life data stream mining problems the data distributions change over time, such as weather predictions, astronomical, and intrusion detection etc. Novel class detection in concept drifting data stream mining causes problems because the classification models become less accurate as time passes. The concept drift means the statistical properties of the target class, which the data mining classifiers are trying to classify, change over time in unforeseen ways. Novel class detection in concept drifting data stream classification refers to a change in the data stream when the underlying concept of

Amit Biswas is with the Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh. Dewan Md. Farid is with the Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh. Chowchury Mofizur Rahman is with the Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh. 2012 JCSE www.Journalcse.co.uk

the data changes over time. Recently research on novel class detection in concept drifting data stream classification received much attention to intelligent computational researchers [1], [2], [3]. The data mining classifiers should update continuously so that it reflects the most recent concept in the data stream. The data stream classifiers are divided into two categories: single model and ensemble model. Single model incrementally update a single classifier and effectively respond to concept drifting [9], [13]. On the other side, ensemble model use a combination of classifiers, which combines a series of classifiers with the aim of creating an improved composite model, and also handle concept drifting efficiently [1], [5], [10], [12]. In this paper, we provide a solution for handling the novel class detection problem using decision tree. Our approach builds a decision tree from data stream, which continuously updates with new data points so that the latest tree represents the most recent concept in data stream. We calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training dataset and also cluster the data points of training dataset based on the similarity of attribute values. If number of the data points classify by a leaf node of the tree increases than the threshold value that calculated before, which means a novel class arrived. Then we compare the new data point with existing data points based on the similarity of attributes value. If the attributes value of new data point is different than existing data points and the new data point does not belongs to any cluster, which confirms a novel class arrived. Then we add the new data point into training dataset and re-

build the decision tree. We organize this paper as follows. Section 2 discusses related work. Section 3 provides an overview of learning algorithms. Our approach is introduced in section 4. Section 5 discusses the datasets and experimental analysis. Finally, conclusions and future works are drawn is section 6.

3 LEARNING ALGORITHMS
Data mining is the process of finding hidden information and patterns in a huge database. Data mining algorithms have two major functions: classification and clustering. Classification maps data into predefined groups or classes. It is often referred to a supervised learning because the classes are determined before examining the data. Classification creates a function from training data. On the other side, clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. It is alternatively referred to as unsupervised learning.

2 RELATED WORK
Novelty detection and data stream classification, where data distributions inherently change over time that received much attention to the intelligent computational researchers in many practical real-world applications, such as spam, climate change and intrusion detection. In 2011, Masud et al. proposed a novelty detection and data stream classification technique, which integrates a novel class detection mechanism into traditional mining classifiers that enabling automatic detection of novel classes before the true labels of the novel class instances arrive [1]. In order to determine whether an instance belongs to a novel class, the classification model sometimes needs to wait for more test instances to discover similarities among those instances. In the same year, R. Elwell and R. Polikar introduced an ensemble of classifiers-based approach named Learn++.NSE for incremental learning of concept drift, characterized by nonstationary environments [2]. Learn++.NSE trains one new classier for each of data it receives, and combines these classifiers using a dynamically weighted majority voting. The novelty of the approach is in determining the voting weights, based on each classifiers time-adjusted accuracy on current and past environments. In 2007, Kolter and Maloof proposed an ensemble approach for concept drifting data stream classification that dynamically creates and removes weighted experts in response to changes in performance using dynamic weighted majority (DWM) [5]. It trains online learners of the ensemble and adds or removes experts based on the global performance of the ensemble. In 2006, Gaber and Yu [8] proposed a novel class detection approach termed as STREAM-DETECT to identify changes in data streams, which concerned with detecting changes in data streams by measuring online clustering result deviation over time. In 2005, Yang et al. [9] proposed an approach, which incorporates proactive and reactive predictions. In a proactive mode, it anticipates what the new concept will be if a future concept change takes place, and prepares prediction strategies in advance. If the anticipation turns out to be correct, a proper prediction model can be launched instantly upon the concept change. If not, it promptly resorts to a reactive mode: adapting a prediction model to the new data. Widmer and Kubat presented a single classifier named FLORA, which use a sliding window to choose a block of new instances to train a new classifier [14]. FLORA has a built-in forgetting mechanism with the implicit assumption that those instances that fall outside the window are no longer relevant, and the information carried by them can be forgotten.

3.1 Decision Tree Learning Decision tree (DT) learning is very popular mining tool for classification and prediction. It is easy to implement and requires little prior knowledge. DT can be build from large dataset with many attributes. In DT the successive division of the set of training instances proceeds until all the subsets consists of instances to a single class. There are 3 main components in a DT: nodes, leaves, and edges. Each node is labeled with an attribute by which the data is to be partitioned. Each node has a number of edges, which are labeled according to possible values of the attribute. An edge connects either two nodes or a node and a leaf. Leaves are labeled with a decision value for categorization of the data. To make a decision using a DT, start at the root node and follow the tree down the branches until a leaf node representing the class is reached. Each DT represents a rule set, which categorizes data according to the attribute of dataset. The ID3 (Iterative Dichotomiser) technique builds DT using information theory [16]. The basic strategy used by ID3 is to choose splitting attributes from a dataset with the highest information gain. The amount of information associated with an attribute value is related to the probability of occurrence. The concept used to quantify information is called entropy, which is used to measure the amount of randomness from a dataset. When all data in a set belong to a single class, there is no uncertainty then the entropy is zero. The objective of decision tree classification is to iteratively partition the given dataset into subsets where all elements in each final subset belong to the same class. The entropy calculation is shown in equation 1. Given probabilities p1, p2,, ps where i=1 pi = 1,
S

Entropy : H ( p1 , p2 ,..., pS ) = ( pi log(


i =1

1 )) pi

(1)

Given a dataset, D, H(D) finds the amount of subset of dataset. When that subset is split into s new subsets S = {D1, D2, , Ds}, we can again look at the entropy of those subsets. A subset of dataset is completely ordered if all examples in it are the same class. ID3 chooses the splitting attribute with the highest gain. The ID3 algorithm calculates the gain by the equation 2.

Gain( D, S ) = H ( D) p( Di ) H ( Di )
i =1

(2)

The C4.5 is a successor of ID3 through GainRatio [15].

For splitting purpose, C4.5 use the largest GainRatio that ensures a larger than average information gain.

GainRatio( D, S ) =

Gain( D, S ) |D | |D | H ( 1 ,..., S ) |D| |D|

(3)

The C5.0 algorithm improves the performance of building trees using boosting, which is an approach to combining different classifiers. But boosting does not always help when the training data contains a lot of noise. When C5.0 performs a classification, each classifier is assigned a vote, voting is performed, and the example of dataset is assigned to the class with the most number of votes. CART (Classification and Regression Trees) is a process of generating a binary tree for decision making [17]. CART handles missing data and contains a pruning strategy. The SPRINT (Scalable Parallelizable Induction of Decision Tree) algorithm uses an impurity function called gini index to find the best split [18]. Equation 4, difines the gini index for a dataset, D.

gini ( D) = 1 p 2 j

(4)

stream, and xnow is the latest data point, which has just arrived in the data stream. Each data point xi is an ndimensional feature vector that consists of a number of attributes Ai = {A1,A2,,An} with class label Ci = {C1,C2,,Cn}. Each attribute consists of a number of attribute values Ai = {Ai1,Ai2,,Aip}. Algorithm 1 outlines the overview of our approach. We build a decision tree from training data points and calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training data points and also cluster the training data points based on the similarity of attribute values. When classifying the continuous data streams in real-time, if number of the data points classify by a leaf node of the tree increases than the threshold value that calculated before, which means a novel class arrived. Then we compare the new data point with existing data points based on the similarity of attributes value. If the attributes value of new data point is different than existing data points and the new data point does not belongs to any cluster, which confirms a novel class arrived. Then we add the new data point into training dataset and rebuild the decision tree. The decision tree classifier continuously updates so that if represents the most resent concept in the data stream. Algorithm 1: Novel Class Detection using Decision Tree 1. Find the best splitting attribute with highest information gain value in training dataset. 2. Create a node and label with splitting attribute. [First node is the root node, T of the decision tree] 3. For each branch of the node, partition the data points and grow sub training datasets Di by applying splitting predicate to training dataset D. 4. For each sub training datasets Di, if data points in Di, are all of same class value, Ci then the leaf node labeled with Ci. Else continues steps 1 to 4 until each final subset belong to the same class value or leaf node created. 5. When the decision tree construction is complete, calculate the threshold value for each leaf node in the tree based on the ratio of percentage of data points between each leaf node in the decision tree and the data points in the training dataset. 6. Cluster the training data points based on the similarity of attribute values. 7. For classifying the continuous data streams in realtime, if number of the data points classify by a leaf node of the decision tree increases compare to threshold value that calculated before, which means a novel class may arrived. 8. If the attributes value of new data point is different than existing data points of the leaf node of the decision tree, and also the new data point does not belongs to any existing cluster, which confirms a novel class arrived. 9. If novel class detected, then add the new data point into existing training data points and generate a new training dataset, Dnew. 10. Rebuild a new decision tree using new/updated training dataset, Dnew.

Where, pj is the frequency of class Cj in D. The goodness of a split of D into subsets D1 and D2 is defined by

ginisplit ( D) =

n1 n2 + n( gini ( D1 )) n( gini ( D2 ))

(5)

The split with the best gini value is chosen. A number of research projects of optimal feature selection and classification have been done, which adopt hybrid stratecy involving evolutionary algorithm and inductive decision tree learning [19], [20], [21], [22], [23].

3.2 Clustering Clustering can be considered the most important unsupervised learning problem, which has been used in many real-world application domains, including biology, medicine, anthropology, marketing etc. It is the process of organizing objects into groups whose members are similar in some way. A data point within one cluster is more like data points within that cluster than it is similar to data points outside it. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Give a dataset D = {t1,t2,,tn} of data points, a similarity measure, sim(ti,tl), defined between any two data points, ti, tl, D, and an integer value k, the clustering problem is to define a mapping f: D {1,,k} where each ti is assigned to one cluster Kj, 1 j k. Clustering algorithms can be categorized based on their cluster model, like k-means clustering, distribution-based clustering, density-based clustering etc.

4 PROPOSED APPROACH
The data stream is a continuous sequence of data points: {x1,x2,,xnow}, where x1 is the very first data point in the

5 EXPERIMENTAL ANALYSIS
In this section, we describe the datasets, and the experimental results.
Dataset Iris Plants Database Image Segmentation Data Large Soybean Database Fitting Contact Lenses Database NSL-KDD Dataset

TABLE 1 Data Set Descriptions


No of Attributes 4 Attribute Types Real No of Instances 150 No of Class Attribute 3

5.1 Datasets Data stream mining is the process of analyzing online data to discover patterns, which uses sophisticated mathematical algorithms to segment the continuous data and evaluate the probability of future events. A set of data items called the dataset, which is the very basic concept of data mining and machine learning research. A dataset is roughly equivalent to a two-dimensional spreadsheet or database table. Table 1 describes about the datasets from UCI machine learning repository, which are used in experimental analysis [26]. 1. Iris Plants Database: This is one of the best known dataset in the pattern recognition literature. This dataset contains 3 class values (Iris Setosa, Iris Versicolor, and Iris Virginica), where each class refers to a type of iris plant. There are 150 instances and 4 attributes in this dataset (50 in each of three classes). One class is linearly separable from the other 2 classes. 2. Image Segmentation Data: The goal of this dataset is to provide an empirical basis for research on image segmentation and boundary detection. There are 1500 data instances in this dataset with 19 attributes and all the attributes are real. There are 7 class attribute values: brickface, sky, foliage, cement, window, path, and grass. 3. Large Soybean Database: There are 35 attributes in this dataset and all attributes are nominalized. There are 683 data instances and 19 class values in this dataset. 4. Fitting Contact Lenses Database: It is very small dataset with only 24 data instances, 4 attibutes and 3 class attribute values (soft, hard, and none). All the attribute values are nominal in this dataset. The instances are complete and noise free and 9 rules cover the training set. 5. NSL-KDD Dataset: The Knowledge Discovery and Data Mining 1999 (KDD99) competition data contains simulated intrusions in a military network environment. It is often used a benchmark to evaluate handling concept drift. NSL-KDD dataset is the new version of the KDD99 dataset, which solved some of the inherent problems of the KDD99 dataset [25]. Although, NSL-KDD dataset still suffers from some of the problems that discussed by McHugh [24]. The main advantage of NSL-KDD dataset is that the training and testing data points are reasonable, so it become affordable to run the experiments on the complete set of training and testing dataset without the need to randomly select a small portion of dataset. Each record in NSL-KDD dataset consists of 41 attributes and 1 class attribute. NSL-KDD dataset does not include redundant and duplicate examples in training dataset.

19

Real

1500

35

Nominal

683

19

Nominal Real & Nominal

24

41

25192

23

5.2 Results We implement our algorithm in Java. The code for decision tree has been adapted from the Weka machine learning open source repository (http://www.cs.waikato.ac.nz/ml/weka). Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. The experiments were run on an Intel Core 2 Duo Processor 2.0 GHz processor (2 MB Cache, 800 MHz FSB) with 1 GB of RAM. There are various approaches to determine the performance of data stream classifiers. The performance can most simply be measured by counting the proportion of correctly classified instances in an unseen test dataset. Table 2 summarizes the symbols and terms used throughout the equation 6 to 8.
TABLE 2 Used Symbols and Terms
Symbol N Nc Fp Fn Fe Mnew Term Total instances in the data stream Total novel class instances in the data stream Total existing class instances misclassified as novel class Total novel class instances misclassified as existing class Total existing class instances misclassified % of novel class instances misclassified as existing class % of existing class instances falsely identified as novel class Total misclassified error

Fnew ERR

M new =

Fn *100 Nc
Fp *100 N Nc

(6)

Fnew =

(7)

ERR =

( Fp + Fn + Fe ) *100 N

(8)

The equations 6, 7, and 8 are used to evaluate our approach. Table 3 and table 4 tabulate the results of performance com-

leafspot-size

lt-1/8
seed

dna leaf-malf

fruit-pods

diseased

gt-1/8
few -pr ese n

rm

no

seed
cystnematode

ab s

pre sen

rm

en t

t
2-4-dinjury

no

abn orm
purple-seedstain

Bacterial -blight

int-discolor

frog-eyeleaf-spot

rm

bro

n ab

ne

no

wn

m or

no

precip

brown-stemrot

leaf-mild

plant-growth

ab s

abn

m no r gtplant-growth

no r m
seed-discolor

en t

orm

orm

absent

phyllostictaleaf-spot

lt-norm

up u r-s pe rf

no rm

abs ent

abn

abo

e -nd sec ve -

stem-cankers abo
brown-stemrot

ves oi l

powderymildew

cystnematode

purple-seedstain

t en es pr

no rm

frog-eyeleaf-spot

alternarialeaf -spot

plant-stand

alternarialeaf -spot

diaporthestem-canker

brown-stemrot

no

lt-n orm al
phyllostictaleaf-spot

area-damaged

rm al

sc at te re d

low
frog-eyeleaf-spot

eas -ar

upp

80-89

frog-eyeleaf-spot

germination

10 0 90 -

wh ole ld -fie
frog-eyeleaf-spot

erare as
frog-eyeleaf-spot

8 lt0
alternarialeaf -spot

alternarialeaf -spot

Fig. 1. Decision Tree DTA using Sub Dataset A.

parison between our approach and traditional decision tree classifier. TABLE 3 Performance of Proposed Approach
Dataset Iris Plants Database Image Segmentation Data Large Soybean Database Fitting Contact Lenses Database NSL-KDD Dataset ERR 4 2.9 9.2 16.6 4.0 Mnew 4 1.3 2.8 0 8.4 Fnew 3 2.9 1.9 5.2 1.2

TABLE 4 Performance of Traditional Decision Tree


Dataset Iris Plants Database Image Segmentation Data Large Soybean Database Fitting Contact Lenses Database NSL-KDD Dataset ERR 5.3 5.2 10.8 50 5.3 Mnew 4 3.7 6.5 100 10.0 Fnew 5 5.2 2.8 5.2 1.5

points so that the most recent tree represents the most recent concept in data stream. The main propose of this paper is to improve the performance of decision tree clas- sifier in concept drifting data stream classification prob- lems. The decision tree classifier is very popular super- vised learning algorithm that has several advantages such as it is easy to implement and requires little prior knowledge. We tested the performance of proposed ap- proach on several benchmark datasets, which proved proposed approach efficiently detect novel class and im- prove the classification accuracy. The future work focuses on addressing this problem under dynamic attribute sets.

APPENDIX , AN ILLUSTRATIVE EXAMPLE

CONCLUSION

In this paper, we introduce decision tree classifier based novel class detection in concept drifting data stream clas- sification, which builds a decision tree from data stream. The decision tree continuously updates with new data

In large soybean database from UCI machine learning repos- itory [26], there are total 35 attributes and all the attribute values are nominal-valued. There are 683 data points in this dataset, which are categorized into 19 class attribute values. We split the dataset into 3 sub datasets: sub-dataset A con- tains 356 instances with 10 class attribute values, sub-dataset B contains 107 instances with 5 class attribute values, and sub-dataset C contains 220 instances with 4 class attribute values. We built a decision tree, DTA using sub-dataset A, which is shown in figure 1.

leafspot-size
lt-1/8
gt-1/8

dna

canker-lesion
rm no dk w ro -b

fruit-pods
tan
lk
sed disea

int-discolor
dn t
cystnematode herbicideinjury

dn

few -pr ese n

none

black brow n

brown

b n-

leafspots-marg
arg -m w-s now-s -m arg

norm

no

dn

absent

a
Bacterial -blight

bla ck

absen t

Bacterial -blight

bacterialblight

purple-seedstain

int-discolor
bro

leaves
o abn

leaves
abn

leaf-malf
sen pre

leaf-malf
sen pre

ne

wn

no

orm

rm

rm

seed-size
norm

bacterialpustule

precip
no gt rm
lt-norm

brown-stemrot

alternarialeaf -spot

frog-eyeleaf-spot

no

diaporthepod-&-stemblight

stemcankers
veabo oil s

leaf-malf
abo vet absen

brownstem-rot

2-4-dinjury

charcoalrot

2-4-dinjury

en t

rm

Bacterial -blight

bacterialpustule

plant-growth
m nor abn o

phyllostictaleaf-spot

seed-discolor
t en es pr
abs ent

canker-lesion anthracno
se
bro

ab s

tan
purple-seedstain

no r ma l

frog-eyeleaf-spot

alternarialeaf -spot

plant-stand
lt-n

alternarialeaf -spot

area-damaged
sc att ere d

low
frog-eyeleaf-spot

eas -ar

up

per -ar eas

phyllostictaleaf-spot

powderymildew

no

orm al

diaporthe- purple-seedpod-&-stemstain blight

purple-seedstain

plant-growth
abnorm

norm

dn

wn

rm

cystdiaporthediaporthenematode stem-canker anthracnose stem-canker anthracnose

frog-eyeleaf-spot

germination
-10 90
80-89

frog-eyeleaf-spot

frog-eyeleaf-spot

alternarialeaf alternarialeaf -spot -spot

Fig. 2. Decision Tree DTX using Sub Dataset XA+B.

Then we classified the 356 instances of sub-dataset A by applying the decision tree, DTA that correctly classified 323 instances and misclassified 33 instances. After that we classi- fied 107 instances of sub-dataset B [sub-dataset B contains 5 novel classes] by applying the decision tree DTA that detect novel class arrived. For example, leafspot-size = lt-1/8 and seed = norm: bacterial-blight, this leaf node satisfied 20 instances from sub-dataset A and 10 instances from subdataset B. The other attributes value of 10 instances from sub-dataset B are quite dissimilar than 20 instances from sub-dataset A, which confirms novel class arrived. Then we merged sub-dataset A and sub-dataset B to generate a new dataset XA+B. Next we rebuild the decision tree DTX, which is shown in figure 2. Similarly, we merged dataset XA+B with sub-dataset C [sub-dataset C contains 220 instanc- es with 4 novel classes] and again generate a new dataset XA+B+C. Final, we again rebuild the decision tree DTY, which correctly classified all the 683 instances of dataset XA+B+C to 91.5081% and all the 220 instances of sub-dataset C to 98.6364%. Decision tree, DTY is shown in figure 3.

REFERENCES
[1] M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, Classification and Novel Class Detection in Concept Drifting Data Streams under Time Constraints, IEEE Transactions on Knowledge and Data Engineening, Vol. 23, No. 6, pp. 859-874, June 2011. R. Elwell, and R. Polikar, Incremental Learning of Concept Drift in Nonstationary Environment, IEEE Transactions on Neural Networks, Vol. 22, No. 10, pp. 1517-1531, October 2011. A. Zhou, F. Cao, W. Qian, and C. Jin, Tracking Clusters in Evolving Data Streams over Sliding Window, Knowledge and Information Systems, Vol. 15, No. 2, pp. 181-214, May 2008. E. J. Spinosa, A. P. de Leon, F. de Carvalho, and J. Gama, Cluster-Based Novel Concept Detection in Data Streams Applied to Intrusion Detection in Computer Networks," Proc. 2008 ACM Symp. Applied Computing, pp. 976-980, 2008. J. Z. Kolter, and M. A. Maloof, "Dynamic Weighted Majority: An Ensemble Method for Drifting Concept," Journal of Machine Learning Research, Vol. 8, pp. 2755-2790, 2007. B. R. Dai, J. W. Huang, M. Y. Yeh, and M. S. Chen, Adaptive Clustering for Multiple Evolving Streams, IEEE Transactions on Knowledge and Data Engineening, Vol. 18, No. 9, pp. 1166-1180, September 2006. C. C. Aggarwal, J. Han, J. Wang, and P. S. yu, A Framework for On-Demand Classification of Evolving Data Streams, IEEE Transactions on Knowledge and Data Engineening, Vol. 18, No. 5, pp. 577-589, May 2006.

[2]

[3]

[4]

[5]

[6]

ACKNOWLEDGMENT
This research work was supported by Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh.
[7]

dn

n ltm or

rm

sec - nd

pre sen

t
2-4-dinjury

anthracno se

stem
abn orm

wn bro dk- blk -

fruit-spots
brow n-w/b col lk-sp ore ecks d
t absen

wh ole ld -fie

8 lt0

leafspot-size
lt-1/8

dna int-discolor
none

gt-1/8

canker-lesion
dn a
no

roots
tan

rotted

bro w

black
n

rm

leafspots-marg
w-s-m arg

Bacterial -blight

dk o -br k - bl wn
Bacterial -blight

sts -cy lls ga

brown

phytophthora purple-seed-rot stain

mold-growth

ab sen t

norm

dn

pre sen

cystnematode

area-damaged
red scatte

leaves
ab no rm

t wh
low

leaf-malf
absen

charcoalrot

seed-size
o lt-n
norm

lk-sp ecks

ab sen t

nor m

rg -ma w-s no-

sen pre

bacterialpustule

fruit-spots
t sen ab

leaves
brow
m nor

n-w/b

disto

dna
rt same
brownspot

herbicideherbicideinjury phytophth injury ora-rot phytophth ab ora-rot n

ole -fi eld

s rea r-a pe up eas -ar

stemcankers
below soil

stem
abov e -secn ab

brownstem-rot

2-4-dinjury

d colore

orm

nde

rm
bacterialpustule

orm

ye ar

Bacterial -blight

leaf-malf
ab sen t

fruit-pods
rm

crop-hist
s lstsameyr dif f-l st-

fruiting-bodies
abs ent

pre sen

ab sen t

prese nt

phyllostictaleaf-spot

brownspot

frog-eyeleaf-spot

frog-eyeleaf-spot

frog-eyeleaf-spot

brownspot

brownspot

brownspot

frog-eyeleaf-spot

purple-seed- purple-seedpowderystain stain mildew

no

ent

diaporthepod-&-stem- purple-seedstain blight

cystnematode

leaf-malf
t

normal

bro

pre sen

no

rm

few-p res

wn

dna

-lst-s

ev-yr s

brown- diaporthestem-rot pod-&-stem- downymildew blight

canker-lesion
dna
now -br k dk bl

anthracno se rhizoctoni anthracno a-root-rot plant-growth se ta

o ab

il -so ve

plant-stand
lt-n orm al

o tw ste-l rs am y

dis

n ab

ed eas

orm

fruiting-bodies
sen pre

t absen
brown-spot

date
ma y

seed
o abn

2-4-d-injury
rm

septe mber

rott ed

ril ap

rm

phytophthora -rot

roots
-cy lls ga

augu st

octob

er

rm no

y jul

no

jun e
norm

sts

brown-spot brown-spot

precip
gt-n

precip
lt-n orm

leaf-shread
absen

stem
m no r

alternarialeaf -spot

diaporthestem-canker

anthracnose

anthracnose

phytophthora phytophthora -rot -rot

o lt-n

rm

rm no gt- rm

n ab

sen pre

no

orm

orm

phyllostictabrown-spot leaf-spot

brown-spot phyllosticta- phyllostictaleaf-spot leaf-spot

frog-eyeleaf-spot

seed-tmt
fungic ide

alternarialeaf alternarialeaf -spot -spot

frog-eyeleaf-spot

no

ne

oth er
frog-eyeleaf-spot

alternarialeaf -spot

plant-stand
o lt-n l rma

frog-eyeleaf-spot

no

rm al

alternarialeaf -spot

Fig. 3. Decision Tree DTY using Sub Dataset XA+B+C.


[8] M. M. Gaber, and P. S. Yu, "Detection and Classification of Changes in Evolving Data Streams," Intl Journal of Information Technology & Decision Making, Vol. 5, No. 4, pp. 659-670, 2006. Y. Yang, X. Wu, and X. Zhu, Combining Proactive and Reactive Predictions for Data Streams," Proc. ACM SIGKDD, pp. 710715, 2005. W. Fan, Mining Concept Drifting Data Streams using Ensemble Classifiers," Proc. 10th ACM SIGKDD Intl Conference on Knowledge Discovery and Data Mining, pp. 128-137, 2004. M. Markou, and S. Singh, "Novelty Detection: A Review Part 2: Neural Network based Approaches," Signal Processing, Vol. 83, Issue 12, pp. 2499-2521, December 2003. H. Wang, W. Fan, P. S. Yu, and J. Han, Mining Concept Drifting Data Streams using Ensemble Classifiers, IBM T. J. Watson Research, Hawthorne, NY 10532, Association for Computing Machinery Aug. 24, 2003. G. Hulten, L. Spencer, and P. Domingos, Mining Time Changing Data Streams," Proc. 7th ACM SIGKDD Intl Conference on Knowledge Discovery and Data Mining, ACM New York, NY, USA, pp. 97-106, 2001. [14] G. Widmer, and M. Kubat, "Laerning in the Presence of Concept Drift and Hidden Contests," Machine Learning, Vol. 23, No. 1, pp. 69-101, April 1996. [15] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993. [16] J. R. Quinlan, Induction of Decision Tree, Machine Learning, Vol. 1, pp. 81-106, 1986. [17] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Statistics Probability Series, Wadsworth, Belmont, 1984. [18] J. Shafer, R. Agrawal, and M. Meha, SPRINT: A Scalable Parallel Classifier for Data Mining, Morgan Kaufmann, pp. 544-555, 1996. [19] D. Turney, Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm, Journal of Artificial Intelligence Research, pp. 369-409, 1995. [20] J. Bala, J. Huang, H. Vafaie, K. DeJong, and H. Wechsler, Hybrid Learning using Genetic Algorithms and Decision Trees for Pattern Classification, Proc. 14th Intl Con. On Artificial Intelligence, Montreal, pp. 1-6, 19-25 August 1995. [21] C. G. Salcedo, S. Chen, D. Whitley, and S. Smith, Fast and Accurate Feature Selection using Hybrid Genetic Strategics, Proc. Genetic and Evolutionary Computation Conference, pp. 1-8, 1999. [22] S. R. Safavian, and D. Landgrebe, A Survey of Decision Tree Classifier Methodology, IEEE Transactions on Systems, Man. and Cybermetics, Vol. 21, No. 3, pp. 660-674, 1991.

[9]

[10]

[11]

[12]

[13]

[23] W. Y. Loh, and X. Shih, Split selection methods for classification tree, Statistica Sininca, Vol. 7, pp. 815-840, 1997. [24] J. McHugh, Testing Intrusion Detection Systems: A critique of the 1998 and 1999 darpa intusion detection evaluations as performed by lincoln laboratory, ACM Transactions on Information and System Security, Vol. 3, No. 4, pp. 262-294, 2000. [25] The KDD Archive. KDD99 cup dataset, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 1999. [26] A. Frank, and A. Asuncion, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2010, http://archive.ics.uci.edu/ml

Amit Biswas is currently completing Master of Science in Computer Science and Engineering from United International University, Bangladesh. He obtained Bachelor of Computer Application (BCA) from Bangalore University, India in 2004. He is an IT professional working as a Team Leader of Software department in a reputed IT company named BASE Limited. He has also worked for Access to Information Programme (A2I), Prime Ministers Office, supported UNDP Bangladesh. He has extensive experience and knowledge on Software Development and Database. Some of his developed software successfully using by PLAN Bangladesh, CARE Bangladesh, Bangladesh Small and Cottage Industries Corporation (BSCIC), Habib Bank Limited, Dutch Bangla Bank, Rahimafrooz, etc. Dr. Dewan Md. Farid received B.Sc. in Computer Science and Engineering from Asian University of Bangladesh in 2003, M.Sc. in Computer Science and Engineering from United International University, Bangladesh in 2004, and Ph.D. in Computer Science and Engineering from Jahangirnagar University, Bangladesh in 2012. He is a part-time faculty member in the Department of Computer Science and Engineering, United International University, Bangladesh and Daffodil International University, Bangladesh. He has published 1 book chapter, 8 journals, and 10 conferences in machine learning, data mining, and intrusion detection. He has participated and presented his papers in international conferences at Malaysia, Portugal, Italy, and France. Dr. Farid is a member of IEEE and IEEE Computer Society. He worked as a visiting researcher at ERIC Laboratory, University Lumire Lyon 2 France from 01-09-2009 to 3006-2010. He received Senior Fellowship I & II awarded by National Science & Information and Communication Technology (NSICT), Ministry of Science & Information and Communication Technology, Government of Bangladesh, in 2008 and 2011 respectively.

Professor Dr. Chowdhury Mofizur Rahman had his B.Sc. (EEE) and M.Sc. (CSE) from Bangladesh University of Engineering and Technology (BUET) in 1989 and 1992 respectively. He earned his Ph.D. from Tokyo Institute of Technology in 1996 under the auspices of Japanese Government scholarship. Prof Chowdhury is presently working as the Pro Vice Chancellor and acting treasurer of United International University (UIU), Dhaka, Bangladesh. He is also one of the founder trustees of UIU. Before joining UIU he worked as the head of Computer Science & Engineering department of Bangladesh University of Engineering & Technology which is the number one technical public university in Bangladesh. His research area covers Data Mining, Machine Learning, AI and Pattern Recognition. He is active in research activities and published around 100 technical papers in international journals and conferences. He was the Editor of IEB journal and worked as the moderator of NCC accredited centers in Bangladesh. He worked as the organizing chair and program committee member of a number of international conferences held in Bangladesh and abroad. At present he is acting as the coordinator from Bangladesh for EU sponsored eLINK project. Prof Chowdhury has been working as the external expert member for Computer Science departments of a number of renowned public and private universities in Bangladesh. He is actively contributing towards the national goal of converting the country towards Digital Bangladesh.

S-ar putea să vă placă și