Documente Academic
Documente Profesional
Documente Cultură
Skewed Distribution
Dissertation
submitted in partial fulfillment of the requirements
for the degree of
Master of Technology, Computer Engineering
by
Abhijeet B. Godase
Roll No: 121022003
under the guidance of
Prof. V. Z. Attar
Dedicated to
my mother
Smt. Aruna B. Godase
and
my father
Shri. Balasaheb J. Godase
for their love, endless support
and encouragement.
CERTIFICATE
This is to certify that the dissertation titled
Prof. V. Z. Attar,
Guide,
Department of Computer Engineering
and Information Technology,
College of Engineering, Pune,
Shivaji Nagar, Pune-411005.
Date :
Abstract
The emerging domain of data stream mining is one of the important areas of
research for the data mining community. The data streams in various real life
applications are characterized by concept drift. Such data streams may also be
characterized by skewed or imbalance class distributions for example Financial
fraud detection, Network intrusion detection etc. In such cases skewed class distribution of the stream increases the problems associated with classifying stream
instances. Learning from such skewed data streams results in a classifier which is
biased towards the majority class. Thus the model or the classifier built on such
skewed data streams tends to misclassify the minority class examples. In case of
some applications for instance, financial fraud detection the identification of fraudulent transaction is the main focus because here misclassification of such minority
class instances might result in heavy losses, in this case financial. Increasingly
higher losses due to misclassification of such minority class instances cannot be
ruled out in many other data stream applications as well. The challenge, therefore,
is to pro-actively identify such minority class instances in order to avoid the losses
associated with the same. With an effort in this direction we propose a method
using k nearest neighbours approach and oversampling technique to classify such
skewed data streams. Oversampling is achieved by making use of minority class
examples which are retained from the stream as the time progresses. Experimental
results show that our approach shows good classification performance on synthetic
as well as real world datasets.
iii
Acknowledgements
I express my sincere gratitude towards my guide Prof. V.Z.Attar for her constant help, encouragement and inspiration throughout the project work. Without
her invaluable guidance, this work would never have been a successful one. She also
guided me through the essence of time-management, need of efficient organization
and the presentation skills. I would also like to thank Head of the Department
Dr. Jibi Abraham and all other faculty members who made my journey of
post-graduation and technical learning such an enjoyable experience.
I would also like to thank Dr. Albert Bifet, University of Waikato, New
Zealand for his help and guidance during the implementation of the project. Last
but not the least I would like to thank all my classmates for their support and
help throughout the course of the project.
Abhijeet B. Godase
College of Engineering, Pune
May 30, 2012
iv
Contents
Abstract
iii
Acknowledgements
iv
List of Tables
viii
List of Figures
viii
1 Introduction
1.1 Introduction to Data Mining and Techniques . . . . .
1.1.1 Classification . . . . . . . . . . . . . . . . . .
1.2 An Overview of Data Streams . . . . . . . . . . . . .
1.3 Data Stream Classification . . . . . . . . . . . . . . .
1.4 An Overview of Skewed Data Sets in the Real World
1.5 Issues in Learning from Skewed Data Streams . . . .
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . .
2 Literature Survey
2.1 Overview of Methods for Dealing with
Traditional Approaches . . . . . . . . . .
2.1.1 Oversampling . . . . . . . . . . .
2.1.2 Under-sampling . . . . . . . . . .
2.1.3 Cost Sensitive Learning. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
3
4
6
7
8
Skewed Data
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
3 Problem Description
3.1 Motivation . . . . . . . . . . . . . . . . . . . . .
3.2 Problem Statement . . . . . . . . . . . . . . . .
3.2.1 How did we reach our problem statement
3.2.2 Why are Existing Classifiers Weak? . . .
3.3 Evaluation Metrics . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Streams
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 8
. 9
. 10
. 11
.
.
.
.
.
14
14
15
16
16
16
.
.
.
.
.
4 Our Approach
20
4.1 Approach to Deal with Skewed Data Streams . . . . . . . . . . . . 20
4.2 Algorithm for Skewed Data Streams . . . . . . . . . . . . . . . . . . 22
5 Experimental Results and Discussions
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Synthetically Generated Datasets . . . . . . . . . . . . . . .
5.2.2 Results of Our Approach on Synthetically Generated Datasets
5.2.3 Real World Datasets . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Results of Our Approach on Real World Datasets . . . . . .
5.2.5 Effect of Noise Level on the Performance of the approach . .
25
25
25
26
28
33
35
37
vi
List of Tables
5.1
5.2
5.3
vii
List of Figures
1.1
1.2
1.3
4.1
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
3
4
21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
30
30
31
31
32
32
35
35
36
37
37
38
38
39
Chapter 1
Introduction
This chapter gives the introduction to the basics of the area of project, also gives
some insight to the problem domain under consideration.
1.1
With the internet age the data and information explosion have resulted in the
huge amount of data. Fortunately to gather knowledge from such abundant data
there exist data mining techniques. As per the definition by Jiawei Han in his
book Data Mining: Concepts and Techniques [18] the data mining is - Extraction of interesting, non trivial, implicit, previously unknown and potentially useful
patterns or knowledge from huge amount of data. Data mining has been used
in various areas like Health care, business intelligence, financial trade analysis,
network intrusion detection etc.
General process of knowledge discovery from data involves data cleaning, data
integration, data selection, data mining, pattern evaluation and knowledge presentation. Data cleaning, data integration constitute data preprocessing. Here data
is processed so that it becomes appropriate for the data mining process. Data
mining forms the core part of the knowledge discovery process. There exist various data mining techniques viz Classification , Clustering, Association rule mining
etc. Our work mainly falls under the classification data mining technique.
1.1.1
Classification
Figure 1.1: Classification as a task of mapping input attribute setx into its class
label y
Classification is a pervasive problem that encompasses many diverse applications, right from static datasets to data streams. Classification tasks have been
employed on static data over the years. In last decade more and more applications
featuring data streams have been evolving which are a challenge to traditional classification algorithms.
1.2
Many real world applications, such as network traffic monitoring, credit card
transactions, real time surveillance systems, electric power grids, remote sensors,
web click streams etc, generate continuously arriving data known as data streams
[4],[2]. Unlike the traditional data sets, data streams arrive continuously at varying
speeds. Data streams are fast changing, temporally ordered, potentially infinite
and massive[18]. It may be impossible to store the entire data stream into memory
or to go through it more than once due to its voluminous nature. Thus there is
2
1.3
Since classification could help decision making by predicting class labels for given
data based on past records, classification on stream data has been extensively studied in recent years with many interesting algorithms developed. Some of them are
cited here: [4],[21].
Fig 1.2 depicts the classification model in data streams.As shown in fig 1.2
data chunks C1 , C2 , C3 ....Ci arrive one by one.
Figure 1.3: Skewed Distributions, each data chunk has fewer positive examples
than negative examples
1.4
The rate at which science and technology have developed has resulted in proliferation of data at an exponential pace. This unrestrained increase in data has
intensified need of various applications in data mining. This huge data in in no
way necessarily equally distributed. Class skew or class imbalance refers to domains where in one class instances outnumber the other class instances, i.e. some
classes occupy the majority of the dataset which are known as majority classes;
while the other classes are known as minority classes. The most vital issue in
these kinds of data sets is that, compared to the majority of the classes, minority
classes are often of much significance and interest to the user.
4
There are many real world applications, where in datasets contain such skewed
nature. Following paragraphs gives an overview of some real world problems that
exhibit such nature.
Above real world examples signify the importance of dealing with imbalanced
data sets. Most of the above examples also fall into the category of skewed data
streams. Most learning algorithms work well with balanced data streams as their
aim is to improve overall accuracy or a related measure. When such algorithms
are applied to skewed data streams then their accuracy of classifying majority
examples is good but the accuracy of classifying the minority examples is poor.
This happens because learning from such imbalanced/skewed streams causes the
learner to become biased towards the majority class; thus the minority examples
are likely to be misclassified. Thus the main issue towards which the research
community is working on in regard to skewed data streams is of correctly classifying the minority data instances without affecting the accuracy of majority data
instances. In recent years learning from skewed data streams has been recognized
as one of the crucial problem in machine learning and data mining. Encapsulating,
the principle problem statement is improving the accuracy of both the minority
as well as majority class instances of the stream.
1.5
In general learning from skewed data streams is challenging due to following issues.
1. Evaluation Metric: Appropriate choice of evaluation metrics is also important in this domain. Evaluation metrics play vital role in data mining; they
are used to guide the algorithms to desired solution. Thus if evaluation metric does not take minority class into the consideration, the learning algorithm
will not be able to cope up well with the skewed data streams.
The standard evaluation metrics like overall accuracy are not valid this case,
because although minority class instances are misclassified then also the
overall accuracy may remain higher, primary reason being negligible amount
of minority class instances.
2. Lack of minority class data for training: In skewed data streams due to lack
of minority class data it becomes difficult to learn class boundaries. As the
number of instances available are very few. Thus training a classifier in such
situations is very difficult.
3. Treatment of minority class data as noise. One of the major issues is that
of the noise. Noisy data in the streams affects the minority class more than
6
1.6
Thesis Outline
Chapter 2 gives the brief description of the literature survey carried out by
us.
Chapter 3 describes our problem statement and how we arrived at it. It also
focusses on different evaluation metrics chosen by us throughout this thesis.
Chapter 4 elaborates our approach to deal with data streams with skewed
distribution of classes and explains our algorithm and its implementation
details.
Chapter 5 contains the results and discussions of the implemented algorithm
and evaluation of its performance.
Chapter 6 provides the conclusion and the future enhancements that can be
carried out.
Chapter 2
Literature Survey
This chapter gives brief details of our literature survey in carried out in the area
of imbalanced datasets and the skewed data streams problem.
2.1
We went through various methods available in the literature to deal with imbalanced datasets and portray some of the well known and most popular approaches,
algorithms and methods that have been devised to deal with skewed data streams.
Some of the books that we have referred to get an effective understanding of data
mining concepts are Data Mining: Concepts and Techniques by Han and Kamber
[18], Introduction to Data Mining by Kumar et al [23]. In the literature there are
number of methods addressing class imbalance problem but the area of skewed
data streams is relatively new to the research community. The sampling based
and ensemble algorithms are the simplest yet the effective ones. Following paragraphs will provide the brief overview of the same.
Some of the approaches for dealing with skewed data streams are categorised
under following methods.
Oversampling.
Under-sampling.
Cost Sensitive Learning.
2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional
Approaches
Oversampling and under-sampling are sampling based preprocessing methods
of data mining. The main idea in these methods is to manipulate the data distributions such that all the classes are represented well in the training or learning
datasets. Recent studies in this domain have shown that sampling is effective
method to deal with such kind of problems. Cost sensitive learning is basically
associates cost of misclassifying the examples to penalise the classifier.
2.1.1
Oversampling
Oversampling is one of the sampling based preprocessing technique in data mining. In oversampling the number of minority class instances in increased by either
reusing the instances from the previous training/learning chunks or by creating
the synthetic examples. Oversampling tries to strike the balance between ratio
of majority and minority. classes. One of the advantage of this method is that
using this normal stream classification methods can be used. The most commonly used method of oversampling is SMOTE(Synthetic Minority Oversampling
Technique)[7]. Some of the Oversampling based approaches in the literature are
discussed below.
Most of the stream classification algorithms available assume that the streams
have balanced distribution of classes. In the last few years few attempts have been
made to address the problem to deal with skewed data streams. First of such attempt was done by Gao et al [16] in their work they have proposed SE(Sampling
+ Ensemble) approach which processes the stream in batches. In SE approach
each of the classifier in ensemble is trained by drawing the uncorrelated sample of
negative instances and all the positive instances in the current training chunk as
well as the positive instances of all previous training chunks. Thus in SE approach,
oversampling of positive instances is done by incorporation of old positive examples along with under sampling by the way of using disjoint subsets of negative
examples.
SERA(Selectively Recursive Approach) framework was proposed by Chen and
He [9] in this framework they selectively absorbed minority examples from previous chunks into current training chunk to balance it. Similarity measure used to
select minority examples from previous chunks was mahalanobis distance. Thus
different from approach used by gao et al [16] which uses take in all approach. In
SERA framework single hypothesis based on current training chunk is maintained
2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional
Approaches
2.1.2
Under-sampling
10
2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional
Approaches
the clusters formed to represent each of those clusters. Number of clusters formed
were equal to the number of positive examples in current training batch and thus
current training batch was updated by taking all positive examples along with
centroids of the clusters of negative samples. A new classifier was created on these
sampled instances. Further size of the ensemble was fixed so for all classifiers
present in the ensemble along with new classifier built on sampled instances. AUROC was used as measure to select best classifiers to be included in the ensemble.
Weights of the classifiers were assigned on the basis of AUROC calculated and
ensemble thus formed was used to classify the instances in the test chunk. The
work by song et al [28] proposes to address the issue of skewed data streams in
cloud security which also follows a similar approach of using k-means clustering
to draw centroids of clusters of negative class to undersample the negative class.
Recently Zhang et al [36] proposed approach called ECSDS (Ensemble classifier for skewed data streams) which aims at reducing the time needed by ensemble
learning by updating it only when required. In this algorithm initially ensemble
is built in the similar way as that of gao et al approach[16] but the mechanism
to update the ensemble is bit different. When the ensemble makes prediction for
the test batch, the positive instances which are misclassified are retained back as
MI. After prediction on two batches difference of f1 values in these two batches
is calculated. If the difference crosses a threshold value then ensemble is updated
by using all misclassified positive instances MI and all previous positive instances
AP along with negative instances in the current batch.
2.1.3
Cost sensitive learning is one if the important technique of data mining. It assigns
different values of misclassification penalties to each class. Cost sensitive learning
has been incorporated into classification algorithms by taking into account the
cost information and trying to optimize overall cost during the learning process.
In cost sensitive classification the problem is dealt by adjusting the learning. It is
done by creating costs associated with misclassification of minority class and adjusting the learner based on punishment-reward system. One of the advantage of
this method is that training dataset is unchanged unlike oversampling and undersampling where in the data distribution changes completely.
11
2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional
Approaches
Polliker et al [11] proposed algorithm named Learn++SMOTE in which algorithm updates the distribution of instance weights by evaluating current training
chunk Dn on the ensemble and accordingly adjust the weights of misclassified instances to form new instance weighing distribution Dm . A new classifier is trained
on this data chunk Dm and then new data subset is created by calling SMOTE
. All the classifiers in the ensemble are evaluated on this synthetic dataset and
then accordingly classifiers with less error are chosen to form ensemble and while
doing so if new classifier has error more than 0.5 then its discarded and new one
is created but if error on older classifier is more than 0.5 then its set to 0.5, since
it might be useful if stream follows a cyclic nature.
Cooper et al [26] proposed online approach to deal with skewed data streams.
In this work they propose an incremental learning algorithm for skewed data
stream classification. They argue that in batch learning algorithms update of
model is delayed until next training chunk is received but in some cases there is
need to update the model with every new instance. They build the initial ensemble model in the similar way as that of gao et al approach[16] but the difference is
that here they draw random subset of negative instances with replacement. Once
the initial model is built then for every new incoming instance which is of positive
class all the base models are updated. If the new incoming instance is of negative
class then it will be used for updating the models with probability of n/p where n,
p are the number of negative and positive instances observed till now respectively.
Liu et al [25] proposed RDFCSDS(Reusing data for classifying skewed data
streams) in this algorithm before training the ensemble it is checked if concept
drift has occurred, if it has then current ensemble is cleared and it is built from
scratch by sampling the data from current training chunk as well as the previous
training chunk thus reusing the previous data. If no concept drift has occurred
then a new classifier is built on current training chunk and then AUROC is calculated for all the classifiers in the ensemble and among those top k classifiers are
chosen to form ensemble based on AUROC value. Prediction of instances in the
test chunk is done by weighted majority voting.
From the above discussion we can see that most of the approaches proposed till
now have used have either used oversampling and/or under-sampling to balance
the training chunks while to dealing with skewed data streams. Some of them have
gone for the cost sensitive classification approach. Also as we have seen previously
12
2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional
Approaches
13
Chapter 3
Problem Description
This chapter gives the motivation behind choosing the problem consideration and
how did to arrive at the problem statement.
3.1
Motivation
While working on identification of the project topic in the area of data mining we
found that lot of work has been already done in the different areas of the data mining with respect to the static datasets. Further in the last decade class imbalance
problem in static datasets has drawn the attention of the data mining community.
Due to which various workshops in the different conference were dedicated to specially for problem of class imbalance. First of such workshop was organized way
back in 2000 in the AAAI 2000 conference, another workshop on Learning from
Imbalanced Datasets was organized in ICML 2003. Recently another workshop
named Data Mining when Classes are Imbalanced and Errors have Cost was
organized in the PAKDD 2009 conference.
Various points from the above discussion drew our attention towards the problem of class imbalance. Further we could find that considerable amount of work
has been done on the class imbalance problem in terms of static datasets. Then
we went on to look for the another area wherein class imbalance is in more primary concern. Meanwhile we found that the data streams is an area where class
imbalance has not been thoroughly studied. After this we concentrated on various
real life application where in data streams are prominent. Various applications
like Network intrusion detection, Financial fraud detection etc. are various areas
which are characterised by stream data.
14
3.2
Problem Statement
15
3.2.1
3.2.2
Most of Existing stream classifiers cannot be used to classify the skewed data
streams. These classifiers face the issues that are mentioned in the Section 1.5.
Further these classifiers also assume that examples in the stream have fairly balanced distribution over the different classes. Hence the standard stream classifiers
are affected by such majority class instances and thus they tend to consider the
minority class instances as noise.
3.3
Evaluation Metrics
Although most of the stream classifiers measure their performance based on overall accuracy but in case of imbalanced datasets such measure is not appropriate.
Consider for example two classifiers being tested on a imbalanced dataset with
class distribution ratio of 0.99:0.01. Now if classifier one classifies all 99% of majority class correctly but none from minority where as second classifier classifies
97% of majority class correctly and 0.8% of the minority class correctly then as
per overall accuracy the classifier one beats the second one but if we consider that
minority class is the main class under focus then second classifier is the one which
should be chosen.
16
Thus in such cases various performance measures have been suggested in the
literature. In this section we present the evaluation metrics that we have used.
Confusion Matrix: The columns of the confusion matrix represent the predictions, and the rows represent the actual class. Correct predictions always
lie on the diagonal of the matrix. Equation 3.1 shows the general structure
of confusion matrix.
T P F N
Conf usionM atrix
FP TN
(3.1)
wherein, True Positives (TP) indicate the number of instances of the minority that were correctly predicted, True Negatives (TN) indicate the number
of instances of the majority that were correctly predicted. False Positives
(FP) indicate the number of instances of the majority that were incorrectly
predicted as minority class instances and False Negatives (FN) indicate the
number of the minority that were incorrectly predicted as majority class
instances. Though the confusion matrix gives a better outlook on how the
classifier performed than accuracy, a more detailed analysis is preferable
which are provided by the further metrics.
Recall: Recall is a metric that gives a percentage of how many of the
actual minority class members the classifier correctly identified. (TP + FN)
represent a total of all minority members. Recall is given by equation 3.2
Recall =
TP
TP + FN
(3.2)
Precision: It gives us the total the percentage of how many of minority class
instances as determined by the model or classifier actually belong to the
minority class. (TP + FP) represents the total of positive predictions by
the classifier. Precision is given by equation 3.3
17
P recision =
TP
TP + FP
(3.3)
Thus in general it is said that Recall is a Completeness Measure and Precision is a Exactness Measure. The ideal classifier would give value as 1
for both Recall and Precision but if the classifier gives higher(closer to one)
for one of the above metrics and lower for the other metrics in that case
choosing the classifier is difficult task. In such cases some other metrics as
discussed further are suggested in the literature.
Further in the literature few metrics which are based on the above metrics
have been suggested to be used in case of imbalanced data sets or streams.
The performance measures like AUROC (Area under ROC curve)[12],[20]
G-Mean [17] are well suited in such situations.
F M easure =
1
( Recall
2
1
+ P recision
)
(3.4)
G M ean =
T NR =
18
Recall T N R
TN
TN + FP
(3.5)
(3.6)
Area Under ROC Curve: The area under ROC Curve(Receiver Operating
Characteristics) gives the probability that, when one draws one majority
and one minority class example at random, the decision function assigns the
higher value to the minority class than the majority class sample. AUROC is
not sensitive to the class distributions in the dataset. Generally it is plotted
as a True Positive Rate verses False Positive Rate. AUROC was mainly
used in signal detection theory and medical domain where it was said to be
the plot of Sensitivity verses (1 Specif icity) where Sensitivity is same as
True Positive Rate and Specif icity is (1 - False Positive Rate) so in general
it reduces to False Positive Rate thus both definitions of AUROC are one
and the same.
19
Chapter 4
Our Approach
As mentioned in the previous chapter that after the literature survey we came to
the conclusion that issue of skewed data streams certainly needs more attention
and there is still room for improvement in the performance of existing classifiers.
Thus we came up with our own approach to deal with skewed data streams.
4.1
This section gives the brief description of our approach to deal with classification
of data streams with skewed distribution. Our approach in general is as follows.
Instead of using a single model from single training set, it is proposed to use
multiple models from different sets. We use the ensemble of classifiers/models
to classify the data streams with skewed distributions. The use of ensemble of
classifiers has been proven to be the effective one to deal with concept drifting
data streams [31]. In our approach we have also used the oversampling approach
to so as to balance the training chunk contents. Fig 1.4 shows in brief our approach.
In our approach we process the data stream in chunks. If the set of instances
used for training the classifiers are disjoint then the classifiers will make uncorrelated errors and which can be eliminated by averaging [16]. On the similar lines
we build a new classifier for the ensemble for every incoming batch such that instances of the training chunk are as uncorrelated as possible.
In order to learn from the data streams the best way is to construct the model
on the most recent training chunk. This works for instances of majority class
because instances of majority class are in abundance in every chunk but as the
minority class samples are very less. Some of the earlier approaches [7],[11] depend
20
Figure 4.1: Our Approach to Classify Data Streams with Skewed Distributions
While balancing the training chunk we follow the K nearest neighbour approach. We select those examples from the previous minority examples that are
accommodated over the time. To balance the training chunk we find the k nearest neighbours of the each example stored from previous chunks into the current
training chunk. Out of those appropriate number of minority samples are used to
balance the training chunk. Using that training chunk we build a new classifier.
This newly built classifier is added to the ensemble.
The size of the ensemble is maintained to 10. As the size of ensemble goes
beyond this specified limit then classifiers with best AUROC(Area Under ROC
21
Curve) values are chosen. The process of learning is continuous as you can see
from the fig 4.1. Thus as the stream is continuously coming in the ensemble is
continuously updated. Mean while the test examples are tested by taking the
predictions from the ensemble. The predictions are made by taking weighted majority voting among the classifiers. The weights of the classifiers in the ensemble
are assigned as per the AUROC values. While taking the majority voting the
weights are normalized.
Thus our approach consist of the oversampling based approach along with k
nearest neighbour algorithm and the ensemble based approach to deal with data
streams with skewed distribution of classes. Algorithm 1 presents the K nearest
neighbour algorithm in details which is used by our approach.
Algorithm 1: K Nearest Neighbours Algorithm
Input:
Set of training examples or the set of examples from which nearest
neighbours are to be determined.
Output:
K Nearest Neighbours.
begin
Determine parameter K = number of nearest neighbours.
Calculate the distance between the query-instance and all the training
samples.
Sort the distance and determine nearest neighbours based on the K-th
minimum distance.
Return the K nearest neighbours .
4.2
24
Chapter 5
Experimental Results and
Discussions
This chapter deals with the results obtained by us after the experiments were
carried out on the algorithm developed by us.
5.1
Experimental Setup
This section gives the abstract of the experimental setup on which we have worked
to perform our experiments and test the effectiveness of our algorithm. For the
implementation of our algorithm we have used MOA [6]. MOA is a open source
framework for mining data streams. It is implemented in java. It has collection
of various data stream mining algorithms. MOA being an open source framework
we could easily add and integrate our algorithm into it. Details of other hardware
and software that we have used is as follows. We used MOA release 20110124
along with it we have also used WEKA version 3.7.5 on Windows 7 Profession
operating system running on Dual core AMD opteron processor 2210 @1.8 GHz
with 2GB RAM.
5.2
Datasets
To evaluate the performance of our algorithm we have tested it on various synthetic as well as real world datasets. We have used various datasets available at
UCI[13] repository. Datasets available at MOA dataset repository[1]. Also we
have used various synthetically generated datasets. These datasets were chosen
25
5.2 Datasets
such that they model real world scenarios, have variety of features and largely
vary in size and class distribution.
5.2.1
This subsection presents the details of the synthetic datasets that we have generated. Later in this section we have briefly explained the details of these datasets.
The main reason for generating these two datasets was that they focus on one of
the main characteristic of data streams that is concept drift. Table 5.1 shows the
characteristics of synthetically generated datasets. As shown in Table 5.1 are the
number of instances, number of majority class instances, number of minority class
instances, number of attributes in the data set, chunk size of the data sets and
the ratio of majority class to minority class.
Table 5.1: Description of Synthetic Datasets used
Dataset
Instances
Max-Class
Min-Class
Attributes
Chunk-size
Ratio
SPH 1%
1,00,000
99000
1000
10
1000
0.99::0.01
SPH 3%
1,00,000
97000
3000
10
1000
0.97::0.03
SPH 5%
1,00,000
95000
5000
10
1000
0.95::0.05
SPH 10%
1,00,000
90000
10000
10
1000
0.90::0.10
SEA 1%
80,000
79200
800
1000
0.99::0.01
SEA 3%
80,000
77600
2400
1000
0.97::0.03
SEA 5%
80,000
76000
4000
1000
0.95::0.05
SEA 10%
80,000
72000
8000
1000
0.90::0.10
5.2 Datasets
these 3 attributes or features has the value between 0 to 10. Out of these attributes only first two are relevant and the third attribute is added which acts as
noise. If the some of the first two attributes crosses a certain threshold value then
it belongs to class 1 else it belongs to class 2. The threshold values generally used
are 8,9,7 and 9.5 for the four data blocks. From each block some of the instances
as reserved as test instances.
We generated 80,000 instances of SEA dataset with different percentage of
class imbalances. These were used in the batch 1000 each that is a batch of
1000 was used for training and then next 1000 instances were used for testing the
model built. We have generated 4 different datasets of SEA dataset. Each with
1%,3%,5%,10% of skew added to them. In each of these datasets we have added
1% noise so as to make the task of learning more difficult. Noise was added by
reversing the correct class label to a incorrect one in the training set.
1X
i
0 =
2 i=1
then the class label l for the instance t is defined as
as opposed to the abrupt concept drift in the SEA dataset the, SPH dataset is
characterised by the gradual concept drift. Some of the coefficients of i will be
sampled in random to have small increment added. Here is defined
=s
27
t
N
5.2 Datasets
here t is the magnitude of the change for every N examples and s alternates
in closed interval -1 to 1 and thus specifying the direction of change, and it has
20% chance of being reversed for every N examples. 0 is also modified as per
the equation mentioned earlier. In such a way class boundary is like spinning
hyperplane in the process of creating data.
5.2.2
We present the graphical results of our approach in comparison with some other
state of the art algorithms. Details of the algorithms used for comparison are as
follows
Our approach CDSSDC which uses K nearest neighbour approach to balance
the current training chunk, ratio of positive to negative examples is kept to
be 0.5
We have used SMOTE algorithm [7] for comparison, which creates synthetic
examples to balance the dataset. The implementation of SMOTE available
in WEKA toolkit [14] has been used with the default settings in WEKA except for SPH and SEA datasets where SMOTE was applied so as maintain
ratio of positive to negative examples to be 0.5.
We have used FLC [3] for comparison, FLC was designed to deal with concept
drift in data streams efficiently.
From Fig 5.1 we can clearly see that our approach gives better performance
as compared to SMOTE and FLC in terms of AUROC in case of SPH datasets
with various imbalance levels. As the class skew ratio eases up from 99:1 towards
90:10 the FLC and SMOTE algorithm come close in performance to our approach.
Fig 5.2 shows comparative performance of our approach, SMOTE and FLC
with respect to G-mean on SPH datasets. Here we can see that our approach performs better than SMOTE and FLC for all the class distributions of SPH datasets
shown. Further here also we can observe that the other algorithms come closer in
catching up as the class skew eases up.
28
5.2 Datasets
29
5.2 Datasets
30
5.2 Datasets
31
5.2 Datasets
cept drift the ensemble is not able to adjust easily to these sudden changes in the
stream behaviour.
32
5.2 Datasets
5.2.3
This section gives the details of the real world datasets that we have used for
evaluating the performance of our approach. Various real world datasets which
are collected and provided by different repositories have been used by us.
Instances
Max-Class
Min-Class
Attributes
Chunk-size
Ratio
Elec2 5%
16000
15,207
793
1000
0.95::0.05
This dataset didnt had the class skew present and hence we extracted the instances in such manner that a skewed data stream was available. While extracting
the instances order of the examples was not affected. Thus we extracted 16,000
instances from the actual dataset to form the skewed dataset. Thus dataset with
5% of skew was formed. Table 5.2given the detailed description of the electricity
dataset.
5.2 Datasets
Instances
Max-Class
Min-Class
Attributes
Chunk-size
Ratio
Letter
20,000
19195
805
17
3332
0.96::0.04
Connect-4
50922
44473
6449
42
5009
0.88::0.12
Adult
40402
32561
7841
14
4070
0.81::0.19
34
5.2 Datasets
5.2.4
We present the graphical results of our approach in comparison with some other
state of the art algorithms as mentioned earlier in subsection 5.2.2.
Fig 5.9 shows comparison of AUROC values obtained on the real world datasets.
Here we can observe that our approach maintains same or achieves better performance than others except for letter dataset where FLC achieves better result. This
may be because FLC can better deal with the concept drift than our approach.
35
5.2 Datasets
36
5.2 Datasets
5.2.5
In this section we try to see how our approach performs when the noise levels
are changed. To see the behaviour of the approach in such scenario we tested
the performance of the approach on SPH dataset with 5% of minority class and
with noise levels from 1% to 5%. The following graphs depict how the approach
responds in case of varied noise levels.
We can observe from the Fig 5.13 that in case of SPH dataset as the noise
37
5.2 Datasets
38
5.2 Datasets
Figure 5.16: Effect on Overall Accuracy when noise levels are varied
level.
We can see from the Fig 5.15 that as the noise levels increase there is slight
drop in the F-Measure values obtained. F-Measure values are almost same when
noise level is 2% and 4% but the values show slight decrease when the noise levels
are 3% and 5%.
We can observe from the fig 5.16 that the as the noise levels increase the overall
accuracy almost remains the same with very slight variations. This is because even
though the noise levels have increased to 5% but the majority class percentage is
still much more which is the major driving factor behind the overall accuracy in
this case.
39
Chapter 6
Conclusion and Future Work
In this chapter we enlist the conclusions drawn from the project work and the
future work that can be carried out.
6.1
Conclusion
In this project work we have come up with an approach to deal with skewed data
streams using oversampling and the k nearest neighbour approach. We have illustrated our algorithm using various real world as well as the synthetic datasets
with various features and the imbalance levels. Results obtained indicate that our
approach shows deals well with skewed data streams, In particular, our approach
has shown comparable and in some cases slightly better performance in case of
Area under ROC Curve , F-Measure and G-Mean except in some cases where it
stays behind some of the other algorithms.
As seen earlier various real life data stream applications like financial fraud detection, network intrusion detection are characterised by the skewed data streams
and in such cases this approach would help identify and classify minority class
instances appropriately.
6.2
Future Work
This chapter throws light on the future enhancements that can be carried out.
Some of the further enhancements would be to implement the approach for parallel
computing platform which would help reduce the time required for the approach.
There are various parts of the approach where in parallelism can be introduced.
40
We are also working on a approach to combine two different stream classification approaches so as to get best out of the two algorithms. By combining
out approach with another approach which handles general data streams or data
streams with balanced distribution, we would like to extend the scope of out approach further such that its applicable in general to data streams with any kind
of class distribution.
41
Appendix A
Publications
This section lists our research papers that have been accepted and the papers that
are under review.
A.1
A.2
42
Bibliography
[1] Moa dataset repository : http://moa.cs.waikato.ac.nz/datasets/.
[2] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. On
demand classification of data streams. In Proceedings of the tenth ACM
SIGKDD international conference on Knowledge discovery and data mining,
KDD 04, pages 503508, New York, NY, USA, 2004. ACM.
[3] Vahida Attar, Pradeep Sinha, and Kapil Wankhade. A fast and light classifier
for data streams. Evolving Systems, 1:199207, 2010.
[4] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer
Widom. Models and issues in data stream systems. In Proceedings of the
twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of
database systems, PODS 02, pages 116, New York, NY, USA, 2002. ACM.
[5] Stephen Bay, Krishna Kumaraswamy, Markus G. Anderle, Rohit Kumar, and
David M. Steier. Large scale detection of irregularities in accounting data.
In Proceedings of the Sixth International Conference on Data Mining, ICDM
06, pages 7586, Washington, DC, USA, 2006. IEEE Computer Society.
[6] Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer, and Mikio
Braun. Moa: Massive online analysis.
[7] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip
Kegelmeyer. Smote: synthetic minority over-sampling technique. J. Artif.
Int. Res., 16:321357, June 2002.
[8] S. Chen and H. He. Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach. Evolving Systems, pages 116, 2011.
43
BIBLIOGRAPHY
[9] Sheng Chen and Haibo He. Sera: Selectively recursive approach towards
nonstationary imbalanced stream data mining. In Neural Networks, 2009.
IJCNN 2009. International Joint Conference on, pages 522529, june 2009.
[10] Sheng Chen, Haibo He, Kang Li, and S. Desai. Musera: Multiple selectively
recursive approach towards imbalanced stream data mining. In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 18, july
2010.
[11] G. Ditzler, R. Polikar, and N. Chawla. An incremental learning algorithm
for non-stationary environments and class imbalance. In Pattern Recognition
(ICPR), 2010 20th International Conference on, pages 29973000, aug. 2010.
[12] Tom Fawcett. Roc graphs: Notes and practical considerations for researchers.
Technical report, 2004.
[13] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
[14] E. Frank, M. Hall, G. Holmes, R. Kirkby, B. Pfahringer, I.H. Witten, and
L. Trigg. Weka. Data Mining and Knowledge Discovery Handbook, pages
13051314, 2005.
[15] Peter W. Frey and David J. Slate. Letter recognition using holland-style
adaptive classifiers. Machine Learning, 6:161182, 1991.
[16] J. Gao, W. Fan, J. Han, and P.S. Yu. A general framework for mining conceptdrifting data streams with skewed distributions. Proc. of SIAM ICDM, 2007.
[17] Qiong Gu, Li Zhu, and Zhihua Cai. Evaluation measures of the classification performance of imbalanced data sets. In Computational Intelligence and
Intelligent Systems, volume 51, pages 461471. Springer Berlin Heidelberg,
2009.
[18] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers, 2nd edition, 2006.
[19] Michael Harries and New South Wales. Splice-2 comparative evaluation: Electricity pricing, 1999.
[20] Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE
Trans. Knowl. Data Eng., 21(9):12631284, 2009.
44
BIBLIOGRAPHY
[21] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing
data streams. In Proceedings of the seventh ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD 01, pages 97
106, New York, NY, USA, 2001. ACM.
[22] Miroslav Kubat, Robert C. Holte, and Stan Matwin. Machine learning for the
detection of oil spills in satellite radar images. Mach. Learn., 30(2-3):195215,
February 1998.
[23] Vipin Kumar, Pang-Ning Tan, and Michael Steinbach. Introduction to Data
Mining. Addison-Wessley, 2006.
[24] Yanling Li, Guoshe Sun, and Yehang Zhu. Data imbalance problem in text
classification. In Proceedings of the 2010 Third International Symposium on
Information Processing, ISIP 10, pages 301305, Washington, DC, USA,
2010. IEEE Computer Society.
[25] Peng Liu, Yong Wang, Lijun Cai, and Longbo Zhang. Classifying skewed
data streams based on reusing data. In Computer Application and System
Modeling (ICCASM), 2010 International Conference on, volume 4, pages V4
90V493, oct. 2010.
[26] H.M. Nguyen, E.W. Cooper, and K. Kamei. Online learning from imbalanced
data streams. In Soft Computing and Pattern Recognition (SoCPaR), 2011
International Conference of, pages 347352, oct. 2011.
[27] Dan Pelleg and Andrew Moore. Active learning for anomaly and rare-category
detection. In In Advances in Neural Information Processing Systems 18, pages
10731080. MIT Press, 2004.
[28] Qun Song, Jun Zhang, and Qian Chi. Assistant detection of skewed data
streams classification in cloud security. In Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE International Conference on, volume 1,
pages 6064, oct. 2010.
[29] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea)
for large-scale classification. In Proceedings of the seventh ACM SIGKDD
international conference on Knowledge discovery and data mining, KDD 01,
pages 377382, New York, NY, USA, 2001. ACM.
[30] Pavan Vatturi and Weng-Keen Wong. Category detection using hierarchical
mean shift. In Proceedings of the 15th ACM SIGKDD international conference
45
BIBLIOGRAPHY
on Knowledge discovery and data mining, KDD 09, pages 847856, New York,
NY, USA, 2009. ACM.
[31] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining conceptdrifting data streams using ensemble classifiers. In Proceedings of the ninth
ACM SIGKDD international conference on Knowledge discovery and data
mining, KDD 03, pages 226235, New York, NY, USA, 2003. ACM.
[32] Yi Wang, Yang Zhang, and Yong Wang. Mining data streams with skewed
distribution by static classifier ensemble. In Been-Chian Chien and TzungPei Hong, editors, Opportunities and Challenges for Next-Generation Applied
Intelligence, volume 214 of Studies in Computational Intelligence, pages 65
71. Springer Berlin / Heidelberg, 2009.
[33] Zhe Wang, William Josephson, Qin Lv, Moses Charikar, and Kai Li. Filtering
image spam with near-duplicate detection. In In Proceedings of the Fourth
Conference on Email and AntiSpam, CEAS2007, 2007.
[34] Gang Wu and Edward Y. Chang. Class-boundary alignment for imbalanced
dataset learning. In In ICML 2003 Workshop on Learning from Imbalanced
Data Sets, pages 4956, 2003.
[35] Junjie Wu, Hui Xiong, Peng Wu, and Jian Chen. Local decomposition for
rare class analysis. In Proceedings of the 13th ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD 07, pages 814
823, New York, NY, USA, 2007. ACM.
[36] Juan Zhang, Xuegang Hu, Yuhong Zhang, and Pei-Pei Li. An efficient ensemble method for classifying skewed data streams. In De-Shuang Huang, Yong
Gan, Prashan Premaratne, and Kyungsook Han, editors, ICIC (3), volume
6840 of Lecture Notes in Computer Science, pages 144151. Springer, 2011.
46