Classification of Data Streams With Skewed Distribution

Classification of Data Streams with
Skewed Distribution
Dissertation
submitted in partial fulfillment of the requirements
for the degree of
Master of Technology, Computer Engineering
by
Abhijeet B. Godase
Roll No: 121022003
under the guidance of
Prof. V. Z. Attar
Department of Computer Engineering and Information Technology

College of Engineering, Pune
Pune - 411005.
June 2012
Dedicated to
my mother
Smt. Aruna B. Godase
and
my father
Shri. Balasaheb J. Godase
for their love, endless support
and encouragement.
DEPARTMENT OF COMPUTER ENGINEERING AND

INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE
CERTIFICATE
This is to certify that the dissertation titled
Classification of Data Streams with Skewed

Distribution
has been successfully completed
By
Abhijeet B. Godase
(121022003)
and is approved for the degree of
Master of Technology, Computer Engineering.
Prof. V. Z. Attar,
Guide,
Department of Computer Engineering
and Information Technology,
College of Engineering, Pune,
Shivaji Nagar, Pune-411005.
Date :
Dr. Jibi Abraham,

Head,
Department of Computer Engineering
and Information Technology,
College of Engineering, Pune,
Shivaji Nagar, Pune-411005.
Abstract
The emerging domain of data stream mining is one of the important areas of
research for the data mining community. The data streams in various real life
applications are characterized by concept drift. Such data streams may also be
characterized by skewed or imbalance class distributions for example Financial
fraud detection, Network intrusion detection etc. In such cases skewed class distribution of the stream increases the problems associated with classifying stream
instances. Learning from such skewed data streams results in a classifier which is
biased towards the majority class. Thus the model or the classifier built on such
skewed data streams tends to misclassify the minority class examples. In case of
some applications for instance, financial fraud detection the identification of fraudulent transaction is the main focus because here misclassification of such minority
class instances might result in heavy losses, in this case financial. Increasingly
higher losses due to misclassification of such minority class instances cannot be
ruled out in many other data stream applications as well. The challenge, therefore,
is to pro-actively identify such minority class instances in order to avoid the losses
associated with the same. With an effort in this direction we propose a method
using k nearest neighbours approach and oversampling technique to classify such
skewed data streams. Oversampling is achieved by making use of minority class
examples which are retained from the stream as the time progresses. Experimental
results show that our approach shows good classification performance on synthetic
as well as real world datasets.
iii
Acknowledgements
I express my sincere gratitude towards my guide Prof. V.Z.Attar for her constant help, encouragement and inspiration throughout the project work. Without
her invaluable guidance, this work would never have been a successful one. She also
guided me through the essence of time-management, need of efficient organization
and the presentation skills. I would also like to thank Head of the Department
Dr. Jibi Abraham and all other faculty members who made my journey of
post-graduation and technical learning such an enjoyable experience.
I would also like to thank Dr. Albert Bifet, University of Waikato, New
Zealand for his help and guidance during the implementation of the project. Last
but not the least I would like to thank all my classmates for their support and
help throughout the course of the project.
Abhijeet B. Godase
College of Engineering, Pune
May 30, 2012
iv
Contents
Abstract
iii
Acknowledgements
iv
List of Tables
viii
List of Figures
viii
1 Introduction
1.1 Introduction to Data Mining and Techniques . . . . .
1.1.1 Classification . . . . . . . . . . . . . . . . . .
1.2 An Overview of Data Streams . . . . . . . . . . . . .
1.3 Data Stream Classification . . . . . . . . . . . . . . .
1.4 An Overview of Skewed Data Sets in the Real World
1.5 Issues in Learning from Skewed Data Streams . . . .
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . .
2 Literature Survey
2.1 Overview of Methods for Dealing with
Traditional Approaches . . . . . . . . . .
2.1.1 Oversampling . . . . . . . . . . .
2.1.2 Under-sampling . . . . . . . . . .
2.1.3 Cost Sensitive Learning. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
3
4
6
7
8
Skewed Data
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
3 Problem Description
3.1 Motivation . . . . . . . . . . . . . . . . . . . . .
3.2 Problem Statement . . . . . . . . . . . . . . . .
3.2.1 How did we reach our problem statement
3.2.2 Why are Existing Classifiers Weak? . . .
3.3 Evaluation Metrics . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Streams
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 8
. 9
. 10
. 11
.
.
.
.
.
14
14
15
16
16
16
.
.
.
.
.
4 Our Approach
20
4.1 Approach to Deal with Skewed Data Streams . . . . . . . . . . . . 20
4.2 Algorithm for Skewed Data Streams . . . . . . . . . . . . . . . . . . 22
5 Experimental Results and Discussions
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Synthetically Generated Datasets . . . . . . . . . . . . . . .
5.2.2 Results of Our Approach on Synthetically Generated Datasets
5.2.3 Real World Datasets . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Results of Our Approach on Real World Datasets . . . . . .
5.2.5 Effect of Noise Level on the Performance of the approach . .
25
25
25
26
28
33
35
37
6 Conclusion and Future Work

40
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A Publications
42
A.1 IDSL: Imbalanced Data Stream Learner . . . . . . . . . . . . . . . 42
A.2 Classification of Data Streams with Skewed Distribution . . . . . . 42
vi
List of Tables
5.1
5.2
5.3
Description of Synthetic Datasets used . . . . . . . . . . . . . . . . 26

Description of Electricity Pricing Dataset . . . . . . . . . . . . . . . 33
Description of Real World Datasets . . . . . . . . . . . . . . . . . . 34
vii
List of Figures
1.1
1.2
1.3
Classification as a task of mapping input attribute setx into its class

label y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification model in data streams . . . . . . . . . . . . . . . . . .
Skewed Distributions, each data chunk has fewer positive examples
than negative examples . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Our Approach to Classify Data Streams with Skewed Distributions
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
Comparison of AUROC on SPH Datasets . . . . . . . . . .

Comparison of G-Mean on SPH Datasets . . . . . . . . . .
Comparison of F-Measure on SPH Datasets . . . . . . . .
Comparison of Overall Accuracies on SPH Datasets . . . .
Comparison of AUROC on SEA Datasets . . . . . . . . . .
Comparison of G-Mean on SEA Datasets . . . . . . . . . .
Comparison of F-Measure on SEA Datasets . . . . . . . .
Comparison of Overall Accuracies on SEA Datasets . . . .
Comparison of AUROC on Real World Datasets . . . . . .
Comparison of G-Mean on Real World Datasets . . . . . .
Comparison of F-Measure on Real World Datasets . . . . .
Comparison of Overall Accuracies on Real World Datasets
Effect on AUROC when noise levels are varied . . . . . . .
Effect on G-Mean when noise levels are varied . . . . . . .
Effect on F-Measure when noise levels are varied . . . . . .
Effect on Overall Accuracy when noise levels are varied . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
3
4
21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
30
30
31
31
32
32
35
35
36
37
37
38
38
39
Chapter 1
Introduction
This chapter gives the introduction to the basics of the area of project, also gives
some insight to the problem domain under consideration.
1.1
Introduction to Data Mining and Techniques
With the internet age the data and information explosion have resulted in the
huge amount of data. Fortunately to gather knowledge from such abundant data
there exist data mining techniques. As per the definition by Jiawei Han in his
book Data Mining: Concepts and Techniques [18] the data mining is - Extraction of interesting, non trivial, implicit, previously unknown and potentially useful
patterns or knowledge from huge amount of data. Data mining has been used
in various areas like Health care, business intelligence, financial trade analysis,
network intrusion detection etc.
General process of knowledge discovery from data involves data cleaning, data
integration, data selection, data mining, pattern evaluation and knowledge presentation. Data cleaning, data integration constitute data preprocessing. Here data
is processed so that it becomes appropriate for the data mining process. Data
mining forms the core part of the knowledge discovery process. There exist various data mining techniques viz Classification , Clustering, Association rule mining
etc. Our work mainly falls under the classification data mining technique.
1.2 An Overview of Data Streams
1.1.1
Classification
Classification is one of the important technique of data mining. It involves use

of the model built by learning from the historical data to make prediction about
the class label of the new data/observations. Formally, it is task of learning a
target function f, that maps each attribute set x to a set of predefined class labels
y. Classification model learned from historical data is nothing but the target function. It can serve as a tool to distinguish between the objects of different classes as
well as to predict class label of unknown records. Fig 1.1 shows the classification
task which maps attribute set x to its class label y.
Figure 1.1: Classification as a task of mapping input attribute setx into its class
label y
Classification is a pervasive problem that encompasses many diverse applications, right from static datasets to data streams. Classification tasks have been
employed on static data over the years. In last decade more and more applications
featuring data streams have been evolving which are a challenge to traditional classification algorithms.
1.2
An Overview of Data Streams
Many real world applications, such as network traffic monitoring, credit card
transactions, real time surveillance systems, electric power grids, remote sensors,
web click streams etc, generate continuously arriving data known as data streams
[4],[2]. Unlike the traditional data sets, data streams arrive continuously at varying
speeds. Data streams are fast changing, temporally ordered, potentially infinite
and massive[18]. It may be impossible to store the entire data stream into memory
or to go through it more than once due to its voluminous nature. Thus there is
2
1.3 Data Stream Classification
need of single scan, multidimensional, online stream analysis methods. In todays

world with data explosion the data is increasing by terabytes and even petabytes,
stream data has rightly captured our data mining needs of today. Even though
complete set of data can be collected and stored its quite expensive to go through
such huge data multiple times.
1.3
Data Stream Classification
Since classification could help decision making by predicting class labels for given
data based on past records, classification on stream data has been extensively studied in recent years with many interesting algorithms developed. Some of them are
cited here: [4],[21].
Fig 1.2 depicts the classification model in data streams.As shown in fig 1.2
data chunks C1 , C2 , C3 ....Ci arrive one by one.
Figure 1.2: Classification model in data streams

Each chunk contains positive instances Pi and negative instances Qi . Suppose
C1 , C2 , C3 ....Ci are labelled. At the time stamp m + 1, when an unlabelled chunk
Cm+1 arrives, the classification model predicts the labels of instances in Cm+1 on
basis of previously labelled data chunks.When experts give true class labels of the
instances in Cm+1 , the chunk can join the training set, resulting in more and more
3
labelled data chunks. Because of storage constraints, it is critical to judiciously

select labelled examples that can represent the current distribution well.
Most studies on stream mining assume relatively balanced and stable data
streams. However, many applications can involve concept-drifting data streams
with skewed distributions. In data with skewed distributions each data chunk has
many fewer positive instances. Fig 1.3 shows the similar concept diagrammatically.
At the same time, loss functions associated with positive and negative classes are
also unbalanced. Misclassifying positive instances can have serious effects in some
applications like stream of financial transactions.
Figure 1.3: Skewed Distributions, each data chunk has fewer positive examples
than negative examples
1.4
An Overview of Skewed Data Sets in the

Real World
The rate at which science and technology have developed has resulted in proliferation of data at an exponential pace. This unrestrained increase in data has
intensified need of various applications in data mining. This huge data in in no
way necessarily equally distributed. Class skew or class imbalance refers to domains where in one class instances outnumber the other class instances, i.e. some
classes occupy the majority of the dataset which are known as majority classes;
while the other classes are known as minority classes. The most vital issue in
these kinds of data sets is that, compared to the majority of the classes, minority
classes are often of much significance and interest to the user.
4
There are many real world applications, where in datasets contain such skewed
nature. Following paragraphs gives an overview of some real world problems that
exhibit such nature.
Financial Fraud Detection: In financial fraud detection, majority of financial

transactions are genuine and legitimate, and very small number of them may
be fraudulent[5].
Network Intrusion Detection: Here, the number of malicious activities are
hidden among the voluminous routine network traffic[30],[35]. Usually there
are thousands of access requests every day. Among all these requests, the
number of malicious connections is, in most cases, very small compared
to the number of normal connections. Obviously, building good model that
can effectively detect future attacks is crucial so that the system can respond
promptly in case of network intrusions.
Medical Fraud Detection: In medical fraud detection, the percentage of bogus claims is small, but the total loss is significant.
Real Time Video Surveillance: Imbalance is seen in data that arrives as
video sequences[34].
Oil Spillage: Oil spills detection in satellite radar images[22].
Astronomy: Skewed data sets exist in astronomical field also; only 0.001%
of the objects in the sky survey images are truly beyond the scope of current
science and may lead to new discoveries [27].
Spam Image Detection: In Spam image detection, near duplicate spam images are difficult to discover from the large number of spam images[33]
Text Classification: In text classification, there is imbalanced data such as
text number, class size, subclass and class fold [24].
Health Care: Health care domain has a classic example of class imbalance
presence; the rare diseases affect very negligible amount of people, but the
consequences involved are very severe. It id extremely vital to correctly
detect and classify the rare diseases and the affected patients. If any errors
occur in such cases it might be fatal.
5
1.5 Issues in Learning from Skewed Data Streams
Above real world examples signify the importance of dealing with imbalanced
data sets. Most of the above examples also fall into the category of skewed data
streams. Most learning algorithms work well with balanced data streams as their
aim is to improve overall accuracy or a related measure. When such algorithms
are applied to skewed data streams then their accuracy of classifying majority
examples is good but the accuracy of classifying the minority examples is poor.
This happens because learning from such imbalanced/skewed streams causes the
learner to become biased towards the majority class; thus the minority examples
are likely to be misclassified. Thus the main issue towards which the research
community is working on in regard to skewed data streams is of correctly classifying the minority data instances without affecting the accuracy of majority data
instances. In recent years learning from skewed data streams has been recognized
as one of the crucial problem in machine learning and data mining. Encapsulating,
the principle problem statement is improving the accuracy of both the minority
as well as majority class instances of the stream.
1.5
Issues in Learning from Skewed Data Streams
In general learning from skewed data streams is challenging due to following issues.
1. Evaluation Metric: Appropriate choice of evaluation metrics is also important in this domain. Evaluation metrics play vital role in data mining; they
are used to guide the algorithms to desired solution. Thus if evaluation metric does not take minority class into the consideration, the learning algorithm
will not be able to cope up well with the skewed data streams.
The standard evaluation metrics like overall accuracy are not valid this case,
because although minority class instances are misclassified then also the
overall accuracy may remain higher, primary reason being negligible amount
of minority class instances.
2. Lack of minority class data for training: In skewed data streams due to lack
of minority class data it becomes difficult to learn class boundaries. As the
number of instances available are very few. Thus training a classifier in such
situations is very difficult.
3. Treatment of minority class data as noise. One of the major issues is that
of the noise. Noisy data in the streams affects the minority class more than
6
1.6 Thesis Outline
that of the majority data. Furthermore, standard stream mining algorithm

tend to treat the minority class data as noise.
4. As stated earlier data streams are usually massive and arrive at varying
speeds; which makes it difficult to store them completely and multiple scans
are nearly impossible and hence learning from such streams is difficult.
5. Data streams also undergo concept evolution considerably over time.
As per the above points we can see that classification of the skewed data
streams is and multi-fold problem. All the above issues need to be addressed so
as to design a appropriate learning algorithm for skewed data streams.
1.6
Thesis Outline
The rest of the thesis is organized as follows:
Chapter 2 gives the brief description of the literature survey carried out by
us.
Chapter 3 describes our problem statement and how we arrived at it. It also
focusses on different evaluation metrics chosen by us throughout this thesis.
Chapter 4 elaborates our approach to deal with data streams with skewed
distribution of classes and explains our algorithm and its implementation
details.
Chapter 5 contains the results and discussions of the implemented algorithm
and evaluation of its performance.
Chapter 6 provides the conclusion and the future enhancements that can be
carried out.
Chapter 2
Literature Survey
This chapter gives brief details of our literature survey in carried out in the area
of imbalanced datasets and the skewed data streams problem.
2.1
Overview of Methods for Dealing with Skewed

Data Streams -Traditional Approaches
We went through various methods available in the literature to deal with imbalanced datasets and portray some of the well known and most popular approaches,
algorithms and methods that have been devised to deal with skewed data streams.
Some of the books that we have referred to get an effective understanding of data
mining concepts are Data Mining: Concepts and Techniques by Han and Kamber
[18], Introduction to Data Mining by Kumar et al [23]. In the literature there are
number of methods addressing class imbalance problem but the area of skewed
data streams is relatively new to the research community. The sampling based
and ensemble algorithms are the simplest yet the effective ones. Following paragraphs will provide the brief overview of the same.
Some of the approaches for dealing with skewed data streams are categorised
under following methods.
Oversampling.
Under-sampling.
Cost Sensitive Learning.
2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional
Approaches
Oversampling and under-sampling are sampling based preprocessing methods
of data mining. The main idea in these methods is to manipulate the data distributions such that all the classes are represented well in the training or learning
datasets. Recent studies in this domain have shown that sampling is effective
method to deal with such kind of problems. Cost sensitive learning is basically
associates cost of misclassifying the examples to penalise the classifier.
2.1.1
Oversampling
Oversampling is one of the sampling based preprocessing technique in data mining. In oversampling the number of minority class instances in increased by either
reusing the instances from the previous training/learning chunks or by creating
the synthetic examples. Oversampling tries to strike the balance between ratio
of majority and minority. classes. One of the advantage of this method is that
using this normal stream classification methods can be used. The most commonly used method of oversampling is SMOTE(Synthetic Minority Oversampling
Technique)[7]. Some of the Oversampling based approaches in the literature are
discussed below.
Most of the stream classification algorithms available assume that the streams
have balanced distribution of classes. In the last few years few attempts have been
made to address the problem to deal with skewed data streams. First of such attempt was done by Gao et al [16] in their work they have proposed SE(Sampling
+ Ensemble) approach which processes the stream in batches. In SE approach
each of the classifier in ensemble is trained by drawing the uncorrelated sample of
negative instances and all the positive instances in the current training chunk as
well as the positive instances of all previous training chunks. Thus in SE approach,
oversampling of positive instances is done by incorporation of old positive examples along with under sampling by the way of using disjoint subsets of negative
examples.
SERA(Selectively Recursive Approach) framework was proposed by Chen and
He [9] in this framework they selectively absorbed minority examples from previous chunks into current training chunk to balance it. Similarity measure used to
select minority examples from previous chunks was mahalanobis distance. Thus
different from approach used by gao et al [16] which uses take in all approach. In
SERA framework single hypothesis based on current training chunk is maintained
Approaches
for making predictions on test data chunk.

MuSeRA(Multiple Selectively Recursive Approach) algorithm by Chen and He
[10] was their further work after SERA to deal with imbalanced data stream classification. In MuSeRA balancing of training chunk is done in the similar way by
using mahalanobis distance as similarity measure to accommodate minority samples accumulated from all the previous training chunks. In MuSeRA a hypothesis
is built on every training chunk, thus a set of hypothesis is built over time as
opposed to SERA which maintains only single hypothesis. Here set of hypothesis
at time-stamp i will be used to predict the classes for instances in test chunk at
time-stamp i.
In their further work in similar area Chen and He [8] proposed an approach
named REA(Recursive Ensemble Approach), in which when next training chunk
arrives, it is balanced by adding those positive instances from previous chunks
which are nearest neighbours of the positive instances in the current training
chunk, then it is used to build a soft typed hypothesis. In REA for every training chunk a new soft typed hypothesis is built. It then uses weighted majority
voting to predict the posterior probabilities of test instances, here the weights are
assigned to different hypothesis based on their performance on current training
chunk.
2.1.2
Under-sampling
Under-sampling is another sampling based method which solves the problem by

reducing the number of majority class instances. This is generally done by filtering out the majority class instances or by randomly selecting the appropriate
number of majority class examples. under-sampling is mostly carried out using
the clustering method. Using clustering the best representative from the majority class are chosen and the training chunk is balanced accordingly. Some of the
under-sampling based approaches in the literature are discussed below.
Zhang et al [32] proposed another algorithm to deal with skewed data streams.
They used clustering+sampling algorithm to deal with skewed data streams. Sampling was carried out by using k-means algorithm to form clusters of negative examples in the current training chunk and then they used the centroids of each of
10
Approaches
the clusters formed to represent each of those clusters. Number of clusters formed
were equal to the number of positive examples in current training batch and thus
current training batch was updated by taking all positive examples along with
centroids of the clusters of negative samples. A new classifier was created on these
sampled instances. Further size of the ensemble was fixed so for all classifiers
present in the ensemble along with new classifier built on sampled instances. AUROC was used as measure to select best classifiers to be included in the ensemble.
Weights of the classifiers were assigned on the basis of AUROC calculated and
ensemble thus formed was used to classify the instances in the test chunk. The
work by song et al [28] proposes to address the issue of skewed data streams in
cloud security which also follows a similar approach of using k-means clustering
to draw centroids of clusters of negative class to undersample the negative class.
Recently Zhang et al [36] proposed approach called ECSDS (Ensemble classifier for skewed data streams) which aims at reducing the time needed by ensemble
learning by updating it only when required. In this algorithm initially ensemble
is built in the similar way as that of gao et al approach[16] but the mechanism
to update the ensemble is bit different. When the ensemble makes prediction for
the test batch, the positive instances which are misclassified are retained back as
MI. After prediction on two batches difference of f1 values in these two batches
is calculated. If the difference crosses a threshold value then ensemble is updated
by using all misclassified positive instances MI and all previous positive instances
AP along with negative instances in the current batch.
2.1.3
Cost Sensitive Learning.
Cost sensitive learning is one if the important technique of data mining. It assigns
different values of misclassification penalties to each class. Cost sensitive learning
has been incorporated into classification algorithms by taking into account the
cost information and trying to optimize overall cost during the learning process.
In cost sensitive classification the problem is dealt by adjusting the learning. It is
done by creating costs associated with misclassification of minority class and adjusting the learner based on punishment-reward system. One of the advantage of
this method is that training dataset is unchanged unlike oversampling and undersampling where in the data distribution changes completely.
11
Approaches
Polliker et al [11] proposed algorithm named Learn++SMOTE in which algorithm updates the distribution of instance weights by evaluating current training
chunk Dn on the ensemble and accordingly adjust the weights of misclassified instances to form new instance weighing distribution Dm . A new classifier is trained
on this data chunk Dm and then new data subset is created by calling SMOTE
. All the classifiers in the ensemble are evaluated on this synthetic dataset and
then accordingly classifiers with less error are chosen to form ensemble and while
doing so if new classifier has error more than 0.5 then its discarded and new one
is created but if error on older classifier is more than 0.5 then its set to 0.5, since
it might be useful if stream follows a cyclic nature.
Cooper et al [26] proposed online approach to deal with skewed data streams.
In this work they propose an incremental learning algorithm for skewed data
stream classification. They argue that in batch learning algorithms update of
model is delayed until next training chunk is received but in some cases there is
need to update the model with every new instance. They build the initial ensemble model in the similar way as that of gao et al approach[16] but the difference is
that here they draw random subset of negative instances with replacement. Once
the initial model is built then for every new incoming instance which is of positive
class all the base models are updated. If the new incoming instance is of negative
class then it will be used for updating the models with probability of n/p where n,
p are the number of negative and positive instances observed till now respectively.
Liu et al [25] proposed RDFCSDS(Reusing data for classifying skewed data
streams) in this algorithm before training the ensemble it is checked if concept
drift has occurred, if it has then current ensemble is cleared and it is built from
scratch by sampling the data from current training chunk as well as the previous
training chunk thus reusing the previous data. If no concept drift has occurred
then a new classifier is built on current training chunk and then AUROC is calculated for all the classifiers in the ensemble and among those top k classifiers are
chosen to form ensemble based on AUROC value. Prediction of instances in the
test chunk is done by weighted majority voting.
From the above discussion we can see that most of the approaches proposed till
now have used have either used oversampling and/or under-sampling to balance
the training chunks while to dealing with skewed data streams. Some of them have
gone for the cost sensitive classification approach. Also as we have seen previously
12
Approaches
that most of the applications do exhibit characteristics of skewed data streams

and there is still a room for improvement in this area.
13
Chapter 3
Problem Description
This chapter gives the motivation behind choosing the problem consideration and
how did to arrive at the problem statement.
3.1
Motivation
While working on identification of the project topic in the area of data mining we
found that lot of work has been already done in the different areas of the data mining with respect to the static datasets. Further in the last decade class imbalance
problem in static datasets has drawn the attention of the data mining community.
Due to which various workshops in the different conference were dedicated to specially for problem of class imbalance. First of such workshop was organized way
back in 2000 in the AAAI 2000 conference, another workshop on Learning from
Imbalanced Datasets was organized in ICML 2003. Recently another workshop
named Data Mining when Classes are Imbalanced and Errors have Cost was
organized in the PAKDD 2009 conference.
Various points from the above discussion drew our attention towards the problem of class imbalance. Further we could find that considerable amount of work
has been done on the class imbalance problem in terms of static datasets. Then
we went on to look for the another area wherein class imbalance is in more primary concern. Meanwhile we found that the data streams is an area where class
imbalance has not been thoroughly studied. After this we concentrated on various
real life application where in data streams are prominent. Various applications
like Network intrusion detection, Financial fraud detection etc. are various areas
which are characterised by stream data.
14
3.2 Problem Statement
We went on to see the characteristics of some data streams from which we

could find that most of the real life data streams are characterised by the class
imbalance or we can say these were skewed data streams. As mentioned earlier
in the Section 1.4 many applications are characterised by skewed data streams.
Thus our main point of focus was then brought to skewed data streams.
As we have seen various real life examples of skewed data streams lets consider
one of them say financial fraud detection. As we know lacs of transaction are on
during any particular instance very few of them are fraudulent transactions. Identifying such transactions is difficult task. Further identification of new patterns of
the fraudulent transactions can be tedious task if we ask the domain experts to
label each of them. Thus in such case identifying these minority class examples is
the important task. In these scenarios methods like random sampling of instances
where these randomly sampled examples are given to domain expert for labelling
is not a good idea. If we miss out on identifying such examples then we are calling
for the financial losses. In case of network intrusion detection depending upon the
network there will either normal connection requests and also there will be malicious connection requests. In this case also either of the type of connections may
be in minority. In such case if the malicious connection requests are in minority
then identifying them becomes the primary goal.
From the above discussions its quite evident that there is need for identification or classification of minority class examples in the skewed data streams in a
effective manner.
3.2
Problem Statement
As seen in previous section that after concentrating on problem of dealing with

skewed data streams we formally framed our problem statement as : To design and
develop a general classification framework that will correctly classify the minority
class instances along with improvement in the evaluation metrics of data streams
with skewed distribution of classes.
15
3.3 Evaluation Metrics
3.2.1
How did we reach our problem statement
As we realised that how important it is to identify the minority class examples

in the skewed data streams in some applications as mentioned in the Section 1.4.
For instance consider financial fraud detection where in the fraudulent attempt
of transactions which are very few in number are the minority instances which
need early identification. Misidentification or misclassification of such minority
examples may result in the financial losses. Thus we can see that these minority
class instances have more valued loss functions associated with them.
Elaborating it further the data streams are also characterised by concept drift
which complicates the problem further. Thus dealing with skewed data stream is
an multi-fold problem. Identification of minority class examples could be achieved
by classification task of the data mining. Thus we drilled down on our problem
statement.
3.2.2
Why are Existing Classifiers Weak?
Most of Existing stream classifiers cannot be used to classify the skewed data
streams. These classifiers face the issues that are mentioned in the Section 1.5.
Further these classifiers also assume that examples in the stream have fairly balanced distribution over the different classes. Hence the standard stream classifiers
are affected by such majority class instances and thus they tend to consider the
minority class instances as noise.
3.3
Evaluation Metrics
Although most of the stream classifiers measure their performance based on overall accuracy but in case of imbalanced datasets such measure is not appropriate.
Consider for example two classifiers being tested on a imbalanced dataset with
class distribution ratio of 0.99:0.01. Now if classifier one classifies all 99% of majority class correctly but none from minority where as second classifier classifies
97% of majority class correctly and 0.8% of the minority class correctly then as
per overall accuracy the classifier one beats the second one but if we consider that
minority class is the main class under focus then second classifier is the one which
should be chosen.
16
Thus in such cases various performance measures have been suggested in the
literature. In this section we present the evaluation metrics that we have used.
Confusion Matrix: The columns of the confusion matrix represent the predictions, and the rows represent the actual class. Correct predictions always
lie on the diagonal of the matrix. Equation 3.1 shows the general structure
of confusion matrix.
T P F N
Conf usionM atrix
FP TN
(3.1)
wherein, True Positives (TP) indicate the number of instances of the minority that were correctly predicted, True Negatives (TN) indicate the number
of instances of the majority that were correctly predicted. False Positives
(FP) indicate the number of instances of the majority that were incorrectly
predicted as minority class instances and False Negatives (FN) indicate the
number of the minority that were incorrectly predicted as majority class
instances. Though the confusion matrix gives a better outlook on how the
classifier performed than accuracy, a more detailed analysis is preferable
which are provided by the further metrics.
Recall: Recall is a metric that gives a percentage of how many of the
actual minority class members the classifier correctly identified. (TP + FN)
represent a total of all minority members. Recall is given by equation 3.2
Recall =
TP
TP + FN
(3.2)
Precision: It gives us the total the percentage of how many of minority class
instances as determined by the model or classifier actually belong to the
minority class. (TP + FP) represents the total of positive predictions by
the classifier. Precision is given by equation 3.3
17
P recision =
TP
TP + FP
(3.3)
Thus in general it is said that Recall is a Completeness Measure and Precision is a Exactness Measure. The ideal classifier would give value as 1
for both Recall and Precision but if the classifier gives higher(closer to one)
for one of the above metrics and lower for the other metrics in that case
choosing the classifier is difficult task. In such cases some other metrics as
discussed further are suggested in the literature.
Further in the literature few metrics which are based on the above metrics
have been suggested to be used in case of imbalanced data sets or streams.
The performance measures like AUROC (Area under ROC curve)[12],[20]
G-Mean [17] are well suited in such situations.
F-Measure: It is a harmonic mean of Precision & Recall. We can say that

it is essentially an average between the two percentage. It really simplifies
the comparison between the classifiers. It is given by the equation 3.4.
F M easure =
1
( Recall
2
1
+ P recision
)
(3.4)
G-Mean: G-Mean is a metric that measures the balanced performance of

a learning algorithm between the classes. It is given by the equation 3.5
G M ean =
T NR =
18
Recall T N R
TN
TN + FP
(3.5)
(3.6)
Area Under ROC Curve: The area under ROC Curve(Receiver Operating
Characteristics) gives the probability that, when one draws one majority
and one minority class example at random, the decision function assigns the
higher value to the minority class than the majority class sample. AUROC is
not sensitive to the class distributions in the dataset. Generally it is plotted
as a True Positive Rate verses False Positive Rate. AUROC was mainly
used in signal detection theory and medical domain where it was said to be
the plot of Sensitivity verses (1 Specif icity) where Sensitivity is same as
True Positive Rate and Specif icity is (1 - False Positive Rate) so in general
it reduces to False Positive Rate thus both definitions of AUROC are one
and the same.
19
Chapter 4
Our Approach
As mentioned in the previous chapter that after the literature survey we came to
the conclusion that issue of skewed data streams certainly needs more attention
and there is still room for improvement in the performance of existing classifiers.
Thus we came up with our own approach to deal with skewed data streams.
4.1
Approach to Deal with Skewed Data Streams
This section gives the brief description of our approach to deal with classification
of data streams with skewed distribution. Our approach in general is as follows.
Instead of using a single model from single training set, it is proposed to use
multiple models from different sets. We use the ensemble of classifiers/models
to classify the data streams with skewed distributions. The use of ensemble of
classifiers has been proven to be the effective one to deal with concept drifting
data streams [31]. In our approach we have also used the oversampling approach
to so as to balance the training chunk contents. Fig 1.4 shows in brief our approach.
In our approach we process the data stream in chunks. If the set of instances
used for training the classifiers are disjoint then the classifiers will make uncorrelated errors and which can be eliminated by averaging [16]. On the similar lines
we build a new classifier for the ensemble for every incoming batch such that instances of the training chunk are as uncorrelated as possible.
In order to learn from the data streams the best way is to construct the model
on the most recent training chunk. This works for instances of majority class
because instances of majority class are in abundance in every chunk but as the
minority class samples are very less. Some of the earlier approaches [7],[11] depend
20
4.1 Approach to Deal with Skewed Data Streams
on creation of synthetic examples to deal with this problem. In our approach we

store the minority class examples from the previous training chunks. To balance
the training chunks distribution we oversample the minority class instances stored
over the period of time.
Figure 4.1: Our Approach to Classify Data Streams with Skewed Distributions
While balancing the training chunk we follow the K nearest neighbour approach. We select those examples from the previous minority examples that are
accommodated over the time. To balance the training chunk we find the k nearest neighbours of the each example stored from previous chunks into the current
training chunk. Out of those appropriate number of minority samples are used to
balance the training chunk. Using that training chunk we build a new classifier.
This newly built classifier is added to the ensemble.
The size of the ensemble is maintained to 10. As the size of ensemble goes
beyond this specified limit then classifiers with best AUROC(Area Under ROC
21
4.2 Algorithm for Skewed Data Streams
Curve) values are chosen. The process of learning is continuous as you can see
from the fig 4.1. Thus as the stream is continuously coming in the ensemble is
continuously updated. Mean while the test examples are tested by taking the
predictions from the ensemble. The predictions are made by taking weighted majority voting among the classifiers. The weights of the classifiers in the ensemble
are assigned as per the AUROC values. While taking the majority voting the
weights are normalized.
Thus our approach consist of the oversampling based approach along with k
nearest neighbour algorithm and the ensemble based approach to deal with data
streams with skewed distribution of classes. Algorithm 1 presents the K nearest
neighbour algorithm in details which is used by our approach.
Algorithm 1: K Nearest Neighbours Algorithm
Input:
Set of training examples or the set of examples from which nearest
neighbours are to be determined.
Output:
K Nearest Neighbours.
begin
Determine parameter K = number of nearest neighbours.
Calculate the distance between the query-instance and all the training
samples.
Sort the distance and determine nearest neighbours based on the K-th
minimum distance.
Return the K nearest neighbours .
4.2
Algorithm for Skewed Data Streams
Algorithm 2 presents the algorithmic implementation of our approach that we

have come up with to deal with classification of skewed data streams.
The algorithm basically processes the data stream in batches or chunks. Thus
22
Algorithm 2: CDSSDC: Classifier for Data streams with skewed distribution

of classes
Input:
Training chunk Bi ,Test chunk Bi+1 arriving at current time-stamp ti
Ensemble of classifiers Zi1 built from ti = 0 to i 1
AP set of all positive instances accumulated from the previous training
chunks from ti = 0 to i 1
Balance ratio f , and ratio of positive to negative instances in the training
chunk .
Output:
Ensemble Zi , for the test chunk Bi+1
begin
foreach timestamp ti = 1,2.... do
Split the current training chunk Bi into P and N containing positive
and negative examples respectively.
if f > (ti -1) then
TrainSet = Bi AP
else
Find K nearest neighbours for each instance of AP within current
training chunk an read positive class instances from these into
Di and sort Di in descending order.
Read first {(f ) | Bi |}instances from Di into Ei .
TrainSet = Bi Ei
Build new classifier Ci on TrainSet
if | Zi1 | < 10 then
Zi = Zi1 Ci
else
Find AUROC for all classifiers in Zi1 on Bi .
Cj = classifier with lowest AUROC from Zi1
if AUROC(Cj ) < AUROC (Ci ) then
Zi = Zi1 { Ci } { Cj }
Assign weights to the classifiers in Zi according to AUROC.
Normalize the weights of the classifiers in Zi .
23
AP = AP P
input to the system is a skewed data stream in chunks or batches. We assume

that at any given instance Training and Test chunks are available. At time-stamp
ti the training chunk Bi and Bi+1 are available as input to the algorithm. Further
ensemble of classifiers built over the from the time-stamp t0 to ti1 is also given
as input to the algorithm. All the minority class samples from the start of the
stream till the last time-stamp are accumulated and those are also given as input
to the algorithm. Balance ratio of 0.5 is also set initially. Its assumed that for
each training chunk we are aware of the ratio of positive to negative class instances.
The training batch that arrives at current time-stamp is split into positive
and negative class instances. Further it is checked whether appropriate number of
minority class or positive class instances have been accumulated or not so as to
balance the batch. If the number is not adequate then all the retained examples
are added along with positive and negative class instances into the train-set. If the
retained examples are more than that of those required then for each of the those
minority class instances, the K nearest neighbours into the current training batch
are found. Then those nearest neighbours that are of minority class are sorted
in descending order depending on which is more nearest. Then select minority
class examples from retained minority samples set which are corresponding to the
first x examples from the descending ordered list. x number of examples is chosen such that the balance ratio of 0.5 is maintained. Thus the Train-Set is created.
Then a new classifier is built on this Train-Set. This newly built classifier is
then added to the ensemble of classifiers. If the number of classifiers in the ensemble is more than 10 then AUROC for each of the classifiers is found on the
original training data chunk and then 10 classifiers with best AUROC values are
chosen to be the part of the ensemble.
Then the classifiers are assigned weights based on the AUROC values calculated.Then for making the predictions on the test batch Bi+1 the weighted majority
voting is taken and the test instances are given predictions accordingly. The Test
then Train pattern for the data chunks is followed.
24
Chapter 5
Experimental Results and
Discussions
This chapter deals with the results obtained by us after the experiments were
carried out on the algorithm developed by us.
5.1
Experimental Setup
This section gives the abstract of the experimental setup on which we have worked
to perform our experiments and test the effectiveness of our algorithm. For the
implementation of our algorithm we have used MOA [6]. MOA is a open source
framework for mining data streams. It is implemented in java. It has collection
of various data stream mining algorithms. MOA being an open source framework
we could easily add and integrate our algorithm into it. Details of other hardware
and software that we have used is as follows. We used MOA release 20110124
along with it we have also used WEKA version 3.7.5 on Windows 7 Profession
operating system running on Dual core AMD opteron processor 2210 @1.8 GHz
with 2GB RAM.
5.2
Datasets
To evaluate the performance of our algorithm we have tested it on various synthetic as well as real world datasets. We have used various datasets available at
UCI[13] repository. Datasets available at MOA dataset repository[1]. Also we
have used various synthetically generated datasets. These datasets were chosen
25
5.2 Datasets
such that they model real world scenarios, have variety of features and largely
vary in size and class distribution.
5.2.1
Synthetically Generated Datasets
This subsection presents the details of the synthetic datasets that we have generated. Later in this section we have briefly explained the details of these datasets.
The main reason for generating these two datasets was that they focus on one of
the main characteristic of data streams that is concept drift. Table 5.1 shows the
characteristics of synthetically generated datasets. As shown in Table 5.1 are the
number of instances, number of majority class instances, number of minority class
instances, number of attributes in the data set, chunk size of the data sets and
the ratio of majority class to minority class.
Table 5.1: Description of Synthetic Datasets used
Dataset
Instances
Max-Class
Min-Class
Attributes
Chunk-size
Ratio
SPH 1%
1,00,000
99000
1000
10
1000
0.99::0.01
SPH 3%
1,00,000
97000
3000
10
1000
0.97::0.03
SPH 5%
1,00,000
95000
5000
10
1000
0.95::0.05
SPH 10%
1,00,000
90000
10000
10
1000
0.90::0.10
SEA 1%
80,000
79200
800
1000
0.99::0.01
SEA 3%
80,000
77600
2400
1000
0.97::0.03
SEA 5%
80,000
76000
4000
1000
0.95::0.05
SEA 10%
80,000
72000
8000
1000
0.90::0.10
Description of Synthetically Generated Datasets

SEA Dataset:
The SEA dataset is generated to model the abrupt concept drift. The SEA
dataset was proposed by Street and Kim[29]. The instances of the SEA dataset
are generally generated in 3 dimensional feature space with two classes. Each of
26
5.2 Datasets
these 3 attributes or features has the value between 0 to 10. Out of these attributes only first two are relevant and the third attribute is added which acts as
noise. If the some of the first two attributes crosses a certain threshold value then
it belongs to class 1 else it belongs to class 2. The threshold values generally used
are 8,9,7 and 9.5 for the four data blocks. From each block some of the instances
as reserved as test instances.
We generated 80,000 instances of SEA dataset with different percentage of
class imbalances. These were used in the batch 1000 each that is a batch of
1000 was used for training and then next 1000 instances were used for testing the
model built. We have generated 4 different datasets of SEA dataset. Each with
1%,3%,5%,10% of skew added to them. In each of these datasets we have added
1% noise so as to make the task of learning more difficult. Noise was added by
reversing the correct class label to a incorrect one in the training set.
Spinning Hyperplane Dataset:

The SPH(Spinning Hyperplane Dataset) in generated to model the Gradual Concept Drift. The SPH dataset was proposed by Wang et al[31]. The SPH dataset
defines a class boundary in n dimensions by coefficients as 1 ,2 ,3 ...n . An instance d=(d1 ,d2 ,d3...,dn ) is generated by randomizing each attribute in the range
0 to 1. Thus each attribute value is between 0 to 1. A constant bias is defined as:
1X
i
0 =
2 i=1
then the class label l for the instance t is defined as
as opposed to the abrupt concept drift in the SEA dataset the, SPH dataset is
characterised by the gradual concept drift. Some of the coefficients of i will be
sampled in random to have small increment added. Here is defined
=s
27
t
N
5.2 Datasets
here t is the magnitude of the change for every N examples and s alternates
in closed interval -1 to 1 and thus specifying the direction of change, and it has
20% chance of being reversed for every N examples. 0 is also modified as per
the equation mentioned earlier. In such a way class boundary is like spinning
hyperplane in the process of creating data.
5.2.2
Results of Our Approach on Synthetically Generated

Datasets
We present the graphical results of our approach in comparison with some other
state of the art algorithms. Details of the algorithms used for comparison are as
follows
Our approach CDSSDC which uses K nearest neighbour approach to balance
the current training chunk, ratio of positive to negative examples is kept to
be 0.5
We have used SMOTE algorithm [7] for comparison, which creates synthetic
examples to balance the dataset. The implementation of SMOTE available
in WEKA toolkit [14] has been used with the default settings in WEKA except for SPH and SEA datasets where SMOTE was applied so as maintain
ratio of positive to negative examples to be 0.5.
We have used FLC [3] for comparison, FLC was designed to deal with concept
drift in data streams efficiently.
From Fig 5.1 we can clearly see that our approach gives better performance
as compared to SMOTE and FLC in terms of AUROC in case of SPH datasets
with various imbalance levels. As the class skew ratio eases up from 99:1 towards
90:10 the FLC and SMOTE algorithm come close in performance to our approach.
Fig 5.2 shows comparative performance of our approach, SMOTE and FLC
with respect to G-mean on SPH datasets. Here we can see that our approach performs better than SMOTE and FLC for all the class distributions of SPH datasets
shown. Further here also we can observe that the other algorithms come closer in
catching up as the class skew eases up.
28
5.2 Datasets
Figure 5.1: Comparison of AUROC on SPH Datasets
Figure 5.2: Comparison of G-Mean on SPH Datasets

Fig 5.3 gives the comparative details of F-Measure values obtained on our approach, SMOTE and FLC algorithm. Here we can observe that our approach
achieves better values than other algorithms.
Fig 5.4 shows the overall accuracies of SMOTE, CDSSDC and FLC on SPH
datasets with varied class imbalance levels. Here we can see that our approach
maintains almost same or better accuracy performance than SMOTE and FLC.
29
5.2 Datasets
Figure 5.3: Comparison of F-Measure on SPH Datasets
Figure 5.4: Comparison of Overall Accuracies on SPH Datasets

From Fig 5.5 we can clearly see that our approach gives better performance as
compared to SMOTE and FLC in terms of AUROC in case of SEA datasets with
various imbalance levels. As the class skew ratio eases up from 99:1 towards 90:10
the FLC and SMOTE algorithm very come close in performance to our approach.
Fig 5.6 shows comparative performance of our approach, SMOTE and FLC
with respect to G-mean values obtained. Here we can see that our approach performs a bit better than SMOTE and FLC for all the class distributions of SPH
datasets shown. Further here also we can observe that the other algorithms come
30
5.2 Datasets
Figure 5.5: Comparison of AUROC on SEA Datasets
Figure 5.6: Comparison of G-Mean on SEA Datasets

very close in catching up as the class skew eases up.
Fig 5.7 gives the comparative details of F-Measure values obtained on our approach, SMOTE and FLC algorithm. Here we can observe that our approach
achieves almost similar performance or a bit less performance than other algorithms. Smote algorithm gives better results than our approach.
The reason for the a bit less performance in case of F-Measure values for some
cases of SEA datasets could be that as the SEA dataset embraces abrupt con-
31
5.2 Datasets
Figure 5.7: Comparison of F-Measure on SEA Datasets
Figure 5.8: Comparison of Overall Accuracies on SEA Datasets
cept drift the ensemble is not able to adjust easily to these sudden changes in the
stream behaviour.
32
5.2 Datasets
5.2.3
Real World Datasets
This section gives the details of the real world datasets that we have used for
evaluating the performance of our approach. Various real world datasets which
are collected and provided by different repositories have been used by us.
Description of Real World Datasets

Electricity Dataset:
The electricity market dataset [19] was described by M. Harries and the data was
analysed by Gama. This data was actually collected from the electricity market of
New South Wales. Here the pricing is not fixed and it is affected by the demand
and the supply in the market.It is fixed after every 5 minutes. Now how the
market affects the pricing rates will nicely capture the unpredictability of the real
world streams. Further as it is real world dataset the concept drift and noise are
already embedded in the dataset. The electricity pricing fluctuations are given
by up/down depending on the rise and fall of the prices. This dataset contains
45,312 examples collected from May 1996 to December 1998.
Table 5.2: Description of Electricity Pricing Dataset

Dataset
Instances
Max-Class
Min-Class
Attributes
Chunk-size
Ratio
Elec2 5%
16000
15,207
793
1000
0.95::0.05
This dataset didnt had the class skew present and hence we extracted the instances in such manner that a skewed data stream was available. While extracting
the instances order of the examples was not affected. Thus we extracted 16,000
instances from the actual dataset to form the skewed dataset. Thus dataset with
5% of skew was formed. Table 5.2given the detailed description of the electricity
dataset.
Datasets from UCI Repository

A mentioned in the previous section we have tested performance our approach on
some of the datasets from the UCI repository. These datasets are made available
to everyone by Centre for Machine Learning and Intelligent Systems, Machine
33
5.2 Datasets
Table 5.3: Description of Real World Datasets

Dataset
Instances
Max-Class
Min-Class
Attributes
Chunk-size
Ratio
Letter
20,000
19195
805
17
3332
0.96::0.04
Connect-4
50922
44473
6449
42
5009
0.88::0.12
Adult
40402
32561
7841
14
4070
0.81::0.19
Learning Repository of University of California Irvine [13].

The UCI repository is the collection of various datasets, dataset generators
and domain theories that can be used by the people from the data mining and
machine learning community. The archive was created by David Aha and other
in 1987 at UC Irvine.
The following part of the section provides information about the datasets used.
We provide the common name of the dataset followed by the actual name of the
dataset in the description.
Letter: Letter Recognition Dataset - Here the Objective is to identify each

one of the large number of black and white displays as one of the English
alphabet capital letter. The character images taken were based on almost 20
different fonts. Each letter within these 20 fonts was randomly distorted to
produce file of 20,000 unique stimuli. Each of these stimulus were converted
to 16 primitive numerical attributes (edge counts and statistical moments)
which were further scaled to fit the range of integers values through 0 to 15
[15].
Adult: In case of this dataset main task is to predict whether the income
exceeds $50,000 per year or not based on the census data. The extraction of
this dataset was done by Barry Becker from the 1994 census database [13].
Connect-4: Connect-4 Opening Dataset - This dataset contains all the
legal 8 ply positions in the game of Connect-4, in which none of the two
players has won yet, and in which next move is not forced [13].
34
5.2 Datasets
Figure 5.9: Comparison of AUROC on Real World Datasets
Figure 5.10: Comparison of G-Mean on Real World Datasets
5.2.4
Results of Our Approach on Real World Datasets
We present the graphical results of our approach in comparison with some other
state of the art algorithms as mentioned earlier in subsection 5.2.2.
Fig 5.9 shows comparison of AUROC values obtained on the real world datasets.
Here we can observe that our approach maintains same or achieves better performance than others except for letter dataset where FLC achieves better result. This
may be because FLC can better deal with the concept drift than our approach.
35
5.2 Datasets
Figure 5.11: Comparison of F-Measure on Real World Datasets

Fig 5.10 gives comparative performance with respect to G-Mean values on the
real world data sets. We can see that our approach gets better G-Mean performance except for Connect-4 where performance of SMOTE algorithm is better.
Fig 5.11 gives comparitive performance of various algorithms on real world
datasets. Here we can see that our approach get better F-Measure values than
other algorithms.
Fig 5.12 comparison of overall accuracies of algorithms for Real world datasets.
Here we can see that our approach maintains almost similar performance as that
of other two except for elec-2 where it is little less. But as mentioned earlier our
main performance measure is not overall accuracy.
Our approach works optimally when number of attributes of the instances is
around 10 to 15. As the number of attributes increases it becomes difficult for
nearest neighbor methods to identify appropriate nearest instances. This is because if the number of attributes is more then all points seem to be at almost equal
distance for the nearest neighbor methods. Thus we can say nearest neighbour
method runs into curse of dimensionality. It can be seen that due to the same
reason performance of the approach in some cases is not much better or is bit less
for connect-4 and letter dataset where number of attributes is more.
36
5.2 Datasets
Figure 5.12: Comparison of Overall Accuracies on Real World Datasets
Figure 5.13: Effect on AUROC when noise levels are varied
5.2.5
Effect of Noise Level on the Performance of the approach
In this section we try to see how our approach performs when the noise levels
are changed. To see the behaviour of the approach in such scenario we tested
the performance of the approach on SPH dataset with 5% of minority class and
with noise levels from 1% to 5%. The following graphs depict how the approach
responds in case of varied noise levels.
We can observe from the Fig 5.13 that in case of SPH dataset as the noise
37
5.2 Datasets
Figure 5.14: Effect on G-Mean when noise levels are varied
Figure 5.15: Effect on F-Measure when noise levels are varied

levels increase there is slight drop in the AUROC obtained this may be because
that, as the noise levels increase to 5%, the classifiers find it difficult to learn
from minority class instances. As already the minority class instances are less in
number and then the noise further makes the learning process difficult.
We can see from the Fig 5.14 that as the noise levels are varied the G-Mean
remain almost at the similar level upto 4% of noise but there is slight drop in the
G-Mean value for noise level of 5%. We could observe that the approach sustains
the noise levels for the said case in terms of G-Mean performance upto a certain
38
5.2 Datasets
Figure 5.16: Effect on Overall Accuracy when noise levels are varied
level.
We can see from the Fig 5.15 that as the noise levels increase there is slight
drop in the F-Measure values obtained. F-Measure values are almost same when
noise level is 2% and 4% but the values show slight decrease when the noise levels
are 3% and 5%.
We can observe from the fig 5.16 that the as the noise levels increase the overall
accuracy almost remains the same with very slight variations. This is because even
though the noise levels have increased to 5% but the majority class percentage is
still much more which is the major driving factor behind the overall accuracy in
this case.
39
Chapter 6
Conclusion and Future Work
In this chapter we enlist the conclusions drawn from the project work and the
future work that can be carried out.
6.1
Conclusion
In this project work we have come up with an approach to deal with skewed data
streams using oversampling and the k nearest neighbour approach. We have illustrated our algorithm using various real world as well as the synthetic datasets
with various features and the imbalance levels. Results obtained indicate that our
approach shows deals well with skewed data streams, In particular, our approach
has shown comparable and in some cases slightly better performance in case of
Area under ROC Curve , F-Measure and G-Mean except in some cases where it
stays behind some of the other algorithms.
As seen earlier various real life data stream applications like financial fraud detection, network intrusion detection are characterised by the skewed data streams
and in such cases this approach would help identify and classify minority class
instances appropriately.
6.2
Future Work
This chapter throws light on the future enhancements that can be carried out.
Some of the further enhancements would be to implement the approach for parallel
computing platform which would help reduce the time required for the approach.
There are various parts of the approach where in parallelism can be introduced.
40
6.2 Future Work
We are also working on a approach to combine two different stream classification approaches so as to get best out of the two algorithms. By combining
out approach with another approach which handles general data streams or data
streams with balanced distribution, we would like to extend the scope of out approach further such that its applicable in general to data streams with any kind
of class distribution.
41
Appendix A
Publications
This section lists our research papers that have been accepted and the papers that
are under review.
A.1
IDSL: Imbalanced Data Stream Learner
Accepted at Cube 2012,International Conference and IT Exhibition, to be

held on 3-5 September at Pune, India. Proceedings of the conference will be
published in ACM ICPS (International Conference Proceedings Series).
A.2
Classification of Data Streams with Skewed

Distribution
Submitted at AoIS (Annals of Information Systems), Special issue on Real world

data mining applications(Springer Journal), the paper is under review.
42
Bibliography
[1] Moa dataset repository : http://moa.cs.waikato.ac.nz/datasets/.
[2] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. On
demand classification of data streams. In Proceedings of the tenth ACM
SIGKDD international conference on Knowledge discovery and data mining,
KDD 04, pages 503508, New York, NY, USA, 2004. ACM.
[3] Vahida Attar, Pradeep Sinha, and Kapil Wankhade. A fast and light classifier
for data streams. Evolving Systems, 1:199207, 2010.
[4] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer
Widom. Models and issues in data stream systems. In Proceedings of the
twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of
database systems, PODS 02, pages 116, New York, NY, USA, 2002. ACM.
[5] Stephen Bay, Krishna Kumaraswamy, Markus G. Anderle, Rohit Kumar, and
David M. Steier. Large scale detection of irregularities in accounting data.
In Proceedings of the Sixth International Conference on Data Mining, ICDM
06, pages 7586, Washington, DC, USA, 2006. IEEE Computer Society.
[6] Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer, and Mikio
Braun. Moa: Massive online analysis.
[7] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip
Kegelmeyer. Smote: synthetic minority over-sampling technique. J. Artif.
Int. Res., 16:321357, June 2002.
[8] S. Chen and H. He. Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach. Evolving Systems, pages 116, 2011.
43
BIBLIOGRAPHY
[9] Sheng Chen and Haibo He. Sera: Selectively recursive approach towards
nonstationary imbalanced stream data mining. In Neural Networks, 2009.
IJCNN 2009. International Joint Conference on, pages 522529, june 2009.
[10] Sheng Chen, Haibo He, Kang Li, and S. Desai. Musera: Multiple selectively
recursive approach towards imbalanced stream data mining. In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 18, july
2010.
[11] G. Ditzler, R. Polikar, and N. Chawla. An incremental learning algorithm
for non-stationary environments and class imbalance. In Pattern Recognition
(ICPR), 2010 20th International Conference on, pages 29973000, aug. 2010.
[12] Tom Fawcett. Roc graphs: Notes and practical considerations for researchers.
Technical report, 2004.
[13] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
[14] E. Frank, M. Hall, G. Holmes, R. Kirkby, B. Pfahringer, I.H. Witten, and
L. Trigg. Weka. Data Mining and Knowledge Discovery Handbook, pages
13051314, 2005.
[15] Peter W. Frey and David J. Slate. Letter recognition using holland-style
adaptive classifiers. Machine Learning, 6:161182, 1991.
[16] J. Gao, W. Fan, J. Han, and P.S. Yu. A general framework for mining conceptdrifting data streams with skewed distributions. Proc. of SIAM ICDM, 2007.
[17] Qiong Gu, Li Zhu, and Zhihua Cai. Evaluation measures of the classification performance of imbalanced data sets. In Computational Intelligence and
Intelligent Systems, volume 51, pages 461471. Springer Berlin Heidelberg,
2009.
[18] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers, 2nd edition, 2006.
[19] Michael Harries and New South Wales. Splice-2 comparative evaluation: Electricity pricing, 1999.
[20] Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE
Trans. Knowl. Data Eng., 21(9):12631284, 2009.
44
BIBLIOGRAPHY
[21] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing
data streams. In Proceedings of the seventh ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD 01, pages 97
106, New York, NY, USA, 2001. ACM.
[22] Miroslav Kubat, Robert C. Holte, and Stan Matwin. Machine learning for the
detection of oil spills in satellite radar images. Mach. Learn., 30(2-3):195215,
February 1998.
[23] Vipin Kumar, Pang-Ning Tan, and Michael Steinbach. Introduction to Data
Mining. Addison-Wessley, 2006.
[24] Yanling Li, Guoshe Sun, and Yehang Zhu. Data imbalance problem in text
classification. In Proceedings of the 2010 Third International Symposium on
Information Processing, ISIP 10, pages 301305, Washington, DC, USA,
2010. IEEE Computer Society.
[25] Peng Liu, Yong Wang, Lijun Cai, and Longbo Zhang. Classifying skewed
data streams based on reusing data. In Computer Application and System
Modeling (ICCASM), 2010 International Conference on, volume 4, pages V4
90V493, oct. 2010.
[26] H.M. Nguyen, E.W. Cooper, and K. Kamei. Online learning from imbalanced
data streams. In Soft Computing and Pattern Recognition (SoCPaR), 2011
International Conference of, pages 347352, oct. 2011.
[27] Dan Pelleg and Andrew Moore. Active learning for anomaly and rare-category
detection. In In Advances in Neural Information Processing Systems 18, pages
10731080. MIT Press, 2004.
[28] Qun Song, Jun Zhang, and Qian Chi. Assistant detection of skewed data
streams classification in cloud security. In Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE International Conference on, volume 1,
pages 6064, oct. 2010.
[29] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea)
for large-scale classification. In Proceedings of the seventh ACM SIGKDD
international conference on Knowledge discovery and data mining, KDD 01,
pages 377382, New York, NY, USA, 2001. ACM.
[30] Pavan Vatturi and Weng-Keen Wong. Category detection using hierarchical
mean shift. In Proceedings of the 15th ACM SIGKDD international conference
45
BIBLIOGRAPHY
on Knowledge discovery and data mining, KDD 09, pages 847856, New York,
NY, USA, 2009. ACM.
[31] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining conceptdrifting data streams using ensemble classifiers. In Proceedings of the ninth
ACM SIGKDD international conference on Knowledge discovery and data
mining, KDD 03, pages 226235, New York, NY, USA, 2003. ACM.
[32] Yi Wang, Yang Zhang, and Yong Wang. Mining data streams with skewed
distribution by static classifier ensemble. In Been-Chian Chien and TzungPei Hong, editors, Opportunities and Challenges for Next-Generation Applied
Intelligence, volume 214 of Studies in Computational Intelligence, pages 65
71. Springer Berlin / Heidelberg, 2009.
[33] Zhe Wang, William Josephson, Qin Lv, Moses Charikar, and Kai Li. Filtering
image spam with near-duplicate detection. In In Proceedings of the Fourth
Conference on Email and AntiSpam, CEAS2007, 2007.
[34] Gang Wu and Edward Y. Chang. Class-boundary alignment for imbalanced
dataset learning. In In ICML 2003 Workshop on Learning from Imbalanced
Data Sets, pages 4956, 2003.
[35] Junjie Wu, Hui Xiong, Peng Wu, and Jian Chen. Local decomposition for
rare class analysis. In Proceedings of the 13th ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD 07, pages 814
823, New York, NY, USA, 2007. ACM.
[36] Juan Zhang, Xuegang Hu, Yuhong Zhang, and Pei-Pei Li. An efficient ensemble method for classifying skewed data streams. In De-Shuang Huang, Yong
Gan, Prashan Premaratne, and Kyungsook Han, editors, ICIC (3), volume
6840 of Lecture Notes in Computer Science, pages 144151. Springer, 2011.
46

Classification of Data Streams With Skewed Distribution

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Classification of Data Streams With Skewed Distribution

Încărcat de

Drepturi de autor:

Formate disponibile

Classification of Data Streams with

Department of Computer Engineering and Information Technology

DEPARTMENT OF COMPUTER ENGINEERING AND

Classification of Data Streams with Skewed

Dr. Jibi Abraham,

6 Conclusion and Future Work

Description of Synthetic Datasets used . . . . . . . . . . . . . . . . 26

Classification as a task of mapping input attribute setx into its class

Our Approach to Classify Data Streams with Skewed Distributions

Comparison of AUROC on SPH Datasets . . . . . . . . . .

Introduction to Data Mining and Techniques

1.2 An Overview of Data Streams

Classification is one of the important technique of data mining. It involves use

An Overview of Data Streams

1.3 Data Stream Classification

need of single scan, multidimensional, online stream analysis methods. In todays

Data Stream Classification

Figure 1.2: Classification model in data streams

1.4 An Overview of Skewed Data Sets in the Real World

labelled data chunks. Because of storage constraints, it is critical to judiciously

An Overview of Skewed Data Sets in the

1.4 An Overview of Skewed Data Sets in the Real World

Financial Fraud Detection: In financial fraud detection, majority of financial

1.5 Issues in Learning from Skewed Data Streams

Issues in Learning from Skewed Data Streams

1.6 Thesis Outline

that of the majority data. Furthermore, standard stream mining algorithm

The rest of the thesis is organized as follows:

Overview of Methods for Dealing with Skewed

for making predictions on test data chunk.

Under-sampling is another sampling based method which solves the problem by

Cost Sensitive Learning.

that most of the applications do exhibit characteristics of skewed data streams

3.2 Problem Statement

We went on to see the characteristics of some data streams from which we

As seen in previous section that after concentrating on problem of dealing with

3.3 Evaluation Metrics

How did we reach our problem statement

As we realised that how important it is to identify the minority class examples

Why are Existing Classifiers Weak?

3.3 Evaluation Metrics

3.3 Evaluation Metrics

F-Measure: It is a harmonic mean of Precision & Recall. We can say that

G-Mean: G-Mean is a metric that measures the balanced performance of

3.3 Evaluation Metrics

Approach to Deal with Skewed Data Streams

4.1 Approach to Deal with Skewed Data Streams

on creation of synthetic examples to deal with this problem. In our approach we

4.2 Algorithm for Skewed Data Streams

Algorithm for Skewed Data Streams

Algorithm 2 presents the algorithmic implementation of our approach that we

4.2 Algorithm for Skewed Data Streams

Algorithm 2: CDSSDC: Classifier for Data streams with skewed distribution

4.2 Algorithm for Skewed Data Streams

input to the system is a skewed data stream in chunks or batches. We assume

Synthetically Generated Datasets

Description of Synthetically Generated Datasets

Spinning Hyperplane Dataset:

Results of Our Approach on Synthetically Generated

Figure 5.1: Comparison of AUROC on SPH Datasets

Figure 5.2: Comparison of G-Mean on SPH Datasets

Figure 5.3: Comparison of F-Measure on SPH Datasets

Figure 5.4: Comparison of Overall Accuracies on SPH Datasets