Sunteți pe pagina 1din 6

DATA STREAM ANALYSIS

MOA(Massive On-line Analysis)

PROJECT REPORT
NAME - SAKSHAM KAPOOR
GUIDE/MENTOR Ms. BHAWANA SAINI

INTRODUCTION
There is a rapidly growing amount of available electronic information such as
online newspapers, journals, conference proceedings, Web sites, e-mails, etc.
Using all these electronic information, controlling, indexing or searching is
not feasible especially for human and also for search engines. The data *****
MOA (Massive On-line Analysis) is a framework for data stream mining. It
includes tools for evaluation and a collection of machine learning algorithms.
Related to the WEKA project, it is also written in Java, while scaling to more
demanding problems. The goal of MOA is a benchmark framework for
running experiments in the data stream mining context by proving

storable settings for data streams (real and synthetic) for repeatable
experiments

a set of existing algorithms and measures form the literature for


comparison and

an easily extendable framework for new streams, algorithms and


evaluation methods.
MOA currently supports stream classification, stream clustering, outlier
detection
Automatic DATA organization is an important issue. By using clustering
methods we can insight into data distribution or we can preprocess data for
other applications . For example, if a search engine uses clustered
documents in order to search an item, it can produce results more effectively
and efficiently
Clustering is an unsupervised learning method which does not need any
training step; pre-defined categories and labeled documents. So, there is no
need for a training set while applying the clustering algorithms. It just uses
the input data in order to find regularities in it. Although these algorithms are
designed for data streams, they obviously can also be used on nonstreaming data. In this paper we investigate how to use data stream
clustering techniques on large, non-streaming data

Literature Survey
Data Set*definition*
Data Streams
A data stream is an ordered sequence of points x1;:::;xn that must be
accessed in order and that can be read only once or a small number of times.
Each reading of the sequence is called a linear scan or a pass. The stream
model is motivated by emerging applications involving massive data sets; for
example, customer click streams, telephone records, large sets of web
pages, multimedia data, financial transactions, and observational science
data are better modeled as data streams. These data sets are far too large to
fit in main memory and are typically stored in secondary storage devices.
Data stream clustering is applied in applications that involve large amounts
of streaming data. For clustering, is a widely used heuristic but alternate
algorithms have also been developed such as k-medoids, CURE
For data streams, one of the first results appeared in 1980 but the model was
formalized in 1998.
Data clustering techniques are categorized on various different factors as
follows:1.
2.
3.
4.
5.
6.
7.
8.

Partitioning
Hierarchical
Density based
Grid based
Model based
Frequent pattern based
Constraint based
Link based

Clustering

Data stream clustering has become an important field of research in recent


years. A data stream is an ordered and potentially unbounded sequence of
objects (e.g. data points representing sensor readings). Data stream
algorithms have been developed in order to process large volumes of data in
an efficient manner using a single pass over the data while having only
minimal storage overhead requirements.
*info about clustering*
Data Stream Clustering

In todays applications, evolving data streams are ubiquitous. Stream


clustering algorithms were introduced to gain useful knowledge from these
streams in real-time. The quality of the obtained clusterings, i.e. how good
they reflect the data, can be assessed by evaluation measures. A multitude
of stream clustering algorithms and evaluation measures for clusterings were
introduced in the literature. The clustering tab in MOA allows to easily test
and compare stream clustering algorithms as well as evaluation measures.
Moreover it is easily extensible for new stream generators.
Data feeds and data generators
For stream clustering we added new data generators that support the
simulation of cluster evolution events such as merging or disappearing of
clusters. In the configuration dialog the dimensionality, number and size of
clusters can be set as well as the drift speed, decay horizon (aging) and
noise rate etc. Events constitute changes in the underlying data model such
as growing of clusters, merging of clusters or creation of new clusters. Using
the event frequency and the individual event weights, one can study the
behavior and performance of different approaches on various settings.
Finally, the settings for the data generators can be stored and loaded, which
offers the opportunity of sharing settings and thereby providing benchmark
streaming data sets for repeatability and comparison. New data feeds and

generators can be added to the MOA framework by implementing the


ClusteringStream.java interface.
Clustering in MOA
Currently MOA contains several stream clustering methods including:

StreamKM++: It computes a small weighted sample of the data stream


and it uses the k-means++ algorithm as a randomized seeding technique
to choose the first values for the clusters. To compute the small sample, it
employs coreset constructions using a coreset tree for speed up.
CluStream: It maintains statistical information about the data using
micro-clusters. These micro-clusters are temporal extensions of cluster
feature vectors. The micro-clusters are stored at snapshots in time
following a pyramidal pattern. This pattern allows to recall summary
statistics from different time horizons.
ClusTree: It is a parameter free algorithm automatically adapting to the
speed of the stream and it is capable of detecting concept drift, novelty,
and outliers in the stream. It uses a compact and self-adaptive index
structure for maintaining stream summaries.
DenStream: It uses dense micro-clusters (named core-micro-cluster) to
summarize clusters. To maintain and distinguish the potential clusters and
outliers, this method presents core-micro-cluster and outlier micro-cluster
structures.
D-Stream: This method maps each input data record into a grid and it
computes the grid density. The grids are clustered based on the density.
This algorithm adopts a density decaying technique to capture the
dynamic changes of a data stream.
CobWeb. One of the first incremental methods for clustering data. It
uses a classification tree. Each node in a classification tree represents a
class (concept) and is labeled by a probabilistic concept that summarizes
the attribute-value distributions of objects classified under the node.

The set of algorithms is extensible through classes that implement the


interface Clusterer.java. These are added to the framework via reections on
start up. The three main methods of this interface are

void resetLearningImpl(): a method for initializing a clusterer learner


void trainOnInstanceImpl(Instance): a method to train a new instance

Clustering getClusteringResult(): a method to obtain the current


clustering result for evaluation or visualization

WORKFLOW OF MOA
The workflow in MOA follows the simple schema depicted below: first a data
stream (feed, generator) is chosen and configured, second an algorithm (e.g.
a classifier) is chosen and its paramters are set, third the evaluation method
or measure is chosen and finally the results are obtained after running the
task.

MOA FRAMEWORK

Data
feed/Generat
or

Learning
Algorithm

Evalutaion
Method

RESULTS

S-ar putea să vă placă și