Sunteți pe pagina 1din 39

WEKA

Weka is a collection of data mining and machine


learning algorithms most suitable for data mining
tasks. The algorithms can either be applied
directly to a dataset or called from your own
Java code.

Weka is open source software written in Java


and issued under the GNU General Public
License.

Main Features

Weka contains tools for data pre-processing,


classification, clustering, association rules, and
visualization.

Environment for comparing learning algorithms

It is also well-suited for developing new machine


learning schemes.

Resources:
WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka

Also has a list of projects based on


WEKA

Tutorial.
http://prdownloads.sourceforge.net/weka/weka
.ppt

WEKA Knowledge Explorer


Preprocess Choose and modify the data
Classify Train and test learning schemes that classify
Cluster Learn clusters for the data
Associate Learn association rules for the data
Select attributes Most relevant attributes in the data
Visualize View an interactive 2D plot of the data

WEKA Explorer: Pre-processing


the Data

Data can be imported from

a file in various
formats: ARFF, CSV, C4.5, binary
Data can also be read from a URL or from an
SQL database (using JDBC)
Pre-processing tools in WEKA are called
filters
WEKA contains filters for:

Discretization, normalization, re-sampling, attribute


selection, transforming and combining attributes,

WEKA only deals with flat files


The data must be converted to ARFF

format before applying any algorithm.


The datasets name: @relation
The attribute information: @attribute
The data section begins with @data
Data: a list of instances with the attribute values
being separated by commas.
By default, the class is the last attribute in the
ARFF file.

Numeric attribute and Missing


Value

@relation heart-disease-simplified

@attribute age numeric


@attribute gender { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,?,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

Numeric attribute and Missing


Value

@relation heart-disease-simplified

@attribute age numeric


@attribute gender { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,?,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

Explorer: clustering data


WEKA contains

clusterers for finding groups


of similar instances in a dataset
Implemented schemes are:

k-Means, EM, Cobweb, X-means, FarthestFirst

Clusters can be visualized and compared to

true clusters
Evaluation based on loglikelihood if clustering
scheme produces a probability distribution

Performing experiments
Experimenter makes it easy to compare the

performance of different learning schemes


For classification and regression problems
Results can be written into file or database
Evaluation options: cross-validation, learning
curve, hold-out
Can also iterate over different parameter
settings
Significance-testing built in!