Documente Academic
Documente Profesional
Documente Cultură
2
What is WEKA ?
• Machine learning/data mining software written in Java (distributed under
the GNU Public License)
• Used for research, education, and applications
• Complements “Data Mining” by Witten & Frank
• Main features:
– Comprehensive set of data pre-processing tools, learning algorithms
and evaluation methods
– Graphical user interfaces (incl. data visualization)
– Environment for comparing learning algorithms
• Weka versions
– WEKA 3.4: “book version” compatible with description in data mining
book
– WEKA 3.5: “developer version” with lots of improvements
3
Formatting Data into ARFF (Attribute Relation File Format)
@relation iris
@attribute sepallength real
@attribute sepalwidth real
@attribute petallength real
@attribute petalwidth real
@attribute class {Iris-setosa, Iris-versicolor, Iris-virginica}
@data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
…
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
…
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
… 4
Practicing WEKA
• What is WEKA ?
• Formatting the data into ARFF
• Klasifikasi
– Tahapan membangun classifier
– Contoh kasus : Klasifikasi bunga iris
– Tahapan membangun classifier
– Merangkum hasil eksperimen k-Nearest Neighbor Classifier
– Eksperimen memakai classifier yang lain (JST, SVM)
– Classification of cancers based on gene expression
– Parkinson Disease Detection
• K-Means Clustering
5
Tahapan membangun Classifier
6
Contoh Kasus : Klasifikasi bunga iris
7
Flower’s parts
8
Tahapan membangun Classifier
10
Open
file
“iris-‐training.arff”
11
Klik
pada
Classify
untuk
memilih
Classifier
algorithm
12
Klik
pada
“Choose”
untuk
memilih
Classifier
algorithm
13
Naïve
Bayes
14
IB1
:
1-‐Nearest
Neighbor
Classifier)
IBk
:
k-‐Nearest
Neighbor
Classifier
15
Mul3layer
Perceptron
(Jaringan
Syaraf
Tiruan)
16
SMO
singkatan
dari
Sequen3al
Minimal
Op3miza3on.
SMO
adalah
implementasi
SVM
Mengacu
pada
paper
John
PlaQ
17
Decision
Tree
J48
(C4.5)
18
Misalnya
kita
pilih
IBk
:
k-‐Nearest
Neighbor
Classifier
19
Selanjutnya
pilihlah
skenario
Pengukuran
akurasi.
Dari
4
Op3ons
yang
diberikan,
pilihlah
“Supplied
test
set”
dan
klik
BuQon
“Set”
untuk
memiilih
Tes3ng
set
file
“iris-‐tes3ng.arff”
20
Tahapan membangun Classifier
Iris-‐training.arff
iris
setosa
25
Iris-‐tes3ng.arff
iris
versicolor
25
25
25
iris
virginica
25
Classifiers
:
25
1.
Naïve
Bayes
Akurasi
2.
K-‐Nearest
Neighbor
Classifier
terhadap
(lazy
àiBk)
tes3ng
set
?
3.
Ar3ficial
Neural
Network
(func3on
àmul3layer
perceptron)
4.
Support
Vector
Machine
(func3on
à
SMO)
21
Apakah yang dimaksud “mengukur akurasi”
• Model
classifica3on
yang
dibangun
harus
mampu
menebak
dengan
benar
class
tersebut.
22
Berbagai cara pengukuran akurasi
• “Using
training
set”
:
memakai
seluruh
data
sebagai
training
set,
sekaligus
tes3ng
set.
Akurasi
akan
sangat
3nggi,
tetapi
3dak
memberikan
es3masi
akurasi
yang
sebenarnya
terhadap
data
yang
lain
(yang
3dak
dipakai
untuk
training)
• Hold
Out
Method
:
Memakai
sebagian
data
sebagai
training
set,
dan
sisanya
sebagai
tes3ng
set.
Metode
yang
lazim
dipakai,
asal
jumlah
sampel
cukup
banyak.
Ada
2
:
supplied
test
set
dan
percentage
split.
Pilihlah
“Supplied
test
set”
:
jika
file
training
dan
tes3ng
tersedia
secara
terpisah.
Pilihlah
“Percentage
split”
jika
hanya
ada
1
file
yang
ingin
dipisahkan
ke
training
&
tes3ng.
Persentase
di
kolom
adalah
porsi
yang
dipakai
sbg
training
set
23
Berbagai cara pengukuran akurasi
24
Ilustrasi Cross Validation (k=5)
1. Data
terdiri
dari
100
instances
(samples),
dibagi
ke
dalam
5
blok
dengan
jumlah
sampel
yang
sama.
Nama
blok
:
A,
B,
C,
D
dan
E,
masing-‐masing
terdiri
dari
20
instances
2. Kualitas
kombinasi
parameter
tertentu
diuji
dengan
cara
sbb.
step
1:
training
memakai
A,B,C,D
tes3ng
memakai
E
akurasi
a
step
2:
training
memakai
A,B,C,E
tes3ng
memakai
D
akurasi
b
step
3:
training
memakai
A,B,
D,E
tes3ng
memakai
C
akurasi
c
step
4:
training
memakai
A,
C,D,E
tes3ng
memakai
B
akurasi
d
step
5:
training
memakai
B,C,D,E
tes3ng
memakai
A
akurasi
e
3. Rata-‐rata
akurasi
:
(a+b+c+d+e)/5
mencerminkan
kualitas
parameter
yang
dipilih
4. Ubahlah
parameter
model,
dan
ulangi
dari
no.2
sampai
dicapai
akurasi
yang
diinginkan
25
Kali
ini
memakai
“Supplied
test
set”.
Selanjutnya
klik
pada
bagian
yang
Di
dalam
kotak
untuk
men-‐set
nilai
Parameter.
Dalam
hal
ini,
adalah
Nilai
“k”
pada
k-‐Nearest
Neighbour
Classifier
(Nick
name
:
IBK)
26
Set-‐lah
nilai
“k”misalnya
3
dan
klik
OK.
Untuk
memahami
parameter
yang
lain,
kliklah
buQon
“More”
&
“Capabili3es”
27
Klik
buQon
“Start”
Hasil
eksperimen
:
Correct
classifica3on
rate
:
96%
(benar
72
dari
total
75
data
pada
tes3ng
set)
1 1 ? ? ? ?
2 3 100% 96% 92% 96%
3 5
5
7
9
• Tugas
:
lanjutkan
eksperimen
di
atas
untuk
nilai
k
=
1,
3,
5,
7
dan
9
• Buatlah
grafik
yang
menunjukkan
akurasi
yang
dicapai
untuk
masing-‐masing
class
pada
berbagai
nilai
k.
Sumbu
horisontal
:
nilai
k
dan
sumbu
ver3kal
:
akurasi
• Kapankah
(pada
nilai
k
berapa
?)
akurasi
ter3nggi
dicapai
?
Bagaimanakah
trend
akurasi
masing-‐masing
class
?
Eksperimen memakai Neural Network
34
Eksperimen memakai SVM
C:
complexity
parameter
(biasanya
mengambil
nilai
besar.
100,
1000
dst)
35
Eksperimen memakai SVM
Classification of cancers based on gene expression
• Biological reference:
Classification and diagnostic prediction of cancers using gene
expression profiling and artificial neural networks,
J. Khan, et al., Nature Medicine 7, pp.673-679, 2001 (http://
www.thep.lu.se/~carsten/pubs/lu_tp_01_06.pdf )
• Data is available from http://research.nhgri.nih.gov/microarray/
Supplement/
• Small Round Blue Cell Tumors (SRBCT) has two class:
– Ewing Family of Tumors (EWS)
– NB: Neuroblastoma
– BL: Burkitt lymphomas
– RMS: Rhabdomyosarcoma : RMS
• Characteristic of the data
– Training samples : 63 (EWS:23 BL:8 NB:12 RMS:20)
– Testing samples: 20 (EWS:6 BL:3 NB:6 RMS:5)
– Number of features (attributes): 2308 37
Classification of cancers based on gene expression
38
Parkinson Disease Detection
Max Little (Oxford University) recorded speech signals and measured the
biomedical voice from 31 people, 23 with Parkinson Disease (PD). In the
dataset which will be distributed during final examination, each column in the
table is a particular voice measure, and each row corresponds one of 195 voice
recording from these individuals ("name" column). The main aim of the data is
to discriminate healthy people from those with PD, according to "status" column
which is set to 0 for healthy and 1 for PD. There are around six recordings per
patient, making a total of 195 instances. (Ref. 'Exploiting Nonlinear Recurrence
and Fractal Scaling Properties for Voice Disorder Detection', Little MA,
McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering
OnLine 2007, 6:23, 26 June 2007).
Experiment using k-Nearest Neighbor Classifier
Conduct classification experiments using k-Nearest Neighbor Classifier and
Support Vector Machines, by using 50% of the data as training set and the rest
as testing set. Try at least 5 different values of k for k-Nearest neighbor, and
draw a graph show the relationship between k and classification rate. In case
of Support Vector Machine experiments, try several parameter combinations by
modifying the type of Kernel and its parameters (at least 5 experiments).
Compare and discuss the results obtained by both classifiers. Which of them
achieved higher accuracy ?
39
Parkinson Disease Detection
Max Little (Oxford University) recorded speech signals and measured the
biomedical voice from 31 people, 23 with Parkinson Disease (PD). In the
dataset which will be distributed during final examination, each column in the
table is a particular voice measure, and each row corresponds one of 195 voice
recording from these individuals ("name" column). The main aim of the data is
to discriminate healthy people from those with PD, according to "status" column
which is set to 0 for healthy and 1 for PD. There are around six recordings per
patient, making a total of 195 instances. (Ref. 'Exploiting Nonlinear Recurrence
and Fractal Scaling Properties for Voice Disorder Detection', Little MA,
McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering
OnLine 2007, 6:23, 26 June 2007).
Experiment using k-Nearest Neighbor Classifier
Conduct classification experiments using k-Nearest Neighbor Classifier and
Support Vector Machines, by using 50% of the data as training set and the rest
as testing set. Try at least 5 different values of k for k-Nearest neighbor, and
draw a graph show the relationship between k and classification rate. In case
of Support Vector Machine experiments, try several parameter combinations by
modifying the type of Kernel and its parameters (at least 5 experiments).
Compare and discuss the results obtained by both classifiers. Which of them
achieved higher accuracy ?
40
Practicing WEKA
• What is WEKA ?
• Formatting the data into ARFF
• Klasifikasi
– Tahapan membangun classifier
– Contoh kasus : Klasifikasi bunga iris
– Tahapan membangun classifier
– Merangkum hasil eksperimen k-Nearest Neighbor Classifier
– Eksperimen memakai classifier yang lain (JST, SVM)
– Classification of cancers based on gene expression
– Parkinson Disease Detection
• K-Means Clustering
41
K-Means Clustering : Step by Step
42
K-Means Clustering : Step by Step
43
Filename
:
kmeans_clustering.arff
1
2
45
Klik
untuk
memilih
algoritma
clustering
46
47
Klik
untuk
memilih
nilai
k
48
maxItera3ons:
untuk
menghen3kan
proses
clustering
jika
iterasi
melebih
nilai
tertentu
numClusters:
nilai
k
(banyaknya
cluster)
49
Hasil
clustering:
terbentuk
3
cluster
dan
masing-‐masing
beranggotakan
50
instances
50
Klik
dengan
buQon
kanan
mouse
untuk
menampilkan
visualisasi
cluster
51
Nilai
aQribute
x
ditampilkan
pada
sumbu
x,
dan
nilai
aQribute
y
ditampilkan
pada
sumbu
y
52