Sunteți pe pagina 1din 38

DATA MINING DAN BIG DATA

LSCAMP, LSCM, TI-ITS 2018


BUDI SANTOSA, BSANTOSA@GMAIL.COM
TEKNIK INDUSTRI ITS
OUTLINE

• Mengapa Data Mining

• Apa itu Data Mining

• Proses Knowledge discovery

• Disiplin data mining

• Supervised vs unsupervised

• Aplikasi data mining

• Software

• Teknik data mining 2


MENGAPA DATA MINING ?
• The Explosive Growth of Data: from terabytes to petabytes

• Data collection and data availability

• Automated data collection tools, database systems, Web, computerized society

• Major sources of abundant data

• Business: Web, e-commerce, transactions, stocks, …

• Science: Remote sensing, bioinformatics, scientific simulation, …

• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets


APA ITU DATA MINING

• Extraction of interesting (non-trivial, implicit, previously unknown and


potentially useful) patterns or knowledge from huge amount of data
• Is everything “data mining”?
• Simple search and query processing
• (Deductive) expert systems
PROSES KD
DATA MINING DALAM PROSES KNOWLEDGE DISCOVERY

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Assosiasi dan korelasi Pattern selection
Klasisfikasi
Feature selection klastering
Pattern interpretation
Dimension reduction Analisis Outlier Pattern visualization
…………

6
DATA MINING: CONFLUENCE OF MULTIPLE DISCIPLINES

Machine Statistics
Learning

Visualization
Data Mining
Database
Technology

High-Performance
Computing
7
SUPERVISED VS UNSUPERVISED
Supervised Unsupervised
• Dimulai dengan membangun model berdasarkan • Digunakan untuk mempelajari data yang tidak
training dataset yang sudah diketahui lable atau ada labelnya atau kelasnya.
kelasnya
• Mempelajari bagaimana data atau obyek bisa
• model dipakai untuk memprediksi label atau kelas dikelompokkan ke beberapa kelompok tanpa
dari data baru ada contoh kelompok sebelumnya
• Ada tahap training, ada tahap validasi dan testing • Tidak ada tahap training
• SVM, ANN, LDA, regresi logistik • Mis Klastering, Self Organizing Map

Semi-supervised learning
Menggunakan data berlabel dan data tidak berlabel untuk training,
biasanya sedikit yang berlabel dan banyak yang tidak berlabel
Mis KNN
APLIKASI DATA MINING

Customer segmentation
Warranties Manufaktur Frequent Flier incentives
Industri bisa
memanfaatkan DM • Perusahaan bisa melakukan • Airlines bisa
Perusahaan perlu customize produk-produk mengidentifikasi kelmppok
Web page analysis untuk menemukan memprediksi jumlah untuk customer , sehingga customer yang bisa diberi
Web page classification, segemen-segmen customer yang akan mereka perlu memprediksi insentif untuk terbang lebih
clustering customer dengan mengajukan klaim garansi fitur-fitur yang harus sering
mempertimbangkan dan rata-rata ongkos dimasukkan ke dalam produ
variabel tambahan di garansinya untuk memenuhi keinginan
luar yang biasa dipakai customer
APLIKASI DATA MINING
SOFTWARE

• Clementine
• WEKA (waikato University)
• KNIME
• R-Programming
• NLTK
• ORANGE
• Rapidminer
• Matlab
DATA MINING TASKS

Clustering Classification Regression


X2
+
X2 + y
+
+
+ + +
+ +
+ + + - + +
++ + + + -
+ + - - +
+ ++ + +
+
- +
+
-

X1 X1 X1
k-means  Linear Discriminant Analysis, QDA  Classical Linear Regression
Hierachical Cluster  Logistic Regression (Logit)  Ridge Regression
SOM
 Decision Trees, LSSVM, NN, VS  NN, CART
KLASIFIKASI

• Membangun model (functions) berdasarkan data training (supervised learning)


• Prediksi klas dari titik data baru/obyek baru
• Metoda : Decision trees, naïve Bayesian classification, support vector machines, neural
networks, K-NN, logistic regression
• Aplikasi: Credit card fraud detection, classifying diseases, web-pages, customer
PROCESS (1): MODEL CONSTRUCTION

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
PROCESS (2): USING THE MODEL IN PREDICTION

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
DECISION TREE
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
DISCRIMINANT FUNCTION
• Sembarang fungsi x, sedemikian hingga:

Nearest Decision Linear Nonlinear


Neighbor Tree Functions Functions

g ( x)  w T x  b
K-nearest neighbor menggunakan euclidian untuk menentukan
anggota kelas.
ANALISIS KLASTER

• Unsupervised learning (tanpa kelas, tanpa training)


• Mengelompokkan sejumlah obyek ke dalam klaster sehingga tiap
klaster akan berisi obyek yang saling mirip
• Prinsip: Maximizing intra-class similarity & minimizing interclass
similarity
• Metoda: k-means, klaster hirarki, fuzzy c-means
• Aplikasi: segmentasi customer, marketing, klastering penduduk

19
BASIC CONCEPTS: ASSOCIATION RULES

Tid Items bought • Find all the rules X  Y with


10 coffee, Nuts, Diaper minimum support and confidence
20 coffee, tea, Diaper
• support, s, probability that a
30 coffee, Diaper, Eggs
40 Nuts, Eggs, Milk
transaction contains X  Y
50 Nuts, tea, Diaper, Eggs, Milk
• confidence, c, conditional
Customer
Customer probability that a transaction
buys both
buys having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: coffee:3, Nuts:3, Diaper:4, Eggs:3,
Customer {coffee, Diaper}:3
buys coffee  Association rules: (many more!)
 coffee  Diaper (60%, 100%) 20
 Diaper  coffee (60%, 75%)
KUNCI SUKSES

• Data exploration followed by visualization of model results


• Overall data quality and management
• Easy model deployment to quickly get reliable and repeatable results
• Comparing various data mining models and identifying the best
• Automated data-to-decision process

Occam Razor : when you have two competing theories that make
exactly the same predictions, the simpler one is the better.
BIG DATA

• Software, platform big data analytics : Apache Hadoop, Apache Spark, Storm, Samza (framework)
• Hardware, big data analitycs memerlukan sistem klaster komputer, komputer terhubung dalam sebuah
jaringan, ada master (kepala/otaknya) dan slave (unit-unit pekerjanya).
• Klaster komputer dapat kita bangun sendiri, atau menyewa ke penyedia platform cloud computing
seperti AWS (Amazon Web Service) dan Microsoft Azure.
• Framework big data analytics memiliki kemampuan untuk manajemen resource, data parallelisms,
parallel programming dan distributed computing.
• Memungkinkan membuat dan menjalankan code/program kita di sistem klaster komputer, program kita
menjadi jauh lebih powerful dan cepat dengan memanfaatkan semua resource komputer yang
terhubung dalam klaster.
mewakili +1
LARGE MARGIN LINEAR KLASIFIER
x
mewakili -1
• Formulasi: 2
Margin

2 x+
maximize
w

sehingga x+

For yi  1, w T xi  b  1 n
x-
For yi  1, w T xi  b  1

x1
mewakili +1
LARGE MARGIN LINEAR KLASIFIER
x
mewakili -1
2
Margin
• Formulasi:
1 x+
2
minimize w
2
sehingga x+

n
For yi  1, w T xi  b  1 x-
For yi  1, w T xi  b  1

x1
mewakili +1
mewakili -1
LARGE MARGIN LINEAR KLASIFIER
x
• Formulasi: 2
Margin

1 2 x+
minimize w
2
sehingga x+

yi (wT xi  b)  1 n
x-

x1
SELESAIKAN PROBLEM
OPTIMASI
Quadratic 1 2
minimize w
programming 2
dengan linear
constraints
s.t. yi (wT xi  b)  1

Lagrangian
Function

minimize Lp (w, b,  i )  w    i  yi (wT xi  b)  1


n
1 2

2 i 1

s.t. i  0
SELESAIKAN PROBLEM
OPTIMASI
minimize Lp (w, b,  i )  w    i  yi (wT xi  b)  1
n
1 2

2 i 1

s.t. i  0

Lagrangian Dual
Problem
n
1 n n
maximize i  i j yi y j xTi x j
i 1 2 i 1 j 1
n
s.t.  i  0 , dan  y
i 1
i i 0
CLASSIFIER EVALUATION METRICS: CONFUSION
MATRIX
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates #


of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
28
DEEP LEARNING

• Deep learning adalah cara learning yang mempelajari features and tasks langsung dari data. Data bisa
berupa images, text, atau suara. Pemakaian deep learning meningkat selama 5tahun belakangan
karena:
• 1. Deep learning methods are now more accurate than people are at classifying images.

• 2. GPUs enable us to now train deep networks in less time.

• 3. Large amounts of labeled data required for deep learning has become accessible over the last few
years.
Convolutional Neural Network
CONTOH APLIKASI

• Marketing and Sales


Companies are using machine learning technology to analyze the purchase
history of their customers and make personalized product recommendations for
their next purchase. This ability to capture, analyze, and use customer data to
provide a personalized shopping experience is the future of sales and marketing.
• Healthcare
With the advent of wearable sensors and devices that use data to access health of a
patient in real time, ML is becoming a fast-growing trend in healthcare.
Sensors in wearable provide real-time patient information, such as overall health
condition, heartbeat, blood pressure and other vital parameters.
Doctors and medical experts can use this information to analyze the health condition of
an individual, draw a pattern from the patient history, and predict the occurrence of any
ailments in the future.
The technology also empowers medical experts to analyze data to identify trends that
facilitate better diagnoses and treatment.
• Transportation
Based on the travel history and pattern of traveling across various routes,
machine learning can help transportation companies predict potential problems
that could arise on certain routes, and accordingly advise their customers to opt
for a different route. Transportation firms and delivery organizations are
increasingly using machine learning technology to carry out data analysis and
data modeling to make informed decisions and help their customers make smart
decisions when they travel.
• Oil and Gas
This is perhaps the industry that needs the application of machine learning the
most. Right from analyzing underground minerals and finding new energy
sources to streaming oil distribution, ML applications for this industry are vast
and are still expanding.
• Medical Diagnosis:
• Berdasarkan simptom yang ada pada pasien dan kumpulan
data dari pasien-pasien sebelumnya, kita dapat memprediksi
apakah pasien akan menderita penyakit yang sama. Hal ini
dapat membantu untuk memberikan support terhadap
paramedis.
PRODUCT RECOMMENDATION
Berdasarkan history dari pembelian customer dan inventory dari product, kita dapat mengidentifikasi
produk-produk mana yang menarik untuk customer dalam melakukan pembelian. Model ini akan
menghasilkan program yang bertugas untuk memberikan rekomendasi. Amazon memiliki kemampuan ini.
Netflix juga mempunyai kemampuan untuk merekomendasikan film apa yang relevan untuk ditonton
selanjutnya berdasarkan history.
TORNADO DETECTION

• From Radar data


• Separated as tornado and no-tornado
• Data is imbalanced
• Detect tornado fo next few minutes
GENE EXPRESSION ANALYSIS

• Object is of a set of E. coli whole-genome gene expression profiles.


• The problem is how to classify these genes based on their behavior in response to changing pH of the
growth medium and mutation of the acid tolerance response gene regulator GadX. K-Means clustering
is applied in a multi-level scheme to label the genes.
• Multi-level K-Means is itself an improvement over standard K-Means applications.
• The labels indicate the response of genes to the experimental variables: 1-unchanged, 2-decreased
expression level and 3-increased expression level.

S-ar putea să vă placă și