Documente Academic
Documente Profesional
Documente Cultură
DATA SCIENCE
W E EK 7 B IG DATA A ND DATA ANA LY TIC S
A N D RY A L A M S YA H
@ A N D RY B R E W
OUTLINE
o Data Simulation (Monte Carlo)
o Data Preprocessing
o Conceptual Learning Data / Machine Learning
o Model Evaluation / Accuracy
o Case Study / Exercise
Monte Carlo
Monte Carlo methods(orMonte Carlo experiments)areabroadclassofcomputational
algorithmsthatrelyonrepeatedrandomsamplingtoobtainnumericalresults.Their
essentialideaisusingrandomnesstosolveproblemsthatmightbedeterministicinprinciple.
Theyareoftenusedinphysicalandmathematicalproblemsandaremostusefulwhenitis
difficultorimpossibletouseotherapproaches.MonteCarlomethodsaremainlyusedin
threedistinctproblemclasses:[1]optimization,numericalintegration,andgeneratingdraws
fromaprobabilitydistribution.
Why Simulation
Simulations can often be even more realistic than traditional experiments, as they
allow the free configuration of environment parameters found in the operational
application field of the final product. Examples are supporting deep water
operation of the US Navy or the simulating the surface of neighbored planets in
preparation ofNASA missions
Simulations can often be conducted faster thanreal time. This allows using them
for efficientif-then-elseanalyses of different alternatives, in particular when the
necessary data to initialize the simulation can easily be obtained from operational
data. This use of simulation adds decision support simulation systems to the tool
box of traditionaldecision support systems
2. Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
4. Data integration
Integration of multiple databases or files
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation= (missing data)
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
2. Prediction/Forecasting:
3. Classification:
4. Clustering:
5. Association:
2. Prediction/Forecasting (Prediksi/Peramalan):
3. Classification:
4. Clustering:
5. Association:
Machine Learning
Inthefieldofdataanalytics,machinelearningisamethodusedtodevisecomplexmodelsand
algorithmsthatlendthemselvestoprediction-incommercialuse,thisisknownaspredictiveanalytics.
Theseanalyticalmodelsallowresearchers,datascientists,engineers,andanalyststo"produce
reliable,repeatabledecisionsandresults"anduncover"hiddeninsights"throughlearningfrom
historicalrelationshipsandtrendsinthedata(wikipedia)
Machine learning is the science of getting computers to act without being explicitly
programmed. In the past decade, machine learning has given us self-driving cars,
practical speech recognition, effective web search, and a vastly improved
understanding of the human genome. Machine learning is so pervasive today that you
probably use it dozens of times a day without knowing it. Many researchers also think
it is the best way to make progress towards human-level AI. (standford/coursera)
Machine learning is a type of artificial intelligence (AI) that provides computers with the
ability to learn without being explicitly programmed. Machine learning focuses on the
development of computer programs that can teach themselves to grow and change
when exposed to new data.(whatis.com)
Data Split
The Split Data operator takes a dataset as its input and delivers the subsets of that
dataset through its output ports
The sampling type parameter decides how the examples should be shuffled in the
resultant partitions:
1. Linear sampling: Linear sampling simply divides the dataset into partitions
without changing the order of the examples
Subsets with consecutive examples are created
2. Shuffled sampling: Shuffled sampling builds random subsets of the dataset
Examples are chosen randomly for making subsets
3. Stratified sampling: Stratified sampling builds random subsets and ensures that
the class distribution in the subsets is the same as in the whole dataset
In the case of a binominal classification, stratified sampling builds random
subsets so that each subset contains roughly the same proportions of the two
values of the label
Use each subset for testing data and the rest for training data
This method also called k-fold cross-validation
We often use stratified (bertingkat) sampling before cross-validation process,
because it reduces variance estimation
10 Fold Cross-Validation
Eksperime
Dataset
nt
Accurac
y
93%
91%
90%
93%
93%
91%
94%
93%
91%
10
90%
92%
Orange Box : k-subset (data testing)
Akurasi Rata-Rata
Exercise:
1. Use one of the following tools : RapidMiner, R, Orange, Weka
2. Create prediction model (prediksi elektabilitas caleg) using data
training on data pemilu (datapemilukpu.xls) using the following
algorithm :.
1. Decision Tree (C4.5)
2. Nave Bayes (NB)
3. K-Nearest Neighbor (K-NN)
NB
K-NN
Accuracy
92.45%
77.46%
88.72%
AUC
0.851
0.840
0.5