Conceptual Data Science Week 7

CONCEPTUAL
DATA SCIENCE
W E EK 7 B IG DATA A ND DATA ANA LY TIC S
A N D RY A L A M S YA H
@ A N D RY B R E W
OUTLINE
o Data Simulation (Monte Carlo)
o Data Preprocessing
o Conceptual Learning Data / Machine Learning
o Model Evaluation / Accuracy
o Case Study / Exercise
Modeling and Simulation

Modeling and simulation(M&S)referstousingmodelsphysical,mathematical,orotherwise
logicalrepresentationofasystem,entity,phenomenon,orprocessasabasisforsimulations
methodsforimplementingamodel(eitherstaticallyor)overtimetodevelopdataasabasisfor
managerialortechnicaldecisionmaking.[1][2]M&Shelpsgettinginformationabouthowsomething
willbehavewithoutactuallytestingitinreallife(wikipedia)
AnExampleofSimulation:MonteCarloMethods
Monte Carlo
Monte Carlo methods(orMonte Carlo experiments)areabroadclassofcomputational
algorithmsthatrelyonrepeatedrandomsamplingtoobtainnumericalresults.Their
essentialideaisusingrandomnesstosolveproblemsthatmightbedeterministicinprinciple.
Theyareoftenusedinphysicalandmathematicalproblemsandaremostusefulwhenitis
difficultorimpossibletouseotherapproaches.MonteCarlomethodsaremainlyusedin
threedistinctproblemclasses:[1]optimization,numericalintegration,andgeneratingdraws
fromaprobabilitydistribution.
Monte Carlo Example
GoldSim Video Monte Carlo

Simulation
Why Simulation
Simulations is generally cheaper, safer and sometimes more ethical than

conducting real-world experiments. For example,supercomputersare sometimes
used to simulate the detonation of nuclear devices and their effects in order to
support better preparedness in the event of anuclear explosion. Similar efforts
are conducted to simulate hurricanes and other natural catastrophes.
Simulations can often be even more realistic than traditional experiments, as they
allow the free configuration of environment parameters found in the operational
application field of the final product. Examples are supporting deep water
operation of the US Navy or the simulating the surface of neighbored planets in
preparation ofNASA missions
Simulations can often be conducted faster thanreal time. This allows using them
for efficientif-then-elseanalyses of different alternatives, in particular when the
necessary data to initialize the simulation can easily be obtained from operational
data. This use of simulation adds decision support simulation systems to the tool
box of traditionaldecision support systems
Data Preprocessing (Why ?)

Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable,
Consistency: some modified but some not,
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be
understood?
Major Task in Data Preprocessing

1. Data cleaning
Fill in missing values

Smooth noisy data
Identify or remove outliers
Resolve inconsistencies
2. Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
3. Data transformation and data discretization

Normalization
Concept hierarchy generation
4. Data integration
Integration of multiple databases or files
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation= (missing data)
Noisy: containing noise, errors, or outliers

e.g., Salary=10 (an error)
Inconsistent: containing discrepancies in codes or names

e.g., Age=42, Birthday=03/07/2010
Was rating 1, 2, 3, now rating A, B, C
Discrepancy between duplicate records

Intentional (e.g., disguised missing data)
Jan. 1 as everyones birthday?
Incomplete (Missing) Data

Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred
Data Reduction Strategies

Data Reduction
Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same analytical results
Why Data Reduction?

A database/data warehouse may store terabytes of data
Complex data analysis take a very long time to run on the complete dataset
Data Reduction Strategies

1. Dimensionality reduction
1. Feature Extraction
2. Feature Selection
2. Numerosity reduction (Data Reduction)

. Regression and Log-Linear Models
. Histograms, clustering, sampling
General Methods in Data Analytics

1. Estimation:
Linear Regression, Neural Network, Support Vector Machine, etc
2. Prediction/Forecasting:
Linear Regression, Neural Network, Support Vector Machine, etc
3. Classification:
Naive Bayes, K-Nearest Neighbor, C4.5, ID3, CART, Linear

Discriminant Analysis, Logistic Regression, etc
4. Clustering:
K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means,

etc
5. Association:
FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc
Evaluation (Accuracy, Error)

1. Estimation:
Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
Confusion Matrix: Accuracy

ROC Curve: Area Under Curve (AUC)
4. Clustering:
Internal Evaluation: DaviesBouldin index, Dunn index,

External Evaluation: Rand measure, F-measure, Jaccard index, FowlkesMallows
index, Confusion matrix
5. Association:
Lift Charts: Lift Ratio

Precision and Recall (F-measure)
Machine Learning
Inthefieldofdataanalytics,machinelearningisamethodusedtodevisecomplexmodelsand
algorithmsthatlendthemselvestoprediction-incommercialuse,thisisknownaspredictiveanalytics.
Theseanalyticalmodelsallowresearchers,datascientists,engineers,andanalyststo"produce
reliable,repeatabledecisionsandresults"anduncover"hiddeninsights"throughlearningfrom
historicalrelationshipsandtrendsinthedata(wikipedia)
Machine learning is the science of getting computers to act without being explicitly
programmed. In the past decade, machine learning has given us self-driving cars,
practical speech recognition, effective web search, and a vastly improved
understanding of the human genome. Machine learning is so pervasive today that you
probably use it dozens of times a day without knowing it. Many researchers also think
it is the best way to make progress towards human-level AI. (standford/coursera)
Machine learning is a type of artificial intelligence (AI) that provides computers with the
ability to learn without being explicitly programmed. Machine learning focuses on the
development of computer programs that can teach themselves to grow and change
when exposed to new data.(whatis.com)
Data Split
The Split Data operator takes a dataset as its input and delivers the subsets of that
dataset through its output ports
The sampling type parameter decides how the examples should be shuffled in the
resultant partitions:
1. Linear sampling: Linear sampling simply divides the dataset into partitions
without changing the order of the examples
Subsets with consecutive examples are created
2. Shuffled sampling: Shuffled sampling builds random subsets of the dataset
Examples are chosen randomly for making subsets
3. Stratified sampling: Stratified sampling builds random subsets and ensures that
the class distribution in the subsets is the same as in the whole dataset
In the case of a binominal classification, stratified sampling builds random
subsets so that each subset contains roughly the same proportions of the two
values of the label
We split data into 2 group: Training data and Testing data
Cross Validation Methods

Cross-Validation method used to avoid overlapping choice from testing data
Cross-Validation step:
Divide data into k subset (same size)
Use each subset for testing data and the rest for training data
This method also called k-fold cross-validation
We often use stratified (bertingkat) sampling before cross-validation process,
because it reduces variance estimation
10 Fold Cross-Validation
Eksperime
Dataset
nt
Accurac
y
93%
91%
90%
93%
93%
91%
94%
93%
91%
10
90%
92%
Orange Box : k-subset (data testing)
Akurasi Rata-Rata
Case Study : NBA
Exercise:
1. Use one of the following tools : RapidMiner, R, Orange, Weka
2. Create prediction model (prediksi elektabilitas caleg) using data
training on data pemilu (datapemilukpu.xls) using the following
algorithm :.
1. Decision Tree (C4.5)
2. Nave Bayes (NB)
3. K-Nearest Neighbor (K-NN)
3. Do evaluation / accuracy testing using 10-fold X Validation

C4.5
NB
K-NN
Accuracy
92.45%
77.46%
88.72%
AUC
0.851
0.840
0.5

Conceptual Data Science Week 7

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Conceptual Data Science Week 7

Încărcat de

Drepturi de autor:

Formate disponibile

CONCEPTUAL

Modeling and Simulation

Monte Carlo Example

GoldSim Video Monte Carlo

Simulations is generally cheaper, safer and sometimes more ethical than

Data Preprocessing (Why ?)

Accuracy: correct or wrong, accurate or not

Major Task in Data Preprocessing

Fill in missing values

3. Data transformation and data discretization

Noisy: containing noise, errors, or outliers

Inconsistent: containing discrepancies in codes or names

Discrepancy between duplicate records

Incomplete (Missing) Data

Missing data may be due to

Missing data may need to be inferred

Data Reduction Strategies

Why Data Reduction?

Data Reduction Strategies

2. Numerosity reduction (Data Reduction)

General Methods in Data Analytics

Linear Regression, Neural Network, Support Vector Machine, etc

Linear Regression, Neural Network, Support Vector Machine, etc

Naive Bayes, K-Nearest Neighbor, C4.5, ID3, CART, Linear

K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means,

FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc

Evaluation (Accuracy, Error)

Error: Root Mean Square Error (RMSE), MSE, MAPE, etc

Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc

Confusion Matrix: Accuracy

Internal Evaluation: DaviesBouldin index, Dunn index,

Lift Charts: Lift Ratio

We split data into 2 group: Training data and Testing data

Cross Validation Methods

Divide data into k subset (same size)

Case Study : NBA

3. Do evaluation / accuracy testing using 10-fold X Validation

S-ar putea să vă placă și