Sunteți pe pagina 1din 19

CONCEPTUAL

DATA SCIENCE
W E EK 7 B IG DATA A ND DATA ANA LY TIC S
A N D RY A L A M S YA H
@ A N D RY B R E W

OUTLINE
o Data Simulation (Monte Carlo)
o Data Preprocessing
o Conceptual Learning Data / Machine Learning
o Model Evaluation / Accuracy
o Case Study / Exercise

Modeling and Simulation


Modeling and simulation(M&S)referstousingmodelsphysical,mathematical,orotherwise
logicalrepresentationofasystem,entity,phenomenon,orprocessasabasisforsimulations
methodsforimplementingamodel(eitherstaticallyor)overtimetodevelopdataasabasisfor
managerialortechnicaldecisionmaking.[1][2]M&Shelpsgettinginformationabouthowsomething
willbehavewithoutactuallytestingitinreallife(wikipedia)
AnExampleofSimulation:MonteCarloMethods

Monte Carlo
Monte Carlo methods(orMonte Carlo experiments)areabroadclassofcomputational
algorithmsthatrelyonrepeatedrandomsamplingtoobtainnumericalresults.Their
essentialideaisusingrandomnesstosolveproblemsthatmightbedeterministicinprinciple.
Theyareoftenusedinphysicalandmathematicalproblemsandaremostusefulwhenitis
difficultorimpossibletouseotherapproaches.MonteCarlomethodsaremainlyusedin
threedistinctproblemclasses:[1]optimization,numericalintegration,andgeneratingdraws
fromaprobabilitydistribution.

Monte Carlo Example

GoldSim Video Monte Carlo


Simulation

Why Simulation

Simulations is generally cheaper, safer and sometimes more ethical than


conducting real-world experiments. For example,supercomputersare sometimes
used to simulate the detonation of nuclear devices and their effects in order to
support better preparedness in the event of anuclear explosion. Similar efforts
are conducted to simulate hurricanes and other natural catastrophes.

Simulations can often be even more realistic than traditional experiments, as they
allow the free configuration of environment parameters found in the operational
application field of the final product. Examples are supporting deep water
operation of the US Navy or the simulating the surface of neighbored planets in
preparation ofNASA missions

Simulations can often be conducted faster thanreal time. This allows using them
for efficientif-then-elseanalyses of different alternatives, in particular when the
necessary data to initialize the simulation can easily be obtained from operational
data. This use of simulation adds decision support simulation systems to the tool
box of traditionaldecision support systems

Data Preprocessing (Why ?)


Measures for data quality: A multidimensional view

Accuracy: correct or wrong, accurate or not


Completeness: not recorded, unavailable,
Consistency: some modified but some not,
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be
understood?

Major Task in Data Preprocessing


1. Data cleaning

Fill in missing values


Smooth noisy data
Identify or remove outliers
Resolve inconsistencies

2. Data reduction
Dimensionality reduction
Numerosity reduction
Data compression

3. Data transformation and data discretization


Normalization
Concept hierarchy generation

4. Data integration
Integration of multiple databases or files

Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation= (missing data)

Noisy: containing noise, errors, or outliers


e.g., Salary=10 (an error)

Inconsistent: containing discrepancies in codes or names


e.g., Age=42, Birthday=03/07/2010
Was rating 1, 2, 3, now rating A, B, C

Discrepancy between duplicate records


Intentional (e.g., disguised missing data)
Jan. 1 as everyones birthday?

Incomplete (Missing) Data


Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data

Missing data may need to be inferred

Data Reduction Strategies


Data Reduction
Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same analytical results

Why Data Reduction?


A database/data warehouse may store terabytes of data
Complex data analysis take a very long time to run on the complete dataset

Data Reduction Strategies


1. Dimensionality reduction
1. Feature Extraction
2. Feature Selection

2. Numerosity reduction (Data Reduction)


. Regression and Log-Linear Models
. Histograms, clustering, sampling

General Methods in Data Analytics


1. Estimation:

Linear Regression, Neural Network, Support Vector Machine, etc

2. Prediction/Forecasting:

Linear Regression, Neural Network, Support Vector Machine, etc

3. Classification:

Naive Bayes, K-Nearest Neighbor, C4.5, ID3, CART, Linear


Discriminant Analysis, Logistic Regression, etc

4. Clustering:

K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means,


etc

5. Association:

FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc

Evaluation (Accuracy, Error)


1. Estimation:

Error: Root Mean Square Error (RMSE), MSE, MAPE, etc

2. Prediction/Forecasting (Prediksi/Peramalan):

Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc

3. Classification:

Confusion Matrix: Accuracy


ROC Curve: Area Under Curve (AUC)

4. Clustering:

Internal Evaluation: DaviesBouldin index, Dunn index,


External Evaluation: Rand measure, F-measure, Jaccard index, FowlkesMallows
index, Confusion matrix

5. Association:

Lift Charts: Lift Ratio


Precision and Recall (F-measure)

Machine Learning
Inthefieldofdataanalytics,machinelearningisamethodusedtodevisecomplexmodelsand
algorithmsthatlendthemselvestoprediction-incommercialuse,thisisknownaspredictiveanalytics.
Theseanalyticalmodelsallowresearchers,datascientists,engineers,andanalyststo"produce
reliable,repeatabledecisionsandresults"anduncover"hiddeninsights"throughlearningfrom
historicalrelationshipsandtrendsinthedata(wikipedia)
Machine learning is the science of getting computers to act without being explicitly
programmed. In the past decade, machine learning has given us self-driving cars,
practical speech recognition, effective web search, and a vastly improved
understanding of the human genome. Machine learning is so pervasive today that you
probably use it dozens of times a day without knowing it. Many researchers also think
it is the best way to make progress towards human-level AI. (standford/coursera)
Machine learning is a type of artificial intelligence (AI) that provides computers with the
ability to learn without being explicitly programmed. Machine learning focuses on the
development of computer programs that can teach themselves to grow and change
when exposed to new data.(whatis.com)

Data Split
The Split Data operator takes a dataset as its input and delivers the subsets of that
dataset through its output ports
The sampling type parameter decides how the examples should be shuffled in the
resultant partitions:
1. Linear sampling: Linear sampling simply divides the dataset into partitions
without changing the order of the examples
Subsets with consecutive examples are created
2. Shuffled sampling: Shuffled sampling builds random subsets of the dataset
Examples are chosen randomly for making subsets
3. Stratified sampling: Stratified sampling builds random subsets and ensures that
the class distribution in the subsets is the same as in the whole dataset
In the case of a binominal classification, stratified sampling builds random
subsets so that each subset contains roughly the same proportions of the two
values of the label

We split data into 2 group: Training data and Testing data

Cross Validation Methods


Cross-Validation method used to avoid overlapping choice from testing data
Cross-Validation step:

Divide data into k subset (same size)

Use each subset for testing data and the rest for training data
This method also called k-fold cross-validation
We often use stratified (bertingkat) sampling before cross-validation process,
because it reduces variance estimation

10 Fold Cross-Validation
Eksperime
Dataset
nt

Accurac
y

93%

91%

90%

93%

93%

91%

94%

93%

91%

10

90%
92%
Orange Box : k-subset (data testing)
Akurasi Rata-Rata

Case Study : NBA

Exercise:
1. Use one of the following tools : RapidMiner, R, Orange, Weka
2. Create prediction model (prediksi elektabilitas caleg) using data
training on data pemilu (datapemilukpu.xls) using the following
algorithm :.
1. Decision Tree (C4.5)
2. Nave Bayes (NB)
3. K-Nearest Neighbor (K-NN)

3. Do evaluation / accuracy testing using 10-fold X Validation


C4.5

NB

K-NN

Accuracy

92.45%

77.46%

88.72%

AUC

0.851

0.840

0.5

S-ar putea să vă placă și