Data Analytics - Simulations and Applications

Data Analytics: Simulation and
Applications
Menchita F. Dumlao, Ph.D.
drmenchi@gmail.com
Data Science Projects
• Determining Rice Bug Epidemic Using Decision
Trees
• Prediction Model for Students’ Performance in
Java Programming with Course-content
Recommendation System
Data Science Projects
• Predicting IT Employability Using Data Mining
Techniques
Data Science Projects:
Determining Rice Bug Epidemic Using Decision Trees
• Roland Calderon, et. al (2016)

• data mining techniques in agriculture for
predicting future trends such as bug epidemic.
• Insect Epidemiology Data Mining (IEDM).
• IEDM - Discrete Mathematics and Theoretical
Computer Science (DIMACS) that aims to provide
an opportunity to develop and test problem
instances and other methods of testing and
comparing performance of algorithms
• uses decision tree .

• classification and prediction
• represents rules
• CRISP-DM methodology
• Rice Field Insect Light Trap (RFILT) mass traps

both the sexes of insect pests
• insect distribution, abundance, flight patterns,
timing of the application of pesticide
• The confusion matrix shows how well the classifier can

recognize if the model is confusing two classes.
• A confusion matrix displays the number of correct and
incorrect predictions made by the model compared
with the actual classifications in the test data.
• The matrix is n-by-n, where n is the number of classes.
The rows present the number of actual classifications
in the test data. The columns present the number of
predicted classifications made by the model.
•

• forecasting precision of a predictive model:
confusion matrix



• Lunar Cycle level is the best predictor of

epidemic status
• followed by Vegetative level
• In Vegetative stage level, 100% resulted in
outbreak status
• For the Ripening stage, the next best predictor is

temperature.
• Over 82% bugs occurred in the outbreak status if
the temperature is lesser or equal to 32 to 38
temperatures
• 97.3% if the temperature greater than to 32
temperatures.
• For Reproduction and Resting stage, 52.7% bugs
occurred in the infested status and this is also
considered a terminal node.
Prediction Model for Students’ Performance in Java Programming with

Course-content Recommendation System
• Evale, Digna, et.al (2016)

• Comparative analysis among different data
mining algorithm for attribute selection and
classification
• a two-phase study which aimed to predict the
students’ performance in Java Programming
and be able to generate recommendations

• Knowledge Discovery in Database (KDD)

• Logistic Regression and Correlation-based
Feature Selection was used for finding
significant predictors
• Classifiers such as CHAID, Exhaustive CHAID,
CRT, QUEST, J48, BayesNet, NaïveBayes and
JRip were implemented

• J48, has the highest percentage of prediction.

• For the second phase evolutionary
prototyping implemented
• Ruby on Rails : a web-based examination
module that will determine the students’
index of learning style and to assess their prior
knowledge in Java

• A course-content recommendation presenting

the learners’ strengths and weaknesses in the
subject with suggested method of learning
style will be automatically generated by the
system.

• KDD: selection, pre-processing, transformation, mining
and interpretation.
• Selection- possible attributes is collected for data set
• pre-processing - filtering and removing of irrelevant
data.
• Transformation- determining the most suited data
mining technique to provide the best prediction
algorithm.
• Mining -discovering the pattern captured through
classification rules, regression models or decision tree.
Evaluation or interpretation is the process of
visualization extracted from models.

• Waikato Environment for Knowledge Analysis

(WEKA) data mining tool and IBM Statistical
Package for the Social Science (SPSS).
• There were 8 attributes namely gender, age,
course, section, schedule and 3 academic
performance for programming languages.

• Attribute selection was done using Standard

Regression Analysis, Forward and Backward
Conditional Regression, Likelihood Ratio, and
WALD
• WEKA was also used to conduct pre-
processing thru filtering by AttributeSelection

• Summary of Attribute Selection Result

• 2 significant attribute out of eight original

attributes
• With a critical p value of .05 (significant
predictors should have smaller critical p
value),
• Binary Logistic Regression (SPSS)
– section and course as highly insignificant with
.747 and .221 p value respectively.

• Pre-processing using attribute selection (SPSS

and WEKA)
• course and section was automatically
removed (highly insignificant)

• CfsSubsetEvaluation - to further verify the

significance of attribute gender
• BestFirst method -gender was found
significant with 0.239 value of merit of best
subset (0 to 1,incorrectly classified instance)
• 76.1% of correctly classified instances

• GreedyStepWise search method (through

Cross Validation)
• , course and section are not found in any of
the ten folds while gender appeared in 7 out
of 10 folds (70%).
• significant predictors: age, gender, schedule,
grade in Programming 1, grade in rogramming
2, and grade in Programming 3.

Summary of Accuracy of Different Algorithms tested


• J48 is the best algorithm

• J48 has highest accuracy in making predictions
• Also has the highest Cohen’s Kappa value
which means that the prediction is strongly
reliable with 64% to 81% reliability
Predicting IT Employability Using Data Mining Techniques
• Piad, Keno, et.al.(2016)

• Knowledge discovery of databases (KDD)
• CRISP-DM (CROSS-Industry Standard Process
for Data Mining)
• Naive Bayes
• Decision Tree
• Ensemble
• pre-processing
• data sets : training and testing data sets
• training datasets: used to generate model
• testing datasets: used to determine the
acceptability of the model.
• Apriori Algorithm -determine associated

attributes frequently occurred in the data
sets
• decision tree and naive bayes algorithm –
used to design the predictive model
• predictive model = equation or rule sets
for prediction

graduate tracer
student’s biographic profile
cumulative grade point average (CGPA)
WEKA AND SPSS
Rule set or equation learning instances of the testing sets
685 instances (tuples) SY 2011-2015

training and testing sets of data.

Accuracy Result in Predicting IT Graduate Employability
Algorithm Accuracy Result Error Estimation Rate
Naive Bayes 75.33 24.47
J48 74.95 25.05
SimpleCart 73.01 26.99
Logistic regression 78.4 22.60
Chaid 76.3 23.70

• Logistic regression measures the relationship

between the categorical dependent logistic
function

Accuracy Result in Predicting IT Specific Profession
Algorithm Accuracy Result Error Estimation Rate %
Chaid 70.1 29.9
Quest 40 60
CRT 70.2 29.8
Exhaustive Chaid 70.1 29.9
ID3 67 33
J48 70 30
• Classification and Regression Trees.

–CRT splits the data into segments that
are as homogeneous :dependent
variable.
–all cases have the same value for the
dependent variable is a homogeneous,
"pure" node.
• The CRT growing method: maximize within-

node homogeneity.
• node that do not represent a homogenous
subset of cases:impurity.
• a terminal node in which all cases have the
same value for the dependent variable is a
homogenous node that requires no further
splitting because it is “pure.”

Results of Testing the Accuracy of Logistic Regression in Predicting Employability
Predicted
Percentage
Observed Value
Not Related Related Corrected
Related 22 48 68.5
Target
Not Related 72 28 72
Average Percentage 70.5

Classification Table of Logistic Regression in Testing Data (N=170)

Results of Testing the Accuracy of CRT in Predicting Specific IT Field/Job to be Employed
IT Classifications IT Specific Career Correct Classificaiton Error Rate
1 (IT Software) 34 23 (67.64) 11 (32.35)
2 (IT Network/ Sys/ 25 16 (64.00) 9 (36.00)
DB Admin)
3 (other IT related 11 5 (45.45) 16. (54.54)

field.)
Classification Table of CRT in Testing Data (N=70)

Data Analytics - Simulations and Applications

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Analytics - Simulations and Applications

Încărcat de

Drepturi de autor:

Formate disponibile

Data Analytics: Simulation and

Determining Rice Bug Epidemic Using Decision Trees

• Roland Calderon, et. al (2016)

Determining Rice Bug Epidemic Using Decision Trees

• uses decision tree .

Determining Rice Bug Epidemic Using Decision Trees

• Rice Field Insect Light Trap (RFILT) mass traps

Determining Rice Bug Epidemic Using Decision Trees

• The confusion matrix shows how well the classifier can

Determining Rice Bug Epidemic Using Decision Trees

Determining Rice Bug Epidemic Using Decision Trees

Determining Rice Bug Epidemic Using Decision Trees

Determining Rice Bug Epidemic Using Decision Trees

Determining Rice Bug Epidemic Using Decision Trees

• Lunar Cycle level is the best predictor of

Determining Rice Bug Epidemic Using Decision Trees

• For the Ripening stage, the next best predictor is

Prediction Model for Students’ Performance in Java Programming with

• Evale, Digna, et.al (2016)

Prediction Model for Students’ Performance in Java Programming with

• Knowledge Discovery in Database (KDD)

Prediction Model for Students’ Performance in Java Programming with

• J48, has the highest percentage of prediction.

Prediction Model for Students’ Performance in Java Programming with

• A course-content recommendation presenting

Prediction Model for Students’ Performance in Java Programming with

Prediction Model for Students’ Performance in Java Programming with

• Waikato Environment for Knowledge Analysis

Prediction Model for Students’ Performance in Java Programming with

• Attribute selection was done using Standard

Prediction Model for Students’ Performance in Java Programming with

Prediction Model for Students’ Performance in Java Programming with

• 2 significant attribute out of eight original

Prediction Model for Students’ Performance in Java Programming with

• Pre-processing using attribute selection (SPSS

Prediction Model for Students’ Performance in Java Programming with

• CfsSubsetEvaluation - to further verify the

Prediction Model for Students’ Performance in Java Programming with

• GreedyStepWise search method (through

Prediction Model for Students’ Performance in Java Programming with

Summary of Accuracy of Different Algorithms tested

Prediction Model for Students’ Performance in Java Programming with

• J48 is the best algorithm

Predicting IT Employability Using Data Mining Techniques

• Piad, Keno, et.al.(2016)

Predicting IT Employability Using Data Mining Techniques

Predicting IT Employability Using Data Mining Techniques

• Apriori Algorithm -determine associated

Predicting IT Employability Using Data Mining Techniques

WEKA AND SPSS

Rule set or equation learning instances of the testing sets

685 instances (tuples) SY 2011-2015

Predicting IT Employability Using Data Mining Techniques

Algorithm Accuracy Result Error Estimation Rate

Naive Bayes 75.33 24.47

J48 74.95 25.05

SimpleCart 73.01 26.99

Logistic regression 78.4 22.60

Chaid 76.3 23.70

Predicting IT Employability Using Data Mining Techniques

• Logistic regression measures the relationship

Predicting IT Employability Using Data Mining Techniques