Sunteți pe pagina 1din 38

Data Analytics: Simulation and

Applications
Menchita F. Dumlao, Ph.D.
drmenchi@gmail.com
Data Science Projects
• Determining Rice Bug Epidemic Using Decision
Trees
• Prediction Model for Students’ Performance in
Java Programming with Course-content
Recommendation System
Data Science Projects
• Predicting IT Employability Using Data Mining
Techniques
Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees

• Roland Calderon, et. al (2016)


• data mining techniques in agriculture for
predicting future trends such as bug epidemic.
• Insect Epidemiology Data Mining (IEDM).
• IEDM - Discrete Mathematics and Theoretical
Computer Science (DIMACS) that aims to provide
an opportunity to develop and test problem
instances and other methods of testing and
comparing performance of algorithms
Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees

• uses decision tree .


• classification and prediction
• represents rules
• CRISP-DM methodology
Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees

• Rice Field Insect Light Trap (RFILT) mass traps


both the sexes of insect pests
• insect distribution, abundance, flight patterns,
timing of the application of pesticide
Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees

• The confusion matrix shows how well the classifier can


recognize if the model is confusing two classes.
• A confusion matrix displays the number of correct and
incorrect predictions made by the model compared
with the actual classifications in the test data.
• The matrix is n-by-n, where n is the number of classes.
The rows present the number of actual classifications
in the test data. The columns present the number of
predicted classifications made by the model.

Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees


• forecasting precision of a predictive model:
confusion matrix
Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees


Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees


Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees


Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees

• Lunar Cycle level is the best predictor of


epidemic status
• followed by Vegetative level
• In Vegetative stage level, 100% resulted in
outbreak status
Data Science Projects:

Determining Rice Bug Epidemic Using Decision Trees

• For the Ripening stage, the next best predictor is


temperature.
• Over 82% bugs occurred in the outbreak status if
the temperature is lesser or equal to 32 to 38
temperatures
• 97.3% if the temperature greater than to 32
temperatures.
• For Reproduction and Resting stage, 52.7% bugs
occurred in the infested status and this is also
considered a terminal node.
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• Evale, Digna, et.al (2016)


• Comparative analysis among different data
mining algorithm for attribute selection and
classification
• a two-phase study which aimed to predict the
students’ performance in Java Programming
and be able to generate recommendations
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• Knowledge Discovery in Database (KDD)


• Logistic Regression and Correlation-based
Feature Selection was used for finding
significant predictors
• Classifiers such as CHAID, Exhaustive CHAID,
CRT, QUEST, J48, BayesNet, NaïveBayes and
JRip were implemented
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• J48, has the highest percentage of prediction.


• For the second phase evolutionary
prototyping implemented
• Ruby on Rails : a web-based examination
module that will determine the students’
index of learning style and to assess their prior
knowledge in Java
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• A course-content recommendation presenting


the learners’ strengths and weaknesses in the
subject with suggested method of learning
style will be automatically generated by the
system.
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System
• KDD: selection, pre-processing, transformation, mining
and interpretation.
• Selection- possible attributes is collected for data set
• pre-processing - filtering and removing of irrelevant
data.
• Transformation- determining the most suited data
mining technique to provide the best prediction
algorithm.
• Mining -discovering the pattern captured through
classification rules, regression models or decision tree.
Evaluation or interpretation is the process of
visualization extracted from models.
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• Waikato Environment for Knowledge Analysis


(WEKA) data mining tool and IBM Statistical
Package for the Social Science (SPSS).
• There were 8 attributes namely gender, age,
course, section, schedule and 3 academic
performance for programming languages.
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• Attribute selection was done using Standard


Regression Analysis, Forward and Backward
Conditional Regression, Likelihood Ratio, and
WALD
• WEKA was also used to conduct pre-
processing thru filtering by AttributeSelection
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System
• Summary of Attribute Selection Result
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• 2 significant attribute out of eight original


attributes
• With a critical p value of .05 (significant
predictors should have smaller critical p
value),
• Binary Logistic Regression (SPSS)
– section and course as highly insignificant with
.747 and .221 p value respectively.
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• Pre-processing using attribute selection (SPSS


and WEKA)
• course and section was automatically
removed (highly insignificant)
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• CfsSubsetEvaluation - to further verify the


significance of attribute gender
• BestFirst method -gender was found
significant with 0.239 value of merit of best
subset (0 to 1,incorrectly classified instance)
• 76.1% of correctly classified instances
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• GreedyStepWise search method (through


Cross Validation)
• , course and section are not found in any of
the ten folds while gender appeared in 7 out
of 10 folds (70%).
• significant predictors: age, gender, schedule,
grade in Programming 1, grade in rogramming
2, and grade in Programming 3.
Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

Summary of Accuracy of Different Algorithms tested


Data Science Projects:

Prediction Model for Students’ Performance in Java Programming with


Course-content Recommendation System

• J48 is the best algorithm


• J48 has highest accuracy in making predictions
• Also has the highest Cohen’s Kappa value
which means that the prediction is strongly
reliable with 64% to 81% reliability
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques

• Piad, Keno, et.al.(2016)


• Knowledge discovery of databases (KDD)
• CRISP-DM (CROSS-Industry Standard Process
for Data Mining)
• Naive Bayes
• Decision Tree
• Ensemble
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques

• pre-processing
• data sets : training and testing data sets
• training datasets: used to generate model
• testing datasets: used to determine the
acceptability of the model.
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques

• Apriori Algorithm -determine associated


attributes frequently occurred in the data
sets
• decision tree and naive bayes algorithm –
used to design the predictive model
• predictive model = equation or rule sets
for prediction
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques


graduate tracer
student’s biographic profile
cumulative grade point average (CGPA)

WEKA AND SPSS

Rule set or equation learning instances of the testing sets

685 instances (tuples) SY 2011-2015


training and testing sets of data.
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques


Accuracy Result in Predicting IT Graduate Employability

Algorithm Accuracy Result Error Estimation Rate

Naive Bayes 75.33 24.47

J48 74.95 25.05

SimpleCart 73.01 26.99

Logistic regression 78.4 22.60

Chaid 76.3 23.70


Data Science Projects:

Predicting IT Employability Using Data Mining Techniques

• Logistic regression measures the relationship


between the categorical dependent logistic
function
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques


Accuracy Result in Predicting IT Specific Profession
Algorithm Accuracy Result Error Estimation Rate %

Chaid 70.1 29.9

Quest 40 60

CRT 70.2 29.8

Exhaustive Chaid 70.1 29.9

ID3 67 33

J48 70 30
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques

• Classification and Regression Trees.


–CRT splits the data into segments that
are as homogeneous :dependent
variable.
–all cases have the same value for the
dependent variable is a homogeneous,
"pure" node.
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques

• The CRT growing method: maximize within-


node homogeneity.
• node that do not represent a homogenous
subset of cases:impurity.
• a terminal node in which all cases have the
same value for the dependent variable is a
homogenous node that requires no further
splitting because it is “pure.”
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques


Results of Testing the Accuracy of Logistic Regression in Predicting Employability

Predicted
Percentage
Observed Value
Not Related Related Corrected

Related 22 48 68.5

Target
Not Related 72 28 72

Average Percentage 70.5


Classification Table of Logistic Regression in Testing Data (N=170)
Data Science Projects:

Predicting IT Employability Using Data Mining Techniques


Results of Testing the Accuracy of CRT in Predicting Specific IT Field/Job to be Employed

IT Classifications IT Specific Career Correct Classificaiton Error Rate

1 (IT Software) 34 23 (67.64) 11 (32.35)

2 (IT Network/ Sys/ 25 16 (64.00) 9 (36.00)

DB Admin)

3 (other IT related 11 5 (45.45) 16. (54.54)


field.)

Classification Table of CRT in Testing Data (N=70)

S-ar putea să vă placă și