Documente Academic
Documente Profesional
Documente Cultură
By :
Fadhlurrohman Henriwan
4SK1 (09)
NIM : 16.9112
Description
The CIA Factbook has geographic, demographic, and economic data on a country-by-country
basis. In the description of the variables, the 4-digit number indicates the code used to specify that
variable on the data and documentation web site. For instance,
https://www.cia.gov/library/publications/the-world-factbook/fields/2153.html contains
documentation for variable code 2153, network users.
Data
Researchers focus on several variables. Variables are including:
Fert as dependent variable
Children born/woman (#/person), 2127
Pop
Number of people, 2119
Birth
Birth rate (#/1000), 2054
Death
Death rate (#/1000), 2066
Infant
Infant deaths per 1000 live births. 2091
Life
Life expectancy (years), 2102
Labor
Labor force (people), 2095
Tax
Taxes and other revenues (% of GDP), 2221
Imports
Imports ($), 2087
Gold
Reserves of foreign exchange and gold ($), 2188
Mainlines
Telephones - main lines in use (mainlines in use), 2150
library(class)
library(nnet)
library(caret)
library(dplyr)
##
## Attaching package: 'dplyr'
library(ggplot2)
library(ggpubr)
library(psycho)
library(tidyverse)
library(MASS)
##
## Attaching package: 'MASS'
library(plotrix)
library(rcompanion)
library(pROC)
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
summary(country$fert)
head(trainingData)
## Call:
## lda(grpfert ~ pop + birth + death + infant + life + labor + tax +
## gold + imports + mainlines, data = trainingData)
##
## Prior probabilities of groups:
## Under Mean Birth Over Mean Birth
## 0.7135417 0.2864583
##
## Group means:
## pop birth death infant life labor
## Under Mean Birth 28599924 14.62102 7.814522 14.23520 75.33274 13279914
## Over Mean Birth 16454099 32.08745 8.767273 50.97491 62.11818 6233726
## tax gold imports mainlines
## Under Mean Birth 30.65330 75269965590 107841140590 6141750.9
## Over Mean Birth 25.39019 31601898229 12699399321 474140.3
##
## Coefficients of linear discriminants:
## LD1
## pop 1.701532e-08
## birth 2.515883e-01
## death 4.303350e-02
## infant -1.651347e-02
## life 1.638464e-02
## labor -4.636950e-08
## tax 3.055525e-03
## gold 6.476736e-13
## imports 1.393661e-12
## mainlines -1.414573e-08
plot(model_lda)
LDA determines group means and computes, for each individual, the probability of
belonging to the different groups. The individual is then affected to the group with the highest
probability score. Prior probability of groups shows the proportion of training observations in each
group. In this case, there are 71% of the country in Under Mean Birth group, while the rest is in
the opposite group.
Group Means
Group means shows the mean of each variable in each group. For instance, in Under Mean
Birth, the mean of infant deaths is 14 per 1000 live birth. While, in Over Mean Birth, the mean of
infant deaths is 51 per 1000 live birth.
The coefficient of linear discriminants shows the linear combination of predictor variables that are
used to form the LDA decision rule. In this case, the linear combination obtained is
#CONFUSION MATRIX
predict_lda <- predict(model_lda, testData[,-11])
predict_lda
Confusion matrix calculates a cross-tabulation of observed and predicted classes with associated
statistics. The result for Under Mean group shows that 43 out of 44 was classified correctly,
while in Over Mean group 12 out of 20 was classified correctly.
## [1] 0.859375
## [1] 0.859375
sensitivity(CM_lda)
## [1] 0.8431373
specificity(CM_lda)
## [1] 0.9230769
2. Logistic Regression
library(ISLR)
library(corrplot)
library(caret)
library(pROC)
#CM
tab2<-table(testData$grpfert,pr2_or)
tab2
## pr2_or
## Under Mean Birth Over Mean Birth
## Under Mean Birth 44 0
## Over Mean Birth 2 18
cm2=confusionMatrix(tab2)
cm2
## [1] 0.96875
Confusion Matrix of Logistic Regression
Confusion matrix calculates a cross-tabulation of observed and predicted classes with associated
statistics. The result for Under Mean group shows that 44 out of 44 was classified correctly,
while in Over Mean group 18 out of 20 was classified correctly.
Percent Correct for each Category
#ROC table
##
## Attaching package: 'kknn'
##
## Call:
## train.kknn(formula = grpfert ~ ., data = trainingData, kmax = 30)
##
## Type of response variable: nominal
## Minimal misclassification: 0.05208333
## Best kernel: optimal
## Best k: 5
summary(model_knn)
##
## Call:
## train.kknn(formula = grpfert ~ ., data = trainingData, kmax = 30)
##
## Type of response variable: nominal
## Minimal misclassification: 0.05208333
## Best kernel: optimal
## Best k: 5
model_knn$MISCLASS
## optimal
## 1 0.05729167
## 2 0.05729167
## 3 0.05729167
## 4 0.05729167
## 5 0.05208333
## 6 0.05208333
## 7 0.05208333
## 8 0.05208333
## 9 0.05208333
## 10 0.05729167
## 11 0.05729167
## 12 0.05729167
## 13 0.05729167
## 14 0.05729167
## 15 0.05729167
## 16 0.05729167
## 17 0.05729167
## 18 0.05729167
## 19 0.07291667
## 20 0.07291667
## 21 0.07812500
## 22 0.07812500
## 23 0.07812500
## 24 0.07812500
## 25 0.07812500
## 26 0.07812500
## 27 0.07812500
## 28 0.07812500
## 29 0.07812500
## 30 0.07291667
Application of k-nearest neighbor method conducted by split the data into two groups, train and
test data. The train set was using 75% part of the data, while the rest considered as test set.
Output:
K Misclassification K Misclassification
1 0.05729167 6 0.05208333
2 0.05729167 7 0.05208333
3 0.05729167 8 0.05208333
4 0.05729167
5 0.05208333
In the test, maximum number of k applied was 30, meanwhile the optimal number of
nearest neighbors obtained was 5.
## [1] 0.921875
sensitivity(CMknn)
## [1] 0.9148936
specificity(CMknn)
## [1] 0.9411765
Confusion matrix calculates a cross-tabulation of observed and predicted classes with associated
statistics. The result for Under Mean group shows that 43 out of 44 was classified correctly,
while in Over Mean group 16 out of 20 was classified correctly.
knn.roc <- roc(testData$grpfert, ordered(predictionknn), levels= c("Under Mea
n Birth","Over Mean Birth"), direction = "<")
plot.roc(knn.roc, print.auc = T, main="KNN Regression ROC Curve")
4. Decision Tree
library(party)
##
## Attaching package: 'strucchange'
##
## Conditional inference tree with 2 terminal nodes
##
## Response: grpfert
## Inputs: pop, birth, death, infant, life, labor, tax, gold, imports, mainl
ines
## Number of observations: 192
##
## 1) birth <= 21.85; criterion = 1, statistic = 135.95
## 2)* weights = 137
## 1) birth > 21.85
## 3)* weights = 55
Decision tree method was applied to the data by similar technique as the k-nearest neighb
or. That is splitting the data into two groups, train and test data. The train set was using 75% part
of the data, while the rest considered as test set.
#PREDICTION
predictiondt <- predict(output.tree, testData[,-11])
CMtree <- table(testData$grpfert, predictiondt)
CMtree
## predictiondt
## Under Mean Birth Over Mean Birth
## Under Mean Birth 44 0
## Over Mean Birth 1 19
## [1] 0.984375
sensitivity(CMtree)
## [1] 0.9777778
specificity(CMtree)
## [1] 1
Confusion Matrix of Decision Tree
Confusion matrix calculates a cross-tabulation of observed and predicted classes with associated
statistics. The result for Under Mean group shows that 44 out of 44 was classified correctly,
while in Over Mean group 19 out of 20 was classified correctly.
DT.roc <- roc(testData$grpfert, ordered(predictiondt), levels= c("Under Mean
Birth","Over Mean Birth"), direction = "<")
plot.roc(DT.roc, print.auc = T, main="Decision Tree Regression ROC Curve")
5. Random Forest
library(randomForest)
## randomForest 4.6-14
##
## Call:
## randomForest(formula = grpfert ~ ., data = trainingData)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 1.04%
## Confusion matrix:
## Under Mean Birth Over Mean Birth class.error
## Under Mean Birth 136 1 0.00729927
## Over Mean Birth 1 54 0.01818182
Classification using random decision forest method was conducted to the data. The data was split
into two groups, train set which was using 75% part of the data, and test set for the rest 25%.
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 1.04%
Confusion Matrix of Train Set
#validasi model
prediksi_rf <- predict(output.forest, testData[,-11])
#confusion matrix
CM_rf <- table(testData$grpfert, prediksi_rf)
CM_rf
## prediksi_rf
## Under Mean Birth Over Mean Birth
## Under Mean Birth 44 0
## Over Mean Birth 1 19
## [1] 0.984375
## [1] 0.984375
sensitivity(CM_rf)
## [1] 0.9777778
specificity(CM_rf)
## [1] 1
library(naivebayes)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Under Mean Birth Over Mean Birth
## 0.7135417 0.2864583
##
## Conditional probabilities:
## pop
## Y [,1] [,2]
## Under Mean Birth 28599924 112638536
## Over Mean Birth 16454099 27478113
##
## birth
## Y [,1] [,2]
## Under Mean Birth 14.62102 4.376068
## Over Mean Birth 32.08745 6.445796
##
## death
## Y [,1] [,2]
## Under Mean Birth 7.814522 2.659023
## Over Mean Birth 8.767273 3.378159
##
## infant
## Y [,1] [,2]
## Under Mean Birth 14.23520 11.05347
## Over Mean Birth 50.97491 25.85683
##
## life
## Y [,1] [,2]
## Under Mean Birth 75.33274 5.619746
## Over Mean Birth 62.11818 7.861525
##
## labor
## Y [,1] [,2]
## Under Mean Birth 13279914 45815657
## Over Mean Birth 6233726 9543885
##
## tax
## Y [,1] [,2]
## Under Mean Birth 30.65330 11.84013
## Over Mean Birth 25.39019 10.97577
##
## gold
## Y [,1] [,2]
## Under Mean Birth 75269965590 127941891477
## Over Mean Birth 31601898229 42659520087
##
## imports
## Y [,1] [,2]
## Under Mean Birth 107841140590 249494276730
## Over Mean Birth 12699399321 23700059683
##
## mainlines
## Y [,1] [,2]
## Under Mean Birth 6141750.9 15276853
## Over Mean Birth 474140.3 1031074
Naïve Bayes method use probability as basic of classification as the implications of Bayes Theor
em. The assumptions need to be fulfilled when using this method is each attribute need to be inde
pendent and have the same priority. The data processed with this method need to be split into tw
o groups, train set and test set. In this case, train set was using 75% of the data, while the other 2
5% used as the test set.
predict_nb<-predict(model_nb,testData[,-11])
predict_nb
## predict_nb
## Under Mean Birth Over Mean Birth
## Under Mean Birth 32 12
## Over Mean Birth 4 16
confusionMatrix(CM_nb)
model_SVM
##
## Call:
## svm(formula = grpfert ~ ., data = trainingData, type = "C-classification",
## kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 30
Support Vector Machine (SVM) method use point in n-dimensional space and hyper-plane
to help classify the item. Not only linear, it also could be used in non-linear classification by using
“kernel trick”. In the first place, a classifier needs to be built as the hyper-plane. Fert category was
used as the factor variable in this case.
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1
Number of Support Vectors: 30
Kernel linear type was used and resulting the number of support vectors by 30. This show
s the number of points that are close to the boundary or on the wrong side of the boundary.
# Predicting the Test set results
predict_SVM = predict(model_SVM, newdata = testData[-11])
predict_SVM
## predict_SVM
## Under Mean Birth Over Mean Birth
## Under Mean Birth 44 0
## Over Mean Birth 2 18
## [1] 0.96875
sensitivity(cm_svm)
## [1] 0.9565217
specificity(cm_svm)
## [1] 1
#ROC table
8. Neural Networks
library(mlbench)
library(nnet)
nnet.mod=train(grpfert~.,data = trainingData,method="nnet")
predict_nn<-predict(nnet.mod,testData[,-11])
predict_nn
Neural networks method works by interconnect information processing units to transform input i
nto output using activation function. Construction of this method consist of input layer, hidden la
yer, and output layer.
CM_nn <- table(testData$grpfert, predict_nn)
CM_nn
## predict_nn
## Under Mean Birth Over Mean Birth
## Under Mean Birth 44 0
## Over Mean Birth 20 0
confusionMatrix(CM_nn)
Supervised analysis conducted to the data using various model as listed below. Properties
of model such as sensitivity, specificity, and accuracy were also collected from the output.
Comparation of Model
Considering these properties, the best model to classify data is Decision Tree and Random
Forest. Those model was chosen because the value of area under curve (AUC) shows ability to
distinguish two diagnostic group by 97,50%. Another consideration was about its consistence in
sensitivity, specificity, and accuracy properties.