Sunteți pe pagina 1din 2

2.

3 Random Forest

We have performed random forest that averaging the many trees, instead just
constructing single decision tree. Random forest are multiple decision trees which fully
grown that are built from random samples of training data and average specific predictors
from different trees. The advantage of random forest is that it reduces high variance
effectively. Due to the high variance is often occurring in fully grown decision tree. Hence,
stable and reliable classifications produced by random forest are better than single fully-
grown decision tree.

Support Vector Machine (SVM)

Given labeled training data, support vectors are the coordinates of individual
observation. SVM is a classifier that outputs an optimal line / hyper-plane which categorizes
the two classes. For this project, we set the parameter of SVM as linear for the kernel, which
is the equation for the prediction for new input using the dot product between input, x and
each support vector, xi is calculated as f(x)=β0+∑(αi∗(x,xi)). Coefficient β0 and αi are obtained
from the algorithm’s estimation from the training data.

2.4 Discussion

Based on the table, the highest accuracy of random forest is showed among the model.
Although random forest has the best performance among the model, but due to the
complexity of the spam data set, the construction of the decision trees might be over complex.
Other than that, it is very time consuming as the higher the complexity and the larger of the
dataset, the larger the time to construct a tree as this showed in the table that random forest
has the highest time elapsed, 165.3 second. The complexity of the data set which will lead to
less generalized model or over fitting issue. Thus, this increase the difficulty of the model to
be modified is the model is to be enhanced in the future as trees are required to be rebuilding
from the beginning.

Next, Support Vector Machine (SVM) performs well based on its accuracy. Due to
the dimension of the data set is large and harder the simplicity of the distance calculation.
Besides, SVM do not directly provide probability estimates. It is calculated by the 5 fold
cross-validation and due to the beta, β which minimize the loss with the greatest possible
margin. This is because SVM only focus on which side of the hyperplane you are on and so
the class assignment cannot be transform into probabilities. As we know that the cross-
validation will give the most “mileage” on the training data. The performance metrics are
averaged across k folds. Some cases that the 5 sample size might be similar and the decision
boundary which differentiate the separation between the two classes is not wide.
Brownlee, J. (22 April, 2016). Bagging and Random Forest Ensemble Algorithms for
Machine Learning. Retrieved from https://machinelearningmastery.com/bagging-and-
random-forest-ensemble-algorithms-for-machine-learning/

Pupale, R. (16 June, 2018). Support Vector Machines(SVM) — An Overview . Retrieved from
Towards Data Science: https://towardsdatascience.com/https-medium-com-
pupalerushikesh-svm-f4b42800e989

S-ar putea să vă placă și