Sunteți pe pagina 1din 7

FIT5149 APPLIED DATA ANALYSIS

SEMESTER 2, 2018
DATA ANALYSIS CHALLENGE:
CLASSIFYING NEWS ARTICLES

Nikhil Taralkar28980433
Kunal Sharma 29009626
Varun Mathur 28954114
Contents
1. Introduction .................................................................................................................................... 1
2. Libraries Used.................................................................................................................................. 1
3. Pre-processing Steps ....................................................................................................................... 1
4. Feature Selection ............................................................................................................................ 2
5. Model/ Algorithm for classifiers ..................................................................................................... 2
5.1 Naïve Bayes ................................................................................................................................... 2
5.2 Auto-Encoder ................................................................................................................................ 3
5.3 SVM (Support Vector Machines)................................................................................................... 4
6. Conclusion ....................................................................................................................................... 4
7. References ...................................................................................................................................... 5
1. Introduction
This is an interesting project that asks us to classify a set of articles (Text documents). A set of
training_docs, testing_docs and training_labels are provided with which a classifier must be
built that classifies the testing articles to predict the class label for each testing article. An
appropriate classifier needs to be built to satisfy the requirements.

2. Libraries Used
Sno: Library used Description
1 e1701 The SVM package is present in this library. This package needs to
be loaded for the SVM function to run.

2 caTools This package is used for statistical functions, fast calculation of


AUC, LogitBoost classifier. It consists of several basic utility
functions that include: moving, fast calculation of AUC,
LogitBoost classifier, base64 encoder/decoder etc.

3 Caret This package is is used for creating predictive model. It contains


the tools for data splitting, pre-processing, feature selection etc.

4 tm This package is used for text mining applications within R. This


libray is used to modify the documents, eg, stemming, stopword-
removal etc. It hels in the transformation of the data which is done
via the tm_map() function which applies a function(maps) to all the
elements of the corpus.

5 H20 It offers parallelized implementations of different supervised and


unsupervised machine learning algorithms such as Random Forest,
Naïve Bayes, K-means, Deep Neural Networks etc.

3. Pre-processing Steps
Following pre-processing steps were followed:
• Case Normalization: While counting frequencies, tokens need to be folded to a lower
case as this will help in treating a word that is spelled both in caps and in lower case as
the same word. This will help in better analysis of the documents.

• Tokenization: Tokenization has been used to segment the text into smaller units such
as numbers, punctuations, words, alpha-numeric etc. In the process of tokenization,
some characters like punctuation marks are discarded.
• Removal of stop-words: Stop words are the words that are quite frequent but provide
very less information. Removing stop words helped in better analysis of our text.

• Removing the most/least frequent word: The words that appeared more than 95%
of the documents and less than 5% of the documents were removed.

4. Feature Selection
Feature selection helps in making statistical predictions about what is going to happen in a
sentence. The following feature selections were done to process our data:
• Unigram feature: The unigrams (single words) were identified and were processed.
This would allow us to have the frequency of the word as a predictor.
• Bigram feature: Each two adjacent words create a bigram. A bigram helps to a
prediction for a word based on the one before. The bigrams were identified and were
processed.
• TF-IDF5 (Term Frequency-Inverse Document Frequency): TF-idf is used to
evaluate how important a word is for a document. The words in the document with a
high tfidf score signifies that they occur frequently in the document and provide the
maximum information about that specific document. It helps in reducing the weightage
of more common words.

We have used python for the processing and generation of the feature selection. The code for
the feature selection has been run on both the training and the testing documents.

5. Model/ Algorithm for classifiers


The following classifiers were used to predict the class labels of the testing articles:

5.1 Naïve Bayes


Naïve Bayes is a probabilistic learning method that is based on applying Baye’s theorem. It
assumes independence between the features. They tend to perform very well under unrealistic
assumption and can outperform other alternatives for very small sample sizes. It is robust,
fast and accurate and is easy to implement. If the Naïve Bye’s conditional independence
assumptions hold, then the convergence is much quicker than discriminative models like
logistic regression. It needs less training data and can be used for both binary and multi-iclass
classification problems. It can handle continuous and discrete data.

This classier was run on both the training_docs and testing_docs and the accuracy predicted
was as follows:
Accuracy = 44.7%

5.2 Auto-Encoder
Auto encoder is based on the neural networks. Given some contextual information on a
document (can be in the form of unigrams, bigrams etc), this classifier helps to predict the class
label (positive, negative, neural).
Artificial neural network is a nonlinear and a non-parametric model while the other statistical
methods are parametric model that need higher background of statistic. These models are used
to solve super complicated models that have too many variables to be simplified. It has the
ability to detect complex non-linear relationships between dependent and independent
variables and detect possible interactions between predictor variables.
This classier was run on both the training_docs and testing_docs and the accuracy predicted
was as follows:

Accuracy = 65.611%
5.3 SVM (Support Vector Machines)
An SVM is a discriminative classifier that is defined by separating hyperplane. Given a labelled
training data, the algorithm outputs an optimal hyperplane which can categorize new examples.
SVM’s are useful in case of non-regularity in the data, ie, when data is not regularly distributed.
SVM’s deliver a unique solution, which is an advantage over Neural Networks, that have
multiple solutions associated with local minima and may not be robust over different samples.
SVM works well and can solve the problem of unbalanced data.
This classier was run on both the training_docs and testing_docs and the accuracy predicted
was as follows:

Accuracy = 66.44%

6. Conclusion
Different classifiers are built that classifies the testing articles to predict the class label for
each testing article. The accuracies predicted by Naïve Bayes classifier is 44.7%, by Auto-
Encoder classifier is 65.611% and by SVM (Support Vector Machines) is 66.44%. The best
results are obtained by using the SVM classifier.
7. References
Nazrul, S. (2017). Multinomial Naive Bayes Classifier for Text Analysis (Python).
Retrieved from https://towardsdatascience.com/multinomial-naive-bayes-
classifier-for-text-analysis-python-8dd6825ece67

Raschka. S. (2014). Naive Bayes and Text Classification. Retrieved from


https://sebastianraschka.com/Articles/2014_naive_bayes_1.html
Varma, R. (2016). How are neural networks used in Natural Language Processing?.
Retrieved from https://www.quora.com/How-are-neural-networks-used-in-
Natural-Language-Processing
Sallehuddin. R. (2014). What are the advantages of using Artificial Neural Network
compared to other approaches? Retrieved from
https://www.researchgate.net/post/What_are_the_advantages_of_using_Artificial_
Neural_Network_compared_to_other_approaches

S-ar putea să vă placă și