Sunteți pe pagina 1din 6

ASSIGNMENT 1

MACHINE LEARNING

B.S. (IT)
Submitted by:
Bushra Urooj Ansari
Sana Iqbal

6th March 2016


DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY

JINNAH UNIVERSITY FOR WOMEN


5-C NAZIMABAD, KARACHI 74600

A data mining experiment: SMS


Spam reviews classification using
WEKA
The SMS Spam reviews dataset
The dataset consists of 5,574 user-created SMS. The collection is composed by just
one file where the class attribute has only two values: spam and ham.

Importing the dataset in WEKA

Fig. 1: imported dataset and class distribution


As expected, we get a relation containing 5,574 instances and two attributes (text and
class). The histogram in the figure shows the uniform distribution of the review classes
(blue = ham, red = spam).

Text preprocessing and feature extraction in WEKA


Choose Filter StringToWordVector Filter in preprocessing tab.

Fig. 2: StringToWordVector filter configuration

After applying the StringToWordVector filter, we get the result shown in figure 3.

Fig. 3: after term extraction


We get a relation containing 13833 binary attributes. The histogram shows the
distribution of the term I among the documents.
The last preprocessing operation is the attribute selection. Eliminating the poorly
characterizing attributes can be useful to get a better classification accuracy. For this
task,
WEKA
provides
the AttributeSelection filter
from
the weka.filters.supervised.attribute package. The filter allows to choose an attribute
evaluation method and a search strategy (fig. 4).

Fig. 4: AttributeSelection filter parameters


The default evaluation method is CfsSubsetEval (Correlation-based feature subset
selection). This method works by choosing attributes that are highly correlated with the
class attribute while having a low correlation with other attributes.
After applying the AttributeSelection filter, we obtain the result show in figure 5.

Fig. 5: after applying the AttributeSelection filter


As you can see, the number of attributes has greatly decreased, passing from more
than 5574 to just 64.

Classification
Naive bayes classifier
In WEKA, the Naive Bayes classifier is implemented in the NaiveBayes component from
theweka.classifiers.bayes package. The best result achieved with this classifier has
shown a correctness percentage of 96.2684% (fig. 6), using a dataset.

Fig. 6: best classification result with Naive Bayes

K-nearest neighbors classifier


In WEKA, the k-NN classifier is implemented in the weka.classifiers.lazy.IBk component.
The best result achieved with this kind of classifiers has shown a correctness
percentage of 96.3581% (fig. 7), using the 1-nearest neighbors classifier on a dataset .

Fig. 7: best classification result with k-nearest neighbors (1-NN)


Classification algorithm
Correctness Percentage
Nave Bayes
96.26%
KNN (K=1)
96.35%
Table 1: Comparison