Sunteți pe pagina 1din 4

26th Telecommunications forum TELFOR 2018 Serbia, Belgrade, November 20-21, 2018.

Convolutional Neural Network based SMS


Spam Detection
Milivoje Popovac, Mirjana Karanovic, Srdjan Sladojevic, Marko Arsenovic, Andras Anderla

necessity.
Abstract — SMS spam refers to undesired text message. SMS spam is a problem that doesn’t have clear and
Machine Learning methods for anti-spam filters have been simple solution yet and many efforts have been made to
noticeably effective in categorizing spam messages. Dataset make a model that will detect (classify) SMS spam.
used in this research is known as Tiago’s dataset. Crucial step
in the experiment was data preprocessing, which involved
Although these models can be helpful, there are still
reducing text to lower case, tokenization, removing opportunities for further enhancements. In this paper an
stopwords. Convolutional Neural Network was the proposed experiment was conducted in order to classify spam and
method for classification. Overall model’s accuracy was non-spam messages, by using Convolutional Neural
98.4%. Obtained model can be used as a tool in many Network.
applications. This research confirms that CNN can be used to make a
model for spam detection that categorize messages with
Keywords — CNN, Cost-sensitive classification,
Imbalanced dataset, Machine Learning, SMS Spam high accuracy. Moreover, adjusted model can be applied to
sentiment analysis, text categorization, or spam detecting
I. INTRODUCTION in other types of communication media such as emails,

T EXT messages are used by youth and adults for


personal, social purposes and in business, hence, it is
becoming more popular for marketers to use in their
social network systems, online reviews, etc.
The rest of the paper is organized as follows: Section II
presents the related work, Section III contains description
campaigns. Some of the most common means of sending of the experiment, and Section IV presents the results and
text messages is via SMS and email. According to [1], discussion. Section V holds the conclusion, after which a
SMS messages have a 98% open rate, while email only has list of literature is given.
a 20% open rate, additionally, as stated in [2], SMS
messages have a 45% response rate, as opposed to email II. RELATED WORK
which has a 6% response rate. For these reasons, marketers Detection of spam messages, as a classification problem,
in some regions choose SMS as their primary way of can be solved by using different methods of data analysis.
advertising, which leads to spamming. SMS spam refers to With respect to the chosen method for this research, below
any undesired text message, which marketers send at are listed papers in which artificial intelligence methods
random for commercial purposes. It can take the form of a have been used.
simple message that contains a link to a number or a link to Authors of paper [3], also creators of Tiago’s dataset,
a website etc. Some of the reasons why spam is bad, have tested multiple machine learning algorithms and
includes: it's annoying, it wastes time, it misuses resources, provided the foundation for further research. They used
and it’s an invasion of privacy. However, the main reason two tokenizers (in order to recognize a domain and to
is that spamming costs its victims much more than it costs preserve symbols that help in separating spam from ham
its senders. Thus, SMS spam detection proves as a messages) in combination with classifiers, where SVM,
Boosted NB, Boosted C4.5 and PART achieved best
results. SVM showed the best results, with an accuracy of
Milivoje B. Popovac, Faculty of Technical Sciences, Trg Dositeja over 97.5%; it caught 83.10% of spams, and blocked only
Obradovica 6, University of Novi Sad, 21101 Novi Sad, Serbia (phone: 0.18% of non-spam messages.
381-21-4852186, e mail: milivojepopovac@uns.ac.rs )
Mirjana M. Karanovic, Faculty of Technical Sciences, Trg Dositeja
In paper [4], SMS Spam Corpus and SMS Spam
Obradovica 6, University of Novi Sad, 21101 Novi Sad, Serbia (phone: Collection datasets were used, individually and merged.
381-21-4852186, e mail: mkaranovic@uns.ac.rs ) Both of the two used methods, Naive Bayes classifier and
Srdjan Sladojevic, Faculty of Technical Sciences, Trg Dositeja FP-Growth Algorithm, accomplished an accuracy rate
Obradovica 6, University of Novi Sad, 21101 Novi Sad, Serbia (phone:
381-21-4852186, e mail: sladojevic@uns.ac.rs ) superior than 90%. The accuracy best average (98.5%) was
Marko Arsenovic, Faculty of Technical Sciences, Trg Dositeja obtained with the implementation of the FP-Growth
Obradovica 6, University of Novi Sad, 21101 Novi Sad, Serbia (phone: algorithm on Tiago’s dataset.
381-21-4852186, e mail: arsenovic@uns.ac.rs )
Andras Anderla, Faculty of Technical Sciences, Trg Dositeja In paper [5], the research was conducted on the Tiago’s
Obradovica 6, University of Novi Sad, 21101 Novi Sad, Serbia (phone: dataset. In the experiment, Naïve Bayes outperformed
381-21-4852117, e mail: andras@uns.ac.rs ) Random Forest algorithm and Logistic Regression
algorithm. NB provided the results of almost 98.5%

978-1-5386-7171-9/18/$31.00 ©2018 IEEE


accuracy. spam.
The same dataset was used in paper [6]. Authors The dataset represents a collection of 4 subsets:
concluded that the boosting of Random Forest and SVM • A subset extracted from the Grumbletext Web site
algorithms gave the best results, and by using both (UK forum). It contains 425 SMS spam messages.
Linguistic Inquiry, Word Count (LIWC) and SMS-specific • A collection of 3,375 SMS non-spam messages
content based features can positively affect results. provided by NUS SMS Corpus (NSC) and collected for
After comparing multiple algorithms, the authors of research at the Department of Computer Science at the
paper [7] have chosen GentleBoost Classifier as the most National University of Singapore.
suitable to use on Tiago’s dataset. The specified algorithm • SMS Spam Corpus v.0.1 Big subset. It consists of
is a mix of LogitBoost and AdaBoostM1 algorithms, which 322 spam and 1,002 ham messages.
makes it convenient for binary Classification and • A collection of 450 SMS ham messages from
unbalanced data. This approach has led to an accuracy of Caroline Tag’s PhD Thesis.
over 98.3%.
B. Preprocessing
In paper [8], authors used the following algorithms: NB,
SVM, k-NN, RF and AdaBoost. All of the mentioned One of the essential steps for creating a good model for
algorithms accomplished an accuracy rate superior than spam classification is preprocessing text data. Converting
97%, whereas multinomial naive Bayes with Laplace text into something an algorithm can work with represents
smoothing and SVM with linear kernel are among the best a complex process. Primarily, text often has a variety of
classifiers. capitalization, which has no big impact on the final model,
Another research on Tiago’s dataset was presented in so for the purpose of this research, text from the whole
paper [9]. Besides pre-processing and classification with dataset is reduced to lower case for simplicity.
various classifiers, the experiment included Clustering In the next step, tokenization (splitting text into
using K-Means algorithm or NMF Model. After the individual words) was conducted in order to perform
mentioned steps, a solution for SMS Thread Identification stemming. By reducing words to a root, sentiment of the
was proposed. The research led to the conclusion that the text was preserved. Also, words which serve only for
SVM algorithm performs better in categorizing the SMS connecting parts of given text (messages), rather than
messages and the combination of NMF and SVM influencing model (so-called stopwords), were removed.
algorithm gave good results in thread identification. Further, extraction of polarity and subjectivity for every
Authors of paper [10] compared performance of message was provided in order to interpret sentiment
multiple algorithms on four spam datasets. They proposed analysis.
a novel spam filter integrating an N-gram tf.idf feature In the final phase of preprocessing, text from messages
selection, modified distribution-based balancing algorithm was converted into a matrix of TF-IDF features. With this
and a regularized deep multi-layer perceptron NN model step, the information about how important word is to a text
with rectified linear units (DBB-RDNN-ReL). The was reflected. The values within the given matrix increased
accuracy off suggested model on Tiago’s dataset was correspondingly to the number of times a word appeared in
approximately 98.5%, FP rate was around 0.0024, and the messages. Fig. 1 shows the most often repeated words in
auc was 0.961. spam messages.
Similar methods can also be used for: email spam
detection, sentiment analysis, text categorization etc. The
authors of paper [11] trained CNN for sentence
classification; this paper proves that simple CNN with one
layer of convolution can achieve extraordinary results. In
paper [12], a study was made to test application of CNN-
RNN model for Multi-label Text Categorization. Feature
extraction was conducted using CNN, while RNN was
used to extract local semantic information and to model
label correlation. It is shown that the size of the dataset has
a significant impact on the performance of the applied
model. Small dataset may lead to overfitting, while large
dataset can achieve remarkable performance.
Considering presented basis, this study was conducted in
Fig. 1. Most frequent words in spam messages
order to evaluate applicability of CNN to given problem.
Described preprocessing and following experiment are
implemented using Python programming language, in
III. MATERIALS AND METHODS
Spyder IDE.
A. Dataset C. Experiment
The dataset used in this research is also known as CNNs proved to be useful for image classification [14]-
Tiago’s dataset [13]. It is composed of 5,574 English, real [16], nevertheless it can be quite useful for text and time-
and non-encoded messages labeled as ham (non-spam) or
series data [17], [18]. TABLE 2: PERFORMANCE MEASURES
A CNN consists of an input and output layer, including
TPR 0.915
one or more hidden layers between them. The hidden
TNR 0.994
layers of a CNN typically consist of convolutional layers,
AUC 0.955
pooling layers, fully connected layers and normalization
F1 0.983
layers [19].
ACC 0.984
For the purpose of this research, the model was
composed of two convolutional layers. Both of them had
Taking into consideration the results obtained by
32 filters with kernel size 3 and ReLU activation function.
proposed model and comparing it with the similar papers
After each of them, a MaxPooling with pool size 2 was
[3], [7], [8], listed in section II, CNN had produced better
added. Flatten layer preceeded the fully connected layer
results evaluated on the same dataset. Like in paper [11]
which contained 128 units with ReLU activation function.
and [12] it is shown that CNN can be useful for sms spam
Output layer consisted of one unit with sigmoid activation
detection and similar classification tasks.
function. Weights of CNN network were randomly
initialized. Described arhitecture of CNN is shown below,
V. CONCLUSION
in the Fig. 2.
In this paper a CNN was applied on Tiago’s dataset in
order to distinguish spam from non-spam messages.
Dataset is composed of mostly ham messages, hence it is
described as strongly imbalanced.
In order to obtain a good model, preprocessing of data
was conducted. Primarily, text from the dataset is reduced
to lower case, and after that tokenization was applied. By
performing stemming, sentiment of text was preserved,
afterwards stopwords were removed. The last step
consisted of converting a text into matrix of TF-IDF
Fig. 2. Proposed architecture of CNN features.
Due to the unbalance class distributions, cost-sensitive There are many Machine Learning algorithms that can
classification is executed and class weights are assigned be used for spam detection. This paper proposed CNN for
accordingly, 1 for non-spam and 1.5 for spam. spam classifcation, with AUC score of 0.955 and accuracy
Hyperparameters were assigned based on tests of multiple of 98.4%, it is proven that this model can perform better
models with different values. compared to many other ML techniques.
Because of the problem’s nature, Adam optimizer and The proposed model can be adjusted to fit many
binary crossentropy loss were used for compiling the correlated types of problems. Also, it can be used as a
model. The model was updated through 10 epochs, since background service in many applications.
fifth caused overfitting. Future research should focus on applying more complex
preprocessing techiques (spelling correction, term
IV. RESULTS AND DISCUSSION frequency, N-grams etc.), or deciding on another
In order to estimate how accurate the predictions will be architecture of CNN network. Also, combining models
in practice, the model was assessed using 10-fold cross- through ensemble learning can lead to exceptional results.
validation.
TABLE 1:CONFUSION MATRIX ACKNOWLEDGEMENT
This research was supported by the PanonIT company
Predicted [19].
0 1
REFERENCES
0 1440 8
Actual

[1] Mobile marketing watch, "SMS Marketing Wallops Email with


98% Open Rate and Only 1% Spam," [Online]. Available:
https://mobilemarketingwatch.com/sms-marketing-wallops-email-
1 19 205 with-98-open-rate-and-only-1-spam-43866/. [Accessed 2
September 2018].
In each step, AUC, F1 score and confusion matrix were [2] A. Small, "How to Use SMS to Win Love, Leads, Revenue," 5 May
2013. [Online]. Available: https://martech.zone/text-messaging/.
evaluated. The final values were obtain by taking the mean
[Accessed 1 September 2018].
from each step. [3] Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A.,
Of the total 224 spam messages, the model correctly "Contributions to the Study of SMS Spam Filtering: New
classified 205, hence having sensitivity (TPR) of 91.51%. Collection and Results," in Proceedings of the 11th ACM
symposium on Document engineering, Mountain View, 2011.
Out of 1448 non-spam messages, 1440 messages were [4] D. Delvia Arifin, Shaufiah and M. A. Bijaksana, "Enhancing spam
classified accurately having specificity (TNR) of 99.44%. detection on mobile phone Short Message Service (SMS)
Values obtained for AUC and F1 score are 0.955 and performance using FP-growth and Naive Bayes Classifier," in IEEE
0.938, respectively, as shown in table 2.
Asia Pacific Conference on Wireless and Mobile (APWiMob), [12] G. Chen, D. Ye, Z. Xing, J. Chen and E. Cambria, "Ensemble
Bandung, 2016. application of convolutional and recurrent neural networks for
[5] P. Sethi, V. Bhandari and B. Kohli, "SMS spam detection and multi-label text categorization," in International Joint Conference
comparison of various machine learning algorithms," in on Neural Networks (IJCNN), Anchorage, 2017.
International Conference on Computing and Communication [13] T. A. Almeida and J. M. G. Hidalgo, "SMS Spam Collection v. 1,"
Technologies for Smart Nation (IC3TSN), Gurgaon, 2017. 2011. [Online]. Available:
[6] A. Karami and L. Zhou, "Improving Static SMS Spam Detection by http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.
Using New Content-based Features," in Twentieth Americas [Accessed 25 August 2018].
Conference on Information Systems, Savannah, 2014. [14] C. Zhang, X. Pan, H. Li, A. Gardiner, I. Sargent, J. Hare and P. M.
[7] F. Akbari and H. Sajedi, "SMS spam detection using selected text Atkinson, "A hybrid MLP-CNN classifier for very fine resolution
features and Boosting Classifiers," in 2015 7th Conference on remotely sensed image classification," ISPRS Journal of
Information and Knowledge Technology (IKT), Umia, 2015. Photogrammetry and Remote Sensing, vol. 140, pp. 133-144, 2018.
[8] H. Shirani-Mehr, "SMS Spam Detection using Machine Learning [15] A. Scherzinger, S. Klemm, D. Berh and X. Jiang, "CNN-Based
Approach," [Online]. Available: Background Subtraction for Long-Term In-Vial FIM Imaging," in
http://cs229.stanford.edu/proj2013/ShiraniMehr- International Conference on Computer Analysis of Images and
SMSSpamDetectionUsingMachineLearningApproach.pdf. Patterns, Ystad, 2017.
[Accessed 27 August 2018]. [16] M. Arsenovic, S. Sladojevic, A. Anderla and D. Stefanovic,
[9] N. K. Nagwani and A. Sharaff, "SMS spam filtering and thread "FaceTime — Deep learning based face recognition attendance
identification using bi-level text classification and clustering system," in IEEE 15th International Symposium on Intelligent
techniques," Journal of Information Science, vol. 43, no. 1, pp. 75- Systems and Informatics (SISY), Subotica, 2017.
87, 2017. [17] L. Eren, T. Ince and S. Kiranyaz, "A Generic Intelligent Bearing
[10] A. Barushka and P. Hajek, "Spam filtering using integrated Fault Diagnosis System Using Compact Adaptive 1D CNN
distribution-based balancing approach and regularized deep neural Classifier," Journal of Signal Processing Systems, pp. 1-11, 2018.
networks," Applied Intelligence, vol. 48, no. 10, pp. 3538-3556, [18] J. Liu, Y. Cheng, X. Wang and Y. Kong, "Joint Sample Expansion
2018. and 1D Convolutional Neural Networks for Tumor Classification,"
[11] Y. Kim, "Convolutional Neural Networks for Sentence in ICIC 2017: Intelligent Computing Theories and Application ,
Classification," in Conference on Empirical Methods in Natural Liverpool, 2017.
Language Processing, Doha, 2014. [19] "PanonIT: Homepage," PanonIT [Online]. Available:
http://panonit.com/. [Accessed 12 September 2018].

S-ar putea să vă placă și