Documente Academic
Documente Profesional
Documente Cultură
net/publication/330343356
CITATIONS READS
0 384
3 authors:
Mark Stamp
San Jose State University
153 PUBLICATIONS 2,122 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Secure Implementation of Modular Arithmetic Operations and hash search organization for IoT and Cloud Applications View project
All content following this page was uploaded by Mark Stamp on 19 January 2019.
Keywords: Malware, Android, Machine Learning, Random Forest, Logistic Model Tree, Artificial Neural Network.
Abstract: In this paper, we present a comparative analysis of benign and malicious Android applications, based on static
features. In particular, we focus our attention on the permissions requested by an application. We consider both
binary classification of malware versus benign, as well as the multiclass problem, where we classify malware
samples into their respective families. Our experiments are based on substantial malware datasets and we
employ a wide variety of machine learning techniques, including decision trees and random forests, support
vector machines, logistic model trees, AdaBoost, and artificial neural networks. We find that permissions are
a strong feature and that by careful feature engineering, we can significantly reduce the number of features
needed for highly accurate detection and classification.
Artificial Neural Network (ANN) represents a The static analysis can be done using the Java
large class of machine learning techniques that at- Bytecode extracted after disassembling the apk file.
tempt to (loosely) model the behavior of neurons Also we can extract permissions from the manifest
and trained using backpropagation (Stamp, 2018). file. In this paper, we take advantage of the static
While ANNs are not a new concept, having first analysis using permissions of applications and use
been proposed in the 1940s, they have found them for detecting malware and also classify different
renewed interest in recent years as computing malware families. The effectiveness of these techni-
power has become sufficient to effectively deal ques is analyzed using multiple machine learning al-
with “deep” neural network, i.e., networks that gorithms.
include many hidden layers. Such deep networks
have pushed machine learning to new heights. 2.2 Selected Related Work
For our ANN experiments, we use two hidden
layers, with 10 neurons per layer, the rectified The paper (Feng et al., 2014) discusses a tool the
linear unit (ReLU) for the activation functions on authors refer to as Appopscopy, which implements a
the hidden layers, and a sigmoid function for the semantic language-based Android signature detection
output layer. Training consists of 100 epochs, strategy. In their research, general signatures are cre-
with the learning rate set at α = 0.001. ated for each malware family. Signature matching is
Support Vector Machine (SVM) is a popular and achieved using inter-component call graphs based on
effective machine learning technique. According control flow properties and the results are enhanced
to (Bennett and Campbell, 2000), “SVMs are a using static taint analysis. The authors report an accu-
rare example of a methodology where geometric racy of 90% on a malware dataset containing 1027
intuition, elegant mathematics, theoretical gua- samples, with the accuracy for individual families
rantees, and practical algorithms meet.” When ranging from a high of 100% to a low of 38%.
training an SVM, we attempt to find a separating In the research (Fuchs et al., 2009), the authors
hyperplane that maximizes the “margin,” i.e., the analyze a tool called SCanDroid that they have deve-
minimum distance between the classes in the trai- loped. This tool extracts features based on data flow.
ning set. A particularly nice feature of SVMs is The work in (Abah et al., 2015) relies on k-nearest
that we can map the input data to a higher dimen- neighbor classification based on a variety of features,
include incoming and outgoing SMS and calls, de- and Android markets—these samples have been wi-
vice status, running processes, and more. This work dely used in previous research. Labels are included,
claims that an accuracy of 93.75% is achieved. In the which specify the family to which each sample be-
research (Aung and Zaw, 2013), the authors propose longs. Thus, the data can be used for both binary clas-
a framework and test a variety of machine learning al- sification (i.e., malware versus benign) and the multi-
gorithms to analyze features based on Android events class (i.e., family) classification problems.
and permissions. Experimental results from a data- For our benign dataset, we crawled the PlayDrone
set of some 500 malware samples yield a maximum project (PlayDrone, 2018), as found on the Internet
accuracy of 91.75% for a random forest model. Archive (Internet Archive, 2018). The resulting apk
In the paper (Afonso et al., 2015), the authors pro- files might include malicious samples. Therefore, we
pose a dynamic analysis technique that is focused on used Androguard (Androguard, 2018) to filter broken
the frequency of system and API calls. A large num- and potentially malicious apk files. Table 1 gives the
ber of machine learning techniques are tested on a number of malware and benign samples that we obtai-
dataset of about 4500 malicious Android apps. The ned. These samples will be used in our binary classi-
authors give accuracy results ranging from 74.53% fication (malware versus benign) and multiclass (mal-
to 95.96%. Again, a random forest algorithm achieves ware family) experiments discussed below.
the best results.
The research (Enck et al., 2010) discusses a dy- Table 1: Datasets.
namic analysis tool, TaintDroid. This sophisticated Samples
system analyzes network traffic to search for ano- Experiment Type Number
malous behavior—the research is in a similar vein
as (Feng et al., 2014), but with the emphasis on effi- Malware 989
Detection
ciency. Another Android system call analysis techni- Benign 2657
que is considered in (Dimjas̆ević et al., 2015). Classification Malware 1260
Our work is perhaps most closely related to the
research in (Sugunan et al., 2018) and (Kapratwar 3.2 Feature Extraction
et al., 2017) which, in turn, built on the groundbre-
aking work of (Arp et al., 2014) and (Schmeelk et al., To extract static features, we need to reverse en-
2015), as well as that in (Zhou et al., 2012). In (Arp gineer the apk files. We again use Androguard
et al., 2014), for example, an accuracy of 93.9% is at- this reverse engineering task. The manifest file,
tained over a dataset of 5600 malicious Android apps. AndroidManifest.xml, contains numerous potential
The paper (Kapratwar et al., 2017) considers static static features; here we focus on the permissions re-
and dynamic analysis of Android malware based on quested by an application.
permissions and API calls, respectively. A robustness From the superset of malware and benign samples,
analysis is presented, and it is suggested that malware we find that there are 230 distinct permissions. Thus,
writers can most likely defeat detectors that rely on for each apk, a feature vector is generated based on
permissions. We provide a more careful analysis in these permissions. The feature vector is simply a bi-
this paper and find that such is not the case. nary sequence of length 230, which indicates whether
each of the corresponding permissions is requested by
the application or not. Along with each feature vec-
3 EXPERIMENTS AND RESULTS tor, we have a denoting label of +1 or −1, indicating
whether the sample is malware or benign, respecti-
In this section, we first discuss our datasets and fe- vely. The overall architecture, in the case of binary
ature extraction process. Then we turn our attention classification, is given in Figure 1.
to feature engineering, that is, we determine the most
significant features for use in our experiments. We
also discuss our experimental design before presen-
ting results from a wide variety experiments.
3.1 Datasets
We use the Android Malware Genome Project (Zhou Figure 1: Binary classification architecture.
and Jiang, 2012) dataset. This data consists mainly
of apk files obtained from various malware forums
For the multiclass (family) classification problem, Table 2: Permissions ranked by IG.
essentially the same process is followed as for the bi- Score Permission
nary classification case. However, we only examine
malware samples, and over our malware dataset, we 0.30682 READ SMS
find that only 118 distinct permissions occur. Thus, 0.28129 WRITE SMS
the feature vectors for the multiclass problem are of 0.17211 READ PHONE STATE
length 118. 0.15197 RECEIVE BOOT COMPLETED
0.14087 WRITE APN SETTINGS
0.13045 RECEIVE SMS
3.3 Feature Engineering 0.10695 SEND SMS
0.10614 CHANGE WIFI STATE
It is likely that many of the features under considera- 0.10042 INSTALL PACKAGES
tion (i.e., permissions) provide little or no discrimina- 0.10019 RESTART PACKAGES
ting information, with respect to the malware versus
benign or the malware classification problem. It is As mentioned above, we also reduce the feature
useful to remove such features from the analysis, as set using RFE based on a linear SVM. In a linear
they essentially act as noise, and can therefore cause SVM, a weight is assigned to each feature, with the
us to obtain worse results than we would with a smal- weight signifying the importance that the SVM atta-
ler, but more informative, feature set. It is also useful ches to the feature. For our RFE approach, we elimi-
to remove extraneous features so that scoring is as ef- nate the feature with the lowest linear SVM weight,
ficient as possible. Consequently, our immediate goal then train a new SVM on this reduced (by one) fea-
is to determine features that are of no value for our ture set. Then we again eliminate the feature with the
analysis, and remove them from subsequent conside- lowest SVM weight, train a new linear SVM on this
ration. reduced feature set. This process is continued until
There are several techniques for determining fea- a single feature remains, and in this way, we obtain
ture significance. Here we consider two distinct ap- a complete ranking of the features. The potential ad-
proaches to this problem. First, we use information vantage of this RFE technique is that it accounts for
gain to reduce the feature set. Second, we use recur- feature interactions among all of the reduced feature
sive feature elimination (RFE) based on a linear SVM. sets. The top 10 features obtained using RFE based
Information gain is easily computed and gives us a on a linear SVM are listed in Table 3.
straightforward means of eliminating features. RFE
is somewhat more involved, but accounts for feature Table 3: Permissions ranked by RFE using a linear SVM.
interactions in a way that a simple information gain Rank Permission
calculation cannot.
The information gain (IG) provided by a feature 1 WRITE APN SETTINGS
is defined as the expected reduction in entropy when 2 WRITE CALENDAR
we branch on that feature. In the context of a decision 3 WRITE CALL LOG
tree, information gain can be computed as the entropy 4 WRITE CONTACTS
of the parent node minus the average weighted en- 5 WRITE INTERNAL STORAGE
tropy of its child nodes. We measure the information 6 WRITE OWNER DATA
gain for each feature, and select features in a greedy 7 WRITE SECURE SETTINGS
manner. In a decision tree, this has the desirable effect 8 WRITE SETTINGS
of putting decisions based on the most informative fe- 9 WRITE SMS
atures closest to the root. This is desirable, since the 10 WRITE SYNC SETTINGS
entropy is reduced as rapidly as possible, and enables
In Figure 2, we give the cross validation score of
the tree to be simplified by trimming features that pro-
the linear SVM as a function of the number of fe-
vide little or no gain.
atures, as obtained by RFE. We see that the top 82
For our purposes, we simply use the information
features gives us an optimal result—additional featu-
gain to reduce the number of features, then apply va-
res beyond this number provide no benefit. Conse-
rious machine learning techniques to this reduced fe-
quently, we use the 82 top RFE features in our expe-
ature set. Based on the information gain, we selected
riments below.
the 74 highest ranked features—the top 10 of these
features are given in Table 2. Features that ranked
outside the top 74 provided no improvement in our
results.
1.00
Given a scatterplot of experimental results, an
0.95 ROC curve is obtained by graphing the true positive
0.90 rate versus the false positive rate, as the threshold va-
0.85
ries through the range of values. The area under the
ROC curve (AUC) is between 0 and 1, inclusive, and
Cross validation score
0.80
can be interpreted as the probability that a randomly
0.75 selected positive instance scores higher than a rand-
0.70 omly selected negative instance (Stamp, 2017b). In
0.65
Figure 4, we give the AUC statistic for the same set
of IG feature experiments that we ahve summarized
0.60
in Figure 3.
0.55
Training
0.50 Testing
20 40 60 80 100 120 140 160 180 200 220
Features 1.00
0.98
Figure 2: Recursive feature elimination (RFE).
0.96
0.94
3.4 Binary Classification
0.92
AUC
0.90
In this section, we discuss our binary classification
experiments. We consider both the IG features and 0.88
ANN
Linear SVM
J48
LMT
Random tree
AdaBoost
Multinomial NB
ting precision for each of the techniques discussed
in Section 2.1 are plotted in Figure 3. For this ex-
periment the number of samples in the malware and
benign datasets are listed in Table 1, where we see
that there is total of 3647 samples. Here, and in all Figure 4: Machine learning comparison based on AUC (IG
subsequent experiments, we use 5-fold cross valida- features).
tion. Cross validation serves to maximize the num- We repeated the experiments above using the 82
ber of test cases, while simultaneously minimizing RFE features, rather than the 74 IG features. The pre-
the effect of any bias that might exist in the training cision results for these machine learning experiments
data (Stamp, 2017b). are given in Figure 5, while the corresponding AUC
results are summarized in Figure 6.
Training
Testing Training
1.00 Testing
1.00
0.98
0.98
0.96
0.94 0.96
0.94
0.92
Precision
0.92
Precision
0.90
0.90
0.88
0.88
0.86
0.84 0.86
0.84
0.82
0.82
0.80
Random forest
ANN
Linear SVM
J48
LMT
Random tree
AdaBoost
Multinomial NB
0.80
Random forest
ANN
Linear SVM
J48
LMT
Random tree
AdaBoost
Multinomial NB
ANN
Linear SVM
J48
LMT
Random tree
AdaBoost
Multinomial NB
be small), and the samples are likely to be highly ske-
wed towards the benign case.
200:600
400:1200
800:2400
100:600
200:1200
400:2400
100:1200
200:2400
(a) Ratio of 1:3 (a) Ratio of 1:6 (a) Ratio of 1:12
Figure 7: ANN results for various malware to benign ratios.
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0
100
200
300
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Random forest FakeNetflix 1
J48 SndApps 10
DroidKungFu4 96
BeanBot 8
RogueLemon 2
LMT Gone60 9
Random forest SMSReplicator 1
DroidKungFu1 34
Zsone 12
AnserverBot 187
Linear SVM HippoSMS 4
LMT
AUC
4
(a) Precision
Precision
Pjapps 58
Walkinwat 1
DogWars 1
Linear SVM
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
GPSSMSSpy 6
GGTracker 1
FakePlayer 6
Random forest Asroot 8
AdaBoost Bgserv 9
ADRD 22
Testing
Training
DroidKungFu3 309
DroidKungFuUpdate 1
LMT
AUC KMin 52
Spitmo 1
Tapsnake
0.70
0.75
0.80
0.85
0.90
0.95
1.00
2
CruseWin 2
BaseBridge 122
Linear SVM
J48 Endofday 1
Precision
DroidKungFu2 30
Jifake 1
Figure 9: Distribution of malware families.
DroidKungFuSapp 3
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Random forest
Geinimi 69
GoldDream 47
zHash 11
(b) AUC
DroidDream 16
GamblerSMS 1
LMT
Zitmo 1
Linear SVM NickyBot 1
NickySpy 2
CoinPirate 1
DroidCoupon 1
Linear SVM RogueSPPush 9
AdaBoost
AUC
Testing
DroidDreamLight 46
Precision
Training
jSMSHider 16
100
FakeNetflix 100
SndApps 100
DroidKungFu4 98 2
BeanBot 100
RogueLemon 50 50
Gone60 100
SMSReplicator 100
DroidKungFu1 97 3
Zsone 100
AnserverBot 100 80
HippoSMS 100
GingerMaster 100
Pjapps 100
Walkinwat 100
DogWars 100
GPSSMSSpy 100
GGTracker 100
FakePlayer 100
Asroot 38 13 50
Bgserv 100 60
ADRD 100
DroidKungFu3 99
DroidKungFuUpdate 100
KMin 100
Spitmo 100
Tapsnake 100
CruseWin 100
BaseBridge 1 1 1 5 2 1 86 2 2
Endofday 100
YZHC 5 95 40
DroidKungFu2 3 7 83 7
Jifake 100
DroidKungFuSapp 100
Geinimi 100
GoldDream 100
zHash 100
DroidDeluxe 100
LoveTrap 100
DroidDream 19 19 13 50
20
GamblerSMS 100
Zitmo 100
NickyBot 100
NickySpy 100
CoinPirate 100
DroidCoupon 100
RogueSPPush 100
Plankton 100
DroidDreamLight 2 2 2 9 2 83
jSMSHider 100
0
FakeNetflix
SndApps
DroidKungFu4
BeanBot
RogueLemon
Gone60
SMSReplicator
DroidKungFu1
Zsone
AnserverBot
HippoSMS
GingerMaster
Pjapps
Walkinwat
DogWars
GPSSMSSpy
GGTracker
FakePlayer
Asroot
Bgserv
ADRD
DroidKungFu3
DroidKungFuUpdate
KMin
Spitmo
Tapsnake
CruseWin
BaseBridge
Endofday
YZHC
DroidKungFu2
Jifake
DroidKungFuSapp
Geinimi
GoldDream
zHash
DroidDeluxe
LoveTrap
DroidDream
GamblerSMS
Zitmo
NickyBot
NickySpy
CoinPirate
DroidCoupon
RogueSPPush
Plankton
DroidDreamLight
jSMSHider
that is, adding unnecessary permissions that are com- forms well for binary classification, with about 97%
mon among bening apps, is also of limited value. We testing accuracy. A random forest requires signifi-
conclude that features based on permissions are likely cantly less computing power to train, as compared
to remain a viable option for detecting Android mal- to an ANN, and this might be a factor in some im-
ware. plementations, although training is often considered
Our experimental results also show that malware one-time work.
detection on an Android device is practical, since the For future work, it would be interesting to further
necessary features can be extracted and scored effi- explore deep learning for Android malware detection,
ciently. For example, using an ANN on a reduced based on permissions. For ANNs, there are many pa-
feature set, we can obtain an AUC of 0.9920 for the rameters that can be tested, and it is possible that the
binary classification problem. And even in the case of ANN results presented in this paper can be signifi-
highly skewed data—as would typically be expected cantly improved upon. As another avenue for future
in a realistic scenario—an ANN can attain a testing work, recent research has shown promising malware
accuracy in excess of 96%. detection results by applying image analysis techni-
The malware classification problem is inherently ques to binary executable files; see, for example (Hu-
more challenging than the malware detection pro- ang et al., 2018; Yajamanam et al., 2018). As far as
blem. But even in this difficult case, we obtained a the authors are aware, such analysis has not been app-
testing accuracy of almost 95%, based on a random lied to the mobile malware detection or classification
forest. It is worth noting that a random forest also per- problems.
REFERENCES Internet Archive (2018). Internet Archive. https://archive.
org.
Abah, J., V, W. O., B, A. M., M, A. U., and S, Kapratwar, A., Troia, F. D., and Stamp, M. (2017).
A. O. (2015). A machine learning approach Static and dynamic analysis of android malware.
to anomaly-based detection on Android platforms. In Proceedings of the 1st International Works-
https://arxiv.org/abs/1512.04122. hop on Formal Methods for Security Engineering,
Afonso, V. M., de Amorim, M. F., Grégio, A. R. A., Jun- ForSE 2017, in conjunction with the 3rd Interna-
quera, G. B., and de Geus, P. L. (2015). Identi- tional Conference on Information Systems Security
fying Android malware using dynamically obtained and Privacy (ICISSP 2017), pages 653–662. Sci-
features. Journal of Computer Virology and Hacking TePress. http://www.scitepress.org/DigitalLibrary/
Techniques, 11(1):9–17. PublicationsDetail.aspx?ID=mI9FBvhgap4=&t=1.
Androguard (2018). Androguard: Github repository. https: Landwehr, N., Hall, M., and Frank, E. (2005). Logistic
//github.com/androguard/androguard. model trees. Machine Learning, 59(1-2):161–205.
Android Statistics (2017). Android statistics marketers Lin, D. and Stamp, M. (2011). Hunting for undetectable
need to know. http://mediakix.com/2017/08/android- metamorphic viruses. Journal in Computer Virology,
statistics-facts-mobile-usage/. 7(3):201–214.
Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., and Malware Forecast (2017). Malware forecast:
Rieck, K. (2014). DREBIN: effective and explai- The onward march of Android malware.
nable detection of android malware in your pocket. https://nakedsecurity.sophos.com/2017/11/07/2018-
In Proceedings of the 2014 Network and Distribu- malware-forecast-the-onward-march-of-android-
ted System Security Symposium, NDSS 2014. The In- malware/.
ternet Society. http://wp.internetsociety.org/ndss/wp- PlayDrone (2018). PlayDrone: A me-
content/uploads/sites/25/2017/09/11 3 1.pdf. asurement study of Google Play.
Aung, Z. and Zaw, W. (2013). Permission-based Android https://systems.cs.columbia.edu/projects/playdrone/.
malware detection. International Journal of Scientific Quinlan, R. (2018). Software available for download. http:
& Technology Research, 2(3). //www.rulequest.com/Personal/.
Bennett, K. P. and Campbell, C. (2000). Support vector ma- Schmeelk, S., Yang, J., and Aho, A. (2015). Android mal-
chines: Hype or hallelujah? SIGKDD Explorations, ware static analysis techniques. In Proceedings of the
2(2):1–13. 10th Annual Cyber and Information Security Research
Breiman, L. and Cutler, A. (2001). Random Conference, CISR ’15, pages 5:1–5:8, New York, NY,
forestsTM . https://www.stat.berkeley.edu/∼breiman/ USA. ACM.
RandomForests/cc home.htm. Stamp, M. (2017a). Boost your knowledge of adaboost.
Damodaran, A., Troia, F. D., Visaggio, C. A., Austin, T. H., https://www.cs.sjsu.edu/∼stamp/ML/files/ada.pdf.
and Stamp, M. (2017). A comparison of static, dyna-
Stamp, M. (2017b). Introduction to Machine Learning with
mic, and hybrid analysis for malware detection. Jour-
Applications in Information Security. Chapman and
nal of Computer Virology and Hacking Techniques,
Hall/CRC, Boca Raton.
13(1):1–12.
Stamp, M. (2018). Deep thoughs on deep learning. https:
Dimjas̆ević, M., Atzeni, S., Ugrina, I., and Rakamarić,
//www.cs.sjsu.edu/∼stamp/ML/files/ann.pdf.
Z. (2015). Evaluation of android malware de-
tection based on system calls. Technical Report Sugunan, K., Gireesh Kumar, T., and Dhanya, K. A. (2018).
UUCS-15-003, School of Computing, University of Static and dynamic analysis for android malware de-
Utah. http://www.cs.utah.edu/docs/techreports/2015/ tection. In Rajsingh, E. B., Veerasamy, J., Alavi,
pdf/UUCS-15-003.pdf. A. H., and Peter, J. D., editors, Advances in Big Data
Enck, W., Gilbert, P., Chun, B.-G., Cox, L. P., Jung, J., Mc- and Cloud Computing, pages 147–155, Singapore.
Daniel, P., and Sheth, A. N. (2010). TaintDroid: An Springer Singapore.
information-flow tracking system for realtime privacy Yajamanam, S., Selvin, V. R. S., Troia, F. D., and Stamp,
monitoring on smartphones. In Proceedings of the M. (2018). Deep learning versus gist descriptors for
9th USENIX Conference on Operating Systems De- image-based malware classification. In ICISSP, pages
sign and Implementation, OSDI’10, pages 393–407. 553–561. SciTePress.
USENIX Association. Zhou, Y. and Jiang, X. (2012). Dissecting android mal-
Feng, Y., Anand, S., Dillig, I., and Aiken, A. (2014). ware: Characterization and evolution. In Proceedings
Apposcopy: Semantics-based detection of Android of the 2012 IEEE Symposium on Security and Privacy,
malware through static analysis. In Proceedings of SP ’12, pages 95–109, Washington, DC, USA. IEEE
the 22nd ACM SIGSOFT International Symposium on Computer Society.
Foundations of Software Engineering, pages 576–587. Zhou, Y., Wang, Z., Zhou, W., and Jiang, X. (2012). Hey,
Fuchs, A. P., Chaudhuri, A., and Foster, J. S. (2009). you, get off of my market: Detecting malicious apps in
SCanDroid: Automated security certification of An- official and alternative Android markets. In 19th An-
droid applications. https://www.cs.umd.edu/∼avik/ nual Network and Distributed System Security Sympo-
papers/scandroidascaa.pdf. sium, NDSS 2012.
Huang, W., Troia, F. D., and Stamp, M. (2018). Robust
hashing for image-based malware classification. In
ICETE (1), pages 617–625. SciTePress.