Sunteți pe pagina 1din 6

Information Processing Letters 117 (2017) 19–24

Contents lists available at ScienceDirect

Information Processing Letters


www.elsevier.com/locate/ipl

Research on dynamic heuristic scanning technique


and the application of the malicious code detection model
Zhang Bo a , Li Qianmu a,∗ , Ma Yuanyuan b
a
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
b
Global energy interconnection research institute, Nanjing 210003, China

a r t i c l e i n f o a b s t r a c t

Article history: With the rapid development of computer technology, people pay more attention to the
Received 7 April 2016 security of computer data and the computer virus has become a chief threat to computer
Accepted 30 June 2016 data security. By using an antivirus system that can identify randomly generated computer
Available online 7 July 2016
viruses and on the basis of the basic characteristics of the computer code, this paper
Communicated by X. Wu
investigates the heuristic scanning technique. This paper proposes the minimum distance
Keywords: classifier and detection model through the analysis of the malicious code. This model can
Security indigital systems identify unknown feature codes of illegal procedures and construct a healthy network
Software engineering environment by using a combination of model and experimental method, which can
Performance evaluation intercept the illegal virus program in the installation and operation stages.
© 2016 Elsevier B.V. All rights reserved.

1. Introduction havior analysis. Johannes Kinder and coworkers described


the malicious code by using the method of computer tree
The rapid development of the network has brought the logic (CTL), and through the abstract generalization of CNF,
world within the scope of information sharing, which has this method has a good effect on proactive inspection, but
significantly changed people’s output and lifestyle. With the method can be transferred only through a level of
the wide use of network in finance, defense, education, assembly instructions. Zhangboyun used Naive Bayes and
and other fields, there have also emerged several unsafe K-NN algorithms to detect unknown viruses. He also used
factors for network users. Network security has become a a rough set to simplify the characteristic and avoid the loss
major issue in the process of the development of human of information. Relevant scholars from Germany placed the
social information. Therefore, the research on malicious malicious code in the environment of the virtual machine
code significantly contributes to improve network security. software and analyzed the code by tracking program be-
There are several types of research on the malicious havior. After repeated research of the scholars’ work si-
code detection technology such as the linkage of the fire- multaneously at home and abroad, hackers, in order to in-
wall and intrusion detection technology, active defense crease the survivability of the malicious code, also adopted
technology, static signature detection technology, and be- anti-debugging techniques to check whether the code is
havior analysis technology [1]. Among them, the main being debugged. Therefore, in the context of malicious be-
technology is the behavior analysis technology, which can havior, we still need some security experts to study and
detect the signature of unknown illegal procedures. Fur- analyze the resultant data, but the judgment process will
thermore, it is advantageous as it can minimize the be- consume much time [2].
On the basis of the results of predecessors’ research,
this paper has conducted further research on the malicious
* Corresponding author. code detection technology. It mainly focuses on the analy-
E-mail address: lqmnust@sina.com (Q. Li). sis of the malicious code, discusses the description method

http://dx.doi.org/10.1016/j.ipl.2016.06.014
0020-0190/© 2016 Elsevier B.V. All rights reserved.
20 B. Zhang et al. / Information Processing Letters 117 (2017) 19–24

of malicious behavior, and applies the behavior analysis Table 1


technology in virus detection model. This paper provides ROC test index of the statistic.
guidance for future research in this field [3]. No. Display Type
(A) True The virus detection model
2. Research on dynamic heuristic scanning technique identifies the virus
(B) False The virus detection model
2.1. Dynamic heuristic scanning technique identifies virus to be a
legitimate program
Dynamic heuristic scanning technique is a behavior- (C) True The detection model
based technique to monitor the running of a dynamic com- identifies program to be
legitimate legal procedures
puter program and restrict the dynamic behavior of the
(D) False The detection model
computer. During the running of a program, some mali-
identifies program to be a
cious and illegal procedures are often generated that are legitimate virus
in conflict with the general procedures; these are inter-
cepted and stemmed by the dynamic heuristic scanning
technique. (3) This technique can detect illegal procedures through
unauthorized P2P applications, which can gain ac-
2.1.1. Characteristics of illegal viruses cess to the external network news secretly and detect
According to the analysis and identification of dynamic the network-cheating behavior of fly standard channel
heuristic scanning technique, the illegal virus usually has data transmission.
the following characteristics: (4) The dynamic heuristic scanning technique can deter-
(1) Illegal procedures invade the memory and modify mine the potential danger program, for example, it can
the total system memory to remain concealed from the detect the attackers’ use of illegal procedures or mal-
disk operating system (DOS). formed packets. Furthermore, the dynamic heuristic
(2) While switching between programs and viruses, the scanning technique can make accurate judgment [5].
antivirus system executes commands before the host pro-
gram. 3. Research on the detection model based on dynamic
(3) Boot viruses are carried by the guidance virus; heuristic scanning technique
the start-up and executive commands are obtained by the
main Boot sector and the Boot sector. Simultaneously, the 3.1. Establishment of the model
virus can occupy INT13 interrupt and set up the virus code
before system setting is completed. Detection index is the basis for determining the mer-
(4) The viruses can also invade certain files such as its of the model test results. In this paper, test results are
EXE and BAT files to tamper them. Because illegal proce- determined mainly through false negatives and false posi-
dures exhibit behavior characteristics, the dynamic heuris- tives. The false negatives view legal program as malicious
tic scanning technique will test them [4]. code. The false positives view malicious code as normal le-
gal procedures [6].
2.1.2. Principle of dynamic heuristic scanning technique Let N be the number of procedures needed for the
Because the dynamic heuristic scanning technique can detection, m be the malicious codes, and n be the legiti-
identify the malicious code, its study has become the hot mate programs, then the condition N = m + n should be
spot in the research of computer antivirus software. This satisfied, provided all the three have positive values. If n
technique can closely monitor the operating system (OS) normal programs consist of p false alarm malicious codes,
and preserve the normal operation of the system. When the following conclusions can be obtained:
the system32 file of the OS creates abnormal problems,
the network port traffic will increase and an unknown False alarm rate = ( p /n) × 100% (1)
program will run in the computer. The dynamic heuristic
If q denotes the normal legal process among m mali-
scanning technique will find it by analyzing the software,
cious code programs:
and hence, it is widely used in the field of antivirus soft-
ware.
Rate of missing report = (q/m) × 100% (2)
2.1.3. Role of dynamic heuristic scanning technique
3.2. Detection using dynamic heuristic scanning technique
(1) The technique mainly scans to determine the internal
behavior of unauthorized network and sends a warn-
In this paper, we chose four ROC indices of medical
ing to network users.
statistics (Table 1), whose interrelationship is shown in Ta-
(2) It detects the user’s network connection; if there is
ble 2. As shown in the table, the accuracy of detection of
a wide range of broadband usage, it will result in
malicious code to become a true positive rate is given as
the depredations of network resources. The dynamic
heuristic scanning technique ensures the safety of
TPF = a/(a + c ) = 100% − False negative rate (3)
users whose Internet usage is within a certain range
by analyzing and identifying the unauthorized file Detection-qualified rate of false positives becomes the false
sharing software. positive rate:
B. Zhang et al. / Information Processing Letters 117 (2017) 19–24 21

Table 2
Relationship between four types of ROC indices.

Test results Sample whitelist Sample blacklist Total


Negative test (−) True negative False negative a + b test the negative number
Amount a Amount b Test negatives
True negatives False negatives

Positive test (+) False positive True positive


Amount c Amount d c + d test positive number
False positives True positives Test positives
a + b + c + d all the samples

Total a + c the actual number of white samples b + d the actual number of black samples Quantity
Total Malware negatives Malware positives All subjects

The key to design the minimum classifier is to select


one of the most effective distance measures. The use of the
standard Maha Ron Nobes and Euclidean distance classifier
is better for these types of data sets.
(1) Euclidean distance classifier:

dk2 = ( X − U k )t ( X − U k ) (5)

Standard Euclidean distance classifier:

dk2 = ( X − U k )t φk−1 ( X − U k ) (6)


⎡ ⎤
σ11
⎢ σ22 ⎥
⎢ ⎥
φk = ⎢ .. ⎥. (7)
⎣ . ⎦
σnn
The variance of the K class data tuple attribute I is an
Fig. 1. ROC curve diagram. element σi j and satisfies (i = 1, 2, . . . , n).
(2) Maha Ron Nobes distance classifier:
FPF = b/(b + d) (4)
dk2 = ( X − U k )t ψk−1 ( X − U k ) (8)
Note: In Fig. 1, X axis represents false positive rate (FPF)
⎡ ⎤
and Y axis true positive rate (TPF). σ11 σ12 σ1n
Fig. 1 shows that X and Y axes are equal in a random ⎢ σ21 σ22 σ2n ⎥
⎢ ⎥
sample model and the coordinates (FPF, TPF) constitute the ψk = ⎢ .. ⎥. (9)
ROC curve. The points (0, 0) and (1, 1) are necessary for ⎣ . ⎦
the ROC curve. The model with TPF = 1 and FPF = 0 pro- σn1 σn2 σnn
duces the best results. The accuracy of the model depends
on the area of values in the ROC curve. For example, for The variance of the first k class data tuple attribute I
the area of values [0.5, 1], the value point 0.5 is meaning- is the σi j , and satisfies (i = 1, 2, . . . , n). The co-variance of
less and the value point 1 shows the best result [7]. the K class I and j is the first class of data tuple attributes
σi j , and satisfies (i , j = 1, 2, . . . , n; i ̸= j) [9].
3.3. Establishment and solving of the malicious code detection When using Euclidean distance as the minimum dis-
model tance classifier, covariance and attribute variance are not
needed to calculate the results, if a single computing ef-
3.3.1. Choice of classification algorithm fect of one-armed classification is not obvious. When using
On the basis of the vector space model classification Omaha Ronald, a distance of the minimum distance classi-
algorithm called minimum distance classifier, the main fier, data classification is ideal but requires a longer time.
principle of the method is the center vector U k (k = Therefore, in this paper, we use the standard Euclidean
1, 2, . . . , m), which represents a classification according to distance to calculate the minimum variance classifier and
the arithmetic average method used by students for clas- obtain the highest classification accuracy in the shortest
sification, where m is classification number, X represents possible time.
classification data tuple, and dk2 is the distance between
the data X tuples and U k . The nearest classification of X 3.3.2. Analysis of malicious code behavior (Table 3)
is the class of X , {x1 , x2 , . . . , xn , c } to be classified in the A series of host behavior executed by the malicious
form of data tuple X , U k = {uk1 , uk2 , . . . , uk3 , c } as the cen- code will cause significant damage to the computer sys-
ter vector, {c 1 , c 2 , . . . , cm } represents a class of c [8]. tem. In this article, we analyze the host of the call API
22 B. Zhang et al. / Information Processing Letters 117 (2017) 19–24

Table 3
Malicious code behavior vector table.

Behavior-related classes Specific behavior Related API calls


Process-related behavior class Forced termination of other processes TerminateProcess
Create a remote thread in another process CreateRemoteThread
To start the external program Winexec way WinExec
Whether it is hidden in the SHELL way to ShellExecute ShellExecuteEx
start the external program
Check in a state of being debugged IsDebuggerPresent CheckRemote–DebuggerPresent

File-related classes In the key directory, delete files DeleteFile


In the key directory, read related E-mail files ReadFile ReadFileEx
Sensitive to the PE file with the data written WriteFile WriteFileEx
in the directory
Copy the PE file to the directory CopyFile CopyFileEx
Set up a shared folder NetShareAdd

Table 4 Table 5
Table of statistical characteristics. Data of the experimental sample.

Sample ID EVENT1 EVENT2 ··· EVENT N Sample type Test set Train set Sample space
1 1 0 ··· 1 0 Sample 1 576 242 818
2 1 1 ··· 1 1 Sample 0 387 243 630
3 0 0 ··· 0 Total 963 485 1448
.
.
.
N 1 1 ··· 0 0 vector, and, based on the category, c ∈ {0, 1}, which satis-
Note: 1 sample space for black sample set, 0 sample space for normal fies i = 1, 2, . . . , n.
process. The experimental steps are as follows:

parameters to study the host behavior of the malicious (1) Choose the samples 0 and 1 from the training set.
code [10]. (2) Set the center vectors of samples 0 and 1 as U k and
Writing file operations are one of the steps imple- U k′ (k = 0, 1), respectively.
mented by the illegal procedure, which makes it difficult (3) Calculate attribute variance σki .
to determine whether it is a malicious code. If we adopt (4) Calculate total variance
'2 σi .
the call parameters, the system will display the difference (5) Calculate σi = k=1 σki , and satisfy (k = 0, 1; i =
between the names of the folder. Malicious code is repre- 1, 2, . . . , n).
sented by a program that is more likely to participate in
the computer system [11]. Calculating the distance dk2 between X and U k is the
In this paper, we assume that the sample of each host’s key of the experiment. The coefficient of (xi − ukj )2 has
behavior characteristic is a statistical table tuple, and the a decisive effect on the total distance dk2 of the attribute
vector of the multidimensional space is the feature defini- value. The coefficients of the non-normalized and normal-
tion. ized Euclidean distances are 1 and 1/σkj , respectively.
The malicious code detection model classifies programs The first experimental model is the standardized Eu-
into normal program and malicious code program. In this clidean distance model:
paper, mathematical models are used to train and ana- Training set on the variance of the k class attribute i is
lyze these two types of samples to detect problems in σkj , thus
malicious code classification. Therefore, a two-dimensional
table (Table 4) is used to describe the behavior characteris- n
( ) *2
tics of each sample, ⟨α1 , α2 , . . . , αn , c ⟩ denotes conjunction dk2 = 1/(σki + %) × (xi − uki ) . (10)
characteristic quantity, where α represents a host behav- i =1
ior characteristic, c is the sample type, and range is (0, 1),
The second model is the evolution of the first experimental
where 1 represents the normal procedure samples and 0
model:
the black sample collection. As shown earlier, this feature
The variance of the whole training set is σ j , thus
does not change with the number of calls to API and the
value of 1. n
( ) *2
dk2 = 1/(σki + %) × (xi − uki ) .
3.3.3. Experimental procedure i =1
The black sample U is the malicious code sample and
the normal application sample is denoted by U ′ . Sample The third type is the sublimation of the second experimen-
data are as follows (Table 5): the total number of samples tal model, thus
is set to 1444, divided into samples 1 (black sample) and 0 1 n
( ( ) *2
(normal procedure). Data tuple X = [x1 , x2 , . . . , xn , c ] T is σi = σki dk2 = 1/(σi + %) × (xi − uki ) . (11)
written as a feature vector, x j is a behavior characteristic
k =0 i =1
B. Zhang et al. / Information Processing Letters 117 (2017) 19–24 23

Table 6
Choice of auxiliary constants.

Evaluation index Auxiliary constant (%)


0.741 5 × 10−1
0.757 5 × 10−2
0.752 5 × 10−3
0.754 5 × 10−4
0.773 5 × 10−5
0.766 5 × 10−6
0.762 5 × 10−7
0.754 5 × 10−8
0.769 5 × 10−9
0.748 5 × 10−10

The value of the auxiliary constant % can be selected


between 5 × 10−1 and 5 × 10−10 , with the aim of obtaining
a higher Youden’s index. Table 6 shows that the values of
auxiliary constants ranging from 5 × 10−5 to 5 × 10−9 can
achieve higher evaluation indices.
Fig. 2. Youden’s index for the detection of malicious code host behavior.

3.4. Evaluation principle of the experiment


This attributes to its wide application in the maintenance
In this paper, the detection rate of the 1 sample is set of network security. Second, the behavioral characteristics
as the ratio of the sample number to the total number of of malicious code and the minimum distance analyzer are
samples with true positive rate, while that of the 0 sample used to establish a sample that distinguishes and detects
is the ratio of the sample number to the total number of 0 the normal program and the malicious virus by using the
samples with false positive rate. standard Euclidean distance structure, which can prove the
Youden’s index = true positive rate − false positive high and low efficiencies of the detection accuracy of the
rate, that is, J = TPR − FPR. By choosing the values between malicious code. Because of the evolution and breeding of
−1 and 1; we can see that the accuracy of the detection malignant virus, we need more antivirus applications and
is higher for values close to 1. This paper is based on the active defense systems. The model developed in this study
0–1 feature space and the evaluation index of 0–n is called is still in its primary stage and needs further optimization
Youden’s index. and improvement to better serve the intended purpose.

3.5. Analysis of the experimental results Acknowledgements

In this paper, two different training samples, four types This study was supported by the Fundamental Research
of feature classifiers, and variances of the four attributes Funds for the Central Universities (No. 3091601510).
are established. The aim is to detect the characteristic of
the behavior of malicious code using a user-friendly model. References
To determine the difference between a sample and the
[1] Ryuiti Koike, Naoshi Nakaya, Yuji Koi, Development of system for the
overall sample, we conduct the training experiment 10
automatic generation of unknown virus extermination software, in:
times. It can be seen from Fig. 2 that after 10 experimental Proceedings of the 2007 International Symposium on Applications
runs, all four model classifier evaluation indices are stable. and the Internet, Hiroshima, Japan, 2007.
This confirms the reasonable distribution of the training [2] Hassan Salmani, Mohammad Tehranipoor, Jim Plusquellic, A novel
technique for improving hardware Trojan detection and reducing
set type and quantity and makes the center of the sample
Trojan activation time, IEEE Trans. Very Large Scale Integr. (VLSI)
feature clearer and objectively reflected. Syst. 6 (2011).
The study results show that the standard of minimum [3] Christopher Kruegel, Increase dynamic coverage, Secure Systems Lab,
Euclidean distance classifier detection index is higher than Technical University, Vienna, Sep. 2007.
other methods by about 4%, which proves that this method [4] Francesco Di Cerbo, Andrea Girardello, Florian Michahelles, Svetlana
Voronkova, Detection of malicious applications on Android OS, IEEE
is better than the other methods. The deficiency of this Comput. Soc. 11 (2010).
method lies in the fact that the false detection rate of di- [5] Po-Ching Lin, Ying-Dar Lin, Yuan-Cheng Lai, A hybrid algorithm of
alog samples is higher than that of the other methods by Backward Hashing and automaton tracking for virus scanning, IEEE
0.6%, and hence, this method needs to be improved. Trans. Comput. (2011) 594–601.
[6] U. Bayer, C. Kruegel, E. Kirda, TT analyze: a tool for analyzing mal-
ware, in: 15th Annual Conference of the European Institute for Com-
4. Conclusions puter Antivirus Research, Vienna, 2006.
[7] Wook Shin, Shinsaku Kiyomoto, Kazuhide Fukushima, Toshiaki
This paper focuses on dynamic heuristic scanning tech- Tanaka, A formal model to analyze the permission authorization and
nique and malicious code detection model. First, the dy- enforcement in the Android frame work, IEEE Comput. Soc. (2010).
[8] Symantec Security Response Center, http://www.symantec.com/
namic heuristic scanning technique is analyzed and sum- business/security_response, 2010.
marized, because this technique is widely used in the field [9] Thomas Blasing, Leonid Batyuk, Aubrey-Derrick Schmidt, Seyit Ahmet
of antivirus software and can detect the malicious code. Camtepe, Sahin Albayrak, An Android application sandbox system
24 B. Zhang et al. / Information Processing Letters 117 (2017) 19–24

for suspicious software detection, in: 2010 IEEE International Confer- in: IEEE Symposium on Computational Intelligence in Cyber Security,
ence on Malicious and Unwanted Software, MALWARE, October 2010, CICS’09, vol. 7, 2009, pp. 91–98.
pp. 55–62. [11] K. Ashcraft, D. Engler, Using programmer-written compiler extensions
[10] Bobby D. Birrer, Richard A. Raines, Rusty O. Baldwin, Mark E. Oxley, to catch security holes, in: 2002 IEEE Symposium on Security and
Using qualia and multi-layered relationships in malware detection, Privacy, May 2002, pp. 143–159.

S-ar putea să vă placă și