Reversing PDF

LIBRE DE BRUXELLES
UNIVERSITE
Facult
e des Sciences
D
epartement dInformatique
Digital reverse engineering

of executable files.
Obfuscation techniques
against patching.
Nikita Veshchikov
Promoteur :
Prof. Olivier Markowitch
Memoire presente en vue de

lobtention du grade de
Master en Sciences Informatiques
Ann
ee acad
emique 2010 - 2011
Acknowledgments
First of all, I would like to thank my family for their patience.
I would like to thank my advisor - Olivier Markowitch for his advices and support.
I am grateful to everyone who helped me editing this paper, especially Tony Osborne
and Julia Zavyalova.
I would also like to thank persons who suggested interesting ideas and new sections
for this work - Liran Lerman and Markus Lindstrom.
A very special thanks goes to everyone who listened to my explanations about reverse
engineering, code obfuscation and error correction on numerous occasions. Thank you
for your patience!
Contents
1 Introduction
1.1 Goal and context
1.2 Organization . .
1.3 Contributions . .
1.4 Notations . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Understanding reverse engineering
1
1
1
2
2
2 History
2.1 Reasons for reverse engineering . . . . . . . . . . . . . . . . . . . . . . .
2.2 Reverse engineering in military . . . . . . . . . . . . . . . . . . . . . . .
2.3 Digital reverse engineering . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
5
3 Definitions of reverse engineering

3.1 Intuition behind reverse engineering . . . . . . . . . . . . . . . . . . . .
3.2 Definition of reverse engineering . . . . . . . . . . . . . . . . . . . . . . .
3.3 Definition of digital reverse engineering . . . . . . . . . . . . . . . . . . .
8
8
9
10
4 Legal aspects of digital reverse engineering

4.1 Intellectual property protection . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Patent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
12
12
12
II Minimum knowledge required to perform digital reverse engineering

13
5 Theoretical knowledge
5.1 Programming languages . . . . . . . . . . . .
5.1.1 Determining language used . . . . . .
5.2 Compilers . . . . . . . . . . . . . . . . . . . .
5.2.1 General changes in the code structure
5.2.2 Changes due to optimization . . . . .
5.3 Operating systems . . . . . . . . . . . . . . .
.
.
.
.
.
.
14
14
15
15
16
17
20
6 Reversing tools
6.1 Hex editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Sandboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
22
24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
25
26
26
28
29
29
30
32
33
33
7 Code obfuscation
7.1 The definition . . . . . . . . . . . . . . . . . . . . . . . .
7.2 The problem . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Why obfuscate? . . . . . . . . . . . . . . . . . . .
7.3 Anti-reversing techniques . . . . . . . . . . . . . . . . .
7.3.1 Packing techniques . . . . . . . . . . . . . . . . .
7.3.2 Control flow obfuscation . . . . . . . . . . . . . .
7.3.3 Detection of digital reverse engineering . . . . . .
7.3.4 Crashing and confusing reversing tools . . . . . .
7.3.5 Data transformations . . . . . . . . . . . . . . .
7.3.6 Hiding data . . . . . . . . . . . . . . . . . . . . .
7.3.7 Eliminating symbolic information . . . . . . . . .
7.3.8 Human reversers versus automated deobfuscators
7.4 Pushing the reversing problem out of the software world
7.4.1 Program as a service . . . . . . . . . . . . . . . .
7.4.2 Cryptoprocessors . . . . . . . . . . . . . . . . . .
7.4.3 Dongles . . . . . . . . . . . . . . . . . . . . . . .
7.4.4 Trusted computing . . . . . . . . . . . . . . . . .
7.4.5 Hardware protections summary . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
39
39
40
40
41
47
52
58
60
62
63
64
65
65
65
66
67
68
8 Applied reversing
8.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
69
III
71
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.2.1 Virtual environments . . . . .

Disassemblers . . . . . . . . . . . . .
6.3.1 Decompilers . . . . . . . . . .
Debuggers . . . . . . . . . . . . . . .
Monitoring tools . . . . . . . . . . .
Dumping tools . . . . . . . . . . . .
Visual representations . . . . . . . .
Automated deobfuscators . . . . . .
Miscellaneous useful tools . . . . . .
6.9.1 File type recognition . . . . .
6.9.2 Strings and pattern searching
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contribution
9 Anti-patching
9.1 A known problem . . . . . . . . . . . . . . . . . . . . .
9.2 Existing solutions . . . . . . . . . . . . . . . . . . . . .
9.2.1 Manual checking . . . . . . . . . . . . . . . . .
9.2.2 Automatic error detection . . . . . . . . . . . .
9.2.3 Check results of computations . . . . . . . . . .
9.2.4 Algorithm TPCA: Checker Network . . . . . .
9.3 Error detecting and error correcting codes . . . . . . .
9.3.1 The idea behind error detection and correction
9.3.2 Error detecting codes . . . . . . . . . . . . . .
9.3.3 Error correcting codes . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
72
72
72
73
73
73
74
74
74
75
77
ii
9.4
My addition . . . . . . . . . . . . . . .
9.4.1 The idea . . . . . . . . . . . . .
9.4.2 Implementation . . . . . . . . .
9.4.3 Other possible implementations
9.4.4 Advantages and disadvantages
.
.
.
.
.
79
79
81
84
86
10 Conclusions
10.1 Anti-patching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
90
90
A Abbreviations
91
B Images
B.1 History of reverse engineering . . . . . . . .
B.1.1 Reverse engineering in military . . .
B.1.2 Reverse engineering in digital world
B.2 Flowchart . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
93
93
94
95
C Code
C.1 Hello, world . .
C.2 ComputeHash .
C.3 AddChecksum
C.4 SecretValue . .
C.5 AutoCorrect . .
C.6 Makefile . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
96
96
97
99
100
106
D Internship report
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
and improvements
. . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
Chapter 1
Introduction
1.1
Goal and context
Digital reverse engineering is a very interesting sub-domain of reverse engineering. It

consists in extracting from a software the knowledge about how this software works.
Digital reverse engineering is a very powerful tool that can be used for good or evil.
For example, digital reverse engineering could be used in order to understand how a
malicious program works (and then protect a system against this program) or it could
be used to crack a copyright-protected software i.e. disable its protection. Developers
of malware do not want their programs being disabled and developers of copyrightprotected software do not want their programs being cracked and distributed for free.
That is why almost all software developers are concerned about reverse engineering.
Obfuscation techniques are techniques used for protection against reverse engineering. There exist many kinds of obfuscations against different reversing techniques. One
commonly used reversing technique is called patching.
Patching is a process of modifying a program. It is used by software developers in
order to update their programs, but it is also used by reverse engineers (especially by
crackers, they use patching to disable the protection of a program). Several obfuscation
techniques can be used against patching.
This work has three main goals. The first goal is to understand the basic concepts
of digital reverse engineering. The second one is to propose an anti-patching technique
based on error correcting codes and to implement a proof of concept program, in order
to show that this technique works. Finally, the third goal is to suggest improvements
and other possible implementations for the implementation of anti-patching techniques
which use error correcting codes.
1.2
Organization
First of all, Part I of this paper is meant to introduce digital reverse engineering. Chapter 2 describes two cases from the history of reverse engineering. Then, Chapter 3
defines the reverse engineering and the digital reverse engineering. Chapter 4 gives a
quick overview about the legislation related to reverse engineering.
Secondly, Part II describes the minimum knowledge needed in order to do digital
reverse engineering. Chapter 5 presents a theoretical background, that any reverser must
have. Then, Chapters 6 and 7 give an overview about reversing tools and techniques
1
2
used in order to prevent digital reverse engineering (also called obfuscating techniques).
Chapter 8 points out, that theoretical knowledge is not enough to do digital reverse
engineering, it also explains how to start practicing digital reverse engineering.
Finally, Part III describes the proof of concept program and concludes this paper.
Chapter 9 presents patching, explains why some developers are concerned about it,
presents existing solutions against patching and explains how error correcting codes
could be used against patching. The Chapter 10 concludes this paper, by pointing out
that proof of concept program (described in Chapter 9) works well.
A list of abbreviations used in this paper is given in appendix A on the page 91.
The source code of the proof of concept program is given in appendix C on the page 96.
Appendix ?? is the internship report at Forensic Technology Solutions of PriceWaterhouseCoopers.
1.3
Contributions
The list of contributions:
1. Implementation of a proof of concept program which use an error correcting code

in order to restore its original code if it was tampered (see Chapter 9 and Section 9.4.2).
2. List of difficulties related to the use of error correcting codes against patching and
suggestions on how to overcome these difficulties (see Section 9.4.3).
1.4
Notations
The terms reverse engineering and reversing are used interchangeably in this paper, as
well as terms a reverse engineer and a reverser. The abbreviations RE and DRE are
used for reverse engineering and digital reverse engineering.
Part I
Understanding reverse
engineering
Chapter 2
History
The history of reverse engineering is older than might appear at first glance. Reverse
engineering was used long before the invention of computers. Before delving into history,
lets see why people do reverse engineering.
2.1
Reasons for reverse engineering
One of the most relevant questions that one might ask is: Why do people want or need
to reverse-engineer something?.
It appears that there are many answers to this question. Here is a list of the most
common goals achieved by reverse engineering (RE):
Creation and the completion of documentation for an existing product
or device. Sometimes existing documentation for a device or a software is not
clear enough, or has been lost or was never written and nobody knows how the
device works.
Interoperability. Because of the lack of documentation sometimes RE might
help to design a device that would be able to use or otherwise interact with an
existing component or product.
Re-designing. Incorporation of new functionalities into an existing device or
program.
Product analysis. For example to identify potential patent/copyright violations
or to do malware analysis.
Security auditing. The best way to audit a secured system is to try to break it.
Espionage. Military or commercial espionage to acquire sensitive data.
Protection removal. Copyright protection or access restriction removal.
Replication of an existing device or object.
Learning. Academic purposes, learning from others successes and mistakes.
Mere curiosity.
2.2
Reverse engineering in military
As for many other things (e.g. GPS) the military was instrumental in applying RE
techniques.
The most common military motivations for reverse engineering have been:
to find any vulnerabilities in a potential enemys equipment
to produce a copy of a component or an entire machine (e.g. airplane, tank)
to upgrade an already existing equipment
There are many known examples of machines that were reverse engineered. Here
is the story of one such case that illustrates the military approach towards reverse
engineering.
On 21 September 1942 a new Boeing Bomber B-29 Superfortress (see figure B.1
on page 93) prototype had its first flight [57, 56]. Two years later after substantial
modifications, the Boeing B-29 became one of the most performant (range and gross
weight) bombers in the world. It was mainly used by the United States Army Air Force
(USAAF) in the war against Japan. At the same time the Soviet Union did not have
any comparable bomber. The USA refused Soviet requests to purchase this aircraft.
However, B-29 bombers made many missions to Japan and sometimes had to make
emergency landings, occasionally on Soviet territory.
In 1944 four B-29 bombers landed in the USSR. One of them was broken and could
not be repaired while three others had only minor damage and were repaired. In this
way the USSR acquired three B-29 bombers, which were then reverse engineered by
Tupolev Design Bureau in order to build the Tu-4.
One B-29 was given to Gromov Flight Research Institute for flight training. Based
on the flight tests from this institute full instructions for B-29 as well as its flight
characteristics and capabilities were provided. The second B-29 was fully disassembled
by Tupolev Design Bureau in order to study its parts. The third was left untouched as
an original reference (For more information about the Tu-4 creation read [52] and [64]).
A few years later, on 19 May 1947, a prototype of the Tu-4 (see figure B.2, page 93)
had its first flight. Tu-4 was put into mass production two years later.
2.3
Digital reverse engineering
The history of digital reverse engineering (DRE) began in June 1982 with one of the
most famous cases of reverse engineering - implementation of the first non-IBM PC
BIOS by Columbia Data Products, see [6].
On 12 August 1981 the first IBM PC was released [59, 34], see figure B.3 (page 94).
Originally its characteristics were a 4.77 MHz CPU, 16 kB - 256 kB of memory. It also
had three possible operating systems (OS):
IBM BASIC / PC-DOS 1.0
CP/M-86 (later MS-DOS)
UCSD p-System
6
The Basic Input/Output System (BIOS) was the only component with an IBM
copyright (see Section 4.1.1).
IBM wanted to be the only PC manufacturer but could not have copyright nor patent
on the entire IBM PC because it was designed and manufactured with different existing
parts and chips of non IBM origin. Thus anyone could reproduce the entire hardware of
an IBM PC by simply buying the same chips from the same suppliers that IBM used.
Since putting a license on the hardware was not possible IBM decided to put a
copyright license on the BIOS.
IBM encouraged other manufacturers to produce add-on modules for the IBM PC.
It released a technical reference manual with diagrams of all PCs hardware components
as well as the motherboard specifications1 for the contributing manufacturers.
In this way IBM hoped to be the only manufacturer able to produce PCs, having
confidence that no other company could reproduce the BIOS. This turned out to be
wrong.
Although many companies wanted to create their own PC compatible computer
they were not allowed to use IBMs BIOS. They could not copy IBMs BIOS as that
was illegal. It was possible to understand the inner working of the BIOS with the help
of the manual. Also it was possible to acquire the original assembly code and then
reverse engineer it. These two sources of information would allow to recreate the BIOS.
However, copying the code via RE was illegal because lawyers consider it as a copy since
the reverser would see the original code which was under copyright protection.
Here was the problem: how to create a similar software that would be compatible
with all already existing hardware without knowing the BIOS?
Many companies were thinking about how to do it ever since the IBM PC had become
a big success. Finally the answer was found, and in one year, IBM PC compatible
computers were produced by the following companies:
Columbia Data Products presented its IBM compatible MPC 1600 Multi Personal
Computer in June 1982 [6].
Compaq Computer Corporation released its version of the BIOS in November 1982.
Phoenix Software created its BIOS in 1983 and introduced it to the market in May
1984 [34, 42].
Here is the answer to the problem: the copy was done through the process that is
called clean room reverse engineering. The expression the BIOS was clean-roomed is
also used.
Definition 1. Clean room reverse engineering2 (also called Chinese wall) is a reversing
method that involves two separate groups of engineers. The first group reverse-engineers
the original product and creates its documentation. The second group creates a new
product based on the new documentation.
The idea of clean room reversing is based on the principle of independent invention.
The main principle: in order to copy something two separate teams are created. The
1
Hardware specifications are not enough to produce add-ons, one has to know how to talk to the
PCs motherboard in order to design a new component.
2
Do not confuse clean room reverse engineering with clean-room software engineering. Clean-room
software engineering is a process of software development that intend to produce a software with a
certifiable level of reliability i.e. focus on defect prevention.
7
goal of the first team is to reverse engineer the subject in order to create exhaustive
documentation (diagrams, schemes, flow charts etc) which is then then given to the
second team. The task of the second team is to create a completely new product using
only the documentation produced by the first team.
The most challenging part of this process is to ensure and to be able to prove that the
members of the second team have never seen or studied the original product. Otherwise
the judge will consider that the new product is an illegal copy, because it was created
after seeing the original product.
The above example illustrates how to evade the copyright protection, but not the
patent (see Section 4.1). However, reversing a patented product for security auditing or
for studying could be very interesting.
Nowadays clean room RE is not widely used, as it is not very cost effective compared
to 20-30 years ago: in 1982 BIOS was a quite small code compared to modern software.
Recreating a large modern software from scratch is a very challenging exercise.
Chapter 3
Definitions of reverse engineering

3.1
Intuition behind reverse engineering
Here are two little stories that illustrate the intuition behind reverse engineering, the
process of reverse engineering as well as resemblance between scientific research and
reverse engineering.
Once upon a time...
Alice, a system administrator, was
inspecting her network when she found
a strange executable file.
Alice had a friend Bob - an experienced reverse engineer. Alice imediately called Bob and asked him to come
in order to inspect her finding.
Bob was not able to tell directly
what it was, but he was curious about
Alices finding, so he took the image of
the hard drive (with the suspicious file)
in order to analyse it in his laboratory.
Bob started his analysis by dating
the executable file. He discovered that
Alices finding was copiled with g++
1.0.2. The file was a dynamically linked
executable of 26 kilobytes.
Then Bob disassembled the code of
the progaram and started to analyse
the general structure of the program.
The program contained about different
40 functions.
Alice, a famous explorer, was exploring a desert when she found

strange fossils.
Alice had a friend Bob - an experienced paleontologist. Alice imediately
called Bob and asked him to come in
order to inspect her finding.
Bob was not able to tell directly
what it was, but he was curious about
Alices finding, so he extracted the fossil from the ground in order to analyse
it in his laboratory.
Bob started his analysis by dating
the fossils. He discovered that Alices
finding lived about 66 million years ago.
Then Bob assembled all fossils of
the sceleton together and started to
analyse the general structure of the
sceleton. The sceleton was 12.8 meters
long and 4.0 tall. It contained about
150 bones.
He found, that the the program comunicates through the network.

Further analysis of the progam
showed that the program copies itself.
Finaly, Bob classified Alices finding as a computer worm.
He found, that the the animal is

bipedal reptile - a dinosaur.
Further analysis of the fossils
showed that the animal was a predator.
Finaly, Bob called Alices finding
Tyrannosaurus.
This two stories are work of fiction and any resemblance between the characters and
persons living or dead is purely coincidental.
The first worm spreaded throug the Internet was the Moriss Worm (November 2,
1988). The first time a part of a bone (a teeth) of a Tyrannosaurus rex was found in
1874 near Golden, Colorado.
3.2
Definition of reverse engineering
Different sources give slightly different definitions of reverse engineering.

Here is the definition from the Oxford dictionary [17]:
Reverse engineering is the reproduction of another manufacturers product following detailed examination of its construction or composition.
The definition from the Cambridge dictionary [16]:
Reverse engineering is when a company copies the product of another
company by looking carefully at how it is made.
Let us consider a more complete definition is as given in Free On-Line Dictionary of
Computing [20]:
Reverse engineering is the process of analyzing an existing system to
identify its components and their interrelationships and create representations of the system in another form or at a higher level of abstraction. Reverse engineering is usually undertaken in order to redesign the system for
better maintainability or to produce a copy of a system without access to the
design from which it was originally produced. For example, one might take
the executable code of a computer program, run it to study how it behaved with
different input and then attempt to write a program oneself which behaved
identically (or better). An integrated circuit might also be reverse engineered
by an unscrupulous company wishing to make unlicensed copies of a popular
chip.
Here is a very compact definition by Andrew Huang, that he gives in his book about
reverse engineering of the Xbox [25]:
Reverse engineering is the process of extracting know-how or knowledge
from an artifact
Here we would like to propose the following definition, based on various sources of
information ([19] and those listed above):
10
Definition 2. Reverse engineering is the process of discovering the technological principles of a man-made object or system through analysis of its structure and operation.
In other words, reverse engineering is a methodology used to find out how things
work.
When a person is trying to study natural phenomena by observing and analyzing it,
or by reproducing it in an experiment in order to understand how it works, the whole
process is called scientific research.
RE is not unlike scientific researche. RE involves man made objects or devices
whereas scientific research usually involves natural things or phenomena.
In both cases the subject could be studied both passively and actively: disassembling
or execution in a virtual environment in reverse engineering and observations or experiments in science. In both cases the study is made because the person who undertakes
it does not have access to the relevant documentation or any other kind of explanations
about how the subject works.
3.3
Definition of digital reverse engineering
This paper is about digital reverse engineering (DRE) (also called software reverse
engineering) - a sub-domain of reverse engineering. RE is not necessarily attached to
the IT domain (see example in Section 2.2) whereas digital reverse engineering is.
Definition 3. Digital (or Software) Reverse Engineering is reverse engineering of any
part of a computer software.
Generally reversers distinguish between two types of digital reverse engineering:
Binary reverse engineering. e.g. malware reverse engineering, BIOS (see Section
2.3).
Data reverse engineering. e.g. proprietary Microsoft Word, Excel and PowerPoint
formats were reverse engineered and now OpenOffice can handle files of these
formats.
Definition 4. Binary reverse engineering (also called code reverse engineering) is a

reverse engineering of executable files.
Definition 5. Data reverse engineering is a reverse engineering of file formats, data
structures and protocols (the structure of fields as well as the order of messages).
Quite often a reverser has to do some data reverse engineering in order to perform
the reversing of an executable file and vice versa. The understanding of data structure
helps to understand how it is managed and algorithms that use a data can reveal their
structure.
Chapter 4
Legal aspects of digital reverse

engineering
Legal aspects of digital reverse engineering seems to be a very complicated area, as
there is almost no clarity (see [53, 5, 44, 19, 25]). While legislation for murders, stealing,
divorces, heritage and many other domains is well developed, legislation for RE still
remains quite unclear. This is due to the fact that RE (and IT in general) is a very
young domain, compared to others mentioned above. Therefore reverse engineering is
still considered as a grey area1 .
Here are some examples which illustrate it.
Clean room reverse engineering (see Section 2.3) is legal in most countries, and this
is an obvious way to copy any system which is protected by the copyright law (see
Section 4.1.1). However, making copies of copyright-protected material is forbidden by
the copyright law.
Another example, is the reversing of the Xbox, which is described in [25]. The
reversing was done for learning purposes, but the developers of the original product
were not happy about it.
The final example is about the data reverse engineering of Skypes protocol. Skype
is a proprietary Voice-over-IP (VoIP) program, which is free of charge. The protocol
used by this program to communicate between clients is also proprietary. Nowadays
many people try to reverse Skypes protocol (see [12] and [11]) for different reasons
(see Section 2.1). Skypes owners would like to retain the monopoly ownership of the
protocol. On the one hand, it is not a high price to pay, because Skype provides very
good service and has developed a good protocol. On the other hand, the monopoly of
Skypes protocol does not reveal whether personal data is collected (and used by Skype).
There is also another important issue the security of protocol (now it seems more like
security through obscurity, see definition 22).
Maybe it would be better, if the structure of Skypes protocol is revealed. In this
way it would be possible to check if the personal data is collected and if the protocol is
secure.
In many cases it is hard to tell if RE of a given system is legal or not. However,
some programs are open to be reverse engineered - CrackMes and KeygenMes (see
1
Do not confuse with grey area that referres to grey hats. Grey hat, as well as black hat and white
hat are the terms usually used to refer security experts in terms of their attitude towards sharing the
information about security issues. Their respective attitudes are: black hats - full concealment, white
hats - full disclosure, grey hats are somewhere in the middle.
11
12
chapter 8) - programs, specially created in order to be reverse-engineered for studying
purposes and no legislation clearly forbidds it.
4.1
Intellectual property protection
In most countries laws that protect intellectual property are somehow different. Any
reverser should know if he is reversing a protected product. He also should be aware of
the difference between a copyright and a patent.
4.1.1
Copyright
Copyright law protects any recorded (written) work of expression, from the moment it
was created. It gives exclusive rights to the owner to distribute and reproduce his (or
her) product. However, it does not protect from independent invention or from use of
ideas based on the original work e.g. it is legally possible to write a program that does
the same thing as soon as its code is not a copy of the original program.
Copyright law is very complex. For example, if you buy a program, you own a copy,
but only the copyright owner can copy and distribute it. However, a program is copied
into RAM memory and into processors cache memory during its execution. Copyright
law contains an exception, that allow the owner to copy a program to the memory of a
computer.
Copyright law is different in different countries. Some countries have no copyright
law (e.g. Afghanistan). In most countries the length of the copyright is the authors life
plus 70 years (e.g. EU) or 50 years (e.g. Canada), see [60] for coyrights length. There
also exist several exceptions e.g. Bern Convention [1] has the authors life plus 25 years
for the copyright length for photographic works and 50 years from publication or if not
shown 50 years from creation for the length of the copyright for cinematographic works.
4.1.2
Patent
A patent is another commonly used protection for original works. It gives the author
of an invention the exclusive right to make and distribute his invention. Patents also
protect against independent invention.
The inventor has to provide all information needed by someone in order to be able
to recreate the invention. So, the inventor contributes to the general store of knowledge
of the society.
Not all inventions can be patented. In order to be patented an invention has to be
non-obvious and novel.
Nowadays patents are granted for at least 20 years (the first patent protection was
granted for only 14 years) see [25] and [62].
Part II
Minimum knowledge required to

perform digital reverse
engineering
13
Chapter 5
Theoretical knowledge
Before studying reverse engineering one should familiarize himself (or herself) with several concepts that could be referred as low level. A software engineer usually uses
high level concepts: UML diagrams, flow charts, design patterns. A programmer deals
with lower level concepts such as classes and algorithms. In order to do DRE a reverse
engineer most of his time has to work with assembly code, which is about as low level
as you can get (in software).
5.1
Programming languages
In order to do DRE it is necessary to understand differences between programming

languages.
There are many different programming languages and we could classify them in several ways: compiled and interpreted, high-level and low-level, functional and procedural
(imperative) etc.
A future reverse engineer has to understand the difference between a hight-level
language and an assembly code (low-level language), and what is happening when a
program is compiled i.e. transformed from a high-level language to a low-level language.
Also, see [47].
Definition 6. A high-level programming language is a programming language with high
level of abstraction from the particular computer instruction set.
Definition 7. A low-level programming language is a programming language that provides no (or very little) abstraction from the particular computers instruction set.
From the point of view of a reverse engineer (or obfuscator) there are three types of
programming languages to be distinguished:
interpreted or compiled just-in-time (JIT) programming languages - at each execution the program is translated into the machine code by an interpreter e.g. Perl,
Rubby, php
compiled to bytecode or precompiled programming languages - the program is translated into an intermediate language of a virtual machine once and then interpreted
by this virtual machine at each execution e.g. Java, .NET
compiled programming languages - the program is translated into the machine
code once and then could be executed many times e.g. Fortran, Cobol, C/C++
14
15
Interpreted and compiled to bytecode languages are executed by an interpreter.
Compiled to bytecode and compiled languages are compiled off-line. Even if there are
some similarities between these types of languages in case of DRE these languages should
be treated separately, because different obfuscation techniques are applied on programs
written in different types of programming languages, see chapter 7. For example, some
obfuscations do not make sense for compiled languages but are useful for compiled to
byte-code languages e.g. renaming of variables and functions.
A good knowing of the differences between the various programming languages could
significantly help the reverser in his (or her) work. Since obfuscation techniques are
different for different types of languages, DRE techniques required are also different.
5.1.1
Determining language used
Knowing which language was used in order to write the program in question can help
a reverse engineer in his work. This is due to the fact, that different programming
languages have different structure and different underlying logic.
Once the reverser knows which programming language was used, he may try to
figure out which compiler was used (in some cases it might be interesting e.g. malware
analysis [44]).
Here are some basic features, that can help the reverser to find out which programming language was used.
Libraries used - different programming languages have their own libraries. Generally, their names may be found in the header of the executable file, see figure 6.5
on the page 31 (the name of a library is needed in order to load it, before the
execution of the main program).
String representation - even if each individual character is usually represented
in ASCII or Unicode format, which are standards. There exist many possible
representations of strings e.g in C/C++ each string is terminated by a special
symbol - zero byte \0, in Fortran there is no delimiters between strings, in Pascal
each string is prefixed by its length.
Function and procedure calls - there exist many conventions on passing arguments
to a callee and returning the result to the caller e.g. cdecl, stdcall, fastcall, thiscall
etc (see [19]). Arguments may be passed (the result could be returned) through the
stack or through registers. If the stack is used one of two conventions for the order
of parameters could be used (straight or reverse). Also, in different conventions
the caller or the callee is responsible for restoring the stack. For example, C/C++
use the stdcall convention. In stdcall, parameters are passed to the callee through
the stack. Parameters are piled in the reverse order (last parameter first). The
callee is responsible for restoring the stack after the call.
File headers - most of the time compilers add their name and their version in the
header of the file (see example on figure 7.21).
5.2
Compilers
Its important to understand the transformations that compilers apply to a program

during the compilation.
16
Definition 8. A Compiler is a program that transforms a source code into an other
programming language (target language).
Compilers are mostly used to transform a program written in a high-level language
into a low-level language.
Understanding a program written in a high-level language is usually easier than trying to understand a program written in a low-level language. Since a reverser has to
deal with assembly code, he should know what kinds of changes are applied to the original source code by the compiler during the translation from higher-level programming
language into the machine code.
Below are mentioned the main changes that a reverser has to be aware of.
5.2.1
General changes in the code structure
A compiler makes many changes during the compilation and it can be a very hard task
to recognize the source code once compiled. Understanding what a program does is a
very challenging task, even you have access to the source code; in case of a compiled
program this task is harder to accomplish, mainly due to the changes made by compiler.
A computer does neither need names of variables, classes, functions, methods nor
any kinds of labels1 . If you have a source code of a program, these names help to
understand what the program does, but once the program was compiled, it becomes
more difficult.
Listing 5.1: drawLine function call in

C++
1 drawLine ( x1 , y1 , x2 , y2 ) ;
Listing 5.2: drawLine function call in

x86asm assembler
1
2
3
4
5
6
7
8
9
10
11
12
13
mov EAX, [ y2 ]
push EAX
mov EBX, [ x2 ]
push EBX
mov ECX, [ y1 ]
push ECX
mov EDX, [ x1 ]
push EDX
c a l l drawLine
pop EDX
pop ECX
pop EBX
pop EAX
Figure 5.1: Example of translation of a simple function call from C++ code into x86asm
assembler.
To know what a given variable represents can be even harder, because computers
have a limited number of registers2 , whitch are recycled and reused for different variables
many times during the execution.
1
In some cases several compilers keep some labels

Many operations could not be done directly into the memory, so the data has to be copied into a
register, modified and than copied back.
2
17
Listing 5.3: Is it a frog? program in

C++
1 i f ( ( paws==4
2
3
// i t i s
4 } else {
5
// i t i s
6 }
and e a r s ==0)
or s a y s R i b b e t ) {
a frog !
not a f r o g
Listing 5.4: Is it a frog? program in

x86asm assembler
1
mov EAX, [ paws ]
2
cmp EAX, 4
3
jne checkOR
4
mov EBX, [ e a r s ]
5
cmp EBX, 0
6
je f r o g
7 checkOR :
8
mov ECX, [ s a y s R i b b e t ]
9
cmp ECX, 1
10
jne notFrog
11 f r o g :
12
; i t is a frog !
13
jmp c o n t i n u e
14 notFrog :
15
; i t i s not a f r o g
16 c o n t i n u e :
Figure 5.2: Example of an if-else statement with several conditions. Translation form
C++ code into x86asm assembler.
Generally, an instruction a high-level programming language can not be translated
directly into one low-level instruction but into a block of instructions (see figures 5.1
and 5.2).
The entire structure of the code changes after the compilation. Most modern languages allow to declare variables in different blocks of the code, so instructions and
data are mixed in the source code. However after the compilation, data and text(code)
sections are clearly separated.
5.2.2
Changes due to optimization
Some changes are not obligatory but are done in order to optimize the program e.g. to
accelerate programs execution or to gain some space by reducing the size of the final
executable file.
There are many different types of optimization that a compiler can do during the
translation of a program from a high-level programming language to the machine code.
Most optimization options can be turned on and off by the programmer depending
on the purpose of the program e.g. version 4.4.3 of g++ compiler has about 150 optimization options like -finline-small-functions and -funsafe-loop-optimizations etc (see
g++ manpages, man g++).
Code duplication
Sometimes you can find out that some pieces of the original code are duplicated several
times in the final program. This kind of optimization targets the space-time trade off by
18
trying to minimize branch penalities3 . It accelerates the execution speed of a program,
but it lengthen the size of the code. For certain cases this can be useful.
Most modern compilers provide this option although sometimes it is embedded in
the programming language itself e.g. inline keyword in C/C++.
A keyword inline means that the code of a given function or method would be
inserted directly into the program instead of making a function call.
Since a CALL instruction takes more time than a simple execution of the next
instruction an inlined function will execute faster. Usually such optimization is used
for relatively small procedures.
Listing 5.5: Normal loop

1 for ( int i =0; i <1000; i ++){
2
print ( i );
3 }
Listing 5.6: Same loop after unrolling

1 f o r ( int i =0; i <1000; i +=5){
2
print ( i );
3
p r i n t ( i +1);
4
p r i n t ( i +2);
5
p r i n t ( i +3);
6
p r i n t ( i +4);
7 }
Figure 5.3: Example of loop unrolling in C++. In this example there would be 5 times
less branch penalities after the loop unrolling.
Loop unrolling (also known as loop unwinding) is another case of code duplication.
Loop unrolling also minimizes branch penalities and can accelerate the programs
execution, if statements inside the loop are independent i.e. the next instruction does
not need the result of the previous one, in that case they could be executed in parallel
(in case of a loop, it means that a cycle does not depend on the result of the execution
of the previous cycle of the loop). See an example of loop unrolling on figure 5.3.
Replacing
Different instructions are not executed at the same speed and in some cases different
instructions can give the same result. Knowing this fact, we are interested in using
the faster instruction when possible. Often programmers do not know (or do not think
about it) which instruction is faster and if there is a way to use an equivalent faster
instruction. Fortunately, the compiler is well suited for such kinds of replacement and
can do it for programmers during the compilation.
Multiplications and divisions take longer than additions, even for a computer.
The most widely known example of optimization that uses an equivalent instruction
to obtain the same result is the equivalence between multiplication/division by the base
b (also called radix) and shifting a decimal point one position to the right/left. If
multiplied/divided by bn the decimal point has to be shifted n positions to the right/left
in order to obtain the same result. See example on figure 5.4.
3
Tacking a branch (a CALL or JUMP instruction) takes more time, than executing physically following instruction
19
base = 10
404.0 100 = 404.0 102 = 40400
42.0 1000 = 42.0 103 = 42000
1024.0 100 = 1024.0 102 = 10.24
Figure 5.4: Example: multiplication (division) by the base is equivalent to shifting the
decimal point to the right (left).
In case of the digital world, the base is 2. This means that multiplying a number by
is equivalent to shifting the decimal point n digits to the right. Many processors have
a SHIFT instruction which does exactly the same thing. See an example on figure 5.5.
2n
base = 2
101 100 = 101 10 10 = 10100
Same operation in base 10:
5 4 = 5 22 = 20
Result of SHL instruction on a register
CF

?
?
?
?
0
Instructions:
1 mov EAX, 5 ; eax = 0 . . 0 0 0 1 0 1
2 mul AX, 4 ; eax = 0 . . 0 1 0 1 0 0
are equivalent to:
1 mov EAX, 5 ; eax = 0 . . 0 0 0 1 0 1
2 shl EAX, 2 ; eax = 0 . . 0 1 0 1 0 0
Figure 5.5: Multiplication by the base. Instruction SHL is faster than the instruction
MUL. CF - Carry Flag.
Reorganizing
Sometimes reorganizing the code i.e. placing certain blocks of code in different order,
can accelerate the program.
First example is about the if - else statement (see figure 5.6).
If the condition of the if statement is inversed and the instructions inside of the if
and else statements are swapped, the result will be tha same.
However, as you can see in the example on figure 5.6, inverting the condition can
save some instructions in the final assembly code.
The other case when the compiler decides to reorganize some instructions of a program is due to its knowledge of the processors architecture.
Modern processors can execute several instructions at the same time (processors with
several cores and programs with multiple threads) or almost at the same time (several
instructions are in different stages: fetch, decode, execute, memory access, write back.
Also see [51]). In order to use this opportunity sometimes instructions need to be
arranged in such way that between two consecutive instructions the further instruction
does not need the result of the previous one. This goal could be achieved by putting
an other instruction in between them, this way the processor will be better used i.e. the
program will execute faster.
In the example on the figure 5.7 you can see that the instruction add EAX, EDX has
20
Listing 5.7: Written code

1 // valueFound i s a b o o l e a n
2 i f ( not valueFound ) {
3
// I have no answer : (
4 } else {
5
// I have an answer : )
6 }
Listing 5.8: Direct translation into assembly language

1
2
3
4
5
6
7
8
9
movzx EAX, BYTE[ valueFound ]

xor EAX, 0
cmp EAX, 0
je else
; I have no answer : (
jmp c o n t i n u e
else :
; I have an answer : )
continue :
Listing 5.9: Translation into assembly

language after reorganization
1
2
3
4
5
6
7
8
9
movzx EAX, BYTE[ valueFound ]

; no need t o i n v e r s e
cmp EAX, 0
je else
; I have an answer : )
jmp c o n t i n u e
else :
; I have no answer : (
continue :
Figure 5.6: Example of code reorganization in C++: instructions inside if and else were
swapped. In this particular case (listing 5.7) the programmer would probably do the
reorganization himself, but in most situations this is not so obvious.
to wait for the result of the instruction mov EAX, [EBX+ECX]. The processor will have
to wait till the last instruction is executed. In case if inc ECX instruction (which does
not impact the two previous instructions) is placed in between (see listing 5.11), the
processor will be able to execute inc ECX while the result of mov EAX, [EBX+ECX]
instruction is not ready.
5.3
Operating systems
An operating system (OS) coordinates different elements in a computer. It manages the

hardware and the software. An operating system is a key component of any computer
and a reverser must understand how it works.
Since an OS plays the role of a guardian that controls all links between a program
and the outside world, many reversing techniques (and some obfuscation techniques)
are based on the functioning of the OS.
A reverse engineer must have at least some basic knowledge about following concepts:
Memory management - kernel and user memory, memory management scheme
(e.g. paging), memory allocation mechanisms, memory sharing.
21
Listing 5.10: Direct translation into assembly language

1
xor ECX,
2
mov EBX,
3
mov EDX,
4 loop :
5
cmp ECX,
6
j e end
7
mov EAX,
8
add EAX,
9
inc ECX
10
jmp loop
11 end :
ECX
[ myVector ]
5
10
[EBX+ECX]
EDX
Listing 5.11: Translation into assembly

language after reorganization
1
xor ECX,
2
mov EBX,
3
mov EDX,
4 loop :
5
cmp ECX,
6
j e end
7
mov EAX,
8
inc ECX
9
add EAX,
10
jmp loop
11 end :
ECX
[ myVector ]
5
10
[EBX+ECX]
EDX
Figure 5.7: Example of code reorganization in C++: order of instructions changes.

Instructions on lines 8 and 9 are swapped.
Interruptions - exception handling.
Process handling - initialization, context switching, synchronization, threads.
The structure of executable file format
Comunication with the rest of the world - I/O, system calls, APIs.
For example, a reverser should know, that Windows use 32-bits memory addresses
(4 addressable gygabytes). Windows uses the upper 2 GB for kernel-related memory
and the lower 2 gygabytes (GB) for user-related memory (each time when an address
is used in the user-mode the first bit is cleared i.e. set to 0). This means, that all
addresses that has first bit set to 1 are not valid user-mode pointers.
Here is another example, that shows the importance of basic understanding of OS
fundamentals: all programs comunicate with the outside world. Intercepting and understanding these comunications could give reverser some clues about the program. In order
to comunicate with the OS program will use system calls. Understanding what exactly
does each system call will help the reverser to quickly understand general structure of
the progaram.
See [51] and [19] to learn more about OS fundamentals.
Chapter 6
Reversing tools
There are two main reversing techniques: active reversing and passive reversing. There
exist many tools (programs) that allow to do active or passive DRE.
Active and passive techniques could be used regardless of the object that is reversed.
Section 2.2 shows a good example in which both approaches were used.
Definition 9. Active reverse engineering is a reverse engineering that is done using the
reversed object as a blackbox. Reverse engineering is based on observations done after
and during the use of the reversed object. In case of DRE active reverse engineering is
also called live code analysis.
The main idea of active reverse engineering in DRE is to execute the code and
then observe and analyse the result - how the system (the environment) changed after
(or during) the execution. It may be done with different granularities i.e. observe
changes after the execution of the entier program, or after each system call or after each
instruction.
Definition 10. Passive reverse engineering is a reverse engineering done by disassembling the reversed object and studying its components. Passive digital reverse engineering
is also called off-line code analysis.
In the case of passive DRE the file is analyzed without being executed (sometimes
the permission to execute the file is deactivated at the system level, as an additional
precaution against execution).
Here is a list of basic tools which are mostly used by reverse engineers. Sometimes
the possibilities offered by these different tools are combined into one swiss army knife
tool that can do many things (e.g. IDA Pro Debugger & Disassembler).
6.1
Hex editors
One of the most basic, but still useful, tool that is used by reverse engineers is a hex
editor. A hex editor is a program which is very similar to a simple text editor like e.g.
gedit, notepad++, emacs, nano or vim. The prefix hex means hexadecimal and refers
to the base sixteen1 which is used to represent numbers.
Basicaly, all hex editors show the content of any file in two forms:
numbers in hexadecimal representation - for all bytes of the file
1
Numerals go from 0 to F: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F
22
23
characters - for bytes which correspond to printable characters
Generally, modern and more sophisticated hex editors also show other information
(see example on figure 6.1):
offset from the beginning of the file (in bytes)
number representation in other bases e.g. 10, 2
possibility to switch between big endian and little endian representations
Figure 6.1: View of a part of hello world program (see source in appendix C.1) in ghex2
- hex editor for Linux. Left column gives the addresses in hexadecimal representation
(i.e. offsets from the beginning of the file), central column shows hex representations
of values of bytes and right column shows same values as characters (non-printable
characters are replaced by dots).
Hex editors are useful to modify files e.g. replace one instruction by another instruction or to modify some part of the data. That is why hex editors are sometimes
classified as patching tools.
Definition 11. Patching is the process of modifying code in a binary executable to
somehow alter its behavior. Definition from [19].
Patching a program could be useful in order to do active DRE e.g. to ensure that
program will execute its particular branch that the reverser wants to explore. Patching
is also used for protection removal i.e. cracking. Almost all proprietary programs like
Photoshop, Microsoft Office, WinRAR etc. are cracked immediately after their releases.
24
There exist similar programs that do not allow modifications of the content of the
file, in this case, these tools are called hex readers or hex viewers.
6.2
Sandboxes
Usually a reverse engineer deals with an unknown executable file i.e. he does not know
if the given file could be harmful to his system. A reverser has to analyze the file in a
closed environment called sandbox.
Definition 12. A sandbox (in computer security) is a mechanism that allows to run
programs in separate environments. It is often used to separate untrusted programs from
the rest of the system.
In the case of DRE a sandbox could be a computer (or a network), which is used
only for DRE and which is disconnected from the Internet or any kinds of external
networks. This way a virus (if analyzed in such closed environment) would not be able
to contaminate the rest of the digital world.
A real (physical) system e.g. a computer or a network can be used as a sandbox in
order to analyze an unknown executable file. The disadvantage of using a real physical
system is the time consumed on the re-installation of the system if analyzed executable
file harms the system. In order to avoid this time-consuming task reversers should use
virtual environments.
6.2.1
Virtual environments
Definition 13. Virtualization is a use of a virtual version of a real object.

Virtualization in not solely used by DRE. Sometimes virtualization is used in order
to host several virtual machines on one physical machine, in order to give impression to
end users that they have an entire machine for themselves. Virtualization is also used
during the software development for devices like mobile phones, CD players etc.
In order to do the DRE of a potentially harmful executable file a hardware virtualization needs to be used2 . Hardware virtualization is a virtualization of a computer, the
last is called a virtual machine (VM).
In DRE a virtual machine will be used as a sandbox instead of a real computer. A
lot of software, which allows to create virtual machines, is available e.g. VirtualBox and
VMware.
Virtual machines allow to do things which you can not do with physical machines
e.g. pause and resume the execution or take a snapshot (save its current state).
The use of snapshots will significantly simplify the life of a reverse engineer: once
the system is installed he would take a snapshot of the system, and then work with
any suspicious files. If a suspicious file breaks the system on the virtual machine, the
reverser would just need to restore the initial state of the system from the snapshot.
This operation is similar to opening a file in any program.
In some cases virtualization of the whole network is necessary for a complete analysis.
In this situation several virtual machines are installed and then inter-connected on the
same physical machine (software such as VirtualBox and VMware allow to do it more
or less easily).
2
It is strongly recommended to do DRE of all unknown executable files only in virtual environments
25
The use of a virtual machine has several disadvantages. Some of them, like performance degradation, are not very significant for DRE, but some of them are crucial for
DRE! Usually, virtual environment is not exactly the same as a real environment, so
there are ways to detect for the program if it is executed in real or virtual machine, and
act accordingly. For more information see Section 7.3.3.
6.3
Disassemblers
Disassembler is the most important reversing tool. A disassembler is a tool that shows
the content of the text (code) section of an executable file. The disassembler converts a
raw stream of numbers (machine code) into a human readable format - program written
in an assembly language i.e. disassemblers translate bytecodes used by computers into
human readable text. See figure 6.2.
Some powerful disassemblers (e.g. IDA Pro) can also generate flowcharts of functions
and entire programs.
Listing 6.1: objdump -d hello world

80486 d4 :
80486 d5 :
80486 d7 :
80486 da :
80486dd :
80486 e4 :
80486 e5 :
80486 e c :
80486 f 1 :
80486 f 8 :
80486 f 9 :
80486 f c :
8048701:
8048706:
8048707:
55
89
83
83
c7
08
c7
e8
c7
08
89
e8
b8
c9
c3
e5
e4 f 0
e c 10
44 24 04 30 88 04
push
mov
and
sub
movl
%ebp
%esp ,%ebp
$ 0 x f f f f f f f 0 ,%esp
$ 0x10 ,%esp
$ 0x8048830 , 0 x4(%esp )
04 24 40 a0 04 08
ef fe f f f f
44 24 04 00 86 04
movl
call
movl
$ 0x804a040 ,(% esp )

80485 e0
$ 0x8048600 , 0 x4(%esp )
04 24
ef fe f f f f
00 00 00 00
mov
call
mov
leave
ret
%eax ,(% esp )

80485 f 0
$ 0x0 ,%eax
Figure 6.2: Disassembled code of main() function of hello world program (see code in
appendix C.1). Disassembled by the disassembler integrated into objdump program.
Command line: objdump -d hello world. Left column - addresses in hexadecimal. Middle column - raw data (values in hexadecimal). Two right columns - corresponding
instructions.
The format used to encode instructions and the set of instructions are platformspecific (depend on the system), so disassemblers are also platform-specific although
some disassemblers that support several platforms.
Disassemblers could use one of two different approaches in order to disassemble the
code:
sequential, also called linear sweep
26
recursive, also called recursive traversal
The sequential algorithm is simpler than the recursive. In case of sequential algorithm the disassembler reads the code section of the file byte after byte and translates
it into instructions (in a human readable form i.e. assembly language).
The recursive algorithm would not follow the physical order of instructions, but
it would follow the logical order of instructions i.e. follow the control flow and take
branches (evaluate conditions), so the given address would be decoded only if it is
reachable from previously decoded code.
The first disassemblers used only sequential algorithms, the recursive approach appeared as a countermeasure to the data insertion into the code section of the file. This is
used by some compilers for optimization of switch statements (see example on figure 6.3).
This is also used as a control flow obfuscation technique e.g. jump tables (see 7.3.2). It
is difficult to determine the length of these jump tables, so recursive algorithms always
uses some heuristics to determine which instruction to disassemble next.
Insertion of data into the code section is also an obfuscation techniques used in order
to break or to confuse different analyzing tools (see more in section 7.3.4).
6.3.1
Decompilers
A decompiler is a dream-tool of all reverse engineers. Decompilers, as well as disassemblers, translate programs into human readable form, but decompilers try to translate
programs into a high-level programming language.
The work of a decompiler is harder than the work of a disassembler because it has
to deal with all transformations that a compiler applied to the program (see 5.2).
Decompiling a program is a very complicated task, especially if obfuscation techniques were applied (see chapter 7).
Nowadays there are several more or less successful projects that develop decompilers
e.g. Andromeda [14], Boomerang [15].
Decompilers do their best job in the case of compiled to bytecode languages (see
Section 5.1 about programming languages). A lot of information remains in the compiled
program from the source code, so decompilers for Java, .NET produce a very good human
readable code.
6.4
Debuggers
A debugger is a program that allow to monitor the execution of a process (called a

debugee). In the first place, debuggers were used by software developers for tracing and
fixing errors and bugs. Most debuggers were created for bug fixing, however there exist
several debuggers that were created for DRE (but they still could be used for bug fixing)
e.g. OllyDBG [66].
Debuggers allow to execute programs and stop them at various breakpoints. They
can show values of different variables and registers and can also show the content of the
stack (order of called procedures with parameters).
All debuggers include disassemblers features. Generally debuggers that were created
for DRE purposes incorporate more powerful disassemblers.
In order to halt the execution of a program, debuggers use breakpoints. There are
two types of breakpoints:
27
Listing 6.2: Switch statement in C++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
switch ( s t a t e ) {
case 0 :
statements 0 ;
break ;
case 1 :
statements 1 ;
break ;
case 8 :
statements 8 ;
break ;
default :
statements def ;
break ;
}
Listing 6.3: switch statement in assembler

1
mov EAX, [ s t a t e ]
2
jmp f i n d a d r
3 states vec : [0 , 1 , 8]
4 a d r v e c : [ adr0 , adr1 , adr8 ]
5
6 f i n d adr :
7
xor ESI , ESI ; ESI < 0
8 loop :
9
cmp ESI , [ v e c s i z e ]
10
jmp a d r d e f a u l t
11
cmp EAX, [ s t a t e v e c+ESI ]
12
j e found
13
inc ESI
14
jne loop
15 found :
16
jmp [ a d r v e c+ESI ]
17
18 adr0 :
19
statements 0
20
jmp c o n t i n u e
21 adr1 :
22
statements 1
23
jmp c o n t i n u e
24 adr8 :
25
statements 8
26
jmp c o n t i n u e
27 a d r d e f a u l t :
28
statements def
29
30 c o n t i n u e :
31
...
Figure 6.3: Example of implementation of a switch statement in C++ and in Assembler

x86. Tables states vec and adr vec are in the code section of the file.
Software breakpoints - special instructions that a debugger add into a program.
When such instruction is reached during the execution of the program, the program
pauses and the control is transfered to the debugger.
Hardware breakpoints - special features (interruptions) that allow pausing the execution of a program and transfer the control to the debugger when a certain
memory address is accessed. Hardware breakpoints are very useful for detecting
code regions responsible for managing data structures.
28
There are two different types of debuggers:
User-mode debuggers - (e.g. OllyDbg [66], gdb [21], WinDbg [36], IDA Pro [43])
are the most simple and more conventional debuggers. User-mode debuggers are
programs that, as their name implies, operate in the user-mode. A user-mode
debugger attaches to the debuggee program in order to take full control of it.
Kernel-mode debuggers (e.g. Numega SoftICE3 [61]) are more powerful debuggers,
than user-mode debuggers. Kernel-mode debugger is installed as component of the
kernel (the core of the system). Such debuggers allow to stop and observe the entire
system at any given moment.
User-mode and kernel-mode debuggers are very different and are used depending on
what kind of program is analyzed.
Generally, a simple user-mode debugger is enough for almost all kinds of DRE.
Kernel-mode debuggers are mandatory in case if a kernel-mode code is analyzed (e.g. a
driver).
Since a user-mode debugger is a program that runs on top of the system level, it is
easier to install and operate than any kernel-mode debugger (which is a component of
the system).
The biggest disadvantage of kernel-mode debuggers is that they affect the system i.e.
destabilize the operating system that they attached to. This is not the case of user-mode
debuggers.
The main advantage of kernel-mode debuggers is that they allow to monitor the
entire system, while user-mode debuggers allow to monitor only one process. Most
user-mode debuggers can not analyze the code that is executed before the main entry
point of the debugee process is reached (user-mode code from libraries loaded during
the initialization, before the main program).
Main advantages and disadvantages of user-mode and kernel-mode debuggers:
Advantages
Disadvantages
User-mode
easy to install and use
limithed field of view
one process monitoring
can not debug kernel-mode code
Kernel-mode monitor the entire system hard to install
destabilize the system
6.5
Monitoring tools
Monitoring tools are used in order to perform live-code analysis i.e active reverse engineering.
Some questions that a reverse engineer asks himself or herself (e.g. which files this
program uses?, which ports it is listening to?, etc) could be easily answered without
digging into the code, but just by monitoring the analyzed program while it executes.
Generally, monitoring tools observe inputs and outputs (I/O) on channels that exist
between a process and the OS. This is done by monitoring all system calls made by a
process.
There exist a variety of different monitoring tools that can monitor different parts of
a process or an entire system (see [50] and [19]). Here is a list of several most commonly
used types of monitoring tools:
3
Unfortunately, this project was closed in 2006, but the program is still used.
29
file monitoring tools monitors all I/O on files. Generally, these tools shows all
opened files with their permissions per process.
network traffic monitoring tools watch on all opened network connections. Generally shows which port is used by which program and what kind of traffic it uses
(e.g. TCP, UDP, etc)
port monitoring tools monitor I/O on parallel and serial ports.
process monitoring tools show information about running programs e.g. loaded
libraries, CPU usage, memory usage etc. Actually, these tools are very powerful
and upgraded versions of Windows task manager and Linuxs top program.
6.6
Dumping tools
In IT to dump means to copy data from one place to another e.g. from a memory to a
printout or from the RAM to a file.
The most known use of dumping is a core dump.
Definition 14. A core dump is a disk file containing an image of the process memory
at the time of termination. This image can be used in a debugger to inspect the state
of the program at the time when it terminated. Definition from linux manpages ( man
core).
Sometimes core dump files are generated automatically, when a program crashes.
Such files could also be used in case of active reverse engineering. A tool like gcore (in
Linux) extracts the content of the memory of a running process into a file. For example,
command gcore 42 will generate a core dump file (core.42) of a process with process
id (pid) equal to 42.
Dumping a memory of a process is useful if the code of the original program could
not be extracted during off-line analysis using different deobfuscators (see Section 6.8).
In this case the live-code analysis could be used. During the execution of a program
the code of that program can be extracted directly from the memory of the process and
then analyzed (in a debugger).
Another type of dumping is object dumping. Object dumping is presenting the
metadata, data contained in the header of an object, usually a file.
In case of executable files the term executable-dumping is used.
Programs such as objdump in Linux and dumpbin in Windows perform executabledumping.
These programs can present all data from header to footer in a simple human readable
format see examples on figures 6.4 and 6.5.
6.7
Visual representations
People can more easily and rapidly understand information in its visual representation
such as images and diagrams rather than a long text.
There also exist several visualization tools that help to do digital reverse engineering.
These tools represent information about a file (or its part) in graphical format and
sometimes it is very useful for a reverse engineer.
The most conventional visual representation of a program is a flowchart. See an
example of a flowchart in the appendix (figure B.5 on the page 95).
30
Listing 6.4: objdump -f hello world

hello world :
f i l e format e l f 3 2 i 3 8 6
a r c h i t e c t u r e : i 3 8 6 , f l a g s 0 x00000112 :
EXEC P, HAS SYMS, D PAGED
s t a r t a d d r e s s 0 x08048620
Figure 6.4: A part of object dump of hello world program (see source in appendix C.1).
Command line display the contents of the overall file header: objdump -f hello world.
Definition 15. A flowchart is a diagram that represents an algorithm. Each action is
shown as a box and the order of actions is represented by arrows connecting the boxes.
Several reversing tools (e.g. IDA Pro disassembler & debugger [43]) can generate
flowcharts, see figure 6.6.
Flowcharts give a significant level of abstraction over assembly code. By using
flowcharts a reverser can quickly understand the general structure of a function.
There also exist less conventional visualization tools, that were created for DRE
purposes, like one presented in [10]. It tries to give a general impression of the entire
program by showing the overall control flow of the program as a graph. This representation could help to find interesting areas of code in the analyzed executable file (see
figure 6.7).
Another interesting tool, that was created to do DRE of executable files and data
formats is presented in [23]. This tool gives several graphical representations of a file.
For example plots that map each byte as a pixel on the display, see figure 6.8.
This tool could also be used in order to reveal some types of steganography i.e.
hidden message passing (see Section 7.3.6 and figure 6.9).
Case studies in original papers [23] and [10] show that such visualizing tools are very
helpful in DRE: they can significantly reduce the time that a reverse engineer spends in
order to find an interesting part of a file.
6.8
Automated deobfuscators
A fully automated deobfuscator which would effectively remove the effect of any obfuscation would be a dream-tool for a reverse engineer. Creating such tool is a very
challenging task which has not yet been achieved. However, there exist many automated deobfuscators which are well suited to handle some types of obfuscation like
packing or control flow obfuscation.
Here are some of the most common types of automated deobfuscators:
Code extractors or unpackers - can extract code from packed executables (see
Section 7.3.1). There exist many code extractors that use different techniques to
unpack the original code from the packed file. For example, Renovo [37], PolyUnpack [41], UUnP (plug-in for IDAPro [43]).
Control flow deobfuscators - try to remove different kinds of control flow (see 7.3.2)
31
Listing 6.5: objdump -p hello world

hello world :
f i l e format e l f 3 2 i 3 8 6
Program Header :
...
Dynamic S e c t i o n :
NEEDED
NEEDED
NEEDED
NEEDED
INIT
FINI
HASH
GNU HASH
STRTAB
SYMTAB
STRSZ
SYMENT
DEBUG
PLTGOT
PLTRELSZ
PLTREL
JMPREL
REL
RELSZ
RELENT
VERNEED
VERNEEDNUM
VERSYM
l i b s t d c++. s o . 6
libm.so.6
libgcc s.so.1
libc.so.6
0 x08048550
0 x0804880c
0 x080481ac
0 x080481f4
0 x080482f8
0 x08048228
0 x00000183
0 x00000010
0 x00000000
0 x08049ff4
0 x00000048
0 x00000011
0 x08048508
0 x080484f8
0 x00000010
0 x00000008
0 x08048498
0 x00000002
0 x0804847c
Version References :
r e q u i r e d from l i b s t d c++. s o . 6 :
0 x056bafd3 0 x00 05 CXXABI 1.3
0 x08922974 0 x00 03 GLIBCXX 3.4
r e q u i r e d from l i b c . s o . 6 :
0 x0d696910 0 x00 04 GLIBC 2.0
0 x 0 9 6 9 1 f 7 3 0 x00 02 GLIBC 2.1.3
Figure 6.5: A part of object dump of hello world program. Command line display object
format specific file header contents: objdump -p hello world. Here the reverser can see
libraries (and their versions) used by the program, different flags etc.
32
Figure 6.6: A flowchart generated by IDA Pro disassembler. The picture comes from
the official site of IDA Pro [43].
obfuscations such as jump tables. For example, Loco [33], Diablo [40], Pltobased [4].
Automated format reversers - are tools that usually monitor programs in order
to discover the structure of files and/or protocols that these programs use. For
example, Autoformat [31] and Tupni [55]. These tools are mostly used in data
Generally, automated deobfuscators of executable files create a new executable file

which contains the deobfuscated version of the code from the original file. Automated
deobfuscators of file formats and protocol structures usually create diagrams or textual
representations of protocol structures and file formats.
6.9
Miscellaneous useful tools
There exist many miscellaneous tools that do not belong in any category, but that appear
to be very useful to do DRE. Most of these tools were not created for DRE purposes in
the first place, but once reverse engineers started to use them some DRE-versions were
created.
Here below are the most common ones.
33
Figure 6.7: Visualization of the control flow of a program with utility created in [10].
OEP - original entry point. Image comes from the original paper.
6.9.1
File type recognition
A tool that is very useful for DRE is a program that tries to identify the type of a file
e.g. like the file command in Linux bash (see figure 6.10). Most of the time, in order to
find out the type of the file, programs simply check its extension (but this is not very
foolproof).
Tools as Linuxs file command also check for some some magic patterns that are
specific to particular types of files (see linux manpages man file).
Tools that can identify types of file are very useful in DRE (especially in DRE of
malicious programs), because sometimes, in order to obfuscate a program, a file which
contains a sensitive data or code may be diffused as a simple image file (see figure 6.11).
6.9.2
Strings and pattern searching
Tools that perform string search in files are extremely useful in DRE. Searching for
strings can be useful in order to find messages which are displayed in case of errors and
names of libraries used (see figure 6.12).
Most such programs can look for strings encoded in different formats (ASCII, Unicode, etc).
There exists one such program that is integrated in most of Linux distributions -
34
Figure 6.8: Program presented in [23]. Visualization of a file. Image from the original
paper. (a) Current position in the file. (b) Byte Presence Visualization - each of 256
columns shows presence (green) and absence (black) of bytes of a given value. (c)
Byteview Visualization - the color of each pixel maps to a value of a byte from 00(black)
to FF (green). (d) ASCII strings contained in the file. (e) Dot Plot Visualization - used
bu bioinformatics for visual detection of repeated sequences. (f) Byte Frequency tag
cloud. (g) Canonical hexeditor view - hexadecimal and ASCII. (h) Control Toolbar.
program strings, which can be called from bash (see for example figure 6.12).
Once a reverse engineer discovers all messages (strings) that can be displayed, he
can find the address where these strings are stored. Once these addresses are found,
the reverser can find the part of the code that use these strings (e.g. by using hardware
breakpoints, see Section 6.4 about debuggers) and thus start to analyze it.
Nowadays almost all powerful disassemblers and hex editors integrate utilities that
allow to search for a string. Another searching utility which is rarely integrated in
specific tools is the search for strings using regular expressions (see [27]) i.e. search for
patterns instead of searching for exact strings (see example on figure 6.13).
35
Figure 6.9: Steganographic message hidden in an audio mp3 file. The image was generated by the program described in [23] (Byte presence view). Image from the original
paper.
36
/> file hello world

hello world: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, not stripped
/> file hello world.cpp
hello world.cpp: ASCII C program text
Figure 6.10: Example of output of file program for a C++ source code file
hello world.cpp (see source in appendix C.1) and its compiled version hello.
/> file picture1.jpg

picture1.jpg: JPEG image data, JFIF standard 1.01
/> file picture2.jpg
picture2.jpg: Bourne-Again shell script text executable
Figure 6.11: Example of output of the file program for a real picture (picture1.jpg) and
for a shell code disguised as a picture (picture2.jpg). Both files have the same extension.
37
/> strings secretValue

/lib/ld-linux.so.2
: v,
CyIk
libstdc++.so.6
gmon start
Jv RegisterClasses
ZSt4endlIcSt11char traitsIcEERSt13basic ostreamIT T0 ES6
ZSt4cout
ZStlsISt11char traitsIcEERSt13basic ostreamIcT ES5 PKc
ZNSt8ios base4InitC1Ev
ZSt3cin
ZNSt8ios base4InitD1Ev
ZNSirsERj
ZNSolsEPFRSoS E
gxx personality v0
libm.so.6
libgcc s.so.1
libc.so.6
IO stdin used
cxa atexit
libc start main
GLIBC 2.0
GLIBC 2.1.3
CXXABI 1.3
GLIBCXX 3.4
PTRh
QVhD
[ ]
Enter the secret value:
Congratulations!
Wrong value!
Bye-bye!
Figure 6.12: Example of output of the string program for program secretValue (see
source code in appendix C.4).
38
/> strings autoCorrect | grep [E,e]rror

Error while exec corrected file
Error while fork() for copy exec.
Error while opening file
exec cp error
Figure 6.13: Search for all embedded messages which contain the word Error or error
in executable file autoCorrect (see code in appendix C.4).
Chapter 7
Code obfuscation
Generally, if there are techniques that are used to attack a system, then there also
exist other techniques to protect the system against attacks. So, in DRE, as in many
other domains, you can observe the armaments drive on the side of reversers as well as
on the side of anti-reversers (or obfuscators).
In case of binary reverse engineering the anti-reversing technique is called code obfuscation.
7.1
The definition
Definition 16. An obfuscator is a function that takes a program as an input and

returns a program which is equivalent to the program but is harder to reverseengineer.
Definition 17. Two programs and are equivalent if these programs gives equal
outputs for equal inputs1 .
Sometimes, this function is applied by hand, but there also exist different automated
obfuscators.
Harder to reverse-engineer means that it would require more resources (see further
section: 7.2) to do DRE on the obfuscated program, than on the original program.
7.2
The problem
Before presenting different obfuscating techniques we would like to point out, that there
is no way to create a perfectly secured or an unbreakable system. It is also impossible
to create a system that cannot be reverse-engineered (given unlimited resources). The
only issue is the amount of resources that is required in order to break the system. This
means any system could be broken, if it is cost-effective to do so.
Three main resources are needed in order to reverse-engineer any system:
tools - in case of digital reversing this includes computational power, e.g. a powerful computer.
skills/knowledge - having a modern computer would not help to reverse-engineer
a system, if you do not know how to use it for DRE.
1
In case of programs that use heuristics or calculate an irrational number (e.g. ) the definition of
equality between two outputs could be soft e.g. equal to the third decimal place, 3, 141 3, 141592
39
40
time - how much time someone is willing to spend on this task.
7.2.1
Why obfuscate?
If there is no way to have a sound protection against reverse engineers, why then should
one obfuscate programs?
As already mentioned in Section 7.2, in order to reverse-engineer a system, you
need a certain amount of resources. The aim of obfuscation is to make reversers work
very difficult i.e. discourage as many reversers as possible by using such obfuscation
techniques, that require a lot of resources in order to reverse them. Most of the time
obfuscation techniques put the emphasis on two resources: time and knowledge.
All obfuscation techniques require knowledge about how to bypass (or remove) them.
Some obfuscation techniques might require to try all possibilities in order to reverse
them (e.g. finding a password using exhaustive research [39]), and even with a powerful
computer, it could take a lot of time. For example, if a string was encrypted and you
do not have the decryption key, which is 1024 bits long, it could take up to 4 10275
times the age of the universe to find the key using exhaustive research (see figure 7.1).
key length = 1024 bits.
Number of possible values = 21024 2 10308
If trying 1015 possibilities takes 1 second, in the worst case it will take 2 10293 sec.
to find the key.
2 10293 seconds 3 10291 minutes
5 10289 hours 2 10288 days
6 10285 years 4 10275 age of the universe
Figure 7.1: Time needed to find a 1024 bits secret key using exhaustive research.
Each obfuscation technique might protect a given system against some reversers.
Somebody who knows how to circumvent only protection type would not be able
to reverse-engineer a type obfuscation technique. By adding one more obfuscation
technique in our program we increase its level of protection.
Some obfuscation techniques are easier to bypass than others, but often it is difficult
to say if a given obfuscation technique is indeed easier to bypass than another one.
Generally, an obfuscation technique protects against one or two (sometimes more)
reversing techniques so combining several obfuscation techniques is very useful. By
combining many different obfuscation techniques it is possible to reduce dramatically
the number of potential reverse engineers that could successfully reverse the program.
Of course, an experienced reverse engineer has enough skill to bypass many different
obfuscations, so it is impossible to guarantee that a given program could not be reverseengineered.
7.3
Anti-reversing techniques
Many anti-reversing techniques exist nowadays. The challenge of obfuscation is to create

some irreversible or very difficult to reverse transformation(s) that complicate the reversing process but do not affect the programs execution. In case of a compiled program
41
some such transformations are done by the compiler (in a compiled version of a program there are no variable or function names), but these are only minor transformations
compared to other techniques of obfuscation used nowadays.
It is better to do an irreversible transformation, because in this case the obfuscated
program will be harder to reverse-engineer, since the would be no way to reverse the
transformation. Of course, the program could be encrypted and the key could be kept
in secret (see Section 7.3.1). This way nobody would be able to do DRE of the program,
but nobody would be able to execute it!
So, performing irreversible transformations and being able to execute the program
is almost impossible, that is why most obfuscation techniques use reversible transformations and other tricks in order to render them as difficult to reverse as possible.
All techniques of obfuscation somehow reduce the performance of the obfuscated
program or function. Some obfuscations increase the size of the final executable file,
others reduce the execution speed of the program, while some do both.
Most of the time obfuscating the entire program is pointless, because the majority of
the code is not sensitive and because of the performance degradation (of the obfuscated
program). Thus it is a good idea is to obfuscate only sensitive code and sensitive data.
Obfuscation is always a trade-off between how the programmer is concerned about
his (her) program being reversed and performance degradation.
Here below several of the most common techniques of code obfuscation are presented.
7.3.1
Packing techniques
One of the most common techniques that is used to obfuscate an executable file is
packing. In case of a normal executable file anyone can read its content and see the
assembly code. In order to hide it (the code), the content of the file is modified: the
actual code is packed and an unpackers code is added to the file. So now, when the
new file is executed the unpacker would extract the hidden code and then jump into
the original (extracted) program. By simply reading the files content, it would not be
possible to see the original assembly code, but only the unpackers assembly code and
the packed version of the original code, see figure 7.2.
In order to execute a packed program the transformation through which the the
original code undergoes has to be reversible. So, basically the packing algorithm can
use one of the three ways to pack the code:
encryption
compression
virtualization
Encryption
In Greek o [kryptos] means hidden secret. Cryptology is a science of hiding
information2 . In case of anti-reversing techniques, we aim at hiding the assembly code
by encrypting it.
2
Packing use only encryption. However, modern cryptology also study digital signatures, protocols
and integrity of data.
42
Figure 7.2: General scheme of a packed program.

Imagine that Alice wants to send a message to Bob3 . If Alice does not care if
somebody else reads it, she can simply send it as a plaintext message. If she does not
want anyone else, except Bob, to read this message she has to transform the message
from plaintext to ciphertext before sending it. In this way, nobody, except Bob, should
be able to reverse this transformation and read the original message. However, Bob
needs to know the secret (the decryption key) that would allow him to find the original
plaintext message from the ciphertext message that he received.
Definition 18. Plaintext or cleartext is the original message before encryption.
Definition 19. Ciphertext is the plaintext after the encryption.
Definition 20. An encryption (also called enciphering), is a mapping of plaintext to
ciphertext, based on some chosen keytext. It is performed by a stepwise application of a
(more or less formalized) encryption algorithm (definition from [2]).
In other words, encryption is a process of transformation of a plaintext into a ciphertext. Decryption is a reverse transformation.
Here is the general scheme (see figure 7.3): there are two algorithms the encrypting
algorithm, the decrypting algorithm and two corresponding keys.
3
Alice and Bob are common names used in cryptography for convenience (rather than party A and
party B) in order to designate participants of protocols
43
Definition 21. Key is a parameter that determines the output of an encryption (or
decryption) algorithm.
Figure 7.3: General scheme for sending encrypted message.

A long time ago even the encryption and the decryption algorithms were a part of
the secret. Nowadays it is considered as a bad practice, as well as any kind of security
through obscurity. In modern cryptography the algorithms are well known and studied
in order to test their properties (e.g. resistance against different kinds of attacks). The
only secret here is the decryption key.
Definition 22. Security through obscurity is a way of securing a system by hiding how
it was secured.
Modern cryptography relies on another principle Kerckhoffss principle which is
opposite to the security through obscurity.
Definition 23. Kerckhoffss principle says that a cryptographic system has to be secure
even if the potential enemy knows everything about the system, except the secret key.
In order to be sure that a random person, who does not know the secret, would
not be able to find out the plaintext from the ciphertext, algorithms rely on different
principles.
Asymmetric encryption algorithms use difficult problems e.g. discrete logarithm
and integer factorization.
A problem is considered to be difficult if the only way to solve it is to use exhaustive
research.
Definition 24. Exhaustive research or brute-force search is a problem-solving technique
that consists of systematically enumerating all possible candidates for the solution and
checking whether each candidate satisfies the problems statement.
44
Symmetric encryption algorithms rely on irreversible transformations, that could be
reversed only if a secret key is known. The exhaustive research of such key could take
a lot of time (see figure 7.1).
Note, that in order to unpack i.e. decrypt the original code the decryption algorithm
needs the decryption key. If we want our program to be executed on many different
computers without our intervention, the decryption function and the corresponding key
have to be stored in the executable file. It means that they could be found by a reverse
engineer. If a reverser finds the decryption algorithm (which is easy because the packed
program starts by executing this algorithm), than by the principle of Kerckhoff our
program is safe, but should he also find the decryption key and then he would be able
to decrypt the original code.
Hiding the decryption key in the executable file is equivalent to the security through
obscurity, because we are hiding the information about where the key is stored but
once it was found the system in no longer secured. However the key should be well
hidden in the file (see Section 7.3.6).
Compression
Another technique that is used for code obfuscation is compression. Normally compression is used in order to gain some space on the storage device or to use less bandwidth
in the case of network transfers.
Definition 25. Data compression is the reduction in the amount of signal space that
must be allocated to a given message set or a data sample set. Definition from [32].
In other words, data compression is a process of transformation of data into a format
that requires less space.
The idea of packing the executables is to hide the original assembly code. Since some
of compression techniques heavily transform the original data, the original assembly code
would not be found by simply reading the transformed file, so this method can also be
used as an anti-reversing technique.
Compression is done by reducing the amount of redundant information. There are
two kinds of data compression:
lossless data compression
lossy data compression
Definition 26. Lossless data compression is a kind of data encoding method that uses a
statistical redundancy in data in order to compress it. Lossless data compression allows
to reconstruct the original data from the compressed data without losses.
Definition 27. Lossy data compression is a kind of data encoding method that exploits
the fact that in some cases a part of the original data could be discarded.
The kind of data compression method that should be used depends on the application. Lossy data compression is mostly used to compress images and audio. Lossless
data compression is used in cases when information loss is not tolerable, for example
a text. Lossless data compression is possible because there are many redundancies in
real-world data, see figure 7.4.
In case of packing of computer programs only lossless data compression may be used
otherwise the processor would not be able to execute the decompressed program since
it no longer corresponds to its original version.
45
Phrase:
There are two types of data compression: lossless and lossy.
Length of this phrase is 60 symbols
If following replacements are done:
string new symbol
loss
ss
re
The same phrase could be rewritten:

The a two types of data compion: le and y.
The length of new phrase is 48 symbols. The table of replacements needs an addition
storage of 11 symbols. The total space used to store compressed phrase is 59 symbols.
Figure 7.4: Example of lossless data compression.

Virtualization
Virtualization is slightly different from the two previous techniques. During the packing
process all instructions of the original program would be translated into instructions of
an nonexistent virtual machine (see Section 6.2.1).
This virtual machine should not exist, because otherwise packing does not really
hide the original code. Generally, automated packers, that embed virtual machines into
the original code, generate a random instruction set for the new virtual machine and
then translate the original program into it.
Definition 28. Instruction set is a list of all the instructions, that a processor (or in
the case of a virtual machine, an interpreter) can execute.
See figure 7.5. See more about instruction set in [51].
Instruction
ADD
ADDI
NAND
LUI
LW
SW
BEQ
JALR
Name
Addition
Add Imediate
Not And
Load Upper Immediate
Load Word
Store Word
Branch If EQual
Jump And Link Using Register
Description
regA regB + regC
regA regB + immediate
regA NOT (regB AND regC)
regA immediate + 0xFFC0
regA Mem[regB + immediate]
regA Mem[regB + immediate]
if(regA == regB):
PC PC+1+immediate
else: PC PC+1
PC regB, regA PC + 1
Figure 7.5: Instruction set of a Reduced instruction set computing (RISC) processor
(developed for learning purposes by Bruce Jacob from Maryland university [26]). PC Program Counter (sometimes called IP - instruction pointer)
In the case of virtualization, there are two options for the unpacker. The first option
is the same as in case of encryption or compression : translating all instructions back
into the original program before jumping into it.
46
In the other case the unpacking algorithm would not just unpack the original code
and jump into it, but it would execute the code like an interpreter: reading and then
executing instruction after instruction. It means that only the code of the interpreter
would be loaded into the memory, the code of the original packed program will remain
in the file. The virtual machine will read one (or several) instructions from the file,
translate it into real instructions (instructions that can be executed by the physical
machine) and then execute these instructions.
Packing summary
Packing techniques are not very difficult to implement but they have two disadvantages:
executable files size increase - the obfuscated program contains the original program and the unpackers code (which is usually not big in relation to the original
code).
execution speed decrease - in case of encryption or compression there is an additional initialization phase during which the original code is unpacked, in the case
of a virtual machine the interpreter has to translate instructions (thereby adding
a delay).
the basic use of packing techniques protects against simple reading of the executable
file, but it may be more or less easily circumvented.
Note that the unpackers code and the packed original code are in the same file.
So the reverser has the access to both of them and could use the unpacker in order to
access the original code. That is why in this case encryption does not add any security
in terms of cryptography. Nevertheless encryption can bring more security in term of
obfuscation the file, because in order to unpack it the reverser needs two things: the
algorithm and the key which could be very well hidden somewhere inside the executable
file.
Basically there are two ways to access the original unpacked assembly code in order
to reverse engineer it and as for almost all techniques one of them is passive and the
other is active.
Passive techniques consist of finding the unpackers code in the packed executable
file and write a little program that would use it in order to unpack the original assembly
code.
An active technique means that the unpacker would unpack the original code into the
memory at the beginning of the execution. So the active technique consists of executing
the packed program and dumping its memory (see Section 6.6) just after the unpacking
is finished. In order to stop the program at the right moment it could be executed in the
debugger environment (see Section 6.4) with a breakpoint placed after the unpackers
code.
The advantage of active reversing in this case is that the reverser does not need to
understand how the unpackers algorithm works.
Note, that if the virtual machine works as an interpreter, the memory dump would
not work, because the only code that would be in memory all the time is the interpreters
code (see Section 7.3.1).
These techniques executed by hand are acceptable if the original program was
packed only a few times. But since the aim of obfuscation is to discourage as many
reversers as possible, nowadays packing techniques are not used in their basic form.
47
Here are several improvements and upgrades used to render the basic packing scheme
more resistant to DRE.
Improvements to basic packing techniques
A first and very simple improvement that can be made - pack more than once, even
up to 100 times. There exist many automated packers e.g. Armadillo, ASProtect,
FSG, MoleBox, PECompact, UPX, WinUPack etc. In this way retrieving the original
program manually is not a viable option. As a countermeasure to this improvement
there exist automated unpackers e.g. Renovo, PolyUnpack, UUnP (also see Section 6.8
about reversing tools).
The second upgrade used in order to discourage as many reversers as possible, is to
pack different parts of the original program separately. These parts could be packed
with different algorithms. Then, they could be unpacked at once or only as soon as they
are needed during the execution of the program (if different branches of the program
are packed separately, the entire code of the program would never be present in the
memory at one given moment). Another interesting addition that could be made, in
order to protect the program against active reversing: as soon as an unpacked part of
the program is used it could be repacked back again, and thus once again, the entire
code of the original program would never be in the memory, making a memory dump
less useful.
Finally, one of the most interesting things that could be done in order to improve
packing is finding better ways to hide the decryption key (if the encryption is used). If
its simply stored somewhere in the file (also see Section about stegonography 7.3.6) it
could be found more or less easily, so the first thing to do is to break the key in pieces
and hide each one of them separately. Different keys could be used in order to decrypt
different parts of the code. If different parts of the original code are packed separately,
the key needed to decrypt the next part could be hidden in the previous part. Even
better: the key does not need to be stored somewhere in the file, it could be calculated
on the fly during the execution e.g. using several variables of the original program.
7.3.2
Control flow obfuscation
Different techniques of control flow obfuscation of a program work against human reversers as well as against automated reversers.
Definition 29. Control flow is the order in which instructions of a program are executed.
The main idea of all control flow obfuscation techniques is to transform the program
in such way that it would be difficult to follow the logical order of instructions.
Normally, logical order of instructions follows the physical order of instructions except for branching instructions such as jump or call (other exception are interruptions
provoked by different events e.g. keystroke).
All control flow obfuscations introduce additional branches into the program, so it
becomes harder to follow the control flow.
Here below is a list of the most common control flow obfuscation techniques:
Mixing the code
The idea is to break the original code of several functions into little chunks and mix
them using jump instructions to go through them in the right order. See the example
48
on the figure 7.6.
Listing 7.1: Original functions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void f u n c t i o n 1 ( ) {
f1 segment1 ;
f1 segment2 ;
f1 segment3 ;
}
int f u n c t i o n 2 ( ) {
f2 segment1 ;
f2 segment2 ;
f2 segment3 ;
}
string function3 (){
f3 segment1 ;
f3 segment2 ;
f3 segment3 ;
}
Listing 7.2: After obfuscation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
f 2 s e g m e n t 3 ; // end o f
f1 segment2 ;
goto f 1 s e g m e n t 3 ;
f 2 s e g m e n t 1 ; // e n t e r y
goto f 2 s e g m e n t 2
f 3 s e g m e n t 1 ; // e n t e r y
f2 segment2 ;
f3 segment2 ;
f 1 s e g m e n t 1 ; // e n t e r y
f2
f2
f3
f1
f3
f1
Figure 7.6: Example of mixing 3 functions. Each of them separated into 3 pieces.
Example inspired from [19]
Generally, mixing the code of several functions is more effective if it is combined
with obfuscations of conditions instead of simple goto statements.
This technique confuses human reversers and can confuse some automated deobfuscators (most of the time they fail to reconstruct the correct functions).
Mixing several functions in such a way has very little impact on the size of the code
as well as on the execution speed of the obfuscated program.
Jump tables
This obfuscating technique is very similar to the previous one (mixing the code). However they are always presented separately, because these obfuscations use different code
structures and techniques used in order to deobfuscate these obfuscations are different.
The idea of jump tables is the same as in the previous technique - break the original
code of one or several function(s) into short parts and mix them.
The difference consists in the way of determining which segment should be executed
next. If the segments are mixed using the technique described earlier, each block of
the code would provide the control flow to the (logically) next block. In case of jump
tables each little block will end up by jumping into a control loop - a piece of code that
determines which block has to be executed next. In other words, in case of mixing code
obfuscation each block of the code knows his next neighbour, in case of jump tables
each block knows only himself and the control loop, see figure 7.7.
Each part of the code would set a special unique value (usually, something that
corresponds to an index in a table or a pointer), so the control loop would know which
part to execute next. See example on figure 7.8
49
Figure 7.7: Structure of control flow obfuscations that mix different block of the code.
This kind of control flow obfuscation reduces the readability of the code which can
confuse human reversers as well as repel automated deobfuscators.
An improvement, that could be made on the idea of jump tables, is to add a second
layer of indirection i.e. have a second table, which could be filled in during run time,
indexes it that table would give the indexes in the first jump table and the code would
use indexes of the second table, see figure 7.9.
Jump tables, especially jump tables with several levels of indirection, dramatically
reduce the execution speed of the obfuscated program or function. This is due to the
jumps that the control flow has to follow; jump instructions are one of the most slow
and heavy instruction that a processor can execute.
Inlining and outlining
Inlining duplicates some parts of the code (see more in Section 5.2.2) in order to improve
execution speed.
Taking a function and duplicating its code makes the work of a reverse engineer
harder. Inlining reduces the abstraction created by the programmer in the first place.
The reverser does not know, that the code he is looking at is actually an inlined function.
Outlining is the inverse transformation i.e. when a part of the code is transformed
into a function. This obfuscation is effective if a function is created from a random piece
of code, it will confuse human reversers.
50
Listing 7.3: Original code

1
2
3
4
5
6
7
8
9
10
11
12
13
statement 0 ;
Listing 7.4: After obfuscation

1 jump table :
2 [ adr7 , adr4 , adr2 , adr6 ,
3
adr1 , adr3 , adr5 , end ]
4
5
idx = 0;
6 control :
7
goto j u m p t a b l e [ i d x +1]
8
9 adr1 : s t a t e m e n t 4 ;
10
idx = 4;
11
goto c o n t r o l ;
12 adr2 : s t a t e m e n t 2 ;
13
idx = 2;
14
15 adr3 : s t a t e m e n t 5 ;
16
idx = 5;
17
18 adr4 : s t a t e m e n t 1 ;
19
idx = 1;
20
21 adr5 : s t a t e m e n t 6 ;
22
idx = 6;
23
24 adr6 : s t a t e m e n t 3 ;
25
idx = 3;
26
27 adr7 : s t a t e m e n t 0 ;
28
idx = 0;
29
30 end : continue ;
statement 1 ;
statement 2 ;
statement 3 ;
statement 4 ;
statement 5 ;
statement 6 ;
Figure 7.8: Example of use of jump tables.

The other idea consists in combining the inlining with outlining. It consists of
creating several copies of one function and use them all in the main program. These
adds confusion to reversers.
Condition obfuscation
All techniques of control flow obfuscation play with branches. This means that in case
of conditional branching obfuscating conditions, necessary to take the branch, rises the
level of difficulty to understand the program i.e. raises the level of protection against
DRE.
Such obfuscated conditions are also called opaque predicates.
51
Figure 7.9: Example of jump table with a layer of indirection.

Definition 30. Opaque predicate is a logical statement whose outcome is constant
(always true or always false) and is known in advance (by the programmer). Definition
from [19].
One of the main ideas of condition obfuscation consists in creating a condition which
is always true (or always false).
For example, look at the figure 7.10. The result of the condition on the listing 7.5 is
always true and the program will never execute the else statement.
The result of the following condition in the listing 7.6 is always false and the program
will always execute the else statement.
The simple examples from listings 7.5 and 7.6 could be easily reverse-engineered by
human reversers and also by automated control-flow deobfuscators (see Section 6.8).
Generally, in order to confuse reversers, values used inside opaque predicates have
to be generated during the execution of the program.
For example, one thread of the main program can generate random values, that
adhere to some rule (e.g. be greater than some fixed value, be divisible by 7, etc), and
store them in a place that is accessible by the main thread. In its turn, the main thread
will use these values in opaque predicates. Knowing the rule(s) used to generate the
random values, there always exists a way to create a fake control flow statement (i.e. a
branching point with a condition which is always true or always false). Such approach
52
Listing 7.5: Condition is always true

1
2
3
4
5
6
7
bool bb ;
// . . . some a c t i o n s . . .
i f ( bb or not bb ) {
cout<< . . t h a t i s t h e q u e s t i o n <<e n d l ;
} else {
cout<< Problems with l o g i c ? ! <<e n d l ;
}
Listing 7.6: Condition is always false
1 int n ;
2 // . . . some a c t i o n s . . .
3 i f ( n == n+1){
4
cout<< . oO( Problems with math ? ) <<e n d l ;
5 } else {
6
cout<< E v e r y t h i n g i s f i n e . <<e n d l ;
7 }
Figure 7.10: Example of opaque predicates in C++.

is described in [19].
Reordering
Generally, reversers rely on the locality of the code i.e. assume that operations that
reside near each other are somehow codependent.
The idea here is to randomize the order of operations as much as possible. This is
not always possible, because many operations are codependent. The order of operations
that are not codependent could be randomized, see figure 7.11.
This transformation has almost no effect on the performance of the final code. This
technique is not very effective against automated deobfuscators but can confuse most
human reversers.
7.3.3
Detection of digital reverse engineering
One obfuscating technique consists in detecting that someone is trying to reverseengineer the program and then taking some precautions in order complicate the reversing
process.
Basically, if program detects that it is being reverse-engineered, it will jump into the
code that was concealed in order to confuse reversers (see figure 7.12).
Since a program can detect that it is analyzed only during its execution, this method
of obfuscation mostly protects against live code analysis and patching.
Generally, one of three actions is taken in order to stop the reverser :
Stop the program - is the most simple thing to do, it will force the reverser to
patch the obfuscated program (disable the part of the code that detects that the
53
Listing 7.7: Normal order of operations

1
; g e t parameter
2
mov EAX, [EBX]
3
push EAX
4
c a l l f o o ; r e s u l t i n ECX
5
pop EAX
6
7
; g e t parameters
8
; for loop condition
9
mov EDX, [EBP+14]
10
mov ESI , [EDX]
11 loop :
12
; operations
13
cmp ECX, ESI
14
j e loop
Listing 7.8: Randomized order of operations

1
mov EAX, [EBX]
2
mov EDX, [EBP+14]
3
push EAX
4
mov ESI , [EDX]
5
c a l l f o o ; r e s u l t i n ECX
6 loop :
7
; operations
8
cmp ECX, ESI
9
j e loop
10
pop EAX
Figure 7.11: Example of Randomization of the order of operations.

program is reversed) in order to proceed the live-code analysis
Execute meaningless code - is another option, in this case the reverser will spent his
time trying to understand a meaningless and irrelevant part of the program. This
change in the execution may be unseen by the reverser, and will not be detected by
automated deobfuscators (because an automated deobfuscator can not distinguish
between interesting and meaningless code).
Crash the system - is the most radical thing that can be done. The most harmless
is deleting or wiping4 the executable file of the program. The most harmful method
consists in crushing the entire system, in order to do maximum damage.
Methods used in order to detect different reversing tools and reversing techniques
are varied. Here we present the most common techniques used to detect the presence of
most wide-spread reversing tools.
Detecting debuggers
A debugger is one of the most used reversing tools (see Section 6.4). Obfuscators are
thus very interested in ways that can detect the presence of a debugger.
Obfuscation techniques, that detect the presence of a debugger are more effective if
combined with packing techniques. If automated unpackers are unable to extract the
original code from the obfuscated file, the reverser is forced to use a debugger in order
to analyze it.
Techniques used for detecting debuggers have two significant disadvantages:
4
When a file is wiped, it means that before deleting the file some meaningless data is written into
it, in order to overwrite the original data. In such way, even when the memory blocks are restored, the
original data can not be retrieved.
54
Figure 7.12: Flowchart: idea of behavior in case of detection of DRE.

Almost all solutions are platform-specific this means that the programmer has to
know what kind of system will execute his program. The way to reduce the impact
of this disadvantage is to implement several debugger detecting techniques.
False positives could be generated by the part of the code that detects the debugger, so the final code can malfunction even if the debugger is not present.
Generally, it is easier to detect a user-mode debugger, than a kernel-mode debugger.

This is due to the fact that a kernel-mode debugger has less direct impact on the
program. Kernel-mode debuggers are not attached to the process directly, but observes
the program from the kernel (see more in Section 6.4).
Since almost all solutions for detecting the presence of a debugger are platformspecific, it is difficult to describe all of them. Here below some general ideas that are
used to detect debuggers are presented.
Some systems have an application programming interface (API) which returns true if
a user-mode debugger is present e.g. IsDebuggerpresent API in Windows. This method
is not very effective because an API call is easy to detect and easy to bypass. However
it could be improved by inlining the code (see Section 5.2.2 about inlining) of the API
in the program instead of calling it.
Windows also allows to make a request SystemKernelDebuggerInformation, which
returns a structure (see figure 7.13). This structure shows if a kernel debugger is present
and if it is activated.
In Linux OS there is also a way to discover if the program is being traced, see an
example in figure 7.14.
Another approach is more generic and consists in the use of the Trap Flag. The
Trap Flag is a flag defined for x86 processors.
55
Listing 7.9: Structure SYSTEM KERNEL DEBUGGER INFORMATION

1 typedef struct SYSTEM KERNEL DEBUGGER INFORMATION {
2
bool DebuggerEnabled ;
3
bool DebuggerNotPresent ;
4 } SYSTEM KERNEL DEBUGGER INFORMATION,
5 *PSYSTEM KERNEL DEBUGGER INFORMATION;
Figure 7.13: Structure returned by the request SystemKernelDebuggerInformation.
Listing 7.10: debuggerPresent program for Linux

1 #include <s t d i o . h>
2 #include <s y s / p t r a c e . h>
3 int main ( ) {
4
p r i n t f ( H e l l o ! \ n ) ;
5
int t r a c e = p t r a c e (PTRACE TRACEME, 0 , NULL, NULL ) ;
6
i f ( t r a c e ){
7
p r i n t f ( Debugger i s p r e s e n t ! \ n ) ;
8
}
9
return EXIT SUCCESS ;
10 }
Output:
/> ./debuggerPresent
Hello!
/> gdb ./debuggerPresent
(gdb) run
Hello!
Debugger is present!
/>
Figure 7.14: C debuggerPresent program for Linux and its output.

When the Trap Flag is activated, the processor will execute only one instruction and
then raise an interruption in order to allow the debugger to inspect the debugee.
The idea is to enable the Trap Flag and check if the exception was raised. If the
exception was not risen it means that the debugger handled it for us. It means that the
debugger is present.
The use of checksums - is another generic approach, that allow to detect if the
debugger is present.
When debuggers set software breakpoints, they change the code of the program. If
the program checks its integrity it will detect that its code was modified. Read more
56
about checksums in Section 9.3.
This method works if the debugger sets a software breakpoint. It could be bypassed
by use of hardware breakpoints. see Section 6.4 about software and hardware breakpoints.
Sometimes, debuggers may also be detected, by measuring the time spend in a given
procedure. However this method could give a lot of false positives e.g. in case if the
process has a very low priority and it is often rescheduled.
Detecting virtual environments
Since all reversers work in virtual machines (VM) (see Section 6.2.1) from the obfuscator
point of view it is interesting to be able to detect virtual environments.
The main ideas used for detecting virtual environments are similar to ideas used for
detecting debuggers. Detection of virtual environments has same general disadvantages:
All solutions are platform-specific
False positives
However there are several differences between detecting debuggers and detecting
virtual environments. Generally, detecting a virtual environment is much more difficult,
than detecting a debugger.
First of all, virtual machines do not set any software breakpoints, so checksums can
not help in detection of virtual environments.
Secondly, detecting that the program is executed inside of a virtual machine is based
on some differences between real and virtual machines (e.g. some low-level operations
could have different effect on a real and on a virtual machine). These differences exist
because it is almost impossible to create an environment which would perfectly simulate
the real machine. There would always be something missing in the virtual machine,
because even if a virtual machine perfectly simulate a real machines hardware, the set
of possible interactions between a user and the virtual machine is different from the set
of possible interactions between a user and a physical machine.
Virtualization software always use a rarely used key (or a combination of keys)
in order to switch from virtual to physical machine i.e. release keyboard and mouse
controls. VirtualBox use right Ctrl key, so when the end user press right control (ctrl)
key VirtualBox release mouse and keyboard from virtual machine (it means that the
right key is not pressed in the virtual machine), see figure 7.15.
Imagine the following scenario: a program might ask a user to press right ctrl key on
the keyboard. In case if VirtualBox is used (with its default configurations) user would
not be able to press the right ctrl key in the virtual machine. The program might ask
to press a random key from the set of all keys used in order to switch from virtual to
physical machine. Of course, the virtualization software could implement a functionality
that sends the signal (used to switch from virtual to physical machine) to the virtual
machine. In that case the reversed program might ask to press different keys and the
switching key in order to measure the time between two signals (two keys pressed). Such
kind of tests are only limited by the imagination of the developer, so there is always a
more or less tricky way to detect a virtual environment.
Nowadays, virtual machines become more and more sophisticated and the differences
between real (physical) and a virtual machines tend to converge to zero.
57
Figure 7.15: Virtualization : example of passing control (keyboard and mouse) from
virtual to physical machine.
Finally, virtual machines are not only used by reverse engineers (see more in Section 6.2.1). So even if a VM is detected, there are good chances that the program is not
being reverse engineered.
Detecting patching
Reverse engineers use patching i.e. modification of the original code of the analyzed
program in several cases:
Disabling a part of a program is generally used for protection removal.
Forcing a branch is mostly used in live code analysis, when the reverser wants to
analyze a precise part of the code.
Modifying (sometimes adding) functionalities could be useful in some rare cases.
Also used in order to crack programs.
All methods that detect patching use error detection mechanisms (see more in Section 9.3.2).
The main idea is to add a control checksum to the file and to have a control procedure.
This control procedure would be called just before the execution of the code of a function
in order to calculate the checksum of the function and compare it with the precalculated
control checksum. If the checksums are equal (i.e. the file was not modified) the program
continue its execution, otherwise the program stops.
58
Calculating checksums is an expensive operation i.e. it takes a lot of time. If the
program does many checksum verifications its execution speed is reduced dramatically.
Generally programs have several highly sensitive functions that are called at the
initialization of the program (this is always the case of programs that have license
verification procedures). A good way to use checksums is to calculate them only for
most sensitive parts of the code.
Ironically, anti-patching techniques could be deactivated by patching. There are two
ways to deactivate checking procedures:
Patch instructions - disable the checking procedure or to circumvent (disable) the
function call to the checking procedure.
Patch data - recalculate the checksum for the patched part of the code/data.
Section 9.3 presents more information about error detecting codes and techniques
used against patching. Other techniques used against patching are discussed in Section 9.2.
7.3.4
Crashing and confusing reversing tools
One of countermeasures against DRE is to confuse, crash or to make the reversing

tool produce an incorrect output. The most extreme type of such countermeasure is to
exploit a bug in the analyzing tool and to take control of it. This last option is extremely
rare and very difficult to implement, because its success depends on the precise version
of a precise tool that is used for analysis.
Confusing disassemblers
Since a disassembler is one of the most important reversing tools, confusing a disassembler is very a interesting obfuscation technique.
The basic idea consist in adding a piece of data into the code section of a program.
This technique is also called byte insertion, because generally only one byte is inserted
in order to confuse a disassembler. Disassemblers that use a linear sweep algorithm (see
Section 6.3) would interpret this data as an instruction. If the value of the inserted
data is well chosen, all or several following instructions will be misinterpreted by the
disassembler.
See an example in figure 7.16 (listing 7.12). It shows how a byte could be inserted into
the code section of a program without affecting the normal execution of the program.
Listings 7.13 and 7.14 show how such code would be disassembled by recursive traversal
and linear sweep disassemblers. The destination address e.g in listing 7.13 adr+4 is
calculated from the current address (adr+1) plus the parameter 02 plus one byte (the
instruction pointer is incremented automatically).
In listing 7.14 bytecodes of instructions following adr+05 will also be disassembled
with errors.
This technique is not very difficult to implement and has almost no effect on the
execution speed of the program.
Disassemblers, which use recursive traversal algorithms (see Section 6.3), will disassemble the code such as in figure 7.16 correctly. Disassemblers that use recursive
traversal algorithms could be used in order to detect byte insertions.
59
Listing 7.11: Original code

1
2
3
4
; ...
instruction 1
instruction 2
; ...
Listing 7.12: Modified code
1
2
3
4
5
instruction 1
jump c o n t i n u e
DATA BYTE[ 7 5 h ] ; one b y t e i n s e r t i o n = b o g u s i n s t r u c t i o n
continue :
instruction 2
Listing 7.13: Modified code disassembled correctly

1
2
3
4
5
adr+00
adr+01
adr+03
adr+04
adr+05
bytecode1
eb 02
75
bytecode2
...
instruction 1
jump <adr+04> = <(adr+01)+02+1>
DATA
instruction 2
Listing 7.14: Modified code disassembled incorrectly

1
2
3
4
adr+00
adr+01
adr+03
adr+05
bytecode1
eb 02
75 b y t e c o d e 2
...
instruction 1
jump <adr+04>
jne <(adr+03)+ b y t e c o d e 2+1>
Figure 7.16: An example of one byte insertion into the code section of a program.
The program will still be executed correctly, but will be disassembled incorrectly by
linear sweep disassemblers. Instruction jne (jump-if-not-equal) has bytecode 75 (in
hexadecimal) and needs a parameter - the next byte.
Since recursive traversal disassemblers use heuristics in order to estimate when to
stop, it is still possible to make them produce wrong results (the code is not fully
disassembled) at least for some parts of the code.
Opaque predicates (see Section 7.3.2) could also be used in order to confuse recursive
traversal disassemblers. If opaque predicates are used, the disassembler would not be
able to always tell if a section of a code could be accessed or not.
60
Confusing decompilers
Since compiled to bytecode languages decompilers can produce good results (see Section 6.3.1 about decompilers), it is interesting to try to obfuscate the final executable
files of programs written in such languages as Java or .NET.
Differences between the bytecodes and high-level language can be used in order to
confuse a decompiler. For example, there is a goto statement in Java bytecode, but not
in the high-level language. This could be used in order to break the control flow of the
program in such way that a decompiler would not be able to create a corresponding
high-level language structure; see an example in figure 7.17.
Listing 7.15: Normal flow

1
2
3
4
5
6
7
8
loop1 :
statement1 1 ;
statement1 2 ;
goto l o o p 1 ;
loop2 :
statement2 1 ;
statement2 2 ;
goto l o o p 2 ;
Listing 7.16: Two loops with an overlay

1
2
3
4
5
6
7
8
9
10
11
12
loop1 :
statement1 1 ;
loop2 :
i f ( in loop2 ){
statement2 1 ;
}
i f ( in loop1 ){
statement1 2 ;
goto l o o p 1 ;
}
statement2 2 ;
goto l o o p 2 ;
Figure 7.17: Example: two loops with an overlay.
7.3.5
Data transformations
The way how different variables and data structures are stored and handled could reveal
their purpose and their meaning i.e. what a given variable represents (a counter, a sum,
a coordinates, etc).
Data transformations can significantly reduce the readability of the code and will
make the work of the reverser harder.
Many data transformations are possible; here are some general rules that may be
applied in order to obfuscate data.
Encoding formats
There exist many standard formats that are used to encode (represent) data e.g. binarycoded decimal (BCD), twos complement, gray code, etc (see figure 7.18).
The use of several different formats and use of non-standard encodings will reduce
the readability of the program.
The main idea consists in changing the normal order of bits in a byte (or bytes in
a word) or introducing bogus values in the real data.
61
Number
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Binary
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
BCD
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
0001 0000
0001 0001
0001 0010
0001 0011
0001 0100
0001 0101
Gray code
0000
0001
0011
0010
0110
0111
0101
0100
1100
1101
1111
1110
1010
1011
1001
1000
Figure 7.18: Examples of different binary representations for numbers.

Many such transformations can be done in high-level languages (see listings 7.17
and 7.18). For example, all bits in a byte can be shifted n positions to the left, see
example for n = 1 in figure 7.19. In this case all numbers would be multiplied by 2n .
The reverser could spend a lot of time to understand this logic.
Decimal
Number
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Representations
Normal
Shifted
00000000 00000000
00000001 00000010
00000010 00000100
00000011 00000110
00000100 00001000
00000101 00001010
00000110 00001100
00000111 00001110
00001000 00010000
00001001 00010010
00001010 00010100
00001011 00010110
00001100 00011000
00001101 00011010
00001110 00011100
00001111 00011110
Listing 7.17: Normal code

1 f o r ( int i =0; i <16; i ++){
2
// . . . do s o m e t h i n g . . .
3 }
Listing 7.18: With shifted encoding
1 f o r ( int i =0; i <32; i +=2){
2
// . . . do s o m e t h i n g . . .
3 }
Figure 7.19: Example: all bits in a variable are shifted one position to the left. Example
inspired from [19]
62
1 2
Array A
...
1 2
...
1 1 2 2
1 2
Array B
...
Array AB. Concatenated

...
n 1 2
Array AB. Mixed

...
n n
Figure 7.20: Example of how two arrays could stored.

Data structures
The structures used to store data can be altered in order to mislead reversers.
For example, several arrays that contain totally independent information could be
stored together: concatenated or mixed (see figure 7.20).
7.3.6
Hiding data
Reversers always look for interesting pieces of data and code. Finding a good way to
hide sensitive data improves the resistance of a program to reverse engineering and raises
its level of protection.
Hiding strings
During the reversing process reversers often look for strings stored in the file (see Section 6.9.2).
Several precautions can be taken in order to prevent reversers from finding these
strings.
For example, encryption (see Section 7.3.1) could be used. All strings could be
stored in encrypted form, and decrypted before each use. A reverser can execute a
program and observe its output, so generally, reversers can know only the cleartext
form of strings that they are looking for. If a string is stored in its encrypted form (as
a chipertext), standard modules (embedded in different reversing tools) that search for
patterns become useless.
The decryption key could be hidden somewhere in the file or in some rare cases
generated from the input (see [39]).
Another way to prevent reversers from finding messages stored in the file, is to store
them as images. In this last case an image would be shown instead of a text message.
This should be done in such way that the reverser would not suspect that it is an image
e.g. use an image with an error message instead of standard normal error message. In
order to use this technique a function that displays windows with error messages should
be overwritten.
63
Steganography
The word steganography comes from Greek words [steganos] which means
covered and [graphei] which means writing. So steganography means concealed writing.
Steganography is a science of writing hidden messages and thus could be used in order
to hide any type of data, a possibility which is of the great interest for obfuscation.
Unlike cryptography, steganography is a form of security through obscurity (see 7.3.1)
i.e. in theory only the sender and the receiver know how and where the message was
hidden. Note that steganography could be used together with cryptography in which
case the message would be encrypted and then hidden.
Steganography has an advantage over cryptography. If a message is encrypted in
most cases this means that the message contains something interesting and it will
attract the attention of reversers. When steganography is used, the message might go
unnoticed.
One of the most famous (and one of the oldest) cases of steganography is the case
of Histiaeus. He tattooed a secret message on the shaved head of his slave. After some
time, the message was hidden, covered by his hair.
Nowadays digital steganography is used more often than tattoos on slaves heads.
Secret messages can be embedded into images, video or audio files.
For example a secret message could be embedded into an image, that is stored in a
format that does not use a lossy compression (see Section 7.3.1). If lossy compression is
used, the receiver would not be able to restore the message. Generally steganographic
messages may be more easily hidden in messages that contain a lot of redundancies i.e.
in files that are not stored using compressed file formats.
In order to represent an image each color receives a numerical value. Since in most
cases the human eye can not distinguish the color represented by 11111111 and the
color represented by 11111110 the least significant bit of each pixel could be used for
purposes of steganography.
There exist programs that can help to find embedded steganographic messages, see
example in Section 6.7, figure 6.9. Also see [54].
7.3.7
Eliminating symbolic information
Names of variables, functions certainly help the reverser to do DRE. So, eliminating all
symbolic information will complicate the analysis.
In the case of compiled languages (see Section 5.1) part of the symbolic information
is eliminated by the compiler (e.g. names of variables).
Such information would not be eliminated in case of compiled to bytecode languages.
This information remains in the final compiled code and it is used by the interpreter for
cross-referencing (instead of addresses).
Another kind of symbolic information that could be replaced by something misleading is stored in the header (or in the footer) of the file e.g. the name and the version of
the compiler.
In the example in figure 7.21, the compiler (g++) added a string with its version and
its name in the executable file. In order to confuse reversers, the original string GCC:
(Ubuntu 4.4.3-4ubuntu5) 4.4.3 could be replaced by something like GNU Fortran
(openSUSE 3.2.3-3suse4) 3.2.3.
64
/> g++ --version

g++ (Ubuntu 4.4.3-4ubuntu5) 4.4.3
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO warranty;
not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
Figure 7.21: View of hello world program in ghex hex editor. Compiler: GCC, version:
4.4.3, compiled on Ubuntu OS. These info can be seen in the string GCC: (Ubuntu
4.4.3-4ubuntu5) 4.4.3. Command g++ --version gives the same information.
7.3.8
Human reversers versus automated deobfuscators
Obfuscation techniques that prevent from deobfuscation by automated systems are different from techniques used to prevent from deobfuscation by human reversers.
Some techniques could easily fool a debugger but a more or less experienced reverser
would understand whats going on in less than a minute. The opposite is also possible,
some techniques are effective only against human reversers, but not against automated
deobfuscators.
Generally, in DRE process the first part is done by automated deobfuscators (e.g.
extraction of the packed code) and then human reversers try to reverse-engineer the
rest.
It is better to use a combination of techniques against human reversers and techniques against automated deobfuscators.
65
7.4
Pushing the reversing problem out of the software

world
As were already mentioned in Section 7.2 it is impossible to create a perfectly secured

system.
In case of DRE, the reverser has full access to the executable file and can do whatever
he (she) wants with it. This means, that sooner or later, the file will be cracked (if it is
cost effective).
There exist several solutions that consist of taking the sensitive parts of the code and
putting them in a place that is inaccessible (using only software means) for reversers.
If such a solution is applied (assuming, that the solution was properly designed and
implemented), reverse engineering becomes significantly harder and thus requires more
resources. The reversing could be achieved only by the use of a combination of digital
reverse engineering and hardware reverse engineering.
Here below main solutions, that protect against software reverse engineering, are
presented.
7.4.1
Program as a service
First solution consists in implementing a client - server solution. In this case, the
sensitive code will be executed and stored on the server side. All end-users will execute
a clients code which contains no sensitive parts.
Several years ago this solution was not realistic. Nowadays, when many computers
are connected to the Internet and technology permitted a high bandwidth interconnectivity, client - server solution is possible.
Clients will make requests to the server (like requests to a database). The server will
handle these requests and send the result back to the clients.
In such an implementation the sensitive code could never be accessed by the end
users.
There exists many solutions in terms of how the end users might pay for such service:
Data volume processed - by megabytes or gigabytes
Time spend - time of connection between the client and the server or time that
the server spends to threat clients request.
Package - for month, for year or for number of requests or number of connections.
Note that such kind of network solution is not suitable for all kinds of applications.
7.4.2
Cryptoprocessors
Another solution is to use encryption (see definition 20). The difference with the idea
described in 7.3.1 consists in the place where the decryption key is stored.
The idea is to create a processor that can execute encrypted code (by decrypting it
on the fly). The code of a sensitive program is stored on the users computer, but it is
encrypted and can not be accessed (in theory) without a secret decryption key.
Implementing such processor is not enough: a whole system must be created in order
to support cryptoprocessors. See an example of a procedure in the figure 7.22.
66
Manufacturing
1. Manufacturer asks a certification authority (CA) a set of identification

numbers (ID) and corresponding private keys.
2. During the manufacturing, each processor receives an ID and a corresponding private key.
3. Processors are sold.
Purchasing a program
1. An end user buys a program.

2. Software developer asks him for his processor ID.
3. User sends his processor ID.
4. Software manufacturer asks the CA for the public key that corresponds
to the ID of the processor.
5. CA responds by sending the public key of the processor.
6. Software developer encrypts the program with the public key.
7. Software developer sends encrypted program to the user.
8. The user is happy.
Figure 7.22: Basic steps for protecting a program using a cryptoprocessor. Note, that
steps 3 and 4 could be replaced by: the user sends the processors public key.
Cryptoprocessors have to separate different process and prevent them from accessing
each others data, otherwise a reverser can write a program that could access the data
of the encrypted program (by interacting with it).
7.4.3
Dongles
The use of software protection dongles is a very interesting solution that could be used
as a countermeasure against digital reverse engineering.
Definition 31. A dongle5 is a piece of hardware that plugs into a computer.
A software protection dongle usually contains a microcontroller. Since dongle is
accessible by the end user, it has to be tamper resistant (see Section 7.4.5), otherwise
the dongle would be reverse-engineered more or less easily.
The use of dongles can be seen as a hybrid solution of two previously discussed (cryptoprocessors and server-client): the dongle would play a role of a server (it will return
responses to user requests). The code of sensitive functions will be usually encrypted.
5
In the first place, the term dongle was used for any piece of hardware that plugs into a computer.
Nowadays this term is used in case when dongle is used for copyright protection e.g. contains a license
key. Instead of calling all types of hardware a dongle different names are used for little hardware
devices (depending on their purpose). Pluggable storage devices are generally called usb-keys or flash
drives, authentication devices are called security tokens (there also exist security tokens that do not
plug into a computer).
67
The basic idea consists in having a dongle which is able to execute the sensitive code
and return the answers to the main program.
Here below two general schemes (in terms of where the sensitive code is stored) that
could be used in dongle solution are presented.
Store all code together
In this case, the sensitive code would be encrypted and stored with the main program.
When the main program needs to execute the sensitive part of the code, it will send
the encrypted code and eventual parameters to the dongle and then wait for results.
The dongle can contain a cryptoprocessor or a processor, a decryption module and
a secret decryption key. In the first case it will execute the encrypted code directly and
then return the result to the main program. In the second case, the dongle will decrypt
the code, execute it and then send the result back to the main program.
The first advantage of this scheme is the simplicity with which code updates may
be handled. If the developer updates a sensitive part of code he can just include its
encrypted version in general updates.
The second advantage is the storage space that the dongle needs. If several parts of
the program, that contain sensitive code, are encrypted, the dongle has to be able to
store the biggest encrypted function (and its eventual parameters).
The disadvantage of storing the encrypted sensitive code with the main code lies in
a security issue. Since the encrypted code can be accessed by the reverser, it also could
be modified. There exist attacks that consist in sending incorrect messages (which will
handled as encrypted code by the dongle) and observing the results.
Store the sensitive code on the dongle
This solution has its own advantages and disadvantages.
If this scheme is applied, all sensitive functions are stored on the dongle. When the
main program needs a result of one particular function it sends the id (a number or
a name) of the function and its parameters. In its turn, the dongle will execute the
function and send its results back to the main program.
Note, that the code that is contained on the dongle does not have to be encrypted.
If it is not the dongle has to be conceived in such way that the code could not be
accessed from the outside. In order to add one more level of protection, the code could
be encrypted. In this last case, if the reverser finally obtains the code (by means of
some hardware hacks) he still would not be able to read it (except in the case, where
the reverser finds the encrypted code and the decryption key).
The main advantage of such implementation (when the sensitive code is stored on
the dongle) is that the sensitive code is harder to obtain. Supposing, that the dongle is
well conceived DRE could be done only by using hardware hacks.
There are two main disadvantages of storing all sensitive code and data on the dongle.
First of all, updates are not as easy as in store all code together solution. Secondly,
dongle needs to have a larger storage space in order to store all sensitive functions (and
its eventual parameters).
7.4.4
Trusted computing
Trusted computing platform (see [49]) is another solution that could be used in order to
protect programs from being reversed. Trusted computing concept includes hardware
68
and software, it also relies on cryptography. The main idea is to be able to verify, that
only authorized code is executed on the system.
Trusted computing is not only for anti-reverse engineering purposes. One of the
main motivations was preventing users from sharing copyrighted files (e.g. music, films
or programs). Trusted computing could also be used for :
Disk encryption
Platform integrity verification
Digital Rights Management
Password protection
Trusted Computing Group (non-profit organization) promotes trusted computing and

creates specifications needed to meet the requirements of a particular trusted system.
See more on http://www.trustedcomputinggroup.org/.
Trusted computing has a lot of opponents. Their main criticism is based on the fact,
that trusted computing restricts the end user too much. While using trusted computing,
the end users would lose their anonymity and the full control over their data (inability
to move files from one computer to another), etc. Also see [49].
7.4.5
Hardware protections summary
Hardware solutions against reverse engineering do not offer a full protection against
reverse engineering. They transform the reversing from a pure software operation to the
hardware.
Use of hardware protection does not mean that a program could not be reverse engineered. There exist many techniques, that could be used in oder to overcome hardware
protections e.g. power analysis.
Since the hardware is accessible by the end user, in order to be difficult to reverse
engineer, hardware has to be tamper resistant i.e. hard to tamper with physical access
to it. There exist many hardware protections that make tampering difficult e.g. screws
with special (non-standard) heads, chips erasing their memory if opened (exposed to
sunlight) etc.
See more about hardware reverse engineering in [25].
Chapter 8
Applied reversing
General knowledge about reversing tools, reversing techniques and obfuscation techniques is required for doing DRE. Unfortunately, in most real world cases, it is not
enough.
When analysing a file, a reverser has to know as much as possible about the underlying operating system, the assembly language etc (see chapter 5). He or she also
has to know as much as possible about error handling, more exactly - what results are
generated by different errors.
One of the most difficult questions remains: What does a given error message mean
exactly? i.e. why the error occurred, where in the code it happened and how to correct it
(see figure 8.1). Software developers also encounter this problem, but generally compilers
and debuggers help to answer these questions quickly (see figure 8.2).
The only effective way to learn about outputs, produced because of errors, is practice.
/ > ./autoCorrect
Segmentation fault
Figure 8.1: Error: Segmentation fault. Reason process tried to access a part of memory
that does not belong to it.
/ > g++ hello world.cpp -o hello world

hello world.cpp: In function int main():
hello world.cpp:9: error: expected ; before return
Figure 8.2: g++ compiler output for a common error: forgotten semi-column.
8.1
Training
As was already mentioned, trainings are very important for learning DRE. The main
difficulty in training for DRE consists in choosing a suitable target or a program for
practicing. This is mostly due to two reasons:
69
70
Choosing the right level of difficulty - most modern pieces of software are very
difficult to reverse-engineer; a beginner should not start practicing using such
programs. There is no way to know the level of difficulty of reversing a given
program. It is also impossible to know what kinds of obfuscating techniques were
used (before starting DRE). If the reverser knows what he would be dealing with,
the reversing process become easier, less interesting and less didactic.
Legislation for DRE is not clear - in some countries reverse engineering is not
always legal. Most of the time the legislation is not clear enough to understand if
someone can legally reverse-engineer a given program.
Regardless of these difficulties, there exist many programs that are good to start
DRE trainings with. Generally, these programs are called CrackMe of KeygenMe. There
exists entire databases of such programs. Sites like http://www.bright-shadows.net/
propose such training programs for reverse engineers.
The are two main advantages in using programs such as CrackMe. First of all, these
programs were created to be reverse-engineered, so reversing them is legal. Secondly,
generally users who were able to reverse a program can evaluate the level of difficulty of
reversing a given program. This means that, a beginner can start from easier DRE challenges without knowing what obfuscation techniques (or any other kind of difficulties)
he will have to deal with.
Part III
Contribution
71
Chapter 9
Anti-patching
Usually patching is used in one of the following cases:
Updating - in this case, the software developer modifies his own code. This is done
in order to fix bugs or in order to add a new functionality to the program.
Reverse engineering - in this case, a reverse engineer modifies the program in order
to accomplish his (or her) goal.
Cracking - usually done by a hacker, most of the time it is done in order to break
the protection of a program.
9.1
A known problem
A software developer has to be able to fix bugs and update his own product. In this
case patching is 100% legitimate and raises no questions.
At the same time, developers of proprietary software do not want others to reverseengineer and modify (patch) their programs. Software developers may not want their
programs to be patched for several reasons:
Proprietary algorithms and protocols - many proprietary software use secret algorithms and protocols that are better (more perfomant) than equivalents used by
other developers e.g. protocol used by Skype, algorithms used by Oracle database
manager.
Illegal and free copies - in many cases, patching is used in order to break or
circumvent protections. Once the software protection is broken, cracked program
could spread very quickly. Nowadays, protections of almost all proprietary non-free
programs are cracked as soon as they appear on the market.
Embedded malware - software developers do not want their programs to be modified
because a piece of malicious code could be embedded in their code.
9.2
Existing solutions
Different solutions can protect programs from patching. The idea consists in checking
the integrity of the code (or data) before executing (or using) it (also see Section 7.3.3).
Here below, general schemes and ideas of existing solutions are presented.
72
73
9.2.1
Manual checking
One of the first protections against patching was a manual checking of a checksum.
Generally, MD5 (Message-Digest Algorithm 5) hash is used (see Section 9.3.2).
Imagine the following scenario:
Alice obtains a program
She computes the MD5 hash of the program and compares it to the value given by
some trusted authority e.g. look for it on website of the software developer (who
developed the original program)
Alice installs the program if and only if the values match.
This solution can prevent legitimate users from installing software with embedded
malware.
However not all users check if the program that they install was not modified.
This solution can only protect legitimate users from using modified programs, but
will not protect programs from being modified.
Only cryptographic collision resistant hash functions (see Section 9.3.2) should be
used for checking. Otherwise, a malicious person would be able to create a program
that has the same hash value as a legitimate program.
This protection could be defeated. For example a hacker, who is trying to embed a
malicious program into a legitimate one, can find a second preimage to the hash function
(see definition in section 9.3.2). He (a hacker) can also try to break the server of software
developer and replace the hash value of the program on it.
9.2.2
Automatic error detection
This technique is very similar to the previous one, except that it is not done manually
by the end user. As a result this technique offers an additional advantage it protect
programs from being modified.
As were already mentioned in the previous section, not all users check new software
before installation. Programs started to check their integrity by themselves.
The idea is to use error detecting codes (see Section 9.3 and 9.3.2), if an error is
detected it means that the original program was modified.
In order to check themselves programs use the following scheme: before executing
a part of a program (e.g. before executing a sensitive function) the program will check
its checksum. If the value is correct, then the execution continues normally. Otherwise,
program changes its normal behavior, see Section 7.3.3 and figure 7.12.
Generally, cryptographic collision resistant hash functions are used as checksums
(see Section 9.3.2).
In order to break this protection, all functions that check a patched part of the program have to be disabled or it could also be circumvented by recalculating the checksum.
This process could be very difficult, especially if many obfuscations (see chapter 7) were
applied to functions that check the programs integrity.
9.2.3
Check results of computations
This technique attempts to check the code integrity through data. It consists in checking
the results of a given function after the function call.
74
For example, consider a function QSort that sorts an array. If after the function call
of QSort the array is not sorted, it means that the function was altered or circumvented
(assuming, that the function was properly implemented in the first place).
This technique has a significant disadvantage - it checks the code after it was executed, which could be annoying in case if a part of a legitimate program was replaced
by malicious code.
This protection could also be defeated by disabling all checking procedures.
9.2.4
Algorithm TPCA: Checker Network
Mikhail J. Atallah and Chang Hoi patented an algorithm, that detect and correct
patches, see [3].
The idea consist in having a network of checkers procedures that verify the integrity
of the entire code and each others code. Typically, they compute hash value(s) over
a region of a code. In case if a checker detects that a part of a code was patched, a
responder (i.e corrector) procedure will replace the patched region, with a copy of the
code stored elsewhere, see [7].
9.3
Error detecting and error correcting codes
Some time ago people started to use electromagnetic waves (e.g. radio waves, electricity)
as a medium for information i.e. electromagnetic waves are used to transfer information
from one point (sender) to another (receiver).
Waves interfere with each other, which means that a message could be modified
during the transmission i.e. the receiver will receive a message with errors and there
is no guarantee, that the message received is correct or not. In computer science all
messages are streams of bits. Since one bit could take only one of two possible values
(zero or one), a transmission error is also called a bit inversion.
In order to solve the problem of errors that occur during the transmission of a
message, error detecting and error correcting codes may be used.
Error detecting codes allows to detect that a message was modified during its transmission. Error correcting codes allow to detect and correct errors that occurred during
the transmission of a message. This means that all error correcting codes are also error
detecting codes.
Nowadays, error detection and error correction are used in almost all data transmissions e.g. TCP protocol, CD/DVD disks, all space rovers use error correction in order
to send photos of the space to the earth.
9.3.1
The idea behind error detection and correction
The main idea is to introduce redundancy into the message before transmitting it. If
the sender and the receiver know what kind of redundancies should be introduced in the
message, the receiver will be able to detect if he received a correct message or if errors
(bit inversions) occurred during the transmission.
The efficiency of error correcting and error detecting codes can be evaluated using
two parameters:
Redundancy rate - the amount of redundancy (otherwise useless information); the
ratio between the size of useful information and total size of the message is often
used.
75
Number of errors - Maximum number of errors that could be detected and/or
corrected in the message.
Redundancy that has to be added to the message can be calculated using different
algorithms. This redundancy is generally called a checksum.
Here below general ideas used in error detection and error correction are presented.
9.3.2
Error detecting codes
As were already mentioned, error detecting codes introduce redundancy in the message.
The idea is that receiver will check if the message is valid i.e the message contains specific
redundancies. Then, if the message is valid the receiver will acknowledge the message send an OK message to the sender. If the sender does not receive an acknowledgement
in a certain period of time or if he receives an incomprehensible answer, the sender will
send the last message one more time. This scheme is similar to the scheme used by the
TCP/IP protocol.
Repetition code
Here is an example of the most simple error detecting code, also known as repetition
code.
The idea is simply repeat the original message (of size n) k times. See the example
in figure 9.1.
Original message
00
01
10
11
Message to send
000000
010101
101010
111111
Examples of invalid messages

010001
110110
010111
001000
Figure 9.1: Example of repetition code. k = 3 and n = 2.

Such code can detect up to k n 1 errors (single bit inversions), where n is the size
of the original message and k is the number of repetitions. See example of detection of
k n 1 (all bits received correctly except one) in figure 9.2.
Original message: 0000 (n = 4)
Message to send: 0000 0000 0000 (k = 3)
Received message: 1111 1110 1111
11 errors = 4 3 1
Figure 9.2: Example of detecting errors with repetition code.
However, this code would be unable to detect k bit inversions if they all occur at
positions t [0, k 1] : i + t k, where i [0, n 1]. See example in figure 9.3.
n
= k1 of useful inforIf this error detecting code is used, each message contains kn
mation and k1
k redundancies.
76
Original message: 0000 (n = 4)

Message to send: 0000 0000 0000 (k = 3)
Received message: 0010 0010 0010
3 errors occurred at positions t [0, 2] : 2 + t 3.
Where k = 3 and i = 2.
Figure 9.3: Example of an undetected errors k = 3 and n = 4.

Parity bits
Here is another family of slightly more sophisticated error detecting codes called parity
bits.
This error detecting code adds only one bit to the message. There exists two types
of parity bits code: even and odd. The idea is to set the additional bit of the message to
0 or to 1 in order to obtains an even or an odd number of bits set to 1 in the message.
See an example in figure 9.4.
Original message
0100110
0010100
Number of 1
3
2
Even parity
01001101
00101000
Odd parity
01001100
00101001
Figure 9.4: Example of parity bits error detection code.

Parity bits error detecting code can detect any odd number of bit inversions in the
message. It could not detect if any even number of errors occurred. A message with
parity bits code contains n1
n of useful data, where n - is a number of bits in the message.
Parity bits error detection code has a big advantage - it is very simple to implement
using a logical exclusive or (XOR) a very simple and fast instruction. See figure 9.5.
Parity bits code is used in redundant arrays of independent disks (RAID 5).
Hash functions
Hash functions are used for different purposes e.g. hash tables, digital signatures. Cryptographic hash function could also be used in order to calculate a checksum. Because
of the conditions (fixed size, collision resistance) that a cryptographic hash function has
to satisfy, hash functions are good checksums.
Definition 32. One-way hash function (OWHI) is a function h satisfying the following
conditions:
The argument X can be of arbitrary length and the result h(X) has a fixed length
of n bits.
The hash function must be one-way in the sense that given a Y in the image of h,
it is computationally infeasible to find a message X such that h(X) = Y (preimage
resistant) and given X it is computationally infeasible to find a message X 0 6= X
such that h(X) = h(X) (second preimage resistant).
Definition from [2].
77
Original message: 01001010 11010110 11101001 (3 bytes)

Original message presented as 8 messages of size n = 3.
8 parity bits will be used.
message
011
111
001
010
101
010
110
001
# 1
2
3
1
1
2
1
2
1
parity bit
0
1
1
1
0
1
0
1
Same result using XOR:

byte 1
byte 2
XOR (r1)
01001010
11010110
10011100
r1
byte 3
XOR (r2)
10011100
11101001
01110101
Byte containing parity bits: 01110101.
Figure 9.5: Implementation of even parity bits error detection code using XOR.
Definition 33. A collision resistant hash function is a function h satisfying the following
conditions:
The argument X can be of arbitrary length and the result h(X) has a fixed length
of n bits.
The hash function must be OWHI
The hash function must be collision resistant: it is computationally infeasible to
find two distinct messages that hash to the same result.
Definition from [2].
The advantages of use of cryptographic hash functions for error detection are:
Fixed size - a very little amount of redundant information is added to the original
message.
Collision resistance - it is extremely unlikely, that several bit inversions in a message produce another message with the same hash value.
9.3.3
Error correcting codes
Sometimes detecting an error is not enough since in many cases information always
arrives to the receiver with errors.
For example, imagine that there is little a scratch on a CD disk, no matter how many
times the track is read, each time the CD player will read the message with errors.
One error is a single bit inversion, it takes only to invert one bit in order to correct
one error.
78
In order to be able to correct a single bit inversion two things are required: the error
must be detected and place where the error occurred must be found.
Generally, error correcting codes rely on the fact that if a received message is not a
valid message then, most likely, the original message is the valid message closest to the
received message. See figure 9.6.
Sender can send only two possible messages: 0000 or 1111.
If the receiver receives 1101.
Hamming distance (0000, 1101) = 3
It is more likely, that the sender sent 1111.
Figure 9.6: Example of a valid message close to the received invalid message.
The Hamming distance is used to quantify the distance between two messages.
Definition 34. Hamming distance between two codewords, c and c, is defined by:
d(c, c0 ) = card{i [0, n 1]|ci 6= c0i }. Definition from [2].
In other words, the Hamming distance is a number of positions at which the corresponding symbols of two equal length strings are different.
In the case when a received message has more than one closest valid messages errors
are detected, however they can not be corrected (see figure 9.7).
Sender can send only two possible messages: 0000 or 1111.
If the receiver receives 1001.
Figure 9.7: Example of equal Hamming distances between the received message and two
distinct valid messages. Errors are detected, but can not be corrected.
There exist a variety of different error correcting codes (see [24]). Here, only a simple
error correcting code (used in 9.4.2) is presented.
Parity bits
The code, described in Section 9.3.2 can only detect errors, but not correct them.
There is a simple way to modify the parity bits error detecting code in order to
transform it into an error correcting code. The idea is to represent a message as a
matrix M of m n bits and then add parity bits to each line and to each column (m + n
parity bits). See example on figure 9.8.
Suppose, that a message was received with one error (single bit inversion). In this
case an error would be detected in a line i of M and also an error would be detected in
the column j of M , it means that the bit inversion occurred in M [i][j]. See example on
figure 9.9 .
79
Original message: 0100 1010 1101 0110 (2 bytes)

Message presented as a 4 4 matrix:
parity bits
Line
0100
1010
1101
0110
1010
parity bits
0
1
0
1
Parity bits for lines: 0101

Parity bits for columns: 1010
Message to send: 0100 1010 1101 0110 0101 1010
Figure 9.8: Example of odd parity bits error correcting code.

Sent message: 0100 1010 1101 0110 0101 1010
Received message (with one error): 0100 1000 1101 0110 0101 1010
Received message presented as a matrix M :
Line
received parity bits

calculated parity bits
0100
1000
1101
0110
1010
1000
Parity bits
received calculated
0
0
1
0
0
0
1
1
There is an error in the second line and in the third column. M[2][3] have to be
corrected from zero to one.
Figure 9.9: Example of odd parity bits correction of one error.

This error correcting code can always correct a single bit error. If there are more
errors they could detected or undetected depending on their configuration. See examples
in figures 9.10 and 9.11.
9.4
9.4.1
My addition
The idea
The general idea consists in the use of error correcting codes instead of error detecting
codes. Nate Lawson mentions in his blog [30], that use of error correcting codes is pos-
80

Sent message: 0100 1010 1101 0110 0101 1010
Received message (with two errors): 0100 1100 1101 0110 0101 1010
Line

0100
1100
1101
0110
1010
1100
Parity bit
received calculated
0
0
1
1
0
0
1
1
The message can not be corrected, since there is no way to know which lines of the
matrix were affected by errors.
Figure 9.10: Example of 2 detected errors that could not be corrected.

Sent message: 0100 1010 1101 0110 0101 1010
Received message (with four errors): 0100 0000 1101 1100 0101 1010
Line

0100
0000
1101
1100
1010
1010
Parity bit
received calculated
0
0
1
1
0
0
1
1
Errors were not detected. All parity bits are correct.
Figure 9.11: Example of 4 errors that could not be detected.

sible but, to my best knowledge, there were no publication about use of error correcting
codes against patching.
This offers an advantage over a simple use of error detection - the error could be
corrected i.e. a patch could be deactivated and replaced by the original code. It rises
the level of protection of a program.
Consider the following scenario: a reverse engineer modifies the code of a program
which uses an error detection mechanism. He will see that the program has completely
changed its behavior e.g. unexpectedly terminated with an error. In this case the reverser
81
will start to search where the program checks its integrity in order to deactivate the
integrity checking procedure.
Now, consider the scenario in which an error correction mechanism is used. A reverser will see that there are no changes in the behavior of the program. The reverser
might consider that he patched a wrong part of the code (e.g. a part that was not reached
during the execution of the program).
Use of error correction might be an additional source of frustration for a reverse
engineer.
9.4.2
Implementation
The general idea consists in implementing the scheme used by error detection mechanisms i.e. before executing a part of a code a checking procedure is called in order to
ensure that the following code was not patched. See figure 9.12.
Figure 9.12: General idea of implementation.
Proof of concept
Two proof of concept programs were implemented (under Ubuntu OS) in order to show
that use of error correcting codes is possible and that it is possible to restore the original
code (i.e. deactivate a patch) and execute it.
An odd parity bits error correcting code (see Section 9.3.3) was used as following:
the executable file is presented as a matrix of 1024 bytes d f ileSize
1024 e bytes; parity bits
were calculated for all bits at the same position in bytes (in order to form a parity byte),
see figure 9.13.
The proof of concept program is able to correct a 1 byte error. This kind of patches
(1 byte replacement) is very common in case when a hacker (or a reverser) wants to
inverse a condition of a cycle or of an if-else statement (see further in this section).
82
Figure 9.13: Use of parity bits error correcting code in programs addChecksum and
autoCorrect.
The first program addChecksum (see code in appendix C.3) adds parity bits to a file.
The executable file is presented as a matrix 1024 d f ileSize
1024 e. In order to align the last
line of the matrix it is filled with bytecode 00000000. Then the two arrays of parity
bytes (lines and columns) are appended to the file. See figure 9.14
The second program autoCorrect (see source code in appendix C.5) is able to correct
its own code. AutoCorrect program is equivalent (see definition 17) to the program
secretValue (see source code in appendix C.4).
Program autoCorrect checks its own code before executing it. Since a simple parity
bits error correcting code was used, program can always restore one byte of its code. If
there are no errors, the execution continues normally. If an error is detected the main
process saves its current state ( it is modeled by a variable state). Then the main
process will fork the first time in order to create a new process, which will copy the
executable file, using program cp 1 (called with the system call exec). This first fork
could be replaced by a procedure which opens the executable file and then copies all
data from the executable file to a new one.
Definition 35. System call is a mechanism used by programs in order to request a
service from the OS.
Once the file was copied, the main process corrects the error (in the newly created
file) and forks for the second time. The second fork is used in order to execute the new
1
Linux program cp copies a file
83
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
input : f i l e
I n t e g e r columnsNbr = 1 0 2 4 ;
Integer linesNbr = c e i l ( f i l e . s i z e ()/1024)
Vector columnsChecksum [ columnsNbr ] = 0 ;
Vector linesChecksum [ l i n e s N b r ] = 0 ;
I n t e g e r zeroesToAdd = columnsNbr * l i n e s N b r s i z e ( f i l e ) ;
for ( unsigned i =0; i < zeroesToAdd ; ++i ) {
f i l e . append (BYTE( 00000000 ) ) ;
}
Vector b u f f e r [ columnsNbr ] = 0 ;
for ( unsigned i =0; i <l i n e s N b r ; ++i ) {
f i l e . read ( b u f f e r ) ;
for ( unsigned j =0; j <columnsNbr ; ++j ) {
linesChecksum [ i ] = linesChecksum [ i ] XOR b u f f e r [ j ] ;
collumnsChecksum [ i ] = collumnsChecksum [ j ] XOR b u f f e r [ j ] ;
}
}
// Odd p a r i t y :
linesChecksum = linesChecksum XOR v e c t o r ( 0xFF ) ;
columnsChecksum = columnsChecksum XOR v e c t o r ( 0xFF ) ;
f i l e . append ( linesChecksum ) ;
f i l e . append ( columnsChecksum ) ;
Figure 9.14: Pseudocode of addChecksum program.

clean file. Once a new, clean file, was created it is executed with a saved state as a
parameter i.e. the execution continues as if there was no errors (patches) in the file. See
figure 9.15.
The proof of concept program creates a copy of the original executable file and
executes it. This has to be done, because any changes in the original executable file
done during its execution would not affect the execution of the program (since it is
already loaded into the memory). Although, some operational systems (e.g. Windows)
do not allow to modify a file during its execution.
For educational purposes the corrected file is not deleted after its execution. If such
scheme is used in a real application, the corrected file should be deleted.
The autoCorrect program and secretValue program ask the user for a password.
Depending on result the program goes into if or else statement. A hacker can simply
inverse the condition in the if statement by replacing the jump instruction in the if-else
statement e.g. je (jump-if-equal) instruction could be replaced by jne (jump-if-not-
84
Figure 9.15: Flowchart of checking procedure of autoCorrect program.

equal).
9.4.3
Other possible implementations and improvements
The proof of concept program shows that it is possible to use error correction against
patching. Error correction mechanisms could be implemented in different ways.
Several parts of a program that use error correcting mechanism could be implemented
differently. Here are presented main possible upgrades.
Where to put the corrected code
There are several places where the corrected code could be stored.
85
Copy the original file. This technique (used in proof of concept program) has an
advantage: it is simple to implement. But in this case a system of checkpoints
should be also implemented (see Section 9.4.3). However, this technique has several
disadvantages:
If monitoring tools are used, the reverser will notice the disk I/O.
The original program might not have permissions to write on the disk or to
create files.
Correct the code in the RAM. This technique presents a big advantage - the reverser
would not be able to easily see the trace of the correction. However, this technique
could not be applied on all operating systems, because most OS do not allow to
write directly into executable part of the memory. A problem might appear if a
corrected page is swapped out and than reloaded into the memory (see [51] about
memory management).
Use the stack. This technique is the most tricky one. Sometimes the program
has a permission to execute the stack memory (this security issue is often used by
hackers) the corrected (little) part of the program could be placed into the stack
and executed.
How to correct
The proof of concept use a very simple error correcting code, which can always correct
only a single error (one byte).
There exist many other error correcting codes, that are able to correct much more
errors e.g. Reed-Solomon, Turbocodes etc.
A more powerful error correction mechanism will certainly improve the resistance of
the program against patching.
What to correct
The proof of the concept applies one error correction mechanism to the entire file including its headers, code and data sections.
An error correcting code could be applied separately to each function or to each
block of the code (and data). Blocks of the code, that have checksums, could overlap in this case code will be checked several times with use of different checksums.
These kinds of improvements will complicate the process of reverse engineering and
patching.
Hide checksum
Sometimes it is difficult to circumvent the checking mechanism, but it is easier to calculate a new checksum and replace the original checksum by a new one, so the checksum
should be well hidden in the file.
It could be achieved using encryption (see Section 7.3.1) and steganography (see
Section 7.3.6). If a program communicates through the network, all checksums could be
stored on a distant (always accessible) machine.
86
Checkpoints
The system of management of checkpoints could be improved in order to raise the level
of protection against patching.
Definition 36. In a program checkpoint is a place where the current state of the program is saved. A state of a program includes everything that is needed (value of instruction pointer, registers, stack and variables) in order to continue the execution from the
checkpoint.
The idea of checkpoints is to be able to continue (resume) the execution of a program
if the program stops unexpectedly e.g. crashes.
In patching, a system of checkpoints could be used in order to be able to execute a
program from the last clean ckeckpoint.
Definition 37. A Clean checkpoint is a checkpoint C such as all previously used blocks
(data and instructions) of the program contain no patches.
Imagine that the checking procedure is inlined (see Section 5.2.2). Suppose that a
reverse engineer was able to detect and disable all checking procedures between the first
checking procedure and the patch that he (or she) applied (see figure 9.16). If the system
of checkpoints is separated from error-checking procedure the system of checkpoints
would not be disabled. The first non-disabled checking procedure will detect errors in
previous (already executed) part of the code. In this case, it will be able to find the
last clean checkpoint, correct the error and resume the execution. This means that all
checkpoints must be saved.
In order to be able to patch such a system all checking procedures have to be found
and deactivated. This could be much harder than deactivating one checking procedure
especially if different obfuscating techniques are applied to different checking procedures.
Other improvements
The code of the checking procedure will be harder to delete if many obfuscations (see
Chapter 7) are applied to it e.g. the procedure, that checks a block of the code or data
for errors, will be harder to circumvent if it is inlined (see Section 5.2.2).
Another improvement that could be done is duplication - many sensitive parts of
the code could be duplicated (also see 9.2.4). In such way, if one part is patched and
the error correction code is not able to correct all errors (restore the original program
from the patched version) it might execute the code of a duplicate procedure (or replace
patched part of the program by a clean code stored elsewhere and then execute it, see
Section 9.2.4 and [3]). See example on figure 9.17.
9.4.4
Advantages and disadvantages
The use of error checking mechanism against patching has its own advantages and
disadvantages.
First of all, with the use of checking procedures, the size of the file grows, especially
if checking procedures are inlined.
Secondly, computing a checksum before entering each function (or a block of the
code) might take a lot of time. This technique should be used mostly on sensitive parts
of the code.
87
Figure 9.16: Inlined checking procedure with multiple checkpoints.

The advantage of this technique is the additional source of frustration for reverse
engineer, who try to break the protection of the program. This mechanism will also
protect the program from the use of some automated deobfuscators e.g. if an automated
unpacker unpacks the code and then the reverser tries to execute it, the program will
stop because the program is not packed anymore.
The disadvantages and the advantages, described above, are also present in antipatching error detection mechanism. However, error correction has an additional ad-
88
Figure 9.17: Duplicated function bar. Bar copy is a procedure equivalent to bar.
vantage - the ability to disable patches i.e. restore the original code. Normally a reverser
will patch the code and then try to execute it. If the program changes its behavior unexpectedly, the reverser will try to find out why - he will try to find the error detection
mechanism. In case error correction is used, the program will continue its execution
normally and the reverser might think, that the program did not reach the patched
code (another branch of the program was chosen).
Weaknesses
Any system has its weaknesses and error correction mechanism is no exception.
Error detection and error correction mechanism are very close and so they share
some weaknesses.
89
First of all, the error checking procedure could be disabled. If it is disabled a patch
could be applied to a file and this patch will work.
Secondly, if disabling the checking procedure is too hard, the checksum could be
recalculated after a patch was applied. Since a checksum calculation mechanism is
implemented in the file it could be used by a reverser to recalculate the new checksum
(even if the reverser does not understand how the checking procedure works).
Finally, the error correction mechanism will not be able to restore the original code
or data if too many changes were made. However, this weakness could be fortified by
use of duplication (see Section 9.4.3) and by use of error correcting codes that are able
to correct many errors.
Chapter 10
Conclusions
Many different goals could be achieved through reverse engineering - a methodology
used to find out how things work.
Digital reverse engineering (a sub-domain of reverse engineering) and code obfuscation are closely connected and should be studied together. One should know how to
obfuscate systems in order to understand how to reverse-engineer them and vise versa.
Any system could be reverse-engineered and it will be reverse-engineered, if it is costeffective. Regardless of the fact that any system could be reversed, a developer concerned
about his product being reversed must use obfuscations, because each obfuscation raises
the level of protection of the system and decrease the number of potential reversers.
10.1
Anti-patching
One technique used by reverse engineers and by crackers in order to deactivate the
protection of a program is patching.
Patches are applied to different programs on a daily basis. There are legitimate
(updates) and illegitimate (cracking, protection removal) uses of patching. Developers
(especially developers of non-free programs) are very concerned about their programs
being patched and distributed freely.
There exist several solutions that prevent programs from being patched. This paper
present an anti-patching technique based on error correcting codes and a proof of concept
program.
The proof of concept program demonstrates, that it is possible to restore and execute
the original code after it was patched.
10.2
Further work
Further works on this subject can study different implementations of error correcting
mechanisms based on possible improvements presented in this paper.
It is also interesting to implement an efficient system of checkpoints.
And finally, it would be very interesting (and challenging) to implement an automated obfuscator based on ideas presented in this work and than study how different
automated deobfuscators can handle this obfuscation combined with other obfuscation
techniques.
90
Appendix A
Abbreviations
API Application Programming Interface
BCD Binary-coded decimal
BIOS Basic Input/Output System
CA certification authority
CPU Central Processing Unit
IBM International Business Machines
I/O Input/Output
IP Internet Protocol
IT Information Technology
DRE Digital Reverse Engineering
EU European Union
GB Gyga Byte
GPS Global Positioning System
OS Operating System
PC Personal Computer
RAID Redundant Array of Independent Disks
RAM Random Access Memory
RE Reverse Engineering
RISC Reduced Instruction Set Computing
TCP Transmission Control Protocol
VM Virtual Machine
UDP User Datagram Protocol
91
92
UML Unified Modeling Language
USA United States of America
USSR Union of Soviet Socialist Republics
Appendix B
Images
B.1
B.1.1
History of reverse engineering

Reverse engineering in military
Figure B.1: B-29 Superfortress created by Boeing. Image from [56]
Figure B.2: Tu-4 created by Tupolev Design bureau. Image from [63]
93
94
B.1.2
Reverse engineering in digital world
Figure B.3: IBM PC 5150. Image from [58]
Figure B.4: IBM-compatible portable by Compaq. Image from [58]
95
B.2
Flowchart
Figure B.5: Guide to understand flowcharts. Image comes from http://xkcd.com/518/
Appendix C
Code
C.1
Hello, world
#include 

using namespace s t d ;
int main ( ) {
cout<< H e l l o , world ! <<e n d l ;
return 0 ;
}
C.2
ComputeHash

#include < s t d l i b . h>
#include <l o c a l e > // hash
int main ( int argc , char * argv [ ] ) {
s t r i n g value ;
cout<< Enter a v a l u e : ;
g e t l i n e ( cin , value ) ;
locale loc ;
// t h e C l o c a l e
const c o l l a t e <char>& c o l l = u s e f a c e t <c o l l a t e <char> >( l o c ) ;
long myhash = c o l l . hash ( v a l u e . data ( ) , v a l u e . data ()+ v a l u e . l e n g t h ( ) ) ;
cout<< Hash : <<myhash<<e n d l ;
cout<<Byebye ! <<e n d l ;
96
97

}
C.3
AddChecksum
#include
#include
#include
#include
#include
#include
#include
 // I /O
<f s t r e a m > // f i l e
< s t d l i b . h> // e x i t s u c c e s s / f a i l u r e
<s y s / t y p e s . h>
<s y s / s t a t . h>

<math . h> // c e i l
const unsigned l i n e S i z e = 1 0 2 4 ;
//ODD p a r i t y i n h o r i z o n t a l and i n v e r t i c a l p a r i t y v e c t o r s
unsigned o r i g F i l e S i z e , l i n e s N b r , zerosToAdd ;
char * b u f f e r ;
char * hpVector ; // h o r i z o n t a l p a r i t y v e c t o r ; s i z e = b u f f e r S i z e
char * vpVector ; // v e r t i c a l p a r i t y v e c t o r
fstream f i l e ;
i f ( argc < 2){
c e r r << Bad number o f arguments ! <<e n d l ;
c e r r << Usage : <<argv [0]<< <f i l e n a m e ><<e n d l ;
return EXIT FAILURE ;
}
// g e t f i l e s i z e :
struct s t a t f i l e s t a t u s ;
s t a t ( argv [ 1 ] , &f i l e s t a t u s ) ;
o r i g F i l e S i z e = unsigned ( f i l e s t a t u s . s t s i z e ) ;
c o u t << F i l e s i z e : << o r i g F i l e S i z e << b y t e s \n ;
// c a l c u l a t e number o f l i n e s i n t h e f i l e
// and number o f 00 t o add a t t h e end
// t o g e t a c o m p l e t e l a s t l i n e
l i n e s N b r = unsigned ( c e i l ( double ( o r i g F i l e S i z e ) / double ( l i n e S i z e ) ) ) ;
zerosToAdd = l i n e s N b r * l i n e S i z e o r i g F i l e S i z e ;
// open f i l e :
f i l e . open ( argv [ 1 ] , f s t r e a m : : i n | f s t r e a m : : out ) ;
if (! f i l e . is open ()){
c e r r << E r r o r w h i l e o p e n i n g f i l e \ <<argv [1]<< \ <<e n d l ;
98
}
b u f f e r = new char [ l i n e S i z e ] ;
hpVector = new char [ l i n e S i z e ] ;
vpVector = new char [ l i n e s N b r ] ;
// zerosToAdd < l i n e S i z e
// i n i t v e c t o r s :
fo r ( unsigned i =0; i <l i n e S i z e ;++ i ) {
buffer [ i ] = 0;
hpVector [ i ] = 0 ;
}
// g o t o t h e end o f t h e f i l e && add z e r o s
f i l e . s e e k g ( 0 , i o s : : end ) ;
f i l e . w r i t e ( b u f f e r , zerosToAdd ) ;
cout<< Z e r o s added . <<e n d l ;
// g o t o t h e b e g i n i n g o f t h e f i l e
f i l e . s e e k g ( 0 , i o s : : beg ) ;
fo r ( unsigned i =0; i <l i n e s N b r ;++ i ) { // == w h i l e ( ! f i l e . e o f ( ) ) { . . . }
f i l e . read ( buffer , l i n e S i z e ) ;
vpVector [ i ] = 0 ;
f o r ( unsigned j =0; j <l i n e S i z e ;++ j ) { // XOR == sum b i t b i t
vpVector [ i ] = vpVector [ i ] b u f f e r [ j ] ;
hpVector [ j ] = hpVector [ j ] b u f f e r [ j ] ;
}
}
// p a r i t y b y t e s f o r odd p a r i t y : XOR FF = 255
fo r ( unsigned i =0; i <l i n e S i z e ;++ i ) {
hpVector [ i ] = hpVector [ i ] 2 5 5 ;
}
fo r ( unsigned i =0; i <l i n e s N b r ;++ i ) {
vpVector [ i ] = vpVector [ i ] 2 5 5 ;
}
cout<< P a r i t y v e c t o r s c a l c u l a t e d . <<e n d l ;
// add p a r i t y v e c t o r s t o t h e end o f t h e f i l e
f i l e . s e e k g ( 0 , i o s : : end ) ;
f i l e . w r i t e ( vpVector , l i n e s N b r ) ;
f i l e . w r i t e ( hpVector , l i n e S i z e ) ;
cout<< P a r i t y v e c t o r s w r i t t e n . <<e n d l ;
// d e l e t e b u f f e r s and c l o s e f i l e s
99
delete [ ] vpVector ;
delete [ ] hpVector ;
delete [ ] b u f f e r ;
f i l e . close ();
// c h e c k new f i l e s i z e
s t a t ( argv [ 1 ] , &f i l e s t a t u s ) ;
unsigned n e w F i l e S i z e = unsigned ( f i l e s t a t u s . s t s i z e ) ;
i f ( n e w F i l e S i z e == ( l i n e s N b r +1) * ( l i n e S i z e +1)1){
cout<< check <<e n d l ;
cout<< New f i l e s i z e : << n e w F i l e S i z e <
#include <l o c a l e > // hash
s t r i n g se cre tV alue = ;
locale loc ;
long s e c r e t V a l H a s h = 0 ;
cout<< Enter t h e s e c r e t v a l u e : ;
g e t l i n e ( cin , secretValue ) ;
s e c r e t V a l u e+= nv ; // s a l t
s e c r e t V a l H a s h = c o l l . hash ( s e c r e t V a l u e . data ( ) , s e c r e t V a l u e . data ()+ s e c r e t V a
i f ( s e c r e t V a l H a s h == 1 0 9 8 8 5 3 0 2 ) { // s e c r e t V a l u e == 42
cout<< C o n g r a t u l a t i o n s ! <<e n d l ;
} else {
100
cout<<Wrong v a l u e ! <<e n d l ;
}
}
C.5
AutoCorrect
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

<f s t r e a m > // f i l e
< s t d l i b . h> // e x i t s u c c e s s / f a i l u r e
<s y s / t y p e s . h>
<s y s / s t a t . h>
<s y s / w a i t . h>

<math . h> // c e i l
<s t r i n g >
<sstream > // s t r i n g stream
<l o c a l e > // hash
const unsigned c o r r e c t i o n i m p o s s i b l e = 1;
const unsigned good checksum = 0 ;
const unsigned bad checksum = 1 ;
void check ( char * , unsigned ) ;
int c h e c k A n d C o r r e c t F i l e ( char * ) ;
// i n i t
unsigned s t a t e = 0 ; // s t a t e c h a n g e s a t each CheckPoint .
// o r i g i n a l c a l l ? c o n t i n u e , e l s e : g e t s t a t e and c o n t i n u e from CheckPoint
i f ( a r g c == 2 ) {
// g e t s t a t e ( param )
unsigned i =0;
while ( argv [ 1 ] [ i ] ! = \0 ) {
s t a t e * =10;
s t a t e+=argv [ 1 ] [ i ] unsigned ( 0 ) ;
++i ;
}
}
// cout <<fname:<< a r g v [0]<< e n d l ;
char f i l e N a m e [ 3 2 ] ;
unsigned i =2; // d e l e t e . / a t t h e b e g i n i n g
101
while ( argv [ 0 ] [ i ] ! = \0 ) {
f i l e N a m e [ i 2] = argv [ 0 ] [ i ] ;
i ++;
}
f i l e N a m e [ i 2] = \0 ;
//main
s t r i n g se cre tVa lue = ;
locale loc ;
long s e c r e t V a l H a s h = 0 ;
// c h e c k ( fileName , 0 ) ;
switch ( s t a t e ) {
case 0 :
state = 0;
// cout <<S t a t e 0<<e n d l ;
cout<< Enter t h e s e c r e t v a l u e : ;
cout<<f l u s h ;
check ( fileName , 1 ) ;
case 1 :
state = 1;
// cout<< S t a t e 1<<e n d l ;
g e t l i n e ( cin , secretValue ) ;
s e c r e t V a l u e+= nv ; // s a l t
s e c r e t V a l H a s h = c o l l . hash ( s e c r e t V a l u e . data ( ) , s e c r e t V a l u e . data ()+ s e c r e t V a
i f ( s e c r e t V a l H a s h == 1 0 9 8 8 5 3 0 2 ) { // s e c r e t V a l u e == 42
cout<< C o n g r a t u l a t i o n s ! <<e n d l ;
} else {
cout<<Wrong v a l u e ! <<e n d l ;
}
break ;
default :
// cout <<Unknown s t a t e <<s t a t e <<e n d l ;
;
break ;
}
}
void check ( char * fileName , unsigned c u r r e n t S t a t e ) {
102
int c o r r e c t e d = c h e c k A n d C o r r e c t F i l e ( f i l e N a m e ) ;
i f ( c o r r e c t e d == c o r r e c t i o n i m p o s s i b l e ) {
// c e r r <<Recovery i s i m p o s s i b l e<<e n d l ;
e x i t (EXIT FAILURE ) ;
} e l s e i f ( c o r r e c t e d == bad checksum ) {
p i d t pid = f o r k ( ) ;
i f ( p i d == 0 ) { // son
// e x e c new f i l e
s t r i n g correctedFileName , s t a t e S t r ;
s t d : : s t r i n g s t r e a m out , out2 ;
out << . / <<fileName << c o r ;
c o r r e c t e d F i l e N a m e = out . s t r ( ) ;
out2<<c u r r e n t S t a t e ;
s t a t e S t r = out2 . s t r ( ) ;
char * arguments [ 3 ] ;
arguments [ 0 ] = ( char * ) c o r r e c t e d F i l e N a m e . c s t r ( ) ;
arguments [ 1 ] = ( char * ) ( s t a t e S t r . c s t r ( ) ) ;
arguments [ 2 ] = NULL;
i f ( execvp ( c o r r e c t e d F i l e N a m e . c s t r ( ) , arguments ) == 1) {
// c e r r <<Error w h i l e e x e c c o r r e c t e d f i l e <<e n d l ;
}
} e l s e i f ( p i d != 1){ // f a t h e r
int s t a t u s ;
w a i t (& s t a t u s ) ;
// * e v e n t u a l y e r a s e t h e copy f i l e h e r e *
e x i t (EXIT SUCCESS ) ;
} else {
// c e r r << Error w h i l e f o r k ( ) f o r copy e x e c .<< e n d l ;
}
// checkAndCorrect r e t u r n s :
// 1 = bad checksum AND c o r r e c t i o n i s i m p o s s i b l e
// 0 = checksum OK
// 1 = bad checksum AND c o r r e c t e d . c o r r e c t f i l e N a m e = <origFileName> c o r
int c h e c k A n d C o r r e c t F i l e ( char * f i l e N a m e ) {
const unsigned l i n e S i z e = 1 0 2 4 ;
unsigned f i l e S i z e , l i n e s N b r , zerosToAdd ;
char * b u f f e r ;
char * hpVector ; // h o r i z o n t a l p a r i t y v e c t o r ; s i z e = b u f f e r S i z e
char * vpVector ; // v e r t i c a l p a r i t y v e c t o r
fstream f i l e ;
103
int r e s u l t = 1;
// g e t f i l e s i z e :
struct s t a t f i l e s t a t u s ;
s t a t ( fileName , & f i l e s t a t u s ) ;
f i l e S i z e = unsigned ( f i l e s t a t u s . s t s i z e ) ;
// c o u t << F i l e s i z e : << f i l e S i z e << b y t e s \n ;
// c a l c u l a t e number o f l i n e s i n t h e o r i g i n a l f i l e
l i n e s N b r = ( f i l e S i z e +1)/( l i n e S i z e +1) 1 ;
// open f i l e :
f i l e . open ( fileName , f s t r e a m : : i n ) ;
if (! f i l e . is open ()){
// c e r r <<Error w h i l e o p e n i n g f i l e \<<fileName <<\<<e n d l ;
}
b u f f e r = new char [ l i n e S i z e ] ;
hpVector = new char [ l i n e S i z e ] ;
vpVector = new char [ l i n e s N b r ] ;
// i n i t v e c t o r s from f i l e
f i l e . s e e k g ( l i n e s N b r * l i n e S i z e , i o s : : beg ) ;
f i l e . r e a d ( vpVector , l i n e s N b r ) ;
f i l e . r e a d ( hpVector , l i n e S i z e ) ;
// g o t o t h e b e g i n i n g o f t h e f i l e
f i l e . s e e k g ( 0 , i o s : : beg ) ;
for ( unsigned i =0; i <l i n e s N b r ;++ i ) {
f i l e . read ( buffer , l i n e S i z e ) ;
fo r ( unsigned j =0; j <l i n e S i z e ;++ j ) { // XOR == sum b i t b i t
vpVector [ i ] = vpVector [ i ] b u f f e r [ j ] ;
hpVector [ j ] = hpVector [ j ] b u f f e r [ j ] ;
}
}
// cout <<F i l e re ad .<< e n d l ;
// c h e c k r e s u l t s :
int vPos=1, hPos=1;
bool p o s s i b l e R e c o v e r y = true , needRecovery = f a l s e ;
for ( unsigned i = 0 ; i <l i n e s N b r && p o s s i b l e R e c o v e r y ;++ i ) {
// c l e a n CMP i s VERY i m p o r t a n t ! ( same t y p e )
i f ( vpVector [ i ] ! = char ( 2 5 5 ) ) {
i f ( vPos != 1){
possibleRecovery = false ;
104
}
vPos = i ;
needRecovery = true ;
}
for ( unsigned i = 0 ; i <l i n e S i z e && p o s s i b l e R e c o v e r y ;++ i ) {

// c l e a n CMP i s VERY i m p o r t a n t ! ( same t y p e )
i f ( hpVector [ i ] ! = char ( 2 5 5 ) ) {
i f ( hPos != 1){
possibleRecovery = false ;
}
hPos = i ;
needRecovery = true ;
}
}
// cout <<R e s u l t s c h e c k e d .<< e n d l ;
bool c l e a n = true ;
//
//
//
//
as sum pt ion : 1 r e a l e r r o r and i t was d e t e c t e d

or more than 1 e r r o r and d e t e c t e d
t h e r e e x i s t some c a s e s w i t h many e r r o r s t h a t c o u l d
not be d e t e c t e d .
// s e e k g : r e p o s i t i o n g e t p o i n t e r
// s e e k p : r e p o s i t i o n p u t p o i n t e r
i f ( p o s s i b l e R e c o v e r y && needRecovery ) {
s t r i n g copyFileName ;
s t d : : s t r i n g s t r e a m out ;
out <<fileName << c o r ;
copyFileName = out . s t r ( ) ;
p i d t pid = f o r k ( ) ;
// p r o c e s son
// copy c o r r u p t e d f i l e
i f ( p i d == 0 ) {
char * arguments [ 4 ] ;
arguments [ 0 ] = ( char * ) ( cp ) ;
arguments [ 1 ] = f i l e N a m e ;
arguments [ 2 ] = ( char * ) ( copyFileName . c s t r ( ) ) ;
arguments [ 3 ] = NULL;
i f ( execvp ( cp , arguments ) == 1) {
// c e r r <<e x e c cp e r r o r<<e n d l ;
}
105
} e l s e i f ( p i d != 1){ // f a t h e r
// w a i t f o r son d e a t h
int s t a t u s ;
w a i t (& s t a t u s ) ;
/ * i f (WIFEXITED( s t a t u s ) ) {
c o u t <<son e x i t e d OK<<e n d l ;
} */
// open t h e copy o f t h e c o r r u p t e d f i l e :
// ( copy t h a t c r e a t e d by son p r o c e s s )
fstream copyFile ;
c o p y F i l e . open ( copyFileName . c s t r ( ) , f s t r e a m : : i n | f s t r e a m : : out ) ;
i f ( ! copyFile . is open ()){
// c e r r <<Error w h i l e o p e n i n g f i l e \<<copyFileName <<\<<e n d l ;
exit (1);
}
i f ( vPos == 1 && hPos != 1){ // e r r o r o n l y i n hpVector
c o p y F i l e . s e e k g ( l i n e s N b r * ( l i n e S i z e +1)+hPos , i o s : : beg ) ;
copyFile . get ( buffer [ 0 ] ) ;
b u f f e r [ 0 ] = b u f f e r [ 0 ] hpVector [ hPos ] 2 5 5 ;
c o p y F i l e . s e e k p ( l i n e s N b r * ( l i n e S i z e +1)+hPos , i o s : : beg ) ;
c o p y F i l e . put ( b u f f e r [ 0 ] ) ;
} e l s e i f ( hPos == 1 && vPos != 1){ // e r r o r o n l y i n v p V e c t o r
c o p y F i l e . s e e k p ( l i n e s N b r * l i n e S i z e+vPos , i o s : : beg ) ;
b u f f e r [ 0 ] = b u f f e r [ 0 ] vpVector [ vPos ] 2 5 5 ;
c o p y F i l e . s e e k p ( l i n e s N b r * l i n e S i z e+vPos , i o s : : beg ) ;
} e l s e i f ( hPos != 1 && vPos != 1){ // e r r o r i n t h e m i d d l e
c o p y F i l e . s e e k g ( vPos * l i n e S i z e+hPos , i o s : : beg ) ;
c o p y F i l e . s e e k p ( vPos * l i n e S i z e+hPos , i o s : : beg ) ;
// c a l c u l a t i n g t h e c o r r e c t b y t e :
b u f f e r [ 0 ] = b u f f e r [ 0 ] ( hpVector [ hPos ] 2 5 5 ) ;
}
copyFile . close ( ) ;
// cout<< Recovery done.<< e n d l ;
r e s u l t = bad checksum ;
} else {
// c e r r << Error w h i l e f o r k ( ) f o r copy.<< e n d l ;
}
} e l s e i f ( needRecovery ) {
// cout<< F i l e was c o r r u p t e d and r e c o v e r y i s i m p o s s i b l e !<< e n d l ;
result = correction impossible ;
}
106
i f ( ! needRecovery ) {
// cout <<Clean f i l e , no need t o r e c o v e r d a t a .<< e n d l ;
r e s u l t = good checksum ;
}
// d e l e t e b u f f e r s and c l o s e f i l e s
delete [ ] vpVector ;
delete [ ] hpVector ;
delete [ ] b u f f e r ;
f i l e . close ();
return r e s u l t ;
}
C.6
Makefile
TARGET=addChecksum a u t o C o r r e c t computeHash s e c r e t V a l u e
a l l : computeHash s e c r e t V a l u e addChecksum a u t o C o r r e c t
normal : $ (TARGET)
computeHash : computeHash . cpp
g++ computeHash . cpp o $@
s e c r e t V a l u e : s e c r e t V a l u e . cpp
g++ s e c r e t V a l u e . cpp o $@
addChecksum : addChecksum . cpp
g++ addChecksum . cpp o $@
a u t o C o r r e c t : a u t o C o r r e c t . cpp addChecksum
g++ a u t o C o r r e c t . cpp o $@
. / addChecksum a u t o C o r r e c t
clean :
$ (RM) $ (TARGET)
Appendix D
Internship report
107
Report
Internship at Forensic Technology Solutions
of PriceWaterhouseCoopers
Nikita Veshchikov
08-09/2010
Contents
1 Introduction
2 Effort
2.1 Inventory DB . . . . . . . .
2.2 Import scripts . . . . . . . .
2.3 Investigations . . . . . . . .
2.3.1 General information
2.3.2 Data acquisition . .
2.3.3 Analysis . . . . . . .
2.3.4 Meet the client . . .
2.4 Scientific articles . . . . . .
2.5 Miscellaneous . . . . . . . .
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Acknowledgments
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
4
4
5
6
6
7
8
9
4 Conclusions
10
A Consent
11
B Data Acquisition Form
12
C Chain of custody
14
D Case procedure form
16
Bibliography
18
Introduction
I have started my seven weeks internship on 9 of August 2010 at the Belgian

FTS (Forensic Technology Solutions) team of PWC.
PWC has many FTS teams in different countries. On request of other firms
all over the world FTS team provides such services as fraud and bribery detection (usually for litigations), securing data and lost data recovery, securing IT
infrastructure and its complexity reduction (also see [11]).
This internship has had multiple goals:
integrate into a real world working environment
better define and understand the subject of my master degree thesis
apply theoretic knowledges in practice
Before the beginning of my internship I had found two subjects for my master degree thesis that I was deeply interested in: reverse engineering, volatile
data acquisition and its analysis.
Reverse engineering is the process of discovering the technological principles
of a human made device, object or system through analysis of its structure,
function and operation.
- Wikipedia
Volatile data acquisition is the process of acquisition of information from
volatile memory (memory that requires a power supply to maintain the stored
information e.g. RAM).
Both subjects deal with sensitive and confidential data and security, so it
was a very challenging task to find an internship in one of these domains.
FTS mostly deals with non-volatile data capture and analysis, so it is quite
close to the second subject of my interest. This internship allowed me to investigate some cases myself, and what is important it helped me to make the
choice among the two subjects mentioned above.
Effort
During my internship I worked on several tasks; mainly involving the work

under digital forensics projects. I worked with different scripts for databases
and read some scientific articles (mainly during the first three weeks, see 2.4).
During last weeks I worked on several digital forensics projects.
Because of the nature of tasks that I was required to do (quite often I had to
work with confidential information) no real names nor numbers will be mentioned in this report.
2.1
Inventory DB
One of the first tasks that was entrusted to do was the elaboration of an inventory database for the FTS laboratory. This database had to be created in
Microsoft (MS) Access and it had to be designed to store most of the FTS Lab
equipment with some satellite information. Also, this database had to manage
a decommission (i.e. once an object is decomissed the record should not be
deleted; the database has to keep records about decomissed objects) for almost
all types of equipment, including OS and software.
Here is the list of the equipment that this database had to manage:
External Hard Drives : name, type of connectors, capacity, encrypted
(y/n), data on the drive, location
Internal Hard Drives : name, type of connector, capacity, location
Flash Drives : name, capacity
Forensic Equipment : id, name
Computers : general computer characteristics, Operating System and
software installed
It took me several days to accomplish this task.
2.2
Import scripts
Another task, also related to databases, was to import the information from text
files to MS SQL Server. I installed MS SQL Server on my laptop...
There were several types of such files (they contained different information
about the same objects).
After analyzing all different types of files that a I had to manage, I, firstly,
drew a complete structure of the future database on paper in order to understand and project it prior to its design work. Secondly, I created several SQL
scripts:
setup script, that created all tables needed
import scripts, same number as file types
finally, the main script that called other scripts with needed arguments
3
2.3
Investigations
During my internship I worked on several digital forensics investigations. This

was the main part of my interest during my internship, therefore I would like
to describe all different parts of any digital forensics investigation in detail.
The following procedures were carried out several times with some few
variations during my internship.
2.3.1
General information
Each new case, also called project, has its own name. It is done in order to
refer to a project without mentioning the clients name nor persons first or last
names. It is very useful because most of the time the team of investigators has
to deal with private and confidential data, so someone who sees a document
or hears something related to a project would not understand exactly what it is
about.
Everything has to be documented. Every action (script, procedure etc.) that
an investigator runs on evidences has to be documented. Which gives a lot of
paperwork, but still not too much.
To facilitate investigators paperwork and work in general there are some
forms (see appendix B, C, D) to be filled in, so he (or she) would remember to
carry out all steps in the procedure1 . However, sometimes investigators forget
to fill in all these forms. This can create future problems for colleagues working further under this project, especially if the case has to be presented as an
evidence in front of the court.
During my internship I had an opportunity to work on Windows-based Windows forensics.
Windows-based forensics are digital forensics of a system when the investigation is carried out on a Windows system.
Windows forensics are digital forensics of a Windows system.
In other words, Windows-based forensics means that the computer is running Windows OS in order to annalyse any evidences. Windows forensics
means that a suspect Windows system is annalysed (read more in [10]).
Different softwares for digital forensics currently exist, I have mostly worked
with EnCase2 , but also have some experience with FTK3 .
On the one hand FTK requires an Oracle database (which is provided with
the installation CD) to store all search results. It makes the installation process
a little bit more complex but allows FTK to start up very quickly (a few seconds
1
the order of actions to perform and scripts to run, what has to be documented etc.
http://www.guidancesoftware.com/forensic.htm
3
http://accessdata.com/products/forensic-investigation/ftk
2
to open a case). On the other hand, EnCase has its own index system and it
is operational by itself, but EnCase has to read and load all index files at the
opening and it can take some time4 , thats why generally EnCase is running
during the whole time of the case investigation.
I will elaborate on different aspects of EnCase in the following sections more
in details. I will also present differences between EnCase and FTK.
2.3.2
Data acquisition
A first part of any digital forensic investigation is data acquisition and it has to
be done in the right way (also see appendix B).
Basically there are two ways to acquire a data from the hard drive:
direct acquisition requires to extract the actual hard drive from the computer and mount it on the system that would receive the image copy of
that drive.
network acquisition requires to boot (using life cd or usb) the computer
which contains the hard drive that has to be acquired (the evidence) and
connect this computer to the computer that would be used for the investigations by a cable. Sometimes this type of acquisition is called cross-over
acquisition because of the crossed cable which is used during the acquisition.
To avoid all risks of corrupting the evidence (changing it, by writing something to the hard drive) it is always preferred to use the direct acquisition.
Cross-over acquisition is used only in cases when for some reason it is impossible to extract the hard drive from the computer. I was not confronted to a such
case during my internship, so I always used the direct data acquisition.
First of all, forensics have to be done on an image (copy) of an evidence, not
on the real data. There is a special device FastBlock which ensures that some
process would not be able to write on the acquiring hard drive by physically
blocking all writings.
EnCase as well as FTK allow to acquire all different kinds of evidences:
an image of a hard drive
regular files
hard drive itself (in this case program would create an image)
The image acquisition process takes a lot of time (several hours). Not only
the allocated space has to be acquired, often the most interesting things are
found on an unallocated space. EnCase double-checks the image after the acquisition and it also offers multiple options like computing the hash value of
files during the acquisition. It can also compress the image during the acquisition, the time of this process depends on the compression quality.
4
if a lot of keyword searches were run it could take up to 30-40 minutes
The investigator has to take photos of evidences: workstation, laptop, hard

drive. The serial number must be easily readable on these photos. Other photos
of working place of the person are also recomended in the situation when the
person does not know that his disk is acquired. These photos would help to put
everything back in place after the data acquisition.
2.3.3
Analysis
Once the data is acquired, the data analysis can begin. As I already mentioned
there are some standard procedures that an investigator has to run for any case:
search for hidden partitions
restore deleted files
recover lost files
etc.
EnCase offers tools that allows to do all these things. And of course each
action has to be documented (see appendix D). Fortunately, standard case procedures forms prescribe exactly the actions to do. So the investigator should
follow these forms, run scripts and fill in the forms.
On average all these actions (scripts) take a lot of time (3 - 6 hours for
each script). This allows plenty of time to do the paperwork, however some
processes can take more that 10 hours (many keywords/regular expressions to
search for on a large hard drive), so I always tried to run these scripts over the
night in order to continue the work next morning.
Keywords constitute variable parts of any digital forensics case. These keywords are provided by the client depending on what he wants.
2.3.4
Meet the client
After the analysis, once the investigator has all results, he (or she) needs to
report to the client.
Generally keywords are pretty meaningless to an investigator, but the client
knows exactly what he is looking for so usually he would review all search
results and ask to find something more (more keywords or some specific files),
based on what he saw.
Unfortunately, EnCase and FTK are quite hacker-friendly, so the investigator has to explain and to show to the client how the reviewing environment
works.
hacker-friendly environment is an environment that is friendly only for experienced users who are familier with IT domain.
FTK has a slightly more friendly internal reviewing environment for mails
and HTML pages but still it is not easy to understand for an unexperienced user.
EnCase has a simple environment for pictures (jpg, bmp, png, etc.) and
documents (power point, excel, word) reviewing.
2.4
Scientific articles
Sometimes, as already mentioned, I had to wait several hours for the end of the
task.
Generally, during that time I was reading some scientific articles about one
of the two subjects of my interest:
Reverse engineering ([13], [4], [3], [8])
Malware reverse engineering (references [12], [9], [7], [15], [1])
File formats and protocols reverse engineering (references [16], [2],
[14], [5])
Volatile data capture/analysis (reference [6])
This deepened my knowledge in these domains and helped me to understand that I am more interested in reverse engineering than in volatile data
capture or volatile data analysis.
Also during the internship I saw that in practice investigators do not usually
possess volatile data. And even if they would, the structure of volatile data
might change very quickly (with each new release of an OS), therefore any
program for volatile data analysis is quite difficult to mantain (especially for
non open source operating systems). This made me understand that volatile
data analysis is not cost effective. That is why I mostly read articles about
While reading more about reverse engineering I found two very interesting
domains:
files/protocols structure reverse engineering is reverse engineering of
file formats (e.g. png, mp3) the goal is to understand the meaning of
each field of a file and discover its structure (which fields are optional
etc). In case of a protocol (e.g. IP, TCP, HTTP) the goal is to understand
the meaning of each field in the message and the related state machine
(diagram that shows all sequences of different messages).
executable files reverse engineering is reverse engineering of executable
files; the goal is to understand how a given program works and what it
does. And counter reverse engineering techniques (code obfuscation)
are techniques used to protect an executable file from reverse engineering5 .
5
executable files reverse engineering and code obfuscation techniques should be learned together
because in order to reverse-engineer an executable file one have to understand code obfuscation
techniques and vice versa.
2.5
Miscellaneous
While I was waiting (for a project to investigate) I have also done some other
work:
install Windows Server
encrypt several external hard drives that are user to store case files with
TrueCrypt6
test some equipment (FastBlocks, cables and external hard drives)
6
Free open-source disk encryption software for Linux, Windows and Mac OS X (http://www.
truecrypt.org/)
Acknowledgments
I am very grateful to all members of PwC FTS team for their help, advices,
such great experience that I had acquired and also for trusting me by actively
involving me in projects.
Conclusions
This internship allowed me to integrate in to the real world of work environmet

and had some very interesting experience.
I learned how to work with digital evidences and saw theoretical knowledge
applied in practice.
During my internship I understood, that volatile data acquisition and analysis are not really something cost and time effective for investifations, but tend
to be useless in many situations (however still are very interesting).
It helped me to choose my master degree thesis subject between volatile
data acquisition/analysis and reverse engineering.
I also managed to find two interesting domains of reverse engineering:
files/protocols structure reverse engineering
exectutable files reverse engineering and counter reverse engineering techniques (code obfuscation)
This internship was valuable and helped to make the choice so I can start to
write my master degree thesis immedialety.
10
Consent
Appendix A
CONSENT TO ENTER, SEARCH , SEIZE AND REMOVE
I ... (full name), being employed

by
(company
name)
as
a/the
. (position), do hereby consent and authorise the appointed

members of PricewaterhouseCoopers Ltd Forensic Technology Solutions to:
Search the premises (location);
Seize all and any relevant electronic data, stored in any format;
Copy all and any relevant electronic data;
Seize all and any relevant computer or related equipment;
being the property of or in legal possession of ...

(company).
I further declare that I, in my personal capacity or due to the position I hold, am duly
authorised to grant the authorisation as above.
SIGNED:
..
DATE:
..
FULL NAMES:
..
PLACE SIGNED:
..
11
Data Acquisition Form

Forensic Technology Solutions
DATA ACQUISITION RECORD
Case Information:
PwC Office
Client
Acquired by:
FTS Data Tracker #

Image Name: ______________________________
Signature:
Date of Acquisition:
Image created at:
Office / Client Site
SUBJECT COMPUTER INFORMATION

MAKE:
CMOS TIME:
MODEL:
Date:
ACTUAL TIME:
Time:
Date:
Time:
SERIAL #:
Desktop:
Laptop:
Photos taken: Yes
No
Server:
Time Zone:
Daylight Saving:
Bare Dr:
If no, reason not taken:
Source:
Photo CD/DVD: FTS Data Tracker #:
Photo range on memory card (e.g. IMG_011.jpg - IMG_016.jpg):

Computer State: OFF:
ON:
ON and LOGGED IN:
User:
Other:
If ON Shutdown method:
Normal shutdown / Pulled the plug
Encryption:
Encryption Type (e.g. Safeguard Easy):
Other:
UN:
Password:
Acquisition Notes (note any running processes):
SUBJECT HARD DRIVE PHYSICAL INFORMATION

MAKE:
IDE / SCSI/SATA
MODEL:
SCSI ID# :
SERIAL #:
M / S / C / OTHER
Terminated: Y
LBA
Cylinders
N N/A
Heads
Sectors
CAPACITY:
RAID type:
Stripe size:
Notes:
CONTROLLER:
Photos taken: YES / NO
If no, reason photos not taken:
Photo CD/DVD: FTS Data Tracker #:
Photo range on memory card (e.g. IMG_017.jpg - IMG_022.jpg):
OTHER SUBJECT MEDIA INFORMATION

FLOPPY
Label:
Description:
CD
Label:
Description:
ZIP DISK/LS-120
Label:
Description:
OTHER
PDA
Other Media :
Description:
Make:
Model :
Version 7
April 16, 2008
Memory :
Page 1 of 2
Privileged and Confidential
Attorney Work Product
12

DATA ACQUISITION RECORD
FTS Data Tracker #:
ENCASE ACQUISITION INFORMATION

Encase Acquisition Version:
Method:
Other method:
After Acquisition:
Suspect (Encase Boot Disk) Version:

DOS/BIOS
DOS/ATA
Parallel
Network
FastBloc2 FE
FastBloc FE
Other
(Detail other method here)
Restart Acquisition:
Do not add / Add to case / Replace source drive
Notes:
File Segment Size:
Start Sector:
Compression:
Password:
Block size:
Image Hash:
640 MB
Other:
0/
Stop Sector:
Sectors reported by Encase matches total

physical sectors?
NONE / GOOD / BEST

NONE / YES
Password:
64 /
Error granularity:
YES
NO
If NO, image with Linux or DOS.
64 /
CHECKED / UNCHECKED
Read Ahead:
Quick Acquisition:
Remote Acquisition:
Output Path:
Alternate Path:
IMAGE HASH and VERIFICATION

Acquisition Hash Value:
Verify Hash matches Acquisition Hash? YES / NO
Acquisition Notes (Document any errors):
PwC ACQUISITION DESTINATION DRIVE INFORMATION

Make:
Capacity:
Model:
Acquisition Drive FTS
Tracking #:
Serial #:
Description:
PwC ACQUISITION BACKUP DESTINATION DRIVE INFORMATION

Make:
Capacity:
Model:
Backup Drive FTS
Tracking #:
Serial #:
Description
ADDITIONAL NOTES
Version 7
April 16, 2008
Page 2 of 2
Privileged and Confidential
Attorney Work Product
13
Chain of custody
Appendix C
Project Name: .. Case Number: ....

Client: ..Contact person: Telephone: ..........
Physical Address: ........
.
Date: Time: FTS Examiner:
Short Summary of case: ......
Owner (or agent thereof) of PC:
Location of PC: ...
EQUIPMENT TO BE EXAMINED:
Evidence Number:
System:
Model: ..
Serial No:
System date: System time:
Real date:..Real time:..
HARD DRIVE DATA ACQUISITION
Drive: .. of .
Manufacturer
Model
Serial Number
Type
Acquisition Method:
Desktop PC Acquisition:
Laptop Acquisition
Acquisition Type:
Fast Bloc
DOS
Network
Parallel Cable
HARD DRIVE DATA ACQUISITION

Drive:.. of .
Manufacturer
Model
Serial Number
Acquisition Method:
Desktop PC Acquisition:
Laptop Acquisition
Acquisition Type:
Fast Bloc
DOS
Network
Parallel Cable
14
Type
FDISK
Drive 1
Partition
Status
Type
Vol. Label
MB
Sys
Usage
Status
Type
Vol. Label
MB
Sys
Usage
Drive 2
Partition
I hereby confirm that the computer, laptop or

electronic equipment was left in a working condition after the image process was completed by the
FTS investigator. (signature of Owner (or agent thereof) of PC ) .
15
Case procedure form

(En)CASE PROCEDURE RECORD
Evidence Number
_____________________
Case Notes Developer:

Signature:
Case Date( mm/dd/yyyy):
_____/_____/ 200_____
Forensic Computer Info:
Make: _______________ Model: _______________ Notes: ______________________
Default Export Folder:

Temporary Folder:
CASE PROCEDURES
EnScript Partition Finder :
Number of Partition signatures found/valid : ____/____
Types : _______________
Notes : __________________________________________________________________________________________________
(This EnScript searches for the signature of a partition in unallocated disk space a potentially deleted partition.)
RECOVER LOST FOLDERS :
NTFS
Drive: _____
Number of Recovered Folders: _______
FAT32
Drive: _____
Drive: _____
Other:
For NTFS file system, right-click on the volume and the option Recover Folders will be available for additional, advanced recovery. Initial
recovery is automatically available. For FAT file system, scan the volume for lost folders. Locate the Recovered Folders folder and open it.
The recovered folders will be listed. Any sub-folders or files will be undeleted.
Notes : __________________________________________________________________________________________________
EnScript - FILE MOUNTER :
EXTENSION / SIGNATURE / BOTH
Number of files mounted : _______

Mount Persistent:
File Types:
Notes :
(This EnScript mounts the selected file types in a case to allow viewing and searching.)
VERIFY FILE SIGNATURE AND COMPUTE HASH VALUE :
Notes :
(Verify File Signature - Compares each file signature with its extension to identify any files whose extensions have been
deliberately changed. Hash Value MD5 algorithm used to generate a unique 128-bit fingerprint)
EXPORT FILE LISTING AND BOOKMARK FILES SELECTED:
Notes :
(This creates a text file containing the attributes of the files viewed in the case. Note the path of the text file.)
EnScript FILTER MAIL AND BOOKMARK FILES SELECTED:
Number of files script identified: _________
(Search for common mail file types 1. Create a note 2. Paste the contents of the EnScript in the note 3. Save note as EnScript - Mail Filter )
COPY\UNERASE FILTER MAIL :

From :
To :
Highlighted Files
All Selected Files
Separate Files
Merge into one file
Automatically Replace First Character With (circle one):

Copy :
Character Mask:
Logical File Only

None
Entire Physical File
Don t Write Non-ASCII Characters
Number of Files Copied to folder:
_______________
Other : _____
RAM and Disk Slack
RAM Slack Only
Replace Non-ASCII Characters .

Copy Files Size :
Destination of files:
_______________________________________________________
Split files larger than:
500,000 (MB)
640 (MB) Default
Show Errors
_______________ (kb)
Other ___________
Notes: ___________________________________________________________________________________________
Version 5
Page 1 of 2
April 16, 2008
16

(En)CASE PROCEDURE RECORD
Evidence Number__________________________
EnScript FILTER COMMON FILE TYPES (Active & Deleted) AND BOOKMARK FILES SELECTED:
Number of files script identified: _________
(Search for common file types Example: .doc, .xls, .ppt, .pdf, etc.
1. Create a note. 2. Paste the contents of the EnScript in the note. 3. Save note as EnScript
Common File Types Active & Deleted .)
COPY\UNERASE COMMON FILE TYPES:

From :
To :
Highlighted Files
All Selected Files
Separate Files
Merge into one file
Automatically Replace First Character With (circle one):

Copy :
Character Mask:
Logical File Only

None
Don t Write Non-ASCII Characters
Number of Files Copied to folder:

Destination of files:
Split files larger than:
Entire Physical File
Other : _____
RAM and Disk Slack
RAM Slack Only
Replace Non-ASCII Characters .
_______________
Copy Files Size :
Show Errors
_______________ (kb)
________________________________________________________________________________
500,000 (MB)
640 (MB) Default
Other ___________
Notes:
ADDITIONAL EnScript(s):
Notes :
(1. Create a note. 2. Paste the contents of the EnScript(s) in the note. 3. Save note after the executed EnScript(s).)
FILE SIGNATURE Export:
Count :
_________
(1. Tools. 2. File Signature and Viewers. 3. Select all File Signatures. 4. Export. 5. Select all Export checked columns .
6. Output file : __________________________________________________________________________________________
7. Open text file and copy the contents. 8. Create a note. 9. Paste the contents of the File Signature list in the note.
10. Save note as File Signature List .)
HASH SET Export:
Count :
_________
(1. Tools. 2. Hash Set. 3. Export. 4. Select all Export checked columns .
5. Output file : __________________________________________________________________________________________
6. Open text file and copy the contents. 7. Create a note. 8. Paste the contents of the Hash Set list in the note.
9. Save note as Hash Set .)
KEYWORDS :
Notes:
SAVE CASE: (Note the path)

Notes :
ADDITIONAL CASE NOTES :
Version 5
April 16, 2008
Page 2 of 2
17
References
[1] Lorie M. Liebrock Daniel A. Quist. Visualizing compiled executables for
malware analysis. Visualization for Cyber Security, 2009. VizSec 09, October 2009.
[2] Michela Meo Nicol
o Ritacca Dario Bonfiglio, Marco Mellia and Dario
Rossi. Tracking down skype traffic. In INFOCOM 08, pages 261265.
IEEE, 2008.
[3] M. Gentleman J. Henshaw H. Johnson-K. Kontogiannis E. Merlo H. Muller
J. Mylopoulos S. Paul A. Prakash M. Stanley S. Tilley J. Troster K. Wong
E. Buss, R. De Mori. Investigating reverse engineering technologies: The
cas program understanding project. IBM Systems Journal, 33(3):477500,
July 1994.
[4] Massimiliano Di Penta Gerardo Canfora. New frontiers of reverse engineering. In FOSE 07, pages 326341. IEEE Computer Society Washington, 2007.
[5] Matthew Sinda Gregory Conti, Erik Dean and Benjamin Sangster. Visual
reverse engineering of binary and data files. In VizSec 08, pages 1 17.
Springer-Verlag Berlin, 2008.
[6] Theodore Tryfonas Andrew Blyth Iain Sutherland, Jon Evans. Acquiring
volatile operating system data tools and techniques. ACM SIGOPS Operating Systems Review, 42:6573, April 2008.
[7] Kris Kendall. Practical malware analysis. 2007.
[8] Pongsin Poosankam Min Gyung Kang and Heng Yin. Renovo: A hidden
code extractor for packed executables. In WORM 07, pages 4653. ACM,
2007.
[9] Hassen Saidi Phillip Porras Wenke Lee Monirul Sharif, Vinod Yegneswaran. Eureka: A framework for enabling static analysis on malware.
In ESORICS 08, pages 481500. Springer-Verlag Berlin, 2008.
[10] The Honeynet Project. Know your enemy. Learning about security threats.
Boston, 2 edition, 2004.
[11] PWC.
Fts brochure.
Available at http://www.pwc.
com/en_BE/be/dispute-analysis-and-investigation/
forensic-technology-solutions-pwc-08.pdf, consulted 28 December 2010.
[12] James Hamrock Robert Lyda. Using entropy analysis to find encrypted
and packed malware. IEEE Security and Privacy, 5:4045, March 2007.
[13] Kehuan Zhang Zhuowei Li Rui Wang, XiaoFeng Wang. Towards automatic
reverse engineering of software security configurations. In CCS08, pages
245256. ACM, 2008.
18
[14] Karl Chen Helen J. Wang Luiz Irun-Briz Weidong Cui, Marcus Peinado.
Tupni: Automatic reverse engineering of input formats. In CCS 08, pages
391402. ACM, 2008.
[15] Lenny Zeltser. Reverse engineering malware. April/May 2001.
[16] Dongyan Xu Xiangyu Zhang Zhiqiang Lin, Xuxian Jiang. Automatic protocol format reverse engineering through context-aware monitored execution. 2007.
19
Bibliography
[1] Berne convention for the protection of literary and artistic works, 1971. http://
www.wipo.int/treaties/en/ip/berne/trtdocs_wo001.html, consulted 20 May
2011.
[2] Encyclopedia of cryptography and security. Eindhoven University of Technology,
springer edition, 2005.
[3] Mikhail J. Atallah and Chang Hoi. Method and system for tamperproofing software,
2006. United States Application 20060031686, February 2006. Assigned to Purdue
Research Foundation.
[4] Gregory Andrews Matthew Legendre Benjamin Schwarz, Saumya Debray. Plto: A
link-time optimizer for the intel ia-32 architecture. In In Proc. 2001 Workshop on
Binary Translation (WBT-2001, 2001.
[5] Bostjan Bercic. Software emulation in the light of eu legislation. In BILETA05,
2005.
[6] CDP. Cdp history. http://www.cdp.com/history.shtml, consulted 15 November
2010.
[7] Jasvir Nagra Christian Collberg. Surreptitious Software: Obfuscation, Watermarking, and Tamperproofing for Software Protection. Addison-wesley edition, 2009.
[8] Gareth Cronin. A taxonomy of methods for software piracy prevention. http:
//www.croninsolutions.com/writing/piracytaxonomy.pdf, consulted 16 May
2011.
[9] Saumya Debray Cullen Linn. Obfuscation of executable code to improve resistance
to static disassembly. In CCS 03, pages 290299. ACM, 2003.
[10] Lorie M. Liebrock Daniel A. Quist. Visualizing compiled executables for malware
analysis. Visualization for Cyber Security, 2009. VizSec 09, October 2009.
[11] Michela Meo Dario Rossi Paolo Tofanelli Dario Bonfiglio, Marco Mellia. Revealing
skype traffic: when randomness plays with you. In SIGCOMM 07, pages 3748.
ACM, 2007.
[12] Michela Meo Nicol
o Ritacca Dario Bonfiglio, Marco Mellia and Dario Rossi. Tracking down skype traffic. In INFOCOM 08, pages 261265. IEEE, 2008.
[13] Dept. of Comput. Sci. Northern Illinois Univ.-DeKalb IL Davis K.H., Alken P.H.
Data reverse engineering: a historical survey. In Seventh Working Conference on
Reverse Engineering, pages 7078. IEEE, 2000.
127
128
[14] Andromeda decompiler. Andromeda decompiler. http://shulgaaa.at.tut.by/,
consulted 16 May 2011.
[15] Boomerang decompiler.
Boomerang decompiler.
sourceforge.net/, consulted 16 May 2011.
http://boomerang.
[16] Cambridge dictionary. Cambridge dictionary. http://dictionary.cambridge.

org/dictionary/british/reverse-engineering, consulted 16 May 2011.
[17] Oxford dictionary. Oxford dictionary. http://oxforddictionaries.com/view/
entry/m_en_gb0707320#m_en_gb0707320, consulted 16 May 2011.
[18] M. Gentleman J. Henshaw H. Johnson-K. Kontogiannis E. Merlo H. Muller J. Mylopoulos S. Paul A. Prakash M. Stanley S. Tilley J. Troster K. Wong E. Buss, R.
De Mori. Investigating reverse engineering technologies: The cas program understanding project. IBM Systems Journal, 33(3):477500, July 1994.
[19] Eldad Eilam. Reversing: Secrets of Reverse Engineering. 10475 Crosspoint Boulevard. Indianalopis, IN 46256, wiley publishing edition, 2005.
[20] FOLDOC. Free on-line dictionary of computing. http://foldoc.org/reverse+
engineering, consulted 16 May 2011.
[21] GDB. Gdb: The gnu project debugger. http://www.gnu.org/software/gdb/,
consulted 16 May 2011.
[22] Massimiliano Di Penta Gerardo Canfora. New frontiers of reverse engineering. In
FOSE 07, pages 326341. IEEE Computer Society Washington, 2007.
[23] Matthew Sinda Gregory Conti, Erik Dean and Benjamin Sangster. Visual reverse
engineering of binary and data files. In VizSec 08, pages 1 17. Springer-Verlag
Berlin, 2008.
[24] Raymond Hill. A First Course in Coding Theory. Oxford, clarendon press edition,
1986.
[25] Andrew Huang. Hacking the Xbox. An introduction to reverse engineering. San
Francisco, no starch press edition, 2003.
[26] Bruce Jacob. The risc-16 instruction-set architecture. http://www.ece.umd.edu/
~blj/RiSC/, consulted 16 May 2011.
[27] Steven Levithan Jan Goyvaerts. Regular expression cookbook. Oreilley edition,
2009.
[28] Helmut Veith Johannes Kinder. Jakstab: A static analysis platform for binaries.
In CAV 08, pages 423 427. Springer-Verlag Berlin, Heidelberg, 2008.
[29] Kris Kendall. Practical malware analysis, 2007.
[30] Nate Lawson. Mesh design pattern: error correction, 2007. http://rdist.root.
org/2007/08/21/mesh-design-pattern-error-correction/, consulted 16 May
2011.
129
[31] Zhiqiang Lin, Xuxian Jiang, Dongyan Xu, and Xiangyu Zhang. Automatic protocol
format reverse engineering through conectect-aware monitored execution. In In 15th
Symposium on Network and Distributed System Security (NDSS, 2008.
[32] Thomas J. Lynch. Data compression techniques and applications. Van Nostrand
Reinhold, New York, 1985.
[33] Koen De Bosschere Matias Madou, Ludo Van Put. Loco: an interactive code
(de)obfuscation tool. In PEPM 06, pages 140144. ACM, 2006.
[34] Scott Meuller. Upgrading and repairing PCs. 10475 Crosspoint Boulevard. Indianalopis, IN 46256, que edition, 2004.
[35] Anup K. Ghosh Michael N. Gagnon, Stephen Taylor. Software protection through
anti-debugging. IEEE Security and Privacy, 5(3):8284, May/June 2007.
[36] Microsoft. Wingdb. http://www.wingdb.com/, consulted 16 May 2011.
[37] Pongsin Poosankam Min Gyung Kang and Heng Yin. Renovo: A hidden code
extractor for packed executables. In WORM 07, pages 4653. ACM, 2007.
[38] Hassen Saidi Phillip Porras Wenke Lee Monirul Sharif, Vinod Yegneswaran. Eureka:
A framework for enabling static analysis on malware. In ESORICS 08, pages 481
500. Springer-Verlag Berlin, 2008.
[39] Jonathon Giffin Wenke Lee Monirul Sharif, Andrea Lanzi. Impeding malware analysis using conditional code obfuscation. 2009 30th IEEE Symposium on Security
and Privacy, pages 94109, May 2009.
[40] Ghent University PARIS research group, ELIS department. Diablo deobfuscator.
http://diablo.elis.ugent.be/obf_deobfuscation_byhand, consulted 16 May
2011.
[41] David Dagon Robert Edmonds Wenke Lee Paul Royal, Mitch Halpin. Polyunpack:
Automating the hidden-code extraction of unpack-executing malware. In ACSAC
06, pages 289 300. IEEE Computer Society, 2006.
[42] Phoenix. Phoenix press-center. http://www.phoenix.com/pages/press-center,
consulted 15 November 2010.
[43] IDA Pro. Idapro. http://www.hex-rays.com/idapro/, consulted 16 May 2011.
[44] The Honeynet Project. Know your enemy. Learning about security threats. Boston,
addison-wesley edition, 2004.
[45] James Hamrock Robert Lyda. Using entropy analysis to find encrypted and packed
malware. IEEE Security and Privacy, 5:4045, March 2007.
[46] Kehuan Zhang Zhuowei Li Rui Wang, XiaoFeng Wang. Towards automatic reverse
engineering of software security configurations. In CCS08, pages 245256. ACM,
2008.
[47] Ravi Sethi. Programming languages concepts and constructs. Eddison-Wesley, the
University of Michigan, second edition edition, 1996.
130
[48] Matias Madou Sharath K. Udupa, Saumya K. Debray. Deobfuscation: Reverse
engineering obfuscated code. In WCRE 2005, pages 4554, 2005.
[49] Sean W. Smith. Trusted computing platforms: design and applications. Boston,
springer edition, 2005.
[50] Sysinternals. Sysinternals. http://sysinternals.com, consulted 16 May 2011.
[51] Andrew S. Tanenbaum. Structured Computer Organization. Pearson Prentice Hall,
fifth edition edition, 2006.
[52]
.. .
-4.
pages 26, 2
2008.
[53] Federal Circuit U.S. Court of Appeals. Atari vs. nintendo, 1992. http://
digital-law-online.info/cases/24PQ2D1015.htm, consulted 16 May 2011.
[54] Peter Wayner. Disappearring cryptography. Information hiding: Steganography &
Watermarking. Morgan Kaufmann publishers.
[55] Karl Chen Helen J. Wang Luiz Irun-Briz Weidong Cui, Marcus Peinado. Tupni:
Automatic reverse engineering of input formats. In CCS 08, pages 391402. ACM,
2008.
[56] Wikipedia. Boeing b-29 superfortress. http://en.wikipedia.org/wiki/B-29_
Superfortress, consulted 15 November 2010.
[57] Wikipedia. Boeing b-29 superfortress history. http://www.boeing.com/history/
boeing/b29.html, consulted 15 November 2010.
[58] Wikipedia.
Ibm pc compatible.
http://en.wikipedia.org/wiki/IBM_PC_
compatible, consulted 15 November 2010.
[59] Wikipedia. Ibm personal computer. http://en.wikipedia.org/wiki/IBM_
Personal_Computer, consulted 15 November 2010.
[60] Wikipedia. list of countries copyright length. http://en.wikipedia.org/wiki/
List_of_countries%27_copyright_length, consulted 20 May 2011.
[61] Wikipedia. Numega softice. http://en.wikipedia.org/wiki/SoftICE, consulted
16 May 2011.
[62] Wikipedia. Patent. http://en.wikipedia.org/wiki/Patent, consulted 20 May
2011.
[63] Wikipedia. Tupolev tu-4. http://en.wikipedia.org/wiki/Tupolev_Tu-4, consulted 15 November 2010.
[64] Wikipedia. Tupolev tu-4 history. http://www.tupolev.ru/Russian/Show.asp?
SectionID=135, consulted 15 November 2010.
[65] Georgy Wroblewski. General method of program code obfuscation. In SERP02,
2002.
[66] Oleh Yuschuk. Ollydbg. http://www.ollydbg.de/, consulted 16 May 2011.
[67] Lenny Zeltser. Reverse engineering malware, April/May 2001.

Reversing PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Reversing PDF

Încărcat de

Drepturi de autor:

Formate disponibile

LIBRE DE BRUXELLES

Digital reverse engineering

Memoire presente en vue de

Understanding reverse engineering

3 Definitions of reverse engineering

4 Legal aspects of digital reverse engineering

II Minimum knowledge required to perform digital reverse engineering

6.2.1 Virtual environments . . . . .

Goal and context

Digital reverse engineering is a very interesting sub-domain of reverse engineering. It

The list of contributions:

1. Implementation of a proof of concept program which use an error correcting code

Reasons for reverse engineering

Reverse engineering in military

Digital reverse engineering

Definitions of reverse engineering

Intuition behind reverse engineering

Alice, a famous explorer, was exploring a desert when she found

He found, that the the program comunicates through the network.

He found, that the the animal is

Definition of reverse engineering

Different sources give slightly different definitions of reverse engineering.

Definition of digital reverse engineering

Definition 4. Binary reverse engineering (also called code reverse engineering) is a

Legal aspects of digital reverse

Intellectual property protection

Minimum knowledge required to

In order to do DRE it is necessary to understand differences between programming

Determining language used

Its important to understand the transformations that compilers apply to a program

General changes in the code structure

Listing 5.1: drawLine function call in

Listing 5.2: drawLine function call in

In some cases several compilers keep some labels

Listing 5.3: Is it a frog? program in

Listing 5.4: Is it a frog? program in

Changes due to optimization

Listing 5.5: Normal loop

Listing 5.6: Same loop after unrolling

Listing 5.7: Written code

Listing 5.8: Direct translation into assembly language

movzx EAX, BYTE[ valueFound ]

Listing 5.9: Translation into assembly

movzx EAX, BYTE[ valueFound ]

An operating system (OS) coordinates different elements in a computer. It manages the

Listing 5.10: Direct translation into assembly language

Listing 5.11: Translation into assembly

Figure 5.7: Example of code reorganization in C++: order of instructions changes.

Numerals go from 0 to F: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F

Definition 13. Virtualization is a use of a virtual version of a real object.

Listing 6.1: objdump -d hello world

$ 0x804a040 ,(% esp )

%eax ,(% esp )

A debugger is a program that allow to monitor the execution of a process (called a

Listing 6.2: Switch statement in C++

Listing 6.3: switch statement in assembler

Figure 6.3: Example of implementation of a switch statement in C++ and in Assembler

Listing 6.4: objdump -f hello world

Listing 6.5: objdump -p hello world

Generally, automated deobfuscators of executable files create a new executable file

Miscellaneous useful tools

File type recognition

Strings and pattern searching