Sunteți pe pagina 1din 6

Embedded Security Challenge (ESC)16 Final Report

Tri Minh Cao, Sai-Kumar Marri, Danqing Liu

Jeyavijayan Rajendran, Yiorgos Makris

Department of Electrical Engineering


The University of Texas at Dallas
Richardson, USA
tmc096020@utdallas.edu, skm150130@utdallas.edu,
dxl161630@utdallas.edu

Department of Electrical Engineering


The University of Texas at Dallas
Richardson, USA
jv.ee@utdallas.edu, yiorgos.makris@utdallas.edu

AbstractIn this report, we propose two techniques, Dynamic


Trusted Platform Module (DTPM) and Secure Return Address
Stack (SRAS), to defend against TOCTOU (Time-of-check to
Time-of-use) and ROP (Return-oriented programming) attacks.
These techniques can prevent an adversary from modifying a
program without knowledge of the user, and protect the stack
from being exploited. We have successfully implemented the
defensive modules and verified the HDL code locally.
Unfortunately, we have difficulty obtaining the basic blocks
information on the OpenRISC system, and our solution is not
compatible with the OpenRISC system at the moment.
Keywords OpenRISC; TOCTOU; Code Reuse; DTPM;
SRAS; GCM

I.

INTRODUCTION

There are two types of attacks that we want to defend in our


solution. The first attack is Time-of-check to Time-of-use
(TOCTOU). In a TOCTOU attack, an adversary modifies a
program after it has been checked by the system, usually at
boot time. We counter TOCTOU attacks by checking the
integrity of the program at runtime. As a result, it would be
impossible for an adversary to inject malicious code to a
program before runtime. We ensure the integrity of a program
by comparing the hash values of the programs basic blocks
computed at runtime to the hash values already pre-computed.
The pre-computed hash values are stored in a dedicated secure
memory area.
Without the ability to inject malicious code, an adversary
can still mount an attack using Code Reuse Attacks (CRAs),
such as ROP (return-oriented programming). To defend against
ROP, we need to protect the addresses in the stack. The goal of
an adversary is to change the return addresses in the stack by
various techniques such as buffer overflow, stack smashing,
etc. Our solution for this attack, called Secure Return Address
Stack, is to create a shadow stack. Upon a return call from the
program, we compare the top-of-stack address from the
original stack to the address from the shadow stack. If they do
not match, then an attack exists.
The outline of the report is as follows. First, we discuss the
background topics related to our project. Then we discuss the
implementation of our proposed defense mechanisms. Next is
the simulation and experimental setup and results. The report
ends with a conclusion.

Figure 1: Basic block example

II.

BACKGROUND

A. Basic blocks
Basic block is a sequence of consecutive statements with
only one entry in and one exit out in computing [6]. The code
in a basic block has one entry point means that in a basic block,
there is no destination of a jump instruction. A basic block also
has one exit point means that when the code comes to the last
instruction, it will begin to execute another basic block. In the
execution of a basic block, each instruction will be executed
exactly once in order. This restricted form makes a basic block
highly amenable to analysis [2].
Compilers usually split programs into their basic blocks at
the first step in the analysis process. The analyzer scans over
the code, marking block boundaries. Block boundaries are
instructions which either transfer control or accept control from
another point. Then cut at each of these points to get basic
blocks. Basic blocks are the vertices or nodes in a control flow
graph.

B. TOCTOU attacks
TOCTOC is a type of attack in which the attacker uses the
timing of check to time of use race condition to influence the
resource [7]. The software checks the state of a resource before
using it, but the resource's state can change between the check
and the use in a way that make the results of the check invalid.
This can cause the software to perform invalid actions when
the resource is in an unexpected state. This can happen with
shared resources such as files, memory, or even variables in
multithreaded programs. The attacker can exploit such race
condition to modify application data, files, directories and
memory.
Usually programs are checked for integrity at boot time of
the system. However, at runtime the integrity of a program
cannot be determined due to possible TOCTOU attacks. As a
result, runtime integrity checking should be implemented.
C. Code Reuse Attacks (CRAs)
CRAs are attacks where an attacker uses the code inside a
compromised program to execute arbitrary malicious code by
diverting control flow through existing code. In other words,
an attacker can launch an attack without injecting code
explicitly to the program. Return-oriented programming
(ROP) [8] [9] and jump-oriented programming (JOP) [10] are
two popular CRA techniques.

JOP is another CRA but it does not need the call stack.
Instead of using return addresses of gadgets, JOP attack uses
indirect branching to chain the gadgets.
D. Hash Function
Hash function is a one-way mathematical function of a
message that it cannot be easily reconstituted back into the
original message, even with knowledge of the hash algorithm.
Such one-way function calculates any arbitrary length
message to a fixed length string. Also it should not have such
condition that there are two different message that be
calculated to the same hash value.
In our defense frame work, we use GCM AES to get the
hash value of the instruction set which translated from basic
block of C program. It should be secure that it is difficult for
attacker to break.
E. GCM algorithm
Galois Field Mode(GCM) is a block cipher mode with
authenticated function. It contains authenticated encryption and
decryption. We only need to use authenticated encryption in
our implementation. The block cipher infrastructures in GCM
we have chosen 128 bits Advanced Encryption Standard
(AES).
Figure X has demonstrated the GCM operation of
authenticated encryption.
GCM authenticated encryption operation has four inputs:
A secret key K. We assume that it is 128 bits long
consistent with the underlying AES block cipher.
An initialization vector (IV) can have up to 2 64 bits. A 96bit IV is recommended for efficiency.
A plaintext P that can have up to 239 bits.
Additional authenticated data A that have up to 264 bits.
This additional authenticated data is authenticated but not
encrypted. and two outputs:
A cipher text C whose length is identical to that of the
plaintext P.
An authentication tag T that have up to 128 bits.

Figure 2: The authenticated encryption operation. For simplicity, a case


with only a single block of additional authenticated data (labeled AuthData1)
and two blocks of plaintext is shown. Here EK denotes the block cipher
encryption using the key K, mult H denotes multiplication in GF (2128) by
the hash key H, and incr denotes the counter increment function.

In ROP, an attacker gains control of the programs call


stack to hijack program control flow. Then the attacker can use
the call stack to execute multiple short code sequences, called
gadgets. By carefully choosing the sequences of code from the
program or the shared library code, an attacker can exploit the
system to perform arbitrary actions. Researchers have shown
that ROP poses a severe threat in various platforms: SPARC,
embedded systems, and even kernel.

The length of the tag is denoted as t. The plaintext data and


the additional authenticated data are segmented into 128-bit
blocks. Suppose there are n plaintext blocks P1, P2, ...Pn1, Pn
and m additional authenticated data blocks A1, A2, ...Am1,
Am. The GCM authenticated encryption operation is defined as
follows:
H = E (K, 0128)
Y0 = IV || 0311

if len(IV)=96

Yi = incr (Yi1) for i = 1, ..., n


Ci = Pi E (K, Yi) for i = 1, ..., n
Cn = Pn MSBu (E (K, Yn))

T = MSBt (GHASH (H, A, C) E (K, Y0))


In our implementation, we choose 128 bits A, 128 bits
secret key and 96 bits random IV for processing each basic
block and disable the cipher text output, only output tag T as
our basic block hash value.
F. GCM module
The AES GCM module we refer to are from the opencores
repositories [3]. It needs about 96 clock cycles to compute tag
value with test case 4 from [1].
G. OpenRISC
OpenRISC is an open-source project with the goal of
creating a free and open processor for embedded system [11].
To make a complete embedded system, the OpenRISC project
consists of several parts, all of them free and open-source: a
RISC set architecture, a set of implementations of the
architecture, a complete set of software development tools,
libraries, operating systems, and applications.

III. DTPM ARCHITECTURE


A. High-level idea
Assume the user wants to run a program in the system, we
want to make sure that the program is not modified by an
adversary at runtime. To ensure the programs integrity, we
pre-compute the hash values of all basic blocks of the
program. At runtime, we extract the basic blocks data from the
CPU pipeline, then we compute the hash values of these basic
blocks again. We compare the two set of hash values to
determine the integrity of the program. If the two set of hash
values match, then the program is clean. Otherwise, the
program has been modified.
Figure 4: Hash value computation.

The Dynamic Trusted Platform Module (DTPM) analyses


the fetch address and check for start/end address of the basic

blocks [13]. It computes the hash value using the GCM


module.
Figure 3: mor1kx CPU with Cappuccino-type pipeline.

OpenRISC 1000 is the name of the open-source RISC


architecture used in the project. There are two main
implementations of OpenRSIC 1000: OR1200 and mor1kx
[12]. The mor1kx CPU is chosen because it is newer, better
supported, and more flexible. The focus of mor1kx CPUs
designers is to demonstrate implementation trade-offs such as
area and performance. They provide three types of pipeline
implementations: Cappuccino (standard, 6-stage pipeline),
Espresso (2-stage pipeline), Pronto-espresso (2-stage
pipeline). The given system for CSAW ESC 2016 competition
uses Cappuccino pipeline, as shown in figure 3.

B. Pre-compute hash values of basic blocks from a program


We run Valgrind [14] alongside a C program to get a list of
basic blocks. Each basic block has a starting program counter
(PC) address and an ending address. Next, we need to gather
all instructions in each basic block. We use objdump to
disassembler the C program above. The output produced by
objdump will show pairs of (PC address, instruction). We
concatenate all instructions in a basic block to make a
combined instruction (a hex string). Then we compute the
hash value of the basic block by putting the hex string to an
AES-GCM function. This process is written in Python. The
AES-GCM module is provided by Python cryptography.io
library [15].

Figure 5: DTPM architecture.

C. DTPM architecture
Figure 5 shows the architecture in our implementation.
DTPM is implemented outside the processor pipeline to
maintain simple and generic system that can be ported to any
processor architectures. For this demonstration, we ported on
to the OpenRISC architecture.
In OpenRISC architecture, CPU fetched instructions from
IMMU through wishbone interface.
DTPM inputs are sampled on the wishbone interface which
are driven by the Fetch module, and generates stall signal
based on the basic block hash value. The input signals
sampled are ibus_addr, ibus_addr_req, ibus_data, ibus_ack.
The output signal is OR-ed with the external stall signal and
driven to the execute stage.
D. DTPM State Machine
Upon PC reset, DTPM stalls the PC and initializes AES
GCM module. DTPM monitors the address on the Fetch block
and look for the Cache memory for start and end address of
the basic blocks. Once DTPM encounters the basic block start
address, DTPM samples the instructions and loads into AES
GCM. If two start addresses are encountered consecutively
DTPM resets the AES GCM module considering it as a wrong
branch prediction. After DTPM encounters end address,
DTPM loads last word into AES GCM module and stalls PC.
DTPM wait for the hash value from the AES GCM module.
Once Hash value is available, DTPM compares hash value to
the cache memory basic block hash value. If hash comparison
passes DTPM resumes PC and proceed to the next basic
block, else DTPM stalls PC.
E. Optimization of GCM module
We use Galois/Counter Mode (GCM) authenticated
encryption to calculate hash value of basic blocks, the 128 bits
fixed length authenticated tag is used as hash value. The

encryption and authentication infrastructures we use 128 bits


Advanced Encryption Standard (AES). Then we tried to use
the fully pipelined architecture GCM implementation to
optimize the performance, but have not finished yet.
A fully pipelined AES architecture can be applied on GCM
mode. Since there are 10 round operations in AES, a fully
pipelined AES implementation can achieve high throughput
consuming almost 10 times the area of an iterative
implementation. From analyzing the AES and GF multiplier
cores and data dependencies in the GCM algorithm at the
architecture level, this paper [2] developed high speed
hardware architectures for GCM.
We have implemented the pipelined AES core. If we want
to use the high speed architecture in [2], we also need to make
GF multiplier module parallel which means that it needs to
complete the 128 bits GF multiplication in 1 clock cycle. In our
implementation, it takes 8 clock cycles to complete one GF
multiplication. So we have not realized this high speed
architecture yet.
Figure 6: DTPM state machine.

IV.

SRAS
A. High-level idea
As mentioned before, SRAS is our solution to defend
against ROP attacks. In a ROP attack, the attacker
overwrites the return address of a function to make the stack
pointer point to a different value (different address). Upon
function return, execution is not redirected to the original
calling function but instead to another instruction sequence.
SRAS can detect a ROP attack by using a shadow stack to
store a copy of the return address once a function is called,
then check each return instruction issued to the processor
[21]. Figure 7 describe how SRAS works intuitively.
SRAS can also defend against any attack that is based on
corrupting a return address, including conventional stack
smashing and all buffer overflow attacks which overwrite
return addresses.

V.

RESULTS

A. Valgrind issue
As mentioned before, the idea of checking the hash values
of basic blocks comes from paper [13] by Arun Kanuparthi et
al., and the authors do not mention how to get basic blocks
information from an executable file. When we try to
implemented the DTPM, we use Valgrind tool [14] to extract
the basic blocks from a C program running on Linux and
amd64 architecture. We make a mistake in assuming the
Valgrind supports all processor architectures, including
OpenRISC. Unaware of this mistake, we set out to implement
the AES-GCM hash function and the DTPM module. We only
notice the problem when we do simulation of the DTPM
module. If Valgrind supports OpenRISC, or we have time to
write such a tool for OpenRISC architecture, we would be
able to show how DTPM defends against TOCTOU attacks.
B. Simulation setup
Simulation is done with the help of Fusesoc [16], a
program that manages HDL code, simulates and builds systemon-chip solutions. Fusesoc stores a library of modules (called
cores in fusesoc) as well as instructions to use them. Most
cores and instructions can be downloaded from github
repositories, and they are already used and tested by developers
in the OpenRISC community. Hence, when one needs to
simulate or implement a core on a FPGA, Fusesoc will
automatically download the HDL code and use the given
instructions to let the FPGA design software (in our case, the
Altera Quartus) finish the job. In our solution, we modify the
mor1kx CPU to add the defensive mechanisms. To do that we
need to download the mor1kx source code, then create a new
core in Fusesoc.
Figure 7: SRAS high-level approach.

While Fusesoc is a useful software, it has multiple


problems that hinder us from simulating the system properly.
Fusesoc cannot simulate the OpenRISC system prepared for
Terasic DE0-Nano board, either using Icarus Verilog or Altera
ModelSim. Two biggest issues that plague Fusesoc are bugs
from the Fusesocs source code in corner cases, and the usage
of 32-bit libraries in Altera ModelSim. It should be noted that
all Altera ModelSim versions from 2014 are 64-bit programs,
indicating that they cannot be run from 32-bit operating
systems. However, Altera ModelSim still uses 32-bit libraries
that causes compatibility problems. We even ask the author of
Fusesoc about the bugs, but he simply cannot keep up with the
changes in all the tools. Our problem also shows how difficult
it is to maintain a free and open-source software.
After trying for several days, finally we can simulate the
mor1kx CPU running a simple Hello, World program.
However, only the latest version of mor1kx CPU can be
simulated. It is problematic because the system for DE0-Nano
requires mor1kx version 3.1, which is an older version. As a
result, we really cannot make Fusesoc to simulate the whole
OpenRISC system, and we can only simulate running
programs using the OpenRISC architectural simulator, or1k
[17]. Unfortunately, or1k is not a cycle-accurate simulator, and
we cannot run benchmark using or1k.

C. Area, power, and delay overhead


In CSAW ESC 2016, the OpenRISC system is
implemented on the Terasic DE0-Nano board. All the modules
are synthesized and implemented using Altera Quartus. We
have below experiment results show the overhead.
Module
DTPM

Synthesis and Implementation results


Area
Power
Delay
13 LUTs
6402.45mW
1.889ns

SRAS

N/A

N/A

N/A

Table 1: DTPM synthesis and implementation results.

D. Performance Overhead
Because we want to measure the performance of the
OpenRISC system with or without the defensive mechanisms,
we need run a benchmark program on the OpenRISC system
implemented on the DE0-Nano board. We decide to use
CoreMark [18] as the benchmark to report performance
overhead of our defensive modules.
There are two reasons CoreMark is the appropriate
benchmark option. First, it is the only common benchmark
used by the OpenRISC community [19] [20]. It may come
from the fact that OpenRISC architecture is not supported by
any architectural simulator. Also, benchmarking is not crucial
for OpenRISC, an open-source architecture aimed at academic
and non-commercial use. The second reason we choose
CoreMark is our defensive modules are added mostly to the
CPU of the system, and CoreMark focuses primarily on the
CPUs performance. As a result, CoreMark can reflect the
performance changes caused by CPUs modifications.
CoreMark benchmark consists of C programs that do
read/write operations, integer operations, and control
operations. Those benchmark programs run commonly used
algorithms including matrix manipulation, linked list
manipulation, state machine operation, and Cyclic Redundancy
Check. In general, CoreMark is used to test a processors
pipeline operation, memory or cache access, and handling of
integer operations. After running it, Coremark will give a
single-number score for easy comparison between processors.

Original Design
With Defense
(deactivated)
With Defense
(activated)

CoreMark Score
86.6961

Score per MHz


1.734

86.6961

1.734

N/A

N/A

Table 2: Benchmark result with CoreMark.

Table 2 shows the CoreMark score that we obtained from


running CoreMark on the DE0-Nano board. One good aspect
of the DTPM defensive module is it does not interfere with the
CPU pipeline when the defense is deactivated. In this case, the
performance overhead is nonexistent. The defensive

mechanism only activates when the starting and ending


addresses of a basic block are given to the DTPM module.
Unfortunately, given our difficulty in getting the basic blocks
data, we cannot find the performance overhead of our defense
at the moment.

[2]

[3]
[4]

Condition
Development Board

Value
Terasic DE0-Nano

FPGA

Altera Cyclone IV

Processor Clock

50 MHz

[6]

Instruction Cache

32 KB

[7]

Data Cache

32 KB

MMU

Yes

Hardware Multipy

Yes

Hardware Divide

Yes

Floating Point

Single Precision

[5]

[8]

[9]
[10]

Table 3: Experimental system specifications.

VI.

CONCLUSION

In this report we have shown our ideas and implementation


of two defensive techniques, DTPM and SRAS. It is
unfortunate that we cannot complete the defensive
mechanisms for OpenRISC system in a timely manner.
Despite the shortcomings in our implementation, we have
learned many things during the competition: the free and
open-source OpenRISC system-on-chip, many tools and
software developed for designing and debugging systems, the
difficulty to implement solutions for a non-commercial
system.
At last, we want to thank CSAW ESCs organizers for
giving us an opportunity to have fun and learn more about
hardware security.

[11]
[12]
[13]

[14]
[15]
[16]
[17]
[18]
[19]

[20]

[21]

REFERENCES
[1]

David A. McGrew, John Viega. The Galois/Counter Mode of Operation


(GCM).
[Online].
Available:
http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/
gcm/gcm-spec.pdf

Bo Yang, Sambit Mishra, Ramesh Karri. High Speed Architecture for


Galois/Counter Mode of Operation (GCM). [Online]. Available:
https://eprint.iacr.org/2005/146.pdf
Rudolf Usselmann. (2010, Oct. 16). [Online]. Available:
http://opencores.org/project,gcm-aes
J. Wang, G. Shou, Y. Hu, and Z. Guo, High-Speed Architectures for
GHASH Based on Efficient Bit-parallel Multipliers, IEEE International
Conference on Wireless Communications, Networking and Information
Security (WCNIS), pp. 582586, 2010.
C. Paar. A New Architecture for a Parallel Finite Field Multiplier with
Low Complexity Based on Composite Fields. IEEE Transactions on
Computers, 45(7):856 861, July 1996.
Hennessy, John L., and David A. Patterson. Computer architecture: a
quantitative approach. Elsevier, 2011
S. Bratus, N. DCunha, E. Sparks, and S. W. Smith, TOCTOU, Traps,
and Trusted Computing, in Proc. of the 1st International Conference on
Trusted Computing and Trust in Information Technologies: Trusted
Computing - Challenges and Applications, March 2008, pp. 1432.
H. Shacham, The geometry of innocent flesh on the bone: returnintolibc without function calls (on the x86), in Proc. of ACM Conf. on
Computer and Communications Security, pp. 552561, 2007.
E. J. Schwartz et.al, Q: Exploit hardening made easy., in Proc. of
USENIX Security Symposium, August 2011.
S. Checkoway et.al, Return-oriented programming without returns, in
Proc. of ACM Conference on Computer and Communications Security,
pp. 559572, 2010.
OpenRISC official website. [Online]. Available: http://openrisc.io.
Julius Baxter. Official mor1kx documentation. [Online]. Available:
https://github.com/openrisc/mor1kx/blob/master/doc/mor1kx.asciidoc.
A. K. Kanuparthi, M. Zahran and R. Karri, "Feasibility study of
dynamic Trusted Platform Module," Computer Design (ICCD), 2010
IEEE International Conference on, Amsterdam, 2010, pp. 350-355.
Valgrind official website. [Online]. Available: http://valgrind.org.
Cryptography.io library. [Online]. Available: https://cryptography.io.
Olof Kindren. FuseSoC tool. [Online]. Available:
https://github.com/olofk/fusesoc.
Or1k OpenRISC simulator. [Online]. Available:
http://opencores.org/or1k/Or1ksim.
CoreMark official website. [Online]. Available:
http://www.eembc.org/coremark/.
S. Andersson. Benchmarking OpenRISC 1200. [Online]. Available:
http://www.rte.se/blog/blogg-modesty-corex/benchmarking-openrisc1200/2.12.
ORCONF2013 Workshop ORPSoc on DE0 Nano. [Online]. Available:
http://opencores.org/or1k/ORCONF2013_Workshop_ORPSoC_On_DE
0_Nano.
Lucas Davi , Ahmad-Reza Sadeghi, Marcel Winandy, ROPdefender: a
detection tool to defend against return-oriented programming attacks,
Proceedings of the 6th ACM Symposium on Information, Computer and
Communications Security, March 22-24, 2011, Hong Kong, China

S-ar putea să vă placă și