Sunteți pe pagina 1din 11

Virtual Machine Showdown: Stack Versus Registers

Yunhe Shi, David Gregg, Andrew Beatty M. Anton Ertl


Department of Computer Science Institut für Computersprachen
University of Dublin, Trinity College TU Wien
Dublin 2, Ireland Argentinierstraße 8
{yshi, David.Gregg, A-1040 Wien, Austria
Andrew.Beatty}@cs.tcd.ie anton@complang.tuwien.ac.at

ABSTRACT be interpreted or compiled. The most popular VMs, such as


Virtual machines (VMs) are commonly used to distribute the Java VM, use a virtual stack architecture, rather than
programs in an architecture-neutral format, which can eas- the register architecture that dominates in real processors.
ily be interpreted or compiled. A long-running question in A long-running question in the design of VMs is whether
the design of VMs is whether stack architecture or register stack architecture or register architecture can be implemented
architecture can be implemented more efficiently with an more efficiently with an interpreter. On the one hand stack
interpreter. We extend existing work on comparing virtual architectures allow smaller VM code so less code must be
stack and virtual register architectures in two ways. Firstly, fetched per VM instruction executed. On the other hand,
our translation from stack to register code is much more stack machines require more VM instructions for a given
sophisticated. The result is that we eliminate an average computation, each of which requires an expensive (usually
of more than 47% of executed VM instructions, with the unpredictable) indirect branch for VM instruction dispatch.
register machine bytecode size only 25% larger than that of Several authors have discussed the issue [12, 15, 11, 16] and
the corresponding stack bytecode. Secondly we present an presented small examples where each architecture performs
implementation of a register machine in a fully standard- better, but no general conclusions can be drawn without a
compliant implementation of the Java VM. We find that, larger study.
on the Pentium 4, the register architecture requires an av- The first large-scale quantitative results on this question
erage of 32.3% less time to execute standard benchmarks if were presented by Davis et al. [5, 10] who translated Java
dispatch is performed using a C switch statement. Even if VM stack code to a corresponding register machine code. A
more efficient threaded dispatch is available (which requires straightforward translation strategy was used, with simple
labels as first class values), the reduction in running time is compiler optimizations to eliminate instructions which be-
still approximately 26.5% for the register architecture. come unnecessary in register format. The resulting register
code required around 35% fewer executed VM instructions
Categories and Subject Descriptors to perform the same computation than the stack architec-
ture. However, the resulting register VM code was around
D.3 [Software]: Programming Language; D.3.4 [Progr-
45% larger than the original stack code and resulted in a
amming Language]: Processor—Interpreter
similar increase in bytecodes fetched. Given the high cost of
unpredictable indirect branches, these results strongly sug-
General Terms gest that register VMs can be implemented more efficiently
Performance, Language than stack VMs using an interpreter. However, Davis et
al’s work did not include an implementation of the virtual
register architecture, so no real running times could be pre-
Keywords sented.
Interpreter, Virtual Machine, Register Architecture, Stack
This paper extends the work of Davis et al. in two re-
Architecture
spects. First, our translation from stack to register code is
much more sophisticated. We use a more aggressive copy
1. MOTIVATION propagation approach to eliminate almost all of the stack
Virtual machines (VMs) are commonly used to distribute load and store VM instructions. We also optimize constant
programs in an architecture-neutral format, which can easily load instructions, to eliminate common constant loads and
move constant loads out of loops. The result is that an av-
erage of more than 47% of executed VM instructions are
eliminated. The resulting register VM code is roughly 25%
Permission to make digital or hard copies of all or part of this work for larger than the original stack code, compared with 45% for
personal or classroom use is granted without fee provided that copies are Davis et al. We find that the increased cost of fetching
not made or distributed for profit or commercial advantage and that copies
more VM code involves only 1.07 extra real machine loads
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific per VM instruction eliminated. Given that VM dispatches
permission and/or a fee. are much more expensive than real machine loads, this in-
VEE’05, June 11-12, 2005, Chicago, Illinois, USA. dicates strongly that register VM code is likely to be much
Copyright 2005 ACM 1-59593-047-7/05/0006...$5.00.

153
more time-efficient when implemented with an interpreter, More sophisticated approaches, such as Piumarta and Ri-
although at the cost of increased VM code size. cardi’s [14] approach of copying executable code just-in-time
The second contribution of our work is an implementation further reduce dispatch costs, at a further cost in simplicity,
of a register machine in a fully standard-compliant imple- portability and memory consumption. Context threading
mentation of the Java VM. While implementing the register [2] uses subroutine threading to change indirect branch to
VM interpreter is simple, integrating it with the garbage col- call/return, which better exploits hardware’s return-address
lection, exception handling and threading systems is more stack, to reduce the cost of dispatches. As the cost of dis-
complicated. We present experimental results on the be- patches falls, any benefit from using a register VM instead of
haviour of the stack and register versions of JVMs, including a stack VM falls. However, switch and simple threaded dis-
hardware performance counter results. We find that on the patch are the most commonly used interpreter techniques,
Pentium 4, the register architecture requires an average of and switch is the only efficient alternative if ANSI C must
32.3% less time to execute standard benchmarks if dispatch be used.
is performed using a C switch statement. Even if more ef- The second cost component of executing a VM instruction
ficient threaded dispatch is available (which requires labels is accessing the operands. The location of the operands
as first class values), the reduction in running time is still must appear explicitly in register code, whereas in stack code
about 26.5% for the register architecture. operands are found relative to the stack pointer. Thus, the
The rest of this paper is organised as follows. In section 2 average register instruction is longer than the corresponding
we describe the main differences between virtual stack and stack instruction; register code is larger than stack code;
virtual register machines from the point of view of the in- and register code requires more memory fetches to execute.
terpreter. In section 3, we show how stack-based Java byte- Small code size, and small number of memory fetches are
code is translated into register-based bytecode. In sections 4 the main reasons why stack architectures are so popular for
and 5, our copy propagation and constant instruction opti- VMs.
mization algorithms are presented. Finally, in section 6, we The final cost component of executing a VM instruction
analyze the static and dynamic code behaviour before and is performing the computation. Given that most VM in-
after optimization, and we show the performance improve- structions perform a simple computation, such as an add
ment in our register-based JVM when compared to original or load, this is usually the smallest part of the cost. The
stack-based JVM. basic computation has to be performed, regardless of the
format of the intermediate representation. However, elimi-
2. STACK VERSUS REGISTERS nating invariant and common expressions is much easier on
The cost of executing a VM instruction in an interpreter a register machine, which we exploit to eliminate repeated
consists of three components: loads of identical constants (see section 5).

• Dispatching the instruction 3. TRANSLATING STACK TO REGISTER


• Accessing the operands In this section we describe a system of translating JVM
stack code to register code just-in-time. However, it is im-
• Performing the computation
portant to note that we do not advocate run-time transla-
Instruction dispatch involves fetching the next VM in- tion from stack to register format as the best or only way
struction from memory, and jumping to the corresponding to use virtual register machines. This is clearly a possibil-
segment of interpreter code that implements the VM in- ity, maybe even an attractive one, but our main intention in
struction. A given task can often be expressed using fewer doing this work is to evaluate free-standing virtual register
register machine instructions than stack ones. For example, machines. Run-time translation is simply a mechanism we
the local variable assignment a = b + c might be translated use to compare stack and register versions of the JVM eas-
to stack JVM code as ILOAD c, ILOAD b, IADD, ISTORE a. ily. In a real system, we would use only the register machine,
In a virtual register machine, the same code would be a and compile for that directly.
single instruction IADD a, b, c. Thus, virtual register ma- Our implementation of the JVM pushes a new Java frame
chines have the potential to significantly reduce the number onto a run-time stack for each method call. The Java frame
of instruction dispatches. contains local variables, frame data, and the operand stack
In C, dispatch is typically implemented with a large switch for the method(See figure 1). In the stack-based JVM, a
statement, with one case for each opcode in the VM instruc- local variable is accessed using an index, and the operand
tion set. Switch dispatch is simple to implement, but is stack is accessed via the stack pointer. In the register-based
rather inefficient. Most compilers produce a range check, JVM both the local variables and operand stack can be con-
and an additional unconditional branch in the generated sidered as virtual registers for the method. There is a simple
code for the switch. In addition, the indirect branch gen- mapping from stack locations to register numbers, because
erated by most compilers is highly (around 95% [7]) unpre- the height and contents of the JVM operand stack are known
dictable on current architectures. at any point in a program [9].
The main alternative to the switch statement is threaded All values on the operand stack in a Java frame can be con-
dispatch. Threaded dispatch takes advantage of languages sidered as temporary variables (registers) for a method and
with labels as first class values (such as GNU C and assem- therefore are short-lived. Their scope of life is between the
bly language) to optimize the dispatch process. This allows instructions that push them onto the operand stack and the
the range check and additional unconditional branches to instruction that consumes the value on the operand stack.
be eliminated, and allows the code to be restructured to im- On the other hand, local variables (also registers) are long-
prove the predictability of the dispatch indirect branch (to lived and their life scope is the time of method execution.
around 45% [7]). In the stack-based JVM, most operands of an instruction

154
Table 1: Bytecode translation. Assumption: cur-
rent stack pointer before the code shown below is
10. In most cases, the first operand in an instruc-
tion is the destination register
Stack-based bytecode Register-based bytecode
iload 1 move r10, r1
iload 2 move r11, r2
iadd iadd r10, r10, r11
istore 3 move r3, r10

JVM instruction set (multianewarray, tableswitch, and lookup-


switch). In addition to the original three variable-length in-
structions, all method call instructions become variable in
length after the translation to register-based bytecode for-
Figure 1: The structure of a Java frame mat. Here is the instruction format for method call:
op cpi1 cpi2 ret_reg arg1 arg2 ...
are implicit; they are found on the top of the operand stack. op is the opcode of a method call. cpi1 and cpi2 are the
Most of the stack-based JVM instructions are translated into two-byte constant-pool indexes. ret reg is the return value
corresponding register-based virtual machine instructions, register number. arg1, arg2, . . . are the argument register
with implicit operands translated to explicit operand regis- numbers. The number of arguments, which can be deter-
ters. The new register-based instructions use one byte for mined when the method call instructions are executed in
the opcode and one byte for each operand register (similar the interpreter loop, are not part of the instruction format.
to the stack JVM). The main reason for doing so is to reduce the codesize.
Table 1 shows a simple example of bytecode translation.
The function of the bytecode is to add two integers from two 4. COPY PROPAGATION
local variables and store the result back into another local In the stack-based JVM, operands are pushed from local
variable. variables onto the operand stack before they can be used,
There are a few exceptions to the above one-to-one trans- and results must be stored from the stack to local variables.
lation rule: More than 40% of executed instructions in common Java
benchmarks consist of loads and stores between local vari-
• Operand stack pop instructions(pop and pop2 ) are trans-
ables and the stack [5]. Most of these stack push and pop
lated into nop because they are not needed in register-
operations are redundant in our register-based JVM as in-
based code.
structions can directly use local variables (registers) with-
• Instructions related to loading of a local variable onto out going through the stack. In the translation stage, all
operand stack and storing data from operand stack loads and stores from/to local variables are translated into
into a local variable are translated into move instruc- register move instructions. In order to remove these redun-
tions dant move instructions, we apply both forward and back-
ward copy propagation.
• Stack manipulation instructions(e.g. dup, dup2 . . . ) We take advantage of the stack-based JVM’s stack op-
are translated into appropriate sequences of move in- eration semantics to help implement both varieties of copy
structions by tracking the state of the operand stack propagation. During copy propagation, we use the stack
pointer information after each instruction, which tells us
3.1 Parameter Passing which values on the stack are still alive.
A common way to implement stack-based JVM is to over-
lap the current Java frame’s operand stack (which contains 4.1 Forward Copy Propagation
a method call’s parameters) and a new Java frame’s local The convention in Java is that the operand stack is usually
variables. The parameters on the stack in the current Java empty at the end of each source statement, so the lifetimes
frame will become the start of the called method’s local vari- of values on the stack are usually short. Values pushed onto
ables. Although this provides efficient parameter passing, it the operand stack are almost immediately consumed by a
prevents us from copy propagating into the source regis- following instruction. Thus, we mainly focus on copy prop-
ters (parameters) of a method call. To solve this problem, agation optimization on basic blocks.
we change the parameter passing mechanism in the register We separate move instructions into different categories
VM to non-overlapping and copy all the parameters to the and apply different types of copy propagation depending on
location where the new Java frame will start. The benefit is the location of the source and destination operands in the
that we can eliminate more move instructions. The draw- original JVM stack code. We apply forward propagation to
back is that we need to copy all the parameters before we the following categories of move instructions:
push a new Java frame onto the Java stack.
• Local variables → stack
3.2 Variable Length Instructions • Local variables → local variables (these do not exist
Most of the instructions in Java bytecode are fixed-length. in the original translated code but will appear after
There are three variable-length instructions in stack-based forward or backward copy propagation)

155
• All dup (such as dup, dup2 x2) instructions are trans-
Table 2: Forward copy propagation algorithm. X lated into one or more move instructions which allows
is a move instruction being copy propagated and Y them to be eliminated using our algorithm.
is a following instruction in the same basic block.
src and dest are source and destination registers of • All iinc instructions are moved as far towards the end
these instructions of a basic block as possible because iinc instructions
X
are commonly used to increment an index into an ar-
Y src dest
ray. The push of the index onto the stack and iinc
X.src = Y.src X.dest = Y.src instruction used to increase the index are usually next
src Do nothing Replace Y.src with X.src to each other and thus prevent us from forward copy
X.src = Y.dest X.dest = Y.dest propagation.
dest X.src redefined after Y X.dest redefined after Y
Can’t remove X / stop Can remove X / stop • In a few special cases, forward copy propagation across
basic block boundaries is used to eliminate more move
instructions. If a move instruction’s forward copy prop-
• Stack → stack (these originate from the translation of agation reaches the end of a basic block and its desti-
dup instrutions) nation operand is on the stack, we can follow its suc-
The main benefit of forward copy propagation is to col- cessor basic blocks to find all the usages of the operand
lapse dependencies on move operations. In most cases, this and then trace back from the operand consumption in-
allows the move to be eliminated as dead code. struction to the definition instruction. If we don’t find
While doing forward copy propagation, we try to copy any other instructions except the one instruction be-
forward and identify whether a move instruction can be re- ing copy propagated forward, then we can continue the
moved (See Table 2). X is a move instruction which is being cross basic block copy propagation.
copied forward and Y is a following instruction in the same
basic block. X.dest is the destination register of the move 4.2 Backward Copy Propagation
instruction and X.src is the source register of the move in- Backward copy propagation is used to backward copy and
struction. In a following Y instruction, Y.src represents all eliminate the following types of move instructions:
the source registers and Y.dest is the destination register.
The forward copy propagation algorithm is implemented • Stack → local variables
with additional operand stack pointer information after each
Most stack JVM instructions put their result on the stack
instruction to help to decide whether a register is alive or
and a stores instruction stores the result into a local vari-
redefined. The following outlines our algorithm for copy
able. The role of backward copy propagation is to store the
propagation:
result directly into the local variable without going through
• Y.dest = X.dest. X.dest is redefined, stop copy prop- the operand stack. In reality, we can’t copy forward this
agation and remove instruction X. type of move instruction because after the instruction the
source register is above the top of the stack pointer. Due to
• After instruction Y, stack pointer is below X.dest if the characteristics of this type of move instruction, a lot of
X.dest is a register on stack. X.dest can be considered criteria required by backward copy propagation are already
to be redefined, stop copy propagation, and remove satisfied. Suppose Z is a move instruction being considered
instruction X. for backward copy propagation. Y is a previous instruction
• If Y is a return instruction, stop copy propagation and in the same basic block which has Y.dest = Z.src. Whether
remove instruction X. we can do the backward copy propagation and remove in-
struction Z depends on the following criteria:
• If Y is an athrow and X.dest is on operand stack, stop
copy propagation and remove instruction X because 1. Y.dest is a register
the operand stack will be cleared during exception han-
dling. 2. Z is a move instruction

• Y.dest = X.src. X.src is redefined and value in X.dest 3. Z.dest is a register


would still be be used after Y instruction. Stop copy
propagation and don’t remove instruction X. However, 4. Z.src = Y.dest
We can continue to find out whether X.dest is not used
in the following instructions and then is redefined. If 5. Z.dest is not consumed between Y..Z
so, remove instruction X.
6. Z.dest is not redefined between Y..Z
• After instruction Y, stack pointer is below X.src if
X.src is a register on stack. X.src can be considered as 7. Y.dest is not alive out of the basic block, which is
being redefined, stop copy propagation and don’t re- satisfied because Y.dest = Z.src and Z.src is above the
move instruction X. We ignore this rules for the second top of stack pointer after Z
run of forward copy propagation; it is quite similar to
above rule. 8. After the copy propagation, original Y.dest(Z.src) is
not used anymore. It is satisfied as long as 5 and 6
Several techniques are used to improve the ability of our are satisfied because Y.dest(Z.src) is above the top of
algorithm to eliminate move insructions. stack pointer after the instruction Z.

156
Another way to think of the backward copy propagation instruction with all registers. This allows us to combine se-
in our case is that some computation puts the result on the quences of instructions which add a small integer to an value
operand stack and then a move instruction stores the result on the stack.
from the stack to a local variable in the stack-based Java We scan the translated register-based instructions to find
virtual machine. In a register-based Java virtual machine, all those iadd and isub instructions which has one of its
we can shortcut the steps and save the result directly into a operands pushed by a constant instruction with a byte con-
local variable. stant value , due to the byte immediate value in iinc instruc-
A simple version of across basic-block backward copy prop- tion. Then we use an iinc instruction to replace an iadd (or
agation is also used. If a backward copy instruction reaches isub) instruction and a constant instruction.
the beginning of a basic block, we need to find out whether
we can backward copy to all its predecessors. If so, we back- 5.2 Move Constant Instructions out of Loop
ward copy to all its predecessors. and Eliminate Duplicate Constant
Instruction
4.3 Example Because the storage locations of most constant instruc-
The following example demonstrates both forward and tions are on the stack, they are temporary variables, and
backward copy propagation. We assume that the first operand are quickly consumed by a following instruction. The only
register in each instruction is the destination register. way that we can reuse the constant value is to allocate a
dedicated register for the same constant value above the
1. move r10, r1 //iload_1 operand stack. We only optimize those constant instruc-
2. move r11, r2 //iload_2 tions that store a constant values onto the stack locations
3. iadd r10, r10, r11 //iadd and those constant values are consumed in the same basic
4. move r3, r10 //istore_3 block. The constant instructions that store value into lo-
Instructions 1 and 2 move the values of registers r1 and r2 cal variables, which have wider scope, are not targeted by
(local variables) to registers r10 and r11 (stack) respectively. our optimization. A constant instruction which stores di-
Instruction 3 adds the values in register r10 and r11 (stack) rectly into a local variable can appear after backward copy
and put the result back into register r10 (stack). Instruction propagation. The following steps are carried out to optimize
4 moves the register r10 (stack) into register r3 (local vari- constant instructions:
able). This is typical of stack-based Java virtual machine
code. We can apply forward copy propagation to instruc- • Scan all basic blocks in a method to find (1) multiple
tions 1 and 2 and their source are copy propagated into constant VM instructions which push the same con-
instruction 3’s sources. We can apply backward copy prop- stant and (2) constant VM instructions that are inside
a loop. All constant values pushed by these VM in-
agation to instruction 4 and backward copy progagate into
instruction 3’s destination which is replaced by instruction structions onto the operand stack must be consumed
4’s destination. After both copy propagations, instructions by a following instruction in the same basic block for
1, 2, and 4 can be removed. The only remaining instruction our optimization to be applied.
is: • A dedicated virtual register is allocated for each con-
stant value used in the method1 . The constant VM in-
3. iadd r3, r1, r2 struction’s destination virtual register will be updated
to the new dedicated virtual register, as will the VM
5. CONSTANT INSTRUCTIONS instruction(s) that consume the constant.
In stack-based Java virtual machine, there are a large
• All load constant VM instructions are moved to the
number of constant instructions pushing immediate constant
beginning of the basic block. All load constant VM
values or constant values from constant pool of a class onto
instructions inside a loop are moved to a loop pre-
the operand stack. For example, we have found that an
header.
average of more than 6% of executed instructions in the
SPECjvm98 and Java Grande benchmarks push constants • The immediate dominator tree is used to eliminate re-
onto the stack. In many cases the same constant is pushed dundant initializations of dedicated constant registers.
onto the stack every iteration of a loop. Unfortunately, it
is difficult to reuse constants in a stack VM, because VM The above procedure produces two benefits. First, redun-
instructions which take values from the stack also destroy dant loads of the same constant are eliminated. If there are
those values. Virtual register machines have no such prob- more than two constant instructions that try to initialize the
lems. Once a value is loaded to a register, it can be used same dedicated constant registers in the same basic block or
repeatedly until the end of the method. To remove redun- in two basic blocks in which one dominates the other, the
dant loads of constant values, we apply the following opti- duplicate dedicated constant register initialization instruc-
mizations. tion can be removed. The other benefit is to allow us to
move the constant instructions out of loops.
5.1 Combine Constant Instruction and iinc
1
Instruction Given that we use one byte indices to specify the virtual
In the stack-based JVM, the iinc instruction can only be register, a method can use up to 256 virtual registers. Thus,
our current implementation does not attempt to minimize
used to increase a local variable by an immediate value. register usage, because we have far more registers than we
However, in the register machine we make no distinction need. A simple register allocator could greatly reduce regis-
between stack and local variables, so we can use the iinc ter requirements.

157
6. move r17, r0
7. agetfield_quick 2, 0, r17, r17
8. move r3, r17
9. move r17, r0
10. getfield_quick 4, 0, r17, r17
11. move r4, r17
12. iconst_0 r17
13. move r5, r17
14. goto 0, 2 //jump to basic block 2
Figure 2: The control flow of the medium-size ex-
ample basic block(1)
15. bipush r17, 31
16. move r18, r1
5.3 Putting it all together 17. imul r17, r17, r18
The runtime process for translating stack-based bytecode 18. move r18, r3
and optimizing the resulting register-based instructions for 19. move r19, r2
a Java method is as follows: 20. iinc r2, r2, r1
• Find basic blocks, their predecessors and successors 21. caload r18, r18, r19
22. iadd r17, r17, r18,
• Translate stack-based bytecode into intermediate register- 23. move r1, r17
based bytecode representation. 24. iinc r5, r5, 1,

• Find loops and build the dominator matrix. basic block(2):


25. move r17, r5
• Apply forward copy propagation.
26. move r18, r4
• Apply backward copy propagation. 27. if_icmplt 0, 1, r17, r18
// jump to basic block 1
• Combine constant instruction and iadd /isub instruc-
tions into iinc instructions. basic block(3):
28. move r17, r1
• Move iinc instructions as far down their basic block as
29. ireturn r17
possible.
• Eliminate redundant constant load operations and move The intermediate code after optimizations:
constant load operations from loops. basic block(0):
2
• Apply forward copy propagation again . 15. bipush r20, 31 //constant moved out of loop
1. iconst_0 r1
• Write the optimized register code into virtual register 4. getfield_quick 3, 0, r2, r0
bytecode in memory. 7. agetfield_quick 2, 0, r3, r0
10. getfield_quick 4, 0, r4, r0
In order to better demonstrate the effect of the optimiza- 12. iconst_0 r5
tions, we present the following more complicated example 14. goto 0, 2
with 4 basic blocks and one loop(See Figure 2). The number
operands without r are either constant-pool indexes, imme- basic block(1):
diate values, or branch offsets(absolute basic block number 17. imul r17, r20, r1
is used here instead to clearly indicate which basic block is 21. caload operand: r18, r3, r2
the target of a jump): 22. iadd operand: r1, r17, r18
24. iinc operand: r5, r5, r1
The translated intermediate code with all operands explic-
20. iinc operand: r2, r2, r1
itly specified before optimizations:

basic block(0): basic block(2):


1. iconst_0 r17 27. if_icmplt operand: 0, 1, r5, r4
2. move r1, r17
3. move r17, r0 basic block(3):
4. getfield_quick 3, 0, r17, r17 29. ireturn r1
5. move r2, r17
All the move instructions have been eliminated after the
2
We have found that we can eliminate a small percentage of optimizations. Constant instruction 15 has been assigned a
move instruction by applying the forward copy propagation new dedicated register number 20 to store the constant value
algorithm a second time. dup instructions generally shuffle and has been moved out of loop to its preheader, which is
the stack operands around the stack and redefine the values then combined with its predecessors because it has only one
in those registers. This will stop the copy propagation. Af-
ter first forward copy propagation and backward copy prop- predecessor. Instruction 20 has been moved down inside its
agation, new opportunities for forward copy propation are basic block to provide more opportunities for copy propaga-
created. tion.

158
AVERAGE AVERAGE

Search Search

MonteCarlo MonteCarlo

Euler Euler

RayTracer RayTracer

MolDyn MolDyn

Jack Jack

Mtrt Mtrt

Mpegaudio Mpegaudio

Javac Javac

Db Db

Jess Jess

Compress Compress

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Nop-Eliminated Eliminated Move Constant Eliminated Moves Remaining Constant Remaining Others Move Eliminated Constant Eliminated Move Constant Others

Figure 3: Breakdown of statically appearing VM Figure 4: Breakdown of dynamically appearing VM


instructions before and after optimization for all the instructions before and after optimization for all the
benchmarks. benchmarks.

6. EXPERIMENTAL EVALUATION Secondly, some JVM stack manipulation instructions (such


as pop and pop2) can simply be translated to nop instruc-
Stack-based JVM bytecode is very compact because the tions, and can be eliminated directly.
location of operands is implicitly on the operand stack. Re- Figure 3 shows the breakdown of statically appearing VM
gister-based bytecode needs to have all those implicit operands instructions after translation and optimization. On average
as part of an instruction. That means that the register-based we can simply eliminate 2.84% of nop instructions (trans-
code size will usually be much larger than stack-based byte- lated from pop and pop2 ) because they manipulate the stack
code. Larger code size means that more instruction byte- but perform no computation. Applying copy propagation al-
code must be fetched from memory as part of VM instruc- lows a further 33.67% of statically appearing instructions to
tion execution, slowing down the register machine. On the be eliminated. Our copy propagation algorithm is so suc-
other hand, virtual register machines can express compu- cessful that the remaining moves instructs account for only
tations using fewer VM instructions than a corresponding an average 0.78% of the original instructions. Almost all
stack machine. Dispatching a VM instruction is expensive, moves are eliminated. Constant optimization allows a fur-
so the reduction in executed VM instructions is likely to sig- ther average of 6.95% of statically appearing VM instruc-
nificantly improve the speed of the virtual register machine. tions to be eliminated. The remaining load constant VM
An important question is whether the increase in VM in- instructions account for an average of 10.89% of the original
struction fetches or the decrease in dispatches from using VM instructions. However, these figures are for statically
a virtual register machine has a greater effect on execution appearing code, so moving a constant load out of a loop
time. In this section we describe the experimental evalua- to a loop preheader does not result in any static reduction.
tion of two interpreter-based JVMs. The first is a conven- Overall, 43.47% of static VM instructions are eliminated.
tional stack-based JVM (Sun’s J2ME CDC 1.0 - foundation
profile), and the second is a modified version of this JVM 6.2 Dynamic Instruction Analysis after
which translates stack code into register code just-in-time, Optimization
and implements an interpreter for a virtual register machine.
In order to study the dynamic (runtime) behaviour of our
We use the SPECjvm98 client benchmarks[1] (size 100
register-based JVM code, we counted the number of exe-
inputs) and Java Grande[3] (Section 3, data set size A) to
cuted VM instructions run without any optimization as the
benchmark both implementations of the JVM. Methods are
starting point of our analysis. However, the stack VM in-
translated to register code the first time they are executed;
structions that translate to nop instructions have already
thus all measurements in the following analysis include only
been eliminated at this point and are not included in the
methods that are executed at least once. The measurements
analysis.
include both benchmark program code and Java library code
Figure 4 shows the breakdown of dynamically executed
executed by the VMs.
VM instructions before and after optimization. Interstingly
move instructions account for a much greater percentage
6.1 Static Instruction Analysis after of executed VM instructions than static ones. This allows
Optimization our copy propagation to eliminate move VM instructions
Generally, there is a one-to-one correspondence between accounting for 43.78% of dynamically executed VM instruc-
stack VM instructions and the translated register VM in- tions. The remaining moves account for only 0.53% of the
structions. However there are a couple of exceptions. First, original VM instructions. Applying constant optimizations
the JVM includes some very complicated instructions for allows a further reduction of 3.33% of original VM instruc-
duplicating data on the stack, which we translate into a tions to be eliminated. The remaining dynamically executed
sequence of move VM instructions. The result of this trans- constant VM instructions account for 2.98%. However, there
formation is that the number of static instructions in the are far more static constant instructions(17.84%) than those
translated code is about 0.35% larger than in the stack code. dynamically run(6.26%) in the benchmarks. We discovered

159
Average Average

Search Search

MonteCarlo MonteCarlo

Euler Euler

RayTracer RayTracer

MolDyn MolDyn

Jack Jack

Mtrt Mtrt

Mpegaudio Mpegaudio

Javac Javac

Db Db

Jess Jess

Compress Compress

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%

Code size Load Read Write Total

Figure 5: Increase in code size and resulting net in- Figure 7: Dynamic number of real machine load and
crease in bytecode loads from using a register rather store required to access virtual registers in our vir-
than stack architecture. tual register machine as a percentage of the corre-
sponding loads and stores to access the stack and
AVERAGE local variables in a virtual stack machine
Search

MonteCarlo Average

Euler Search

RayTracer MonteCarlo

MolDyn Euler

RayTracer
Jack
MolDyn
Mtrt
Jack
Mpegaudio
Mtrt
Javac
Mpegaudio
Db
Javac
Jess
Db
Compress
Jess
0.00 0.50 1.00 1.50 2.00 2.50 Compress

Check

0.00 0.50 1.00 1.50 2.00 2.50


Figure 6: Increase in dynamically loaded bytecode
instructions per VM instruction dispatch eliminated
by using a register rather than stack architecture. Figure 8: The reduction of real machine memory
accesses for each register-based bytecode instruction
eliminated
that there are a large number of constant instructions in
the initialization bytecode which are usually executed only
once. On average, our optimizations remove 47.21% of the average increase in loads is similar to the average increase
dynamically executed original VM instructions. in code size, which is at 26.03%.
The performance advantage of using a register rather than
6.3 Code Size stack VM is that fewer VM instructions are needed. On
Generally speaking, the code size of register-based VM the other hand, this comes at the cost of increased bytecode
instructions is larger than that of the corresponding stack loads due to larger code. To measure the relative importance
VM instructions. Figure 5 shows the percentage increase of these two factors, we compared the number of extra dy-
in code size of our register machine code compared to the namic bytecode loads required by the register machine, per
original stack code. On average, the register code is 25.05% dynamically executed VM instruction eliminated. Figure 6
larger than the original stack code, despite the fact that the shows that the number of additional bytecode loads per VM
register machine requires 43% fewer static instructions than instruction eliminated is small at an average of 1.07%. On
the stack architecture. This is a significant increase in code most architectures one load costs much less to execute than
size, but it is far lower than the 45% increase reported by an instruction dispatch, with its difficult-to-predict indirect
Davis et al. [5]. branch. This strongly suggests that register machines can be
As a result of the increased code size of the register-based interpreted more efficiently on most modern architectures.
JVM, more VM instruction bytecodes must be fetched from
memory as the program is interpreted. Figure 5 also shows 6.4 Dynamic Local Memory Access
the resulting increase in bytecode load. Interestingly, the Apart from real machine loads of instruction bytecodes,
increase in overall code size is often very different from the the main source of real machine loads in a JVM interpreter
increase in instruction bytecode loaded in the parts of the comes from moving data between the local variables and
program that are executed most frequently. Nonetheless, the the stack. In most interpreter-based JVM implementations,

160
the stack and the local variables are represented as arrays AVERAGE

in memory. Thus, moving a value from a local variable to Search

the stack (or vice versa) involves both a real machine load MonteCarlo

to read the value from one array, and a real machine store Euler

to write the value to the other array. Thus, adding a sim- RayTracer

MolDyn
ple operation such as adding two numbers can involve large
Jack
numbers of real machine loads and stores to implement the
Mtrt
shuffling between the stack and registers.
Mpegaudio
In our register machine, the virtual registers are also rep- Javac
resented as an array. However, VM instructions can access Db
their operands in the virtual register array directly, without Jess
first moving the values to an operand stack array. Thus, Compress
the virtual stack machine can actually require fewer real 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%
machine loads and stores to perform the same computation. Threaded Switch
Figure 7 shows (a simulated measure) the number of the dy-
namic real machine loads and stores required for accessing
the virtual register array, as a percentage of the correspond- Figure 9: Register-based virtual machine reduction
ing loads and stores for the stack JVM to access the local in running time (based on average real running time
variable and operand stack arrays. The virtual register ma- of five runs): switch and threaded (Pentium 3)
chine requires only 67.8% as many real machine loads and
55.07% as many real machine writes, with an overall figure AVERAGE

of 62.58%. Search

In order to compare these numbers with the number of MonteCarlo

Euler
additional loads required for fetching instruction bytecodes,
RayTracer
we expressed these memory operations as a ratio to the dy-
MolDyn
namically executed VM instructions eliminated by using the
Jack
virtual register machine. Figure 8 shows that on average, the Mtrt
register VM requires 1.53 fewer real machine memory oper- Mpegaudio
ations to access such variables. This is actually larger than Javac
the number of additional loads required due to the larger Db
size of virtual register code. Jess
However, these measures of memory accesses for the local Compress

variables, the operand stack and the virtual registers depend 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%
entirely on the assumption that they are implemented as Threaded Switch
arrays in memory. In practice, we have little choice but
to use an array for the virtual registers, because there is no
Figure 10: Register-based virtual machine per-
way to index real machine registers like an array on most real
formance improvement in terms of performance
architectures. However, stack caching [6] can be used to keep
counter running time: switch and threaded (Pen-
the topmost stack values in registers, and eliminate large
tium 4: performance counter)
numbers of associated real machine loads and stores. For
example, Ertl [6] found that around 50% of stack access real
machine memory operations could be eliminated by keeping
preter using threaded dispatch, (3) a virtual register-based
just the topmost stack item in a register. Thus, in many
JVM using switch dispatch and (4) a virtual register-based
implementations, the virtual register architecture is likely
JVM using threaded dispatch. For fairness, we always com-
to need more real machine loads and stores to access these
pare implementations which use the same dispatch mecha-
kinds of values.
nism.
Figure 9 shows the percentage reduction in running time
6.5 Timing Results of our implementation of the virtual register machine com-
To measure the real running times of the stack and register- pared to the virtual stack machine for variations of both in-
based implementations of the JVM, we ran both VMs on terpreters using both switch and threaded dispatch. Switch
Pentium 3 and Pentium 4 systems. The stack-based JVM dispatch is more expensive, so the reduction in running time
simply interprets standard JVM bytecode. The running is slightly larger (30.69%) than the threaded versions of the
time for the register-based JVM includes the time neces- interpreters (29.36%). Nonetheless, a reduction in the run-
sary to translate and optimize each method the first time it ning time of around 30% for both variants of the interpreters
is executed. However, our translation routines are fast, and is a very significant improvement. There are few interpreter
consume less than 1% of the execution time, so we believe optimizations that give a 30% reduction in running time.
the comparison is fair. In our performance benchmarking, Figure 10 shows the same figures for a Pentium 4 ma-
we run SPECjvm98 with a heap size of 70MB and Java chine. The Pentium 4 has a very deep pipeline (20 stages)
Grande with a heap size of 160MB. Each benchmark is run so the cost of branch mispredictions is very much higher
independently. than that of the Pentium 3. The result is that switch dis-
We compare the performance of four different interpreter patch is very slow on the Pentium 4 due to the large number
implementations: (1) a stack-based JVM interpreter using of indirect branch mispredictions it causes. On average, the
switch dispatch (see section 2), (2) a stack-based JVM inter- switch-dispatch register machine requires 32.28% less exe-

161
1.40 branch misprediction, and the high cost of indirect branches,
it is not surprising that the virtual register implementation
1.20
of the JVM is faster.
1.00 Figures 11 and 12 show that uop load for threaded VM is
much higher than switch VM. The most likely reasons for
Stack Threaded
0.80
Register Theaded
such case is that it is more difficult for compiler to optimize
Stack Switch the registers allocation for threaded interpreter loop than
0.60 Register Switch
switch-based one. It is easier for compiler to recognize the
0.40 switch-based interpreter loop and different segment of switch
statement. On the other hand, the threaded interpreter loop
0.20
consists lots of code segments with labels and the execution
0.00
of bytecode jumps around those labels. Obviously, it is much
Cycles Inst. uop load uop store L1_DCM Indirect Indirect easier for compiler to optimize register allocation in switch-
(*500B) (*250B) (*100B) (*100B) (*2B) Branches Mispred.
(*10B) (*10B) based interpreter loop than a threaded one.

Figure 11: Compress (Pentium 4 performance 6.6 Discussion


counter results) Our implementation of the register-based VM translates
the stack-based bytecode into register-based bytecode at the
1.20
runtime. We don’t propose to do so in the real-life imple-
mentation. The purpose of our implementation is to eval-
1.00
uate the register-based VM. Our register-based JVM im-
plementation came from the stack-based JVM implementa-
0.80 tion. Except for the necessary adaption of interpreter loop,
Stack Threaded garbage-collection and exception handling to the new in-
0.60
Register Theaded
Stack Switch
struction format, there is very little change to its original
Register Switch code segments to interpret bytecode instructions. The ob-
0.40 jective of doing so is to provide a fair comparison between
the stack-based and the register-based JVM.
0.20 Another technique to eliminate redundant stack-load and
-store move instructions would be to use register coalesc-
0.00
Cycles Inst. uop load uop store L1_DCM Indirect Indirect
ing. However, the technique is less efficient and more com-
(*500B) (*250B) (*100B) (*100B) (*2B) Branches Mispred. plex than our simple copy propagation algorithm because
(*10B) (*10B)
it involves repeatedly doing data flow analysis and building
interference graph. Moreover, our copy propagation is so
Figure 12: Moldyn (Pentium 4 performance counter effective that only less than 2% of move instructions are re-
results) maining in the static code while the results reported in [13]
are only about 96% of move instructions removed by the
most aggressive register coalescing and 86% move instruc-
cution time than the switch-dispatch stack machine. The tions removed in [8].
corresponding figure for the threaded-dispatch JVMs is only Super-instruction is an another technique to reduce the
26.53%. number of indirect branches and to eliminate intermediate
To more deeply explore the reasons for the relative per- storage of result on the stack. In most cases, the per-
formance, we use the Pentium 4’s hardware performance formance improvement are quite modest[4]. Our prelimi-
counters to measure various processor events during the ex- nary study estimates that around 512 extra superinstruc-
ecution of the programs. Figures 11 and 12 show perfor- tions must be added to the interpreter to achieve the same
mance counter results for the SPECjvm98 benchmark com- static instruction reduction presented in this paper.
press and Java Grande benchmark moldyn. We measure the The arithmetic instruction format in our register-based
number of cycles of execution time, number of retired Pen- JVM use three-addresses. Another obvious way to reduce
tium 4 instructions, numbers of retired Pentium 4 load and code size is to use two-address instruction format for these
store micro-operations, number of level 1 data cache misses, instructions. We choose to use three-address instruction for-
number of indirect branches retired and number of retired mat in order to improve the chances of our copy propaga-
indirect branches mispredicted. tion. Moreover, the static arithmetic instructions consist of,
Figures 11 and 12 show that threaded dispatch is much on average, only 6.28% of all instructions in SPECjvm98
more efficient than switch dispatch. The interpreters that client benchmarks. Most of the individual arithmetic in-
use switch dispatch require far more cycles and executed structions are statically less than 1%. The contribution of
instructions, and the indirect branch misprediction rate is using two addresses arithmetic instruction format to code
significantly higher. size reduction is very small.
When we compare the stack and register versions of our After the copy propagation, most of the stack slots are
JVM, we see that the register JVM is much more efficient not used anymore. One area of improvements that we can
in several respects. It requires significantly less executed make is to do the dataflow analysis and try to compact the
instructions than the stack-based JVM. More significantly, virtual register usage so that the size of Java frame can
for the compress benchmark, it requires less than half the become smaller. This will probably have small impact on
number of indirect branches. Given the large rate of indirect memory usage and performance.

162
Given a computation task, a register-based VM inherently the Practical Application of Java. Manchester, UK,
needs far fewer instructions than a stack-based VM. In our April 2000.
case, our register-based JVM implementation can reduce [4] K. Casey, D. Gregg, M. A. Ertl, and A. Nisbet.
the static number of bytecode instructions by 43.47% and Towards superinstructions for java interpeters. In
the dynamic number of executed bytecode instructions by Proceedings of the 7th International Workshoop on
47.21% when compared to those of the stack-based JVM. Software and Compilers for Embedded Systems
The reduction of executed bytecode instructions leads to (SCOPES 03), pages 329–343, September 2003.
fewer real machine instructions for the benchmarks and sig- [5] B. Davis, A. Beatty, K. Casey, D. Gregg, and
nificant smaller number of indirect branches, which is very J. Waldron. The case for virtual register machines. In
costly when mispredictions of indirect branches happen. On Interpreters, Virtual Machines and Emulators
the other hand, the larger codesize (25.05% larger) could re- (IVME ’03), pages 41–49, 2003.
sults in possible higher level-1 data cache misses and load/store [6] M. A. Ertl. Stack caching for interpreters. In
operations for processor. In terms of running time, the SIGPLAN ’95 Conference on Programming Language
benchmark results show that our register-based JVM has an Design and Implementation, pages 315–327, 1995.
average 30.69%(switch) & 29.36%(threaded) improvement [7] M. A. Ertl and D. Gregg. The behaviour of efficient
on Pentium 3 and 32.28%(switch) & 26.53%(threaded) on virtual machine interpreters on modern architectures.
Pentium 4. This is a very strong indication that the register In Euro-Par 2001, pages 403–412. Springer
architecture is more superior for implementing interpreter- LNCS 2150, 2001.
based virtual machine that the stack architecture.
[8] L. George and A. W. Appel. Iterated register
coalescing. Technical Report TR-498-95, Princeton
7. CONCLUSIONS University, Computer Science Department, Aug. 1995.
A long standing question has been whether virtual stack [9] J. Gosling. Java Intermediate Bytecodes. In Proc.
or virtual register VMs can be executed more efficiently us- ACM SIGPLAN Workshop on Intermediate
ing an interpreter. Virtual register machines can be an at- Representations, volume 30:3 of ACM Sigplan Notices,
tractive alternative to stack architectures because they allow pages 111–118, San Francisco, CA, Jan. 1995.
the number of executed VM instructions to be substantially [10] D. Gregg, A. Beatty, K. Casey, B. Davis, and
reduced. In this paper we have built on the previous work A. Nisbet. The case for virtual register machines.
on Davis et al [5], which counted the number of instructions Science of Computer Programming, Special Issue on
for the two architectures using a simple translation scheme. Interpreters Virtual Machines and Emulators, 2005.
We have presented a much more sophisticated translation To appear.
and optimization scheme for translating stack VM code to [11] B. McGlashan and A. Bower. The interpreter is dead
register code, which we believe gives a more accurate mea- (slow). Isn’t it? In OOPSLA’99 Workshop: Simplicity,
sure of the potential of virtual register architectures. We Performance and Portability in Virtual Machine
have also presented results for a real implementation in a design., 1999.
fully-featured, standard-compliant JVM. [12] G. J. Myers. The case against stack-oriented
We found that a register architecture requires an average instruction sets. Computer Architecture News,
of 47% fewer executed VM instructions, and that the re- 6(3):7–10, August 1977.
sulting register code is 25% larger than the correpsonding [13] J. Park and S. mook Moon. Optimistic register
stack code. The increased cost of fetching more VM code coalescing, Mar. 30 1999.
due to larger code size involves only 1.07% extra real ma-
[14] I. Piumarta and F. Riccardi. Optimizing direct
chine loads per VM instruction eliminated. On a Pentium
threaded code by selective inlining. In SIGPLAN ’98
4 machine, the register machine required 32.3% less time to
Conference on Programming Language Design and
execute standard benchmarks if dispatch is performed using
Implementation, pages 291–300, 1998.
a C switch statement. Even if more efficient threaded dis-
[15] P. Schulthess and E. Mumprecht. Reply to the case
patch is available (which requires labels as first class values),
against stack-oriented instruction sets. Computer
the reduction in running time is still around 26.5% for the
Architecture News, 6(5):24–27, December 1977.
register architecture.
[16] P. Winterbottom and R. Pike. The design of the
Inferno virtual machine. In IEEE Compcon 97
8. REFERENCES Proceedings, pages 241–244, San Jose, California, 1997.
[1] Spec release spec jvm98, first industry-standard
benchmark for measuring java virtual machine
performance. Press Release, page
http://www.specbench.org/osg/jvm98/press.html,
August 19 1998.
[2] M. Berndl, B. Vitale, M. Zaleski, and A. D. Brown.
Context threading: A flexible and efficient dispatch
technique for virtual machine interpreters. In 2005
International Symposium on Code Generation and
Optimization, March 2005.
[3] M. Bull, L. Smith, M. Westhead, D. Henty, and
R. Davey. Benchmarking java grande application. In
Second Ineternational Conference and Exhibtion on

163