Sunteți pe pagina 1din 8

2008 International Conference on Computer Science and Software Engineering

Multithreaded optimizing technique for dynamic


binary translator CrossBit
Haibing Guan Bo Liu Tingtao Li Alei Liang
Dept. of Computer Science School of software School of software School of software
Shanghai Jiao Tong Univ. Shanghai Jiao Tong Univ. Shanghai Jiao Tong Univ. Shanghai Jiao Tong Univ.
Shanghai, China Shanghai, China Shanghai, China Shanghai, China
hbguan@sjtu.edu.cn boliu@sjtu.edu.cn ltt@sjtu.edu.cn liangalei@sjtu.edu.cn

Abstract—Dynamic binary translator (DBT) systems enable the the fields like program instrumentation, dynamic optimization,
architectural incompatible platforms to execute binaries of other security, etc, and continues to expand its application areas.
architectures transparently. And binary translation and High performance is deemed as the nature of the DBT
optimization are always deemed as the key techniques to systems. The caching and reusing of translated code would
constructing high performance DBT systems. Many previous ordinarily make the DBT system ten times faster than the
optimization including fragment chaining, trace/superblock conventional interpretation emulation systems or more.
formation, translated code optimization, has been applied to However, there are lots of factors affecting the performance.
improve performance. However, seldom any effort is made to
With different system configurations and execution
optimize under multi-core processor environment.
Therefore, this paper describes the new multiple threads environment, the processing of the DBT system shows
execution engine (MTEE) of the dynamic binary translator different features, and the key factors affecting the system
CrossBit, which is a both resourceable and retargetable differ. It results in the increasing difficulties for constructing
infrastructure for cross binary translation. The multithreaded effective DBT systems.
technique decomposes the common binary transltation working There are still some general techniques conductive to
routine into dynamic translation and translated code execution performance improvement. In the dimension of the running
phases, and then multi-threads enable parallelized translation of DBT application, the translator could be viewed as two parts:
host binaries. As a result, the multithreaded optimizing the runtime (RT for short) part and the target code (TC for
technique accelerates the translation of binaries.
short) part. Here the RT represents the DBT application itself.
We evaluated our binary translation system across the SPEC
CPU2000 and found that the system under MTEE has reached a And the optimization DBT system always is destined to the
performance level equal to that of the conventional CrossBit two major goals:
engine which was executed on the IA32 multi-cores processor • Reduce the RT overhead,
environment. And several SPEC benchmark programs even
show that the MTEE have speed up the execution of CrossBit • Speed up the TC execution
about 30% at best. And we are still looking forward to a higher
performance improvement through this new software pipelined Normally, if a DBT system has the ability to consecutively
method. execute the translated code natively in most of the time, rarely
with any interruption by the RT, it is considered to have
Keywords-MTEE; DBT; Dynamic Optimization; CrossBit; eliminated most of the DBT runtime overhead. The most
popular approach to reach consecutive TC execution is the
I. INTRODUCTION fragment chaining (or block linking) method. The linking of
Dynamic translation comes as the most popular simulation translated code block effectively reduces the control
approach to running existed binaries of established platforms. transferring between RT and TC and brings another ten times
It eliminates the high overhead of frequent guest-to-host speed-up statistically. It is considered the most efficient
platform instructions converting by caching the translated method in addition to the caching translated code.
native binaries into a software cache sub-system, which is Another common DBT implement of runtime overhead
always called the translated cache (TCache). The dynamic elimination is called the traces/superblocks formation. Such
translation method changes the binaries of the guest platform trace/superblock could be a control flow on a hot path which
into a host representation which is called the translated code, is composed of several translated code blocks. The execution
performs execution natively, and thus sets up the virtualized on traces/superblocks is even more efficient than on basic
guest platform environment. Systems such as dynamic binary blocks linked because it removes control flow joint points and
translation (DBT) [2][4][6] are estimated to have better doesn’t have to transfer control between blocks.
performance than most simulation system using interpretation On the other hand, to speed up the TC execution, the DBTs
method, widely used for various purposes, getting involved in have to make efforts to optimize both the pre-execution (the
translating) process and the post-execution optimization
This work was supported by the National Nature Science Foundation of
China under Grant No. 60773093, the National Grand Fundamental Research
973 Program of China under Grunt No.2007CB316506, the 863 Research and
Development Program of China under Grunt No. 2006AA01Z169.

978-0-7695-3336-0/08 $25.00 © 2008 IEEE 945


DOI 10.1109/CSSE.2008.32
(always the superblock formation, etc) of translated code CrossBit is a resourceable and retargetable dynamic binary
blocks. These optimizations intend to improve the quality of translation system implemented by Shanghai Jiao Tong
the translated code on the fly. It mainly involves the mapping University. It’s a process virtual machine developed mainly to
of the guest/host registers, the selection of register mapping provide the platform independent computing service for a new
algorithm, etc. Especially, many compiler optimizations have virtualized execution environment. Until recently, it has fully
been applied dynamically to speed up the DBT system at post or partially supported guest platforms including simplescalar,
block execution time. Researches related are called the IA32, MIPS, and has fully supported the IA32 host platform.
dynamic optimization, and are being widely studied. However, Another RISC instruction sets platform host is on the plan.
this is beyond the scope of this paper. For platform adaptability, CrossBit creates the intermediate
To sum up, with the above approaches, the DBT systems representation named intermediate instruction (II) layer,
have reached a much better performance level than ever which is used to unify the representation of various sources
before. However, as the processor technology evolutes into a and target instruction sets. It mainly functions in quick adding
new age in which the multiple processing units (always refer guest/host platforms for the CrossBit designers. However,
to the multi-core technology) empowers the computer systems unlike the typical machine adaptive DBT systems, CrossBit
to high performance, multiple threading software architecture doesn’t generate the translator automatically. The binary
that can fully make use of processors’ power provides the interpreting from the guest to II, and the translating from II to
DBT system another chance to be more powerful than before. the host, are all hand-written in the object-oriented language.
Accordingly, the best practise is to decompose the tasks of the Programmers may take the infrastructural interfaces of
DBT systems into different parts, irrespective of the CrossBit translation module and reuse the others together with
dependencies. And working threads can then take these parts overall system mechanism. The obvious advantage of this
separately. These threads are considered either as infrastructure is that it provides CrossBit with direct II quality
programmable software pipeline stages or single co-workers control, which is much harder to complete in above systems.
for applications broken down to independent subordinate And with intermediate representation, CrossBit can unify the
routines. If the overhead to overcome dependencies is cheaper semanteme of guest platforms, and port existed optimization
than serial execution, the DBT system may have the from compilers to the II.
opportunity to be enhanced by proper task decomposition and
parallelized execution.
There are two major contributions of the paper. First, this
paper reveals the major stages affecting DBT CrossBit
performance, and also analyzes their relationship. And finally,
it depicts the relationship of these stages and the performance
of CrossBit with some formulas. Second, this paper gives the
evaluation result of the new MTEE of CrossBit, which
decomposes the common DBT working routine into dynamic
translation and native binaries execution phases, and then
multi-threads them to enable parallelized translation of guest
binaries. The evaluation results show MTEE’s ability of
pre-translating binaries (potentially accelerating the
translation), and eliminating the context-switch overhead for
the translated blocks unlinked, under the new multi-cores
computing environment.
The following sections are organized as follows. First, we Figure 1 CrossBit Architecture
introduce the resourceable and retargetable binary translator
system CrossBit in section 2. CrossBit is designed to handle Figure 1 quoted here illustrates the architecture of DBT
multi-guests-to-multi-hosts platforms binary translation for a CrossBit. There are totally six major components: the
virtualized execution environment. Some formula-based bootstrapper, the translation module, the TCache, the memory
description of DBT performance will be introduced then. manager, the syscall handler, and the runtime profiler.
Section 3 will introduce the multi-threaded execution engine The bootstrapper is responsible for analyzing the guest
(MTEE) for CrossBit, both the architecture and implement. executable and loading its code and data into the physical
Section 4 will show the experimental results of the MTEE, memory space called the guest’s memory image. The
and give the analysis consequently. Some related works are identification of the executables, the mapping of every
showed in section 5. Finally, we will come to the conclusion executable section, the loading of runtime code, and some
in section 6. dynamic linkage libraries issue, are dealt with by the
bootstrapper. After the initialization, the engine transfers the
II. DBT CROSSBIT control to the dispatcher. The dispatcher is an abstract
A. Architecture existence which represents the control dispatching of the
engine. It firstly transfers the control to the look-up of the

946
hash table. The look-up operation verifies the existence of • Translation stage. The first phase is translation. And the
certain translated code block in the TCache, which is the second phase is translation.
manager of code cache limited in size. If successful, this
• Linking stage. It does fragment chaining for the
operation returns the translated code block’s entry address for
translated code blocks.
executing. The dispatcher then performs context-switch
operations before and after the block execution is taken. The As mentioned above, the DBT CrossBit has an even higher
execution time may vary according to the linking rate (explain miss penalty than common translators do. It implies that the
later) of blocks. Finally the dispatcher regains the control and Translation stage is much slower than the others. Suppose the
continues the look-up operation of the next round. DBT system only generates the basic block without trace
If the block is not cached, the engine calls the translation optimization, the overall DBT running time (Tsys) could be
module to work. This may happen when the required block is considered as
not in or evicted out by the TCache. As figure 1 shows, the II
layer separates the translation process into two phases: Tsys = Nblock * Tblock (1)
interpreting and translating, both of which are time-consuming.
The interpreting phase performs the binary decoding and the Nblock is the number of blocks treated. Its value equals the
binary transformation to II operations. And the translating time of look-up operations. It is determined by the code
phase performs the register allocation, the binary encoding, inflation rate, the cache size, the replacement
and the binary transformation to host operations. Moreover, to strategy/algorithm of TCache. The smaller inflation rate, the
get more effective translated code, some II filters may be bigger cache size would certainly benefit the DBT at running
utilized for optimization. The DBT system always has a great time.
miss penalty. To be an adaptable DBT system, the miss Tblock is the per-block time consumption. It can be
penalty of CrossBit is even higher. After the translated block concluded as the formula bellow:
is generated, the engine executes it as usual.
CrossBit is a process virtual machine system and have to Tblock = Texecute +
deal with all system calls (syscall) provided by underlying Runlink * ( Tlookup + Tcontext-switch ) +
operating systems. This is handled by the syscall handler Rmiss * (Tinterpret + Ttranslate + Treplace + Tlink) (2)
module and triggered by the dispatcher. By now CrossBit can
translate binaries between different hardware architectures but The Runlink parameter stands for the percentage of the
only deal with the Linux operating system. So the syscall translated blocks unlinked. It’s complementary to the linking
handler can transfer most of the POSIX syscalls directly to the rate (Rlink) which is composed of two categories sorted by the
underlying system services. In spite of that some special APIs blocks’ exit type. Commonly, they are of the direct
related to processes and threads should be treated specially. control-transfer type, and the indirect control-transfer type.
Another module called the profiler gathers all necessary The former may be a constant at running time. But the linking
information at run time. The information profiled may be rate is not. It is affected by the possibility of the indirect
utilized for determining hot path and thus guiding the control-transfer linking failures. A relatively stable linking
generation of superblocks. It also provides CrossBit with rate can be reached but is brought down when the TCache
sufficient information for other instrumentation work. To system evicts some translated blocks.
advance the optimization and extending features, the profiling The Rmiss parameter represents the miss rate of the TCache
module leaves the facility to adaptive requirements. system. It’s commonly determined by the cache size, the
B. CrossBit Performance binary inflation rate, and the replacement strategy. The
foremost factor is the cache size. Commonly, if the size of the
Not considering the initialization expenditure, we’ve found text section of the guest platform (guest binaries’ size)
by experimentation that there are five major stages affecting multiplying the binaries’ inflation rate of the DBT system
the overall time of CrossBit system: doesn’t exceed the size of TCache, or not much too larger than
• Look-up stage. This stage queries if the translated code it, the miss rate is low and consequently leads to a good
block resides in the TCache. If the cache hits, it returns system performance. Otherwise, replacement happens
the block entry address. frequently.
• Context-switch stage. When the block entry is obtained, Basically, if the TCache is efficient and large enough to
either by querying hash table or by a code block hold all translated blocks, the TCache wouldn’t miss. The
generation process, the dispatcher would switch control total per-block processing time (Ttotal) is:
to the translated code. After the execution, there would
be a control switching back. Tblock = Texecute + Runlink * ( Tlookup + Tcontext-switch ) (3)

• Execution stage. It is the native execution of the The per-block execution time (Texecute) means the per-block
translated code blocks. The better quality of translation execution time of the native binaries wrapped in the translated
algorithm applies, the less time it would take. blocks. The per-block hashmap look-up time (Tlookup) together
with the per-block context switching time (Tcontext-switch),

947
multiplied by the unlinked rate (Runlink), comes as the penalty Unfortunately, there is one thing inevitable for DBT
for the blocks-unlinked condition. We call it the unlinked running upon one a single processor: the RT and the TC have
penalty (Punlink): to share the native registers which causes the register context-
switch overhead. Under certain condition, the context-switch
Punlink = Tlookup + Tcontext-switch (4) overhead could still be the foremost factor affecting the DBT
system.
However, without the replacement of TCache, translated One suggestion is that the RT and TC share the target
blocks may be well linked. As a result, the DBT system will machine registers under the control of some static or
easily reach a steady-state processing status, in which the dynamic/monitoring mechanism. However, for the DBT
linking penalty is minor, and Texecute is dominate to pre-block developers, they handle only the registers’ usage of translated
processing time and even the whole system processing time. code, which is determined at the translation phase (Mostly
When the TCache system misses, there are two kinds of done by the register allocation algorithms). But the register
miss penalty (Pmiss): per-block translating time (Ttranslate), and context of RT is determined by the compiler ahead (The
per-block linking time (Tlink). native compiler which builds DBT) of time. It is almost
impracticable or inefficient to make them co-workers because
Pmiss = Ttranslate + Tlink (5) the RT code is always built prior to the TC, and is changing
by versions if the developer doesn’t make any modification to
The replacement doesn’t bring about miss penalty only, but the local compilers.
also brings up the unlinked rate as mentioned above. What’s Nevertheless, as the processor technology evolving into the
more, the number of blocks treated also rises. In such a multi-cores age, there is another possibility to decouple the
condition, the whole system processing time is not certain but RT and TC executions. The simple concept is to make the RT
determined by the Rmiss. The Per-block consumption is and TC execution apart by threads. The RT and the TC can
run up different threads. And the threads may be then mapped
Tblock = Texecute + Runlink * Punlink + Rmiss * Pmiss (6) into different processor cores on the fly. The obvious
advantage is that either the RT or the TC has its own register
And the system running time is context. Consequently, the elimination of the context
switching overhead achieves.
Tsys = Nblock * (Texecute + Runlink * Punlink + Rmiss * Pmiss) (7) Except for context switching elimination, reducing the
translation time of DBT system is also one goal for MTEE
Formula 7 shows all the factors. Generally, if the Rmiss is design. As mentioned in the previous section, the long-time
low enough, the whole system processing time is still translation phase may greatly prolong the total running time of
dominated by the Texecute. However, when it grows up, the DBT system if the frequent replacements from TCache occur.
Runlink grows too. And the system execution time will not be This is not occasional since there are many memory-limited
predominated by a single factor. Any stages may harm or embedded processors planning to have multi-cores. What’s
benefit the Tsys. At last, when the Rmiss reaches up to a certain more, some DBT implementations involving hardware
level, miss penalty will be the dominator. TCache support also have to deal with probably frequent
Although the optimization on the Texecute is welcome in TCache replacement. Therefore, MTEE proposes to
most of the cases, any optimization on any of the stages implement the multiple threads translation, also by making
mentioned above is also welcome because they may all be the use of the multi-cores’ power, to accelerate translation phase.
candidates of the system bottleneck as the execution The simple implementation is that the RT issues several
environment changes. translations at one time. Every translation translates the source
code by blocks and commits the result out of order. If the
III. MULTIPLE-THREAD EXECUTION ENGINE prediction of translation is always correct, the TC can get
more code to execute and have less time waiting for the long
A. Goals and mechanism time translation.
The design of new execution engine aims at reducing the The coming section will introduce the implement of above
RT overhead. And in the last section, we have discussed the mechanism in MTEE.
factors affecting DBT performance. From Formula 6, we
come into the conclusion, the RT overhead arises when: B. Architecture

• Cache misses
• Blocks are unlinked
Either reducing the Runlink, Rmiss, Punlink, or Pmiss will benefit
the system. Therefore, the goal of the new engine design is to
reduce Punlink and Pmiss overhead. More accurately, MTEE tries
to reduce the Ttranslate, and Tcontext-switch overhead.

948
BranchTree can then be adapted to this change and quickly
dispatch tasks to TranslationThreads, and finally fulfill the
requirement of ExecutionThread.
The translation module is expected to have a higher
throughput than before. While the normal DBT system fills
the TCache with one translated code block per time, the
MTEE fills more according to the number of
TranslationThreads. However, the number of threads should
never exceed the number of processor cores. Or else, the
performance is handicapped rather than be increased.
C. The BranchTree
The BranchTree is the key data structure to enable the
parallel translation and the RT & TC decomposition. In
MTEE, the BranchTree is the description of control flow
ignoring any loop conditions, and is organized as a binary tree.
Figure 2 Multi-Threaded Execution Engine of CrossBit
The BranchTree has follow features:
• Every tree node represents a basic block, and tagged
Figure 2 is the architecture of the MTEE of CrossBit. with the block entry address
According to the principle of decoupling RT and TC • The root of BranchTree represents the block last being
executions, the DBT CrossBit is decomposed into two parts: executed.
translation and execution, which are taken by the • The daughter nodes are the exits (destine address of
TranslationThread and the ExecutionThread. Here, a new branching) of this block.
component called the BranchTree is added. It is the controller Figure 4 is one example of BranchTree.
of dispatching tasks to the TranslationThreads.
The engine starts to bootstrapper and then ramifies into
threads. The TransaltionThreads perform the overall
translation, including translating source binaries, wrapping
translated code into blocks and committing them to the
TCache. There are several TranslationThreads working
parallel. The BranchTree is the controller of threads’ work
dispatching. Every TranslationThread asks the BranchTree for
a Source PC address, from which the entry address translation
starts. At the end of each translation, every TranslationThread
commits a translated code block to the TCache. As mentioned
above, all the commitments are out of order. The blocks
committed could either be the block TranslationThread waits Figure 4 The BranchTree
for or the candidates for later TC executions. And then the
TranslationThread would query the BranchTree for next Each block has a token to show if it is
translation. Definitely there is some mechanism inside the processed/processing/available. When there is a request from
branch to make the work of TranslationThread valuable. This TranslationThread, it selects the one most valuable and
would be introduced later. available node from the BranchTree for next translation. The
Meanwhile, the ExecutionThread is initialized and get to selection algorithm determines the accuracy of prediction
work. However, it is a much easier work to look up the (branch prediction of instructions accordingly). In the current
translated code blocks from TCache and perform the TC version, MTEE implements a simple selection algorithm
executing circularly. based on tree structure only. However, the evaluation results
will later show the defect it brings.
And whenever the ExecutionThread finishes a block
execution, it will update the global shared Source PC variable.
Then whenever the TranslationThread queries for new task, it
may find out the dismatch between the BranchTree root and
the global shared variable. Then the TranslationThread may
update the BranchTree data structure according to the change.
Figure 3 Data flow of TranslationThread and ExecutionThread
This mechanism ensures that the TranslationThread never
Here Figure 3 shows the data that flow from the delays in responding to the real requirement from the
TranslationThread to the ExecutionThread, and also the data ExecutionThread. What’s more, it limits the synchronization
backwards. The ExecutionThread should provide the next overhead.
Source PC value as a notification to the BranchTree. The

949
D. Lock-free R/W operations However, with clear definition of threads’ categories, the
A fact worth noticing is that the synchronization overhead shared variables inside the ExecutionThread are under control.
between the TranslationThread and the ExecutionThread In fact, all the shared variables may be protected by the
would be high. There is no problem that any access of shared compiler-supported keyword (like “volatile” in C++) which
modules (TCache, BranchTree, etc) should be mutually makes sure they won’t be optimized into registers. And the
exclusive. However, this problem is mostly caused by the protected variables won’t occupy the registers required by the
frequent TCache accesses from the ExecutionThread. It is TC. In comparison with the register context in conventional
reasonable to lock any operation into the TCache. However, in CrossBit engine, the shared variables are less in quantity. The
this condition locking could be quite expensive. By evaluation, DBT programmer can protect them all easily. Moreover, the
we found that any additional code added to the look-up DBT version changes do not affect the correctness either.
operations may result in multiplication of actual running time.
IV. EVALUATIONS
TABLE 1.OPERATIONS ON TCACHE The preliminary version of the MTEE of DBT CrossBit has
TCache operator Operations Operation type been completed. The experiments were taken on the Intel Core
TranslationThread Commit block Write 2 Duo host platform with the Fedora Core 6 Linux OS. The
ExecutionThread Lookup block Read experimental guest platform of CrossBit is SimpleScalar. The
benchmark programs are some selections from the SPEC CPU
From table 1, the TCache operations may be of two kinds: 2000. Both direct block linking and indirect block linking are
write (TranslationThreads commit blocks) and read disabled in following tests. And the TranslationThread is limit
(ExecutionThread looks up block). Whenever the to one thread.
ExecutionThread looking-up fails, it may sleep until a new The first evaluation is the performance comparison between
block reaches. Then it may be woke up and perform the CrossBit MTEE and CrossBit conventional engine. Figure 5
look-up operation again. The look-up is a kind of read shows that MTEE leads ahead in running program mcf and
operation to TCache, and it is estimated that the failure of read gzip, about 30% at best. However, the MTEE is much slower
may be harmless even when performing without locks. than the conventional engine in running the program gcc. This
If a TranslationThread tries to commit a block into TCache is because the MTEE miss rate is much higher in running gcc
which the ExecutionThread currently requires, without than other programs. And it brings up the overhead of the
locking but performing the look-up operation directly, it may useless blocks generation. Here useless blocks are the block
fail and the ExecutionThread may have to sleep then. But after generated by the TranslationThread, but not required by the
a short while, there is another commitment from a ExecutionThread.
TranslationThread that wakes up the ExecutionThread. The
latter will perform another look-up and finally get the block it
required before. As long as the TranslationThreads do not stop,
the failure of look-up may never harm the MTEE.
Consequently, it is a better solution to free such locks
between read and write to the TCache. However, the mutually
exclusive operations must be maintained between writes to
avoid mistakes.
Same mechanism is also applied to the “next Source PC”
variable to release the synchronization overhead. However,
the synchronized writes to TCache are necessary even if the
commitments of blocks are out of order.
E. Context-switch elimination
Figure 5 Performance Evaluation of MTEE
When ramified into threads, the context-switch is not
necessary for the DBT system any more. The TC may have its Figure 6 shows how the cache size affects MTEE
own register context, for there are more physical register performance. The statistic data is gathered when running spec
groups to use. On some light weight process (LWP) operating program mcf. Generally, when the cache size is smaller, the
system implementation, the RT and TC could be properly chance of replacement arises, and the miss rate of MTEE
mapped to different physical cores and hardly with any grows higher. But until recent release, the MTEE have much
in-process interference. However, the MTEE implement impact on cache than conventional engine. It is because the
doesn’t completely make the TC register context TC’s. As BranchTree selection algorithm is still based on the
figure 2 shows, the Executionthread still involves some shared BranchTree structure. The branch prediction is not accurate
context, including the shared variables such as BranchTree, enough and leads to increasingly useless blocks generation as
TCache, etc. When the optimization options are selected by the cache replacements happen.
the native compilers, shared variables may also be compiled
and optimized into registers. This may lead to unpredictable
runtime behaviors of ExecutionThread.

950
V. RELATED WORDS
There are several existed machine-adaptable DBT systems,
such as the UQDBT [8] dynamic translator, which is based on
the famous static UQBT [7] framework. It uses specifications
to specify the guest/host architectures at various levels of
abstraction. And by going through several intermediate
representations, it completes binary translation dynamically.
However, the low performance and great translation overhead
of dynamic translation on machine-adaptable DBT systems
aroused concerns. So bintrans[9] project takes specification
but without intermediate representation to accelerate the
translation process. Later system like walkabout [10] makes
further study on dynamic translation for machine-adaptable
Figure 6 Cache size and performance relationship of MTEE DNT. But aside of machine adaptability, it also constructed
binary rewriting tool for various purposes. The Strata [11]
And the Figure 7 quoted illustrated the useless blocks program provides the binary translation framework, which
generation of MTEE in running program mcf. The useless also have partly free the developer from full work of
blocks are generated every time the TCache misses. Although constructing DBT system. Strata has full implemented the
the generation of block only takes the TranslationThread time, basic optimization and its evaluation shows the effectiveness
there is no guarantee from the operating system that the of these approach [16].
TranslationThread works always and never exhaust its To make DBTs practical to use, much research has been
timeslice. Since the translating overhead is still high, the performed, especially by the dynamic optimizers. One of the
useless block generation could be harmful to whole system most famous dynamic optimization systems is Dynamo [12].
time. From the figure, the black curve which represents the Aside from the approaches like fragment chaining, hot trace
percentage of effective block (That is used by the identification and trace/superblock formation, Dynamo
ExecutionThread) in cache, never drops, but the absolute implements many compiler optimizations dynamically. Its
useless blocks arises when the TCache size becomes smaller. advatage is then transparent working which makes Dynamo
Consequently, the translation harms the cache as Figure 6 easy to be bundled with operation system. Dynamo’s IA32
shows. derivative DynamoRIO [13] also reaches to a better
performance than native execution. Yet it is mentioned in the
literature that when implementing Dynamo on HP PA8000
system, the designer prefers interpretation to translation and
block caching, just because of eliminating the overhead of a
large number of registers context switches. So we presume the
performance enhancement of context-switch overhead
elimination will be more obvious in register-rich architectures.
And the Mojo [14] developed by Microsoft Research, is the
dynamic optimizer works on IA32 & Windows architecture,
which gives a good example to handling multiple threads
applications.
VI. CONCLUSION AND FUTURE WORK
Figure 7 MTEE useless block generation

With context-switch elimination, the MTEE has some Conventional optimization method of Binary Translation
superiority over the conventional CrossBit execution engine. makes more effort to reorganize or reduce the size of the
However, the multiple threads translation is not enabled translated binaries. However, binary translation is still
because the experiment platform just has two physical cores. sensitive to condition changes. The MTEE provides another
However, the branch prediction mechanism of MTEE is still way to optimize the DBT systems which employs the multiple
ugly and results in too much useless block generations. And processor power. On one hand, it aims at providing the
according to current operating system implementation translated code an owned register context. On the other hand,
threading support, we consider the BranchTree selection it aims at building the software pipeline stages for the DBT
algorithm as the key point to better performance. We are system. However, software threads do not work as hardware
expecting the MTEE to be more powerful after further components do. They highly rely on the operating system
optimizations. implementation. And the conventional operating system
makes greater effort to enable concurrency rather than parallel
execution. It gives the multiple threads program developer

951
less control of multi-cores power. The DBT upon multi-core
still has many barriers to overcome.
Further work of the MTEE involves the better branch
prediction mechanism and its implementation, the
optimization on thread programming. And as the processor
cores grow, we expect to have experiments on these new
platforms, and finally power the DBTs by parallelism.
ACKNOWLEGEMENT
This work was supported by the National Nature Science
Foundation of China under Grant No. 60773093, the National
Grand Fundamental Research 973 Program of China under
Grunt No.2007CB316506, the 863 Research and Development
Program of China under Grunt No. 2006AA01Z169.

REFERENCES
[1] J.E. Smith, and R. Nair, Virtual Machines: Versatile Platforms for
Systems and Process, Morgan Kaufman, 2005.
[2] Y, Bao. “Building Process Virtual Machine via Dynamic Binary
Translation.” Master thesis of School of Software, Shanghai Jiao Tong
University, January 2007.
[3] A. Chernoff, M. Herdeg, R. Hookway, C. Reeve, N. Rubin, T. Tye, S.
B. Yadavalli, and J. Yates, “FX!32: A profile-directed binary
translator”, IEEE Micro, April 1998.
[4] B. Leonid; T. Devor, O. Etzion, S. Goldenberg, A. Skaletsky, Y. Wang
and Y. Zemach. “IA-32 Execution Layer: a two-phase dynamic
translator designed to support IA-32 applications on Itanium®-based
systems”, MICRO-36, 2003.
[5] J. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A.
Klaiber, and J. Mattson, “The Transmeta Code Morphing Software:
Using speculation, recovery, and adaptive retranslation to address
real-life challenges”, CGO-2003, Pages 15-24, March 2003.
[6] Zheng, C. Thompson, C. PA-RISC to IA-64: transparent execution, no
recompilation. IEEE Computer Society, 33(3): 47-52, March 2000.
[7] C. Cifuentes and M. Van Emmerik, “UQBT: Adaptable binary
translation at low cost.” Computer, 33(3):60-66, March 2000.
[8] D. Ung, C. Cifuentes, “Machine-adaptable dynamic binary translation”,
ACM SIGPLAN Workshop on Dynamic and Adaptive Compilation and
Optimization, Pages: 30-40, January 2000.
[9] M. Probst, “Fast Machine-Adaptable Dynamic Binary Translation”,
Proceedings of the Workshop on Binary Translation, 2001.
[10] C. Cifuentes, B. Lewis, and D. Ung, “Walkabout —A retargetable
dynamic binary translation framework”, Workshop on Binary
Translation, January 2002.
[11] K. Scott, and J. Davidson, “Strata: A software dynamic translation
infrastructure”, Proceedings of the IEEE 2001 Workshop on Binary
Translation, 2001.
[12] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: A transparent
dynamic optimization system”, Proceedings of the ACM SIGPLAN ’00
Conference on Programming Language Design and Implementation,
June 2000
[13] D. Bruening, T. Garnett, and S. Amarasinghe, “An infrastructure for
adaptive dynamic optimization”, CGO-2003, Pages: 265-275, March
2003.
[14] W. K. Chen, S. Lerner, R. Chaiken, and D.M. Gillies, “Mojo: A
dynamic optimization system”, Proceedings of 3rd ACM Workshop on
Feedback-Directed and Dynamic Optimization, 2000.
[15] U. Hölzle, L. Bak, S. Grarup, R. Griesemer, and S. Mitrovic, “Java on
steroids: Sun's high-performance Java implementation”, Proceedings of
HotChip IX, 1997.
[16] K. Scott, and J. Davidson, Strata: A software dynamic translation
infrastructure, Proceedings of the IEEE 2001 Workshop on Binary
Translation, 2001.
[17] K. Scott, N. Kumar, B.R. Childers, J.W. Davidson, M.L. Soffa,
“Overhead Reduction Techniques for Software Dynamic Translation”,
Proceedings of 18th International Parallel and Distributed Processing
Symposium, 2004, April 2004.

952

S-ar putea să vă placă și