A Scalable Synthesis Methodology For Application-Specific Processors

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO.
11, NOVEMBER 2006
1175
A Scalable Synthesis Methodology for Application-Specic Processors

Fei Sun, Srivaths Ravi, Senior Member, IEEE, Anand Raghunathan, Senior Member, IEEE, and Niraj K. Jha, Fellow, IEEE
AbstractCustom processors based on application-specic or domain-specic instruction sets are gaining popularity, and are often used to implement critical architectural blocks in complex systems-on-chip. While several advances have been made in the area of custom processor architectures, tools, and design methodologies, designers are still required to manually perform some critical tasks, such as selection of the custom instructions best suited to the given application and design constraints. We present a scalable methodology for the synthesis of a custom processor from an embedded software program. A key feature of the proposed methodology is its scalability, which is achieved by exploiting the structured, hierarchical nature of large software programs. We motivate the need for such a methodology, and describe the algorithms used for the critical steps, including hardware resource budgeting, local optimizations, and global exploration. Our methodology utilizes the concept of soft instruction templates, which can be adapted by adding operations to them or deleting operations from them at any time during the design space exploration process, allowing for global design decisions to be interleaved with ne-grained optimizations. To the best of our knowledge, this is the rst work that uses the program hierarchy to derive soft instruction templates to synthesize application-specic processors for scalable applications. We have integrated our methodology in an open-source compiler, and veried it using a commercial extensible processor. Experiments with several benchmarks indicate that our methodology can effectively tackle large programs. It results in the synthesis of highquality custom processors that demonstrate an average speedup of 2.82 and a maximum speedup of 6.07 . As a side-effect, the processor energy is also reduced. The average and maximum reduction in the energy-delay product for the benchmarks are 7.64 and 18.85 , respectively. The CPU times required for custom processor synthesis are quite small, indicating that the proposed techniques can be applied to embedded software programs of signicant complexity. Index TermsApplication-specic instruction set processors (ASIPs), custom processors, extensible processors.
I. INTRODUCTION
MPROVEMENTS in semiconductor fabrication technologies promise to make it feasible to replace logic gates or hardware macroblocks with microprocessors as building blocks
Manuscript received December 12, 2003; revised February 23, 2006. This work was supported in part by the New Jersey Commission on Science and Technology Center for Embedded System-on-a-Chip Design and by the National Science Foundation under Grant CCR-0310477. F. Sun is with Tensilica Inc., Santa Clara, CA 95054 USA (e-mail: fsun@ tensilica.com). S. Ravi and A. Raghunathan are with NEC Laboratories America Inc., Princeton, NJ 08540 USA (e-mail: sravi@nec-labs.com; anand@nec-labs.com). N. K. Jha is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: jha@princeton.edu). Digital Object Identier 10.1109/TVLSI.2006.886410
for integrated circuit (IC) design. Such programmable solutions provide the ability to achieve short product cycles and cope with changing application functionality (e.g., in areas with evolving standards). Furthermore, they allow a design to be reused across different product variants or versions, addressing the challenge of burgeoning nonrecurring engineering (design and mask) costs. However, the rapid expansion in the market for embedded systems with tight constraints on cost, performance, size, and power consumption (e.g., digital cameras, cell phones, PDAs, HDTVs, to name a few), implies that the need to customize the architecture to the application or application domain will continue to be a primary driving requirement in system-on-chip (SoC) design. The application-specic instruction set processor (ASIP) can provide a good tradeoff between exibility and efciency, by tailoring the instruction set and microarchitecture to one specic application, or to a set of applications for the same domain. The recent emergence of congurable and extensible processors [1], [2] has generated signicant interest, and several complex SoCs featuring such processors (often several on the same IC) are already in production. Enabling the vision of the application-specic processor as the ubiquitous building block of future SoCs requires several innovations in architecture and design methodologies. These innovations should enable designers to create highly optimized instances of custom processors, which achieve high levels of processing efciency in very short turnaround times (hardware-like efciency from software solutions). Despite signicant advances in the supporting methodologies and automation tools for application-specic processor design (e.g., retargetable software tool chains), designers are still required to manually design custom instructions and the hardware to speed up the application, which is a slow, tedious, and error-prone process. Application-specic processor synthesis (or custom processor synthesis) attempts to address this problem by developing tools that automatically analyze the application, generate and evaluate various choices of the custom processor architecture, and select the one that best meets the designers constraints (performance, area, power). Custom processor synthesis is quite challenging due to the complex inter-dependencies and tradeoffs involved in making the design decisions, which can be difcult to explore even for small applications. Moreover, the complexity of real-life embedded software applications makes this task even more challenging. A moderately sized program may have tens of thousands of lines of code and hundreds or thousands of functions. Fig. 1 shows the number of lines of code in some examples of embedded programs. Consider the task of synthesizing an application-specic
1063-8210/$20.00 2006 IEEE
1176
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006
Fig. 1. Size (lines of code) of various embedded programs.
processor-based MPEG-2 encoder/decoder. A typical MPEG-2 decoder implementation that we analyzed contains 7832 lines of code and 114 functions, while the encoder has 7605 additional lines of code that comprise 95 additional functions. As shown later in this paper, the number of candidate custom instructions that can be generated to accelerate such applications is huge. Thus, any method to explore the architectural design space needs to perform highly efcient design space exploration. While a signicant body of research, built up over a decade, exists for ASIP synthesis, most of it has focused on generating custom instruction sets and architectures from scratch, rather than specializing a given base processor and extending its existing instruction set. Most current commercial offerings, however, are based on the latter approach, since starting with a preoptimized, preveried base architecture considerably eases the challenging tasks of design verication as well as generation of the software development tool chain. More importantly, we believe that further research is required in order to allow these techniques to scale well with application program size, in order to handle large, realistic embedded software applications. The rest of this paper is organized as follows. In Section II, we discuss related work in the areas of application-specic processor design. In Section III, we outline our contributions. In Section IV, we demonstrate the need for scalable custom processor synthesis techniques, by illustrating the size and complexity of the custom instruction design space, and by showing that straightforward approaches may lead to poor quality solutions. In Section V, we introduce the proposed methodology and describe its constituent algorithms. In Section VI, we explain the software in which our methodology is implemented and the processor platform on which our methodology is veried. In Section VII, we discuss the experimental results. We conclude in Section VIII. II. RELATED WORK In this section, we discuss some previous work on ASIP design, congurable processor synthesis, and custom instruction generation and selection. Fisher [3] and Keutzer et al. [4] have provided overviews of the benets, challenges, and barriers in ASIP design. When designing ASIPs from scratch, the applications are usually expressed as dependence graphs and the complete instruction set is
generated either together with the microarchitecture [5], [6], or using retargetable compilers based on given hardware descriptions [7][9]. The ASIP design can be viewed as a hardware/ software partitioning process. The process is to partition part of the application to software and part to hardware. Adams and Thomas presented different approaches for designing a mixed hardware/software system [10]. In most hardware/software partitioning, the system architecture consists of a microprocessor and some coprocessors (ASICs) connected by a bus [11], [12]. In our approach, custom instructions are truly integrated in the pipeline stages of the processor, which forms a ner-grained partition. Instead of designing an ASIP from scratch, congurable processor synthesis tailors custom processors by adding or deleting some options, or functional units, from the core processor. The PICO-NPA system, which automatically synthesizes nonprogrammable accelerators (NPAs), is described in [13]. In the PEAS-III system, the designers are given the freedom to select various processor parameters, describe micro-operations, and declare modules through a simple graphical user interface [14]. Seng et al. introduced the notion of a exible instruction processor and parameterization of modular processor templates [15]. Gupta et al. proposed a compiler-directed approach to select among a few processor options under an area constraint [16]. Dougherty et al. obtained the instruction set of ASIPs as a subset of the instruction set of a more general processor [17]. However, research has also focused on extending the existing instruction set of the base processor in order to speed up specic applications. Kucukcakar described an ASIP design methodology to modify processor architectures for custom instructions [18]. Choi et al. described a method to generate complex instructions from primitive ones [19]. Arnold and Corporaal provided a semi-automated method for detection and exploitation of custom instructions for domain-specic processors [20]. In [21], Lee et al. proposed a method to choose from different instruction encoding alternatives in ASIP design. Atasu et al. described the methods to generate custom instructions from operation patterns [22]. Cheung et al. presented a technique to rapidly select custom instructions from a set of presynthesized custom instructions [23]. Clark et al. used a compiler-driven approach and examined cost/performance tradeoffs involved with selecting custom instructions [24]. However, these works all have a at view of the target application. The computation complexity of the techniques in [19] and [20] make them unlikely to scale to large programs. Some input/output (I/O) constraints are specied in [22], which would ideally be reected in the cost function. The area cost of custom instructions is considered in [23]. However, these custom instructions are presynthesized, and the method can only select or drop an entire instruction. Similar to our work, an area model is considered in [24] for making performance-area tradeoffs. However, it limits its search space to I/O requirements and straight-line chains of operations. III. PAPER OVERVIEW AND CONTRIBUTIONS Our earlier work on custom instruction synthesis [25] provided a solution for small programs, which was, however, not scalable. In this paper, we make the following contributions.
SUN et al.: SCALABLE SYNTHESIS METHODOLOGY FOR APPLICATION-SPECIFIC PROCESSORS
1177
We provide a scalable methodology for application-specic processor synthesis. Given one or more application programs, and the desired design constraints (e.g., area), the methodology generates a custom processor architecture by extending the instruction set of a base processor, so that its performance in executing the given program is signicantly improved. We develop efcient and accurate macromodels to estimate the performance and area of a base processor augmented with a given set of custom instructions. Instead of considering already selected custom instructions as atomic units, we consider the adding and dropping of operations inside each custom instruction, resulting in a more exible, ne-grained approach. In order to address the size of the design space, we formulate the custom instruction pruning problem. While pruning, we not only consider each custom instruction independently, but also consider its performance and area impact with respect to all the other custom instructions. We exploit the fact that large software programs are written in a modular and hierarchical manner, and are composed of functions or procedures. We use the hierarchical structure of the program during several phases of our methodology, including hardware resource budgeting, and to structure our custom instruction selection process into local and global optimizations, much like a software compiler. To the best of our knowledge, this is the rst work that uses the program hierarchy to synthesize application-specic processors for scalable applications. The proposed methodology solves the problem of custom instruction selection by recursively traversing the function call graph bottom-upstarting with smaller subprograms at the leaves of the function call graph, and nally, merging them to obtain the solution for the entire program. We have implemented the proposed techniques by integrating them in an open-source compiler research framework (SUIF [26]), and have veried them using a commercial extensible processor (Tensilicas Xtensa [1]). We have evaluated our tool by using it to generate custom processor architectures based on Xtensa for several embedded software programs. The resulting custom processors were subjected to logic synthesis and technology mapping to a commercial technology library, in order to estimate their area and performance. Our experiments indicate that custom processors generated using the proposed methodology can result in signicant efciency improvements in very small synthesis times.
IV. MOTIVATION In traditional manual custom instruction design ows, designers need to read through the program, pick out performancecritical functions, analyze them, specify custom instructions, and rewrite the program to invoke them. This is a very slow and error-prone process. Moreover, it is difcult for designers to perform global tradeoffs. Hence, the generated custom instructions result in poor quality solutions.
The objective of custom instruction synthesis for extensible processors is to solve the following problem. Problem 1: Given an extensible processor and an application, generate a set of custom instructions to speed up the application as much as possible while keeping the total area of the processor within the given area budget. In a manual design, designers may be able to obtain custom instructions for different parts of the program, and then select a subset of these instructions to meet the area budget. However, this approach has several drawbacks. One frequently nds identical code in different parts of a large program. It is easy for a designer to overlook this fact and not be able to exploit it for larger speed-ups. It may be possible to reduce the number of operations in the derived custom instructions to try to meet the area budget. However, because of the huge search space involved, a designer is likely to just choose a subset of the custom instructions he/she found, and may not try to shrink individual custom instructions, in order to meet the area budget. As shown later, reducing operations in one custom instruction has an impact on how other custom instructions have to be transformed as a result. A program may have a number of functions and a hierarchical structure. The following example illustrates the huge custom instruction design space, and why manual design is very difcult. Example 1: Fig. 2(a) shows part of a function call graph of the mpeg2decode program, rooted at function slice. In this graph, an edge from a parent function to a child function indicates that the parent calls the child. The edges are marked with the number of times each function is called. This is not a complete call graph as some children and library function calls are not shown in the gure for the sake of clarity. Fig. 2(b) shows a table whose entries indicate the number of cycles each function consumes. The number of cycles of the functions do not include children functions, and are obtained from the proling statistics of the application on the base processor. From the function call graph, we can obtain a number of custom instructions. In functions idctcol and idctrow, the largest custom instructions have 63 and 64 operations, respectively. The largest custom instruction is an instruction that includes as many operations as possible inside a basic block. As the custom instructions do not include memory operations, the sizes of such instructions are also limited by memory accesses. Note that the largest custom instruction may not necessarily yield the best performance (this is because it may require too many inputs and outputs, which may reduce its performance, as discussed later). If we need to drop some operations from the largest custom instruction because of some other constraints (e.g., area), we may end up with more than 100 000 custom instruction candidates just for these two functions. Additionally, for functions form_component_prediction, Add_Block, and saturate, we can obtain 16, 11, and 1 largest custom instructions, respectively, that need to be investigated further. If we only consider the largest custom instructions, we have 2 possible choices for this small call graph, because for each custom instruction, we may either select it in the nal implementation or not. In the entire program, we can nd nearly 300 largest custom instructions (these are the custom instructions
1178
TABLE I CUSTOM INSTRUCTIONS FOUND FOR THE CALL GRAPH IN FIG. 2
Fig. 2. mpeg2decode. (a) Function call graph. (b) Number of cycles each function consumes.
whose size cannot be increased any further). If we consider the possible combinations of smaller custom instructions, the number of choices would run into the billions. Example 1 illustrates that the design space is very large for hierarchical programs, and hence, it is not possible to efciently solve the custom processor synthesis problem through exhaustive means. One approach may be to divide a large program into several smaller programs and solve the problem for the latter rst. This divide-and-conquer approach can signicantly reduce the design space. However, a simple divide-and-conquer algorithm may lead to a very inferior solution. For example, a straightforward divide-and-conquer algorithm would be to divide the available area among the different functions based on the number of cycles that each function consumes in the base processor. While on the surface this may seem like a reasonable area allocation, the following example illustrates why it could be highly suboptimal.
Example 2: Table I describes candidate custom instructions for some of the functions in the call graph of Fig. 2(a). A candidate custom instruction is a candidate to be used to augment the processor. Whether it is selected in the nal implementation depends on its quality (e.g., performance improvement, inuence on other code), and cost (e.g., extra area, power). The total number of cycles for all the functions is 3 674 411. The table also shows the number of cycles saved if that custom instruction were to be used, the area for the custom instructions (in layout grids), and the number of user-dened custom registers. The cycles saved is the difference between the number of cycles used to execute the target application when the custom instruction is not used and when it is used. The problem with the previous allocation policy mentioned above is that it allocates the area budget for each function before we even know whether custom instructions can be found for it, and if so, how much area they actually require. Hence, except for the few functions that yield custom instructions, the area allocated to all other functions is totally wasted. It turns out that the area allocated to function idctcol is much larger than the area required for its best custom instruction. If the allocated area is xed, and we cannot move the spare area to other functions, then this also represents waste. To make matters worse, the area required to implement is larger than the area allocated to its function, idctrow. Hence, in this case, we will not even be able to choose the custom instruction that can yield the best performance. We will need to shrink the custom instruction in some way to satisfy the local area requirement. Finally, in some cases, a custom instruction may yield a speed-up only if it is large enough (otherwise the overheads may swamp the benets). This means that there is an area lower bound for a custom instruction to be useful. If it so happens that the area is a little above its local area requirement, lower bound for . In summary, if we use a straightwe cannot even choose forward area budgeting algorithm, because of the large discrepancy between the saved cycles, area of custom instructions, and the number of cycles required for the function that the custom instructions belong to, the result may very easily get stuck in an inferior solution. In Example 2, we assumed that each custom instruction has an area requirement, independent of the area required for the other custom instructions. However, sometimes it is possible for two custom instructions to share area because of a similar structure, or a similar number of inputs and outputs, which further complicates the problem. A typical reduced instruction set computer (RISC) instruction can only have a few inputs and outputs encoded in an instruction. Suppose these numbers are and , respectively. If we restrict
1179
Fig. 3. Custom instruction generation ow for hierarchical programs.
the number of inputs and outputs of custom instructions to and , respectively, we may end up with very few custom instruction candidates, and hence, may not obtain the complete benet of using custom instructions. Fortunately, many commercial extensible processors, like Xtensa [1], allow more inputs and outputs for custom instructions. Besides the general-purpose register le, they allow user-dened registers. Before custom instructions are invoked, some WRITE instructions write input data to the relevant user-dened registers. Then the custom instruction implicitly reads the data from user-dened registers during computation, and writes results to user-dened registers. After the computation is complete, some READ instructions explicitly read the data from user-dened registers into the processors general-purpose registers. User-dened registers are local to the custom hardware. They are used just before and after custom instruction computation. At other times, the values in those registers are invalid. Hence, we can reuse user-dened registers for different custom instructions, if they are guaranteed to execute sequentially, which is true for programs on single-pipeline processors. Some software transformations, such as loop-independent code motion, can increase the lifetime of user-dened registers. Thus, the previous constraint needs to be kept in mind while performing such transformations. Example 3: In Example 2, area sharing of user-dened registers was not considered in area allocation for custom instructions. Hence, each custom instruction had its own user-dened registers. If we consider sharing of user-dened registers, we can use 13 such registers for all the custom instructions, instead of the previous 30. Assuming each user-dened register consumes 3800 grids of area, we can save 64 600 grids of area. This area can be redirected to obtain larger performance improvements. V. METHODOLOGY AND ALGORITHMS In this section, we describe our methodology for synthesis of custom instructions. We rst provide an overview of the whole design ow and then describe the important steps in detail.
A. Overview Fig. 3 illustrates the three phases involved in the automatic generation of custom instructions for hierarchical programs. The rectangular blocks denote the steps involved, and the cylindrical blocks denote data. The intermediate representation (IR) is obtained from the SUIF compiler [26]. In Phase I, the algorithm rst takes a target program (in C) as input (block 1), and performs some target-independent transformations, e.g., constant folding, copy propagation, common subexpression elimination, etc. (block 2) to simplify the input program. Typical input data sequences are fed to the transformed IR (block 3) and generate the proling statistics (block 4), program dependence graph (block 5), and call graph (block 6). The proling statistics are used to obtain the execution frequency of each basic block, while the hierarchical program dependence graph is used to obtain the operation connectivity within and across basic blocks. Based on all these data, custom instruction templates are generated (block 7).1 Only the largest possible template inside each basic block is generated, leaving the pruning of templates to Phase II. The largest template is obtained by greedily including as many operations as possible. This signicantly reduces the complexity of template generation. Note that the largest template may not necessarily be the fastest template, since the extra inputs and outputs may reduce the speed-up. Hence, it is important to use a fast, yet accurate, algorithm to pick out the best templates under a given area constraint. A lot of C programs use macros for convenience. The same macro may appear at different places in the program, resulting in the same code. It would be ideal to identify some good macros and transform them to custom instructions before they are expanded. However, macros are dismantled in a preprocessing step and have no type information. Thus, it is not possible to do this. Therefore, after all the largest custom instruction templates are generated, we collect information on which templates are identical and merge such templates into one (block 8). The merged template has the attributes (e.g.,
1A template is a custom instruction candidate in the custom instruction generation and selection process. Detailed description of initial template generation can be found in [25].
1180
number of execution cycles) of all the identical templates. Finally, a modied IR with custom instruction template annotations (block 9) is passed to Phase II. Since the design space for realistic software programs is extremely large, we cannot use a attened description of the application to solve Problem 1. Therefore, Phase II makes use of the program hierarchy to divide the large program into several smaller programs, each having a much smaller design space. The custom instruction solutions for the smaller programs are recursively combined to gradually converge to a solution for the original program. For each custom instruction template, we rst perform some local optimizations (block 10). We next mark the level in the hierarchy at which each function resides (block11). We start from the leaf functions (block 12), dynamically compute the area and number of cycles saved for all the templates (block 13), and then compute the target area for the templates, which is the area budget for the current function and its subgraph (block 14). Based on the target area, we shrink or expand the custom instruction templates in the function subgraph (block15). If all the functions at a level have been processed, we ascend levels in the function hierarchy and repeat the process (block16). Finally, when we reach the root function, a set of custom instruction templates satisfying the global area constraints has been selected (block 17). A more detailed explanation for these steps is given in Section V-B. Phase III performs more software transformations (block 18) including optimizations like loop-invariant motion, and generates the custom instruction descriptions (block 21). The original program is also modied through insertion of calls into custom instructions (blocks 19 and 20). Finally, the custom processor, which includes the base processor and custom instructions, is built, synthesized, and veried (block 22). B. Details In this section, we describe the steps in Phase II in detail. First, we mathematically formulate the problem using a attened hierarchy (note that this formulation is only applicable to small programs). Then, we present an algorithm to solve the formulated problem in a at way. After that, we describe a global exploration algorithm to reduce the design space using the program hierarchy. Detailed methods for target area computation and custom instruction partition are described last. 1) Problem Formulation: Our objective is to solve Problem 1. To have a clearer understanding of this problem, we dene a custom instruction more rigorously. Denition 1: A custom instruction can be represented as a directed acyclic graph (DAG), . Node represents an operation. Edge represents the data dependence between two nodes . is (abbreviated to called the data dependence predecessor of predecessor), while is called the data dependence successor of (abbreviated to successor). is the symbol that connects and . There are two special nodes, source and drain . is the predecessor of all input nodes, and is the successor of all output nodes. and are the input and output sets of the custom instruction, respectively. In , Out .
In and Out represent the number of inputs and outputs, respectively. has two attributes: cycles Denition 2: Each node , representing the number of cycles required to compute the , representing the operation in the base processor, and area area incurred if the operation is part of a custom instruction. has three atDenition 3: Each custom instruction tributes: execution count , representing the number of and times the custom instruction is executed, cycles saved . They are computed using the following equations: area (1) In Out (2) (3) In Denition 3, (1) computes the cycles saved for each execution of the custom instruction. The rst term gives the number of cycles needed on the base processor, which is the summation of the number of cycles for all operations.2 The second term gives the number of cycles needed using the custom instruction, which is derived as shown in (2). Its rst term, , denotes the number of scheduled cycles used by the custom instruction. Because each custom instruction is composed of a number of operations, and each operation takes some time to execute, the execution time of the custom instruction is determined by the longest delay path from all inputs to all outputs of the custom instruction.3 The delay may be longer than the clock period, in which case, a multicycle instruction is used. The number of cyneeded for the computation can be obtained by comcles paring the longest delay of the custom instruction and the processor clock period, so that all operations of the custom instruccycles. The multicycle instruction tion can be nished within is automatically pipelined. Even though the delay of the instruccycles, the throughput is still one cycle per instruction is tion. The second and third terms give the extra cycles needed to WRITE and READ values to and from user-dened registers, assuming the custom instruction can encode and inputs and outputs in the instruction, respectively. The extra inputs have to be written to user-dened registers before computation. These WRITE instructions take extra cycles. Similar arguments hold for is the number of cycles saved per iteration. outputs. In (1), The total number of cycles saved can be obtained by multiplying with the execution count . In both terms of (1), only the computational operations are considered. Such an estimate is reasonable because the custom instruction generation pass is performed after all other platform-independent optimizations. Hence, the code is already well optimized. The equation does not consider the cycles consumed by cache misses and register spills. However, custom instruction implementation generally requires fewer registers. Hence, cache misses and register spills
2Many embedded processors have RISC-type base instruction sets and issue one instruction per cycle. In this paper, we focus on extending the instruction set for such base embedded processors. 3We do not impose hardware constraints on the custom instructions. Hence, the custom instruction delay is only determined by data dependence. Thus, choice of the scheduling algorithm does not affect the longest delay. In this paper, we use the as soon as possible (ASAP) scheduling algorithm.
1181
are less likely to appear in custom instruction implementation, and thus, its number of cycles saved can be potentially higher. Equation (3) computes the area of the custom instruction. It is denotes the base area of the custom composed of two parts. instruction, required for decoding and other interface logic. If any operation in the custom instruction is chosen, the base area should be added to the custom instruction area. The second term gives the functional unit area. The total area of all the custom instructions can be computed as follows: AREA In Out (5) Because the area to implement user-dened registers can be shared across custom instructions, we include it as a separate, represents the area for user-dened common term in (4). registers, which is the sum of the maximum number of extra inputs and maximum number of extra outputs times unit user. The total area is the sum of dened register area the area of each custom instruction and the area of user-dened registers. Now, Problem 1 can be restated as follows. Problem 2: Given a base processor, an application, and an area budget , nd a set of custom instructions such is maximized while AREA . that After Phase I, a number of custom instructions be. Thus, comes available, where typically we need to select a subset of the custom instruction nodes. The problem can now be further rened as follows. Problem 3: Given a base processor, an application, and an , suppose custom instructions are area budget operations in custom instruction . We found and there are for node need to nd an assignment of values to variables of custom instruction , in order to maximize the following cost function: (6) where (7) In In Out subject to Out (8) (9) (10) (4)
where AREA (12) (13) In Out (14) (15) is iff any node in any path from to or from back to has . 2) Custom Instruction Scaling: Given a target area, all the templates in the function subgraph are shrunk or expanded to meet the target area in order to solve Problem 3. We chose a greedy algorithm with some relaxation to obtain the custom instructions under the area constraint. In the algorithm, we rst compare the area AREA of the . If the actual area is initial solution and the target area larger than the target area, we start deleting operation nodes. Otherwise, we start adding nodes. Because of user register sharing across custom instruction templates, each template cannot be considered separately. Each operation deletion/addition impacts the number of cycles saved and area of all custom instructions under consideration. Fig. 4 gives the pseudocode for node deletion. We rst save the initially selected operations of the templates in solution (line: 1), and try all single operation deletion possibilities (lines: 24). Then we choose to delete the operation that results in the remaining custom instructions to have the highest ratio of the number of cycles saved over area and update the single operation deletion possibilities of that template correspondingly (lines: 629). Note that with node deletion, the area is always smaller in the long run. If in one try, however, the custom instruction area is smaller than the area budget, we compare it with the best solution computed so far. If its number of cycles saved is more than that for the best solution or they are equal but its area is smaller, we update the best solution (lines: 1316). This process is repeated until no further deletion can be done. For node addition, we rst try all addition possibilities that lead to performance improvement, compute the area and cycles saved, and choose the one that has the highest ratio of cycles saved over area. We also keep track of the best solution computed thus far. If a better solution is found, we add the node and update the best solution. Example 4: Fig. 5 gives the structure of a median-sized custom instruction from the program implementing the secure hash algorithm (SHA). The software implementation has 14 operations and consumes at least 14 cycles. The 14 operations can be deleted one by one using the pseudocode presented in Fig. 4. For example, from the output side, the addition operation that computes the value A can be deleted, then the addition operation that computes a9 can be deleted. From the input side, the addition operation that computes a6 can be deleted, or the shift operation that computes a1. The exact sequence of deletions is determined by the algorithm presented in Fig. 4. 3) Global Exploration: Directly applying the algorithm previously described under a tight area budget to the entire applica-
AREA
(11)
1182
Fig. 6. Pseudo-code for global exploration.
Fig. 4. Pseudocode for template deletion.
Fig. 5. Custom instruction graph for the SHA benchmark.
tion may lead to inferior solutions. Using other search methods, such as simulated annealing or genetic algorithm, may improve the solution quality. However, it will signicantly increase the AREA. We synthesis time, especially when
make use of the program hierarchy to recursively solve Problem 3 for subprograms of the program, gradually reducing the area budget and, nally, converging to the global solution. In this way, the bad custom instructions are pruned out early when the subprograms are small and good custom instructions are preserved till the end of the global exploration process. Fig. 6 gives the pseudocode for the algorithm. For each template, we rst perform local optimization by dividing the template if the divided template can yield better performance and smaller area (Section V-B5). We also perform some initial pruning of the templates (Section V-B4). Then we break the loops in the call graph to form a DAG. We mark the levels of all the functions in the call graph, starting from the root function (main) and marking it as 1, then marking all the functions it calls as 2, and so on. Basically, a function is marked as if the highest level of the caller function is being at level , on the condition that it is not assigned a level. Starting from the function at the leaf level, we compute the and area AREA for all the total number of cycles saved templates passed from Phase I, using (6) and (12), respectively, disregarding the area constraint for now. For each function at the current level, we again compute the and area AREA for all total number of cycles saved the templates in the function subgraph, i.e., functions directly or indirectly called by the current function. Then we compute the target area (detailed discussion is in Section V-B4) and solve Problem 3 by shrinking or expanding all the templates in the function subgraph to meet the target area, depending on whether the target area is smaller or larger than the current area the selected custom instructions consume. As a special case of shrinking, a custom instruction may be completely dropped.
1183
Fig. 7. Illustration of global exploration. (a) Call graph. (b) Number of largest custom instructions found in each function subgraph. (c) Example of custom instruction versions for one largest custom instruction. (d) Number of cycles saved and area before and after applying the global exploration algorithm as well as the computed target area.
After all the functions at the current level have been processed, we decrease the level number and perform the computation again. Note that a lower level corresponds to a larger function subgraph, resulting in a larger design space search. The previous custom instruction selection step just provides an initial solution for the current search. The following example illustrates the previous algorithm. Example 5: Fig. 7(a) depicts part of a call graph of the pretty good privacy (PGP) application. Each node is a function. The application signs and encrypts a plain text into an encrypted le with the owners signature attached to it. The MD5 algorithm is used to get the digital signature of the text, and the text body is encrypted via the AES algorithm. Fig. 7(b) gives the number of largest custom instructions found in each function subgraph. A function subgraph is a group of functions directly or indirectly called by the root function of the subgraph. The largest custom instruction is an instruction that includes as many operations as possible inside a basic block. Fig. 7(c) gives examples of custom instructions found in one largest custom instruction. The saved cycles, area, and user-register area of the custom instruction versions are shown in the second, third, and fourth columns, respectively. Note that, since the individual operation inside a custom instruction can be added or dropped, the number of possible custom instruction versions is exponential. Fig. 7(c) does not extensively list all the custom instruction versions for that largest custom instruction. Fig. 7(d) gives the number of cycles saved and area before and after applying the global explo-
ration algorithm as well as the computed target area. If all best performance custom instructions are chosen, disregarding area budget, the execution time of the program can be reduced by 440 157 cycles, while consuming 960 618 grids of area. These numbers are obtained by greedily utilizing custom instructions to speed up the application as much as possible. It is the upper bound on the number of cycles that can be saved. Suppose the target area for the entire program is 360 000 grids. As shown in Fig. 7(d), at level 3, before processing function subgraph make_certificate, its number of saved cycles is 66 110, while consuming 70 494 grids of area. The computed target area is 75 083 grids. After processing the function subgraph, the number of cycles saved and custom instruction area remain the same. Similarly, before processing function subgraph MDfile, its number of saved cycles is 178 717, while consuming 188 771 grids of area. The computed target area is 167 674 grids. After scaling the function subgraph, the number of cycles saved and area become 137 088 cycles and 161 454 grids, respectively. The target area is not very different for the function subgraphs. Similar results can be obtained at levels 2 and 1. 4) Target Area Computation: In both local optimization and global exploration steps, we need to compute the target area, and shrink or expand templates correspondingly to meet the target area. The modied templates form the new initial solution for the next level. Several factors that follow need to be considered to compute the target area.
1184
The target area that the template occupies should be in proportion to its potential for performance savings. For example, a template contributing 1% of the total cycles saved cannot be allowed to occupy 10% of the total budgeted area, if that area can be used for some other templates that can contribute to more saved cycles. While the global area budget must be satised, the local area budgets (or targets) should not be treated as rigid constraints, since that frequently results in highly suboptimal solutions. We propose to use a gradual reduction of the area constraint in order to retain sufcient exibility for optimization till the last stages of the exploration process when the optimization procedure has visibility of the entire function call graph. This reduces the chances of getting stuck in local optima. Based on the previously mentioned factors, we rst compute the local target area for each template, and shrink the template is computed using correspondingly. The local target area the following function: AREA (16)
However, if the large custom instruction can be divided into several smaller custom instructions, and the execution cycles of the large custom instruction is greater than the sum of the execution cycles of the small custom instructions, the division can improve performance without signicant cost. Usually, if this is benecial, the total area for the divided custom instructions is smaller too, because of area sharing of user-dened registers. The following example illustrates custom instruction partitioning. Example 6: Suppose one large custom instruction has three inputs In In In and two outputs Out Out , uses two user-dened registers, one for the input and one for the output, and takes three cycles. The code is shown as follows. Here, refers to writing (reading) of a user-dened register In Out Out In In
If the above custom instruction is divided into two smaller custom instructions, as shown next, they will not use any userdened registers and take only two cycles in all Out Out In In Out
where is a constant relaxation factor. It is usually greater than one. Its purpose is to relax the area budget so that the custom instructions are not pruned too aggressively at the very is set to 3, then only the beginning. For example, suppose custom instructions consuming more than three times the area they deserve are pruned. In some programs, some custom instructions only save a few cycles while consuming a signicant portion of the area. They should be pruned at the very beginis the number of cycles saved by the current template ning. is the number of cycles saved by all the templates. Since and in the global exploration phase, some templates will be considered earlier than others, local optimization provides a relatively good initial solution for the next phase. In the global exploration phase, at level , we compute the target area based on the following function: AREA AREA (17)
Thus, one cycle is saved for each execution of the custom instruction. The previous example illustrates that sometimes it is benecial to divide a large custom instruction into two or more smaller custom instructions. Our methodology performs this optimization before global exploration to further improve the solution quality. VI. IMPLEMENTATION BASIS In this section, we describe the extensible processor platform used in our work, as well as the compiler infrastructure within which our tools were implemented. A. Xtensa Processor Tensilicas Xtensa [1] is a ve-stage pipeline ASIP. Its architecture was designed from scratch to be customizable. Its main features are as follows. The designer can select a number of architectural options to enhance the base processor core, e.g., whether to include some functional units (e.g., multiplier, multiply-accumulator) and oating point coprocessor, congure the register le, cache/memory interface, and exception/interrupt mechanisms, and include test and debugging support. The designer can extend the base processor by writing some custom instructions to speed up some specic applications. A complete GNU-based software tool suite and hardware synthesis and verication scripts are automatically generated to match the conguration specied in the processor generator. This enables rapid design, verication, and evaluation of the application-specic hardware and software.
here represents the number of cycles saved by templates in the current function subgraph. The rst factor in (17) reects the rst aspect we previously considered. Ideally, the percentage of the area each template occupies should correspond to the percentage of the number of cycles saved by the template. The second factor in this equation reects the philosophy of gradually reducing area. There are steps from level to 1. Each time the area budget is tightened a little bit, i.e., by AREA . The custom instructions area is, thus, gradually reduced to the area budget. 5) Custom Instruction Partitioning: In Phase I of Fig. 3, we compute the largest custom instruction in a basic block, but it may not necessarily be the best candidate in terms of performance. If the custom instruction has more than inputs and outputs, some cycles are needed to WRITE/READ extra values to/from user-dened registers. If the number of inputs or outputs is too large, the WRITE or READ instruction will dominate the number of cycles required to execute the custom instruction.
1185
TABLE II COMPARISON OF PERFORMANCE, ENERGY, AND AREA USING OUR TOOL
Tensilica provides a Tensilica instruction extension (TIE) language [27] using which custom instructions can be specied. The custom instructions can be either single-cycled or multicycled. Instead of invoking the custom instructions at the assembly level, TIE instructions can be directly inserted into high-level languages such as C and C++. At most, two inputs and one output can be encoded in the instruction. If a custom instruction has more inputs or outputs, it can implicitly READ from or WRITE to some internal user-dened registers, whose values need to be explicitly written in or read out. B. SUIF SUIF is an open-source compiler research tool [26]. The intermediate format is a mixed-level program representation. It has both high-level constructs, such as loops, condition statements, arrays, structures, etc., as well as low-level constructs. The intermediate format is target-independent and can be transformed back to high-level language (C). A number of optimizations (passes) can be applied to the intermediate format and the output written in the same format. Each pass can perform one analysis or transformation. Because of the modular property, designers can select and order the optimizations very easily. They can also write their own passes for specic purposes. We wrote a number of passes to generate, select, and optimize the custom instructions. VII. EXPERIMENTAL RESULTS We have implemented the design ow given in Fig. 3 using SUIF [26]. We calibrated the area and delay of a library of register-transfer level (RTL) components. As in (3), the area of one custom instruction is the sum of the base area and the area required for the component implementing the operation. The delay of one custom instruction is the sum of the operation
component delay of the critical path of the custom instruction (note that the nal area/delay results are reported based on accurate lower-level tools). We used the simple proling system in SUIF [28] to obtain proling statistics (block 4 in the ow of Fig. 3), the program dependence graph generator [29] to get the program dependence graph (block 5), and some code transformation passes of SUIF to transform the code (blocks 2, 3, 6, 9, 18, and 20). We also implemented some passes to perform some custom instruction-specic transformations (blocks 2, 79, and 1719). For example, we computed the bitwidth of variables to reduce the hardware area [30]. The chosen custom instructions were transformed into TIE language (block 21). Those instructions were automatically inserted into the optimized IR (block 19).4 We veried our implementations with the Xtensa [1] platform from Tensilica, using a GNU-based compiler to compile the program with and without custom instructions, instruction set simulator (ISS) to get the execution cycle count, TIE compiler to convert custom instruction descriptions to RTL Verilog code, and Synopsys Design Compiler [31] to synthesize the RTL Verilog code and map it to a commercial 0.18- m technology library [32]. We limit the number of inputs and outputs that can be encoded in the custom instructions to two and one, respectively. We evaluated our ow using 11 benchmarks. ADPCM refers to adaptive differential pulse code modulation. AES refers to the advanced encryption standard algorithm. DCT denotes a discrete cosine transform algorithm. DES refers to the data encryption standard algorithm. G721 is a voice compression algorithm. MD5 denotes a message digest algorithm. SHA denotes a secure hash algorithm. SSL refers to the secure socket layer protocol. GSM_ENC and GSM_DEC are the encoder and decoder for the
4More
discussion on custom instruction insertion can be found in [25].
1186
TABLE III COMPARISON OF PERFORMANCE AND AREA USING OUR TOOL
GSM protocol, respectively. MPEG2_DEC is a decoder for the MPEG2 standard. ADPCM, DCT, G721, and MPEG2_DEC were obtained from MediaBench [33]. AES, SHA, GSM_ENC, and GSM_DEC were obtained from MiBench [34]. The other benchmarks were obtained from the Internet. We ran our tool on a 1.2-GHz Pentium-III processor with 256-MB memory. The area budget for the benchmarks was set in such a way that increasing the area budget did not signicantly increase benchmark performance. Tables II and III summarize the results of our experiments. Since the benchmarks are hierarchical, we rst show the number of their functions and levels. Table II compares the execution time and energy5 of the benchmarks on the original base processor, and the customized processor (with custom instructions generated by our tool). The benchmarks in Table III need fairly large input datasets to produce reasonable time distributions among the functions. The energy consumption for these benchmarks could not be obtained by RTL simulators. Hence, only performance is reported. These benchmarks are so large that the at approach proposed in our earlier work [25] failed to report results in a reasonable time. The clock period for both processors is 6.5 ns. We also include the area overhead due to the addition of the custom instructions, the number of custom instructions selected, and the time required by our tool to generate and select the custom instructions. The number of execution cycles is reported by a cycle-accurate ISS. The area is obtained after logic synthesis of the entire processor. Energy is estimated by the commercial tool, PowerTheater from Sequence Design Inc. [35]. Custom instruction synthesis using our tool usually takes less than a minute. The synthesis time was hours in our earlier work, which uses a at approach. Verication based on simulating and proling the modied program on the custom processor using the Xtensa [1] software tool chain, and synthesizing the entire custom processor using Design Compiler [31], usually takes several hours. The results indicate that our methodology can quickly (in seconds to minutes) generate and prune custom instructions for realistic programs. The selected custom instructions can achieve a performance improvement of up to 6.07 (average of 2.82 ) over the original processor core with a base instruction set. Energy consumption for the rst eight benchmarks is reduced by up to 3.45 (average of 2.16 ), and energy-delay product is reduced by up to 18.85 (average of 7.64 ).
5Energy is computed as the product of the average power of the customized processor for the duration of the execution and the execution time of the application program.
VIII. CONCLUSIONS Custom processor synthesis is quite challenging for real-life applications because of the complex interdependencies and tradeoffs involved in making the design decisions. Companies that license application-specic processors, such as Tensilica, are also working towards developing such a framework [36]. In this paper, we used the modular and hierarchical structure of the program and provided a scalable methodology for application-specic processor synthesis. We have integrated our design ow in an open-source compiler and veried it on a commercial extensible processor. Our experiments demonstrate that our hierarchical approach can effectively tackle large programs, achieving signicant performance/energy improvements in very small synthesis times.
REFERENCES
[1] Tensilica Inc., Xtensa Microprocessor, 2006. [Online]. Available: http://www.tensilica.com [2] Arc International, ARCtangent Processor, 2006. [Online]. Available: http://www.arc.com [3] J. A. Fisher, Customized instruction sets for embedded processors, in Proc. Des. Autom. Conf., 1999, pp. 253257. [4] K. Keutzer, S. Malik, and A. R. Newton, From ASIC to ASIP: The next design discontinuity, in Proc. Int. Conf. Comput. Des., 2002, pp. 8490. [5] I.-J. Huang and A. M. Despain, Generating instruction sets and microarchitectures from applications, in Proc. Int. Conf. Comput.-Aided Des., 1994, pp. 391396. [6] S. Aditya, B. R. Rau, and V. Kathail, Automatic architectural synthesis of VLIW and EPIC processors, in Proc. Int. Symp. Syst. Synth., 1999, pp. 107113. [7] J. van Praet, G. Goossens, D. Lanneer, and H. De Man, Instruction set denition and instruction set selection for ASIPs, in Proc. Int. Symp. High-level Synthesis, 1994, pp. 1116. [8] R. Leupers and P. Marwedel, Instruction set extraction from programmable structures, in Proc. Eur. Des. Autom. Conf., 1994, pp. 156160. [9] C. Liem, T. May, and P. Paulin, Instruction-set matching and selection for DSP and ASIP code generation, in Proc. Eur. Des. Autom. Conf., 1994, pp. 3137. [10] J. K. Adams and D. E. Thomas, The design of mixed hardware/software systems, in Proc. Des. Autom. Conf., 1996, pp. 515520. [11] R. K. Gupta and G. De Micheli, Hardware-software cosynthesis for digital systems, IEEE Des. Test Comput., vol. 10, no. 3, pp. 2941, Sep. 1993. [12] R. Ernst, J. Henkel, and T. Benner, Hardware-software cosynthesis for microcontrollers, IEEE Des. Test Comput., vol. 10, no. 4, pp. 6475, Dec. 1993.
1187
[13] R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. R. Rau, D. Cronquist, and M. Sivaraman, PICO-NPA: High-level synthesis of nonprogrammable hardware accelerators, HP Laboratories, Palo Alto, CA, 2001. [14] A. Kitajima, M. Itoh, J. Sato, A. Shiomi, Y. Takeuchi, and M. Imai, Effectiveness of the ASIP design system PEAS-III in design of pipelined processors, in Proc. AsiaSouth Pacic Des. Autom. Conf., 2001, pp. 649654. [15] S. P. Seng, W. Luk, and P. Y. K. Cheung, Flexible instruction processors, in Proc. Int. Conf. Compilers, Arch., Synth. Embedded Syst., 2000, pp. 193200. [16] T. V. K. Gupta, R. Ko, and R. Barua, Compiler-directed customization of ASIP cores, in Proc. Int. Symp. HW/SW Codes., 2002, pp. 97102. [17] W. E. Dougherty, D. J. Pursley, and D. E. Thomas, Subsetting behavioral intellectual property for low power ASIP design, J. VLSI Signal Process., vol. 21, no. 3, pp. 209218, Jul. 1999. [18] K. Kucukcakar, An ASIP design methodology for embedded systems, in Proc. Int. Symp. HW/SW Codes., 1999, pp. 1721. [19] H. Choi, J.-S. Kim, C.-W. Yoon, I.-C. Park, S. H. Hwang, and C.-M. Kyung, Synthesis of application specic instructions for embedded DSP software, IEEE Trans. Comput., vol. 48, no. 6, pp. 603614, Jun. 1999. [20] M. Arnold and H. Corporaal, Designing domain-specic processors, in Proc. Int. Symp. HW/SW Codes., 2001, pp. 6166. [21] J. E. Lee, K. Choi, and N. Dutt, Efcient instruction encoding for automatic instruction set design of congurable ASIPs, in Proc. Int. Conf. Comput.-Aided Des., 2002, pp. 649654. [22] K. Atasu, L. Pozzi, and P. Ienne, Automatic application-specic instruction-set extensions under microarchitectural constraints, in Proc. Design Autom. Conf., 2003, pp. 256261. [23] N. Cheung, J. Henkel, and S. Parameswaran, Rapid conguration and instruction selection for an ASIP: A case study, in Proc. Des. Autom. Test Eur. Conf., 2003, pp. 802807. [24] N. Clark, W. Tang, and S. Mahlke, Automatically generating custom instruction set extensions, in Proc. Workshop Appl. Specic Process., 2002, pp. 94101. [25] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, Custom instruction synthesis for extensible processor platforms, IEEE Trans. Comput.Aided Des., vol. 23, no. 2, pp. 216228, Feb. 2004. [26] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. A. M. Anderson, S. W. K. Tjiang, S. W. Liao, C. W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy, SUIF: An infrastructure for research on parallelizing and optimizing compilers, SIGPLAN Notices vol. 29, no. 12, pp. 3137, 1994. [Online]. Available: http://www.citeseer.nj.nec. com/wilson94suif.html [27] A. Wang, E. Killian, D. Maydan, and C. Rowen, Hardware/software instruction set congurability for system-on-chip processors, in Proc. Des. Autom. Conf., 2001, pp. 184188. [28] T. Callahan and J. Wawrzynek, Simple proling system for SUIF, presented at the 1st SUIF Compiler Workshop, Stanford, CA, 1996. [29] J. B. Fenwick, Jr. and L. L. Pollock, Implementing an optimizing Linda compiler using SUIF, presented at the 1st SUIF Compiler Workshop, Stanford, CA, 1996. [30] S. Mahlke, R. Ravindran, M. Schlansker, R. Schreiber, and T. Sherwood, Bitwidth cognizant architecture synthesis of custom hardware accelerators, IEEE Trans. Comput.-Aided Des., vol. 20, no. 11, pp. 13551371, Nov. 2001. [31] Synopsys Inc., Design Compiler 2006. [Online]. Available: http://www.synopsys.com [32] NEC Electronics Inc., CB-11 cell based IC product family, 2006. [Online]. Available: http://www.necel.com [33] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, Mediabench: A tool for evaluating and synthesizing multimedia and communications systems, in Proc. Int. Symp. Microarch., 1997, pp. 330337. [34] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, MiBench: A free, commercially representative embedded benchmark suite, in Proc. IEEE 4th Ann. Workshop Workload Characterization, 2001, pp. 314. [35] PowerTheater Manual Sequence Design Inc., Santa Clara, CA, 2005. [Online]. Available: http://www.sequencedesign.com [36] D. Goodwin and D. Petkov, Automatic generation of application specic processors, in Proc. Int. Conf. Compilers, Arch., Synth. Embedded Syst., 2003, pp. 137147.
Fei Sun received the B.S. degree in computer science from Peking University, Beijing, China, in 2000, and the M.A. and Ph.D. degrees in electrical engineering from Princeton University, Princeton, NJ, in 2002 and 2005, respectively. Dr. Sun is currently a Member of the Technical Staff at Tensilica Inc., Santa Clara, CA. His research interests include electronic design automation, high level synthesis, ASIP design, and hardware/software codesign.
Srivaths Ravi (SM05) received the B.Tech. degree in electrical and electronics engineering from the Indian Institute of Technology, Madras, India, in 1996, and the M.A. and Ph.D. degrees in electrical engineering from Princeton University, Princeton, NJ, in 1998 and 2001, respectively. Currently, he is a Research Staff Member with NEC Laboratories America Inc., Princeton, NJ, and also holds a Visiting Research Collaborator Position with the Department of Electrical Engineering, Princeton University. He leads various projects in the areas of embedded security and low-power design. He co-architected the MOSES security processing architecture for NECs MP211 multiprocessor mobile phone application system-on-chip (SoC). He has also been responsible for developing RTL and C-based power estimation engines in NECs C-based design ow CYBER. His research interests include the areas of advanced embedded processing architectures, system-level and RTL test technologies, and low-power design. His publications have appeared in leading ACM/IEEE conferences and journals on VLSI/computer-aided design (CAD), including invited contributions and talks at the International Forum on Application-Specic Multi-Processor SoC, ACM Transactions on Embedded Computing Systems, International Conference on VLSI Design and Design Automation and Test in Europe Conference. Dr. Ravi was a recipient of Best Paper Awards at the International Conference on VLSI design in 1998, 2000, and 2003. He received the Siemens Medal from the Indian Institute of Technology, Madras, India, in 1996. He serves in the organizing/program committees of various conferences including VLSI Test Symposium (VTS) and Design Automation and Test in Europe (DATE).
Anand Raghunathan (S93M97SM00) received the B.Tech. degree in electrical and electronics engineering from the Indian Institute of Technology, Madras, India, in 1992, and the M.A. and Ph.D. degrees in electrical engineering from Princeton University, Princeton, NJ, in 1994 and 1997, respectively. Dr. Raghunathan is currently a Senior Research Staff Member at NEC Laboratories America Inc., Princeton, NJ, where he leads research projects related to system-on-chip (SoC) architectures, design methodologies, and design tools. He coauthored High-level Power Analysis and Optimization and six book chapters, and has presented several full-day and embedded conference tutorials in the previously mentioned areas. He holds or has led for 20 U.S. patents in the areas of advanced SoC architectures, design methodologies, and VLSI computer-aided design (CAD). Dr. Raghunathan was a recipient of Best Paper Awards at the IEEE International Conference on VLSI Design (one in 1998 and two in 2003) and at the ACM/IEEE Design Automation Conference (1999 and 2000), and three Best Paper Award nominations at the ACM/IEEE Design Automation Conference (1996, 1997, and 2003). He received a Patent of the Year Award (an award recognizing the invention that has achieved the highest impact), and a Technology Commercialization Award from NEC in 2001 and 2005, respectively. He was chosen by MIT Technology Review to be among the TR35 (top 35 technology innovators under 35 years) for 2006. He was a recipient of the IEEE Meritorious Service Award (2001) and Outstanding Service Award (2004), and was elected a Golden Core Member of the IEEE Computer Society in 2001, in recognition of his contributions. He has been a member of the technical program and organizing committees of several leading conferences and workshops. He has served as Program Chair for the IEEE VLSI Test Symposium and the ACM/IEEE International Symposium on Low Power Electronics and Design. He has also served as Associate Editor of the IEEE TRANSACTIONS ON
1188
COMPUTER-AIDED DESIGN, the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, IEEE Design and Test of Computers, and the Journal of Low Power Electronics. He is currently the Vice-Chair of the Tutorials and Education Group at the IEEE Computer Societys Test Technology Technical Council.
Niraj K. Jha (S85M85SM93F98) received the B.Tech. degree in electronics and electrical communication engineering from the Indian Institute of Technology, Kharagpur, India, in 1981, the M.S. degree in electrical engineering from the State University of New York (SUNY) at Stony Brook, NY, in 1982, and the Ph.D. degree in electrical engineering from University of Illinois at Urbana-Champaign, in 1985. He is a Professor of Electrical Engineering at Princeton University, Princeton, NJ. He has coauthored Testing and Reliable Design of CMOS Circuits (Kluwer, 1990), High-Level Power Analysis and Optimization (Kluwer, 1998), and Testing of Digital Systems (Cambridge Univ. Press, 2003). He has also authored six book chapters. He has authored or coauthored more than 300 technical papers. He
is currently serving as an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, and the Journal of Low Power Electronics. He has served as an Editor of the Journal of Electronic Testing: Theory and Applications (JETTA) in the past. He has served as the Guest Editor for the JETTA special issue on high-level test synthesis. He served as the Director of the Center for Embedded System-on-a-chip (SoC) Design funded by the New Jersey Commission on Science and Technology. He has received 11 U.S. patents. His research interests include nanotechnology, thermal analysis and optimization, computer-aided design of integrated circuits and systems, digital system testing, and computer security. Dr. Jha is a Fellow of the ACM. He is the recipient of the AT&T Foundation Award and NEC Preceptorship Award for research excellence, NCR Award for teaching excellence, and Princeton University Graduate Mentoring Award. He has co-authored six papers which have won the Best Paper Award at ICCD93, FTCS97, ICVLSID98, DAC99, PDCS02, and ICVLSID03. A paper of his was selected for The Best of ICCAD: A collection of the best IEEE International Conference on Computer-Aided Design papers of the past 20 years, and another by IEEE Micro as being among the best of 2005 Computer Architecture conference papers. He has served as the Program Chairman of the 1992 Workshop on Fault-Tolerant Parallel and Distributed Systems and the 2004 International Conference on Embedded and Ubiquitous Computing.

A Scalable Synthesis Methodology For Application-Specific Processors

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Scalable Synthesis Methodology For Application-Specific Processors

Încărcat de

Drepturi de autor:

Formate disponibile

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO.

11, NOVEMBER 2006

A Scalable Synthesis Methodology for Application-Specic Processors

1063-8210/$20.00 2006 IEEE

Fig. 1. Size (lines of code) of various embedded programs.

SUN et al.: SCALABLE SYNTHESIS METHODOLOGY FOR APPLICATION-SPECIFIC PROCESSORS

TABLE I CUSTOM INSTRUCTIONS FOUND FOR THE CALL GRAPH IN FIG. 2

SUN et al.: SCALABLE SYNTHESIS METHODOLOGY FOR APPLICATION-SPECIFIC PROCESSORS

Fig. 3. Custom instruction generation ow for hierarchical programs.

SUN et al.: SCALABLE SYNTHESIS METHODOLOGY FOR APPLICATION-SPECIFIC PROCESSORS

Fig. 6. Pseudo-code for global exploration.

Fig. 4. Pseudocode for template deletion.

Fig. 5. Custom instruction graph for the SHA benchmark.

SUN et al.: SCALABLE SYNTHESIS METHODOLOGY FOR APPLICATION-SPECIFIC PROCESSORS

SUN et al.: SCALABLE SYNTHESIS METHODOLOGY FOR APPLICATION-SPECIFIC PROCESSORS

TABLE II COMPARISON OF PERFORMANCE, ENERGY, AND AREA USING OUR TOOL

discussion on custom instruction insertion can be found in [25].

TABLE III COMPARISON OF PERFORMANCE AND AREA USING OUR TOOL

SUN et al.: SCALABLE SYNTHESIS METHODOLOGY FOR APPLICATION-SPECIFIC PROCESSORS

S-ar putea să vă placă și