Sunteți pe pagina 1din 9

NIAGARA: A 32-WAY

MULTITHREADED SPARC
PROCESSOR
THE NIAGARA PROCESSOR IMPLEMENTS A THREAD-RICH ARCHITECTURE
DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOLUTION FOR COMMERCIAL

SERVER APPLICATIONS. THE HARDWARE SUPPORTS 32 THREADS WITH A

MEMORY SUBSYSTEM CONSISTING OF AN ON-BOARD CROSSBAR, LEVEL-2

CACHE, AND MEMORY CONTROLLERS FOR A HIGHLY INTEGRATED DESIGN

THAT EXPLOITS THE THREAD-LEVEL PARALLELISM INHERENT TO SERVER

APPLICATIONS, WHILE TARGETING LOW LEVELS OF POWER CONSUMPTION.

Over the past two decades, micro- application performance by improving


processor designers have focused on improv- throughput, the total amount of work done
ing the performance of a single thread in a across multiple threads of execution. This is
desktop processing environment by increas- especially effective in commercial server appli-
ing frequencies and exploiting instruction cations such as databases1 and Web services,2
level parallelism (ILP) using techniques such which tend to have workloads with large
as multiple instruction issue, out-of-order amounts of thread level parallelism (TLP).
issue, and aggressive branch prediction. The In this article, we present the Niagara
Poonacha Kongetira emphasis on single-thread performance has processor’s architecture. This is an entirely
shown diminishing returns because of the lim- new implementation of the Sparc V9 archi-
Kathirgamar Aingaran itations in terms of latency to main memory tectural specification, which exploits large
and the inherently low ILP of applications. amounts of on-chip parallelism to provide
Kunle Olukotun This has led to an explosion in microproces- high throughput. Niagara supports 32 hard-
sor design complexity and made power dissi- ware threads by combining ideas from chip
Sun Microsystems pation a major concern. multiprocessors3 and fine-grained multi-
For these reasons, Sun Microsystems’ Nia- threading.4 Other studies5 have also indicated
gara processor takes a radically different the significant performance gains possible
approach to microprocessor design. Instead using this approach on multithreaded work-
of focusing on the performance of single or loads. The parallel execution of many threads
dual threads, Sun optimized Niagara for mul- effectively hides memory latency. However,
tithreaded performance in a commercial serv- having 32 threads places a heavy demand on
er environment. This approach increases the memory system to support high band-

0272-1732/05/$20.00 ” 2005 IEEE Published by the IEEE computer Society 21


HOT CHIPS 16

Table 1. Commercial server applications.

Instruction-level Thread-level Working Data


Benchmark Application category parallelism parallelism set sharing
Web99 Web server Low High Large Low
JBB Java application server Low High Large Medium
TPC-C Transaction processing Low High Large High
SAP-2T Enterprise resource planning Medium High Medium Medium
SAP-3T Enterprise resource planning Low High Large High
TPC-H Decision support system High High Large Medium

width. To provide this bandwidth, a crossbar a server running these applications is the sus-
interconnects scheme routes memory refer- tained throughput of client requests.
ences to a banked on-chip level-2 cache that Furthermore, the deployment of servers
all threads share. Four independent on-chip commonly takes place in high compute den-
memory controllers provide in excess of 20 sity installations such as data centers, where
Gbytes/s of bandwidth to memory. supplying power and dissipating server-
Exploiting TLP also lets us improve per- generated heat are very significant factors in
formance significantly without pushing the the center’s cost of operating. Experience at
envelope on CPU clock frequency. This and Google shows a representative power density
the sharing of CPU pipelines among multi- requirement of 400 to 700 W/sq. foot for
ple threads enable an area- and power- racked server clusters.2 This far exceeds the typ-
efficient design. Designers expect Niagara to ical power densities of 70 to 150 W/foot2 sup-
dissipate about 60 W of power, making it ported by commercial data centers. It is
very attractive for high compute density envi- possible to reduce power consumption by sim-
ronments. In data centers, for example, power ply running the ILP processors in server clus-
supply and air conditioning costs have ters at lower clock frequencies, but the
become very significant. Data center racks proportional loss in performance makes this
often cannot hold a complete complement of less desirable. This situation motivates the
servers because this would exceed the rack’s requirement for commercial servers to improve
power supply envelope. performance per watt. These requirements
We designed Niagara to run the Solaris have not been efficiently met using machines
operating system, and existing Solaris applica- optimized for single-thread performance.
tions will run on Niagara systems without Commercial server applications tend to have
modification. To application software, a Nia- low ILP because they have large working sets
gara processor will appear as 32 discrete proces- and poor locality of reference on memory
sors with the OS layer abstracting away the access; both contribute to high cache-miss
hardware sharing. Many multithreaded appli- rates. In addition, data-dependent branches
cations currently running on symmetric mul- are difficult to predict, so the processor must
tiprocessor (SMP) systems should realize discard work done on the wrong path. Load-
performance improvements. This is consistent load dependencies are also present, and are not
with observations from previous multithread- detectable in hardware at issue time, resulting
ed-processor development at Sun6,7 and from in discarded work. The combination of low
Niagara chips and systems, which are under- available ILP and high cache-miss rates caus-
going bring-up testing in the laboratory. es memory access time to limit performance.
Recently, the movement of many retail and Therefore, the performance advantage of using
business processes to the Web has triggered a complex ILP processor over a single-issue
the increasing use of commercial server appli- processor is not significant, while the ILP
cations (Table 1). These server applications processor incurs the costs of high power and
exhibit large degrees of client request-level par- complexity, as Figure 1 shows.
allelism, which servers using multiple threads However, server applications tend to have
can exploit. The key performance metric for large amounts of TLP. Therefore, shared-

22 IEEE MICRO
memory machines with discrete single-thread- Single
ed processors and coherent interconnect have issue C M C M C M
tended to perform well because they exploit ILP C M C M C M
TLP. However, the use of an SMP composed
of multiple processors designed to exploit ILP TLP C M
is neither power efficient nor cost-efficient. A (on shared Time saved
single issue C M
more efficient approach is to build a machine
pipeline) C M
using simple cores aggregated on a single die,
with a shared on-chip cache and high band-
width to large off-chip memory, thereby Memory latency Compute latency
aggregating an SMP server on a chip. This has
the added benefit of low-latency communi- Figure 1. Behavior of processors optimized for TLP and ILP on
cation between the cores for efficient data commercial server workloads. In comparison to the single-
sharing in commercial server applications. issue machine, the ILP processor mainly reduces compute
time, so memory access time dominates application perfor-
Niagara overview mance. In the TLP case, multiple threads share a single-issue
The Niagara approach to increasing pipeline, and overlapped execution of these threads results in
throughput on commercial server applications higher performance for a multithreaded application.
involves a dramatic increase in the number of
threads supported on the processor and a
memory subsystem scaled for higher band- of bandwidth. A two-entry queue is available
widths. Niagara supports 32 threads of exe- for each source-destination pair, and it can
cution in hardware. The architecture queue up to 96 transactions each way in the
organizes four threads into a thread group; the crossbar. The crossbar also provides a port for
group shares a processing pipeline, referred to communication with the I/O subsystem.
as the Sparc pipe. Niagara uses eight such Arbitration for destination ports uses a sim-
thread groups, resulting in 32 threads on the ple age-based priority scheme that ensures fair
CPU. Each SPARC pipe contains level-1 scheduling across all requestors. The crossbar
caches for instructions and data. The hard- is also the point of memory ordering for the
ware hides memory and pipeline stalls on a machine.
given thread by scheduling the other threads The memory interface is four channels of
in the group onto the SPARC pipe with a zero dual-data rate 2 (DDR2) DRAM, supporting
cycle switch penalty. Figure 1 schematically a maximum bandwidth in excess of 20
shows how reusing the shared processing Gbytes/s, and a capacity of up to 128 Gbytes.
pipeline results in higher throughput. Figure 2 shows a block diagram of the Nia-
The 32 threads share a 3-Mbyte level-2 gara processor.
cache. This cache is 4-way banked and
pipelined for bandwidth; it is 12-way set- Sparc pipeline
associative to minimize conflict misses from Here we describe the Sparc pipe implemen-
the many threads. Commercial server code tation, which supports four threads. Each
has data sharing, which can lead to high thread has a unique set of registers and instruc-
coherence miss rates. In conventional SMP tion and store buffers. The thread group shares
systems using discrete processors with coher- the L1 caches, translation look-aside buffers
ent system interconnects, coherence misses go (TLBs), execution units, and most pipeline
out over low-frequency off-chip buses or links, registers. We implemented a single-issue
and can have high latencies. The Niagara pipeline with six stages (fetch, thread select,
design with its shared on-chip cache elimi- decode, execute, memory, and write back).
nates these misses and replaces them with low- In the fetch stage, the instruction cache and
latency shared-cache communication. instruction TLB (ITLB) are accessed. The fol-
The crossbar interconnect provides the lowing stage completes the cache access by
communication link between Sparc pipes, L2 selecting the way. The critical path is set by
cache banks, and other shared resources on the 64-entry, fully associative ITLB access. A
the CPU; it provides more than 200 Gbytes/s thread-select multiplexer determines which of

MARCH–APRIL 2005 23
HOT CHIPS 16

Sparc pipe DDR


4-way MT Dram control
Channel 0
L2 B0
Sparc pipe
4-way MT

Sparc pipe DDR


4-way MT Dram control
Channel 1
L2 B1
Sparc pipe

Crossbar
4-way MT

Sparc pipe DDR


4-way MT Dram control
Channel 2
L2 B2
Sparc pipe
4-way MT

Sparc pipe DDR


4-way MT Dram control
Channel 3
L2 B3
Sparc pipe
4-way MT

I/O and shared functions


I/O interface

Figure 2. Niagara block diagram.

the four thread program counters (PC) should instructions before the register file is updat-
perform the access. The pipeline fetches two ed. ALU and shift instructions have single-
instructions each cycle. A predecode bit in the cycle latency and generate results in the
cache indicates long-latency instructions. execute stage. Multiply and divide operations
In the thread-select stage, the thread-select are long latency and cause a thread switch.
multiplexer chooses a thread from the avail- The load store unit contains the data TLB
able pool to issue an instruction into the (DTLB), data cache, and store buffers. The
downstream stages. This stage also maintains DTLB and data cache access take place in the
the instruction buffers. Instructions fetched memory stage. Like the fetch stage, the criti-
from the cache in the fetch stage can be insert- cal path is set by the 64-entry, fully associa-
ed into the instruction buffer for that thread tive DTLB access. The load-store unit
if the downstream stages are not available. contains four 8-entry store buffers, one per
Pipeline registers for the first two stages are thread. Checking the physical tags in the store
replicated for each thread. buffer can indicate read after write (RAW)
Instructions from the selected thread go hazards between loads and stores. The store
into the decode stage, which performs instruc- buffer supports the bypassing of data to a load
tion decode and register file access. The sup- to resolve RAW hazards. The store buffer tag
ported execution units include an arithmetic check happens after the TLB access in the
logic unit (ALU), shifter, multiplier, and a early part of write back stage. Load data is
divider. A bypass unit handles instruction available for bypass to dependent instructions
results that must be passed to dependent late in the write back stage. Single-cycle

24 IEEE MICRO
Fetch Thread select Decode Execute Memory Writeback

Register
file
×4

ICache Instruction DCache


ITLB buffer × 4 Thread ALU
MUL DTLB Crossbar
select Decode store interface
Mux Shifter
DIV buffers × 4

Instruction type
Thread selects Thread Misses
select
logic Traps and interrupts
Resource conflicts
PC
Thread logic
select ×4
Mux

Figure 3. Sparc pipeline block diagram. Four threads share a six-stage single-issue pipeline with local instruction
and data caches. Communication with the rest of the machine occurs through the crossbar interface.

instructions such as ADD will update the reg- Pipeline interlocks and scheduling
ister file in the write back stage. For single-cycle instructions such as ADD,
The thread-select logic decides which thread Niagara implements full bypassing to younger
is active in a given cycle in the fetch and thread- instructions from the same thread to resolve
select stages. As Figure 3 shows, the thread-select RAW dependencies. As mentioned before, load
signals are common to fetch and thread-select instructions have a three-cycle latency before
stages. Therefore, if the thread-select stage the results of the load are visible to the next
chooses a thread to issue an instruction to the instruction. Such long-latency instructions can
decode stage, the F stage also selects the same cause pipeline hazards; resolving them requires
instruction to access the cache. The thread-select stalling the corresponding thread until the haz-
logic uses information from various pipeline ard clears. So, in the case of a load, the next
stages to decide when to select or deselect a instruction from the same thread waits for two
thread. For example, the thread-select stage can cycles for the hazards to clear.
determine instruction type using a predecode In a multithreaded pipeline, threads com-
bit in the instruction cache, while some traps peting for shared resources also encounter
are only detectable in later pipeline stages. structural hazards. Resources such as the ALU
Therefore, instruction type can cause deselec- that have a one-instruction-per-cycle through-
tion of a thread in the thread-select stage, while put present no hazards, but the divider, which
a late trap detected in the memory stage needs has a throughput of less than one instruction
to flush all younger instructions from the thread per cycle, presents a scheduling problem. In
and deselect itself during trap processing. this case, any thread that must execute a DIV

MARCH–APRIL 2005 25
HOT CHIPS 16

Cycles
become unavailable because of pipeline stalls
such as cache misses, traps, and resource con-
flicts. The thread scheduler assumes that loads
St0-sub Dt0-sub Et0-sub Mt0-sub Wt0-sub are cache hits, and can therefore issue a depen-
dent instruction from the same thread specu-
latively. However, such a speculative thread is
Instructions

Ft0-sub St1-sub Dt1-sub Et01ub Mt1-sub Wt1-sub


assigned a lower priority for instruction issue
as compared to a thread that can issue a non-
Ft1-ld St2-ld Dt2-ld Et2-ld Mt2-ld Wt2-ld
speculative instruction.
Figure 4 indicates the operation when all
Ft2-br St3-add Dt3-add Et3-add Mt3-add threads are available. In the figure, reading
from left to right indicates the progress of an
instruction through the pipeline. Reading
Ft3-add Sto-add Dto-add Eto-add
from top to bottom indicates new instructions
fetched into the pipe from the instruction
Figure 4. Thread selection: all threads available. cache. The notation St0-sub refers to a Subtract
instruction from thread 0 in the S stage of the
pipe. In the example, the t0-sub is issued down
Cycles
the pipe. As the other three threads become
available, the thread state machine selects
thread1 and deselects thread0. In the second
St0-ld Dt0-ld Et0-ld Mt0-ld Wt0-ld cycle, similarly, the pipeline executes the t1-
sub and selects t2-ld (load instruction from
thread 2) for issue in the following cycle. When
Instructions

Ft0-add St1-sub Dt1-sub Et01ub Mt1-sub Wt1-sub


t3-add is in the S stage, all threads have been
executed, and for the next cycle the pipeline
Ft1-ld St1-ld Dt1-ld Et1-ld Mt21 Wt21
selects the least recently used thread, thread0.
When the thread-select stage chooses a thread
Ft1-br St0-add Dt0-add Et0-add Mt0-add for execution, the fetch stage chooses the same
thread for instuction cache access.
Figure 5 indicates the operation when only
two threads are available. Here thread0 and
thread1 are available, while thread2 and
Figure 5. Thread selection: only two threads available. The ADD instruction thread3 are not. The t0-ld in the thread-select
from thread0 is speculatively switched into the pipeline before the hit/miss stage in the example is a long-latency opera-
for the load instruction has been determined. tion. Therefore it causes the deselection of
thread0. The t0-ld itself, however, issues down
the pipe. In the second cycle, since thread1 is
instruction has to wait until the divider is free. available, the thread scheduler switches it in.
The thread scheduler guarantees fairness in At this time, there are no other threads avail-
accessing the divider by giving priority to the able and the t1-sub is a single-cycle opera-
least recently executed thread. Although the tion, so thread1 continues to be selected for
divider is in use, other threads can use free the next cycle. The subsequent instruction is
resources such as the ALU, load-store unit, a t1-ld and causes the deselection of thread1
and so on. for the fourth cycle. At this time only thread0
is speculatively available and therefore can be
Thread selection policy selected. If the first t0-ld was a hit, data can
The thread selection policy is to switch bypass to the dependent t0-add in the exe-
between available threads every cycle, giving cute stage. If the load missed, the pipeline
priority to the least recently used thread. flushes the subsequent t0-add to the thread-
Threads can become unavailable because of select stage instruction buffer, and the
long-latency instructions such as loads, instruction reissues when the load returns
branches, and multiply and divide. They also from the L2 cache.

26 IEEE MICRO
Integer register file Architectural set Working set
The integer register file has three read and (compact SRAM cells) (fast RF cells)
two write ports. The read ports correspond to
those of a single-issue machine; the third read
port handles stores and a few other three
source instructions. The two write ports han- w(n+1)
Outs[0-7]
dle single-cycle and long-latency operations.
Long-latency operations from various threads Call Locals[0-7]
within the group (load, multiply, and divide)
can generate results simultaneously. These Outs[0-7] Ins[0-7]
Outs[0-7]
instructions will arbitrate for the long-laten- w(n)
Locals[0-7]
cy port of the register file for write backs. Sparc Locals[0-7]
V9 architecture specifies the register window
Outs[0-7] Ins[0-7] Transfer
implementation shown in Figure 6. A single port Ins[0-7]
window consists of eight in, local, and out reg-
w(n−1) Locals[0-7]
isters; they are all visible to a thread at a given Return
time. The out registers of a window are Ins[0-7]
addressed as the in registers of the next
sequential window, but are the same physical Read/write
registers. Each thread has eight register win- access from pipe
dows. Four such register sets support each of
the four threads, which do not share register
space among themselves. The register file con- Figure 6. Windowed register file. W(n -1), w(n), and w(n + 1) are neighboring
sists of a total of 640 64-bit registers and is a register windows. Pipeline accesses occur in the working set while window
5.7 Kbyte structure. Supporting the many changing events cause data transfer between the working set and old or new
multiported registers can be a major challenge architectural windows. This structure is replicated for each of four threads,
from both area and access time considerations.
We have chosen an innovative implementa-
tion to maintain a compact footprint and a thread can read the register file in a given cycle.
single-cycle access time.
Procedure calls can request a new window, Memory subsystem
in which case the visible window slides up, The L1 instruction cache is 16 Kbyte,
with the old outputs becoming the new 4-way set-associative with a block size of 32
inputs. Return from a call causes the window bytes. We implement a random replacement
to slide down. We take advantage of this char- scheme for area savings, without incurring sig-
acteristic to implement a compact register file. nificant performance cost. The instruction
The set of registers visible to a thread is the cache fetches two instructions each cycle. If
working set, and we implement it using fast the second instruction is useful, the instruc-
register file cells. The complete set of registers tion cache has a free slot, which the pipeline
is the architectural set; we implement it using can use to handle a line fill without stalling.
compact six-transistor SRAM cells. A trans- The L1 data cache is 8 Kbytes, 4-way set-
fer port links the two register sets. A window associative with a line size of 16 bytes, and
changing event triggers deselection of the implements a write-through policy. Even
thread and the transfer of data between the though the L1 caches are small, they signifi-
old and new windows. Depending on the cantly reduce the average memory access time
event type, the data transfer takes one or two per thread with miss rates in the range of 10
cycles. When the transfer is complete, the percent. Because commercial server applica-
thread can be selected again. This data trans- tions tend to have large working sets, the L1
fer is independent of operations to registers caches must be much larger to achieve signif-
from other threads, therefore operations on icantly lower miss rates, so this sort of trade-
the register file from other threads can con- off is not favorable for area. However, the four
tinue. In addition, the registers of all threads threads in a thread group effectively hide the
share the read circuitry because only one latencies from L1 and L2 misses. Therefore,

MARCH–APRIL 2005 27
HOT CHIPS 16

Hardware multithreading primer


A multithreaded processor is one that allows more than one thread of sharing a next-level cache and system interfaces. Here, each processor
execution to exist on the CPU at the same time. To software, a dual-thread- core executes instructions from a thread independently of the other thread,
ed processor looks like two distinct CPUs, and the operating system takes and they interact through shared memory. This has been an attractive
advantage of this by scheduling two threads of execution on it. In most option for chip designers because of the availability of cores from earlier
cases, though, the threads share CPU pipeline resources, and hardware processor generations, which, when shrunk down to present-day process
uses various means to manage the allocation of resources to these threads. technologies, are small enough for aggregation onto a single die.
The industry uses several terms to describe the variants of multi-
threading implemented in hardware; we show some in Figure A.
In a single-issue, single-thread machine (included here for reference),
hardware does not control thread scheduling on the pipeline. Single- (a)
thread machines support a single context in hardware; therefore a thread
switch by the operating system incurs the overhead of saving and retriev-
(b) Switch overhead
ing thread states from memory.
Coarse-grained multithreading, or switch-on-event multithreading, is
when a thread has full use of the CPU resource until a long-latency event
such as miss to DRAM occurs, in which case the CPU switches to anoth- (c)
er thread. Typically, this implementation has a context switch cost asso-
ciated with it, and thread switches occur when event latency exceeds a
specified threshold.
Fine grained multithreading is sometimes called interleaved multi-
(d)
threading; in it, thread selection typically happens at a cycle boundary.
The selection policy can be simply to allocate resources to a thread on a
round-robin basis with threads becoming unavailable for longer periods
of time on events like memory misses.
The preceding types time slice the processor resources, so these imple- (e)
Thread 0
mentations are also called time-sliced or vertical multithreaded. An Thread 1
approach that schedules instructions from different threads on different Unused cycle
functional units at the same time is called simultaneous multithreading
(SMT) or, alternately, horizontal multithreading. SMT typically works on Figure A. Variants of multithreading implementation hardware:
superscalar processors, which have hardware for scheduling across mul- single-issue, single-thread pipeline (a); single-issue, 2 threads,
tiple pipes. coarse-grain threading (b); single-issue, 2 threads, fine-grain
Another type of implementation is chip multiprocessing, which simply threading (c); dual-issue, 2 threads, simultaneous threading (d);
calls for instantiating single-threaded processor cores on a die, perhaps and single-issue, 2 threads, chip multiprocessing (e).

the cache sizes are a good trade-off between line and data is then returned to the L1 cache.
miss rates, area, and the ability of other threads The directory thus maintains a sharers list at
in the group to hide latency. L1-line granularity. A subsequent store from
Niagara uses a simple cache coherence pro- a different or same L1 cache will look up the
tocol. The L1 caches are write through, with directory and queue up invalidates to the L1
allocate on load and no-allocate on stores. L1 caches that have the line. Stores do not update
lines are either in valid or invalid states. The the local caches until they have updated the
L2 cache maintains a directory that shadows L2 cache. During this time, the store can pass
the L1 tags. The L2 cache also interleaves data data to the same thread but not to other
across banks at a 64-byte granularity. A load threads; therefore, a store attains global visi-
that missed in an L1 cache (load miss) is deliv- bility in the L2 cache. The crossbar establish-
ered to the source bank of the L2 cache along es memory order between transactions from
with its replacement way from the L1 cache. the same and different L2 banks, and guaran-
There, the load miss address is entered in the tees delivery of transactions to L1 caches in the
corresponding L1 tag location of the directo- same order. The L2 cache follows a copy-back
ry, the L2 cache is accessed to get the missing policy, writing back dirty evicts and dropping

28 IEEE MICRO
clean evicts. Direct memory access from I/O tecture Based on Single-Chip Multiprocess-
devices are ordered through the L2 cache. Four ing,” Proc. 27th Ann. Int’l Symp. Computer
channels of DDR2 DRAM provide in excess Architecture (ISCA 00), IEEE CS Press, 2000,
of 20 Gbytes/s of memory bandwidth. pp. 282-293.
6. S. Kapil, H. McGhan, and J. Lawrendra, “A

N iagara systems are presently undergoing test-


ing and bring up. We have run existing
multithreaded application software written for
Chip Multithreaded Processor for Network-
Facing Workloads,” IEEE Micro, vol. 24, no.
2, Mar.-Apr. 2004, pp. 20-30.
Solaris without modification on laboratory sys- 7. J. Hart et al., “Implementation of a 4th-Gen-
tems. The simple pipeline requires no special com- eration 1.8 GHz Dual Core Sparc V9 Micro-
piler optimizations for instruction scheduling. On processor,” Proc. Int’l Solid-State Circuits
real commercial applications, we have observed Conf. (ISSCC 05), IEEE Press, 2005,
the performance in lab systems to scale almost lin- http://www.isscc.org/isscc/2005/ap/
early with the number of threads, demonstrating ISSCC2005AdvanceProgram.pdf.
that there are no bandwidth bottlenecks in the
design. The Niagara processor represents the first Poonacha Kongetira is a director of engi-
in a line of microprocessors that Sun designed neering at Sun Microsystems and was part of
with many hardware threads to provide high the development team for the Niagara proces-
throughput and high performance per watt on sor. His research interests include computer
commercial server applications. The availability architecture and methodologies for SoC
of a thread-rich architecture opens up new design. Kongetira has a MS from Purdue Uni-
avenues for developers to efficiently increase appli- versity and BS from Birla Institute of Tech-
cation performance. This architecture is a para- nology and Science, both in electrical
digm shift in the way that microprocessors have engineering. He is a member of the IEEE.
been designed thus far. MICRO
Kathirgamar Aingaran is a senior staff engineer
Acknowledgments at Sun Microsystems and was part of the devel-
This article represents the work of the very opment team for Niagara. His research inter-
talented and dedicated Niagara development ests include architectures for low power and low
team, and we are privileged to present their design complexity. Aingaran has an MS from
work. Stanford University and a BS from the Univer-
sity of Southern California, both in electrical
References engineering. He is a member of the IEEE.
1. S.R. Kunkel et al., “A Performance Method-
ology for Commercial Servers,” IBM J. Kunle Olukotun is an associate professor of
Research and Development, vol. 44, no. 6, electrical engineering and computer science
2000, pp. 851-872. at Stanford University. His research interests
2. L. Barroso, J. Dean, and U. Hoezle, “Web include computer architecture, parallel pro-
Search for a Planet: The Architecture of the gramming environments, and formal hard-
Google Cluster,” IEEE Micro, vol 23, no. 2, ware verification. Olukotun has a PhD in
Mar.-Apr. 2003, pp. 22-28. computer science and engineering from the
3. K. Olukotun et al., “The Case for a Single University of Michigan. He is a member of
Chip Multiprocessor,” Proc. 7th Conf. Archi- the IEEE and the ACM.
tectural Support for Programming Lan-
guages and Operating Systems (ASPLOS Direct questions and comments about this
VII), 1996, pp. 2-11. article to Poonacha Kongetira,MS: USUN05-
4. J. Laudon, A. Gupta, and M. Horowitz, “Inter- 215, 910 Hermosa Court, Sunnyvale, CA
leaving: A Multithreading Technique Target- 94086; poonacha.kongetira@sun.com.
ing Multiprocessors and Workstations,” Proc.
6th Int’l Conf. Architectural Support for Pro-
gramming Languages and Operating Systems For further information on this or any other
(ASPLOS VI), ACM Press, 1994, pp. 308-316. computing topic, visit our Digital Library at
5. L. Barroso et al., “Piranha: A Scalable Archi- http://www.computer.org/publications/dlib.

MARCH–APRIL 2005 29

S-ar putea să vă placă și