p8 Nest JRD

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/273393397
The cache and memory subsystems of the IBM POWER8 processor
Article in Ibm Journal of Research and Development · January 2015

DOI: 10.1147/JRD.2014.2376131
CITATIONS READS
34 3,309
10 authors, including:
William J. Starke David Daly

IBM IBM
37 PUBLICATIONS 659 CITATIONS 43 PUBLICATIONS 804 CITATIONS
SEE PROFILE SEE PROFILE
B. Blaner
IBM
22 PUBLICATIONS 364 CITATIONS
SEE PROFILE
All content following this page was uploaded by B. Blaner on 01 April 2015.
The user has requested enhancement of the downloaded file.

The cache and memory W. J. Starke
J. Stuecheli
subsystems of the IBM D. M. Daly
POWER8 processor J. S. Dodson
F. Auernhammer
P. M. Sagmeister
In this paper, we describe the IBM POWER8i cache, interconnect, G. L. Guthrie
memory, and input/output subsystems, collectively referred to as the C. F. Marino
Bnest.[ This paper focuses on the enhancements made to the nest to
M. Siegel
achieve balanced and scalable designs, ranging from small 12-core
single-socket systems, up to large 16-processor-socket, 192-core B. Blaner
enterprise rack servers. A key aspect of the design has been
increasing the end-to-end data and coherence bandwidth of the
system, now featuring more than twice the bandwidth of the
POWER7A processor. The paper describes the new memory-buffer
chip, called Centaur, providing up to 128 MB of eDRAM (embedded
dynamic random-access memory) buffer cache per processor,
along with an improved DRAM (dynamic random-access memory)
scheduler with support for prefetch and write optimizations,
providing industry-leading memory bandwidth combined with
low memory latency. It also describes new coherence-transport
enhancements and the transition to directly integrated PCIeA
(PCI ExpressA) support, as well as additions to the cache subsystem
to support higher levels of virtualization and scalability including
snoop filtering and cache sharing.
Introduction and the POWER8 processor builds upon that history [2].
The IBM POWER8* processor, shown in Figure 1, is IBM’s The POWER8 design provides industry-leading memory
latest generation of POWER* processors, boasting significant bandwidth and capacity, allowing the cores to run at
increases in thread, core, and system performance, scaling full speed, while minimizing average memory latency
to large core-count SMPs (symmetric multiprocessors) in and power consumption.
order to support big data analytics, cognitive computing, This paper describes the architectural enhancements
and transaction processing. These workloads require large made to the POWER8 cache, interconnect, memory, and
memory capacities with high bandwidth, accessed with low input/output subsystems, collectively referred to as the
latency, and optimally supporting locking and shared data. Bnest[ (shown in Figure 2), to achieve a balanced and
The performance of the systems, from 12-core single-socket scalable design. A major focus of the design has been
systems, up to the largest 192-core 16-processor-socket increasing the end-to-end data and coherence bandwidth of
SMPs depend on caches, coherence, data interconnects, the system (more than twice that of the POWER7 processor).
memory subsystems, and I/O subsystems to provide The paper describes the new memory-buffer chip, called
the cores with all the data they demand. Additionally, Centaur, with up to 128 MB of eDRAM buffer cache
the POWER8 core has roughly doubled in performance per processor along with an improved DRAM scheduler
compared to the IBM POWER7* core, with roughly with support for prefetch and write optimizations,
1.5 times the single-thread performance, and supporting up to providing industry-leading memory bandwidth combined
8-concurrent threads (SMT8) [1]. The POWER processor with low memory latency. It also describes new
family has consistently excelled at large SMP systems, coherence-transport enhancements to optimize the
performance on shared data, and integrated PCI Express**
(PCIe**) support to enable higher bandwidth, lower latency
Digital Object Identifier: 10.1147/JRD.2014.2376131 I/O (input output) operations.
ÓCopyright 2015 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
0018-8646/15 B 2015 IBM
IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 3 JANUARY/FEBRUARY 2015 W. J. STARKE ET AL. 3:1
While similar in form, the POWER8 processor’s cache
hierarchy has been significantly enhanced to accommodate
the computational strength of the core; the core has roughly
doubled in performance from the POWER7 processor core
for most workloads [6], supports twice as many hardware
threads, and increases the L1 data cache capacity from
32 KB to 64 KB [1]. As shown in Table 1, the L2 capacity
per core has grown from 256 KB to 512 KB, while the
L3 capacity per core has increased from 4 MB to 8 MB.
The number of cores has grown from 8 to 12, and the
aggregate L3 capacity per chip was extended from 32 MB
to 96 MB. The latencies to each level of cache remain
roughly the same as in the POWER7 processor [3]. Finally,
the POWER8 processor supports up to 128 MB of shared
L4 (level 4) cache per processor, included in the new
Centaur chip. The L4 memory cache will be detailed in
the BMemory Subsystem[ section of this paper.
Figure 1 Designed for big data workloads, most data paths
Annotated die photo of the POWER8 chip showing 12 processor cores throughout the L1 data cache and core-execution pipelines
each with local L2 and L3 cache, on-chip interconnect, memory are twice as wide (providing twice the data width per
controllers, onboard PCIe, and off-chip SMP interconnects. processor clock) as those found in the POWER7 processor
[1]. As described in [1], this double-wide dataflow extends
through the L2 load and store paths, L2 cache arrays,
The rest of this paper is organized as followed. The BCache local L3 region read/write paths and cache arrays, and
hierarchy[ section describes the L2 (level 2) and L3 (level 3) seamlessly through the on-chip interconnect to the
caches. The BMemory subsystem[ section describes the memory read/write interfaces.
memory subsystem, including the new Centaur chip with L4 More in-flight requests must be tracked to manage the
(level 4) cache and technology-agnostic memory channel. increased traffic flow enabled by the double-wide dataflow.
The BI/O subsystem[ section describes the I/O subsystem, Table 1 depicts the growth from the POWER7 processor
while the BSMP interconnect[ and BOn-chip accelerator[ to the POWER8 processor for the major classes of L2
sections describe the processor data and coherence and L3 cache resources to manage the increased flow.
interconnects with coherence filtering and the nest The L2 core store-gather cache consolidates store-thru
accelerators available in the POWER8 processor, before traffic from the core to optimize updates to the L2 cache.
providing a brief conclusion. L2 core read/write state machines manage coherence
negotiations with the system and L2 cache reads
Cache hierarchy and writes. L2 cache reads and writes can originate
The cache hierarchy for the POWER8 processor builds from consolidated core-store traffic, core data loads,
on the basic organization created for the POWER7 instruction fetches, and address-translation fetches.
architecture [3]. The cache hierarchy local to a processing L2 castout state machines manage capacity related
core comprises a store-through L1 (level 1) data cache [1] migration of L2 data to the L3 cache or memory, while
and a relatively small but low-latency and high-bandwidth, L2 system-read/write state machines manage coherence
private, store-in L2 (level 2) cache built from SRAM and data requests from the rest of the system. L3
(static random access memory). Additionally, the core-read state machines handle core-fetch traffic for
POWER8 processor chip has up to 96 MB of a large, L3 hits, while L3 write state machines manage L2
dense, shared L3 (level 3) cache, comprised of 8 MB castouts (data evicted from L2), lateral L3-region castouts
cache regions built from eDRAM [4] and local to (data evicted from other L3 regions to another region),
the processor cores. The L3 cache-management algorithm and cache injection (data installed directly into the L3
selectively migrates data to the local L3 region attached from a remote agent). L3 prefetch-write state machines
to the core that is using the data, even cloning heavily stage prefetch data from memory to the L3 cache. L3
used shared data copies as needed and collapsing them castout and L3 system-read/write state machines perform
as they fall out of use. The same 13-state coherence the same functions as their L2 counterparts.
protocol developed for the POWER6* architecture [5] In addition to the fundamental scaling improvements
and the POWER7 architecture is utilized for the afforded by increased bandwidth, the POWER8 processor
POWER8 architecture. has significant hardware-based improvements to enhance
3:2 W. J. STARKE ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 3 JANUARY/FEBRUARY 2015
Figure 2
POWER8 Nest is composed of up to 12 cores with local caches, connected through an on-chip coherence and data interconnect to off-chip interconnect,
accelerators, PCIe host controllers, and memory controllers. The accelerators include a true random number generator (RNG), cryptography accelerator
(Crypto), CAPP (coherent attached processor proxy), and compression accelerator (CMP).
Table 1 Comparison of POWER7 and POWER8 cache hierarchy capacities, resources, and bandwidth.
Figure 3
The POWER8 memory subsystem includes a high-speed memory-technology-agnostic memory channel connected to the new Centaur chip, which
includes an eDRAM cache and 4 DDR3/DDR4 DRAM ports.
uncontested-lock performance (time to access a lock Memory subsystem

that is free), highly-contested-lock scaling (handling The memory subsystem has been a major area of our focus
multiple threads competing for the same lock), for the POWER8 processor, resulting in a substantial increase
and address translation management scaling. For example, in bandwidth (roughly 230 GBs/socket in the POWER8
uncontested-lock performance benefits from two changes. processor compared to roughly 100 GB/s for the POWER7
Atomic operations are able to make binding coherence processor) while reducing the latency to memory from over
commit decisions in the L1 data cache instead of the 100 ns in the POWER7 processor to around 80 ns in the
L2 cache, reducing commit latency for uncontested locks. POWER8 processor.
Additionally, hardware transactional memory has been The POWER8 processor chip has 8 memory controllers
added to the POWER8 processor enabling significant (MCs) that operate synchronously with the data and
software-scaling capabilities with software lock elision coherence interconnects. Figure 3 shows the memory
for software that previously would have used locking [7], subsystem for one MC. Each memory controller can track
further reducing uncontested-lock latency. up to 32 requests at a time. Memory is interleaved at a
Beyond these foundational protocol improvements to cache-line granularity of 128 B across memory controllers
single-thread and system-wide scaling for shared data and to support low-latency operation. In contrast to the POWER7
volatile-translation contexts, several hardware capabilities processor, which spread a cache line over two memory
have been added to the cache hierarchy to directly enable channels, every POWER8 memory channel supplies a full
high-value system software, middleware, and application 128 B cache line of data.
features, as described in [8]. Such features include
micro-partition prefetch and support for multiple Memory-technology-agnostic memory channel
concurrent partitions on a core, both of which enable higher The POWER7 and POWER8 processors both use a
performance for many smaller partitions, as well as the high-speed memory channel and memory-buffer chip to
hardware-managed reference-history array to improve increase both the memory bandwidth and memory capacity
virtual-memory management. above what could be supported by a direct-attached DRAM
configuration. The POWER8 Centaur chip itself is by changing the scheduler logic. New custom DIMMs
manufactured in the same 22 nm SOI (silicon on insulator) with this new buffer chip could be used seamlessly in
process as the main processor chip, using the same design POWER systems built with POWER8 processors.
rules, enabling a high-performance solution. Each memory
controller can connect to a memory-buffer chip (Centaur L4 buffer cache
chip) over a high-speed, differential, memory channel The memory-technology-agnostic memory channel also
providing 2 B wide read and 1 B wide write width with a enables another major change in the Centaur chip: the
bit rate of 9.6 Gb/s. This results in a peak efficiency of over inclusion of an L4 buffer cache per Centaur providing
28 GB/s of usable memory bandwidth per memory channel. up to 128 MB of L4 buffer cache per processor chip. The
The 8 high-speed memory channels provide bandwidth cache is organized as a 16-way set-associative cache, with the
equivalent to 32 direct-attached DRAM channels. The data stored in eDRAM, and the directory in SRAM. The
Centaur chip will be included directly on the custom DIMM cache serves as a buffer for the memory and is only accessed
(dual-inline memory module) for some systems, and can on memory-system accesses. As such, it is not snooped
be soldered onto the backplane or a riser card for use with on every bus request like the processor caches, and does
industry-standard DIMMs (ISDIMMs) for other systems. not participate in the coherence protocol. The L4 cache
The Centaur chip also enables higher efficiency on the provides several memory improvements: lower average
memory channel, greatly increasing the total usable bandwidth memory-read latency, lower write latency, more efficient
compared to the POWER7 design. The memory-channel memory scheduling, lower DIMM power, and prefetch
efficiency increase is due to architectural improvements. extensions. The latency and power impacts are the most
The POWER7 processor had the memory scheduler on the obvious of the improvements. Read requests that are satisfied
processor chip and the scheduler was DDR3 specific. The by the L4 cache merely pay the cost of the eDRAM access
POWER7 buffer chip merely received requests from the (latency and power), and do not have to perform a DRAM
processor chip, transmitting them to the appropriate DRAM access. In general an L4 hit reduces the latency of an L3
channel, and returning data to the processor chip when it miss by over 35 ns and requires less energy to complete.
came back from the DRAM. The memory controller on the The other features enabled by the L4 cache are also
processor chip was responsible for scheduling all operations, significant. The first feature is reduced write latency. All
and ensuring that there were no collisions on the channel writes received by the Centaur chip are installed into the
back to the processor chip. With the scheduling tightly L4 cache, whether they hit the cache or not. This allows the
controlled from the processor chip, the efficiency of the write to be retired quickly, freeing the memory-controller
memory channel to the buffer chip was tied to the efficiency resources associated with the write for use for the next
of the DRAM channel itself. In the POWER8 memory command. Additionally, the cache also enables much more
subsystem, the memory channel is agnostic with respect to the efficient scheduling of writes to the DRAM. The Virtual Write
memory technology used by the buffer chip. Requests sent to Queue [9] introduced the idea of using the last-level cache
the Centaur chip are high-level commands (e.g., cache-line as a large write queue. We use the L4 cache for the same
read, cache-line write) and include a tag identifying the purpose. We have added machinery, called the cache cleaner,
request. The high-level commands contrast with the low-level to track how many dirty lines are in the L4 cache, and to
DDR commands sent across the POWER7 memory channel. scan the cache for lines to write back to memory. The cache
The Centaur chip processes each request, and sends back cleaner attempts to keep the DRAM write queue mostly
data as quickly as possible with the appropriate tag, possibly full and to schedule bursts of writes to a page on each rank
sending data back in a different order than the requests arrived. when writes are scheduled. This active process of scanning
Requests might be reordered because of an L4 hit passing the cache for page-mode writes enables more efficient use of
an earlier request. Additionally, the Centaur chip can reorder the DRAM data bus, as multiple writes to the same DRAM
requests to the DRAM for optimal efficiency. The scheduling page do not hit any DRAM-scheduling constraints, and
flexibility and L4 cache allow us to drive the memory additionally, the page-mode writes save energy as the page
channel to the processor chip at an efficiency of over 90%, only needs to be activated once. In previous designs, we
representing a significant performance improvement over would let writes accumulate in the DRAM write queue, and
previous designs. perform a burst of writes when either the read queue was
The memory-technology-agnostic memory channel has empty or if the write queue became too full. In the current
another benefit: it enables possible upgrades to new memory design, we try to keep the write queue mostly full. Instead
technologies without requiring a new processor chip. For of forcing writes when the write queue is mostly full, the
instance, we could develop a new variant of the Centaur chip cache cleaner monitors the number of dirty lines in the cache,
to support novel memory technologies (e.g., phase-change and switches to a burst of writes when the number of dirty
memory, STT-MRAM (spin-transfer torque magnetoresistive lines in the cache exceeds a threshold. In this way, we allow
random-access memory), flash) or a newer DDR specification reads to pass writes and perform efficient bursts of page mode
writes when necessary to drain writes from the system,
reducing the likelihood of writes delaying critical reads.
Extended prefetching is also enabled by the L4 cache.
As prefetch requests are sent from the cores, extra information
in the form of hints is included with the prefetch requests.
The extra information indicates if the prefetch is part of a
stream of prefetches, and the confidence associated with the
prefetch stream. For high-confidence streaming prefetches,
the Centaur chip can prefetch extra data into the L4 cache
before it is requested by the core, reducing latency of future
prefetch requests in that stream. The extra prefetching has
two forms. The first involves fetching two 128 B lines of data
for one request. The second line of data is installed into the
L4 cache for later use. Additionally, the two lines can be
fetched together from the DRAM, using an open-page
access for the second line. The open-page access increases
efficiency and lowers DRAM-power usage. The second
form involves fetching the next N (e.g., 4) pairs of lines
in the stream into the L4 cache. This is done for the
highest-confidence streams, and can essentially transform
all of the prefetch requests into L4 cache hits.
DRAM interface
Each Centaur chip has four 9 or 10 B (9 B with up to
1 B of spare data) DRAM ports, supporting the DDR3
(double-data rate) [10] and DDR4 [11] JEDEC standards, Figure 4
with the DRAM scheduler included on the Centaur chip. Annotated die photo of the Centaur buffer chip with high-speed link to
Each port can address up to 16 logically independent ranks the processor chip, eDRAM memory buffer cache, DDR interfaces, and
control logic.
of DRAM, in order to support multiple physical ranks of
DRAM as well as stacked DRAM. A rank of DRAM is a
collection of DRAM chips (e.g., 8) that are accessed together
in lockstep to provide a cache line of data. Additionally, than normal, in order to fetch lines from the soon to be
the Centaur chip contains an eDRAM buffer cache, providing refreshed rank. After the prefetcher is allowed to prefetch
up to 128 MB of aggregate L4 memory buffer to the ahead for a period of time, no new requests to that rank are
processor chip, enabling lower latency and improved allowed into the read queue (from the prefetcher or from
scheduling opportunities. The DRAM ports are accessed L4 misses). The reads that remain in the read queue are
in pairs to provide data, fetching a full 128 B cache line drained from the queue with high priority, and the refresh
from a port pair. The die photo in Figure 4 shows the is performed after the reads are drained. Extended prefetching
four DRAM ports arranged around the outside of the before performing refresh allows the stream to continue for
Centaur chip, with the high-speed channel to the processor longer before being disrupted by the refresh. Draining the read
chip taking up the remaining perimeter on the right. The queue has a different effect: It keeps the read queue from
eDRAM cache and the related control structures fill the becoming filled with reads that cannot be serviced at the
center of the chip, along with the DRAM scheduling logic. expense of reads that can be serviced.
DRAM scheduling must satisfy a number of scheduling
constraints, including the need for regular refresh operations. Memory RAS
While a refresh operation is occurring, no read or write POWER7 systems had industry leading RAS (reliability,
commands can be issued to the rank in refresh. In traditional availability, and serviceability) properties, using a 72 B
systems, this can have a noticeable performance impact. We marking code to support chipkill (failure of a complete
have included a number of features to mitigate the refresh DRAM chip) and sparing. For POWER7 the full 128 B access
impact. Before the memory controller issues a refresh to used two memory channels, and the ECC (error correcting
a rank, it does a number of things. First it checks the code) correction was performed on the processor chip. In the
high-confidence prefetch streams that are being fetched into POWER8 processor using the memory-technology-agnostic
the L4. If any of the streams will be impacted by the upcoming memory channel, the full 128 B cache line comes from one
refresh, the stream is allowed to prefetch further ahead memory channel, allowing the ECC check and correction
to be performed on the buffer chip. The Centaur chip supports The POWER8 processor is the first generation of POWER
10 B wide DRAM ports, consisting of 72 b of data and processors that features integrated, industry-standard, PCIe
ECC plus 8 b of spare data. The data is arranged into a 72 B Gen 3.0 [15] host controllers. PCIe was chosen because
code word for 64 B of data. The DRAM is accessed in a burst of its maturity, industry-leading bandwidth, and widespread
length of 8 beats, with the first four beats of data from the adoption by third party vendors. The POWER8 processor
two DRAM ports forming one code word, and the second chip integrates 3 PCIe host bridges (PHBs) per chip, with a
four beats of data forming a second code word. The spare data maximum of 16 lanes per PHB, resulting in a maximum
can be multiplexed in and used in place of a failed chip bidirectional bandwidth of 32 GB/s per PHB. The 12-core
(chipkill) discovered by the ECC. The ECC is able to detect POWER8 processor chip provides a total of 32 PCI Express
and correct a complete DRAM chip failure. When a DRAM lanes. A single POWER8 processor chip thus offers a
chip is determined to be faulty, a mark is placed and the maximum bidirectional bandwidth of 64 GB/s.
chip is spared out. The spare data is steered into the place The move towards an integrated PCIe-based I/O
of the failed chip and, for 4 bit wide DRAM, supports steering subsystem allows for more flexibility in adapting to specific
at 4 bit wide granularity. In this way, the ECC is capable application needs. The POWER7 I/O subsystem [16] was
of handling three chipkill events. primarily built around external PCIe I/O hub chips that
In addition to protecting against failures in the DRAM split the GX bus bandwidth of up to 20 GB/s into multiple,
itself, the rest of the path to memory is also protected. but slower x8 PCIe 2.0 [17] buses. In contrast, the POWER8
The high-speed memory channel to the processor chip processor chip now offers the increased bandwidth of
is protected by a CRC (cyclic redundancy check) code with 32 GB/s per controller, but also the flexibility to either
a retry protocol. Transmissions over the memory channel couple PCIe devices with very high-bandwidth requirements
are buffered until confirmed, and any transmission that directly to the processor chip or to split the bandwidth among
encounters an error is resent removing the error from the multiple slower devices using external PCIe switches.
system, and protecting the memory channel with the same Another advantage of moving the PHB onto the processor
level of protection as the rest of the system. chip is the significant reduction in memory-access latency,
Additionally, all the primary data paths through the providing very low DMA (Direct Memory Access) read
chip are protected with a SECDED (single-error correcting, roundtrip times that are crucial for high bandwidth and
dual-error detecting) ECC code, while the eDRAM cache low latencies with high-performance network adapter
is protected by an ECC code with a fuse-controlled devices. The on-chip PHBs achieve time-to-first-byte
repair facility. latencies for DMA reads as low as 200 ns, which is less
The POWER8 processor supports an additional RAS than one third of the time-to-first-byte latency feasible
feature called Selective Memory Mirroring (SMM), with the POWER7 GX-attached PCIe hub chip.
originally introduced in the POWER7 processor [3].
SMM enables mirroring of critical data regions such as I/O memory management unit
hypervisor data. Protected data is mirrored across two The I/O memory management unit (IOMMU) provides
separate memory channels, protecting against a complete enhanced protection and reliability, availability, and
memory channel failure. serviceability (RAS) features. The protection mechanisms
allow freezing all traffic to or from a device in an isolation
I/O subsystem domain on an error condition, without affecting other
The I/O subsystem of the POWER8 processor saw a domains. The protection mechanisms also enable recovery
complete refresh compared to its predecessor designs. without having to reset the PHB. Therefore multiple devices
This refresh increased bandwidth, significantly lowered can be attached to the same PHB securely without sacrifice
memory latency, lowered power consumption, and enables in device reliability and availability.
easy adoption by third parties, especially for those that The IOMMU resided on the GX hub chip in POWER
want to use the coherent coupling features using the systems built using the POWER7 processor, and has
Coherent Accelerator Processor Interface (CAPI) protocol been moved onto the POWER8 processor chip. A number
(see the BOn-chip accelerator[ section). of enhancements were made to the IOMMU to meet the
Starting with the POWER4 processor [12] and z10* [13] increased I/O requirements of the POWER8 processor,
processors, the IBM I/O subsystem was built upon a and to further exploit the inclusion of the unit in the
proprietary ecosystem of I/O chips [14] attached to IBM’s processor chip. The IOMMU’s virtualization and protection
proprietary GX bus and later GX+ and GX++ buses to capabilities were increased to meet the higher fan out
achieve higher I/O throughput than possible with the requirements of the POWER8 processor. Each POWER8
standardized interfaces available at the time. Different PHB is capable of handling 256 error isolation domains,
break-out I/O hub chips provided connectivity to PCI-X**, which allow devices to continue operation even when
PCIe devices, and InfiniBand** or Ethernet fabrics. another device sharing the same PHB encounters a failure
and has to be reinitialized. Moreover every POWER8 PHB different error isolation domains to prevent stalls, but
can manage 512 virtual address spaces in order to provide without compromising on domain isolation capabilities.
enough resources to fully support and isolate single root Beside the basic PCIe standard functionality, the PHB
I/O virtualized (SRIOV) [18] PCIe devices. now also supports the use of Translation Layer Packet (TLP)
The address-translation structures in main memory used by hint bits on the PCIe link to control the cache injection
the IOMMU were completely rearranged and optimized to mechanism on the coherent interconnect.
take full advantage of the new location of the IOMMU, being
now on-chip and thus able to profit of core-like low-latency SMP interconnect
memory access. The new structures thus provide the optimal POWER systems are symmetric multiprocessor (SMP)
balance between the number of lookups needed, translation systems, providing snooping based cache coherence across
latency, and area spent on cache or array structures in each all the cores in the system. On the POWER8 processor
PHB. Also, the IOMMU was extended to support multi-level chip, the SMP data and coherence interconnects consist of
translation tables in addition to the traditional single-level on-chip and off-chip interconnects, plus memory-coherence
translation scheme used in AIX, in order to better support the additions to enable efficient scaling up to 192 cores. The
Linux** memory management schemes. SMP interconnect is responsible for transferring the current
value of memory locations to and from the processor cores
Throughput optimization as well as implementing the coherence protocol and enabling
The design was optimized to fully utilize the available efficient lock handling.
link bandwidth, achieving close to the theoretical limits
with more than 95% raw link utilization and roughly 90% On-chip SMP interconnect
data utilization, even with traffic using the full address As shown in Figure 2, the POWER8 processor on-chip
translation and protection capabilities of the IOMMU. interconnect consists of a multi-ported coherency interface
The high sustained link speeds are made possible by and a highly distributed data interconnect between the
a combination of data and coherence interconnect and processor cores each with associated L2 and L3 caches,
micro-architectural enhancements. the memory subsystem, and the I/O subsystem. As in the
An improved streaming mode exclusively used for I/O POWER7 processor chip, the on-chip data connect consists
data is implemented in the data interconnect, based on the of eight 16 B buses that span the chip horizontally, and
streaming mode introduced in the POWER7 processor [3]. are broken into multiple segments to handle the propagation
It allows data bandwidths of more than 14 GB/s in each delay across the chip, and allows the interconnect to be
direction for reads and writes, while still maintaining pipelined. Four of the buses flow left to right, and four
the strict ordering requirements imposed by the PCIe flow right to left, and the buses operate at up to 2.4 GHz.
Specification. Frequently achieving high bandwidths on Micro-architecture improvements were made to the internal
PCIe buses also requires the use of relaxed ordering to reduce data interconnect to reduce request latency by resolving
dependencies between reads and writes wherever possible. on-chip data-routing decisions using a distributed-arbitration
The POWER8 processor does not require these ordering scheme instead of a centralized data arbiter in previous
relaxations due to the high bandwidth provided by the generations. The on-chip interconnect also contains an
streaming mode. This is critical as allowing relaxed adaptive-control mechanism to support independently
ordering would create a potential threat to the error controlled core frequencies while optimizing
isolation concept of the IOMMU. coherence-traffic bandwidth. As core frequencies are adjusted
In addition, various micro-architectural optimizations upward or downward, the coherence-arbitration logic will
were made within the PHB to enable higher link efficiency. throttle up or throttle down the rate at which commands
Address translation prefetching allows early inspection of are issued based on the frequency of the slowest core.
PCIe addresses close to the link interface and early fetches
of translation control entries (TCE) [19] in order to hide Off-chip SMP interconnect
most of the lookup latency behind stack queuing and The off-chip interconnect, shown in Figure 5, is a highly
packet processing. In combination with the low latency scalable multi-tiered fully-connected topology redesigned for
to memory, this allows for more efficient receive buffer the POWER8 processor chip to reduce latency. POWER
use, and achieves four times higher bandwidth as the prior systems built on the POWER7 processor required up to
generation external I/O hub chip, even with the same-sized 3 hops between chips to get from one chip to another, but
link receive buffers. The PHB further applies prioritization POWER systems built on the POWER8 processor require no
mechanisms in many areas of the design to optimize the more than 2 hops. By eliminating a chip hop, the POWER
scheduling and arbitration especially of reads for control system has been flattened, significantly reducing the latency
and payload data. It also relaxes ordering requirements between the furthest ends of the SMP (an approximately
wherever possible by splitting up streams from the 25 ns reduction). The first-level system is a single chip.
had two coherence scopes to limit the coherence traffic
on the SMP links by filtering the traffic that needs to be
broadcast to other chips and groups [3]. The POWER8
architecture extends the POWER7 architecture’s coherence
scopes, adding a third scope.
Coherence filtering
Commands are issued to the coherence interconnect with one
of three scopes specified: chip, group, and system, where the
scopes match the tiers described above. Commands issued
with chip or group scope are not broadcast to the complete
system, but rather only to the local chip or local group
respectively. Since commands issued with chip or group
scope only access a portion of the system, they require
much less command bandwidth and much lower latency to
complete. However, the system must still ensure memory
Figure 5 coherence across the complete system, including the portions
POWER8 SMP topology consists of up to 4 POWER8 processors not snooped. The command must be seen by any cache
connected in an all-to-all manner to form a 4-chip group, and up to 4 holding the line. Hardware prediction mechanisms and
groups with each chip in a group connected to the matching chips in the software collaboration allows the use of the smallest scope
other groups. to maintain coherence, while hardware checks ensure that
scope includes all relevant caches.
Memory-Domain Indicators (MDIs) are included for each
The second-level system is fully connected with 10 B line of data in the system in order to determine if the scope
single-ended SMP links running at 4.8 Gbps. The POWER8 of the command is adequate to find all caches holding the
processor chip has three such links, enabling direct requested line.
connection to each of 4 other processor chips, in order to MDI are assigned classes. Chip-class MDI indicate if the
create a four-chip group. The third-level system connects line has moved off the line’s home chip where the Memory
each processor chip in a group to its corresponding processor Controller (MC) owning the line is attached and have been
chip in each other group. Three of the inter-group links are included in POWER systems since the POWER6 processor
provided per chip supporting a total of four groups, each [5]. Group-class MDI are new in the POWER8 processor
containing four processor chips. A full four-group system of and indicate if the line has moved off the line’s home group.
four chips per group comprises a maximum system of 16 The new group class MDI are kept in a directory structure
processor chips and a 192-way SMP. The inter-group level called an MCD (memory-coherence directory), and included
SMP link uses a 22 bit high-speed differential bus running on the processor chip. Each MCD is associated with the
at 6.4 Gbps. full address range of the memory served by that processor
The reliability, availability, and serviceability (RAS), chip’s memory controllers. In the POWER8 processor chip,
and the cost of the SMP links have also been improved in the MCD is a coarse-grained directory with each MDI
the POWER8 processor. The SMP links have the ability Bgranule[ representing a 16 MB granule of memory.
to dynamically detect and repair bit lanes without The MCD snoops commands issued with group and system
removing functional lanes from operation. In other words, scope. It participates in the address tenure of commands
SMP-coherence traffic continues to function normally issued with group scope to indicate if the line specified by
while a bad lane is repaired. Another major change for the command has moved off the home group (i.e., when
the POWER8 processor was to replace the SMP backplane the MCD MDI is marked Bremote[). At system scope, the
with a cabled SMP link connection, significantly reducing MCD determines during the address tenure if the line is going
the system cost. System configuration is discussed in to be moved off the group. If the destination of the line is off
more detail in [2]. the group, the MCD sets the MDI for the line’s granule to
Bremote.[ When the MCD snoops a command at group scope
Coherence scopes and indicates that the line is Bremote,[ the master may be
The data and coherence interconnects described above required to retry the operation using system scope, or may be
provide a large amount of data and command bandwidth. allowed to complete the operation and be directed to perform
However, with up to 192 cores and a snooping a background kill operation using system scope.
memory-coherence protocol, the cores could easily saturate Using both the chip and group MDI, the coherence
the links with coherence traffic. The POWER7 architecture protocol may direct a requestor that initially issues a request
with chip scope, to increase its scope to group, and then resolve coherence is unavailable, the protocol allows for the
to system. Scope-prediction logic within the requestor snooper to retry the operation. Oversubscription provides
is used to determine the initial scope for the operation. the same retry capability but at a chip level scale. This chip
For example, if the line is in the local cache in the IG state, level retry can occur when the coherency operations received
the predictor can determine that the line was last sent off from the incoming SMP links exceeds the chip’s snoop
the chip and that the request should have group or system bandwidth. This implies that the coherence operation was not
scope. A cache prefetching consecutive lines could use broadcast to any downstream processor chips in the SMP
the scope needed to complete the operation for one line topology. Using oversubscription, a coherence operation
in the stream for subsequent lines in the stream. can still complete successfully even if the broadcast did not
The MC holds chip-class MDI for each 128 B cache reach all chips in the SMP system, so long as the broadcast
line it holds and can reset the MDI bit to Bhome[ when was received by the appropriate subset for that request.
a line is removed from all caches in the system. However, Not all requests are handled in the same manner when they
each MCD group-class MDI entry holds the status for are retried. Speculative commands, such as prefetch, do not
a 16 MB granule, consisting of 128 K lines and requires incur as high a penalty if retried, while critical coherence
a recovery mechanism to reset the MDI. A specialized requests will suffer a much higher penalty. A priority was
coherence-protocol command exists to determine if any added to requests to determine which requests to drop, when
cache holds the line or holds a reservation for the line, and a request must be dropped. The priority of a command can
this command is used to implement the recovery mechanism. be specified as low, medium, or high. Low-drop-priority
commands are the first commands to be dropped, while
Data routing through topology high-priority commands are the last commands to be
Data needs to be routed over the data interconnect to get to its dropped. Furthermore, the interconnect issues requests at
destination. On the larger SMP systems, there are multiple different rates depending on their priority. Low-drop-priority
routes using inter- and intra-group interconnect buses that commands have a higher issue rate than high-drop-priority
could be used to route the data. Routing normally follows commands as they may be more freely dropped. Dropped
a preferred shortest route between source and destination, requests that are retried, are retried with higher priority.
but may use alternate routing in the case of congestion. The priorities and request issue rates allow the interconnect
to manage the number of coherence operations concurrently
Optimistic non-blocking coherence flow in flight, allowing most high priority commands to
The POWER8 processor SMP interconnect utilizes a succeed regardless of SMP system coherency traffic
non-blocking snooping protocol, which has highly scalable, while speculative, low-priority commands utilize the
low-latency characteristics. This has an advantage over remaining coherence bandwidth, increasing overall
directory-based coherence protocols by decentralizing coherence-bandwidth efficiency.
coherency and thus vastly reducing latency. The The coherence interconnect issues requests at different rates
non-blocking snooping protocol also has an advantage based on commands’ scope, priority, and current interconnect
over message-passing snooping coherence protocols because usage. The rates are dynamically controlled in hardware
it is temporally bounded, i.e. snoopers respond in a fixed using feedback from dropped commands, in order to optimize
time called Tsnoop. Furthermore, message-passing snooping bandwidth use and minimize dropped requests.
protocols rely on queuing structures and communication
bandwidth. These become more constrained as the SMP On-chip accelerators
system scales to larger n-way systems. In POWER7+*, IBM introduced on-chip accelerators for
In previous POWER server generations, coherence cryptography and active memory expansion (AME) [20]
operations were evenly divided using time-division and provided a true-hardware random-number generator
multiplexing in order to ensure that each processor chip’s for cryptographic applications [21]. In POWER8, these
coherency bandwidth does not exceed a limit, when all accelerators have been carried forward and improved,
processor chips are issuing requests. However, limiting adding new capabilities, such as providing data-prefetching
the broadcast rate for worst-case conditions often leaves hints to the memory controller to reduce cache-line-access
significant coherence bandwidth unused. The POWER8 SMP latency and improve throughput.
interconnect introduced the capability to oversubscribe POWER8 introduces the Coherent Accelerator Processor
each processor chip’s allotment of bandwidth in order to Interface (CAPI), described in detail elsewhere in this issue
take advantage of this unused bandwidth, without impacting of the IBM Journal of Research and Development [22].
the base non-blocking operation. CAPI enables off-chip accelerators to be plugged into
The optimistic coherence flow’s oversubscription uses an up to 16-lane PCIe slot and participate in the system
the existing POWER coherence-retry protocol. When a memory-coherence protocol as a peer of other caches
coherence operation is received and the resource required to in the system at high bandwidth. Accelerators use effective
3 : 10 W. J. STARKE ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 3 JANUARY/FEBRUARY 2015
addresses to reference data structures in memory just like microarchitecture,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 2,
pp. 2:1–2:21, 2015.
applications running on the cores. Accelerator designers 2. J. Cahill, T. Nguyen, M. Vega, D. Baska, D. Szerdi, H. Pross,
are given the freedom to implement their PCIe-card based R. Arroyo, H. Nguyen, M. Mueller, D. Henderson, and J. Moreira,
designs using field-programmable gate arrays (FPGA) BIBM Power Systems built with the POWER8 architecture and
processors,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 4,
application-specific integrated circuit (ASIC) chips, pp. 4:1–4:10, 2015.
semi-custom integrated circuits, and so forth. 3. B. Sinharoy, R. R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni,
J. A. Van Norstrand, B. Ronchetti, J. Stuecheli, J. Leenstra,
G. L. Guthrie, D. Nguyen, B. Blaner, C. F. Marino, E. Retter,
Conclusion and P. Williams, BIBM POWER7 multicore server processor,[
IBM J. Res. & Dev., vol. 55, no. 3, paper 1, pp. 1:1–1:29,
In this paper, we have described the IBM POWER8 May/Jun. 2011.
cache, interconnect, memory, and input/output subsystems 4. S. Narasimha, P. Chang, C. Ortolland, D. Fried, E. Engbrecht,
collectively referred to as the Bnest.[ Systems built from K. K. Nummy, P. Parries, T. T. Ando, M. Aquilino, N. Arnold,
R. Bolam, J. Cai, M. Chudzik, B. Cipriany, G. Costrini,
the POWER8 processor and nest represent a substantial M. Dai, J. Dechene, C. Dewan, B. Engel, M. Gribelyuk,
increase in single thread and throughput computing. D. Guo, G. Han, N. Habib, J. Holt, D. Ioannou, B. Jagannathan,
The paper focused on the enhancements made to the nest D. Jaeger, J. Johnson, W. Kong, J. Koshy, R. Krishnan,
A. Kumar, M. M. Kumar, J. Lee, X. Li, C. Lin, B. Linder,
to achieve balanced and scalable designs, ranging from S. Lucarini, N. Lustig, P. McLaughlin, K. Onishi, V. Ontalus,
small 12-core single-socket systems, up to large 16-socket, R. Robison, C. Sheraw, M. Stoker, A. Thomas, G. Wang,
192-core enterprise rack servers. The local-cache hierarchy R. Wise, L. Zhuang, G. Freeman, J. Gill, E. Maciejewski,
R. Malik, J. Norum, and P. Agnello, B22 nm high-performance
has been improved to accommodate the computational SOI technology featuring dual-embedded stressors, Epi-Plate
strength of the core (which has roughly doubled in High-K deep-trench embedded DRAM and self-aligned
performance), with twice the L2 and L3 cache capacity, via 15LM BEOL,[ in 2012 IEEE Electron. Devices Meeting
(IEDM), 2012 IEEE Int., Dec. 10–13, 2012, pp. 3.3.1–3.3.4.
twice the bandwidth, more in flight commands, more efficient 5. H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen,
locking and translation support, while maintaining similar B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden,
access latencies as the POWER7 cache. The memory BIBM POWER6 microarchitecture,[ IBM J. Res. & Dev,
vol. 51, no. 6, pp. 639–662, Nov. 2007.
subsystem was a major area of focus for the POWER8 6. A. Mericas, N. Peleg, L. Pesantez, S. B. Purushotham, P. Oehler,
processor providing industry-leading bandwidth, while C. A. Anderson, B. A. King-Smith, M. Anand, B. Rogers,
simultaneously reducing the latency to memory significantly. L. Maurice1, and K. Vu, BIBM POWER8 performance features
and evaluation,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 6,
The memory subsystem uses the new Centaur chip, including pp. 6:1–6:10, 2015.
a new L4 memory-buffer cache and containing the DRAM 7. H. Q. Le, G. L. Guthrie, D. E. Williams, M. M. Michael,
scheduler, enabling higher memory-channel efficiency. B. G. Frey, W. J. Starke, C. May, R. Odaira, and T. Nakaike,
BTransactional memory support in the IBM POWER8
The I/O subsystem of the POWER8 processor chip saw processor,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 8,
a complete refresh compared to its predecessor designs, pp. 8:1–8:14, 2015.
bringing 32 lanes of the PCIe 3.0 onto the processor chip, 8. B. Sinharoy, R. Swanberg, N. Nayar, B. Mealey, J. Stuecheli,
B. Schiefer, J. Leenstra, J. Jann, P. Oehler, D. Levitan, S. Eisen,
significantly increasing bandwidth while lowering DMA D. Sanner, T. Pflueger, C. Lichtenau, W. E. Hall, and T. Block,
memory latencies considerably. The data and coherence BAdvanced features in IBM POWER8 systems,[ IBM J. Res. &
interconnects were improved to increase end-to-end Dev., vol. 59, no. 1, paper 1, pp. 1:1–1:18, 2015.
9. J. Stuecheli, D. Kaseridis, L. John, D. Daly, and H. C. Hunter,
bandwidth by increasing on-chip efficiency, increasing BCoordinating DRAM and last-level-cache policies with the
intra-group and inter-group bandwidth, reducing virtual write queue,[ IEEE Micro, vol. 31, no. 1, pp. 90–98,
the maximum number of hops, and improving the Jan./Feb. 2011.
10. JEDEC Solid State Technology Association, JEDEC Standard:
coherence-scope filtering. Finally, the on-chip accelerators DDR3 SDRAM, 2010. [Online]. Available: www.jedec.org/
from POWER7 were extended, featuring the inclusion of sites/default/files/docs/JESD79-3E.pdf
the new Coherent Accelerator Processor Interface (CAPI). 11. JEDEC Solid State Technology Association, JEDEC Standard:
DDR4 SDRAM, 2012. [Online]. Available: www.jedec.org/
sites/default/files/docs/JESD79-4.pdf
*Trademark, service mark, or registered trademark of International 12. J. M. Tendler, J. S. Dodson, J. S. Fields, H. Le, and B. Sinharoy,
Business Machines Corporation in the United States, other countries, BPOWER4 system microarchitecture,[ IBM J. Res. & Dev.,
or both. vol. 46, no. 1, pp. 5–25, Jan. 2002.
13. C.-L. K. Shum, F. Busaba, S. Dao-Trong, G. Gerwig, C. Jacobi,
**Trademark, service mark, or registered trademark of PCI-SIG, T. Koehler, E. Pfeffer, B. R. Prasky, J. G. Rell, and A. Tsai,
InfiniBand Trade Association, or Linus Torvalds, Inc., in the United BDesign and microarchitecture of the IBM System z10
States, other countries, or both. microprocessor,[ IBM J. Res. & Dev., vol. 53, no. 1, pp. 1:1–1:12,
Jan. 2009.
14. E. W. Chencinski, M. A. Check, C. DeCusatis, H. H. Deng,
References M. Grassi, T. A. Gregg, M. M. Helms, A. D. Koenig, L. Mohr,
1. B. Sinharoy, J. A. Van Norstrand, R. J. Eickemeyer, H. Q. Le, K. Pandey, T. Schlipf, T. Schober, H. Ulrich, and C. R. Walters,
J. Leenstra, D. Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, BIBM System z10 I/O subsystem,[ IBM J. Res. & Dev., vol. 53,
J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky, J. W. Bishop, no. 1, pp. 6:1–6:13, Jan. 2009.
M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbach, 15. PCI Express Base Specification, Revision 3.0, Nov. 2010. [Online].
T. Karkhanis, and K. M. Fernsler, BIBM POWER8 processor core Available: http://www.pcisig.com/specifications/pciexpress/base3/
IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 3 JANUARY/FEBRUARY 2015 W. J. STARKE ET AL. 3 : 11
16. R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen, Additional work has focused on system scalability, memory
BIBM POWER7 systems,[ IBM J. Res. & Dev., vol. 55, no. 3, coherence, and network interconnects, as well as system reliability
pp. 2:1–2:13, May/Jun. 2011. and availability. Dr. Daly is a senior member of the Institute of
17. PCI Express Base Specification, Revision 2.1, Mar. 2009. [Online]. Electrical and Electronics Engineers (IEEE).
Available: http://www.pcisig.com/members/downloads/
specifications/pciexpress/PCI_Express_Base_r2_1_04Mar09.pdf
18. Single Root I/O Virtualization and Sharing Specification, J. Steve Dodson IBM Systems and Technology Group, Austin,
Revision 1.1, Jan. 2010. [Online]. Available: http://www. Texas 78758 USA (jsdodson@us.ibm.com). Mr. Dodson is a Senior
pcisig.com/specifications/iov/single_root/ Engineer in the POWER microprocessor development organization
19. Logical Partition Security in the IBM eServer pSeries 690, IBM at IBM Austin. He received a B.S.E.E. degree from the University
white paper. [Online]. Available: http://www-1.ibm.com/servers/ of Kentucky in 1982. He subsequently joined IBM where he has
eserver/pseries/hardware/whitepapers/lpar_security.html worked on the development of I/O bridge chips, cache controllers,
20. B. Blaner, B. Abali, B. M. Bass, S. Chari, R. Kalla, S. Kunkel, and memory controllers for POWER processors from the earliest
K. Lauricella, R. Leavens, J. J. Reilly, and P. A. Sandon, POWER systems through the POWER8 processor. Mr. Dodson
BIBM POWER7+ processor on-chip accelerators for cryptography is currently a memory stack hardware architect for the POWER8
and active memory expansion,[ IBM J. Res. & Dev, vol. 57, no. 6, processor. He is an IBM Master Inventor and coauthor of over
pp. 3:1–3:16, Nov./Dec. 2013. 200 issued U.S. patents.
21. J. S. Liberty, A. Barrera, D. W. Boerstler, T. B. Chadwick,
S. R. Cottier, H. P. Hofstee, J. A. Rosser, and M. L. Tsai,
BTrue hardware random number generation implemented in the Florian Auernhammer IBM Research - Zurich, Switzerland
(fau@zurich.ibm.com). Dr. Auernhammer is a Research Staff Member
32-nm SOI POWER7+ processor,[ IBM J. Res. & Dev, vol. 57,
no. 6, pp. 4:1–4:7, Nov./Dec. 2013. in the Cloud and Computing Infrastructure department at IBM
22. J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel, BCAPI: Research - Zurich. He received an M.S. degree in general engineering
A Coherent Accelerator Processor Interface,[ IBM J. Res. & Dev., from Ecole Centrale Paris in 2005, and Dipl.-Ing. and Ph.D. degrees
vol. 59, no. 1, paper 7, pp. 7:1–7:7, 2015. in electrical engineering from the Technical University of Munich
in 2005 and 2011, respectively. He joined IBM at the IBM
Research - Zurich Lab in 2006, where he has worked on I/O host
Received March 17, 2014; accepted for publication bridges for InfiniBand and PCI Express, their efficient integration
April 12, 2014 into coherent fabrics, and I/O virtualization. He is author or coauthor
of six issued patents in addition to several currently pending.
Dr. Auernhammer is a member of the Institute of Electrical and
William J. Starke IBM Systems and Technology Group, Austin, Electronics Engineers (IEEE).
TX 78758 USA (wstarke@us.ibm.com). Mr. Starke joined IBM in
1990 after graduating from Michigan Technological University with
a B.S. degree in computer science. He currently serves as the IBM Patricia M. Sagmeister IBM Research - Zurich, Switzerland
Distinguished Engineer and Chief Architect for the Power processor (psa@zurich.ibm.com). Dr. Sagmeister is a Research Staff Member
storage hierarchy, and is responsible for shaping the processor cache in the Cloud and Computing Infrastructure department at IBM
hierarchy, symmetric multi-processor (SMP) interconnect, cache Research - Zurich. She received a Dipl.-Inf. degree in computer science
coherence, memory and I/O controllers, accelerators, and logical system from the University of Passau in 1993, as well as a Ph.D. degree
structures for Power systems. He leads a large engineering team that in computer science from the University of Stuttgart in 2000. In 2013,
spans multiple geographies. Mr. Starke has been employed by IBM she received an M.B.A. degree from Warwick Business School.
for almost 25 years in several roles, spanning mainframe and Power She joined IBM in 1999 at IBM Research - Zurich, where she has
systems performance analysis, logic design, and microarchitecture. worked on various aspects of datacenter optimization. In 2008
Over the past decade, he has served as the storage hierarchy Chief and 2009, she had an assignment at the IBM Thomas J. Watson
Architect for POWER6, POWER7, POWER8, and follow-on Research Center working specifically on POWER8 I/O architecture.
design points that are currently in development. Mr. Starke holds She is author or coauthor of several patents and technical papers.
approximately 200 U.S. patents. Dr. Sagmeister is a senior member of the Institute of Electrical and
Electronics Engineers (IEEE).
Jeff Stuecheli IBM Systems and Technology Group, Austin,

TX 78758 USA (jeffas@us.ibm.com). Dr. Stuecheli is a Senior Guy L. Guthrie IBM Systems and Technology Group, Austin,
Technical Staff Member in the Systems and Technology Group. He TX 78758 USA (gguthrie@us.ibm.com). Mr. Guthrie is a Senior
works in the area of server hardware architecture. His most recent work Technical Staff Member on the POWER Processor development team
includes advanced memory architectures, cache coherence, and in the Systems and Technology Group and is an architect for the
accelerator design. He has contributed to the development of numerous IBM POWER8 cache hierarchy, coherence protocol, SMP interconnect,
IBM products in the POWER architecture family, most recently the memory and I/O subsystems. He served in a similar role for POWER4,
POWER8 design. He has been appointed an IBM Master Inventor, POWER5, POWER6, and POWER7 programs as well. Prior to that,
authoring about 100 patents. He received B.S., M.S., and Ph.D. degrees he worked as a hardware development engineer on several PCI
from The University of Texas Austin in electrical engineering. (Peripheral Component Interconnect) Host Bridge designs and
also worked in the IBM Federal Systems Division on a number
of IBM Signal Processor Development programs. He received his
David M. Daly IBM Research Division, Thomas J. Watson B.S. degree in electrical engineering from Ohio State University
Research Center, Yorktown Heights, NY 10598 USA (ddaly@ieee.org). in 1985. Mr. Guthrie is an IBM Master Inventor and holds
Dr. Daly was a Research Staff Member at the IBM T. J. Watson 184 issued U.S. patents.
Research Center. He received a B.S. degree in computer engineering
from Syracuse University in 1998, and M.S. and Ph.D. degrees
in electrical engineering from the University of Illinois at Charles F. Marino IBM Systems and Technology Group,
Urbana-Champaign in 2001 and 2005, respectively. He subsequently Austin, TX 78758 USA (marinoc@us.ibm.com). Mr. Marino received
joined IBM at the IBM T. J. Watson Research Center, where he worked his B.S. degree in electrical and computer engineering from
on next generation POWER microprocessor design, including the Carnegie-Mellon University. He is a Senior Engineer in the IBM
POWER8 microprocessor, from 2008 to 2014. The main focus of his Systems and Technology Group. In 1984, he joined IBM in Owego,
work has been the design and performance analysis of the memory New York. Mr. Marino is currently the interconnect team lead for
subsystem using trace driven, queuing, and spreadsheet analysis. the IBM POWER8 servers.
3 : 12 W. J. STARKE ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 3 JANUARY/FEBRUARY 2015
Michael Siegel Systems and Technology Group, Research
Triangle Park, NC 27709 USA (siegelm@us.ibm.com). Mr. Siegel is a
Senior Technical Staff Member in the IBM Systems and Technology
Group. He currently works as the hardware architect of coherent bus
architectures developed for POWER system applications. In 2003,
Mr. Siegel joined the PowerPC* development team to support the
high performance processor roadmap and creating standard products
for both internal and external customers. Roles included memory
controller design lead and coherency bus design lead and architect,
leading to the development of the PowerBus architecture in use in
POWER processor chips starting with POWER7. He supported
multiple projects including IBM POWER7 and POWER8 processors,
and assisting in customer discussions for future game and joint chip
development activity, incorporating new functions into the architecture
as system requirements evolved. While working on the POWER8
PowerBus architecture, he worked as a hardware architect of the
processor side of the Coherent Attached Processor Interface, working
with development teams spanning the processor chip and first
generation coherently attached external coprocessor by specifying
hardware behavior and micro code architecture. Prior to his work
in the IBM Systems and Technology Group, Mr. Siegel worked in
NHD developing the Rainier network processor, the IEEE 802.5
DTR standard, IBM token ring switch products based on the standard.
He started working for IBM in Poughkeepsie, New York, on the
3081 I/O subsystem and the ES/9000 Vector Facility. Mr. Siegel
is an IBM Master inventor and holds over 70 patents issued by
the U.S. Patent Office.
Bart Blaner IBM Systems and Technology Group, Essex

Junction, VT 05452 USA (blaner@us.ibm.com). Mr. Blaner earned
a B.S.E.E. degree from Clarkson University. He is a Senior Technical
Staff Member in the POWER development team of the Systems
and Technology Group. He joined IBM in 1984 and has held a
variety of design and leadership positions in processor and ASIC
development. Recently, he has led accelerator designs in POWER7+
and POWER8 platforms, including the Coherent Accelerator Processor
Proxy design. He is presently focused on the architecture and
implementation of hardware acceleration technologies spanning a
variety of applications for future POWER processors. He is an IBM
Master Inventor, a Senior Member of the Institute of Electrical and
Electronics Engineers (IEEE) and holds more than 30 patents.
IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 3 JANUARY/FEBRUARY 2015 W. J. STARKE ET AL. 3 : 13
View publication stats

p8 Nest JRD

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

p8 Nest JRD

Încărcat de

Drepturi de autor:

Formate disponibile

See discussions, stats, and author proﬁles for this publication at: https://www.researchgate.

The cache and memory subsystems of the IBM POWER8 processor

Article in Ibm Journal of Research and Development · January 2015

William J. Starke David Daly

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded ﬁle.

0018-8646/15 B 2015 IBM

uncontested-lock performance (time to access a lock Memory subsystem

Jeff Stuecheli IBM Systems and Technology Group, Austin,

Bart Blaner IBM Systems and Technology Group, Essex

View publication stats

S-ar putea să vă placă și