Sunteți pe pagina 1din 3

Blue

Gene/Q is Super Ef4icient


Guest Editorial Contribution Linley Group
Blue Gene/Q Is Super Efficient
IBMs Innovative Multiversioned L2 Cache Speeds Concurrency By Kevin Krewell, Senior Analyst, The Linley Group At the 2011 Hot Chips conference, IBM revealed the architecture behind the Q version of its Blue Gene supercomputer processors, which is already deployed in prototype systems. Blue Gene/Q is scheduled to power a pair of upcoming multi-petaFlop supercomputers that are expected to be among the fastest in the world in 20122013. Although the market for dedicated supercomputer processors is not large, each of these installations will consume more than 100,000 of the processor chips for computing and I/O processing. Not only is Blue Gene/Q on the leading edge of performance, it is a test bed for processor innovations like transactional memory. IBMs Blue Gene chip-design manager, Ruud Haring, spoke at Hot Chips. Blue Gene/Q is the First production microprocessor to integrate transactional memory, which should speed multiprocessing by eliminating the need for memory locks. Transactional memory guarantees correctness in hardware without incurring long COREtoL2 cache turnaround delays. In addition, the L2 cache is built using embedded DRAM (eDRAM) instead of SRAM. As Figure 1 shows, the Blue Gene/Q processor has 18 COREs on die, 16 of which are available for application programs. Blue Gene/Q has two extra COREs that are not software-visible. One of them runs its own kernel, ofFloading interrupts and other bursty trafFic from the operating system. The last CORE is a spare that can replace any defective CORE, improving die yield. The COREs are equipped with multimode prefetch units to reduce average memory latency. Each of Blue Gene/Qs processor cores has a quad Floating-point unit (FPU) that, combined with the numerous COREs on the chip, gives it exceptional Floating-point performance: over 200 gigaFlops when running at 1.6GHz.

Figure 1.Die photo of the Blue Gene/Q SoC. The


18 COREs, which IBM calls processor units (PUs), are numbered from 00 to 17 and form a horseshoe shape around the die. Inside this horseshoe are the 16 L2 cache segments. On either side are DDR3 memory interfaces. The die measures 360mm2 and consumes 1.47 billion transistors in IBMs 45nm SOI process. (Photo courtesy of IBM)

Transactional Memory Boosts Multiprocessing IBM is the First company to build a commercial microprocessor that uses transactional memory, a new multicore- chip feature that researchers have studied for years. It replaces the current practice of locking a data address while performing a read-modify-write to that addressan approach that scales poorly as more processes work concurrently on the same data set.

Shape the future of Power Architecture Technology

The central point of coherence on Blue Gene/Q is the 32MB L2 cache, which is 16-way set associative. In the multiversioned design, the data is tagged with additional bits, which allow multiple versions of a memory location to be maintained and tracked by scoreboard logic. The tag bits are used to detect any load/store conFlicts for each transaction. If no such conFlicts are found, the transaction can be committed. The success or failure of a transaction is recorded in registers. If a conFlict does appear, the system software must resolve it, typically by invalidating and then re-executing the transaction. The PowerPC A2 CORE design emphasizes throughput and energy efFiciency; it runs at a quick but not blazing clock speed (1.6GHz). The A2 CORE is a multithreaded, in-order design that includes a dynamic branch predictor. To further reduce power, the chip makes extensive use of clock gating. These changes reduce the processors total power to 55W (maximum) while enabling sufFicient performance for supercomputer applications. The Blue Gene/Q CORE really shines on Floating-point applications. This exceptional performance comes from its quad-pipe Floating-point unit. Each cycle, the quad FPU, shown in Figure 2, can serve as a simple scalar FPU or a four- wide SIMD FPU, or it can perform two complex-arithmetic SIMD operations. All of these operations can be single- or double-precision. Figure 2. Quad FPU in IBMs PowerPC A2 CORE. With four multiply-add units, the quad FPU can perform eight double-precision Uloating-point operations per cycle. The processor can execute up to eight double-precision Uloating-point operations, based on a fused multiply-add (FMA), along with an FP load and an FP store in a single cycle.

As Table 1 shows, Blue Gene/Q represents a tremendous upgrade in performance over its predecessors, owing in large part to increases in both the CORE count and clock frequency. The Blue Gene/Q chip delivers 15 times the peak Flops of the previous Blue Gene/P and 36 times that of the original Blue Gene/L. Processor Blue Gene/L Blue Gene/P Blue Gene/Q CORE Type PowerPC 440 PowerPC 450 PowerPC A2 Cores 2 4 18 CORE Speed 700MHz 850MHz 1600MHz Peak FP Perf 5.6 gigaFlops 13.6 gigaFlops 205 gigaFlops Date 2004 2007 2012

Table 1. Three generations of Blue Gene processors. The Blue Gene/Q processor represents the largest increase in
performance. (Source: IBM and HPCWire)

Blue Gene Is Very Green The supercomputer community is already aware of Blue Gene/Q. A prototype computer using 512 processor chips is the top-ranked system on the latest Green500 list, which ranks supercomputers on the basis of Linpack performance per watt. The prototype has a rating of 2.1 gigaFlops per watt, beating even the newest GPU- accelerated machines. Two U.S. Department of Energy labs are building the most powerful Blue Gene systems ever deployed: the 10- petaFlop Mira system at Argonne National Laboratory and the 20-petaFlop Sequoia supercomputer at Lawrence Livermore National Laboratory. These supercomputers will both rank among the worlds fastest supercomputers, and Sequoia could take the top spot in 2012. The new processor will elevate the Blue Gene franchise to tens of petaFlops. Blue Gene/Q introduces new architectural concepts like transactional memory that enable better scaling with large numbers of threads. It embodies the manycore model that is becoming more prevalent in HPC, but adds HPC-speciFic optimizations. These features enable industry-leading Flops per watt, which will help Blue Gene/Q systems triple the performance of todays top supercomputers while paving the way for future exaFlop (1,000-petaFlop) systems.

S-ar putea să vă placă și