Documente Academic
Documente Profesional
Documente Cultură
Transactional Memory Boosts Multiprocessing IBM is the First company to build a commercial microprocessor that uses transactional memory, a new multicore- chip feature that researchers have studied for years. It replaces the current practice of locking a data address while performing a read-modify-write to that addressan approach that scales poorly as more processes work concurrently on the same data set.
The central point of coherence on Blue Gene/Q is the 32MB L2 cache, which is 16-way set associative. In the multiversioned design, the data is tagged with additional bits, which allow multiple versions of a memory location to be maintained and tracked by scoreboard logic. The tag bits are used to detect any load/store conFlicts for each transaction. If no such conFlicts are found, the transaction can be committed. The success or failure of a transaction is recorded in registers. If a conFlict does appear, the system software must resolve it, typically by invalidating and then re-executing the transaction. The PowerPC A2 CORE design emphasizes throughput and energy efFiciency; it runs at a quick but not blazing clock speed (1.6GHz). The A2 CORE is a multithreaded, in-order design that includes a dynamic branch predictor. To further reduce power, the chip makes extensive use of clock gating. These changes reduce the processors total power to 55W (maximum) while enabling sufFicient performance for supercomputer applications. The Blue Gene/Q CORE really shines on Floating-point applications. This exceptional performance comes from its quad-pipe Floating-point unit. Each cycle, the quad FPU, shown in Figure 2, can serve as a simple scalar FPU or a four- wide SIMD FPU, or it can perform two complex-arithmetic SIMD operations. All of these operations can be single- or double-precision. Figure 2. Quad FPU in IBMs PowerPC A2 CORE. With four multiply-add units, the quad FPU can perform eight double-precision Uloating-point operations per cycle. The processor can execute up to eight double-precision Uloating-point operations, based on a fused multiply-add (FMA), along with an FP load and an FP store in a single cycle.
As Table 1 shows, Blue Gene/Q represents a tremendous upgrade in performance over its predecessors, owing in large part to increases in both the CORE count and clock frequency. The Blue Gene/Q chip delivers 15 times the peak Flops of the previous Blue Gene/P and 36 times that of the original Blue Gene/L. Processor Blue Gene/L Blue Gene/P Blue Gene/Q CORE Type PowerPC 440 PowerPC 450 PowerPC A2 Cores 2 4 18 CORE Speed 700MHz 850MHz 1600MHz Peak FP Perf 5.6 gigaFlops 13.6 gigaFlops 205 gigaFlops Date 2004 2007 2012
Table 1. Three
generations
of
Blue
Gene
processors.
The
Blue
Gene/Q
processor
represents
the
largest
increase
in
performance.
(Source:
IBM
and
HPCWire)
Blue Gene Is Very Green The supercomputer community is already aware of Blue Gene/Q. A prototype computer using 512 processor chips is the top-ranked system on the latest Green500 list, which ranks supercomputers on the basis of Linpack performance per watt. The prototype has a rating of 2.1 gigaFlops per watt, beating even the newest GPU- accelerated machines. Two U.S. Department of Energy labs are building the most powerful Blue Gene systems ever deployed: the 10- petaFlop Mira system at Argonne National Laboratory and the 20-petaFlop Sequoia supercomputer at Lawrence Livermore National Laboratory. These supercomputers will both rank among the worlds fastest supercomputers, and Sequoia could take the top spot in 2012. The new processor will elevate the Blue Gene franchise to tens of petaFlops. Blue Gene/Q introduces new architectural concepts like transactional memory that enable better scaling with large numbers of threads. It embodies the manycore model that is becoming more prevalent in HPC, but adds HPC-speciFic optimizations. These features enable industry-leading Flops per watt, which will help Blue Gene/Q systems triple the performance of todays top supercomputers while paving the way for future exaFlop (1,000-petaFlop) systems.