Sunteți pe pagina 1din 102

Multicore Architectures

A. P. Shanthi
Department of Computer Science
CEG Campus, Anna University.
Agenda

 The era of Microprocessors


 Motivation for Multicore Architectures
 Types of Multicore Architectures
 Case Studies
 Intel
 IBM
 Sun
 AMD and others
 Challenges Ahead
 Summary
The All Familiar Moore’s Law

© Intel

Transistor capacity doubles every 18 months


20 years of Microprocessor Evolution
• Performance has grown 1000x
• Transistor density gains from Moore’s Law
have driven
– increases in transistor speed from higher clock
rates
– energy scaling
– microarchitectural advances from additional
transistors
Microprocessor Evolution Contd.
• Rising transistor count enables more
functionality
• Why not make single cores more
sophisticated?
– let’s take a look at a few microprocessors for some
intuition …
Pentium 4 (Super-Pipelined, Super-
Scalar)
Opteron Pipeline (Super-Pipelined, Super-Scalar)
Evolution in a Nutshell
Why the Roadblock?
Superscalar Designs
Circuit Technology Impact
Circuit Technology Impact Contd.
• Higher clock rates
– increase power consumption
• proportional to f and V²
• higher frequency needs higher voltage
• Small structures: Energy loss by leakage
– increase heat output and cooling requirements
– limit chip size (speed of light)
– at fixed technology (e.g. 65 nm)
• Smaller number of transistor levels per pipeline stage possible
• More, simplified pipeline stages (P4: >30 stages)
• Higher penalty of pipeline stalls
(on conflicts, e.g. branch misprediction)
ILP Concerns
Sources of Wasted Issue Slots
Simulations of 8 issue Superscalar
What Next?
• If not increase in issue width, what else?
• Alternatives
– Single chip multiprocessors
• Replicate processors
– Exploit thread level parallelism
• Fine grained and coarse grained parallelism
• How to decide?
• Best approach depends on the application
characteristics
Single-core CPU chip
the single core
Single Chip Multiprocessors
• Replicate multiple processor cores on a single
die.

Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip


Comparing Alternative Designs
Performance Comparisons
Exploiting TLP : Threading Alternatives
SMT – Maximizing on-chip Parallelism
Can we combine Multiple Cores and SMT?
IBM Power 7
Types of Multicore Architectures
• Two general types of multicore or chip multiprocessor (CMP) architectures
– Homogeneous CMPs – all processing elements (PEs) are the same
– Heterogeneous CMPs – comprise of different PEs
• specialized accelerator cores
– SIMD
– GPU operations
– Cryptography
– DSP functions (e.g. FFT)
– FPGA (programmable circuits)
• Homogenous processors (2, 4, 6, 8, …cores) for PCs are now available from
all major manufactures
• Heterogeneous CMPs are available in the form of multiprocessor systems-
on-chips (MPSoCs)
Examples of Heterogeneous Systems
Case Studies
• Intel Processors
• IBM’s CellBE
• Sun’s Niagara
• AMD and others
Intel x86 Development Changes
Course …
• May 17, 2004 … Intel acknowledged that it had hit a ''thermal
wall‘’
• Disbanded one of its most advanced design groups and said it
would abandon two advanced chip development projects …
• Now, Intel is embarked on a course already adopted by some
of its major rivals: obtaining more computing power by
stamping multiple processors on a single chip rather than
straining to increase the speed of a single processor

New York Times, May 17, 2004


Beginning of Intel’s Multicore
Architectures
• Intel Core
• Yonah 0.065 µm (65 nm) process technology
– Introduced January 2006
– 533/667 MHz front side bus
– 2 MB L2 cache
– SSE3 SIMD instructions
– 31W
– Variants:
• Intel Core Duo T2700 2.33 GHz
• Intel Core Duo T2600 2.16 GHz
Intel’s Multicore Architectures Contd.

• Several dual core, quad core and also


some six core processors and some many
core processors introduced using the
65nm, 45nm, 32nm and 22nm
technologies for various application
domains
• Example case study of one of the latest
architectures follows
Intel’s i7
• The Intel Nehalem microarchitecture uses a
45nm fabrication process for different
processors in the Core i7 family
Intel’s i7-Salient Features

• New Platform Architecture


• Higher-Performance Multiprocessor Systems with QPI
• CPU Performance Boost via Intel Turbo Boost Technology
• Improved Cache Latency with Smart L3 Cache
• Optimized Multithreaded Performance through Hyper-Threading
• Intelligent Power Technology
• Higher Data-Throughput via PCI Express 2.0 and DDR3 Memory Interface
• Improved Virtualization Performance
• Remote Management of Networked Systems with Intel Active
Management Technology
New Platform Architecture
• Penryn architecture - the front-side bus (FSB) was
the interface for exchanging data between the CPU
and the northbridge.
• Nehalem microarchitecture - the memory controller
and PCI Express controller shifted from the
northbridge onto the CPU die, reducing the number
of external databus transfers that the data had to
traverse.
• Help increase data-throughput and reduce the
latency for memory and PCI Express data
transactions.
New Platform Architecture Contd.
Higher Performance with QPI
• QPI is the new point-to-point interconnect for
connecting a CPU to either a chipset or
another CPU.
• Provides up to 25.6 GB/s of total bidirectional
data throughput per link.
• Nehalem is a distributed shared memory
architecture using QPI.
Higher Performance with QPI Contd.
Turbo Boost Technology
• Provides a performance boost for lightly threaded applications and optimizes the
processor power consumption
• An innovative feature that automatically allows active processor cores to run faster
than the base operating frequency when certain conditions are met
• Intel Turbo Boost is activated when the OS requests the highest processor
performance state.
• The maximum frequency of the specific processing core on the Core i7 processor is
dependent on the number of active cores, and the amount of time the processor
spends in the Turbo Boost state depends on the workload and operating
environment.
• The duration of time that the processor spends in a specific Turbo Boost state
depends on how soon it reaches thermal, power, and current thresholds.
Turbo Boost Technology Contd.
Turbo Boost Technology Contd.

No. of Active Cores Mode Base frequency Max.Turbo Boost


Frequency
4 Quad Core 1.73 GHz 2 GHz
2 Dual Core 1.73 GHz 2.08 GHz
1 Single Core 1.73 GHz 3.06 GHz
Improved Cache Latency with Smart L3
Cache
• L3 cache can be up to 12 MB in size
• 32 kilobytes for instructions and 32 kilobytes for data of L1
cache
• 256 kilobytes per core of L2 cache
• The L3 cache is shared across all cores and its inclusive nature
helps increase performance and reduces latency by reducing
cache snooping traffic to the processor cores.
• An inclusive shared L3 cache guarantees that if there is a
cache-miss, then the data is outside the processor and not
available in the local caches of other cores, which eliminates
unnecessary cache snooping.
Improved Cache Latency with Smart L3
Cache Contd.
Intelligent Power Technology

• Intelligent power gates allow the idling


processor cores to near zero power
• Power consumption reduced to 10 watts
compared to 16 to 50 watts earlier
• Automated low power states to put the
processor and memory in the lowest power
state allowable in a workload
Higher Data-Throughput via PCI Express 2.0
and DDR3 Memory Interface
• PCI Express 2.0 databus doubles the data throughput
from PCI Express 1.0 while maintaining full hardware
and software compatibility with PCI Express 1.0.
• Has a maximum throughput of 8 GB/s/direction.
• Features multiple DDR3 1333 MHz memory
channels.
• A system with two channels of DDR3 1333 MHz RAM
has a memory bandwidth of 21.3 GB/s.
Improved Virtualization Performance

• Virtualization enables running multiple OSs


side-by-side on the same processing hardware
• New features such as hardware-assisted page-
table management and directed I/O in the
Core i7 processors and its chipsets allow
software to further improve their
performance in virtualized environments.
Remote Management of Networked Systems
with Intel Active Management Technology
• AMT provides system administrators the ability to
remotely monitor, maintain, and update systems
• Intel AMT is part of the Intel Management Engine, which
is built into the chipset of a Nehalem-based system
• This feature allows administrators to boot systems from a
remote media, track hardware and software assets, and
perform remote troubleshooting and recovery
Other Features
• Enhanced Branch Prediction Features
– New second level BTB
– New renamed Return Stack Buffer (RSB)
• 7 new application targetted accelerators for
accelerated string and text processing
– Parsing of XML strings and text
i7- 6core processor
• The i7-980X (previously
code-named Gulftown)
brings Intel's turbo boost
and hyperthreading technologies to the 32nm
process, and is also Intel's first processor with
six physical cores.
• L3 cache size of 12 MB
• 12 threads instead of 8
Intel Sandy Bridge (January 2011)
Intel’s MIC
• Many Integrated Core Architecture
• Codenamed Knights Corner, it uses the 22-
nanometer manufacturing process, scaling to more
than 50 Intel processing cores on a single chip
• The Intel Xeon Phi coprocessor is the first product
based on Intel MIC architecture
• Targets HPC segments such as oil exploration,
scientific research, financial analyses, and climate
simulation, among many others
• Single-chip Cloud Computer
Intel’s MIC Contd.
Sun Niagara-1
Motivation for Niagara
• Designed for high performance server
applications
– Have client request-level parallelism (TLP)
• Shared memory single-issue CPUs perform
better than complex multiple issue CPUs
• Combines CMP and fine-grained
multithreading
Design Goals
Niagara-1 at a Glance
Block Diagram
Core Details
SPARC Pipeline Features
Niagara-1’s SPARC Pipeline
Niagara-1’s SPARC Pipeline Contd.
• Fetch:
– 64-entry ITLB access
– Two instructions per cycle with a predecode bit
• Thread select:
– Selects the thread to fetch and decode
• Decode:
– Decode and register file access
– Forwarding for data dependant instructions
• Execute:
– ALU, shift : single cycle
– mul, div : long latency and causes thread switch
• Memory:
– DTLB access
– Can cause a late trap which flushes all subsequently fetched
instructions from the same thread and a thread switch
• Write back
Thread Selection Logic
Niagara-1’s Memory Subsytem
Highlights of the Memory System
• No external chipset because of the 4 on-chip
memory controllers
– Much faster, 6.4 Gbytes/s vs 20Gbytes/s
• Shared L2 cache
– Eliminates snooping
– Provides a single unified address space to the OS
• Crossbar enables 8 simultaneous memory
addresses per cycle (one per core)
It is Open Source!
Niagara-1 vs Niagara-2
Latest SPARC Processors
Introduction

• The Cell concept was originally


brought up by Sony Computer
Entertainment Inc. of Japan, for the
PlayStation 3

• The architecture as it exists today was


the work of three companies: Sony,
Toshiba and IBM

http://www.blachford.info/computer/Cell/Cell0_v2.html
http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html
Cell Processor
• 1 Power Processor Element (PPE)
• 8 Synergistic Processor Elements (SPEs)
• Element Interconnect Bus (EIB)
• Direct Memory Access Controller (DMAC)
• 2 Rambus XDR memory controllers
• Rambus FlexIO (Input / Output) interface
Power Processor Element (PPE)

• 64 bit, "Power Architecture" processor with 512K


cache
• Dual issue, dual threaded, in-order processor
• Control unit for the SPEs
• Runs the OS and most of the applications but
compute intensive parts of the OS and
applications will be offloaded to the SPEs
• RISC architecture
• Uses considerably less power than other
PowerPC devices - even at higher clock rates
PPE Major Units
PPE Pipeline
Synergistic Processing Element (SPE)

• Dual issue, 128-bit 4-way SIMD


– Vector Processing

• 4 Integer Units + 4 FP Units


• 8-,16-,32-bit Integer + 32-,64-bit FP

• 128x128-bit Registers

• 256KB Local-Store Memory (specially designed)


– Caches are not used
– Data & Instruction in LS
SPE Contd.
• Coherent & Cooperative off-load engines for CPU
– Works independently
– Not directly tied to CPU as co-processor

• Dedicated DMA engine


– Move data : CPU SPE or SPE SPE
– Parallel or Serial with other SPEs
SPE Block Diagram
SPE Pipeline
Element Interconnect Bus (EIB)
• Data ring for internal communication
• Four 16 byte data rings – low latency
• Multiple simultaneous transfers
• 96B/cycle peak bandwidth (@ ½ CPU speed )
SPE Local Stores
• 256 KB, 1 per SPE
• The SPEs operate on registers which are read from or written to the
local stores
• The local stores can access main memory in blocks of 1KB minimum
(16KB maximum) but the SPEs cannot act directly on main memory
(they can only move data to or from the local stores).
• Caches can deliver similar or even faster data rates but only in very
short bursts (a couple of hundred cycles at best), the local stores can
each deliver data at this rate continually for over ten thousand cycles
without going to RAM.
Cell Characteristics
• A big difference in Cells from normal CPUs is
the ability of the SPEs in a Cell to be chained
together to act as a stream processor
Cell Architecture
Memory and I/O
External Memory Bus:
• Licensed from Rambus
• Dual XDR™ interface (25.6GB/s @ 3.2GHz)
External IO:
• Licensed from Rambus
• FlexIO™ interface (each 2-wire bit @ 800Mbps)
• Total 76.8 GB/s ( 7 Tx Bytes + 5 Rx Bytes )

• Excessive Shielding is necessary


– Many VDD/GND wires
– 90% of all pins
Exploiting Application Level
Parallelism at Different Levels
• Data level parallelism – SIMD instruction support
• ILP – static scheduling and power aware
microarchitecture
• Compute-transfer parallelism – programmable data
transfer engines
• TLP – Multicore and hardware multithreading
• Memory level parallelism – overlapping transfers
from multiple requests per core and from multiple
cores
Other Multicore Processors from
IBM
• POWER4, the world's first non-embedded dual-core
processor, released in 2001
• POWER5, a dual-core processor, released in 2004
• POWER6, a dual-core processor, released in 2007
• POWER7, a 4,6,8-core processor, released in 2010
• PowerPC 970MP, a dual-core processor, used in the
Apple Power Mac G5
• Xenon, a triple-core, SMT-capable, PowerPC microprocessor
used in the Microsoft Xbox 360 game console
IBM Power4 - December 2001
IBM Power 5 - August 2003
IBM Power7 (Feb 2010)
Power 7 and SMT
Fujitsu SPARC64 VIIIfx (February
2009)
AMD Interlagos (November 2011)
NVIDIA Kepler GK110 (May 2012)
Intel Sandy Bridge (January 2011)
AMD Trinity (May 2012)
Blue Gene/L Architecture

1024 nodes

System Overview
Why the name “Blue Gene”?

• “Blue”: The corporate color of IBM


• “Gene”: Massive computing power of the
supercomputer was initially used to model
the folding of human proteins.
Blue Gene/Q

• Third and the Last known supercomputer in


the Blue Gene series
• Expected to reach 20 petaflops in 2012
• Enhancement to the blue gene/L and P
architecture
Blue Gene/Q Compute Chip (2012)
Using Multicore Processors
What about the Software?
Summary
• A paradigm shift towards
Multicore architectures due to
technological limitations and ILP
limitations
• Several multicore architectures
available
• Parallel programming techniques
likely to gain importance
References
• Simultaneous multithreading: maximizing on-chip parallelism, Dean
Tullsen, Susan Eggers, and Henry Levy, In 25 Years of ISCA, 1995.
• The case for a single-chip multiprocessor, Kunle Olukotun, Basem Nayfeh,
Lance Hammond, Ken Wilson, and Kunyung Chang, ASPLOS-VII, 1996.
• Characterization of simultaneous multithreading (SMT) efficiency in
POWER5, H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and
S. R. Kunkel, IBM Journal of Research and Development, 49(4/5), 2005.
• Purple: Fifth Generation ASC Platform,
http://www.llnl.gov/asci/platforms/purple
• IBM POWER7 multicore server processor. B. Sinharoy et al. IBM Journal of
Research and Development 55(3), May-June 2011, 1:1-1:29.
• Intel, AMD, Sun web sites.
• Rice University lecture notes
• MIT lecture notes

S-ar putea să vă placă și