Sunteți pe pagina 1din 24

Jeff Stuecheli

Hardware Architect
IBM

The Future of Innovation


in Computing

Copyright IBM Corporation 2014

The last 20 years


POWER1 1990

POWER7 QCM 2011


Execution BW

Execution BW
6 x 107 FLOPS

16000x

1 x 109 FLOPS

Storage BW

Storage BW
240 MB/sec to L1

25000x

144 MB/sec to 1 GB DRAM

2700x

6 TB/sec to L1
6 TB/sec to L2
3 TB/sec to L3
400 GB/sec 1 TB DRAM

Innovations in those 20 years


1.0 m 45 nm feature size
~500x the transistor density

25 MHz 4 GHz clock rates


133x the clock rate (enabled by faster gates And deeper
pipelines)

25 MHz 6.4 GHz busses


High frequency communication

The next 20 years


At the same rate (16k x), in 20 years 44 Watsons would fit in 1U of rack space!

But,

Gates would be 125 pm ( < 1 Si atom wide )


Voltage scaling limits
Prior power increase offset by lower voltage operation (Dennard
scaling)

What will the future bring?


For computer architects, its likely more exciting than the last 20 years
Vision presented in this talk 16k x is possible!
More diverse innovations
More gates without smaller devices (cheaply manufactured)

3D structures
Power reduction through more efficient gate utilization

Gate leakage (Power gating)


Integration
Sophisticated power management/voltage control
Reconfigurable logic
Higher power density through advancement in packaging and cooling technologies

Liquid replaces air cooling


Energy recovery through reuse
System integration enables higher bandwidth at reduced energy

Si interposer based communication


Optics
5

OpenPOWER: The Beginning


IBM Stack
Research
And
Innovation

What is OpenPOWER?
Industry Consortium focused on Innovation
- Across Server HW / SW stack
- For customized servers and components
- Leveraging complementary skills and investments
- To provide differentiated architectural alternatives

IBM

Mellanox OpenPower
Open Innovation
TYAN

Google
NVIDIA

Benefits for Clients


New Innovators on Power Platform = More Value
OpenPOWER = Greater choice for IBM Clients
More Innovation = Increased Adoption of Power

OpenPOWER: Today
Implementation / HPC / Research
System / Software / Services

I/O / Storage / Acceleration


Boards / Systems
Chip / SOC

Growing transistors without process shrinks


Current industry expectation is ~6nm in 2026

(International Technology Roadmap for Semiconductors)

Moores law would predict 0.25nm

Density doubles every 4 years instead of 2 years


How can we achieve more ~gates without smaller transistors
Every 4 years we need some other doubling improvement to stay on 2
year growth rate
Todays example: eDRAM
eDRAM reduces both transistors and energy for semiconductor arrays
~equivalent to a new process generation
Beyond more gates, more useful gates
Remove gates through integration, optimization

3d Stacking
Many-levels of chips with
low power and latency
communication
Enables larger caches
stacked below the CPU
Enables larger chips with
good yield
DRAM TSV enables larger
capacity without power and
frequency cuts

Gates in 3D are better than 2D


CPU design example
Todays 2D designs
Larger structures introduce longer physical
distances creating a design conflict

Example, multi level design structures

Core

Core

L2 Cache

L2 Cache
Fast Local
L3 Region

Mem Ctrl

L2 Cache

L3 Cache and Chip

L2 Cache

Remote SMP +

Data/Instruction caches: Tertiary levels


inherently forced to be across adjacent
(vs inside the CPU core). This resulting
in wide high power busses crossing
large distances.

Local SMP Links

TLB: Translation Lookaside Buffer,


POWER7 design uses a two level
structure. Second tier is outside critical
logic path, but adds area, making other
structures father apart.

10

Gates in 3D advantage
Critical high power execution core takes up the penthouse
(where heat can more easily be removed)
Smaller core yields higher frequency and reduced energy

Second level structures pulled under the core


Can grow to ideal size without hurting critical execution loop.

L3 cache pulled to 3rd level

4th level can be power control, IO transceivers, more cache

11

Staging current CPUs into many layers


Key limiter is available vertical wiring channels
1st generation : Pull external interface logic and voltage
regulation into lower layer
2nd generation : Add an L3 cache layer
3rd generation : Move L2 cache and large second level core
centric structures (TLB, Predictors)
4th generation : Two layer CPU, execution units on top, L1
caches below
12

Efficient integration with Si Interposers


Use old manufacturing line to produce large
active Si interconnect (base layer of 3D)

Enables efficient communication between CPU


compute stacks, memory stacks, Accelerators, and
optical transceivers.
MCM (Multi-chip-module), where module is active
Si logic.
Enables very high bandwidth/low power
interconnect
Conceptually, system could fit on the Si carrier,
with optical external attach points
Much lower power interconnect
Micro bumps

13

Si interposer communication advantages


On vs Off chip communication
On chip
Bucket brigade
Clock skew managed along path
Wire pitch ~10s nm

Off chip
Wave pulses along a string

Clock skew managed at endpoint


Wire pitch ~10s m

14

Circuit based energy improvements


Power gating: Turn off voltage to prevent gate
leakage
Utilized in Power7+ core (entire core as one
domain)

Multi-cycle transition
Current server class design gate large blocks
(entire CPU)
Required to provide voltage stability through
capacitance in power grid
Fine grain power gating will become possible in
server space with sophisticated 3D based
power delivery.
Potential ~4x reduction in leakage power.
15

Scalable Heat Removal by Interlayer Cooling


3D integration requires interlayer cooling for stacked logic
chips
Bonding scheme to isolate electrical interconnects
from
Microchannel
coolant
Pin fin

Through silicon via electrical bonding


and water insulation scheme

A large fraction of energy


in computers is spent for
data transport
Shrinking computers
saves energy

Test vehicle with fluid


manifold and
connection
19

IBM Research Zurich

Future Memory
Today
DRAM ~100ns, Read and Write Durable, Volatile
Technology scaling slowdown

FLASH ~100us, Read Durable


Disk ~10ms, Read and Write Durable
Tomorrow
Phase Change Memory (PCM)
Restive RAM (RRAM)

~100ns Read
~1 usec write
Read Durable
Non-volatile

Copyright IBM Corporation 2011

17

Optical Interconnects
POWER7 775 HPC
system

Disruptive Optics Evolution


and Silicon Photonics

High density IO off


module optical
transceivers
Physical escape density
P7 775 HPC network chip
shown below

Silicon Photonics,
Multi-wavelength,
25 Gb/s Optics

10 Gb/s = 24 GB/s
1 Color
(Deuce)
25 Gb/s = 60 GB/s
1 Color
(Deuce)

01 1010 1110 1001 0110 01

25 Gb/s = 240 GB/s


4 Color
(Deuce)
3 A 6 F 2 9 7 0BC 5 3 E 5 1 4 D 8 9 F
18

Copyright IBM Corporation 2013

Heterogeneous Computing
ASIC - An application-specific
integrated circuit (ASIC) is
an integrated circuit (IC) customized for
a particular use, rather than intended for
general-purpose use. For example, a
chip designed solely to run specific cell
phone is an ASIC.
FPGA - A field-programmable gate
array (FPGA) is an integrated
circuit designed to be configured by the
customer or designer after
manufacturinghence "fieldprogrammable
GP GPU A General Purpose
Graphics Processing Unit is a
massively threaded processing engine
capable of accelerating highly parallel

computation programs using many


very lightweight threads.

19

FPGA Capability

FPGA Trends
Current
Field
Deployed
Appliances
WFO
PoC /
Datapower

High End FPGA Price to Logic Ratio (100Ku/yr.)


note: DSP, memory blocks, hard ip (e.g. PCIe) added over time
100.0

74.4
49.6

$/K LE's

10.0

35.6 31.3

26.0
14.0
9.0

6.6 6.2
5.3

3.8

3.0 2.8 2.5

2.0

1.5 1.4 1.2


1.1

0.9 0.8
0.7

19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14

1.0

0.1

Cost per LE (MID-RANGE)

This 3x rise in LE density occurred


as technology shifted from lagging
edge (180nm) to leading edge
(40nm). This brings new speeds
and capabilities, lower costs (still
preserving >60% margins), while
ASIC costs are expected to rise
exponentially.
=> This has primed a tipping in
the industry.
20

CPU vs ASIC vs FPGA efficiency


Custom logic
4 Ghz
Highly optimized for one task
At time of fabrication

Generic Logic
250 MHz

Configurable for ~any task


Change at any time

Pervasive
Decimal
Unit
Instruction
Sequencing
Fixed
Point
Unit

Vector
and
Scalar
Unit

Instruction
Fetch and Decode
Load/Store
Unit

21

FPGAs and Workload Optimized Systems


Big data has inherent data parallel components
Data compression
Algorithms in logic 10-100x more efficient than CPU based
Negligible latency
Increases effective disk capacity
Increases effective disk and network bandwidth
FPGA logic can be used to sift large volumes of data, which is then passed to
the CPUs for detailed analysis

Packaged solutions hide FPGA programing complexity from user


Workload optimized appliance delivery model

22

Software challenges
All of the following apply to,
Applications
Middleware: Compilers, database, etc.
System SW: OS, hypervisors, cluster, etc.
Parallel programing
Accelerator usage (heterogeneous computing)
Workload partitioning
FPGA compilation
Tiered memory management
More levels, diverse types
Melding of main memory and storage
EDA tools required to support complex design structures and circuit power
optimization (e.g. productive fine grain power gating, diffraction mask generation,
etc.)
23

IBM as the Innovator


Only company with resources to design such an integrated
systems

World leading

Technology
Research labs
Hardware design
Software design
System design

24

S-ar putea să vă placă și