The Future of Innovation in Computing PDF

Jeff Stuecheli
Hardware Architect
IBM
The Future of Innovation

in Computing
Copyright IBM Corporation 2014
The last 20 years

POWER1 1990
POWER7 QCM 2011

Execution BW
Execution BW
6 x 107 FLOPS
16000x
1 x 109 FLOPS
Storage BW
Storage BW
240 MB/sec to L1
25000x
144 MB/sec to 1 GB DRAM
2700x
6 TB/sec to L1
6 TB/sec to L2
3 TB/sec to L3
400 GB/sec 1 TB DRAM
Innovations in those 20 years

1.0 m 45 nm feature size
~500x the transistor density
25 MHz 4 GHz clock rates

133x the clock rate (enabled by faster gates And deeper
pipelines)
25 MHz 6.4 GHz busses

High frequency communication
The next 20 years

At the same rate (16k x), in 20 years 44 Watsons would fit in 1U of rack space!
But,
Gates would be 125 pm ( < 1 Si atom wide )

Voltage scaling limits
Prior power increase offset by lower voltage operation (Dennard
scaling)
What will the future bring?

For computer architects, its likely more exciting than the last 20 years
Vision presented in this talk 16k x is possible!
More diverse innovations
More gates without smaller devices (cheaply manufactured)
3D structures
Power reduction through more efficient gate utilization
Gate leakage (Power gating)

Integration
Sophisticated power management/voltage control
Reconfigurable logic
Higher power density through advancement in packaging and cooling technologies
Liquid replaces air cooling

Energy recovery through reuse
System integration enables higher bandwidth at reduced energy
Si interposer based communication

Optics
5
OpenPOWER: The Beginning

IBM Stack
Research
And
Innovation
What is OpenPOWER?
Industry Consortium focused on Innovation
- Across Server HW / SW stack
- For customized servers and components
- Leveraging complementary skills and investments
- To provide differentiated architectural alternatives
IBM
Mellanox OpenPower
Open Innovation
TYAN
Google
NVIDIA
Benefits for Clients

New Innovators on Power Platform = More Value
OpenPOWER = Greater choice for IBM Clients
More Innovation = Increased Adoption of Power
OpenPOWER: Today
Implementation / HPC / Research
System / Software / Services
I/O / Storage / Acceleration

Boards / Systems
Chip / SOC
Growing transistors without process shrinks

Current industry expectation is ~6nm in 2026
(International Technology Roadmap for Semiconductors)
Moores law would predict 0.25nm
Density doubles every 4 years instead of 2 years

How can we achieve more ~gates without smaller transistors
Every 4 years we need some other doubling improvement to stay on 2
year growth rate
Todays example: eDRAM
eDRAM reduces both transistors and energy for semiconductor arrays
~equivalent to a new process generation
Beyond more gates, more useful gates
Remove gates through integration, optimization
3d Stacking
Many-levels of chips with
low power and latency
communication
Enables larger caches
stacked below the CPU
Enables larger chips with
good yield
DRAM TSV enables larger
capacity without power and
frequency cuts
Gates in 3D are better than 2D

CPU design example
Todays 2D designs
Larger structures introduce longer physical
distances creating a design conflict
Example, multi level design structures
Core
Core
L2 Cache
L2 Cache
Fast Local
L3 Region
Mem Ctrl
L2 Cache
L3 Cache and Chip
L2 Cache
Remote SMP +
Data/Instruction caches: Tertiary levels

inherently forced to be across adjacent
(vs inside the CPU core). This resulting
in wide high power busses crossing
large distances.
Local SMP Links
TLB: Translation Lookaside Buffer,

POWER7 design uses a two level
structure. Second tier is outside critical
logic path, but adds area, making other
structures father apart.
10
Gates in 3D advantage
Critical high power execution core takes up the penthouse
(where heat can more easily be removed)
Smaller core yields higher frequency and reduced energy
Second level structures pulled under the core

Can grow to ideal size without hurting critical execution loop.
L3 cache pulled to 3rd level
4th level can be power control, IO transceivers, more cache
11
Staging current CPUs into many layers

Key limiter is available vertical wiring channels
1st generation : Pull external interface logic and voltage
regulation into lower layer
2nd generation : Add an L3 cache layer
3rd generation : Move L2 cache and large second level core
centric structures (TLB, Predictors)
4th generation : Two layer CPU, execution units on top, L1
caches below
12
Efficient integration with Si Interposers

Use old manufacturing line to produce large
active Si interconnect (base layer of 3D)
Enables efficient communication between CPU

compute stacks, memory stacks, Accelerators, and
optical transceivers.
MCM (Multi-chip-module), where module is active
Si logic.
Enables very high bandwidth/low power
interconnect
Conceptually, system could fit on the Si carrier,
with optical external attach points
Much lower power interconnect
Micro bumps
13
Si interposer communication advantages

On vs Off chip communication
On chip
Bucket brigade
Clock skew managed along path
Wire pitch ~10s nm
Off chip
Wave pulses along a string
Clock skew managed at endpoint

Wire pitch ~10s m
14
Circuit based energy improvements

Power gating: Turn off voltage to prevent gate
leakage
Utilized in Power7+ core (entire core as one
domain)
Multi-cycle transition
Current server class design gate large blocks
(entire CPU)
Required to provide voltage stability through
capacitance in power grid
Fine grain power gating will become possible in
server space with sophisticated 3D based
power delivery.
Potential ~4x reduction in leakage power.
15
Scalable Heat Removal by Interlayer Cooling

3D integration requires interlayer cooling for stacked logic
chips
Bonding scheme to isolate electrical interconnects
from
Microchannel
coolant
Pin fin
Through silicon via electrical bonding

and water insulation scheme
A large fraction of energy

in computers is spent for
data transport
Shrinking computers
saves energy
Test vehicle with fluid

manifold and
connection
19
IBM Research Zurich
Future Memory
Today
DRAM ~100ns, Read and Write Durable, Volatile
Technology scaling slowdown
FLASH ~100us, Read Durable

Disk ~10ms, Read and Write Durable
Tomorrow
Phase Change Memory (PCM)
Restive RAM (RRAM)
~100ns Read
~1 usec write
Read Durable
Non-volatile
17
Optical Interconnects
POWER7 775 HPC
system
Disruptive Optics Evolution

and Silicon Photonics
High density IO off

module optical
transceivers
Physical escape density
P7 775 HPC network chip
shown below
Silicon Photonics,
Multi-wavelength,
25 Gb/s Optics
10 Gb/s = 24 GB/s
1 Color
(Deuce)
25 Gb/s = 60 GB/s
1 Color
(Deuce)
01 1010 1110 1001 0110 01
25 Gb/s = 240 GB/s

4 Color
(Deuce)
3 A 6 F 2 9 7 0BC 5 3 E 5 1 4 D 8 9 F
18
Heterogeneous Computing
ASIC - An application-specific
integrated circuit (ASIC) is
an integrated circuit (IC) customized for
a particular use, rather than intended for
general-purpose use. For example, a
chip designed solely to run specific cell
phone is an ASIC.
FPGA - A field-programmable gate
array (FPGA) is an integrated
circuit designed to be configured by the
customer or designer after
manufacturinghence "fieldprogrammable
GP GPU A General Purpose
Graphics Processing Unit is a
massively threaded processing engine
capable of accelerating highly parallel
computation programs using many

very lightweight threads.
19
FPGA Capability
FPGA Trends
Current
Field
Deployed
Appliances
WFO
PoC /
Datapower
High End FPGA Price to Logic Ratio (100Ku/yr.)

note: DSP, memory blocks, hard ip (e.g. PCIe) added over time
100.0
74.4
49.6
$/K LE's
10.0
35.6 31.3
26.0
14.0
9.0
6.6 6.2
5.3
3.8
3.0 2.8 2.5
2.0
1.5 1.4 1.2

1.1
0.9 0.8
0.7
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
1.0
0.1
Cost per LE (MID-RANGE)
This 3x rise in LE density occurred

as technology shifted from lagging
edge (180nm) to leading edge
(40nm). This brings new speeds
and capabilities, lower costs (still
preserving >60% margins), while
ASIC costs are expected to rise
exponentially.
=> This has primed a tipping in
the industry.
20
CPU vs ASIC vs FPGA efficiency

Custom logic
4 Ghz
Highly optimized for one task
At time of fabrication
Generic Logic
250 MHz
Configurable for ~any task

Change at any time
Pervasive
Decimal
Unit
Instruction
Sequencing
Fixed
Point
Unit
Vector
and
Scalar
Unit
Instruction
Fetch and Decode
Load/Store
Unit
21
FPGAs and Workload Optimized Systems

Big data has inherent data parallel components
Data compression
Algorithms in logic 10-100x more efficient than CPU based
Negligible latency
Increases effective disk capacity
Increases effective disk and network bandwidth
FPGA logic can be used to sift large volumes of data, which is then passed to
the CPUs for detailed analysis
Packaged solutions hide FPGA programing complexity from user

Workload optimized appliance delivery model
22
Software challenges
All of the following apply to,
Applications
Middleware: Compilers, database, etc.
System SW: OS, hypervisors, cluster, etc.
Parallel programing
Accelerator usage (heterogeneous computing)
Workload partitioning
FPGA compilation
Tiered memory management
More levels, diverse types
Melding of main memory and storage
EDA tools required to support complex design structures and circuit power
optimization (e.g. productive fine grain power gating, diffraction mask generation,
etc.)
23
IBM as the Innovator

Only company with resources to design such an integrated
systems
World leading
Technology
Research labs
Hardware design
Software design
System design
24

The Future of Innovation in Computing PDF

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

The Future of Innovation in Computing PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Jeff Stuecheli

The Future of Innovation

Copyright IBM Corporation 2014

The last 20 years

POWER7 QCM 2011

144 MB/sec to 1 GB DRAM

Innovations in those 20 years

25 MHz 4 GHz clock rates

25 MHz 6.4 GHz busses

The next 20 years

Gates would be 125 pm ( < 1 Si atom wide )

What will the future bring?

Gate leakage (Power gating)

Liquid replaces air cooling

Si interposer based communication

OpenPOWER: The Beginning

Benefits for Clients

I/O / Storage / Acceleration

Growing transistors without process shrinks

(International Technology Roadmap for Semiconductors)

Moores law would predict 0.25nm

Density doubles every 4 years instead of 2 years

Gates in 3D are better than 2D

Example, multi level design structures

L3 Cache and Chip

Data/Instruction caches: Tertiary levels

Local SMP Links

TLB: Translation Lookaside Buffer,

Second level structures pulled under the core

L3 cache pulled to 3rd level

4th level can be power control, IO transceivers, more cache

Staging current CPUs into many layers

Efficient integration with Si Interposers

Enables efficient communication between CPU

Si interposer communication advantages

Clock skew managed at endpoint

Circuit based energy improvements

Scalable Heat Removal by Interlayer Cooling

Through silicon via electrical bonding

A large fraction of energy

Test vehicle with fluid

IBM Research Zurich

FLASH ~100us, Read Durable

Copyright IBM Corporation 2011

Disruptive Optics Evolution

High density IO off

01 1010 1110 1001 0110 01

25 Gb/s = 240 GB/s

Copyright IBM Corporation 2013

computation programs using many

High End FPGA Price to Logic Ratio (100Ku/yr.)

3.0 2.8 2.5

1.5 1.4 1.2

Cost per LE (MID-RANGE)

This 3x rise in LE density occurred

CPU vs ASIC vs FPGA efficiency

Configurable for ~any task

FPGAs and Workload Optimized Systems

Packaged solutions hide FPGA programing complexity from user

IBM as the Innovator

S-ar putea să vă placă și