Documente Academic
Documente Profesional
Documente Cultură
P2
0 7
P7
Distributed L2
P3
56 63
P6
P5 P4
Multi-Core
Dual-Core Large cache
Monolithic shared cache Shared cache
Distributed cache
NoC-based: How?
2 E. Bolotin – The Power of Priority, NoCs 2007
Future Cache - Physics Perspective
• Large cache Large access time Global Wires Delay
100
Gate delay
0.1 250
250
250 180 130 90 65 45 32
Fraction of chip
reachable in 1 clock cycle
Source: Keckler et al. ISSCC 2003
Sources:
Kim et al. ASPLOS 2002
Beckmann et al. MICRO 2004
4 E. Bolotin – The Power of Priority, NoCs 2007
Issues in NUCA-based CMP
• NoC performance CMP performance
• Cache coherency and transaction order (correctness)
• Search (in DNUCA)
• Different traffic types (e.g. fetch vs. prefetch)
P0 P1
• Synchronization (locks)
P2
0 7
P7
NoC Services for CMP? Distributed L2
P3
56 63
P6
P5 P4
56 63
P6
P5 P4
Directory
P0 NoC L2 Ctrl. packet
L1 Data packet
2. READ RESP
(data transfer)
Directory
P0 P2
NoC
NoC
L2
L1 P2-MOD. L1
2. READ RESP
(data transfer)
P0 Directory P2
NoC
NoC
L2 Data packet
L1 P0-SHARED
L1
6. READ RESP 5. WR BACK RESP
(data transfer) (data transfer)
P1
L1
1. READ REQ
NoC
Directory
P0 P2
NoC
NoC
L2
L1 P1-Shared
P2-Shared 1. READ. REQ
L1
P1 Ctrl. packet
5. INVALID. ACK
L1
Data packet
NoC
3. READ EXCL. REQ
Directory
P0 P2
NoC
NoC
L2
L1 P0-MOD.
5. INVALID. ACK
L1
6. Read EXCL. RESP
(data transfer)
5. INVALID. ACK
• Grid of wormhole routers L1
.R
EQ
D
LI
NoC
• Unicast only 3. READ EXCL. REQ 4.
IN
VA
Directory
• Ordering in network P0 P2
NoC
NoC
L2
L1 L1
Static routing
P0-MOD.
5. INVALID. ACK
6. Read EXCL. RESP
(data transfer)
No virtual channels
• Smart interfaces
Can We Do Better?
Vanilla NoC
5. INVALID. ACK
B) All NoC transactions are equally important L1
EQ
.R
D
LI
NoC VA
Directory
• Short ctrl. packets P0 P2
NoC
NoC
L2
L1 L1
• Long data packets 6. Read EXCL. RESP
(data transfer)
P0-MOD.
5. INVALID. ACK
SL 1 SL 1
SL 2 SL 2
CROSS-BAR
SL 3 SL 3
Multiple SL link
Output Input
CREDIT Control
Scheduler CREDIT
SL 0 SL 0
SL 1 SL 1
Physical Link
SL 2 SL 2
SL 3 SL 3
Long Data
Short Req.
Directory
P0
2. Read Resp.
L2
L1 4. Invalidation Req.
P2
0 7
P7
Distributed L2
Obtain L2-cache access traces
P3
56 63
P6
• QNoC simulator (OPNET) P5 P4
1000 20.00
800
15.00
600
10.00
400
286 234
200 5.00
62 57
0 0.00
1 4 16 1 4 16
Link Capacity[gbps] Link Capacity [gbps]
L2 Access Delay Reduction by Priority-based NoC Total Program Speedup by Priority-based NoC
Read Read Exclusive
10.0 9.4
35.0 32.9 31.8 8.7 9.0
9.0 8.6
30.0 28.4 28.0 8.0
Delay Reduction [%]
25.3 7.0
25.0 22.6
Speedup [%]
22.3
6.0
20.0 18.3 19.6 5.0
5.0
15.0 13.5 4.0
3.0
10.0
2.0
5.0 1.0
0.0
0.0
apache zeus fft ocean radix
apache zeus fft ocean radix
P2
0 7
P7
Distributed L2
P3
Good For: 56 63
P6
P5 P4
• Coherency
• Traffic differentiation (e.g. Fetch vs. Pre-Fetch)
• Search in DNUCA
• Synchronization (Locks)
Forwarding
• Virtual Ring P0 P1
No Additional Cost
P2
0 7
P7
For Invalidation Multicast
P3
Snooping or synchronization
56 63
P6
P5 P4