Documente Academic
Documente Profesional
Documente Cultură
these technologies. In Section IV we review some previously via access transistors T5 and T6. Figure2 shows the SRAM cell
proposed architectura and circuit techniques whichwith dual ports, each is hardwired with dedicated word and bit
leakage acurrentu btine ulatencyIniSectio wedwil
and V lines, and how duplicating bit-lines causes bit-line leakage
preskage am confed memoy. arhiecture flwed
dcufent
present dynamically ponig (DPM) algorithmsur inguSeto transistors T7multiply.
current to The word and bit lines and access
and T8 for the second port would almost double
by dynamic memory parhtitoning (DPM) algorithms in Section Il-.1- -,I .If
VI~~~~~~~~~~~~~~~~.InScinVIw.rsn h ici iuainmdl the silicon area [10] of the single-port configuration.
followed by conclusions in Section VIII. Subthreshold leakage current, drain-source current of the
transistor when the transistor is operating in weak inversion is
a major contributor towards SRAM leakage current [11]. Pre-
Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
charging, as well as keeping the bit lines high, causes accesses simultaneously. Therefore, it doubles (in the case of
significant power dissipation and contributes heavily to the dual-port) or multiplies (in the case of multi-port) the
total power dissipation. bandwidth of a single-port cache.
When "O" is stored, transistors TI, T5 and T4 dissipate Leakage current for a dual-port memory cell can be
leakage current. When "1" is stored, T2, T3 and T6 dissipate described as:
leakage current. The leakage current in a single-port SRAM Idcell = Idsub(T1) + Idsub(T5) +Idsub(T4) + Idsub(T7) (4)
cell is described as follows
Idcell = Idsub(T1) + Idsub(T5) +Idsub(T4) (2) For a memory core with N rows and M columns this leakage
current of Equation 4 can be represented as:
| WL Idcell.= N Idsub(T1)
J +
Idsub(T5) +Idsub(T4))
I . N * M +Idsub(T7)
Xf, -"
T6 Most of the recent research activities in this area are geared
toward reduction of sub-threshold leakage current in on-chip
T2 l / - | [ T4,
cache. Among process and circuit level techniques dynamic Vt,
dual-threshold voltage, reduced-gate SRAM (RG SRAM) and
gated Vdd have been discussed in [12-15]. Dynamic Vt SRAM
BL BL [12] reduces leakage current in cache memories by switching
cache lines to high Vt if the access has a small probability. The
dual-threshold voltage technology [13] uses high threshold
Fig. 1. A Single-PortnSRAM Cell voltage devices to reduce leakage current; low threshold
voltage devices are used where high performance is required.
For a memory core with N rows and M columns this leakage Reduced-gate SRAM (RG SRAM) uses two additional pass
current is described as: transistors connected between the cross couple inverters to
Idcell= N * M(Idsub(Tl) + Jdsub(T5) +Jdsub(T4)) (3) decrease gate leakage current [14]. Gated Vdd [2] reduces
leakage power by using high threshold transistor between a
virtual ground and GND to cut off the power supply to the
0< WL (Port o) memory cell in a low power mode.
0
r Among architectural techniques bitline segmentation [6,
15-
o0
18] and W3
sub-banking have reduced leakage power significantly.
Most of the architectural techniques are combined with
Ow<~~~~~~~~relevant
>O circuit techniques to suppress unnecessary leakage
o S power.
HX Albonesi [15] reduces power dissipation by enabling
X s s / ~~T4[T4
_ E only a small portion of the L2 cache at a time. Zhu et. al. [19]
1> ~~~~designed low power SRAM by enabling banks to switch
T only a smallportionof
between active and standby mode. Bitline segmentation
efficiently reduces leakage current by shorten the bitline length
BL BL dynamically. Karandikar et. al. [16] divides bit lines into
hierarchical segments to reduce bitline capacitance and adds
Fig. 2. A Dual-Port SRAM Cell parallel bit lines to access SRAM cells. Adding parallel bit
lines in the above architecture, however, has the drawback of
Fig. 2 shows the classic hardwired dual-port memory larger memory area. yang et. al. [18] explored this architecture
architecture, where each SRAM cell is accessible by two ports further and proposed hierarchical bit lines with local sense
with dedicated word and bit lines for each The addition of the amplifiers. One major source of energy dissipation is
word and bit lines and access transistors T7 and T8 would charginng the whole bitline. Rao [6] divides bitlines
almost doublerthe siliconoarea. into smaller segments such that segments higher then the
The dual-port (as well as multi-port) memory architecture current access cell are isolated from bitline precharge.
has been implemented with instruction and data cache in multi- Although this approach incurs additional delay, it reduces the
core processors in recent years. The most important advantage
of this architecture is that it can execute multiple cache
Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
length of bit lines for accessing cells near the physical ports,
hence, reducing latency and power dissipation. IT--
. VMzer Port
--
The cache memory configurations employed in the above Pre-cha e circuit
mentioned approaches use fixed bank size and duplicated word
and bit lines (without providing dual- and multi-port accesses), l
hence, incur either moderate performance degradation or large B
area overhead.
V. DYNAMIC MEMORY PARTITIONING ICL (ii Leakage current
The proposed architecture employs a DMP technique that
uses isolation nodes to partition a cache memory block into Cell
two virtually independent sections based on real-time access Set(x)
addresses of multiple ports. Figure 3 shows placement of
isolation control line (ICL) and isolation node on each of the
1WL6+7,
bit lines to divide an SRAM block into the upper and lower Cell(+7) 0
sections, which are to be accessed by the upper and lower Set(x+7)
ports, respectively. A selected ICL turns off isolation nodes
based on the real-time access addresses at the upper and lower
ports. Compared with the hardwired dual-port SRAM as shown
in Fig. 2, DMP can provide dual-port accesses without the need
of the second pair of bit lines and effectively reduce leakage Group(g)
current, bitline latency and silicon area. RL BL
Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
A small circuit block is used to compare the addresses and k = current ICL;
calculate ICLs. DMP operations are carried out in parallel with if (j < k < i) return NULL (no new DMP);
address (including bank selection) setup and bitline precharge else return (j + (i-j)/2);
phases, as depicted in Fig. 5. There is no overall operation For coarse-grain cases, DMP-2 is adjusted accordingly.
delay due to DMP. It is important to note that a new ICL is
enabled only when the current ICL cannot provide valid multi- ORTB
port accesses.
Address ________________________To X ~~~
fI~~ICL(l).--
L(2) -WL (3) -4 -W
VL(1)
L
(2)
<
AddrelI__ fU3)- _______________
-L(3
fCL(4).-----------1 ---WVL (4)
f CL(5){WL)
T1 T * * ~~~~~~~~~~~~~~~~~~~~WL (6)
DMP \ /
DataI
Toc: Operation cycle time; TAS: Addr setup time ICL (n-3) -ii ---
TBA: Bank address setup time;
TICL: ICL identification time;
TDP: Valid data
Tp. Precharge time
ICL (n-2)
ICL (n-i)_.
-7 W
---3L (n-2)
_WL (n-i)
WL (n)
Fig. 5. Memory Operation Timing with DMP
VI. DYNAMIC MEMORY PARTITIONING ALGORITHMS Fig. 6. Generic DMP Model
Optimal cache partitioning not only provides dual-port VII. CIRCUIT SIMULATION MODEL
access but also reduces delay and minimizes power dissipation. Scaling down geometry and voltage parameters has proven
Efficiency of dynamically partitioned algorithm is determined to be too simple to predict behavior of nano-scale technologies.
by three factors, delay, power dissipation and complexities of We have used BSIM Predictive Technology Model [20] to
proposed algorithm. Two generic DMP algorithms are calculate resistance and capacitance of MOS devices. Bit line
developed based on fine-grain DMP illustrated in Figure 6. delay, wire resistance and capacitance is an increasing source
Algorithm DMP- 1 provides the optimal partitioning that of concern in sub-micron cache design [21]. Figure 7 shows
minimizes bitline latency and power dissipation, with which the wire resistance trends evaluated from [22] as technology
two ICLs are turned off during partitioning: ILC(j) is turned off scales down form 32nm to 90nm. Fig. 8 shows transmission
for the upper port to access WL(j), while ILC(i-1) is turned off gate and MOS devices resistance generated with the BSIM
for the lower port to access WL(i), where j < i. This approach Predictive Technology Model [21]. As technology scales
ensures shortest active bitlines for both the upper and lower down, device resistance decreases and delay due to isolations
ports, hence minimizing latency and power dissipation. The nodes will become smaller. Fig. 9 shows the circuit simulation
pseudo-coded of DMP-1 as follows: model for the active bit lines with DMP. The isolation node
addr (A) <1: n>; addr (B) <1: n>; (implemented as a transmission gate) is modeled by RT and 2
where addr (A) = i > addr (B) = j; CT. With Wn/Wp =0.5, using 90nm technology, we calculated
if i = j +1 return ICL (j); the capacitance and resistance for transmission gate. For
else return ICL (j) and ICL (i-1); simulation purpose the drive inverter of SRAM cell is replaced
with equivalent Ro and CO. Though the length and width of the
For coarse-grain DMP (an isolation node is placed on bit wire is difficult to predict in the earlier stages of design we
lines between every n word lines), DMP- 1 is adjusted used a stick diagram to evaluate the length of the wire to be
accordingly to select correct ILCs to turn off. 602 per memory cell. Rw and Cw represent wire resistance and
capacitance between 2 bit cells on the same bitlines. CL is the
For applications that do not require a new partition for every total load capacitance of the local sense amplifier structure.
memory access, algorithm DMP-2 minimizes the switching of
isolation nodes by identifying whether or not a current partition
can facilitate new accesses. In this case, one selected ILC is
turned off for a DMP. The generic fine-grain DMP-2 algorithm
is pseudo-coded as follows:
addr (A) <1:n>; addr (B) <1 :n>;
where addr (A) =i > addr (B) = j;
Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
multi-port memory. Shorter active bit lines also means less
latency.
50
40 Cache memory often contributes a large part to the total
30 system power dissipation. This happens as the bit lines remain
i 20 pre-charged even when not accessed. The proposed
r0 ___RWIRE architecture reduces leakage power by using bitline isolation
0 and selective pre-charging. Dynamically configured memory
32 45 65 90 not only reduces leakage current by eliminating pass transistors
Technology 32nm - 9Onm in hardwired multi-port memory, but also reduces the bitline
leakage power to half by eliminating additional bitlines.
Leakage current in dual-port dynamically configured memory
is the same as (2). For a memory core with N rows and M
columns the leakage current is reduced to less than half the
70 value of hardwired dual-port memories.
60 X Idcell= * M(Idsub(Tl) + Idsub(T5) +Idsub(T4)) (6)
50 2_ _
Q 40
30 = _ =
_ .2
Z 20
R-NMOS (Ohms) X<
10 RT-GATE (Ohms)
R-PMOS (Ohms)
_
, _'
0 / OO
32 45 65 90
Technology 32nm - 90nm 600mX
t iv
I
th I solatiNo s
0I0
RO
-A
R Rw Rw RR I
I.0 402P times I 0p
' TCc, TCP FwTrP 1Gw TCWICT TCT CL Fig. 10. Bitline Delay with/without Isolation Nodes
Using 90 nm technology data we estimated bitline delay by
calculating resistance and capacitance in the drive circuit,
i
Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
Combined with local sense amplifiers and port multiplexing, [12] C.H. Kim and K. Roy, "A Leakage Tolerant Cache
the proposed DMP can support efficient multi-port cache Memory for Low Voltage Microprocessors," in the Proc.
architecture to reduce silicon footprint, wiring contention, of the 2002 Interationa Symposium on Low Power
power dissipation and bitline latency. Electronics and Design, pp. 251-254, 2002.
[13] J.T. Koa and A.P. Chandrakasan, "Dual threshold voltage
REFERENCES techniques for Low-Power digital circuits," in IEEE
[1] X. Chen and H.Bajwa, "Energy-efficient dual-port cache Journal of solid state Circuits, Vol. 35, No.7, pp. 1009-
architecture with improved performances," in IEE Journal 1018, Jul. 2000.
of Electronic Letters, Vol. 43, No. 1, pp. 12-14, Jan. 2007. [14] C. Thondapu, P. Elakkumanan, R. Sridhar, "RG-SRAM: a
[2] N. S. Kim, K. Flautner, D. Blaauw , T. Mudge, "Circuit low gate leakage memory design," in the Proc. of the
and Micro-architectural techniques for reducing cache IEEE Computer Society Annual Symposium on VLSI, pp.
leakage power," IEEE Transaction on VLSI systems Vol. 295-296,2005.
12, No. 2, pp. 167-184, Feb. 2004. [15] D.H. Albonesi, "Selective Cache Ways: On-Demand
[3] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. Cache Resource Allocation," in Proc. of the 32nd Annual
Vijaykumar,, "Gated-Vdd A circuit technique to reduce International Symposium on Microarchitecture, pp. 248-
leakage in deep-submicron cache memories," in Proc. 259, Nov. 1999.
IEEE/ACM Int. Symposium on Low Power Electronics [16] A. Karandikar and K.K. Parhi, "Low power SRAM design
and Design, pp. 90 95, 2000. using hierarchical divided bitline approach.," in Proc. Int.
[4] S. Kim, N. Vijaykrishnan, M. Kandemir and M. J. Irwin, Conf. Computer Design: VLSI in computers and
"Optimizing Leakage Energy Consumption in Cache Processors, pp. 82-88, 1998.
Bitlines" In Journal of Design Automation for Embedded [17] K. Ghose, M.B. Kamble, "Reducing power in superscalar
Systems, Vol. 9, No 1, pp. 5-18(14), Mar. 2004. processor caches using sub-banking, multiple line buffers
[5] S.-H. Yang and B. Falsafi, "Performance and Energy and bit-line segmentation," in Proc. of the International
Trade-offs of Bitline Isolation in Nano-scale CMOS Symposium on Low Power Electronics and Design,
Caches.," presented at the Workshop on Complexity- pp.70-75, Aug. 1999.
Effective Design (WCED) held in conjunction with the [18] B.D. Yong and L.-S. Kim, "A Low Power SRAM Using
30th International Symposium on Computer Architecture Hierarchical Bit Line and Local Sense Amplifier." in IEEE
(ISCA-30), Jun. 2003. Journal of Solid State Circuits, Vol. 40, No. 6, pp. 1366-
[6] R. Rao, J. Wence, D. Franklin, R. Amirtharajah and V. 1376, Jun. 2005.
Akella, "Exploiting Non- Uniform Memory Access [19] Z. Zhu, K. Johguchi, H.J. Mattausch, T. Koide, T.
Pattern through Bit Line Segmentation.," presented at the Hironaka, "Low power bank-based multi-port SRAM
Workshop on Memory Performance Issues, in conjunction design due to bank standby mode," in Proc. of the 47th
with High Performance Computer Architecture (HPCA), Midwest Symposium on Circuits and Systems, Vol.1, pp.
Feb. 2006. 569-72, 2004.
[7] B. Amelifard, F. Fallah, M. Pedram, "Reducing the sub- [20] W. Zhao, Y. Cao, "New Generation of Predictive
threshold and Gate-tunneling Leakage of SRAM cells Technology Model for Sub-45nm Design Exploration," in
using Dual-Vt and Dual-Tox Assesment," in IEEE the Proc. of the 7th International Symposium on Quality
Proceedings of Design, Automation and Test, Vol. 1, pp. Electronic Design, pp. 585-590, 2006.
1-6, 2006. [21] P. Kapur, J.P. McVittie, K.C. Saraswat, "Technology and
[8] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. reliability constrained future copper interconnects. I.
Shigematsu, and J. Yamada, "1-V power supply high- Resistance modeling," in IEEE Transactions on Electron
speed digital circuit technology with multithreshold- Devices, Vol. 4, pp. 590-597, 2002.
voltage CMOS," in IEEE J. Solid-State Circuits, Vol. 30, [22] C. Grecu, P.P. Pande, A. Ivanov, R. Saleh, "A scalable
pp. 847-854, Aug. 1995. communication-centric SoC interconnect architecture.," in
[9] N. V. Ykrishanan, M. Kandemir, M.J. Irwin. "Optimizing the Proc. of 5th International Symposium on Quality
Leakage energy Consumption in cache Bit-lines" In Electronic Design, pp. 343-348, 2004.
Design Automation for Embedded Systems, Vol. 9, pp. 5-
18,2004.
[10] R.D. Adams, "High Performance Memory Testing: Design
Principles, Fault Modeling and Self-Test," Kluwer
Academic Publishers, 2003.
[11] M. Mamidipaka, K. Khouri, N.Dutt, and M. Abadir
"Analytical Models for Leakage Power Estimation of
Memory Array Structures" In International Conference on
Hardware/Software and Co-design and System Synthesis
(CODES±ISSS) pp. 146-15 1, 2004.
Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.