Sunteți pe pagina 1din 13

DESIGN TRACK DIGEST

29th IEEE International System on Chip


Conference (SOCC)
September 06-09, 2016, Seattle, WA, USA

Editors
Karan Bhatia
Texas Instruments

Massimo Alioto
National University of Singapore
Danella Zhao
Old Dominion University
Andrew Marshall
University of Texas at Dallas

Ramalingam Sridhar
University at Buffalo

The SOC Conference is sponsored by the IEEE Circuits and Systems Society

Design Track
Chair: Gururaj Shamanna (Qualcomm, USA)

Design Challenges and Practical Solutions for a


Mobile SoC in the 10nm FinFET Process
Hyosig Won1, Hyounsoo Park2, Kyongjun Noh, Myungsoo Jang, Dayeon Cho, Seongmin Ryu, Minkook Kim
Samsung Electronics
Hwaseong, Korea
1
hs.won@samsung.com, 2hs09.park@samsung.com
AbstractIn the 10nm process, newly adopted triple
patterning technology (TPT) and quadruple patterning
technology (QPT) cause the excessive coloring runtime problem.
The combination of shifting, cell flipping, and color swapping
executed the TPT coloring faster by 3.28X than the existing
method did in the design with 2.1M instances. For the QPT
coloring, while the existing method took over 7 days in a
10,000um10,000um design, the proposed method to divide the
large coloring graph into physically isolated small graphs took
0.17 hours. Another problem in the 10nm process is serious
voltage drop. Unlike the previous power routing to use the lowest
one metal layer for power of standard cells, using the lowest two
power metal layers lowered the estimated worst dynamic voltage
drop from 193mV by the previous power routing to below
130mV acceptable in the CPU design.
Keywords mobile SoC; multi-patterning technology; dynamic
voltage drop; adaptive clocking

I. INTRODUCTION
For next mobile SoCs, the 10nm FinFET process beyond
the latest 14nm FinFET process is being prepared. For the
lithography more difficult in the 10nm process, instead of the
extreme ultraviolet lithography which has the issues such as the
high cost, the low productivity, etc., it was enabled to extend
double patterning technology (DPT) in the 14nm process to
triple patterning technology (TPT) and quadruple patterning
technology (QPT). By adopting TPT and QPT in the 10nm
process, the coloring of the layers for those is required.
However, in the design of a mobile SoC with large size, it is
too difficult to finish 3-coloring and 4-coloring which are NPcomplete problems fundamentally in the reasonable runtime [1].
When the process node scaled down from 14nm to 10nm,
1X metal resistance increased toward 2.8X by its metal pitch
shortened from 64nm to 48nm [2]. Rapidly increased metal
resistance in the 10nm process made voltage drop more
seriously and high-speed chip implementation be more difficult.
In addition, due to the risk of timing failure raised by serious
voltage drop, it became difficult to lower the minimum supply
voltage (LVcc) for a given operating clock frequency further
for low power of a mobile SoC.
In the paper, we show the seriousness of coloring and
rapidly increased metal resistance in 10nm mobile SoC design,
and present the practical solutions for those.

II. 10NM DESIGN CHALLENGES AND PRACTICAL SOLUTIONS


A. Design Challenges for the Multi-Patterning Technology
For patterning scaled-down layers in the 10nm process,
multi-patterning technology (MPT) is still used as the key
technology. While the 1X metal layers for routing between
standard cells are patterned by DPT, the 1X metal layer inside
the cell (AL) is patterned by TPT for the minimum size of cells.
As the same reason, the layer for local interconnection (BL)
below AL in standard cells is patterned by QPT.
The MPT layers such as AL, BL inside standard cells are
colored and design rules including coloring are cleaned before
standard cells are released to chip designers [3]. Therefore,
color conflicts occur at boundaries of abutted cells. In the chip
design, coloring of MPT layers is generally executed after the
design layout is fixed. However, doing the TPT coloring in the
fixed design layout is too risky because there are special cases
difficult to find the solution of TPT coloring except for
modifying the layout. To avoid the risk, TPT coloring of AL is
executed in the place and route (P&R) stage, and the standard
cells physical DB has the color information of AL for it.
As the methods to solve color conflicts during P&R of a
design, Fig. 1 (a) shows shifting to give some space between
cells with color conflict, cell flipping, and color swapping to
reassign colors. Color swapping solves the 3-color mapping
problem easily after reducing it to the 2-color mapping
problem [3]. Shifting is only used for cases not solved by cell
flipping and color swapping to minimize design area overhead.
After the cell placement of the design with 2.1M instances,
0.11M instances had color violations of AL. In the Linux
machine with 8 cores and 256Gbyte memories, the
combination of shifting, cell flipping, and color swapping
cleared the violations for 123 minutes. On the other hands, AL
coloring using the commercial tool in the fixed design layout
took 403 minute.
4-coloring of BL can be solved more easily than a typical 4coloring problem in the fixed design layout because the layer
in a standard cell has quite uniform patterns unlike AL with
various patterns [1, 3]. However, coloring of BL in a large size
design still does not guarantee the acceptable runtime. In the
Linux machine with 8 cores and 256Gbyte memories, BL
coloring using the commercial tool of the latest 2015 version
took over 7 days for a 10,000um10,000um design. For the

excessive coloring runtime issue of BL, a workaround is


required until the commercial tool is dramatically enhanced.
The runtime of the graph coloring which is a NP-complete
problem is largely dependent on the graph size. Based on it, we
use the method to reduce the graph size for BL coloring with
the slight area overhead. For it, the minimized special physical
cell of which BL is intentionally removed is used, and those are
pre-placed and fixed before functional cell placement as shown
in Fig. 1 (b). By the method, BL in the design is divided into
physically isolated small layers, and coloring of each small
layer is independently executed.
In the experiment using the same commercial tool in the
same Linux machine as the previous one, our method reduced
the coloring runtime largely from previous over 7 days to 0.17
hours with under 0.08% area overhead when dividing BL into
500um500um layers in a 10,000um10,000um design.
Shifting

Cell A Cell B

Cell Flipping

Color Swapping
AL_E1
AL_E2
AL_E3

because a few critical paths can be detoured due to routed noncritical paths, and the timing correlation between a P&R tool
and a static timing analysis (STA) tool is not perfected.
Instead of the typical global routing, the 2-step global
routing which routes non-critical paths after first routing of
critical paths can give better design speed. Critical paths are
extracted from STA of the design pre-routed using the typical
routing flow. In the CPU design, the 2-step global routing
provided 1.69% faster clock speed and 18.7% less setup time
violated paths than the typical routing flow did.
C. Low Power Design
The adaptive clocking circuit reduces the risk of timing
failures by modulating the clock frequency dynamically when
serious voltage drop is generated in a chip, and allows to lower
LVcc further [4]. For real-time monitoring of voltage drop in a
chip, we implemented the droop detector using the delay line
consisting of 128 cell stages and the delay comparator of its
delay under voltage drop and the reference delay. When
voltage drop is over the pre-defined threshold, a droop detector
generates the error signal and a clock modulator using a clock
gating cell lowers the clock frequency to its half frequency.
When implementing the adaptive clocking in 10nm CPU, we
obtained less LVcc by 7.1% in the high frequency operation.

Special Physical
Cell without BL

(a)

III. SUMMARY
(b)

Fig. 1. (a) Methods to solve color conflicts in the P&R stage (b) physical
layer seperation using special cells pre-placed for QPT coloring.

B. Design Challenges by High Metal Resistance


In the power routing of a chip, it is general to use the
lowest one metal layer connected from power meshes of high
layers through vias for power/ground rails of standard cells.
However, the power routing may be insufficient in the 10nm
process. When the general power routing was used in the 10nm
2GHz CPU design, the estimated worst dynamic voltage drop
(DVD) was 193mV unacceptable as considering the nominal
voltage of 0.75V. Even though the serious voltage drop was
mainly because of high resistance of the lowest power/ground
rails, we could not extend the metal width of the lowest rails
due to area overhead. To solve the serious voltage drop without
area overhead, it is required to use metal layers with the same
pitch as the lowest metal layer, which were not used as
power/ground rails in the 14nm process, for the power routing.
When adding a metal layer next above the lowest metal layer to
the previous power routing in the CPU design, the estimated
worst DVD lowered below 130mV which is acceptable.
In the 10nm process, the 1X metal layer has several times
larger unit resistance than the wider and thicker metal layer
next above it has. Therefore, for high-speed chip
implementation, it is required to use higher metal layers for
long-distance critical paths. The global routing flow of
commercial P&R tools typically routes all cells at the same
route stage after cell placement and clock-tree synthesis. In this
flow, it may be difficult to route all critical paths optimally

We solved the high complexity issues of TPT and QPT


coloring by using the combination of shifting, cell flipping, and
color swapping, and the physical layer separation in the 10nm
process. In addition, we overcame the serious voltage drop
without area overhead by adding a 1X power metal layer to the
previous power routing. As shown in Fig. 2, we implemented
the 10nm test chip including a mobile SoC using these
practical solutions and verified its operations successfully
through EDS and the board tests.
GPU
CPU1

CPU2

Fig. 2. A 10nm test chip (58mm2) and ARM low Vcc shmoo plot.

REFERENCES
[1]
[2]
[3]
[4]

B. Yui, K. Yuan, B. Zhang, D. Ding, and D.Z. Pan, Layout


decomposition for triple patterning lithography, ICCAD, 2011, pp. 1-8.
D. Guo, et al., "10nm FINFET technology for low power and high
performance applications, ICSICT, 2014, pp. 1-4.
K.B. Agarwal and L.W. Liebmann, Hierarchical approach to triple
patterning decomposition, Mar. 26, 2015, US Patent 0089457 A1.
K.A. Bowman, C. Tokunaga, T. Karnik, V.K. De, and J.W. Tschanz, A
22nm Dynamically Adaptive Clock Distribution for Voltage Droop
Tolerance, Symposium on VLSI Circuits, 2012, pp. 94-95.

Efficient Circuit Architecture and FPGA


Implementation for LTE Single Carrier FDMA DFT
J. Greg Nash, Senior Member, IEEE
Centar LLC, Los Angeles, CA USA jgregnash@centar.net
Abstract A new memory-based circuit architecture for
computing the DFT is presented and applied to the LTE SCFDMA DFT protocol requirements. The implementation focuses
on efficiently using the LUT/register fabric of FPGA-based
hardware.
The fastest available commercial design uses
29%/52% more LUT/registers while the proposed design is 40%
faster in computing LTE resource blocks, a measure that reflects
both circuit throughput and latency. It is programmed by
simply entering parameter values into a single ROM memory so
that any number of transform sizes, including powers-of-two,
can be accommodated. The architecture provides scalable
throughput by increasing the array size, high dynamic range and
leads to simple, regular implementations.
Keywords LTE; SC-FDMA; FPGA; DFT; Fast Fourier
transform; discrete Fourier transform; non-power-of-two.

arithmetic (butterfly) and data units. The goal in such designs


is to sequence data to/from the memories in such a way that
data I/O rates are maximized. Alternatively, the model
proposed in this paper does the same thing, but at a finer level
of granularity such that data is placed in close proximity to
computing resources. As shown in Fig. 1b, this is done using
many very small processing elements (PEs), each
containing a multiplier/adder and a few registers. Since each
PE reads and writes to a small, simple dual-port memory,
aggregate bandwidth is limited only by the number of PEs.
Additionally, well-known scalability of array structures
means that high bandwidths, and thus performance, are
achieved by simply increasing the array size. It is far more
difficult to do the same for traditional memory based designs
(Fig. 1a).

I. INTRODUCTION
Single carrier frequency division multiple access (SCFDMA) is a part of the LTE protocol used for up-link data
transmission. It involves a discrete Fourier transform (DFT)
pre-coding of the transmitted signal, where the DFT can be any
one of 35 transform sizes N from 12-points to 1296-points, and
N=2a3b5c (a,b,c are positive integers). The rationale for
targeting FPGAs is due to the rapidly growing FPGA use in
communications applications, e.g., base stations and remote
radio heads at the top of cell phone towers. Here we provide
results of mapping the architecture to Xilinx Virtex and Altera
Stratix devices.
FPGAs as an implementation platform have unique
features such as large numbers of embedded multipliers and
memories, leading to very different design tradeoffs compared
to ASIC designs. In particular, embedded elements in such
quantities make them almost free, compared to their ASIC
implementation costs. Consequently, the design goal is to
produce a circuit that minimizes the expensive FPGA lookup-table (LUT) and register fabric usage rather than embedded
element usage. An additional motivation for this goal is that
this fabric is also the source of most of the FPGA dynamic
power consumption, as opposed to the embedded elements, an
important consideration since FPGAs are increasingly being
used in mobile devices.
II. BACKGROUND
A. FFT computing models
The proposed memory-based FFT model departs
considerably from traditional memory-based designs as
illustrated in Fig. 1. Here a traditional high-performance
memory-based design (Fig. 1a) contains physically separate

Fig. 1. (a) Traditional memory-based FFT architecture and (b) proposed finegrained, locally connected, equivalent.

Fig. 1b also shows that each PE is locally connected to its


(4) neighbors which keeps interconnections short, resulting in
reduced power dissipation and higher clock speeds.
B. Related Work
Both Xilinx and Altera [1] provide users of their FPGAs
options to support the LTE SC-FDMA DFT protocol using a
memory-based architecture as in Fig.1a, consisting of a single
multi-port memory that sends/receives data to/from a single
arithmetic unit that performs the required butterfly
computations. For these designs the number of clock cycles
per DFT is greater than the transform size N, so that it is not
possible to continuously stream data into and out of the
circuit.
A couple of other published designs are different from
those discussed above in that they either use a higher radix
memory-based design [2] or a pipelined architecture [3] to
reduce the overall number of cycles needed to compute a DFT

to that of the actual transform size N.


III. IMPLEMENTATION
For our proposed memory-based architecture transforms are
performed using a 6x6 PE virtual array to compute the
appropriate butterflies for the mixed radices needed. Since N
can also be obtained from the expression N=2a3b4c5d6e, where
all exponents are positive integers, additional radices can be
employed, improving computational efficiency compared to
use of just 2,3 and 5 radices. Twiddle memory is minimized
by on-the-fly generation of values. More implementation
details are provided in [4].

memory (~5 RAM blocks) to the numbers shown in Table I.


(Stratix III block RAMs are 9K bits).
For comparison the proposed design was also targeted to a
Stratix III FPGA of the same speed grade. The Altera
implementation uses less logic, but is far slower, both in terms
of the lower values of Fmax, and the increased number of
cycles to complete the RB computation. Consequently, the
proposed design has ~3x higher throughput while LUT usage
is only ~47% higher.
TABLE I. LTE CIRCUIT TECHNOLOGY COMPARISONS

IV. COMPARITIVE ANALYSIS


A. Introduction
In order to provide a more relevant metric than throughput
and latency numbers we also calculate, where possible, the
length of time necessary to compute an LTE resource block
(RB). The RB is the minimum processing unit of data for the
LTE protocol consisting of 7 symbols for (normal cyclic
prefix). This is a better performance comparison metric in
that it requires both low latency and high throughput for good
results.
B. Commercial FPGAs
For comparison with Xilinx FPGAs, a Virtex-6
(XC6VLX75T-3) FPGA was used as the target hardware for
both the Xilinx and the proposed circuit. The Xilinx
LogiCORE IP version 3.1 was used to generate a 16-bit
version of their DFT because the SQNR of 60.0 db (average
over all 35 transform sizes) was comparable to the proposed
circuit with average SQNR=61.3.
The resource comparisons in Table I use a Xilinx block
RAM normalized to 18K bits, so that a Xilinx 36K block
RAM is considered equal to two 18K RAMs. Also, the RB
Avg column provides the average number of cycles (over all
35 DFT sizes) it takes to compute the DFT for the 7 symbols
defined by a RB as a function of the transform size N. Finally,
the Fmax (maximum clock frequency) value and the number
of RB cycles are combined, providing a measure of the
throughput, which is normalized to a value of 1 for the
proposed design (higher is better). Table I shows the Xilinx
design uses 29%/52% more LUTs/registers while the
proposed design provides a 40% higher RB computation
throughput. So the overall combined gain is significant. The
proposed design uses more embedded memory and
multipliers, but this was less a consideration as discussed in
Section I.
Altera does not offer a DFT LTE core as does Xilinx;
however, they have published results of an example design
running on a Stratix III FPGA that provides a useful basis for
comparison. This design example is different than the
proposed design here in that the outputs are not in normal
order. Adding buffer circuitry to sort the output data would
require additional logic and add ~N additional words of

Design

FPGA

Proposed
Xilinx [1]
Proposed
Altera [2]

Virtex-6
Virtex-6
Stratix III
Stratix III

LUT Reg
2975
3851
3816
2600

2853
4326
3188
N/A.

BLK Mult Fmax RB Thrpt


RAM 18-bit (MHz) Avg norm
19
72
401 16.6N 1
10
16
403 23.4N 0.71
29
60
400 16.6N 1
17
32
260 32.9N 0.33

C. Other FPGA implementations


Other published design comparison results are shown in
Table II for Virtex FPGAs. For the proposed architecture the
average throughput as a function of N for all 35 transform
sizes is 2.1N, which is a factor of two higher than each of the
other implementations.
However, these more complex
architectures require far more LUT hardware, 162% and
262% for [2] and [3], respectively. Although [3] uses fewer
registers, this is less meaningful because the 10:1 ratio of
LUTs/registers in FPGA hardware leads to imbalances that
can cause many registers to be inaccessible. Additionally,
comparing Thrpt norm values in Table I (here based only on
throughputs, since latency values werent supplied), they can
be seen to be much slower designs.
TABLE II. LTE CIRCUIT TECHNOLOGY COMPARISONS
Blk Mult Fmax Thrpt Thrpt
RAM 18-bit (MHz) (cycles) norm
7
44
123
N
0.65
Chen [5] Virtex-5 7791 N/A
45
41
61.3
N
0.32
Niras [6] Virtex-6 10768 786
72
401
2.1N
1
Proposed Virtex-6 2975 2853 19
Design

FPGA LUT Reg

V. CONCLUSION
We have shown how a new memory-based model for the
FFT combines algorithm efficiency and programmability with
new circuit features leading to higher throughputs, lower
latencies and at the same time reduced LUT/register usage
compared to other FPGA implementations.
REFERENCES
[1]
[2]

[3]

[4]

Application Note: Xilinx DFT v3.1, DS615 Mar. 1, 2011 and Altera
DFT/IDFT Reference Design, 464, May 2007.
J. Chen, J. Hu, and S. Li, High throughput and hardware efficient FFT
architecture for LTE application, Proc. 2012 IEEE Wireless
Communications and Networking Conf., pp. 826-83.
C.V. Niras and V. Thomas, Systolic variable length architecture for
discrete fourier transform in Long Term Evolution, Int. Symp. on
Electronic System Design, 2012, pp. 52-55.
J. G. Nash, High-throughput programmable systolic array FFT
architecture and FPGA implementations, Int. Conf. on Computing,
Networking and Communication, Honolulu, HI, Feb.2014, pp. 878-884.

Design Challenges in a Low-Power Management Unit


for a GNSS Receiver System in 28nm CMOS
Filippo Neri, Thomas Brauner, Eric De Mey
u-blox
Zrcherstrasse 68, 8800 Thalwil, Switzerland
filippo.neri@u-blox.com, thomas.brauner@u-blox.com,
eric.demey@u-blox.com
Keywords low-power;
management systems; switches

CMOSFET

circuits;

battery

I. INTRODUCTION
Nano-scale CMOS technologies have been used to
implement integrated circuits with the advantages of scaling
down feature size, improving high-frequency characteristics,
low-power consumption, high integration capability, and low
cost for mass production. However, the thinner gate oxide in
nano-scale CMOS technology sets big challenges in the
analog design and it can seriously degrade the overall
robustness of IC products [1]. For mixed-signal SoC, several
power domains are present at the I/O interfaces ranging from 1
V up to 5 V (and higher), especially for battery powered
devices. To reach 5 V compatibility, the solutions in the
available CMOS 28 nm process are either using LDMOS or
stacked standard transistors which are capable of handling
only 1.8 V (or a combination of the two).
The widespread use of battery-operated systems, the
relatively slow progress of battery performance/cost ratio and
the need to minimize simple maintenance procedures, such as
battery replacement, are pushing the design of very low
voltage and low power systems, both digital and analog.
In this work the main design challenges that have been
faced during the design of a low Power Management Unit
(PMU) for a GNSS receiver are highlighted.
II. SYSTEM DESCRIPTION
Fig. 1 shows a simplified block diagram of the proposed
PMU. It is capable of selecting either the main supply or the
backup battery for powering the system. In doing so, a very
complicated stacked architecture for backup switch has been
implemented, in order to be compatible with a 4.8 V input
standard. A high PSRR wide supply 1-V bandgap generator,
together with a decision voltage regulator, allows the system
to decide constantly whether to switch to the main or to the
backup battery. This bandgap consumes less than 3 A of
current. Several linear regulators power the different subsystems: five are capacitor-less and four are filtered with an
external capacitor. The full start-up circuitry is completed by
three comparators consuming 0.75 A each, a reference
current generator derived from the bandgap circuit and a
power on reset circuit. Capacitor-less regulators generating
Vdec and Vddb are always on, as well as LDOA, consuming
only 3 A in total. To achieve this low current consumption,
when system is in stand-by mode, LDOA drops the voltage to

Christian Schippel
Globalfoundries
Wilschorfer Landstrasse 101, 01109 Dresden, Germany
christian.schippel@globalfoundries.com
1.0 V (from 1.5 V nominal) and LDOB goes into bypass, so
only LDOA is in the account of total consumption and not
LDOB. The DCDC converter is implemented in an hysteretic
architecture using LDMOS as power devices, having 85 %
efficiency at 3.0 V input, for a maximum of 100 mA load
current. Vcore can be modulated depending on system states:
higher Vcore allows a better noise rejection, but at the cost of
lower system efficiency (from main supply down to Vddc).
Several isolated linear regulators are used to power sensitive
RF sub-blocks like LNA, PLL, ADC, etc. All the capacitorless regulators have been implemented using the architecture
in [2], being capable of load current ranging from 5 mA for
LDOB up to 30 mA for LDOE, used for OTP memory. A 32
kHz, 1.5 % temperature compensated oscillator, together with
a PSEQ state machine, ensure a proper control and start-up of
the whole system when enabling also the digital core.
III. DESIGN CHALLENGES
This is a list of the main design challenges faced during
the design of the proposed system.
A. Backup switch
The (start-up) backup switch has the goal of selecting
whether the system is working on the main supply or on
the backup battery (in stand-by power mode). The input
standard is 4.8 V while the maximum Vgs allowed by the
process is 1.8 V, whether is an LDMOS or a normal MOS
transistor. A novel architecture has been implemented to
achieve this goal, consuming only 1 A. It makes use of
LDMOS and it is fully integrated on chip.
B. Overvoltage protection
All the blocks connected to the input lines had to be
specifically designed for tolerating higher voltage than
1.8 V, in terms of Vds and Vgs. Not all the blocks could
be implemented using only LDMOS (i.e. backup switch)
but through a mix of low voltage devices (1 V and 1.8 V)
and LDMOS.
C. Stand-by current
Specifications were targeting a sub-20 A system when at
minimal consumption. About 5 uA are lost as leakage in
the memories supplied by Vddb. Remaining 15 A have
been shared among the bandgap, comparators, bias
voltage generators to protect junctions from over-voltage
conditions, current generators, three linear regulators

Figure 1. PMU simplified block diagram.

(LDOD, LDOA, LDOB) and a 32 kHz clock generator


which has been implemented using only 0.5 A.
D. Layout-dependent effects (LDE)
Several modifiers of device performance had to be
considered. These modifiers were due to LDE like
distance between the gates including dummy poly
which has a direct effect on the drain current of the
transistor [3]. Mechanical stress is another modifier,
mainly through shallow trench isolation (STI). These
effects are not modelled in the parameterised cells (or
device models) and, as a result, schematic simulations
were far off the reality. Design had to go through several
steps of iterations evaluating the good of the layout
through extensive extracted layout simulations, even for
the more complex structures.
E. Design rules
Advanced sub-micron CMOS processes add more and
more design rules each step but also more complex rules
every time. In this case about 20,000 design rules had to
be respected but some very specific advanced rules were
setting a big challenge for mixed-signal design, among
which: density rules which forced to push gates further
away at the cost of worse gate matching, direction
orientation rules, transistor proximity rules, and new rules
governing legal inter-digitation patterns.
F. Device complexity and variation
Transistors at 28 nm are very small and very fast, and
variation is a constant challenge. Transistors are sensitive

to channel length and channel doping, and transistor


behavior is subject to short-channel effects. Focus has
been put on minimizing leakage, ensuring reliability, and
achieving acceptable yields. For several sub-blocks this
translated into digital assisted calibration.
IV. SIMULATIONS AND MEASUREMENTS
Simulations of the whole system had to be done through
layout extraction only in order to take into account LDE
factors. Stand-by current showed to be under 18 A in all
cases while safe-operating-area checks showed the system to
be reliable for overvoltage protection. Functionality of the
system has been successfully verified on silicon and currently
investigations are ongoing to verify the stand-by current.
V. CONCLUSIONS
A challenging low-power management system for GNSS
receiver has been successfully implemented on a 28 nm
CMOS and this work highlights all the related challenges.
REFERENCES
[1]

[2]

[3]

F. Neri, T. Brauner, E. De Mey and C. Schippel, "Low-power, wide


supply voltage bandgap reference circuit in 28nm CMOS," Applied
Electrical Engineering and Computing Technologies (AEECT), 2015
IEEE Jordan Conference on, Amman, 2015, pp. 1-6.
T. Y. Man, P. K. T. Mok and M. Chan, "A High Slew-Rate PushPull
Output Amplifier for Low-Quiescent Current Low-Dropout Regulators
With Transient-Response Improvement," in IEEE Transactions on
Circuits and Systems II: Express Briefs, vol. 54, no. 9, pp. 755-759,
Sept. 2007.
S. Carlson, The five key challenges of sub-28nm custom and analog
design, in http://www.techdesignforums.com/practice/technique/fivekey-challenges-20nm-custom-design/.

Intelligent Low Power Wake-Up Protocol for MultiRegulator Power Management Architectures
Sunny Gupta, Kumar Abhishek, Nitin Pant, Garima Sharda, Gautham S. Harinarayan,
Automotive Microcontrollers and Processors,
NXP Semiconductors.
sunny.gupta, kumar.abhishek, nitin.pant, garima.sharda, gautham.harinarayan@nxp.com

Keywords power management controller, low power, standby,


power management architecture, automotive MCU, multi-domain,
multi-regulator, 40 nm CMOS.

I. INTRODUCTION
System-on-Chips (SOC) integrate a large number of analog
and digital circuits, enabling a wide variety of features. Based
on the application use case, during a certain window of time,
many parts of the chip may not be needed to be active (Figure
1). We can turn off the clock sources to these inactive circuits
ceasing dynamic power consumption. But there is still a
sizable amount of static leakage. At smaller technology nodes,
the amount of static leakage increases. Thus to enable
maximum amount of power saving, additional features like
power gating need to be introduced. The SOC may have
different circuits operating at different voltages, needing such
voltages to be generated from the externally driven input
supply. This is achieved by on-chip voltage regulators.
Finally, all the input and generated supplies need monitors to
ensure the voltage range is within the range of design
specifications. The circuits used for this are called Power-on
Reset (PORs) and Low and High Voltage Detectors (LVDs
and HVDs). Together all these circuits form the Power
Management Controller (PMC) on the SOC [1], [2].

is power switch in between the path of the two power domains


as shown in Figure 2. The lowest power mode called Standby
mode is where all the system clocks are gated, the power
switch is opened, and only an Ultra-Low Power (ULP)
regulator is active to keep certain amount of logic alive for
retention purpose. The other regulators are turned off. The
high performance Run mode on the other hand can potentially
have all the logic and functionality in an active state. In this
mode the power switch is closed, and the SOC is powered by
the main High Power (HP) Regulator with a high current
carrying capacity on the order of 100s of mA. There is an
intermediate Functional Low Power mode in which the SOC
uses a lower clock frequency. The dynamic current scales with
clock frequency, hence the Low Power (LP) regulator with a
current drive strength of the order of 10s of mA is sufficient.

Figure 2 - PMC Architecture block diagram

III. STANDBY MODE ENTRY - EXIT PROTOCOL


Figure 1 Example of various application use case modes

II. PMC ARCHITECTURE


In multi-regulator low power architecture systems, there are
various low power modes in the SOC, each mode capable of
certain functions while consuming power within a certain
range [3]. A typical example of multiple power modes and
their use case profile is shown in in Figure 1. 1n our design,
the IPs were grouped together into two different power
domains, PD0 (always ON) and PD1 (switchable domain).
There are three voltage regulator circuits, and together they
provide the two power domains with regulated supply. There

As discussed in the previous section, there are various Low


Power modes, and the entry protocol involves handshaking
between the PMC, digital state machines and clocking
peripherals. The steps during Standby entry are: configuring
wake up circuits, turning off clocks, enabling isolations,
opening the power switch and disabling regulators as shown in
Figure 3. The protocol to wake up from Standby mode
requires these steps to be carried out as shown in Figure 4.
The first step of the wakeup protocol is enabling a fast internal
oscillator (FIRC). This clock source is needed to make the
state machines and wakeup protocol to function. But ULP
regulators are designed to specifically have a very low self-

current, and also have an upper limit on the amount current


drive they can support. The main requirement of the ULP
regulator is to keep states of certain memories and state
machines retained, so that the functionalities of the SOC can
be resumed instantaneously without having a lengthy reset
sequencing process on recovery from low power modes. In the
worst case the regulator may not handle sudden load
transitions very well, even though the total current itself might
be within design specification. Such sudden load transitions
can be caused during clock startup and can lead to a low
voltage condition upon wakeup. The low voltage event created
in such a condition may end up resetting the whole SOC
instead of gracefully waking up the system. Mixed mode
simulation results depict this limitation in Figure 5.

Figure 6 The Intelligent Low Power Wake-up Protocol, with a 32


KHz Slow RC Oscillator and a 24 MHz Fast RC Oscillator

Figure 3 The Standby mode entry protocol of our design

Figure 7 Silicon results showing exit from Standby mode, where


HP regulator is turned on before moving to full power mode

V. MEASUREMENT RESULTS
Figure 4 The Standby mode exit protocol of our design

Figure 5 A limitation seen when the ULP regulator faces a sudden


transient load due to the FIRC being enabled

This SOC design was fabricated in 40nm CMOS as a 32-bit


MCU for automotive and industrial applications. It was
extensively tested in Silicon with all possible mode transition
sequences. The design was found to be working robustly
across all use cases and mode transitions. Figure 7 shows the
exit from Standby mode, where HP regulator is turned on
before moving to full power mode. The wakeup time needed
for recovery to full power mode from standby is 260uS.
VI. CONCLUSION

IV. INTELLIGENT LOW POWER WAKE-UP PROTOCOL


The application use case that triggers an exit from low power
modes may require that, the wakeup be serviced as soon as
possible. This is especially true if the SOC is used in critical
applications. Thus there is a need to complete the exit protocol
in the shortest time, and to do this we need to use a relatively
fast clock source. But the use of fast clock sources during
wakeup, can lead to problems as described in Section III.
There is usually a slow oscillator (say 32 KHz) as part of the
SOC for enabling Real Time Clock (RTC) or for functioning
as a back-up clock source. We solved the above limitation by
enabling and utilizing a slow oscillator first when wakeup is
triggered, so that the ULP regulator will be able to handle the
load. In parallel we switched ON the main HP regulator. Till
the time the HP regulator is ready to take the load, the logic is
kept running on slow oscillator. Once HP regulator is ready,
we switch back the state machine to operate using the fast
oscillator. In this way the wake up time is also not
compromised and the system becomes robust enough for
multiple low power mode transitions. The timing diagram of
the new wake up scheme is shown in Figure 6.

A new architecture for a Power Management Controller


supporting multiple regulators and modes, with a novel
scheme for both a very low power retention state as well as a
robust and fast wakeup has been presented [4]. Silicon results
show correct behavior. Such an architecture is crucial for
safety, reliability and quick wake up from low power modes.
REFERENCES
[1]

[2]

[3]

[4]

Andre Mansano, Andre Vilas Boas, Alfredo Olmos, Stefano Pietri, and
Jefferson D. B. Soldera, Power management controller for automotive
MCU applications in 90nm CMOS technology, IEEE International
Symposium on Circuits and Systems (ISCAS), pp. 2545-2548, 2011.
Stefano Pietri, Chris Dao, Juxiang Ren, Jehoda Refaeli, and Alfredo
Olmos, Safety oriented automotive MCU power management, 2012
IEEE/IFIP 20th International Conference on VLSI and System-on-Chip
(VLSI-SoC), pp 36-40, Oct 2012.
G. Harinarayan, et al, A Robust Architecture for a Complex On-Chip
Power Management Controller with External Regulator Handshake for
Automotive SOCs, 2015 IEEE 28th International System-on-Chip
Conference (SOCC), pp 80-81, Sept 2015.
Gupta, et al, Integrated circuit wake-up control system. U.S. Patent
9,252,774, issued February 2nd, 2016.

Energy Efficient Design of Ultra-Lightweight


Hardware Security Circuits for IoT Applications
Vikram B. Suresh, Sudhir K. Satpathy, Sanu K. Mathew, Ram K. Krishnamurthy
Circuits Research Lab, Intel Corporation, Hillsboro, Oregon, USA
AbstractThe advent of the new paradigm of Internet of
Things (IoT) has resulted in boundless increase of data
accumulation and transfer. The reliability of IoT platforms rests
on reliable data security, authentication and resilience to attacks,
demanding energy efficient and ultra-lightweight hardware
accelerators for security. In this work, we present RNG - an
energy efficient all-digital full-entropy TRNG, and nanoAES a
lightweight single-Sbox AES datapath, both optimized for IoT
applications to provide high quality and reliable security features
in the edge devices.
KeywordsIoT, Security, TRNG, AES, Lightweight, Energy
Efficient

I. INTRODUCTION
The global internet based connection of numerous physical
devices has steered the development of platforms for Internet
of Things (IoT). Billions of devices ranging from sensor nodes,
RFID tags, mobile devices and electronic appliances interact
with each other and with the cloud servers to enable smart
connectivity in IoT [1, 2]. The application of IoT spans across
various domains in industrial supply chain, health care, smart
homes and autonomous vehicles. The vast amount of data
collection, storage and communication warrant the need for
data security, resilience to attacks, secure authentication and
user privacy [3]. These security and privacy challenges are
impaired by the lack of compute resources and energy at the
edge devices in IoT. The edge devices such as sensor nodes
and RFID tags consist of modest processing engines optimized
for ultra-low power operations which limit traditional
implementation of crypto algorithms. The cost limitation of
these devices further restrict the implementation of extensive
security features. The edge devices may also be passively
powered (like in the case of RFIDs) or powered on a highly
energy constrained platform. These challenges have
necessitated the design of energy efficient and ultra-lightweight
crypto primitives and hardware accelerators to address the
security requirements of IoT applications.
One of the critical crypto primitives for data encryption and
secure communication protocols is a True Random Number
Generator (TRNG). TRNG circuits harness random physical
phenomena to generate high entropy bit streams that are used
as encryption keys, session-ID and nonce. Conventional TRNG
circuits sample and digitize thermal noise using Analog-toDigital Converters, jitter in Ring Oscillators or the resolution
state of a metastable element [4,5]. Variations in process and
operating conditions introduce bias and correlation in the
TRNG output making them non-ideal for cryptographic
applications. The non-idealities are compensated using
adaptive circuit tuning and extensive algorithmic postprocessing using crypto-hash or ciphers [4]. This increases the

area and energy consumption of the TRNG circuits making


them unsuitable for IoT applications. In this work, we present
RNG - an energy efficient lightweight full-entropy TRNG
circuit optimized for IoT applications [6].
Block ciphers are the bedrock of secure data storage and
communication. The Advanced Encryption Standard (AES) is
the de facto symmetric cipher used for data encryption.
Conventionally, the computation intensive AES algorithm is
implemented in software using dedicated instructions, such as
Intels AES-NI. The restricted processing resources in IoT
devices require a software implementation using primitive
instructions resulting low energy efficiency. Hardware
accelerators for AES use 16 Substitution Box (Sbox) instances
for the 128-bit encrypt/decrypt datapath spanning more than
50K gates and requiring significant energy/bit [7]. In this work,
we present nanoAES a lightweight single-Sbox AES design
with hybrid encrypt/decrypt datapath for IoT applications [8].
II. RNG: ENERGY EFFICIENT ALL-DIGITAL FULL-ENTROPY
TRNG FOR IOT APPLICATIONS

The RNG consists of three independent all-digital selfcalibrating entropy sources followed by correlation
suppressors. The three uncorrelated raw bit streams are
combined using the Barack-Impagliazzo-Widgersen (BIW)
extractor [9] to obtain a full-entropy random bit stream, as
shown in fig 1. The all-digital entropy source consists of a pair
of cross-coupled inverters whose internal nodes are precharged to the high gain state of Vcc and allowed to resolve to
a stable state. Ideally, the resolution state depends on the
differential thermal noise in the inverters. However, process
variation and power supply noise can bias the circuit to
generate disproportionate 0/1 ratio. To compensate for the bias,
a digital tuning mechanism is used by turning ON additional
NMOS/PMOS legs in the inverters to provide coarse control of
the effective P/N skew of the inverter pair. A second control
loop configures the delay on the pre-charge clock buffers to
provide a fine granularity mismatch compensation. The output
of the entropy sources may be self-correlated due to the
adaptive tuning process or mutually correlated due to the
spatial locality of the circuits. A serial decorrelator using
undersampled XOR-feedback shift-register is used to reduce
the correlation between consecutive bits and across the three
entropy sources. The uncorrelated output is combined using a
BIW extractor consisting of bit-serial Galois Field
multiplication and addition.
The RNG circuit implemented in 14nm high-k metal gate
CMOS technology occupies a silicon area of 1088m2. The decorrelators and BIW extractor constitute 12% of the RNG area
at 131m2 (1.3K gates), resulting in lightweight entropy
extraction. The circuit has a throughput of 163Mbps at the

64

clock

clkconf1[4:0]
clk1
Delay
nconf0[5:0]
pconf0[1:0]
b

inv0
a

nconf1[5:0]
pconf1[1:0]

nconf1
pconf1
clkconf0 Loop
clkconf1 control
nconf0 logic
pconf0

key[15]
: data
: key
: shared

load/nextdata/lastround

Inv
MixCols
8

encrypt

2:1

Key
generate

2:1
8

encrypt |
keygen

Sbox

3:1

InvMap

2:1

nextdatain

MixCols
encrypt

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

data[15] 8

32

clkconf0[4:0]
clk0
Clock
Delay

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

32

plaintext
nextdatain

Data
registers

Correlation 8
suppressor

Map

3:1

2:1

keygen?

1 Raw stream C

key

lastdataround?

Entropy
source C

Correlation 8
suppressor

first
round?

data[15]

1 Raw stream B

Map

Key
registers

Entropy
source B

Correlation 8
suppressor

load/keygen

1 Raw stream A

Full-entropy bitstream

Entropy
source A

BIW Extractor

1.3K gates

2:1

inv1

Fig 2: nanoAES Architecture with hybrid encrypt/decrypt datapath


Self-calibrating
feedback loop

SboxIn
Inv.
Affine

Inv_in[7:0]
4

Fig 1: RNG Architecture and Entropy Source

encrypt |
keygen

nominal operating voltage of 750mV and throughput of


0.4Mbps at a scaled supply of 300mV. Peak energy efficiency
of 323Gbps/W is measured at 400mV, resulting in a 7.7x
higher energy efficiency compared to previously published
results [4].

X2*{}

X-1

X-1
4

Affine

III. NANOAES: ENERGY EFFICIENT LIGHTWEIGHT HYBRID AES


ENCRYPT/DECRYPT HARDWARE ACCELERATOR
The nanoAES hardware accelerator consists of a 1-Sbox 8bit hybrid datapath for AES-128 encrypt/decrypt operation
with on-the-fly key expansion, shown in fig. 2. The dataflow
toggles between 16 cycles of encrypt/decrypt operation
interspersed with 4 cycles of key expansion. The 10-round
AES-128 operation has an overall latency of 216 cycles. The
serial accumulating MixColumn and InvMixColum operations
accumulate scaled factors of four consecutive bytes over 4
clock cycles to compute the mix/inv.mix column outputs of
each 32-bit column. The plaintext or ciphertext and the round
keys are mapped from the native field of GF(28) to a composite
field of GF(24)2 during the first round. This enables the area
and performance intensive inverse computation in the Sbox
operation to be performed in the composite field of GF(24).
The hybrid Sbox design (fig. 3) performs the Sbox operation
during encrypt and key generation and the inverse Sbox
operation during decrypt. The ShiftRow and inverse ShiftRow
operations are performed by writing the intermediate results to
fixed locations of the data register to account for the shift at
byte boundaries. The ciphertext or plaintext is mapped back to
the native field GF(28) during the last round.
The nanoAES accelerator implemented in 22nm high-K
metal gate CMOS technology occupies a silicon area of
2736m2. At a nominal supply voltage of 900mV, the design
operates at 1.1GHz, achieving a throughput of 671Mbps. A
peak energy efficiency of 289Gbps/W is achieved at a scaled

encrypt |
keygen

Inv_out[7:0]
1

SboxOut

Fig 3: Hybrid Sbox for combined Sbox and InvSbox operation in GF(24)

supply voltage of 430mV, resulting in 11x improvement


compared to conventional AES hardware accelerators.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]

[7]

[8]

[9]

D. Giusto, et. al, The Internet of Things, Springer, 2010.


L. Tan, N. Wang, Future internet: The Internet of Things, at Intl.
Coonference on Advanced Computer Theory and Engineering, 2010.
A.R.Sadeghi, C. Wachsmann, M. Waidner, Security and privacy
challenges in industrial Internet of Things, at DAC, 2015
G. Cox, et. al., Intels Digital Random Number Generator, at Hot
Chips 23: A Symposium on High-performance Chips, July 2011.
Kaiyuan Yang, et. al. "A 23Mb/s 23pJ/b fully synthesized TRNG in
28nm and 65nm CMOS", ISSCC, pp. 280-281, Feb. 2014.
S. Mathew, et.al, RNG: A 300950mV 323Gbps/W all-digital fullentropy true random number generator in 14nm FinFET CMOS, at
ESSCIRC, 2015.
S. Mathew, et. al, 53Gbps Native GF(24)2 Composite-Field AESEncrypt/Decrypt Accelerator for Content-Protection in 45nm HighPerformance Microprocessors, in IEEE JSSC, Apr 2011.
S. Mathew, et. al, 340 mV1.1 V, 289 Gbps/W, 2090-Gate NanoAES
Hardware Accelerator With Area-Optimized Encrypt/Decrypt GF(24)2
Polynomials in 22 nm Tri-Gate CMOS, in IEEE JSSC, 2015
B. Barak et. al., Extracting randomness using few independent
sources, SIAM Journal of Computing, v.36, pp.1095-1118, Dec 2006.

S-ar putea să vă placă și