Documente Academic
Documente Profesional
Documente Cultură
Editors
Karan Bhatia
Texas Instruments
Massimo Alioto
National University of Singapore
Danella Zhao
Old Dominion University
Andrew Marshall
University of Texas at Dallas
Ramalingam Sridhar
University at Buffalo
The SOC Conference is sponsored by the IEEE Circuits and Systems Society
Design Track
Chair: Gururaj Shamanna (Qualcomm, USA)
I. INTRODUCTION
For next mobile SoCs, the 10nm FinFET process beyond
the latest 14nm FinFET process is being prepared. For the
lithography more difficult in the 10nm process, instead of the
extreme ultraviolet lithography which has the issues such as the
high cost, the low productivity, etc., it was enabled to extend
double patterning technology (DPT) in the 14nm process to
triple patterning technology (TPT) and quadruple patterning
technology (QPT). By adopting TPT and QPT in the 10nm
process, the coloring of the layers for those is required.
However, in the design of a mobile SoC with large size, it is
too difficult to finish 3-coloring and 4-coloring which are NPcomplete problems fundamentally in the reasonable runtime [1].
When the process node scaled down from 14nm to 10nm,
1X metal resistance increased toward 2.8X by its metal pitch
shortened from 64nm to 48nm [2]. Rapidly increased metal
resistance in the 10nm process made voltage drop more
seriously and high-speed chip implementation be more difficult.
In addition, due to the risk of timing failure raised by serious
voltage drop, it became difficult to lower the minimum supply
voltage (LVcc) for a given operating clock frequency further
for low power of a mobile SoC.
In the paper, we show the seriousness of coloring and
rapidly increased metal resistance in 10nm mobile SoC design,
and present the practical solutions for those.
Cell A Cell B
Cell Flipping
Color Swapping
AL_E1
AL_E2
AL_E3
because a few critical paths can be detoured due to routed noncritical paths, and the timing correlation between a P&R tool
and a static timing analysis (STA) tool is not perfected.
Instead of the typical global routing, the 2-step global
routing which routes non-critical paths after first routing of
critical paths can give better design speed. Critical paths are
extracted from STA of the design pre-routed using the typical
routing flow. In the CPU design, the 2-step global routing
provided 1.69% faster clock speed and 18.7% less setup time
violated paths than the typical routing flow did.
C. Low Power Design
The adaptive clocking circuit reduces the risk of timing
failures by modulating the clock frequency dynamically when
serious voltage drop is generated in a chip, and allows to lower
LVcc further [4]. For real-time monitoring of voltage drop in a
chip, we implemented the droop detector using the delay line
consisting of 128 cell stages and the delay comparator of its
delay under voltage drop and the reference delay. When
voltage drop is over the pre-defined threshold, a droop detector
generates the error signal and a clock modulator using a clock
gating cell lowers the clock frequency to its half frequency.
When implementing the adaptive clocking in 10nm CPU, we
obtained less LVcc by 7.1% in the high frequency operation.
Special Physical
Cell without BL
(a)
III. SUMMARY
(b)
Fig. 1. (a) Methods to solve color conflicts in the P&R stage (b) physical
layer seperation using special cells pre-placed for QPT coloring.
CPU2
Fig. 2. A 10nm test chip (58mm2) and ARM low Vcc shmoo plot.
REFERENCES
[1]
[2]
[3]
[4]
I. INTRODUCTION
Single carrier frequency division multiple access (SCFDMA) is a part of the LTE protocol used for up-link data
transmission. It involves a discrete Fourier transform (DFT)
pre-coding of the transmitted signal, where the DFT can be any
one of 35 transform sizes N from 12-points to 1296-points, and
N=2a3b5c (a,b,c are positive integers). The rationale for
targeting FPGAs is due to the rapidly growing FPGA use in
communications applications, e.g., base stations and remote
radio heads at the top of cell phone towers. Here we provide
results of mapping the architecture to Xilinx Virtex and Altera
Stratix devices.
FPGAs as an implementation platform have unique
features such as large numbers of embedded multipliers and
memories, leading to very different design tradeoffs compared
to ASIC designs. In particular, embedded elements in such
quantities make them almost free, compared to their ASIC
implementation costs. Consequently, the design goal is to
produce a circuit that minimizes the expensive FPGA lookup-table (LUT) and register fabric usage rather than embedded
element usage. An additional motivation for this goal is that
this fabric is also the source of most of the FPGA dynamic
power consumption, as opposed to the embedded elements, an
important consideration since FPGAs are increasingly being
used in mobile devices.
II. BACKGROUND
A. FFT computing models
The proposed memory-based FFT model departs
considerably from traditional memory-based designs as
illustrated in Fig. 1. Here a traditional high-performance
memory-based design (Fig. 1a) contains physically separate
Fig. 1. (a) Traditional memory-based FFT architecture and (b) proposed finegrained, locally connected, equivalent.
Design
FPGA
Proposed
Xilinx [1]
Proposed
Altera [2]
Virtex-6
Virtex-6
Stratix III
Stratix III
LUT Reg
2975
3851
3816
2600
2853
4326
3188
N/A.
V. CONCLUSION
We have shown how a new memory-based model for the
FFT combines algorithm efficiency and programmability with
new circuit features leading to higher throughputs, lower
latencies and at the same time reduced LUT/register usage
compared to other FPGA implementations.
REFERENCES
[1]
[2]
[3]
[4]
Application Note: Xilinx DFT v3.1, DS615 Mar. 1, 2011 and Altera
DFT/IDFT Reference Design, 464, May 2007.
J. Chen, J. Hu, and S. Li, High throughput and hardware efficient FFT
architecture for LTE application, Proc. 2012 IEEE Wireless
Communications and Networking Conf., pp. 826-83.
C.V. Niras and V. Thomas, Systolic variable length architecture for
discrete fourier transform in Long Term Evolution, Int. Symp. on
Electronic System Design, 2012, pp. 52-55.
J. G. Nash, High-throughput programmable systolic array FFT
architecture and FPGA implementations, Int. Conf. on Computing,
Networking and Communication, Honolulu, HI, Feb.2014, pp. 878-884.
CMOSFET
circuits;
battery
I. INTRODUCTION
Nano-scale CMOS technologies have been used to
implement integrated circuits with the advantages of scaling
down feature size, improving high-frequency characteristics,
low-power consumption, high integration capability, and low
cost for mass production. However, the thinner gate oxide in
nano-scale CMOS technology sets big challenges in the
analog design and it can seriously degrade the overall
robustness of IC products [1]. For mixed-signal SoC, several
power domains are present at the I/O interfaces ranging from 1
V up to 5 V (and higher), especially for battery powered
devices. To reach 5 V compatibility, the solutions in the
available CMOS 28 nm process are either using LDMOS or
stacked standard transistors which are capable of handling
only 1.8 V (or a combination of the two).
The widespread use of battery-operated systems, the
relatively slow progress of battery performance/cost ratio and
the need to minimize simple maintenance procedures, such as
battery replacement, are pushing the design of very low
voltage and low power systems, both digital and analog.
In this work the main design challenges that have been
faced during the design of a low Power Management Unit
(PMU) for a GNSS receiver are highlighted.
II. SYSTEM DESCRIPTION
Fig. 1 shows a simplified block diagram of the proposed
PMU. It is capable of selecting either the main supply or the
backup battery for powering the system. In doing so, a very
complicated stacked architecture for backup switch has been
implemented, in order to be compatible with a 4.8 V input
standard. A high PSRR wide supply 1-V bandgap generator,
together with a decision voltage regulator, allows the system
to decide constantly whether to switch to the main or to the
backup battery. This bandgap consumes less than 3 A of
current. Several linear regulators power the different subsystems: five are capacitor-less and four are filtered with an
external capacitor. The full start-up circuitry is completed by
three comparators consuming 0.75 A each, a reference
current generator derived from the bandgap circuit and a
power on reset circuit. Capacitor-less regulators generating
Vdec and Vddb are always on, as well as LDOA, consuming
only 3 A in total. To achieve this low current consumption,
when system is in stand-by mode, LDOA drops the voltage to
Christian Schippel
Globalfoundries
Wilschorfer Landstrasse 101, 01109 Dresden, Germany
christian.schippel@globalfoundries.com
1.0 V (from 1.5 V nominal) and LDOB goes into bypass, so
only LDOA is in the account of total consumption and not
LDOB. The DCDC converter is implemented in an hysteretic
architecture using LDMOS as power devices, having 85 %
efficiency at 3.0 V input, for a maximum of 100 mA load
current. Vcore can be modulated depending on system states:
higher Vcore allows a better noise rejection, but at the cost of
lower system efficiency (from main supply down to Vddc).
Several isolated linear regulators are used to power sensitive
RF sub-blocks like LNA, PLL, ADC, etc. All the capacitorless regulators have been implemented using the architecture
in [2], being capable of load current ranging from 5 mA for
LDOB up to 30 mA for LDOE, used for OTP memory. A 32
kHz, 1.5 % temperature compensated oscillator, together with
a PSEQ state machine, ensure a proper control and start-up of
the whole system when enabling also the digital core.
III. DESIGN CHALLENGES
This is a list of the main design challenges faced during
the design of the proposed system.
A. Backup switch
The (start-up) backup switch has the goal of selecting
whether the system is working on the main supply or on
the backup battery (in stand-by power mode). The input
standard is 4.8 V while the maximum Vgs allowed by the
process is 1.8 V, whether is an LDMOS or a normal MOS
transistor. A novel architecture has been implemented to
achieve this goal, consuming only 1 A. It makes use of
LDMOS and it is fully integrated on chip.
B. Overvoltage protection
All the blocks connected to the input lines had to be
specifically designed for tolerating higher voltage than
1.8 V, in terms of Vds and Vgs. Not all the blocks could
be implemented using only LDMOS (i.e. backup switch)
but through a mix of low voltage devices (1 V and 1.8 V)
and LDMOS.
C. Stand-by current
Specifications were targeting a sub-20 A system when at
minimal consumption. About 5 uA are lost as leakage in
the memories supplied by Vddb. Remaining 15 A have
been shared among the bandgap, comparators, bias
voltage generators to protect junctions from over-voltage
conditions, current generators, three linear regulators
[2]
[3]
Intelligent Low Power Wake-Up Protocol for MultiRegulator Power Management Architectures
Sunny Gupta, Kumar Abhishek, Nitin Pant, Garima Sharda, Gautham S. Harinarayan,
Automotive Microcontrollers and Processors,
NXP Semiconductors.
sunny.gupta, kumar.abhishek, nitin.pant, garima.sharda, gautham.harinarayan@nxp.com
I. INTRODUCTION
System-on-Chips (SOC) integrate a large number of analog
and digital circuits, enabling a wide variety of features. Based
on the application use case, during a certain window of time,
many parts of the chip may not be needed to be active (Figure
1). We can turn off the clock sources to these inactive circuits
ceasing dynamic power consumption. But there is still a
sizable amount of static leakage. At smaller technology nodes,
the amount of static leakage increases. Thus to enable
maximum amount of power saving, additional features like
power gating need to be introduced. The SOC may have
different circuits operating at different voltages, needing such
voltages to be generated from the externally driven input
supply. This is achieved by on-chip voltage regulators.
Finally, all the input and generated supplies need monitors to
ensure the voltage range is within the range of design
specifications. The circuits used for this are called Power-on
Reset (PORs) and Low and High Voltage Detectors (LVDs
and HVDs). Together all these circuits form the Power
Management Controller (PMC) on the SOC [1], [2].
V. MEASUREMENT RESULTS
Figure 4 The Standby mode exit protocol of our design
[2]
[3]
[4]
Andre Mansano, Andre Vilas Boas, Alfredo Olmos, Stefano Pietri, and
Jefferson D. B. Soldera, Power management controller for automotive
MCU applications in 90nm CMOS technology, IEEE International
Symposium on Circuits and Systems (ISCAS), pp. 2545-2548, 2011.
Stefano Pietri, Chris Dao, Juxiang Ren, Jehoda Refaeli, and Alfredo
Olmos, Safety oriented automotive MCU power management, 2012
IEEE/IFIP 20th International Conference on VLSI and System-on-Chip
(VLSI-SoC), pp 36-40, Oct 2012.
G. Harinarayan, et al, A Robust Architecture for a Complex On-Chip
Power Management Controller with External Regulator Handshake for
Automotive SOCs, 2015 IEEE 28th International System-on-Chip
Conference (SOCC), pp 80-81, Sept 2015.
Gupta, et al, Integrated circuit wake-up control system. U.S. Patent
9,252,774, issued February 2nd, 2016.
I. INTRODUCTION
The global internet based connection of numerous physical
devices has steered the development of platforms for Internet
of Things (IoT). Billions of devices ranging from sensor nodes,
RFID tags, mobile devices and electronic appliances interact
with each other and with the cloud servers to enable smart
connectivity in IoT [1, 2]. The application of IoT spans across
various domains in industrial supply chain, health care, smart
homes and autonomous vehicles. The vast amount of data
collection, storage and communication warrant the need for
data security, resilience to attacks, secure authentication and
user privacy [3]. These security and privacy challenges are
impaired by the lack of compute resources and energy at the
edge devices in IoT. The edge devices such as sensor nodes
and RFID tags consist of modest processing engines optimized
for ultra-low power operations which limit traditional
implementation of crypto algorithms. The cost limitation of
these devices further restrict the implementation of extensive
security features. The edge devices may also be passively
powered (like in the case of RFIDs) or powered on a highly
energy constrained platform. These challenges have
necessitated the design of energy efficient and ultra-lightweight
crypto primitives and hardware accelerators to address the
security requirements of IoT applications.
One of the critical crypto primitives for data encryption and
secure communication protocols is a True Random Number
Generator (TRNG). TRNG circuits harness random physical
phenomena to generate high entropy bit streams that are used
as encryption keys, session-ID and nonce. Conventional TRNG
circuits sample and digitize thermal noise using Analog-toDigital Converters, jitter in Ring Oscillators or the resolution
state of a metastable element [4,5]. Variations in process and
operating conditions introduce bias and correlation in the
TRNG output making them non-ideal for cryptographic
applications. The non-idealities are compensated using
adaptive circuit tuning and extensive algorithmic postprocessing using crypto-hash or ciphers [4]. This increases the
The RNG consists of three independent all-digital selfcalibrating entropy sources followed by correlation
suppressors. The three uncorrelated raw bit streams are
combined using the Barack-Impagliazzo-Widgersen (BIW)
extractor [9] to obtain a full-entropy random bit stream, as
shown in fig 1. The all-digital entropy source consists of a pair
of cross-coupled inverters whose internal nodes are precharged to the high gain state of Vcc and allowed to resolve to
a stable state. Ideally, the resolution state depends on the
differential thermal noise in the inverters. However, process
variation and power supply noise can bias the circuit to
generate disproportionate 0/1 ratio. To compensate for the bias,
a digital tuning mechanism is used by turning ON additional
NMOS/PMOS legs in the inverters to provide coarse control of
the effective P/N skew of the inverter pair. A second control
loop configures the delay on the pre-charge clock buffers to
provide a fine granularity mismatch compensation. The output
of the entropy sources may be self-correlated due to the
adaptive tuning process or mutually correlated due to the
spatial locality of the circuits. A serial decorrelator using
undersampled XOR-feedback shift-register is used to reduce
the correlation between consecutive bits and across the three
entropy sources. The uncorrelated output is combined using a
BIW extractor consisting of bit-serial Galois Field
multiplication and addition.
The RNG circuit implemented in 14nm high-k metal gate
CMOS technology occupies a silicon area of 1088m2. The decorrelators and BIW extractor constitute 12% of the RNG area
at 131m2 (1.3K gates), resulting in lightweight entropy
extraction. The circuit has a throughput of 163Mbps at the
64
clock
clkconf1[4:0]
clk1
Delay
nconf0[5:0]
pconf0[1:0]
b
inv0
a
nconf1[5:0]
pconf1[1:0]
nconf1
pconf1
clkconf0 Loop
clkconf1 control
nconf0 logic
pconf0
key[15]
: data
: key
: shared
load/nextdata/lastround
Inv
MixCols
8
encrypt
2:1
Key
generate
2:1
8
encrypt |
keygen
Sbox
3:1
InvMap
2:1
nextdatain
MixCols
encrypt
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data[15] 8
32
clkconf0[4:0]
clk0
Clock
Delay
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
32
plaintext
nextdatain
Data
registers
Correlation 8
suppressor
Map
3:1
2:1
keygen?
1 Raw stream C
key
lastdataround?
Entropy
source C
Correlation 8
suppressor
first
round?
data[15]
1 Raw stream B
Map
Key
registers
Entropy
source B
Correlation 8
suppressor
load/keygen
1 Raw stream A
Full-entropy bitstream
Entropy
source A
BIW Extractor
1.3K gates
2:1
inv1
SboxIn
Inv.
Affine
Inv_in[7:0]
4
encrypt |
keygen
X2*{}
X-1
X-1
4
Affine
encrypt |
keygen
Inv_out[7:0]
1
SboxOut
Fig 3: Hybrid Sbox for combined Sbox and InvSbox operation in GF(24)
[7]
[8]
[9]