GlitchFreeFPGA HOST12

Glitch-Free Implementation of Masking in Modern FPGAs
Amir Moradi and Oliver Mischke

Horst Grtz Institute for IT Security, Ruhr University Bochum, Germany {moradi, mischke}@crypto.rub.de
AbstractDue to the propagation of the glitches in combinational circuits side-channel leakage of the masked S-boxes realized in hardware is a known issue. Our contribution in this paper is to adopt a masked AES S-box circuit according to the FPGA resources in order to avoid the glitches. Our design is suitable for the 5, 6, and 7 FPGA series of Xilinx although our practical investigations are performed using a Virtex-5 chip. In short, compared to the original design synthesized by automatic tools while requiring the same area (slice count) our design reduces power consumption, critical path delay, and more importantly the side-channel leakage. In our practical investigations we could not recover any rst-order leakage of our design using up to 50 million traces. However, since the targeted S-box realizes a rst-order boolean masking, the second-order leakage could be revealed using around 25 million measurements.
I. I NTRODUCTION With the increasing pervasion of cryptography in more and more embedded systems to protect either the intellectual property of a vendor or to preserve privacy by allowing secure communications, the need of secure implementations of cryptographic primitives like AES is at an all-time high. These implementations should not only be resistant to classical attacks but also be protected against side-channel attacks like power analysis [11], [12]. Countermeasures against power analysis attacks in hardware can be realized on multiple levels. However, if the target platform is an FPGA, the algorithmic-level countermeasures are mainly the possible choices. Masking of sensitive values is one of the most considered solutions, and several schemes have already been published. These options include multiplicative [2], [10], additive [3], [7], [20], or relatively recent afne [9] masking schemes. The problem of masking in hardware could not yet be solved by these schemes. Several attacks have been published, e.g., [13], [15], which exploit a remaining rst-order leakage in the designs. The reason for the remaining leakage namely glitches in the combinational circuits is well known to the community. A couple of new schemes have been proposed to solve this issue by creating glitch-resistant implementations. The notable ones are the threshold implementation (TI) [17], [18], [19] and a new proposal based on a mixture of multi-party computation (MPC) and Shamir secret sharing [22], [23]. However, making a correct TI of most algorithms is very challenging. So far only the Noekeon [8] and the PRESENT [5] S-boxes could be successfully implemented [19], [21]. The MPC scheme has not been practically evaluated yet, but because of the proposed design of the
inversion, the area and speed overheads of a single S-box computation are quite large. In this work we try not to create a glitch-resistant implementation but instead try to avoid causing any glitches. The target of our implementation is the Virtex-5 LX-50 FPGA of the readily available side-channel evaluation platform SASEBOGII [1]. For this we take the very compact masked S-box by Canright-Batina [7] and manually map the combinational functions to the resources of our target platform. By efciently using special enable signals in each FPGA Look-Up-Table (LUT), we can suppress any glitches at the LUT outputs by enabling them only sequentially. We have evaluated different versions of our design including a fully pipelined one achieving a very high clock frequency. Note that although our design has been initially optimized to the 6-Input LUT architecture of the Xilinx Virtex-5 FPGA, the same architecture is used in their newer Series 6 and 7 FPGAs which allows using the same design on these recent platforms. When evaluating the side-channel leakage of our nal design, contrary to the original S-box implementation our design did not show any rst-order leakage by analyzing 50 million measurements. Since the scheme only implements a rst-order masking, a second-order attack is expected to be successful, which is practically conrmed using a very high amount of 25 million measurements. In the next section we briey describe the reasons why we have selected the Canright-Batina masked S-box as the basis of our implementation. Moreover, we introduce the Xilinx LUT architecture and how we have used it to eliminate glitches. Section III gives an overview of our S-box design and names the implementation proles used in the evaluation whose results are depicted in Section IV. Finally, Section V concludes this article. II. TARGETS In the following we will rst give a short summary of the recent masked S-box designs and state why we have chosen the one of Canright and Batina as basis for our modications to create a glitch-free version. Afterwards we will describe the architecture of the Xilinx 6-Input LUT and how we use it to minimize the possible leakage. A. Masked AES S-box As stated previously the currently known glitch-resistant schemes come with some drawbacks. Threshold implementation has been shown to be quite effective when using small
input mask
output mask
Optimized xor/sq/scl/ mul
XORS
Fig. 1.
Masked GF(28 ) Inverter by Canright-Batina (taken from [15]) (a) (b)
S-boxes [21], but because of the large S-box size of AES up to now no expressions could be found to rewrite the AES S-box using this scheme. Note that the implementation reported in [16] has been made by masking the multipliers of a tower-eld implementation of the AES S-box which could not follow the requirements of the threshold implementation. At CHES 2011 a mixture of Shamir secret sharing scheme and multi-party computation was introduced [22]. While it has not been practically evaluated yet, it is clear that the hardware resource requirements are quite high. Furthermore, because of the sequential way of computing the inversion of the S-box a large number of clock cycles are necessary to compute only one S-box output. All these predicted area and time overheads may hinder its practical feasibility. Instead of focusing on glitch resistance in this article we try to avoid any glitches at the FPGA LUTs at all. From the more traditional currently known masking schemes the one of Canright-Batina [7] uses an additive masking and implements the S-box in a tower-eld approach using carefully chosen normal bases to minimize the circuit size. It is based on the area-optimized S-box by Canright [6], and it is still supposed to be the most compact design available. While it was claimed to be perfectly secure by the denition of [3], it was shown in [15] that because of glitches in the circuit there still exists an exploitable rst-order leakage. Figure 1 shows an overview of the GF(28 ) inverter design omitting the towereld conversions. The GF(24 ) inverter is implemented using the same design the only difference being that the inversion in GF(22 ) is also merged to this module. The authors of the original design were kind enough to supply the HDL source code online1 which we used as basis for our modications detailed in the following. B. Xilinx FPGA Resources When not using dedicated hardware blocks like Multipliers/DSPs, a combinational logic circuit in an FPGA is usually implemented by means of many-to-one Look-Up Tables. Their general design is as a number of single-bit storage elements whose values are initialized during the conguration of the FPGA by the bitstream. The inputs of the LUT control the setting of internal multiplexers thereby choosing which stored
1 http://faculty.nps.edu/drcanrig/pub/index.html
Fig. 2. Two possible LUTs in Virtex-5: (a) 6-input LUT, (b) 32-bit ShiftRegister [25]
bit value is available at the output of the LUT. As example, considering the 6-to-1 LUT of the Xilinx Series 5, 6, and 7 FPGAs, the implementation of this LUT is realized as two 5-to-1 LUTs and a multiplexer as can be seen in Fig. 2(a). Each of these 5-to-1 LUTs themselves can again be seen as two 4-to-1 LUTs and a multiplexer and so on. In our device under test, the Xilinx Virtex-5 LX50 FPGA mounted on a SASEBO-GII Board, each slice consists of four LUT6 and four single-bit ip-ops. The LUT6, as depicted in Fig. 2(a), can be hardinstanced in two different congurations. As LUT6_1 any combinational function having up to 6 input signals and one output signal can be implemented. Using the LUT in a LUT5_2 conguration allows providing two output signals from the 5 inputs but only if these 5 inputs are the same for both internal 5-to-1 LUTs, i.e., the inputs must be shared. Glitches at the output of a LUT happen since the input signals arrive at different instances of time because of the routing specication in the device. In order to avoid this the output of the LUT must be hold stable until all input signals have arrived. We achieve this by using one of the input signals as an active low enable signal, i.e., in our case as long as this input signal is set to logic 1, the LUT output will always be logic 0 no matter the values of the other input signals. Here it is important to choose the correct LUT input signal as enable carefully. Let us consider choosing the input I5 in Fig. 2(a) as the enable signal. While the output of the LUT_6 will actually not change during the transition period of the other input signals, there will still be glitches at the output of one of the internal LUT_5 instances. We therefore have to choose the input signal which controls the very rst multiplexer stage so that toggles at the select signals of the following multiplexers do not cause any glitches. Although the details of the internal architecture of the FPGA resources are not publicly available, this input signal can be identied by looking at the architecture of the SRLC32E depicted in Fig. 2(b). It is a special mode of operation for LUTs in some slices of Xilinx FPGAs that realizes a shift register. In this mode the content of the LUT storage cells 2
ah al
XOR
LUT 5x2
XOR
ahal
GF_MUL_4x4
MUL.SCL 2x2
bh bl
LUT 5x2
MUL 2x2
bhbl
LUT 5x2
ph
p an XOR
LUT 5x2
MUL 2x2
LUT 5x2
XOR
Q1
en1
LUT 5x2
XOR
pl en2 en3
LUT 5x2
Q0
ah al
LUT 5x2
XOR
ahal
GF_MUL_4x4
MUL.SCL 2x2 ah al p mb Q1 XOR
bh bl
LUT 5x2
MUL 2x2
bhbl
LUT 5x2
ph
GF_INV_8 (masked)
LUT 5x2
XOR
ahal
GF_MUL_4x4
MUL.SCL 2x2
XOR
bh bl
LUT 5x2
MUL 2x2
LUT 5x2
XOR
LUT 5x2
MUL 2x2
bhbl
LUT 5x2
ph
XOR
en1
LUT 5x2
XOR
pl en2 en3
LUT 5x2
Q0
LUT 5x2
MUL 2x2
LUT 5x2
XOR
Q1 p XOR Q0
m n
ah al
LUT 5x2
XOR
ahal
GF_MUL_4x4
MUL.SCL 2x2 d p XOR XOR ph mn Q1
en12
LUT 5x2
pl en13 en14
LUT 5x2
LUT 6x1
XOR
bh bl
ah al
XOR
LUT 5x2
MUL 2x2
bhbl
LUT 5x2
LUT 5x2
XOR
ahal
GF_MUL_4x4
MUL.SCL 2x2
LUT 6x1
XOR
QH
LUT 5x2
MUL 2x2
LUT 5x2
XOR
LUT 6x1
XOR
cl
bh bl
LUT 6x1
p XOR
LUT 5x2
MUL 2x2
bhbl
LUT 5x2
ph
en1
LUT 5x2
pl en2 en3
LUT 5x2
Q0
LUT 6x1
XOR
XOR
LUT 6x1
Q1 dn
b a MUL 2x2
GF_INV_4
LUT 5x2
MUL 2x2
LUT 5x2
XOR
LUT 6x1
XOR
c1
en3
cst
LUT 6x1
XOR
LUT 6x1
af8
LUT 6x1 LUT 5x2 LUT 5x2 LUT 5x2
c3
LUT 6x1 LUT 6x1

XOR
LUT 5x2
ch MUL 2x2
an
en12
LUT 5x2
pl en13 en14
LUT 5x2
Q0
XOR mb
LUT 5x2
MUL 2x2
LUT 6x1
XOR
ah al Q1 bh bl
XOR
LUT 5x2
XOR
ahal
GF_MUL_4x4
MUL.SCL 2x2
c2
c4 c5 c6 c7 c8
LUT 6x1 LUT 6x1 LUT 6x1
LUT 5x2
XOR
cst
LUT 5x2 LUT 6x1 LUT 6x1 LUT 6x1 LUT 6x1
cst1
mn
LUT 6x1
MUL 2x2 e
LUT 5x2
MUL 2x2
bhbl
LUT 5x2
ph
a b en2
LUT 5x2
LUT 5x2
XOR MUL 2x2
XOR q
XOR
cst0 XOR cm1
LUT 5x2
LUT 5x2
MUL 2x2
LUT 5x2
XOR
Q1 q XOR Q0
m m4 mn en4 XOR
LUT 6x1
XOR
LUT 6x1
XOR
LUT 5x2
MUL 2x2 d
em XOR p
en12
LUT 5x2
pl en13 en14
LUT 5x2
LUT 6x1
XOR
m2
LUT 6x1
XOR
csm
cm0
LUT 6x1
LUT 6x1
LUT 5x2
MUL 2x2
LUT 6x1
XOR
Q0
XOR
ah al
XOR
LUT 6x1
XOR
LUT 5x2
csm en6 en7 en8 en9
dn
LUT 6x1
en10 XOR
LUT 5x2
LUT 5x2
XOR
ahal
GF_MUL_4x4
MUL.SCL 2x2
LUT 6x1
XOR
QL
bh bl
LUT 6x1
p XOR
LUT 5x2
MUL 2x2
bhbl
LUT 5x2
ph
XOR
LUT 6x1
Q1 em
LUT 6x1
XOR
LUT 5x2
XOR
o1
LUT 5x2
MUL 2x2
LUT 5x2
XOR
LUT 6x1
LUT 5x2
o0 en12
LUT 5x2
pl en13 en14
LUT 5x2
Q0
m4 m5 en1 en2 en3 en4 en5 en6 en7 en8 en9 en10 en11 en12 en13 en14 en15
Fig. 3.
Design of our full-custom optimized S-box (inversion part only)
can be changed in a serial fashion during the operation of the FPGA. By using the LUT inputs as select lines, the length of the shift register can be set dynamically. Since the all zero input sets the length to 1 bit, and switching the I0 input signal to logic 1 increases the length to 2 bits, i.e., choosing the neighboring cell, the I0 signal must control the very rst multiplexer stage. Therefore, I0 is the correct choice for the enable signal. Note that since the synthesizer permutes the LUT input signals (and accordingly changes the LUT conguration) to optimize the routing, by special constraints [24] one has to keep the PIN positions of the hardinstanced LUTs locked. III. O UR D ESIGN The detailed structure of our design is given by Fig. 3. Omitting the tower-eld conversion, 15 LUT stages are required to perform the full inversion in GF(28 ). We give performance gures for 6 different implementation proles, 3
from the original unmodied design to our optimized one with or without pipelining stages and when the special enable signals to minimize glitches in the circuit are used or not. The implementation proles of the S-box are as follows: 1) The original HDL code optimized by the ISE synthesizer 2) The original HDL but avoiding any optimizations or trimming by the synthesizer, i.e., one LUT per gate to keep all hierarchy levels 3) Our modied design using hardinstanced LUTs, all enable signals always 0, no pipeline registers 4) Our modied design without pipelining but activating each stage sequentially by the enable signals 5) Our modied design using pipelining to hinder glitch propagation, but all enable signals always 0 6) Our modied design using both pipelining to hinder the glitch propagation and using the enable signals to avoid glitches in the circuit
In Proles 1, 2, and 3 the implementations are pure combinational functions where at each clock the full S-box is computed at once. Glitches in the rst LUT stage therefore are passed through the whole S-box generating a highly glitching circuit until all signals get stable. Therefore, we do not consider Proles 1 and 2 in our side-channel evaluations (Section IV). Prole 4 avoids this issue. Here only one LUT stage is activated in each clock cycle, thereby not only hindering the propagation of glitches but also not causing any glitches at all. That is because the input signals of the next LUT stage are stable when they are activated in the following clock cycle. The downside of this prole is the apparent non-practicality. One needs 15 clock cycles to compute a single S-box output while the inputs must be hold stable. In order to make matters worse one would need to spend another 15 clock cycles to deactivate each stage in the reverse order before the next Sbox computation can begin. In Prole 5 the pipelining stages hinder the glitch propagation. On the other hand, keeping all enable signals at 0 glitches will still occur at the LUT outputs of each stage. Finally, in the last Prole 6 we combine both the pipelining to avoid any glitch propagation and the use of the active-low enable signals to completely shut down glitches at the LUT outputs. In order to reach our goal in a straightforward way one would need to i) rst disable all LUTs, ii) clock every second pipelining registers after enabling their corresponding LUTs, iii) disable all LUTs again, iv) clock the other half of pipelining registers having their corresponding LUTs enabled and so on. This means that only every four clock cycles a new S-box input can be feed into the circuit, and it leads to a latency of 30 clock cycles from input to output. This is necessary because if one would simply merge clocking every second register and disabling the connected LUT stage at the same time, the routing of the signals would determine whether the disable signal arrives at the LUT rst or if other inputs arrive earlier, which the later causes glitches at the LUT output. To avoid this issue we can use the special way the clock signal is routed in the FPGA. The clock is routed on special dedicated paths to each switch box separately to avoid race problems in synchronous circuits. However, the LUT output signals need to rst go back to the corresponding slices switch box and from there travel to the destination LUT inputs where more switch boxes might be passed. Therefore, a transition, e.g., low-to-high, on the clock signal arrives at the registers and LUTs of each slice earlier than the other signals. Therefore, by tying our active-low LUT enable signals to the clock signal the LUT gets deactivated at each rising clock edge before the new inputs arrive. At the falling edge of the clock the LUT gets active and provides the output signal to the next ip-op stage where it will be stored at the next rising edge. This way the pipelining registers can be active at every clock cycle and no glitches will occur. Please note that the maximum clock frequency in this case cannot be faster than twice the longest critical path delay of the S-box circuit. In order to provide 4
clk/LUT en LUT output LUT data in 0 output (i) inputs (i) 0 output (ii) inputs (ii) 0 inputs (iii)
Fig. 4.
Signal timings on LUT inputs and outputs
a better understanding Fig. 4 showcases the different signal timings. Also, the performance results of each implementation prole for only the inversion module of the S-box is given in Table I.
TABLE I S YNTHESIS RESULTS FOR ALL PROFILES ( INVERSION ONLY ) Prole 1 2 3 4 5 6 Max. Freq. 105.519MHz 56.504MHz 88.300MHz 641.026MHz 641.026MHz 320.513MHz #LUTs 99 244 100 100 100 100 #FFs 0 0 0 0 649 649 Latency (#clocks) 0 0 0 30 15 (piped) 15 (piped) Throughput (16 Inv. /s) 6 594 937 3 531 500 5 518 750 1 335 471 20 678 258 10 339 129
IV. E VALUATION We used a SASEBO-GII [1] board as the target platform to examine the side-channel leakage of our designs. Different proles of our design were implemented on the Virtex-5 (XC5VLX50) FPGA embedded on the target board, and the power consumption traces were collected using a LeCroy WP715Zi 1.5GHz oscilloscope at the sampling rate of 1GS/s. Since our design employs a very few number of LUTs in the target FPGA, and the number of toggles in each clock cycle is restricted, the peak-to-peak amplitude of the signal in the power traces was quite low. Therefore, we measured the power traces by means of a 1 resistor in the VDD path, a DC blocker, a passive probe and an amplier. Furthermore, we restricted the bandwidth of the measurements (on the oscilloscope) to 20MHz to eliminate the electrical noise while our designs run by a stable 3MHz oscillator. We made an exemplary architecture where one AddRoundKey module (128-bit) and one instance of the targeted Sbox exist. The 128-bit (masked) input is XORed by a 128bit secret key, and the result is sequentially given to the Sbox module one byte per clock cycle. The method we used to examine the side-channel leakage of our targeted designs is a correlation collision attack [15]. It examines the rstorder leakage of one circuit instance that is used in different time instances. Therefore, it perfectly suits to our exemplary architecture since the targeted S-box instance is shared for all SubBytes transformations. The target masked S-box [7] uses two different mask bytes per input byte, i.e., a random byte to mask an input byte and another random byte as the mask of S-box output. Therefore, we provided two random values for each input byte, and gave the above mentioned architecture the masked inputs and the corresponding masks. In other words, in each run of the circuit two independent 128-bit random values
10
Voltage [mV]
5 0
Time [s]
Voltage [mV]
5 0
Time [s]
13
18
(a)
(a)
(b)
(b)
(c) Fig. 5. Prole 3: evaluation results (a) a sample trace, (b) attack result using 50 000 traces, (c) over the number of traces.
(c) Fig. 6. Prole 5: evaluation results (a) a sample trace, (b) attack result using 20 000 000 traces, (c) over the number of traces.
as input and output masks are provided for the aforementioned circuit. For comparison purposes we start our evaluations by Prole 3 to have a reference as a design where glitches are not controlled and can be propagated. Please note that we omitted the evaluation results of Proles 1 and 2 since there is no control over the glitches, and they have the same side-channel leakage as that of Prole 3. A sample power trace of this design is shown in Fig. 5(a). Sixteen clock cycles related to the sixteen S-box computations are clearly distinguishable. We measured 50 000 traces, and performed a correlation collision attack considering two plaintext bytes which are processed consecutively by the targeted S-box instance. Note that this attack, similar to the most of the side-channel collision attacks, recovers a relation between the targeted secret key bytes. In case of our targets (like the linear collision attack on AES [4]) the attack searches for the XOR difference of the key bytes corresponding to the two targeted plaintext bytes. The result of this attack is depicted in Fig. 5(b) and Fig. 5(c) showing the simplicity of recovering the secret, i.e., 5 000 traces, when the glitches in the masked S-box are not controlled. Prole 5 is the next S-box design we evaluated. As mentioned in Section III, this design does not avoid the glitches, but it prevents their propagation to the next circuit stages. Since this design provides a pipeline with 15 stages, sequentially giving the 16 key-whitened plaintext bytes to this S-box instance leads to requiring 31 clock cycles to compute all the 5
SubBytes transformations. These 31 clock cycles are clearly distinguishable in Fig. 6(a) which shows a sample power trace of this design. As an interesting point, compared to that of Prole 3 (Fig. 5(a)) the power consumption of Prole 5 is reduced though it needs 15 more clock cycles to nish all SubBytes transformations. In order to perform a successful attack on this design and recover the desired secret, we required to collect much more traces compared to Prole 3, i.e., 20 000 000. This is due to preventing the glitch propagations, which control the datadependent leakage and consequently is harder to detect. The same attack scheme with the same target as in Prole 3 was performed. As shown in Fig. 6(b), there is still a rst-order leakage. This shows that controlling the propagation of the glitches is effective to signicantly reduce the side-channel leakage, but it does not completely prevent it, as we need about 8 000 000 traces to see the desired leakage (Fig. 6(c)). The last design we considered for evaluation is Prole 6, where by a sophisticated control over the LUT enable signals the glitches are prevented. The level of power consumption of this design, as shown by Fig. 7(a), is roughly the same2 as that of Prole 5. In order to perform the attacks, we measured 50 000 000 traces of this design. Performing the same attack as before led to the unsuccessful result which is depicted in Fig. 7(b). In fact, it shows that preventing the glitches signicantly helps resisting against the rst-order attacks. However, this design should have second-order leakage because of its
2 Indeed,
it is slightly lower because of the glitch prevention.
5 0
Time [s]
13
18
i.e., a multivariate attack. This is out of the evaluation criteria we have considered in this paper. However, we believe that combining leakages of different time instances leads to increasing the noise factor and most likely provides not a better result than the univariate second-order attack whose result is shown here. V. C ONCLUSIONS In this work we have taken the highly optimized for ASICs very compact masked S-box by Canright and Batina, and ported it to use the available resources of the current Xilinx FPGA Series (Virtex-5 onward) in a size-optimized manner. Compared to a design created by an automatic synthesizer this led to the same number of LUTs and slight decrease of the operation frequency. We could also, as already pointed out in [15], conrm the still available rst-order leakage of this S-box design when implemented in a straightforward manner. Since this leakage was caused by glitches in the circuit, we have rst eliminated the glitches by placing enable signals in each used LUT, so that no output is propagated while the inputs are not stable. By combining this solution together with pipelining stages and utilizing the special way how the clock signals are routed for the LUT enable signals, we could create an implementation which operates at an extremely high clock frequency while showing absolutely no rst-order leakage by means of 50 million power consumption measurements. While not specically focusing on this, we also achieved a quite high resistance against univariate second-order attacks. In this case 25 million traces is the threshold after which the secrets become slowly distinguishable using the very sophisticated attacks of [14]. We should emphasize a comparison between our results and those of a threshold implementation of AES reported in [16] and [14]. Although their implementation platform is different to ours, their scheme required roughly the same number of traces the secondorder leakage to be exploited while the area overhead of their design excluding all the internal PRNGs is much higher than our optimized one. In order to allow further study of our design and to use it in real applications the HDL source code of our masked S-box design is available online at http://www.emsec.rub.de/research/publications. ACKNOWLEDGMENT In this project O. Mischke has been part-nanced by the European Union, Investing in your future, European Regional Development Fund. R EFERENCES
[1] Side-channel attack standard evaluation board (sasebo). Further information are available via http://www.morita-tech.co.jp/SASEBO/en/index. html. [2] M.-L. Akkar and C. Giraud. An Implementation of DES and AES, Secure against Some Attacks. In CHES 2001, volume 2162 of LNCS, pages 309318. Springer, 2001. [3] J. Blmer, J. Guajardo, and V. Krummel. Provably Secure Masking of AES. In SAC 2004, volume 3357 of LNCS, pages 6983. Springer, 2004.
Voltage [mV]
(a)
(b)
(c)
(d) Fig. 7. Prole 6: evaluation results (a) a sample trace, (b) attack result using 50 000 000 traces, (c) attack result on 50 000 000 squared mean-free traces, (d) over the number of traces.
underlying rst-order masking scheme. In order to check this issue we performed the same attack, i.e., correlation collision attack, but using the second-order moments. That is, as illustrated in [14], in a correlation collision attack one can employ the variance traces of the measurements instead of the averages to examine the second-order moments. It is, in fact, the same as squaring the mean-free traces and then performing a correlation collision attack [14]. We performed this preprocessing step prior to the same correlation collision attack as before, and the result is presented by Fig. 7(c). As expected, the second-order leakage is available, and can be used to reveal the desired secret using around 25 000 000 measurements (see Fig. 7(d)). We should mention that we considered only the univariate attacks, i.e., rst-order and zero-offset second-order. Because of the pipeline architecture of our design the leakages relevant to the one S-box computation are distributed over 15 clock cycles. Therefore, one may perform a second-order attack by combining the leakages appearing at different clock cycles, 6
[4] A. Bogdanov. Multiple-Differential Side-Channel Collision Attacks on AES. In CHES 2008, volume 5154 of LNCS, pages 3044. Springer, 2008. [5] A. Bogdanov, G. Leander, L. Knudsen, C. Paar, A. Poschmann, M. Robshaw, Y. Seurin, and C. Vikkelsoe. PRESENT - An Ultra-Lightweight Block Cipher. In CHES 2007, number 4727 in LNCS, pages 450466. Springer, 2007. [6] D. Canright. A Very Compact S-Box for AES. In CHES 2005, volume 3659 of LNCS, pages 441455. Springer, 2005. The HDL specication is available at the authors ofcial webpage http://faculty.nps.edu/drcanrig/ pub/index.html. [7] D. Canright and L. Batina. A Very Compact "Perfectly Masked" SBox for AES. In ACNS 2008, volume 5037 of LNCS, pages 446459. Springer, 2008. the corrected version at Cryptology ePrint Archive, Report 2009/011 http://eprint.iacr.org/. [8] J. Daemen, M. Peeters, G. Assche, and V. Rijmen. Nessie proposal: NOEKEON. Submitted as an NESSIE Candidate Algorithm, 2000. http: //www.cryptonessie.org. [9] L. Genelle, E. Prouff, and M. Quisquater. Thwarting Higher-Order Side Channel Analysis with Additive and Multiplicative Maskings. In CHES 2011, volume 6917 of LNCS, pages 240255. Springer, 2011. [10] J. D. Goli and C. Tymen. Multiplicative Masking and Power Analysis of c AES. In CHES 2002, volume 2523 of LNCS, pages 198212. Springer, 2003. [11] P. C. Kocher, J. Jaffe, and B. Jun. Differential Power Analysis. In CRYPTO 1999, volume 1666 of LNCS, pages 388397. Springer, 1999. [12] S. Mangard, E. Oswald, and T. Popp. Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer, 2007. [13] S. Mangard, N. Pramstaller, and E. Oswald. Successfully Attacking Masked AES Hardware Implementations. In CHES 2005, volume 3659 of LNCS, pages 157171. Springer, 2005. [14] A. Moradi. Statistical Tools Flavor Side-Channel Collision Attacks. In EUROCRYPT 2012, volume 7237 of LNCS, pages 428445. Springer, 2012. [15] A. Moradi, O. Mischke, and T. Eisenbarth. Correlation-Enhanced Power Analysis Collision Attack. In CHES 2010, volume 6225 of LNCS, pages 125139. Springer, 2010. the extended version at Cryptology ePrint Archive, Report 2010/297 http://eprint.iacr.org/. [16] A. Moradi, A. Poschmann, S. Ling, C. Paar, and H. Wang. Pushing the Limits: A Very Compact and a Threshold Implementation of AES. In EUROCRYPT 2011, volume 6632 of LNCS, pages 6988. Springer, 2011. [17] S. Nikova, C. Rechberger, and V. Rijmen. Threshold Implementations Against Side-Channel Attacks and Glitches. In ICICS 2006, volume 4307 of LNCS, pages 529545. Springer, 2006. [18] S. Nikova, V. Rijmen, and M. Schlffer. Secure Hardware Implementations of Non-Linear Functions in the Presence of Glitches. In ICISC 2008, volume 5461 of LNCS, pages 218234. Springer, 2008. [19] S. Nikova, V. Rijmen, and M. Schlffer. Secure Hardware Implementation of Nonlinear Functions in the Presence of Glitches. J. Cryptology, 24(2):292321, 2011. [20] E. Oswald, S. Mangard, N. Pramstaller, and V. Rijmen. A Side-Channel Analysis Resistant Description of the AES S-Box. In FSE 2005, volume 3557 of LNCS, pages 413423. Springer, 2005. [21] A. Poschmann, A. Moradi, K. Khoo, C.-W. Lim, H. Wang, and S. Ling. Side-Channel Resistant Crypto for Less than 2, 300 GE. J. Cryptology, 24(2):322345, 2011. [22] E. Prouff and T. Roche. Higher-Order Glitches Free Implementation of the AES Using Secure Multi-party Computation Protocols. In CHES 2011, volume 6917 of LNCS, pages 6378. Springer, 2011. [23] A. Shamir. How to Share a Secret. Commun. ACM, 22(11):612613, 1979. [24] Xilinx. Constraints Guide. Available via http://www.xilinx.com/itp/ xilinx10/books/docs/cgd/cgd.pdf, 2008. [25] Xilinx. Virtex-5 Libraries Guide for HDL Designs. Available via http://www.xilinx.com/support/documentation/sw_manuals/xilinx11/ virtex5_hdl.pdf, September 2009.

GlitchFreeFPGA HOST12

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

GlitchFreeFPGA HOST12

Încărcat de

Drepturi de autor:

Formate disponibile

Glitch-Free Implementation of Masking in Modern FPGAs

Amir Moradi and Oliver Mischke

Optimized xor/sq/scl/ mul

Masked GF(28 ) Inverter by Canright-Batina (taken from [15]) (a) (b)

LUT 6x1 LUT 5x2 LUT 5x2 LUT 5x2

LUT 6x1 LUT 6x1

LUT 6x1 LUT 6x1 LUT 6x1

cst0 XOR cm1

Design of our full-custom optimized S-box (inversion part only)

Signal timings on LUT inputs and outputs

it is slightly lower because of the glitch prevention.

S-ar putea să vă placă și