Documente Academic
Documente Profesional
Documente Cultură
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 1
Abstract—The demand for low-power and high-performance computing has been driving the semiconductor industry for decades. The
semiconductor technology has been scaled down to satisfy these demands. At the same time, the semiconductor technology has faced
severe reliability challenges like soft-error. Research has been conducted to improve the soft-error reliability of the GPU, which has
been improved by using various methodologies such as redundancy methodologies. However, the GPU compiler has yet to be
considered for improving the soft-error reliability of the GPU. In this paper, in order to improve the soft-error reliability of the GPU, we
propose a novel GPU architecture aware compilation methodology. The proposed methodology jointly considers the parallel behavior
of the GPU hardware and the applications, and minimizes the vulnerability of the GPU applications during instruction scheduling. In
addition, the proposed methodology is able to complement any hardware based soft-error reliability improvement techniques. We
compared our compilation methodology with the state-of-the-art soft-error reliability aware techniques and the performance aware
instruction scheduling. We have injected the soft-errors during the experiments and have compared the number of correct executions
that have no erroneous output. Our methodology requires less performance and power overhead than the state-of-the-art soft-error
reliability methodologies in most cases. Compilation time overhead of our methodology is 8.13 seconds on average. The experimental
results show that our methodology improves the soft-error reliability by 23% and 12% (up to 64% and 52%) compared to the
state-of-the-art soft-error reliability and performance aware compilation techniques, respectively. Moreover, we have shown that the
soft-error reliability of a GPU is not related to the performance, but to the fine-grained timing behavior of an application.
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 2
the number of threads is 841, Scheduling Algorithm 2 shows Another research in [20] has proposed a cross layer anal-
the smallest vulnerable period. ysis approach for modeling the soft-error behavior on the
These experimental results indicate that the vulnerable FinFET transistors. The proposed approach performs a 3D
period of an application does not only depend on the simulations from an interactions in FinFET structures up to
instruction scheduling but also on the parallel behavior of circuit level. A soft-error susceptibility estimation technique
the GPU. The result may show a small amount of change in has been proposed in [26]. The proposed technique uses
the vulnerable period. This is because of the simple kernel a symbolic modeling technique to estimate the soft-error
function and a small number of threads. However, the result susceptibility of a combinational logic circuit. However,
in Figure 1 still shows the possibility that the vulnerable since the detailed GPU architecture is not available to the
period, which is related to the soft-error reliability, may be research community, the research works are not applicable.
improved through the instruction scheduling. Therefore, the Research has been conducted to provide the soft-error
parallel behavior of the GPU and the instruction scheduling injection tools because it is extremely difficult to perform
need to be considered together to further improve the soft- the experiment in a radiation environment [8] [22]. One
error reliability. research in [22] has proposed a soft-error injection tool that
injects single bit-flip into the data object in the binary. Other
research in [8] has proposed a soft-error injection technique
2 R ELATED W ORK that randomly selects the instruction to change its result.
Research has been conducted to improve the soft-error reli- However, these research works have limitations in modeling
ability of the processors including GPUs [10] [11] [18] [23] the random behavior of soft-error. For instance, although
[33] [39] [42] [45] [47]. Some research works have shown not all the injected soft-errors cause bit-flip, these research
the impact of soft-errors by using radioactive sources [30] works inject bit-flip whenever the soft-error occurs.
[44]. However, usage of these radioactive sources is difficult Research has been conducted in order to find the in-
and not for the general purpose applications, therefore some correct behavior and ensure the functional correctness [18]
research works have focused on the technique to model [25] [39] [42] [45] [47]. One research in [25] has proposed
the soft-error behavior [2] [16] [20] and evaluate the soft- an application framework to handle the soft-errors on GPU
error resiliency of an application [9] [15] [27] [30] [40] [44]. DRAM. The DRAM errors in GPUs are detected by using
Research works have proposed to detect the occurrence of dual parity technique and the application is recovered from
the soft-errors and ensure the correctness of an application the checkpoints. Other research in [47] has proposed a soft-
by using various techniques (i.e. redundant execution [18] ware level Dual-Modular Redundancy (DMR) technique.
[39] [42] [45] [47], insertion of protection code [10] [23], and Each stage had monitor function and its result is compared
leveraging architectural characteristics [11] [33]). In the rest to the result from the monitor function. Another research in
of this section, we discuss in detail the above mentioned [18] has proposed a redundant technique that has utilized
research works. the GPU idle time caused by the branch divergence. A dupli-
Various research works have been conducted to demon- cation technique proposed in [39] has redundant execution
strate the impact of soft-error and evaluate the soft-error for the critical parts of the GPU pipeline and recompute
resilience of the GPU application [9] [15] [27] [30] [40] [44]. erroneous results when error is detected. A redundant ex-
One research [27] proposed a metric to quantify the soft- ecution methodology is proposed in [42] and it uses the
error reliability based on the detailed timing behavior of an idle time of the GPU in order to minimize the performance
application. Other research works [30] and [44] showed the overhead. An automatic Redundant Multithreading (RMT)
impact of the radiation-induced soft-error on the NVIDIA’s technique is proposed in [45]. The proposed technique has
GPU. In order to inject the actual soft-error into the GPU, modified an application’s code during compile-time to add
the target system is exposed to the neutron beam. Another redundant execution. However, these works may cause a
research in [37] has evaluated the soft-error resilience of sev- significant amount of performance and power overhead,
eral safety-critical applications. Embedded GPGPU platform because the proposed techniques are based on redundant
has been exposed to neutron flux in order to measure the execution or recomputation.
soft-error resilience. A soft-error estimation framework is Research has also been conducted to improve the soft-
proposed in order to accurately estimate the soft-error rate, error reliability by leveraging the parallel behavior of the
work in [15]. Unlike traditional netlist-based technique, the GPU [11] [33]. It is shown in [11] that the GPU’s soft-error
proposed framework estimated the soft-error rate from the reliability is affected by its parallel behavior. For example,
layout of the target processor. The impact of soft-error relia- one research in [33] has demonstrated that the Mean Exe-
bility has been discussed in [9] and [40]. Various techniques cutions Between Failures (MEBF) of a GPGPU application
(i.e. debugger based fault injection) are used to show the is affected by the parallel behavior of the GPU. In order to
impact of soft-error reliability of the GPU. These research show the relationship between the MEBF and the parallel
works have successfully demonstrated the impact of soft- behavior of the GPU, the grid and the block size of a
error in real-world environment and applications. However, GPGPU application is modified. However, these research
these works do not propose the technique to improve the works do not provide the techniques to find the grid and
soft-error reliability. the block size and improve the soft-error reliability of the
Research has been conducted to model the soft-error GPU.
behavior. One research in [2] has proposed a soft-error Various instruction scheduling algorithms have been
model to provide the failure probability of various inter- proposed in [34], [35] and [36]. One research project in [35]
leaving techniques for the SRAM. Other research in [16] has has proposed a metric to quantify the soft-error reliability
discussed about the soft-error model for 25nm technology. by using both the detailed timing behavior of the applica-
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 3
tion and the hardware information. Based on the proposed 2) Efficient selection of the location for the instructions
metric, an instruction schedule is generated to maximize the while minimizing the total vulnerable period of an ap-
soft-error reliability. Other research in [34] has proposed an plication. In addition, the instruction scheduling should
instruction schedule algorithm to maximize the soft-error not cause significant compilation overhead.
reliability under various performance constraints. Another 3) Modeling of the random behavior of soft-error to prop-
research in [36] has proposed an instruction schedule to erly evaluate methodologies. During run-time, the soft-
maximize the soft-error reliability for specific component. errors should be randomly injected into the GPU at a
Based on the compiler option, the proposed instruction given rate. Moreover, we should consider that not all the
scheduling algorithm maximizes the soft-error reliability injected soft-errors cause the bit-flip.
of the selected components. For example, if the compiler 3.2 Our Novel Contributions
option selects the register file, then the proposed instruction
To address the above-mentioned challenges, we propose a
generates an instruction schedule that maximize the soft-
novel methodology to minimize the vulnerable period of an
error reliability of the register file. However, since these
application on a GPU-based system which employs:
instruction scheduling algorithms are based on RISC proces-
sors such as SPARC-V8 architecture, they are not applicable 1) A GPU architecture aware instruction scheduling algo-
to the GPU. In addition, due to the fact that the probability rithm (Section 6.1) that minimizes the total vulnerable
of the soft-error occurrence is proportional to the time that period of a GPU application by considering the impact
the hardware component is used, the performance aware of the parallel behavior of the GPU. To support this
instruction scheduling may be considered to improve the algorithm, we require all the following:
soft-error reliability [17]. However, since the performance 2) Estimation of the vulnerable period during compile-
aware instruction scheduling focuses on maximizing the time (Section 5) of a GPU application, which provides
GPU resource utilization, improvement in the soft-error the thread-level information for the GPU.
reliability is limited. 3) Theorems to handle the application’s control-flow (Sec-
Research has been conducted for detection and protec- tion 5.2) during the vulnerable period estimation. These
tion of the vulnerable parts in GPGPU applications [6] [10] theorems help to estimate the control flow instructions
[23] [32]. One research in [10] has proposed the checker such as loops and branches.
functions that are inserted to protect the potentially vul- 4) A fine-grained clock-cycle-level fault injection tool
nerable parts of a GPGPU application. Other research in (Section 7.1) to verify our methodology. The fault injec-
[32] has proposed a compilation technique that improved tion tool that can randomly inject faults at the granularity
the control-flow reliability. Another research in [23] has of the clock-cycle-level is integrated with the state-of-
proposed a compile-time methodology that protects the the-art GPGPU-Sim simulator.
memory access instructions by inserting checker instruc- In addition to the above mentioned contributions, we
tions. One study in [6] has proposed an application-level also provide a comparison with the state-of-the-art perfor-
technique that modifies the loop code during the compile- mance aware instruction scheduling [17] (Section 6.3) to
time. However, these works have limitations in improving illustrate why our methodology may achieve maximum
the soft-error reliability of the GPU because the soft-error soft-error reliability.
may occur in any part of a GPGPU application. In summary,
the above-mentioned state-of-the-art methodologies suffer
from the following limitations:
1) They control the parallel behavior of the GPU and mod-
ify the application source code to improve the soft-error
reliability. However, the instruction scheduling method-
ologies have not been considered to further improve the
soft-error reliability.
2) Their techniques may cause significant performance and
power overhead because their techniques are based on
redundancy or recomputation.
3) Their fault injection tool does not properly model the Fig. 2: Overview of the Proposed Soft-error Reliability Im-
random behavior of the soft-error. Not all of the injected provement Methodology.
soft-error cause the bit-flip, but existing fault injection Figure 2 shows the overview of the methodology in this
techniques cause bit-flip for all soft-error injections. paper. The proposed compilation methodology has strong
potential to improve the soft-error reliability of the GPU
3 R ESEARCH C HALLENGES AND C ONTRIBUTIONS and can complement with the hardware oriented techniques
such as [1], [7], [31], [43], and [46].
3.1 Problem and Research Challenges
The problem of minimizing vulnerable period during 4 S YSTEM M ODELS
compile-time to improve soft-error reliability poses the fol- 4.1 GPGPU Application Model
lowing research challenges:
A GPGPU application is considered as our target. Generally,
1) Estimation of the vulnerable period of GPU applications a GPGPU application consists of a host code and a kernel
by considering the accurate GPU execution model in or- code. The host code is executed on a CPU and the kernel
der to minimize the vulnerable period of an application code is executed on the GPU. The proposed instruction
during compile-time. scheduling and vulnerable period estimation algorithms
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 4
are designed based on an architecture similar to NVIDIA’s 5 V ULNERABLE P ERIOD E STIMATION FOR GPU
Fermi architecture [28], which contains similar hardware Since the parallel behavior of the GPU affects the soft-error
units compared to Kepler architecture [29]. The host or- reliability of the GPU as well, the architectural character-
chestrates the overall behavior of a GPGPU application and istics of the GPU also need to be considered to estimate
the kernel handles most of the computation in the GPGPU the vulnerable period of the GPU application. During run-
application. time, each warp runs instructions in lock-step, and it is
During run-time, the conf iguration must be provided assumed that the GPU hardware schedules the warps in
by the host before it launches a kernel. This conf iguration a round-robin manner [4]. The latency of the instruction
consists of two numbers: grid size and block size. The depends on the parallel behavior of the GPU pipeline and
grid size represents the dimension of the GPU thread block the memory access latency [14]. By using the state-of-the-
hierarchy at the Streaming Multiprocessor (SM) level and art GPU performance models, we may estimate the timing
block size represents the size of each thread block. Right behavior of the GPU and extract information to estimate the
after the host launches a kernel function, the mapping vulnerable period. Below, we provide more details about the
between the thread blocks and the SMs are created based latency estimation of the instruction which uses the state-of-
on the conf iguration and the entire kernel function is the-art GPU performance model from [3] [14]. The model
submitted into GPU hardware. When the GPU executes a parameters are generated from the actual GPU by using
kernel function, the threads in the same thread block are micro benchmarks.
grouped into warp. Note that a single thread block can
have multiple warps. W arp is the basic unit of execution 5.1 Latency of Instruction Execution
of the GPU. The threads in the same warp fetch and execute The nth instruction can be issued after all warps have
the same instruction. The behavior of warps in the same issued the (n − 1)th instruction and all the data for the nth
thread block is sequential because they share the single SM. instruction is ready to use. Therefore, the amount of time to
Since multiple SMs operate at the same time, the behavior execute n instructions may be represented by the following
of thread blocks could be either parallel or sequential. Since equation:
n
the detailed architecture of the GPU is not available to X
the research community, we assume that the GPU uses the t In = Liready + Lissue + Lnpipe + Lnmem (1)
i=1
round robin algorithm to schedule the warps as has been
adopted in [4] as well. where Ln ready represents the latency to make the n
th
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 5
latency, the latencies of the arithmetic and memory access After that, the maximum number of concurrent memory
instructions need to be considered together. accesses may be estimated by the following equation:
Nmem req = bBWM EM /bwc (6)
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 6
by using the above mentioned equations (Equations 9- on the state-of-the-art GPU architecture and performance
n
11). In addition, Nnonf ull is same as Nremain . models [4] [14].
2) tn
startmem is between tn−1 n−1
nonf ull and tmemf inish . In this Observation 1. Let’s assume that the register R includes the
scenario, there are ongoing memory accesses, but the
moment of data production before the loop and the last moment of
memory bandwidth is not fully occupied. Some memory
data consumption in the loop. The vulnerable period of the register
accesses can begin immediately and the other memory
R, V PR , depends on the following:
accesses can begin after tn−1 n
memf inish . Therefore, df ull
becomes 0 and based on the available bandwidth, 1) Execution time of one iteration in the loop, lloop .
n−1
Nnonf ull , and Nremain , the memory latency and the 2) Time from the data production to the first data consumption
parameters are updated. When Nremain is greater than in the loop, lpre−loop .
n−1
Nnonf ull , the last memory access is followed by the
memory accesses that start at tn−1
memf inish . Therefore, the Proof. During run-time, depending on the data, the number
parameters are updated as follows: of iteration N is changed. In other words, two different
instruction schedules will have the same number of loop
tnnonf ull = tnstartmem + Lnmem (12) iterations, if the input data is the same. The vulnerable
tnmemf inish = tn−1
memf inish + Lnmem (13) period of the register R may be represented by the following
n n−1 equation:
Nnonf ull = Nmem req − (Nremain − Nnonf ull ) (14)
n−1
In the case hen Nremain is less than or equal to Nnonf ull ,
( l
the last memory access is followed by the memory N × lloop , if pre−loop
lloop ≈0
V RR = (22)
access that starts at tn
startmem . In addition, the available N × lloop + lpre−loop , otherwise
n
memory bandwidth, Nnonf ull , is related to the memory
accesses that are finished at time tn−1 memf inish . Therefore, From Equation 22, we can see that lpre−loop ’s contribu-
the parameters are updated as follows: tion to the V PR becomes negligible if one of the following
l
two cases is satisfied: 1) pre−loop is approximately zero; or
tnnonf ull = tn−1
memf inish + (Nf ull × Lavg mem ) (15) lloop
2) N is so large. Therefore, the vulnerable period of the loop
tnmemf inish = tn−1 n
memf inish + Lmem (16) can be improved during compile-time by minimizing both
the lloop and the lpre−loop . Figure 5(a) shows an example
n
Nnonf ull = Nmem req − Nremain (17) code to demonstrate the proof of this theorem for a for-
loop.
In the case when Nremain is equal to 0, the memory
access behavior is exactly the same as the previous one.
Therefore, the parameters are updated as follows: Observation 2. For register R, which includes the moment of
data production before the branch instruction and the last moment
tnnonf ull = tnstartmem + (Nf ull × Lavg of data consumption within the branch body, the vulnerable period
mem ) (18)
of R depends on the location of the last consumption of R.
tnmemf inish = tn−1
memf inish + (Nf ull × Lavg mem ) (19)
n n−1 Proof. Unlike the general CPU, the GPU likely executes
Nnonf ull = Nnonf ull (20)
every branch body to handle control flow divergence no
3) tn n−1 matter how the branch is taken. Moreover, for the same
startmem is less than tnonf ull . In this scenario, there is
no available memory bandwidth. Some memory band- input data, two different instruction schedules show the
width becomes available after tn−1 same branch behavior. Therefore, the vulnerable period of
nonf ull . This scenario is
a special case for the previous scenario where dn the R is calculated based on the last place where R is used.
f ull > 0
Figure 5(b) shows one such example case to demonstrate the
and the memory accesses at tn−1 nonf ull , because the mem-
proof of this theorem.
ory access behavior will be similar as the previous
scenario after tn−1
nonf ull . The memory access latency may
be estimated by adding the delay dn f ull to the equations
in the previous scenario. dn f ull may be estimated by the
following equation:
dnfull = tn−1 n
nonf ull − tstartmem (21)
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 7
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 8
(a) Instruction Scheduling from Default (b) Performance Aware Instruction (c) Our Proposed Instruction
NVCC Compiler Scheduling [17] Scheduling
Fig. 7: Instruction Scheduling Results for BFS Application from the NVCC, Performance Aware Instruction Scheduling [17],
and Our Proposed Instruction Scheduling.
Algorithm 2: Algorithm for Instruction Scheduling. tion and our heuristic repeats the above mentioned process
Input: Instruction Flow Graph Gin until all instructions are scheduled.
Output: New Instruction Flow Graph Gnew Algorithm 2 shows the pseudo-code for the proposed
1 Function BuildFromBottom (Gin )
2 begin instruction scheduling algorithm. At the beginning, the
3 Gnew .clear(); algorithm selects the last instruction and places it at the
4 P os ← |Gin |; end of the application code (Lines 10-13). For each prede-
5 ntgt ← ; RegSet ← ;
6 while Gin 6= do cessor instruction, the algorithm examines the aforemen-
7 CandidateSet ← ; tioned three conditions (Lines 15-25). If the predecessor
8 pass ← true;
9 if P os = |Gin | then instruction satisfies these three conditions, it will be placed
10 ntgt ← GetSinkNode(); Gnew .push(ntgt ); near the consumer instruction (Lines 30-34). The overall
11 Gin .remove(ntgt );
12 P os ← P os - 1; complexity of the instruction scheduling algorithm is given
13 else by O(n × n × n) = O(n3 ) because of the three levels loop
14 for idx = 0; idx < P os; idx + + do hierarchy.
15 for cnt = idx; cnt < P os; cnt + + do
16 if Gin [cnt].is sync() then
17 pass ← false; 6.2 Example of the Proposed Instruction Scheduling
18 if pass = true and
Gin [cnt].is consume(Gin [idx].GetDest()) then
An example of our scheduling process is shown in Figure
19 pass ← false; 6. In the figure, there are four producer and consumer
20 if pass = true and Gin [idx].is in loop() then
instruction pairs: instructions in Lines 2 and 5, in Lines 6 and
21 if Gin [idx].GetLastLoopPos() < P os then 9, in Lines 8 and 13, and in Lines 10 and 14. The first pair
22 pass ← false; of instructions, in Lines 2 and 5, violates the third condition
23 if pass = true and RegSet.exist(Gin [idx].GetDest()) because they are not in the same loop. The second pair of
then instructions, in Lines 6 and 9, violates the first condition due
24 CandidateSet.push(Gin [idx]);
to the synchronization instruction in Line 7. The third pair of
25 if CandidateSet = then instructions, in Lines 8 and 13, violates the second condition
26 ntgt ← Gin .GetLastNode();
because there is another consumption of the register %r22
27 else in Line 11. However, the last pair of instructions, in Lines
28 ntgt ← CandidateSet.front();
10 and 14, satisfies all three conditions, and therefore the
29 Gnew .push(ntgt );
30 Gin .remove(ntgt ); proposed heuristic is able to place the instruction in Line 10
31 RegSet.RemoveDestRegs(ntgt ); around its consumer instruction, in Line 14.
32 RegSet.AddSrcRegs(ntgt );
33 P os ← P os - 1;
6.3 Comparison with Performance Aware Instruction
Scheduling
As mentioned in Section 1, the probability that the soft-error
In order to minimize the vulnerable period of the last
occurs on a single hardware component is proportional
instruction, our heuristic schedules the predecessor instruc-
to the time that the hardware component is used [48].
tion that satisfies all three conditions. After that, the sched-
Therefore, it may be possible to improve the soft-error reli-
uled predecessor instruction becomes the consumer instruc-
ability through performance improvement. The instruction
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 9
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 10
selected bit from the result of the instruction that uses the
faulty component. After the second execution, the entire
simulation results are compared to the correct results, which
are generated without any fault injection. The output is clas-
sified into the following categories: correct output, incorrect
output, and application crash.
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 11
Fig. 11: Soft-error Reliability Improvement Compared to [10], [23]and Performance Driven Instruction Scheduling [17].
Each Application is Executed 25 Times with Different Fault Injection Rates.
hotspot application with 50 fault injections and plotted the
ratio of correct output for all the 50 times of simulation
(Figure 12). From Figure 12, we may observe that our
methodology does not have converged in terms of ratio of
correct output after 25 times of simulation. However, after
50 times of simulation, all the methodologies, including
ours, have converged in terms of ratio of correct output.
In addition, the results show that our methodology has a
similar failure rate to the original application and it matches
with the vulnerable period estimation (Figure 10).
We observe that there is no correct output from the
Matrix multiplication and BoxFilter applications when the
fault injection rate is 100 faults/1M cycles. This is because
the behavior of the Matrix multiplication and BoxFilter appli-
cations are sensitive to soft-errors. The failure rate of an ap- Fig. 12: Ratio of Correct Output for Hotspot Application
plication depends on two things: the timing behavior (vul- with 50 Faults Injection.
nerable period) and the functional behavior (masking effect,
soft-error effect and produces incorrect results. In Figure
propagation, and etc.). The applications default soft-error
11, the results from the Matrix multiplication and BoxFilter
reliability may be decided based on the functional behavior.
applications show examples of the above mentioned case.
For example, lets assume we have two different applications
Both applications have a kernel function that consists of
that have the same vulnerable period: the application with
multiple loops with multiplication operation. Thus, the ma-
multiplication instruction (MUL) and the application with
trix multiplication and the box filter applications may have
addition instruction (ADD). Although they have the same
higher soft-error sensitivities than other applications. There-
vulnerable period, MUL may have a higher failure rate
fore, with a 100 faults/1M cycles fault injection rate, the
because the multiplication operation likely propagates the
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 12
Application Faulty 10 Faults / 1M Clock Cycle 50 Faults / 1M Clock Cycle 100 Faults / 1M Clock Cycle
Component Orig. Our [10] [23] [17] Orig. Our [10] [23] [17] Orig. Our [10] [23] [17]
REGISTER FILE 14 27 13 35 12 66 186 26 122 61 115 403 106 495 129
LDSTR UNIT 5 7 3 20 7 13 32 9 25 25 27 51 30 66 37
Bp INT ALU 6 16 8 23 19 10 17 3 42 7 14 49 16 138 19
FLOAT ALU 19 19 18 18 5 55 85 46 60 100 103 235 149 142 131
SFU ALU 16 17 9 11 6 34 38 24 29 44 56 106 43 73 64
REGISTER FILE 43 26 27 68 12 158 149 61 158 101 268 266 249 596 189
LDSTR UNIT 6 1 7 7 2 4 6 12 3 5 5 8 5 8 5
Bfs INT ALU 61 0 0 4 31 4 4 0 2 227 4 4 6 10 391
FLOAT ALU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SFU ALU 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
REGISTER FILE 5 0 1 6 1 19 3 14 7 7 48 17 32 56 18
LDSTR UNIT 1 0 0 1 1 8 5 4 4 6 9 5 22 8 0
Srad INT ALU 6 3 0 2 10 9 5 0 3 40 7 11 13 6 62
FLOAT ALU 3 0 4 0 3 8 7 5 3 7 16 7 29 13 7
SFU ALU 2 1 0 0 0 1 2 0 1 0 1 4 7 3 0
REGISTER FILE 3 9 23 6 0 30 36 30 38 33 109 92 53 195 76
LDSTR UNIT 3 1 3 10 2 14 7 4 10 11 38 13 26 23 19
Km INT ALU 46 0 7 0 32 7 4 2 7 216 11 14 5 21 437
FLOAT ALU 24 4 13 12 11 51 31 34 37 58 115 93 54 88 113
SFU ALU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
REGISTER FILE 25 23 39 27 16 96 56 53 58 56 199 134 293 287 116
LDSTR UNIT 15 6 25 14 10 55 24 32 35 37 120 46 92 160 65
Mat INT ALU 2 0 4 6 64 13 10 9 13 306 377 21 26 51 557
FLOAT ALU 9 8 5 7 3 24 30 15 24 48 86 76 28 62 72
SFU ALU 0 0 0 0 0 0 2 4 0 0 0 0 12 0 0
REGISTER FILE 6 5 14 21 5 19 19 34 41 35 66 87 189 228 73
LDSTR UNIT 2 2 5 6 1 2 0 5 11 8 9 28 33 33 11
Hs INT ALU 16 20 46 31 67 29 25 78 48 310 87 110 257 248 585
FLOAT ALU 7 6 16 18 3 9 12 32 27 18 68 35 146 118 42
SFU ALU 3 0 3 0 2 5 0 8 6 1 8 8 24 23 9
REGISTER FILE 29 31 68 80 67 248 178 259 308 274 443 477 472 457 434
LDSTR UNIT 0 0 0 2 0 0 0 0 0 0 3 2 0 1 2
Box INT ALU 41 25 53 35 37 242 89 219 238 212 538 325 469 424 409
FLOAT ALU 46 30 43 28 50 177 195 195 216 195 426 425 434 380 338
SFU ALU 18 16 37 31 23 111 67 140 159 122 255 185 219 271 229
REGISTER FILE 45 48 55 206 47 260 220 245 246 137 325 391 481 546 491
LDSTR UNIT 32 36 24 25 35 176 145 64 152 71 313 269 26 269 364
Conv INT ALU 7 7 6 42 4 20 30 28 79 5 22 33 26 146 20
FLOAT ALU 97 89 71 47 126 416 301 563 419 197 662 543 785 636 854
SFU ALU 0 0 0 4 0 5 0 0 2 0 7 1 0 2 2
REGISTER FILE 247 240 250 181 226 978 1086 852 967 1082 1860 2396 1599 1444 1614
LDSTR UNIT 3 2 1 0 2 1 11 1 2 16 3 13 10 5 6
Man INT ALU 0 7 0 1 1 5 10 7 14 2 11 42 15 20 6
FLOAT ALU 367 1432 371 404 361 2086 6466 1903 2216 1974 4084 13469 3633 2370 2928
SFU ALU 183 606 186 180 216 939 3130 861 1002 928 1747 6440 1685 1076 1400
REGISTER FILE 46.33 45.44 54.44 70.00 42.89 208.22 214.78 174.89 216.11 198.44 381.44 473.67 386.00 478.22 348.89
LDSTR UNIT 7.44 6.11 7.56 9.44 6.67 30.33 25.56 14.56 26.89 19.89 58.56 48.33 27.11 63.67 56.56
Avg. INT ALU 13.89 8.67 13.78 16.00 26.11 37.67 21.56 38.44 49.56 147.22 119.00 67.67 92.56 118.22 276.22
FLOAT ALU 63.56 176.44 60.11 59.33 62.44 314.00 791.89 310.33 333.56 288.56 617.78 1,653.67 584.22 423.22 498.33
SFU ALU 24.67 71.11 26.11 25.11 27.44 121.67 359.89 115.22 133.22 121.67 230.56 749.33 221.11 160.89 189.33
TABLE 2: The List of Effective Faults During the 25-times of Soft-error Reliability Experiments. Each Number Indicates
How Many Effective Faults are Occurred on Each Components During the Experiments.
failure of the Matrix multiplication and BoxFilter applications overhead of our methodology is 125% on average and the
is an expected outcome. power overhead is -7% on average. This negative power
In Figure 11, for the Srad application, we can observe overhead is an expected result because there is no change
that the performance driven instruction scheduling shows in the GPU hardware and our methodology sacrifices the
better soft-error reliability than ours. This is one example of performance to improve the soft-error reliability. Therefore,
how the soft-error reliability can be achieved through per- since our methodology consumes more time to execute the
formance improvement. Since the execution time of the Srad application, the less power consumption is an expected
is really short, the performance improvement may reduce result.
the time that the soft-error occurs during the execution time. From Figure 13, we can observe that our methodol-
Figure 13 and Figure 14 show the performance and ogy has a considerable amount of performance overhead
the power overhead. The performance and power over- (125%). However, without the Mandelbrot application, the
head results are normalized with respect to the original performance overhead from our methodology is only 47%,
application’s performance and power consumption. In ad- which is less than other two reliability methodologies. The
dition, in Figure 13 and Figure 14, the original applica- performance overheads from [10] and [23] are 53% and 92%,
tion’s performance and power consumption are described respectively. The major reason for the performance overhead
by red lines. As mentioned, since our methodology sacri- is that 1) our methodology does not have any additional
fices the performance to improve the soft-error reliability, source code; and 2) our methodology tries to generate an
our methodology consumes less power compared to other instruction schedule and configuration in order to minimize
methodologies. Note that [23] has less power consumption the vulnerable period. Srad and Matrix applications are the
than our methodology and it is because of the idle time examples that the proposed methodology improves the per-
from the additional memory operations. The performance formance. However, our methodology shows a significant
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 13
R EFERENCES
Fig. 13: Performance Overheads Compared to [10], [23] and [1] N. Avirneni and A. Somani. “Low Overhead Soft Error Mitigation
[17]. Techniques for High-Performance and Aggressive Designs”. IEEE
Transactions on Computers, 61(4):488–501, 2012.
[2] S. Baeg, S. Wen, and R. Wong. “SRAM Interleaving Distance
Selection With a Soft Error Failure Model”. IEEE Transactions on
Nuclear Science, 56(4):2111–2118, 2009.
[3] S. Baghsorkhi, M. Delahaye, S. Patel, W. Gropp, and W. Hwu. “An
Adaptive Performance Modeling Tool for GPU Architectures”.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming (PPoPP’10), pages 105–114, 2010.
[4] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. “Ana-
lyzing CUDA workloads using a detailed GPU simulator”. IEEE
International Symposium on Performance Analysis of Systems and
Software (ISPASS’09), pages 163–174, 2009.
[5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee,
and K. Skadron. “Rodinia: A benchmark suite for heterogeneous
computing”. Proceedings of the 2009 IEEE International Symposium
Fig. 14: Average Power Consumption Overheads Compared on Workload Characterization (IISWC’09), pages 44–54, 2009.
[6] J. Cong and C. Yu. “Impact of loop transformations on software
to [10], [23], and [17]. reliability”. IEEE/ACM International Conference on Computer-Aided
Design (ICCAD’15), pages 278–285, 2015.
performance overhead in the Mandelbrot application because [7] A. El-Maleh and K. Daud. “Method for synthesizing soft error
of the scheduling of memory access instructions and the tolerant combinational circuits”, 2014. US Patent 8,640,063.
change of the application’s configuration to minimize the [8] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. “GPU-
effect of the effective faults. These results show that the out- Qin: A methodology for evaluating the error resilience of GPGPU
applications”. IEEE International Symposium on Performance Analy-
come of the soft-error reliability improvement is not always sis of Systems and Software (ISPASS’14), pages 221–230, 2014.
the performance improvement or the power consumption [9] B. Fang, J. Wei, K. Pattabirama, and M. Ripeanu. “Evaluating Error
improvement. Our methodology may sacrifice performance Resiliency of GPGPU Applications”. High Performance Computing,
Networking, Storage and Analysis (SCC’13), pages 1502–1503, 2012.
to improve the soft-error reliability if it is necessary. In other [10] B. Fang, J. Wei, K. Pattabiraman, and M. Ripeanu. “Towards
words, the trade-off for the soft-error improvement is very Building Error Resilient GPGPU Applications”. 3rd Workshop on
application specific. Resilient Architecture (WRA’12), 2012.
In summary, the experimental results imply that, as [11] L. B. Gomez, F. Cappello, L. Carro, N. Debardeleben, B. Fang,
S. Gurumurthi, K. Pattabiraman, P. Rech, and M. S. Reorda.
shown in Figure 7, the soft-error reliability is related to the “GPGPUs: How to Combine High Computational Power with
detailed timing behavior of an application and the perfor- High Reliability”. Design, Automation and Test in Europe Conference
mance improvement does not guarantee the improvement and Exhibition (DATE’14), pages 1–9, 2014.
[12] I. Haque and V. Pande. “Hard Data on Soft Errors: A Large-
in soft-error reliability. Scale Assessment of Real-World Error Rates in GPGPU”. 10th
IEEE/ACM International Conference on Cluster, Cloud and Grid Com-
8 C ONCLUSION puting (CCGrid’10), pages 691–696, 2010.
In this paper, we have proposed a GPU architecture-aware [13] M. Harris. How to implement performance metrics in
cuda c/c+. https://devblogs.nvidia.com/parallelforall/how-implement-
compilation methodology in order to improve the soft-error performance-metrics-cuda-cc, 2012.
reliability of the GPU-based system. In the methodology, we [14] S. Hong and H. Kim. “An Integrated GPU Power and Performance
have developed a novel instruction scheduling algorithm to Model”. Proceedings of the 37th Annual International Symposium on
Computer Architecture (ISCA’10), pages 280–289, 2010.
minimize the vulnerable period and improve the soft-error [15] H. Huang and C.-P. Wen. “Layout-Based Soft Error Rate Esti-
reliability of an application. Based on the analysis of the mation Framework considering Multiple Transient Faults - from
state-of-the-art GPU architecture, we model the behavior of Device to Circuit Level”. IEEE Transactions on Computer-Aided
an application to estimate its soft-error vulnerability and Design of Integrated Circuits and Systems (TCAD’15), pages 1–1,
2015.
generate the best instruction schedule and configuration. In [16] S. i. Abe, R. Ogata, and Y. Watanabe. “Impact of Nuclear Reaction
addition, we have developed a fine-grained fault injection Models on Neutron-Induced Soft Error Rate Analysis”. IEEE
tool that is integrated with the state-of-the-art cycle-level Transactions on Nuclear Science, pages 1806–1812, 2014.
[17] J. Jablin, T. Jablin, O. Mutlu, and M. Herlihy. “Warp-aware
GPU simulator to evaluate the proposed methodology. We Trace Scheduling for GPUs”. Proceedings of the 23rd International
also have developed theorems and proofs to handle the Conference on Parallel Architectures and Compilation (PACT’14), pages
application’s control flows during the vulnerable period 163–174, 2014.
estimation. The experimental results show that our method- [18] H. Jeon and M. Annavaram. “Warped-DMR: Light-weight Er-
ror Detection for GPGPU”. Proceedings of the 2012 45th Annual
ology generates the instruction schedule within 8.13 sec- IEEE/ACM International Symposium on Microarchitecture (MICRO-
onds on average. Through our methodology, the soft-error 45), pages 37–47, 2012.
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 14
[19] H. Jeon, M. Wilkening, V. Sridharan, S. Gurumurthi, and G. Loh. [39] J. W. Sheaffer, D. P. Luebke, and K. Skadron. “A Hardware Redun-
“Architectural Vulnerability Modeling and Analysis of Integrated dancy and Recovery Mechanism for Reliable Scientific Computa-
Graphics Processors”. Workshop on Silicon Errors in Logic-System tion on Graphics Processors”. SIGGRAPH/Eurographics Workshop
Effects (SELSE’13), pages 1–6, 2013. on Graphics Hardware, pages 55–64, 2007.
[20] S. Kiamehr, T. Osiecki, M. Tahoori, and S. Nassif. “Radiation- [40] G. Shi, J. Enos, M. Showerman, and V. Kindratenko. “On Testing
Induced Soft Error Analysis of SRAMs in SOI FinFET Technology: GPU Memory for Hard and Soft Errors”. Proc. Symposium on Ap-
A Device to Circuit Approach”. Proceedings of the 51st Annual plication Accelerators in High-Performance Computing (SAAHPC’09),
Design Automation Conference (DAC’14), 2014. pages 1–3, 2009.
[21] H. Lee and M. A. A. Faruque. “GPU-EvR: Run-Time Event Based [41] G. Shobaki, K. Wilken, and M. Heffernan. “Optimal Trace Schedul-
Real-Time Scheduling Framework on GPGPU Platform”. Design, ing Using Enumeration”. ACM Transactions on Architecture and
Automation and Test in Europe Conference and Exhibition (DATE’14), Code Optimization (TACO’09), pages 1–32, 2009.
pages 1–6, 2014. [42] J. Tan and X. Fu. “RISE: Improving the Streaming Processors
[22] D. Li, J. Vetter, and W. Yu. “Classifying Soft Error Vulnerabilities in Reliability Against Soft Errors in GPGPUs”. Proceedings of the
Extreme-scale Scientific Applications Using a Binary Instrumen- 21st international conference on Parallel architectures and compilation
tation Tool”. Proceedings of the International Conference on High techniques (PACT’12), pages 191–200, 2012.
Performance Computing, Networking, Storage and Analysis (SC’12, [43] J. Tan, Z. Li, and X. Fu. “Soft-error reliability and power co-
pages 1–11, 2012. optimization for GPGPUs register file using resistive memory”.
Design, Automation Test in Europe Conference Exhibition (DATE’15),
[23] S. Li, N. Farooqui, and S.Yalamanchili. “Software Reliability pages 369–374, 2015.
Enhancements for GPU Applications”. Sixth Workshop on Pro- [44] D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai,
grammability Issues for Heterogeneous Multicores (MULTIPROG’13), D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and
2013. A. Bland. “Understanding GPU errors on large-scale HPC systems
[24] A. Maghazeh, U. D. Bordoloi, A. Horga, P. Eles, and Z. Peng. “Sav- and the implications for system design and operation”. IEEE 21st
ing Energy Eithout Defying Deadlines on Mobile GPU-Based Het- International Symposium on High Performance Computer Architecture
erogeneous Systems”. International Conference on Hardware/Software (HPCA’15), pages 331–342, 2015.
Codesign and System Synthesis (CODES+ISSS’14), pages 1–10, 2014. [45] J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and
[25] N. Maruyama, A. Nukada, and S. Matsuoka. “A High- K. Skadron. “Real-world Design and Evaluation of Compiler-
Performance Fault-Tolerant Software Framework for Memory on managed GPU Redundant Multithreading”. Proceeding of the
Commodity GPUs”. IEEE International Symposium on Parallel & 41st Annual International Symposium on Computer Architecuture
Distributed Processing (IPDPS’10), pages 1–12, 2010. (ISCA’14), pages 73–84, 2014.
[26] N. Miskov-Zivanov and D. Marculescu. “MARS-C: Modeling and [46] K. Wu and D. Marculescu. “A Low-Cost, Systematic Methodology
Reduction of Soft Errors in Combinational Circuits”. Proceedings for Soft Error Robustness of Logic Circuits”. Very Large Scale
of the 43rd Annual Design Automation Conference (DAC’06), pages Integration (VLSI) Systems, IEEE Transactions on, pages 367–379,
767–772, 2006. 2013.
[27] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. [47] R. Xiaoguang, X. Xinhai, W. Qian, C. Juan, W. Miao, and Y. Xuejun.
“A Systematic Methodology to Compute the Architectural Vul- “GS-DMR: Low-overhead soft error detection scheme for stencil-
nerability Factors for a High-Performance Microprocessor”. Pro- based computation”. Parallel Computing, pages 50–65, 2015.
ceedings of the 36th annual IEEE/ACM International Symposium on [48] J. Yan and W. Zhang. “Compiler-guided Register Reliability
Microarchitecture (MICRO’03), pages 29–40, 2003. Improvement Against Soft Errors”. Proceedings of the 5th ACM
[28] NVIDIA. “NVIDIA’s next generation CUDA compute architec- International Conference on Embedded Software (EMSOFT ’05), pages
ture: Fermi”. 2009. 203–209, 2005.
[29] NVIDIA. “NVIDIA’s next generation CUDA compute architec-
ture: Kepler GK110”. 2012. Haeseung Lee received the the Master’s De-
[30] D. Oliveira, C. Lunardi, L. Pilla, P. Rech, P. Navaux, and L. Carro. gree of Electrical Engineering from Arizona
“Radiation Sensitivity of High Performance Computing Applica- State University in 2013. Currently, he is pur-
tions on Kepler-Based GPGPUs”. 44th Annual IEEE/IFIP Inter- suing his Ph.D. degree in the Department of
national Conference on Dependable Systems and Networks (DSN’14), Electrical Engineering and Computer Science
pages 732–737, 2014. at the University of California Irvine. His cur-
[31] D. Palframan, N. Kim, and M. Lipasti. “Precision-aware soft error rent research interests include real-time sys-
protection for GPUs”. IEEE 20th International Symposium on High tems, multi-core systems, image processing,
Performance Computer Architecture (HPCA’14), pages 49–59, 2014. networks-on-chip, and reliability.
[32] R. B. Parizi, R. Ferreira, L. Carro, and Á. Moreira. “Compiler
Optimizations Do Impact the Reliability of Control-Flow Radia-
tion Hardened Embedded Software”. Embedded Systems: Design,
Analysis and Verification: 4th IFIP TC 10 International Embedded Mohammad Al Faruque is currently with Uni-
Systems Symposium (IESS’13), pages 49–60, 2013. versity of California Irvine (UCI), where he is a
[33] P. Rech, L. Pilla, P. Navaux, and L. Carro. “Impact of GPUs tenure track assistant professor and directing the
Parallelism Management on Safety-Critical and HPC Applications Cyber-Physical Systems Lab. Prof. Al Faruque
Reliability”. 44th Annual IEEE/IFIP International Conference on is the recipient of the IEEE CEDA Ernest S.
Dependable Systems and Networks (DSN’14), pages 455–466, 2014. Kuh Early Career Award 2016. He served as
[34] S. Rehman, M. Shafique, and J. Henkel. “Instruction scheduling for an Emulex Career Development Chair during
reliability-aware compilation”. 2012 49th ACM/EDAC/IEEE Design October 2012 till July 2015. Before, he was with
Automation Conference (DAC’12),, pages 1288–1296, 2012. Siemens Corporate Research and Technology in
[35] S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. “Reliable Princeton, NJ. His current research is focused
Software for Unreliable Hardware: Embedded Code Generation on system-level design of embedded systems
Aiming at Reliability”. Proceedings of the seventh IEEE/ACM/IFIP and Cyber-Physical-Systems (CPS) with special interest on multi-core
international conference on Hardware/software codesign and system systems, real time scheduling algorithms, dependable systems, etc. Dr.
synthesis (CODES+ISSS’11), pages 237–246, 2011. Al Faruque received the Thomas Alva Edison Patent Award 2016
from the Edison foundation, the 2016 DATE Best Paper Award, the
[36] S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. “RAISE:
2015 DAC Best Paper Award, the 2009 IEEE/ACM William J. McCalla
Reliability-Aware Instruction SchEduling for unreliable hard-
ICCAD Best Paper Award, the 2016 NDSS Distinguished Poster
ware”. 17th Asia and South Pacific Design Automation Conference
Award, the 2015 Hellman Fellow Award, the 2012 DATE Best IP
(ASP-DAC’12), pages 671–676, 2012.
Award Nomination, the 2008 HiPEAC Paper Award, and the 2005
[37] D. Sabena, L. Sterpone, L. Carro, and P. Rech. “Reliability Evalua- DAC Best Paper Award Nomination. Besides 55+ IEEE/ACM publi-
tion of Embedded GPGPUs for Safety Critical Applications”. IEEE cations in the premier journals and conferences, Dr. Al Faruque holds 6
Transactions on Nuclear Science, pages 23–29, 2014. US patents. Dr. Al Faruque is an IEEE senior member.
[38] G. Sadowski. “Design Challenges Facing CPU-GPU-Accelerator
Integrated Heterogeneous Systems”. Design Automation Conference
(DAC’14), 2014.
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.