GPU Architecture Aware Instruction Scheduling

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
Transactions on Multi-Scale Computing Systems
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. XX, XXX-XXX 2017 1
GPU Architecture Aware Instruction Scheduling

for Improving Soft-error Reliability
Haeseung Lee, Student Member, IEEE, and Mohammad Abdullah Al Faruque, Senior Member, IEEE
Abstract—The demand for low-power and high-performance computing has been driving the semiconductor industry for decades. The
semiconductor technology has been scaled down to satisfy these demands. At the same time, the semiconductor technology has faced
severe reliability challenges like soft-error. Research has been conducted to improve the soft-error reliability of the GPU, which has
been improved by using various methodologies such as redundancy methodologies. However, the GPU compiler has yet to be
considered for improving the soft-error reliability of the GPU. In this paper, in order to improve the soft-error reliability of the GPU, we
propose a novel GPU architecture aware compilation methodology. The proposed methodology jointly considers the parallel behavior
of the GPU hardware and the applications, and minimizes the vulnerability of the GPU applications during instruction scheduling. In
addition, the proposed methodology is able to complement any hardware based soft-error reliability improvement techniques. We
compared our compilation methodology with the state-of-the-art soft-error reliability aware techniques and the performance aware
instruction scheduling. We have injected the soft-errors during the experiments and have compared the number of correct executions
that have no erroneous output. Our methodology requires less performance and power overhead than the state-of-the-art soft-error
reliability methodologies in most cases. Compilation time overhead of our methodology is 8.13 seconds on average. The experimental
results show that our methodology improves the soft-error reliability by 23% and 12% (up to 64% and 52%) compared to the
state-of-the-art soft-error reliability and performance aware compilation techniques, respectively. Moreover, we have shown that the
soft-error reliability of a GPU is not related to the performance, but to the fine-grained timing behavior of an application.
Index Terms—GPGPU, Soft-error, Reliability, Instruction Scheduling, Compiler.
1 I NTRODUCTION AND M OTIVATIONAL E XAMPLE
I N recent years, the number of devices that require low-

power and high-performance computing has been in-
creasing. In order to satisfy these demands, semiconductor
[27] (see Section 4 for a more detailed definition and Sec-
tion 5 for the vulnerable period estimation technique). We
then measured the vulnerable period of these three matrix
industries have done extensive technology scaling. In addi- multiplication applications by using the GPGPU-Sim [4]
tion, many have considered and used a Graphics Processing simulator and show the experimental results in Figure 1.
Unit (GPU) to handle an extensive amount of computation Below we provide some observations of these experimental
[21] [24] [38]. However, the aggressive technology scaling results.
causes several reliability issues such as soft-errors. In par-
ticular, the real-world GPUs are susceptible to the soft-
errors even under normal conditions [12]. In addition, the
probability that the soft-error occurs on a single hardware
component is proportional to the time that the hardware
component is used [48]. The soft-error reliability of the GPU
is important because the GPU-based system handles most
of its computation by using the GPU. In other words, if
the GPU produces incorrect results, the entire system may
behave incorrectly. In order to improve the soft-error relia-
bility of the GPU, many methodologies have been proposed
[10] [19] [23] [45] [47]. However, instruction scheduling has
not been considered for improving the soft-error reliability
of the GPU (see Section 2 for detailed related work for
improving the soft-error reliability of the GPU).
In order to observe how the instruction scheduling
affects a vulnerable period, we have created a total of
three matrix multiplications by modifying its instruction
schedule. The vulnerable period is the metric to measure Fig. 1: Motivational Example to Illustrate the Relation Be-
the soft-error reliability of a GPU application [19] [27]. The tween Vulnerable Period and Instruction Schedule.
vulnerable period is the time from the moment that the data The total amount of the vulnerable period is not propor-
is produced until the last moment that the data is consumed tional to the number of thread in a thread block. The vulner-
able periods vary depending on the number of threads in a
• The authors are with the Department of Electrical Engineering and thread block. In addition, there is no instruction schedule
Computer Science, University of California, Irvine, California, USA. that always shows the minimum vulnerable period. For ex-
E-mail: {haeseunl, alfaruqu}@uci.edu ample, Scheduling Algorithm 1 shows the smallest vulnerable
Manuscript received Mar 07, 2016; revised Nov 08, 2016. period when the number of threads is 625. However, when
2332-7766 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2667661, IEEE
the number of threads is 841, Scheduling Algorithm 2 shows Another research in [20] has proposed a cross layer anal-
the smallest vulnerable period. ysis approach for modeling the soft-error behavior on the
These experimental results indicate that the vulnerable FinFET transistors. The proposed approach performs a 3D
period of an application does not only depend on the simulations from an interactions in FinFET structures up to
instruction scheduling but also on the parallel behavior of circuit level. A soft-error susceptibility estimation technique
the GPU. The result may show a small amount of change in has been proposed in [26]. The proposed technique uses
the vulnerable period. This is because of the simple kernel a symbolic modeling technique to estimate the soft-error
function and a small number of threads. However, the result susceptibility of a combinational logic circuit. However,
in Figure 1 still shows the possibility that the vulnerable since the detailed GPU architecture is not available to the
period, which is related to the soft-error reliability, may be research community, the research works are not applicable.
improved through the instruction scheduling. Therefore, the Research has been conducted to provide the soft-error
parallel behavior of the GPU and the instruction scheduling injection tools because it is extremely difficult to perform
need to be considered together to further improve the soft- the experiment in a radiation environment [8] [22]. One
error reliability. research in [22] has proposed a soft-error injection tool that
injects single bit-flip into the data object in the binary. Other
research in [8] has proposed a soft-error injection technique
2 R ELATED W ORK that randomly selects the instruction to change its result.
Research has been conducted to improve the soft-error reli- However, these research works have limitations in modeling
ability of the processors including GPUs [10] [11] [18] [23] the random behavior of soft-error. For instance, although
[33] [39] [42] [45] [47]. Some research works have shown not all the injected soft-errors cause bit-flip, these research
the impact of soft-errors by using radioactive sources [30] works inject bit-flip whenever the soft-error occurs.
[44]. However, usage of these radioactive sources is difficult Research has been conducted in order to find the in-
and not for the general purpose applications, therefore some correct behavior and ensure the functional correctness [18]
research works have focused on the technique to model [25] [39] [42] [45] [47]. One research in [25] has proposed
the soft-error behavior [2] [16] [20] and evaluate the soft- an application framework to handle the soft-errors on GPU
error resiliency of an application [9] [15] [27] [30] [40] [44]. DRAM. The DRAM errors in GPUs are detected by using
Research works have proposed to detect the occurrence of dual parity technique and the application is recovered from
the soft-errors and ensure the correctness of an application the checkpoints. Other research in [47] has proposed a soft-
by using various techniques (i.e. redundant execution [18] ware level Dual-Modular Redundancy (DMR) technique.
[39] [42] [45] [47], insertion of protection code [10] [23], and Each stage had monitor function and its result is compared
leveraging architectural characteristics [11] [33]). In the rest to the result from the monitor function. Another research in
of this section, we discuss in detail the above mentioned [18] has proposed a redundant technique that has utilized
research works. the GPU idle time caused by the branch divergence. A dupli-
Various research works have been conducted to demon- cation technique proposed in [39] has redundant execution
strate the impact of soft-error and evaluate the soft-error for the critical parts of the GPU pipeline and recompute
resilience of the GPU application [9] [15] [27] [30] [40] [44]. erroneous results when error is detected. A redundant ex-
One research [27] proposed a metric to quantify the soft- ecution methodology is proposed in [42] and it uses the
error reliability based on the detailed timing behavior of an idle time of the GPU in order to minimize the performance
application. Other research works [30] and [44] showed the overhead. An automatic Redundant Multithreading (RMT)
impact of the radiation-induced soft-error on the NVIDIA’s technique is proposed in [45]. The proposed technique has
GPU. In order to inject the actual soft-error into the GPU, modified an application’s code during compile-time to add
the target system is exposed to the neutron beam. Another redundant execution. However, these works may cause a
research in [37] has evaluated the soft-error resilience of sev- significant amount of performance and power overhead,
eral safety-critical applications. Embedded GPGPU platform because the proposed techniques are based on redundant
has been exposed to neutron flux in order to measure the execution or recomputation.
soft-error resilience. A soft-error estimation framework is Research has also been conducted to improve the soft-
proposed in order to accurately estimate the soft-error rate, error reliability by leveraging the parallel behavior of the
work in [15]. Unlike traditional netlist-based technique, the GPU [11] [33]. It is shown in [11] that the GPU’s soft-error
proposed framework estimated the soft-error rate from the reliability is affected by its parallel behavior. For example,
layout of the target processor. The impact of soft-error relia- one research in [33] has demonstrated that the Mean Exe-
bility has been discussed in [9] and [40]. Various techniques cutions Between Failures (MEBF) of a GPGPU application
(i.e. debugger based fault injection) are used to show the is affected by the parallel behavior of the GPU. In order to
impact of soft-error reliability of the GPU. These research show the relationship between the MEBF and the parallel
works have successfully demonstrated the impact of soft- behavior of the GPU, the grid and the block size of a
error in real-world environment and applications. However, GPGPU application is modified. However, these research
these works do not propose the technique to improve the works do not provide the techniques to find the grid and
soft-error reliability. the block size and improve the soft-error reliability of the
Research has been conducted to model the soft-error GPU.
behavior. One research in [2] has proposed a soft-error Various instruction scheduling algorithms have been
model to provide the failure probability of various inter- proposed in [34], [35] and [36]. One research project in [35]
leaving techniques for the SRAM. Other research in [16] has has proposed a metric to quantify the soft-error reliability
discussed about the soft-error model for 25nm technology. by using both the detailed timing behavior of the applica-
tion and the hardware information. Based on the proposed 2) Efficient selection of the location for the instructions
metric, an instruction schedule is generated to maximize the while minimizing the total vulnerable period of an ap-
soft-error reliability. Other research in [34] has proposed an plication. In addition, the instruction scheduling should
instruction schedule algorithm to maximize the soft-error not cause significant compilation overhead.
reliability under various performance constraints. Another 3) Modeling of the random behavior of soft-error to prop-
research in [36] has proposed an instruction schedule to erly evaluate methodologies. During run-time, the soft-
maximize the soft-error reliability for specific component. errors should be randomly injected into the GPU at a
Based on the compiler option, the proposed instruction given rate. Moreover, we should consider that not all the
scheduling algorithm maximizes the soft-error reliability injected soft-errors cause the bit-flip.
of the selected components. For example, if the compiler 3.2 Our Novel Contributions
option selects the register file, then the proposed instruction
To address the above-mentioned challenges, we propose a
generates an instruction schedule that maximize the soft-
novel methodology to minimize the vulnerable period of an
error reliability of the register file. However, since these
application on a GPU-based system which employs:
instruction scheduling algorithms are based on RISC proces-
sors such as SPARC-V8 architecture, they are not applicable 1) A GPU architecture aware instruction scheduling algo-
to the GPU. In addition, due to the fact that the probability rithm (Section 6.1) that minimizes the total vulnerable
of the soft-error occurrence is proportional to the time that period of a GPU application by considering the impact
the hardware component is used, the performance aware of the parallel behavior of the GPU. To support this
instruction scheduling may be considered to improve the algorithm, we require all the following:
soft-error reliability [17]. However, since the performance 2) Estimation of the vulnerable period during compile-
aware instruction scheduling focuses on maximizing the time (Section 5) of a GPU application, which provides
GPU resource utilization, improvement in the soft-error the thread-level information for the GPU.
reliability is limited. 3) Theorems to handle the application’s control-flow (Sec-
Research has been conducted for detection and protection 5.2) during the vulnerable period estimation. These
tion of the vulnerable parts in GPGPU applications [6] [10] theorems help to estimate the control flow instructions
[23] [32]. One research in [10] has proposed the checker such as loops and branches.
functions that are inserted to protect the potentially vul- 4) A fine-grained clock-cycle-level fault injection tool
nerable parts of a GPGPU application. Other research in (Section 7.1) to verify our methodology. The fault injec-
[32] has proposed a compilation technique that improved tion tool that can randomly inject faults at the granularity
the control-flow reliability. Another research in [23] has of the clock-cycle-level is integrated with the state-of-
proposed a compile-time methodology that protects the the-art GPGPU-Sim simulator.
memory access instructions by inserting checker instruc- In addition to the above mentioned contributions, we
tions. One study in [6] has proposed an application-level also provide a comparison with the state-of-the-art perfor-
technique that modifies the loop code during the compile- mance aware instruction scheduling [17] (Section 6.3) to
time. However, these works have limitations in improving illustrate why our methodology may achieve maximum
the soft-error reliability of the GPU because the soft-error soft-error reliability.
may occur in any part of a GPGPU application. In summary,
the above-mentioned state-of-the-art methodologies suffer
from the following limitations:
1) They control the parallel behavior of the GPU and mod-
ify the application source code to improve the soft-error
reliability. However, the instruction scheduling method-
ologies have not been considered to further improve the
soft-error reliability.
2) Their techniques may cause significant performance and
power overhead because their techniques are based on
redundancy or recomputation.
3) Their fault injection tool does not properly model the Fig. 2: Overview of the Proposed Soft-error Reliability Im-
random behavior of the soft-error. Not all of the injected provement Methodology.
soft-error cause the bit-flip, but existing fault injection Figure 2 shows the overview of the methodology in this
techniques cause bit-flip for all soft-error injections. paper. The proposed compilation methodology has strong
potential to improve the soft-error reliability of the GPU
3 R ESEARCH C HALLENGES AND C ONTRIBUTIONS and can complement with the hardware oriented techniques
such as [1], [7], [31], [43], and [46].
3.1 Problem and Research Challenges
The problem of minimizing vulnerable period during 4 S YSTEM M ODELS
compile-time to improve soft-error reliability poses the fol- 4.1 GPGPU Application Model
lowing research challenges:
A GPGPU application is considered as our target. Generally,
1) Estimation of the vulnerable period of GPU applications a GPGPU application consists of a host code and a kernel
by considering the accurate GPU execution model in or- code. The host code is executed on a CPU and the kernel
der to minimize the vulnerable period of an application code is executed on the GPU. The proposed instruction
during compile-time. scheduling and vulnerable period estimation algorithms
are designed based on an architecture similar to NVIDIA’s 5 V ULNERABLE P ERIOD E STIMATION FOR GPU
Fermi architecture [28], which contains similar hardware Since the parallel behavior of the GPU affects the soft-error
units compared to Kepler architecture [29]. The host or- reliability of the GPU as well, the architectural character-
chestrates the overall behavior of a GPGPU application and istics of the GPU also need to be considered to estimate
the kernel handles most of the computation in the GPGPU the vulnerable period of the GPU application. During run-
application. time, each warp runs instructions in lock-step, and it is
During run-time, the conf iguration must be provided assumed that the GPU hardware schedules the warps in
by the host before it launches a kernel. This conf iguration a round-robin manner [4]. The latency of the instruction
consists of two numbers: grid size and block size. The depends on the parallel behavior of the GPU pipeline and
grid size represents the dimension of the GPU thread block the memory access latency [14]. By using the state-of-the-
hierarchy at the Streaming Multiprocessor (SM) level and art GPU performance models, we may estimate the timing
block size represents the size of each thread block. Right behavior of the GPU and extract information to estimate the
after the host launches a kernel function, the mapping vulnerable period. Below, we provide more details about the
between the thread blocks and the SMs are created based latency estimation of the instruction which uses the state-of-
on the conf iguration and the entire kernel function is the-art GPU performance model from [3] [14]. The model
submitted into GPU hardware. When the GPU executes a parameters are generated from the actual GPU by using
kernel function, the threads in the same thread block are micro benchmarks.
grouped into warp. Note that a single thread block can
have multiple warps. W arp is the basic unit of execution 5.1 Latency of Instruction Execution
of the GPU. The threads in the same warp fetch and execute The nth instruction can be issued after all warps have
the same instruction. The behavior of warps in the same issued the (n − 1)th instruction and all the data for the nth
thread block is sequential because they share the single SM. instruction is ready to use. Therefore, the amount of time to
Since multiple SMs operate at the same time, the behavior execute n instructions may be represented by the following
of thread blocks could be either parallel or sequential. Since equation:
n
the detailed architecture of the GPU is not available to X
the research community, we assume that the GPU uses the t In = Liready + Lissue + Lnpipe + Lnmem (1)
i=1
round robin algorithm to schedule the warps as has been
adopted in [4] as well. where Ln ready represents the latency to make the n
th
instruction ready. Lnissue represents the latency to issue all

4.2 Fault Model the instructions for the nth instruction. Ln n
pipe and Lmem
Neutron-induced soft-errors are considered as a primary represent the pipeline and the memory access latency for the
reason for a bit-flip. Since the natural radiation is evenly nth instruction, respectively. The instruction issue latency,
distributed across the chip, we assume that neutron-induced Lissue , shown in Figure 3 is represented by the following
soft-errors are evenly distributed over the GPU area. As equation:
discussed in Section 2, several metrics and techniques have nwarp × sizeof (warp)
been proposed to quantify the impact of soft-errors. Lissue = (2)
ncore × rissue
Among these soft-error quantification metrics, the Ar-
chitectural Vulnerability Factor (AVF) [27] and the Instruction where nwarp represents the number of warps in a thread
Vulnerability Index (IVI) [35] are two of the most well- block. ncore represents the number of streaming processors
known metrics used for quantifying the soft-error reliability. in a single SM, and rissue represents the instruction issue
The AVF represents the probability that a single soft-error rate. The pipeline latency, Lpipe , depends on the instruction
on a hardware component causes a visible error. In AVF and can be obtained empirically.
metric, the probability of the soft-error is proportional to As mentioned in the beginning of this section, the nth
the amount of the time that the data resides on a hardware instruction will be issued after all the data dependencies
component. In order to estimate the AVF, the detailed timing are resolved. After the (n − 1)th instruction is issued com-
behavior of hardware components needs to be obtained. pletely, if there is unresolved data dependency then the nth
Based on the usage of each hardware component, the AVF instruction will wait until all the data is ready to use. The
p
is calculated. In AVF metric, a hardware component is instruction ready latency, Ln ready , is a function of the Lpipe
p n
susceptible to the soft-error while the hardware component and Lmem of the producer instructions p. Therefore, Lready ,
is in use. This susceptible period is referred to as vulnerable depends on the latencies of the producer instructions and
period. The IVI metric does not only consider the temporal may be represented by the following equation:
effect, which is considered in the AVF metric, but also Lnready = max(tn−1 n n−1
issue comp , tdata ready ) − tissue comp (3)
the spatial effect, the area of each component. Since the tn−1 th
issue comp represents the time that all the (n − 1)
soft-error is evenly distributed throughout the processor, n
instructions are issued and tdata ready represents the time
the probability that the soft-error occurs on the hardware
that all the data for the nth instructions are ready to use.
component is affected by the area of the component as well.
Moreover, the instruction execution may overlap with the
In this paper, since a detailed Register Transfer Level
other instruction’s issue and the memory operation. There-
(RTL) model, which must be required for the IVI metric, is
fore, the completion time of the nth instruction may be
not available to the research community, we use the AVF
described by the following equation:
metric to quantify the soft-error effect. However, if the RTL
Equation 1 shows that the latencies caused by the data
model of the GPU is provided, then our methodology is also
dependencies are related to both the arithmetic and mem-
scalable to the IVI metric as has been done for the single-core
ory operations. To accurately measure the data dependent
SPARC-V8 CPU in [35].
latency, the latencies of the arithmetic and memory access After that, the maximum number of concurrent memory
instructions need to be considered together. accesses may be estimated by the following equation:
Nmem req = bBWM EM /bwc (6)
During run-time, the number of total memory accesses

from the instruction is calculated by the following equation:
nmem req = nwarp × nSM (7)

Since all the memory accesses may not be served at the same
Fig. 3: Example Pipeline Stages of Arithmetic Instructions. time, the memory latency is estimated based on currently
available bandwidth, number of ongoing memory accesses,
and number of requested memory accesses. The memory la-
5.1.1 Latency of Arithmetic Instruction
tency consists of two parts: the latency for memory accesses
The latency of the arithmetic instruction consists of two (Lmem ) and the delay due to the fully occupied memory
latencies: the instruction ready latency and the pipeline bandwidth (dn f ull ).
latency. The instruction ready latency is the period between
the issue completion time of the previous instruction and
Lnmem = Lmem + dnfull (8)
the latest completion time among the predecessor instruc-
tions. Figure 3 shows an example of the instruction ready If there is no ongoing memory access, dn
f ull becomes 0
latency. The issue completion time may be estimated by the and the memory access latency may be estimated by the
summation of the instruction ready latencies and the Lissue following equations:
for all the instructions preceding the nth instruction. The in- j n k
struction ready latency may be estimated by the instruction mem
Nf ull = (9)
whose execution time is the longest and has finished after Nmem req
the issue completion time of the dependent instructions.
Note that the pipeline latency likely overlaps with the issue Nremain = M OD(nmem , Nmem req ) (10)
latency. Therefore, the instruction ready latency is calculated l N m
by subtracting the issue latency from the pipeline latency. Lmem = Nf ull +
remain
× Lavg (11)
mem
In Figure 3, the instruction In−2 actually dominates the Nmem req
instruction ready latency since it has the longest execution
The above equations imply that the memory bandwidth
time. The instruction ready latency is two clock cycles for
is not always fully occupied. Therefore, in order to track
In because one overlapped issue cycle of In−2 and one
the memory bandwidth and describe the memory access
overlapped issue cycle of In−1 are subtracted from the
latency, we define three parameters: 1) the time when there
longest execution time.
is available memory bandwidth (tn nonf ull ), 2) the time when
5.1.2 Latency of Memory Access Instruction the memory access is completed (tn memf inish ), and 3) the
The latency of the memory access instruction is calculated number of available memory accesses between tn nonf ull and
by subtracting the issue latency from the total memory tnmemf inish (Nnonf n
ull ). By using these parameters, we may
access time. Figure 4 shows an example of the memory ac- describe the memory latency and track the bandwidth sta-
cess latency. Assuming that the system has adopted GDDR n
tus. Note that Nnonf ull may represent the available memory
bandwidth between tn n
nonf ull and tmemf inish because the
available memory bandwidth can be represented by the
number of available memory accesses. In order to track the
memory status, tn n n
nonf ull , tmemf inish , and Nnonf ull need to
th
be updated for each instruction. If the n instruction is
not a memory access instruction, then tn n
nonf ull , tmemf inish ,
n
and Nnonf ull are the same as the previous instruction‘s
Fig. 4: Example Pipeline Stages of Memory Access Instruc- values, which are tn−1 n−1 n−1
nonf ull , tmemf inish , and Nnonf ull . How-
tions. th
ever, if the n instruction is a memory access instruction,
memory, the maximum memory bandwidth, BWM EM , is tnstartmem , tnnonf ull , tnmemf inish , and Nnonf n
ull are updated
the bandwidth of the bus multiplied by the clock frequency based on the current memory status, which is represented
of memory (Equation 4) [13]. Due to the fact that the by tn−1 n−1 n−1
nonf ull , tmemf inish , and Nnonf ull . From Equation 1, by
exact cache behavior may be impossible to estimate during n
making Lmem = 0, it may be possible to estimate the time
compile-time, the average memory access latency is used to when the nth instruction‘s memory accesses are sent to the
estimate the memory access latency for the nth instruction. memory system, tn startmem .
Based on tn−1 n−1 n
nonf ull , tmemf inish , and tstartmem , there are
BWM EM = Bus width × M emory f req × 2 (4)
three different possible scenarios for estimating the memory
During run-time, the memory bandwidth is consumed by accesses latency.
the memory accesses. The consumed bandwidth for each n−1
1) tn
startmem is greater than tmemf inish . In this scenario,
memory access may be estimated by the following equation:
there is no ongoing memory access at time tn startmem .
bw = F etch size × Core f req (5) Therefore, the memory access latency may be estimated
by using the above mentioned equations (Equations 9- on the state-of-the-art GPU architecture and performance
n
11). In addition, Nnonf ull is same as Nremain . models [4] [14].
2) tn
startmem is between tn−1 n−1
nonf ull and tmemf inish . In this Observation 1. Let’s assume that the register R includes the
scenario, there are ongoing memory accesses, but the
moment of data production before the loop and the last moment of
memory bandwidth is not fully occupied. Some memory
data consumption in the loop. The vulnerable period of the register
accesses can begin immediately and the other memory
R, V PR , depends on the following:
accesses can begin after tn−1 n
memf inish . Therefore, df ull
becomes 0 and based on the available bandwidth, 1) Execution time of one iteration in the loop, lloop .
n−1
Nnonf ull , and Nremain , the memory latency and the 2) Time from the data production to the first data consumption
parameters are updated. When Nremain is greater than in the loop, lpre−loop .
n−1
Nnonf ull , the last memory access is followed by the
memory accesses that start at tn−1
memf inish . Therefore, the Proof. During run-time, depending on the data, the number
parameters are updated as follows: of iteration N is changed. In other words, two different
instruction schedules will have the same number of loop
tnnonf ull = tnstartmem + Lnmem (12) iterations, if the input data is the same. The vulnerable
tnmemf inish = tn−1
memf inish + Lnmem (13) period of the register R may be represented by the following
n n−1 equation:
Nnonf ull = Nmem req − (Nremain − Nnonf ull ) (14)
n−1
In the case hen Nremain is less than or equal to Nnonf ull ,
( l
the last memory access is followed by the memory N × lloop , if pre−loop
lloop ≈0
V RR = (22)
access that starts at tn
startmem . In addition, the available N × lloop + lpre−loop , otherwise
n
memory bandwidth, Nnonf ull , is related to the memory
accesses that are finished at time tn−1 memf inish . Therefore, From Equation 22, we can see that lpre−loop ’s contribu-
the parameters are updated as follows: tion to the V PR becomes negligible if one of the following
l
two cases is satisfied: 1) pre−loop is approximately zero; or
tnnonf ull = tn−1
memf inish + (Nf ull × Lavg mem ) (15) lloop
2) N is so large. Therefore, the vulnerable period of the loop
tnmemf inish = tn−1 n
memf inish + Lmem (16) can be improved during compile-time by minimizing both
the lloop and the lpre−loop . Figure 5(a) shows an example
n
Nnonf ull = Nmem req − Nremain (17) code to demonstrate the proof of this theorem for a for-
loop.
In the case when Nremain is equal to 0, the memory
access behavior is exactly the same as the previous one.
Therefore, the parameters are updated as follows: Observation 2. For register R, which includes the moment of
data production before the branch instruction and the last moment
tnnonf ull = tnstartmem + (Nf ull × Lavg of data consumption within the branch body, the vulnerable period
mem ) (18)
of R depends on the location of the last consumption of R.
tnmemf inish = tn−1
memf inish + (Nf ull × Lavg mem ) (19)
n n−1 Proof. Unlike the general CPU, the GPU likely executes
Nnonf ull = Nnonf ull (20)
every branch body to handle control flow divergence no
3) tn n−1 matter how the branch is taken. Moreover, for the same
startmem is less than tnonf ull . In this scenario, there is
no available memory bandwidth. Some memory band- input data, two different instruction schedules show the
width becomes available after tn−1 same branch behavior. Therefore, the vulnerable period of
nonf ull . This scenario is
a special case for the previous scenario where dn the R is calculated based on the last place where R is used.
f ull > 0
Figure 5(b) shows one such example case to demonstrate the
and the memory accesses at tn−1 nonf ull , because the mem-
proof of this theorem.
ory access behavior will be similar as the previous
scenario after tn−1
nonf ull . The memory access latency may
be estimated by adding the delay dn f ull to the equations
in the previous scenario. dn f ull may be estimated by the
following equation:
dnfull = tn−1 n
nonf ull − tstartmem (21)
5.2 Control Flow Management During Vulnerable Pe-

riod Estimation
Before estimating the vulnerable period of the application
and scheduling of the instructions, the control flows of an
application need to be considered. These control flows are
implemented by using the following: the loops and the (a) for-loop
branches. We provide the following two observations to (b) if-else
handle the loops and the branches during the vulnerable Fig. 5: Example Kernel Code for For-Loop and If-Else.
period estimation. Note that these observations are based
Algorithm 1: Algorithm for Computing Vulnerable

Period.
Input: Instruction Flow Graph G, configuration c, # of SM NSM
Output: Total vulnerable period T otV ulnP eriod
1 Function EstimateVulnPeriod (G, c, NSM )
2 begin
3 InitializeGraph(); // Initialize the node and the edge
variables
4 TotVulnPeriod ← 0;
5 ProgMissMatchFlag ← 0;
6 G ← DuplicateGraph(G, c, NSM );
7 foreach N ∈ G do
8 Lissue ← GetIssueLatency(N ); // Equation 2
9 SetNodeInfo(Lissue );
10 foreach Eout ∈ N do
11 if Eout ← then
12 if N = M emOp then
13 nconcurrent ← FindConcurrentAccess(N ) ;
// Get the number of concurrent
access
14 LIn ← EstimateMemLatency(N , nconcurrent ) ;
// Equation 4-21
15 else
16 LIn ← GetPipeLatency(N );
17 SetOutgoingEdge(N , Eout , Lexe ) ;

Fig. 6: Example Code for the Proposed Instruction Schedul-
ing.
18 foreach G ∈ G do
19 Edges ← N ull ; scheduling algorithm that schedules the instructions based
20 foreach E ∈ G do
21 Elongest ← Edges.f ind(E) ;
on the data dependency.
22 if Elongest = N U LL then
23 Edges.add(E);
24 else 6.1 Vulnerable Period Aware Instruction Scheduling
25 SetLongest(Edges, E ) ;
The vulnerable period of a register will reach a minimum
26 TotVulnPeriod = GetSummation(G, Edges) ; if the data is immediately consumed after it is produced.
27 return TotVulnPeriod ; Therefore, the primary objective of our heuristic is minimiz-
ing the distance between the producer instruction and the
5.3 Vulnerable Period Estimation consumer instruction. Since our heuristic places instructions
based on the data dependencies, it is important to know
Based on the GPU execution model and the control flow
the last consumption of the data. Therefore, our heuristic
management, we can estimate the vulnerable period of the
uses a bottom-up strategy and places the instructions from
application. The pseudo-code for the vulnerable period esti-
the ending point of the application. At the beginning of
mation is shown in Algorithm 1. The input to the Algorithm
instruction scheduling, the last instruction is selected and
1 includes the instruction flow graph of the kernel code, G,
placed. Afterward, the producer instructions corresponding
the configuration of the application, c, and the number of
to the last instruction are examined based on the following
SMs, NSM (Line 1). Each node in the graph represents the
three conditions:
instruction and each edge in the graph represents the data
dependency. After that, based on the degree of parallelism 1) There is no synchronization instruction between the pro-
provided by the configuration c and NSM , Algorithm 1 ducer and the consumer instruction. The synchroniza-
updates the timing information in the graph (Line 6). The tion instruction is used to prevent the race conditions
loop (Lines 7-17) starts to update the weight of the edges, between multiple threads. For example, if the instruction
which will store LIn from a node to its dependent nodes. (Line 6 in Figure 6) is moved after the synchronization
For each node in the graph, Lissue is estimated (Line 8) and instruction, then the functional correctness of the appli-
then for each outgoing edge of the node N , the execution cation is not guaranteed.
latency LIn is determined according to the node type (Lines 2) There is no other instruction that consumes the same
10-17). In the second loop (Lines 18-25), the total vulnerable data produced by the producer instruction. If the in-
period is explored by searching and adding the longest edge struction is moved after its consumer instruction, the
from each node. consumer instruction has wrong data and the result will
be corrupted. For example, in Figure 6, if the instruction
6 GPU A RCHITECTURE AWARE I NSTRUCTION in Line 8 is moved after Line 11, then the instruction in
S CHEDULING Line 11 will use the data in register %r22, which is not
As mentioned in Section 3.2, our primary goal is to find updated properly.
the instruction scheduling that has minimum vulnerable 3) The loop depth value of the producer instruction and the
period. To do that, every possible instruction schedule needs consumer instruction must be identical. If the instruction
to be tested and verified. However, finding an instruction is moved from the inside of the loop to the outside of
schedule is known to be an NP-hard problem [41]. The over- the loop, the data will not be properly updated. For
head of finding an instruction schedule that minimizes the example, in Figure 6, if the instruction in Line 0 is moved
vulnerable period could be significant. Therefore, in order after Line 4, which is outside of the loop, the value
to find the best instruction schedule while minimizing the in register %f 1 will not be properly updated and the
compilation overhead, we propose a heuristic instruction functional correctness is not guaranteed.
(a) Instruction Scheduling from Default (b) Performance Aware Instruction (c) Our Proposed Instruction
NVCC Compiler Scheduling [17] Scheduling
Fig. 7: Instruction Scheduling Results for BFS Application from the NVCC, Performance Aware Instruction Scheduling [17],
and Our Proposed Instruction Scheduling.
Algorithm 2: Algorithm for Instruction Scheduling. tion and our heuristic repeats the above mentioned process
Input: Instruction Flow Graph Gin until all instructions are scheduled.
Output: New Instruction Flow Graph Gnew Algorithm 2 shows the pseudo-code for the proposed
1 Function BuildFromBottom (Gin )
2 begin instruction scheduling algorithm. At the beginning, the
3 Gnew .clear(); algorithm selects the last instruction and places it at the
4 P os ← |Gin |; end of the application code (Lines 10-13). For each prede-
5 ntgt ← ; RegSet ← ;
6 while Gin 6= do cessor instruction, the algorithm examines the aforemen-
7 CandidateSet ← ; tioned three conditions (Lines 15-25). If the predecessor
8 pass ← true;
9 if P os = |Gin | then instruction satisfies these three conditions, it will be placed
10 ntgt ← GetSinkNode(); Gnew .push(ntgt ); near the consumer instruction (Lines 30-34). The overall
11 Gin .remove(ntgt );
12 P os ← P os - 1; complexity of the instruction scheduling algorithm is given
13 else by O(n × n × n) = O(n3 ) because of the three levels loop
14 for idx = 0; idx < P os; idx + + do hierarchy.
15 for cnt = idx; cnt < P os; cnt + + do
16 if Gin [cnt].is sync() then
17 pass ← false; 6.2 Example of the Proposed Instruction Scheduling
18 if pass = true and
Gin [cnt].is consume(Gin [idx].GetDest()) then
An example of our scheduling process is shown in Figure
19 pass ← false; 6. In the figure, there are four producer and consumer
20 if pass = true and Gin [idx].is in loop() then
instruction pairs: instructions in Lines 2 and 5, in Lines 6 and
21 if Gin [idx].GetLastLoopPos() < P os then 9, in Lines 8 and 13, and in Lines 10 and 14. The first pair
22 pass ← false; of instructions, in Lines 2 and 5, violates the third condition
23 if pass = true and RegSet.exist(Gin [idx].GetDest()) because they are not in the same loop. The second pair of
then instructions, in Lines 6 and 9, violates the first condition due
24 CandidateSet.push(Gin [idx]);
to the synchronization instruction in Line 7. The third pair of
25 if CandidateSet = then instructions, in Lines 8 and 13, violates the second condition
26 ntgt ← Gin .GetLastNode();
because there is another consumption of the register %r22
27 else in Line 11. However, the last pair of instructions, in Lines
28 ntgt ← CandidateSet.front();
10 and 14, satisfies all three conditions, and therefore the
29 Gnew .push(ntgt );
30 Gin .remove(ntgt ); proposed heuristic is able to place the instruction in Line 10
31 RegSet.RemoveDestRegs(ntgt ); around its consumer instruction, in Line 14.
32 RegSet.AddSrcRegs(ntgt );
33 P os ← P os - 1;
6.3 Comparison with Performance Aware Instruction
Scheduling
As mentioned in Section 1, the probability that the soft-error
In order to minimize the vulnerable period of the last
occurs on a single hardware component is proportional
instruction, our heuristic schedules the predecessor instruc-
to the time that the hardware component is used [48].
tion that satisfies all three conditions. After that, the sched-
Therefore, it may be possible to improve the soft-error reli-
uled predecessor instruction becomes the consumer instruc-
ability through performance improvement. The instruction
scheduling technique in [17] may be used to improve the

performance of the applications.
Current state-of-the-art GPUs do not support out-of-
order execution and the instructions will be held until all the
data is ready to use [4] [14]. In order to improve the perfor-
mance of the GPU during compile-time, the instructions are
scheduled to increase the Instruction Level Parallelism (ILP).
The increased ILP indicates that some of the data may not be
consumed right after its creation and the lifetime of the data
is likely increased. Therefore, with the performance aware
instruction scheduling, the execution time of the GPU appli-
cation may be decreased, although the total vulnerable pe-
Fig. 8: Experimental Setup for Fault Injection Flow.
riod of the GPU may be increased. Moreover, the instruction
scheduling in [17] does not change the instruction schedule tool which is integrated with the GPGPU-Sim [4] simulator.
in the basic block. There is a possibility that the soft-error Figure 8 shows the overview of our fault injection tool.
reliability of the basic block is not improved through the In our fault injection tool, each kernel is executed twice
performance aware instruction scheduling. with GPGPU-Sim, where the list of effective faults is created
Figure 7 shows instruction scheduling results for the during the first execution and the actual data is changed
Breadth First Search (BFS) application. Figures 7(a), 7(b), and based on the effective fault list during the second execution.
7(c) represent the instruction scheduling results from the Since the injected faults do not all cause the bit-flip, it is
default NVCC compiler, the performance aware instruction essential to know which fault actually causes the bit-flip.
During the first execution, based on the given fault rate,
scheduling [17], and our proposed instruction scheduling,
our fault injection tool periodically and randomly generates
respectively. In Figure 7, three registers, %rd3, %r8, and
the list of injected faults which includes the fault injection
%r9, are highlighted to show the difference between the
time (clock cycles) and the faulty components. The possible
three instruction schedules. For these three registers, the
faulty components are shown in Figure 9, which shows the
difference between the default instruction scheduling and
abstracted block diagram of the NVIDIA’s Fermi architec-
our instruction scheduling is the lifetime of the register %r9.
ture used for our experiments. Among these components,
Our instruction scheduling puts fewer instructions between
based on the area information from the GPGPU-Sim, our
the creation and the last consumption of the data in the
fault injection tool selects the following components to inject
register of %r9. However, in Figure 7(b), the performance
soft-errors: Register File, Load/Store unit, Integer ALU,
aware instruction scheduling puts the largest number of
Floating Point ALU, and Special Function Units. After that,
instructions during the lifetime of the register %r9. For the
register %rd3 and %r8, both the default and our instruction
scheduling put the same number of instructions during
their lifetime. On the other hand, the performance aware
instruction scheduling puts a larger number of instructions
during the lifetime of the %rd3 and %r8.
The examples in Figure 7 show that the performance
aware instruction scheduling may have the largest vul-
nerable period even though the execution time of the ap-
plication is minimal. In Section 7, Figure 14 shows that
the performance aware instruction scheduling achieves the
highest performance. However, Figure 11 shows that the
performance aware instruction scheduling cannot maximize Fig. 9: A High-Level Block Diagram of GPU Architecture
the soft-error reliability of a GPU. Used for Our Experiments.
on each clock cycle, the fault injection tool tries to find a
7 R ESULTS AND E VALUATION match from the list of injected faults based on the following
conditions:
7.1 Experimental Setup and Fault Injection Tool
• The fault injection time is matched with the current clock
In order to evaluate our instruction scheduling methodol-
cycle number.
ogy, we have selected nine benchmark applications because
• The component in the injected fault list is in use.
of their intensive usage in GPU applications for image
processing and scientific computing. Several applications If there is a match, the fault is considered as an effective
are selected from the Rodinia benchmark suite [5], GPGPU- fault, which means this fault causes a bit-flip. The fault
Sim benchmark [4], and CUDA examples. The selected injection tool adds the fault into the effective faults log with
benchmark applications are Backprop (Bp), BFS (Bfs), Srad, the following information: the clock cycle number (injection
Kmeans (Km), Matrix Multiplication (Mat), Hotspot (Hs), time), the SM number, the thread id, the instruction string,
BoxFilter (Box), ConvolutionSeparable (Conv), and Mandel- and the faulty component.
brot (Man). During the experiments, each benchmark appli- At the beginning of the second execution, the fault
cation is executed 25 times with various fault injection rates injection tool reads the effective faults log and obtains the
(10, 50, and 100 faults/1 Million (M) cycles). In addition, information for the faults. During the second execution, the
we have implemented a clock cycle level fault injection fault injection tool randomly selects one bit and flips the
selected bit from the result of the instruction that uses the
faulty component. After the second execution, the entire
simulation results are compared to the correct results, which
are generated without any fault injection. The output is clas-
sified into the following categories: correct output, incorrect
output, and application crash.
7.2 Experimental Results

The algorithm execution time of our instruction scheduling
is shown in Table 1. An Intel CoreT M i7 Quad-core processor
at 3.5 GHz is used to measure the algorithm execution time.
On average, our instruction scheduling algorithm requires Fig. 10: Vulnerable Period Improvement Compared to [10],
8.13 seconds to generate the reliable instruction schedule [23] and [17].
(standard deviation is 5.356 seconds). The proposed instruc- BFS application, which has a small performance overhead.
tion scheduling methodology adds two steps on top of On the other hand, our methodology has more effective
vendor provided compilation process: 1) PTX code extrac- faults for the Mandelbrot application. However, since our
tion; and 2) execution of instruction scheduling algorithm. methodology has large performance overhead, the actual
After that, vendor provided compiler takes the optimized effect of the effective faults may be smaller compared to
PTX code and generates the binary. For these two steps, other methodologies. The smaller amount of effect for the
extraction of PTX code has negligible overhead. Therefore, effective faults is shown in Figure 11 and Figure 10. Figure
the execution time of the proposed instruction algorithm 10 shows that our methodology has smaller amount of nor-
contributes most to the compilation overhead. malized vulnerable period and Figure 11 supports the result
Application Our algorithm exec. time (sec) in Figure 10 with the soft-error reliability improvement.
Backprop 13.215 Figure 11 shows the comparison with the actual output
BFS 0.94 of the benchmark applications. The results show that our
Srad 4.41 methodology can further improve the soft-error reliability
Kmeans 2.355
Matrix 13.91 by 23% compared to other two compilation methodologies
Hotspot 13.91 (up to 64% and standard deviation is 19%). In other words,
BoxFilter 3.24 with our methodology, the failure probability is decreased
convolutionSeparable 9.69 by 23% on average (up to 64%). Compared to the per-
Mandelbrot 11.53
formance aware instruction scheduling, our methodology
Avg. 8.13
Standard Dev. 5.356 shows up to 12% improved soft-error reliability (up to 52%
and standard deviation is 14%).
TABLE 1: Our Algorithm Execution Time. The major reason for the improvement is the decreased
Figure 10 shows the vulnerable period improvement of vulnerable period compared to other methodologies. The
our instruction scheduling compared to reliability enhance- methodologies presented in [10] and [23] protect the par-
ment methodologies proposed in [10], array bounds checker ticular parts of applications with additional code. However,
in [23], and performance aware instruction scheduling [17]. in general, the soft-errors randomly occur throughout the
All results are normalized with the vulnerable period results application, and these methods may provide worse results
from the default NVCC compiler (with -O3 option). The by forcing the use of additional protection functions. The
results show that our instruction scheduling can achieve additional functions increase the overall vulnerable period
the smallest vulnerable period for all, except the Hotspot of application.
application. This is because our algorithm could not modify The performance aware instruction scheduling may im-
the instruction schedule due to frequent usage of syn- prove the soft-error reliability by improving the perfor-
chronization instructions in the Hotspot application’s kernel mance and decreasing the vulnerable period. However,
function (Section 6.1). As mentioned in Section 6.3, some of as shown in Section 6.3, the total vulnerable period is
the applications achieve similar vulnerable period improve- not minimized through the performance aware instruction
ments through the performance improvement. However, scheduling. Therefore, the methodologies in [10], [23], and
as shown in Figures 7 and 10, performance improvement [17] may have less soft-error reliability improvement than
cannot guarantee the minimum vulnerable period. our instruction scheduling does.
Table 2 and Figure 11 show the list of effective faults From Figure 11, we observe that the simulation result
during the entire execution time and the soft-error reliabil- for the hotspot application with 50 faults injection does not
ity improvement compared to the state-of-the-art soft-error match with the vulnerable period estimation result from
reliability improvement methodologies [10] [23] and the Figure 10. Our methodology should show a similar failure
performance aware instruction scheduling [17], respectively. rate, but our methodology shows improved failure rate.
The results from Table 2 are related to the results in Figure This is because of the random fault injection. Since we
11, 10 and 13. Since the number of effective faults is propor- randomly inject faults on random components, the ran-
tional to the execution time, in order to properly compare domly generated faults may cause extreme behavior and
the number of effective faults, we also need to consider generate abnormal results. In order to verify and show the
the execution time of the application. For example, our effect from the extreme behavior of random fault injection,
methodology shows less number of effective faults for the we performed an additional 25 times of simulation for the
Fig. 11: Soft-error Reliability Improvement Compared to [10], [23]and Performance Driven Instruction Scheduling [17].
Each Application is Executed 25 Times with Different Fault Injection Rates.
hotspot application with 50 fault injections and plotted the
ratio of correct output for all the 50 times of simulation
(Figure 12). From Figure 12, we may observe that our
methodology does not have converged in terms of ratio of
correct output after 25 times of simulation. However, after
50 times of simulation, all the methodologies, including
ours, have converged in terms of ratio of correct output.
In addition, the results show that our methodology has a
similar failure rate to the original application and it matches
with the vulnerable period estimation (Figure 10).
We observe that there is no correct output from the
Matrix multiplication and BoxFilter applications when the
fault injection rate is 100 faults/1M cycles. This is because
the behavior of the Matrix multiplication and BoxFilter appli-
cations are sensitive to soft-errors. The failure rate of an ap- Fig. 12: Ratio of Correct Output for Hotspot Application
plication depends on two things: the timing behavior (vul- with 50 Faults Injection.
nerable period) and the functional behavior (masking effect,
soft-error effect and produces incorrect results. In Figure
propagation, and etc.). The applications default soft-error
11, the results from the Matrix multiplication and BoxFilter
reliability may be decided based on the functional behavior.
applications show examples of the above mentioned case.
For example, lets assume we have two different applications
Both applications have a kernel function that consists of
that have the same vulnerable period: the application with
multiple loops with multiplication operation. Thus, the ma-
multiplication instruction (MUL) and the application with
trix multiplication and the box filter applications may have
addition instruction (ADD). Although they have the same
higher soft-error sensitivities than other applications. There-
vulnerable period, MUL may have a higher failure rate
fore, with a 100 faults/1M cycles fault injection rate, the
because the multiplication operation likely propagates the
Application Faulty 10 Faults / 1M Clock Cycle 50 Faults / 1M Clock Cycle 100 Faults / 1M Clock Cycle
Component Orig. Our [10] [23] [17] Orig. Our [10] [23] [17] Orig. Our [10] [23] [17]
REGISTER FILE 14 27 13 35 12 66 186 26 122 61 115 403 106 495 129
LDSTR UNIT 5 7 3 20 7 13 32 9 25 25 27 51 30 66 37
Bp INT ALU 6 16 8 23 19 10 17 3 42 7 14 49 16 138 19
FLOAT ALU 19 19 18 18 5 55 85 46 60 100 103 235 149 142 131
SFU ALU 16 17 9 11 6 34 38 24 29 44 56 106 43 73 64
REGISTER FILE 43 26 27 68 12 158 149 61 158 101 268 266 249 596 189
LDSTR UNIT 6 1 7 7 2 4 6 12 3 5 5 8 5 8 5
Bfs INT ALU 61 0 0 4 31 4 4 0 2 227 4 4 6 10 391
FLOAT ALU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SFU ALU 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
REGISTER FILE 5 0 1 6 1 19 3 14 7 7 48 17 32 56 18
LDSTR UNIT 1 0 0 1 1 8 5 4 4 6 9 5 22 8 0
Srad INT ALU 6 3 0 2 10 9 5 0 3 40 7 11 13 6 62
FLOAT ALU 3 0 4 0 3 8 7 5 3 7 16 7 29 13 7
SFU ALU 2 1 0 0 0 1 2 0 1 0 1 4 7 3 0
REGISTER FILE 3 9 23 6 0 30 36 30 38 33 109 92 53 195 76
LDSTR UNIT 3 1 3 10 2 14 7 4 10 11 38 13 26 23 19
Km INT ALU 46 0 7 0 32 7 4 2 7 216 11 14 5 21 437
FLOAT ALU 24 4 13 12 11 51 31 34 37 58 115 93 54 88 113
SFU ALU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
REGISTER FILE 25 23 39 27 16 96 56 53 58 56 199 134 293 287 116
LDSTR UNIT 15 6 25 14 10 55 24 32 35 37 120 46 92 160 65
Mat INT ALU 2 0 4 6 64 13 10 9 13 306 377 21 26 51 557
FLOAT ALU 9 8 5 7 3 24 30 15 24 48 86 76 28 62 72
SFU ALU 0 0 0 0 0 0 2 4 0 0 0 0 12 0 0
REGISTER FILE 6 5 14 21 5 19 19 34 41 35 66 87 189 228 73
LDSTR UNIT 2 2 5 6 1 2 0 5 11 8 9 28 33 33 11
Hs INT ALU 16 20 46 31 67 29 25 78 48 310 87 110 257 248 585
FLOAT ALU 7 6 16 18 3 9 12 32 27 18 68 35 146 118 42
SFU ALU 3 0 3 0 2 5 0 8 6 1 8 8 24 23 9
REGISTER FILE 29 31 68 80 67 248 178 259 308 274 443 477 472 457 434
LDSTR UNIT 0 0 0 2 0 0 0 0 0 0 3 2 0 1 2
Box INT ALU 41 25 53 35 37 242 89 219 238 212 538 325 469 424 409
FLOAT ALU 46 30 43 28 50 177 195 195 216 195 426 425 434 380 338
SFU ALU 18 16 37 31 23 111 67 140 159 122 255 185 219 271 229
REGISTER FILE 45 48 55 206 47 260 220 245 246 137 325 391 481 546 491
LDSTR UNIT 32 36 24 25 35 176 145 64 152 71 313 269 26 269 364
Conv INT ALU 7 7 6 42 4 20 30 28 79 5 22 33 26 146 20
FLOAT ALU 97 89 71 47 126 416 301 563 419 197 662 543 785 636 854
SFU ALU 0 0 0 4 0 5 0 0 2 0 7 1 0 2 2
REGISTER FILE 247 240 250 181 226 978 1086 852 967 1082 1860 2396 1599 1444 1614
LDSTR UNIT 3 2 1 0 2 1 11 1 2 16 3 13 10 5 6
Man INT ALU 0 7 0 1 1 5 10 7 14 2 11 42 15 20 6
FLOAT ALU 367 1432 371 404 361 2086 6466 1903 2216 1974 4084 13469 3633 2370 2928
SFU ALU 183 606 186 180 216 939 3130 861 1002 928 1747 6440 1685 1076 1400
REGISTER FILE 46.33 45.44 54.44 70.00 42.89 208.22 214.78 174.89 216.11 198.44 381.44 473.67 386.00 478.22 348.89
LDSTR UNIT 7.44 6.11 7.56 9.44 6.67 30.33 25.56 14.56 26.89 19.89 58.56 48.33 27.11 63.67 56.56
Avg. INT ALU 13.89 8.67 13.78 16.00 26.11 37.67 21.56 38.44 49.56 147.22 119.00 67.67 92.56 118.22 276.22
FLOAT ALU 63.56 176.44 60.11 59.33 62.44 314.00 791.89 310.33 333.56 288.56 617.78 1,653.67 584.22 423.22 498.33
SFU ALU 24.67 71.11 26.11 25.11 27.44 121.67 359.89 115.22 133.22 121.67 230.56 749.33 221.11 160.89 189.33
TABLE 2: The List of Effective Faults During the 25-times of Soft-error Reliability Experiments. Each Number Indicates
How Many Effective Faults are Occurred on Each Components During the Experiments.
failure of the Matrix multiplication and BoxFilter applications overhead of our methodology is 125% on average and the
is an expected outcome. power overhead is -7% on average. This negative power
In Figure 11, for the Srad application, we can observe overhead is an expected result because there is no change
that the performance driven instruction scheduling shows in the GPU hardware and our methodology sacrifices the
better soft-error reliability than ours. This is one example of performance to improve the soft-error reliability. Therefore,
how the soft-error reliability can be achieved through per- since our methodology consumes more time to execute the
formance improvement. Since the execution time of the Srad application, the less power consumption is an expected
is really short, the performance improvement may reduce result.
the time that the soft-error occurs during the execution time. From Figure 13, we can observe that our methodol-
Figure 13 and Figure 14 show the performance and ogy has a considerable amount of performance overhead
the power overhead. The performance and power over- (125%). However, without the Mandelbrot application, the
head results are normalized with respect to the original performance overhead from our methodology is only 47%,
application’s performance and power consumption. In ad- which is less than other two reliability methodologies. The
dition, in Figure 13 and Figure 14, the original applica- performance overheads from [10] and [23] are 53% and 92%,
tion’s performance and power consumption are described respectively. The major reason for the performance overhead
by red lines. As mentioned, since our methodology sacri- is that 1) our methodology does not have any additional
fices the performance to improve the soft-error reliability, source code; and 2) our methodology tries to generate an
our methodology consumes less power compared to other instruction schedule and configuration in order to minimize
methodologies. Note that [23] has less power consumption the vulnerable period. Srad and Matrix applications are the
than our methodology and it is because of the idle time examples that the proposed methodology improves the per-
from the additional memory operations. The performance formance. However, our methodology shows a significant
reliability is improved by 23% and 12% (up to 64% and

52%) compared to the state-of-the-art soft-error reliability
improvement methodologies and performance aware in-
struction scheduling, respectively. Compared to the state-of-
the-art methodologies, our methodology has similar perfor-
mance and power overheads in most cases while improving
the soft-error reliability. In addition, the experimental results
shows that the soft-error reliability of GPU is not related
to the performance, but instead to the fine-grained timing
behavior of an application.
R EFERENCES
Fig. 13: Performance Overheads Compared to [10], [23] and [1] N. Avirneni and A. Somani. “Low Overhead Soft Error Mitigation
[17]. Techniques for High-Performance and Aggressive Designs”. IEEE
Transactions on Computers, 61(4):488–501, 2012.
[2] S. Baeg, S. Wen, and R. Wong. “SRAM Interleaving Distance
Selection With a Soft Error Failure Model”. IEEE Transactions on
Nuclear Science, 56(4):2111–2118, 2009.
[3] S. Baghsorkhi, M. Delahaye, S. Patel, W. Gropp, and W. Hwu. “An
Adaptive Performance Modeling Tool for GPU Architectures”.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming (PPoPP’10), pages 105–114, 2010.
[4] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. “Ana-
lyzing CUDA workloads using a detailed GPU simulator”. IEEE
International Symposium on Performance Analysis of Systems and
Software (ISPASS’09), pages 163–174, 2009.
[5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee,
and K. Skadron. “Rodinia: A benchmark suite for heterogeneous
computing”. Proceedings of the 2009 IEEE International Symposium
Fig. 14: Average Power Consumption Overheads Compared on Workload Characterization (IISWC’09), pages 44–54, 2009.
[6] J. Cong and C. Yu. “Impact of loop transformations on software
to [10], [23], and [17]. reliability”. IEEE/ACM International Conference on Computer-Aided
Design (ICCAD’15), pages 278–285, 2015.
performance overhead in the Mandelbrot application because [7] A. El-Maleh and K. Daud. “Method for synthesizing soft error
of the scheduling of memory access instructions and the tolerant combinational circuits”, 2014. US Patent 8,640,063.
change of the application’s configuration to minimize the [8] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. “GPU-
effect of the effective faults. These results show that the out- Qin: A methodology for evaluating the error resilience of GPGPU
applications”. IEEE International Symposium on Performance Analy-
come of the soft-error reliability improvement is not always sis of Systems and Software (ISPASS’14), pages 221–230, 2014.
the performance improvement or the power consumption [9] B. Fang, J. Wei, K. Pattabirama, and M. Ripeanu. “Evaluating Error
improvement. Our methodology may sacrifice performance Resiliency of GPGPU Applications”. High Performance Computing,
Networking, Storage and Analysis (SCC’13), pages 1502–1503, 2012.
to improve the soft-error reliability if it is necessary. In other [10] B. Fang, J. Wei, K. Pattabiraman, and M. Ripeanu. “Towards
words, the trade-off for the soft-error improvement is very Building Error Resilient GPGPU Applications”. 3rd Workshop on
application specific. Resilient Architecture (WRA’12), 2012.
In summary, the experimental results imply that, as [11] L. B. Gomez, F. Cappello, L. Carro, N. Debardeleben, B. Fang,
S. Gurumurthi, K. Pattabiraman, P. Rech, and M. S. Reorda.
shown in Figure 7, the soft-error reliability is related to the “GPGPUs: How to Combine High Computational Power with
detailed timing behavior of an application and the perfor- High Reliability”. Design, Automation and Test in Europe Conference
mance improvement does not guarantee the improvement and Exhibition (DATE’14), pages 1–9, 2014.
[12] I. Haque and V. Pande. “Hard Data on Soft Errors: A Large-
in soft-error reliability. Scale Assessment of Real-World Error Rates in GPGPU”. 10th
IEEE/ACM International Conference on Cluster, Cloud and Grid Com-
8 C ONCLUSION puting (CCGrid’10), pages 691–696, 2010.
In this paper, we have proposed a GPU architecture-aware [13] M. Harris. How to implement performance metrics in
cuda c/c+. https://devblogs.nvidia.com/parallelforall/how-implement-
compilation methodology in order to improve the soft-error performance-metrics-cuda-cc, 2012.
reliability of the GPU-based system. In the methodology, we [14] S. Hong and H. Kim. “An Integrated GPU Power and Performance
have developed a novel instruction scheduling algorithm to Model”. Proceedings of the 37th Annual International Symposium on
Computer Architecture (ISCA’10), pages 280–289, 2010.
minimize the vulnerable period and improve the soft-error [15] H. Huang and C.-P. Wen. “Layout-Based Soft Error Rate Esti-
reliability of an application. Based on the analysis of the mation Framework considering Multiple Transient Faults - from
state-of-the-art GPU architecture, we model the behavior of Device to Circuit Level”. IEEE Transactions on Computer-Aided
an application to estimate its soft-error vulnerability and Design of Integrated Circuits and Systems (TCAD’15), pages 1–1,
2015.
generate the best instruction schedule and configuration. In [16] S. i. Abe, R. Ogata, and Y. Watanabe. “Impact of Nuclear Reaction
addition, we have developed a fine-grained fault injection Models on Neutron-Induced Soft Error Rate Analysis”. IEEE
tool that is integrated with the state-of-the-art cycle-level Transactions on Nuclear Science, pages 1806–1812, 2014.
[17] J. Jablin, T. Jablin, O. Mutlu, and M. Herlihy. “Warp-aware
GPU simulator to evaluate the proposed methodology. We Trace Scheduling for GPUs”. Proceedings of the 23rd International
also have developed theorems and proofs to handle the Conference on Parallel Architectures and Compilation (PACT’14), pages
application’s control flows during the vulnerable period 163–174, 2014.
estimation. The experimental results show that our method- [18] H. Jeon and M. Annavaram. “Warped-DMR: Light-weight Er-
ror Detection for GPGPU”. Proceedings of the 2012 45th Annual
ology generates the instruction schedule within 8.13 sec- IEEE/ACM International Symposium on Microarchitecture (MICRO-
onds on average. Through our methodology, the soft-error 45), pages 37–47, 2012.
[19] H. Jeon, M. Wilkening, V. Sridharan, S. Gurumurthi, and G. Loh. [39] J. W. Sheaffer, D. P. Luebke, and K. Skadron. “A Hardware Redun-
“Architectural Vulnerability Modeling and Analysis of Integrated dancy and Recovery Mechanism for Reliable Scientific Computa-
Graphics Processors”. Workshop on Silicon Errors in Logic-System tion on Graphics Processors”. SIGGRAPH/Eurographics Workshop
Effects (SELSE’13), pages 1–6, 2013. on Graphics Hardware, pages 55–64, 2007.
[20] S. Kiamehr, T. Osiecki, M. Tahoori, and S. Nassif. “Radiation- [40] G. Shi, J. Enos, M. Showerman, and V. Kindratenko. “On Testing
Induced Soft Error Analysis of SRAMs in SOI FinFET Technology: GPU Memory for Hard and Soft Errors”. Proc. Symposium on Ap-
A Device to Circuit Approach”. Proceedings of the 51st Annual plication Accelerators in High-Performance Computing (SAAHPC’09),
Design Automation Conference (DAC’14), 2014. pages 1–3, 2009.
[21] H. Lee and M. A. A. Faruque. “GPU-EvR: Run-Time Event Based [41] G. Shobaki, K. Wilken, and M. Heffernan. “Optimal Trace Schedul-
Real-Time Scheduling Framework on GPGPU Platform”. Design, ing Using Enumeration”. ACM Transactions on Architecture and
Automation and Test in Europe Conference and Exhibition (DATE’14), Code Optimization (TACO’09), pages 1–32, 2009.
pages 1–6, 2014. [42] J. Tan and X. Fu. “RISE: Improving the Streaming Processors
[22] D. Li, J. Vetter, and W. Yu. “Classifying Soft Error Vulnerabilities in Reliability Against Soft Errors in GPGPUs”. Proceedings of the
Extreme-scale Scientific Applications Using a Binary Instrumen- 21st international conference on Parallel architectures and compilation
tation Tool”. Proceedings of the International Conference on High techniques (PACT’12), pages 191–200, 2012.
Performance Computing, Networking, Storage and Analysis (SC’12, [43] J. Tan, Z. Li, and X. Fu. “Soft-error reliability and power co-
pages 1–11, 2012. optimization for GPGPUs register file using resistive memory”.
Design, Automation Test in Europe Conference Exhibition (DATE’15),
[23] S. Li, N. Farooqui, and S.Yalamanchili. “Software Reliability pages 369–374, 2015.
Enhancements for GPU Applications”. Sixth Workshop on Pro- [44] D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai,
grammability Issues for Heterogeneous Multicores (MULTIPROG’13), D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and
2013. A. Bland. “Understanding GPU errors on large-scale HPC systems
[24] A. Maghazeh, U. D. Bordoloi, A. Horga, P. Eles, and Z. Peng. “Sav- and the implications for system design and operation”. IEEE 21st
ing Energy Eithout Defying Deadlines on Mobile GPU-Based Het- International Symposium on High Performance Computer Architecture
erogeneous Systems”. International Conference on Hardware/Software (HPCA’15), pages 331–342, 2015.
Codesign and System Synthesis (CODES+ISSS’14), pages 1–10, 2014. [45] J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and
[25] N. Maruyama, A. Nukada, and S. Matsuoka. “A High- K. Skadron. “Real-world Design and Evaluation of Compiler-
Performance Fault-Tolerant Software Framework for Memory on managed GPU Redundant Multithreading”. Proceeding of the
Commodity GPUs”. IEEE International Symposium on Parallel & 41st Annual International Symposium on Computer Architecuture
Distributed Processing (IPDPS’10), pages 1–12, 2010. (ISCA’14), pages 73–84, 2014.
[26] N. Miskov-Zivanov and D. Marculescu. “MARS-C: Modeling and [46] K. Wu and D. Marculescu. “A Low-Cost, Systematic Methodology
Reduction of Soft Errors in Combinational Circuits”. Proceedings for Soft Error Robustness of Logic Circuits”. Very Large Scale
of the 43rd Annual Design Automation Conference (DAC’06), pages Integration (VLSI) Systems, IEEE Transactions on, pages 367–379,
767–772, 2006. 2013.
[27] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. [47] R. Xiaoguang, X. Xinhai, W. Qian, C. Juan, W. Miao, and Y. Xuejun.
“A Systematic Methodology to Compute the Architectural Vul- “GS-DMR: Low-overhead soft error detection scheme for stencil-
nerability Factors for a High-Performance Microprocessor”. Pro- based computation”. Parallel Computing, pages 50–65, 2015.
ceedings of the 36th annual IEEE/ACM International Symposium on [48] J. Yan and W. Zhang. “Compiler-guided Register Reliability
Microarchitecture (MICRO’03), pages 29–40, 2003. Improvement Against Soft Errors”. Proceedings of the 5th ACM
[28] NVIDIA. “NVIDIA’s next generation CUDA compute architec- International Conference on Embedded Software (EMSOFT ’05), pages
ture: Fermi”. 2009. 203–209, 2005.
[29] NVIDIA. “NVIDIA’s next generation CUDA compute architec-
ture: Kepler GK110”. 2012. Haeseung Lee received the the Master’s De-
[30] D. Oliveira, C. Lunardi, L. Pilla, P. Rech, P. Navaux, and L. Carro. gree of Electrical Engineering from Arizona
“Radiation Sensitivity of High Performance Computing Applica- State University in 2013. Currently, he is pur-
tions on Kepler-Based GPGPUs”. 44th Annual IEEE/IFIP Inter- suing his Ph.D. degree in the Department of
national Conference on Dependable Systems and Networks (DSN’14), Electrical Engineering and Computer Science
pages 732–737, 2014. at the University of California Irvine. His cur-
[31] D. Palframan, N. Kim, and M. Lipasti. “Precision-aware soft error rent research interests include real-time sys-
protection for GPUs”. IEEE 20th International Symposium on High tems, multi-core systems, image processing,
Performance Computer Architecture (HPCA’14), pages 49–59, 2014. networks-on-chip, and reliability.
[32] R. B. Parizi, R. Ferreira, L. Carro, and Á. Moreira. “Compiler
Optimizations Do Impact the Reliability of Control-Flow Radia-
tion Hardened Embedded Software”. Embedded Systems: Design,
Analysis and Verification: 4th IFIP TC 10 International Embedded Mohammad Al Faruque is currently with Uni-
Systems Symposium (IESS’13), pages 49–60, 2013. versity of California Irvine (UCI), where he is a
[33] P. Rech, L. Pilla, P. Navaux, and L. Carro. “Impact of GPUs tenure track assistant professor and directing the
Parallelism Management on Safety-Critical and HPC Applications Cyber-Physical Systems Lab. Prof. Al Faruque
Reliability”. 44th Annual IEEE/IFIP International Conference on is the recipient of the IEEE CEDA Ernest S.
Dependable Systems and Networks (DSN’14), pages 455–466, 2014. Kuh Early Career Award 2016. He served as
[34] S. Rehman, M. Shafique, and J. Henkel. “Instruction scheduling for an Emulex Career Development Chair during
reliability-aware compilation”. 2012 49th ACM/EDAC/IEEE Design October 2012 till July 2015. Before, he was with
Automation Conference (DAC’12),, pages 1288–1296, 2012. Siemens Corporate Research and Technology in
[35] S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. “Reliable Princeton, NJ. His current research is focused
Software for Unreliable Hardware: Embedded Code Generation on system-level design of embedded systems
Aiming at Reliability”. Proceedings of the seventh IEEE/ACM/IFIP and Cyber-Physical-Systems (CPS) with special interest on multi-core
international conference on Hardware/software codesign and system systems, real time scheduling algorithms, dependable systems, etc. Dr.
synthesis (CODES+ISSS’11), pages 237–246, 2011. Al Faruque received the Thomas Alva Edison Patent Award 2016
from the Edison foundation, the 2016 DATE Best Paper Award, the
[36] S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. “RAISE:
2015 DAC Best Paper Award, the 2009 IEEE/ACM William J. McCalla
Reliability-Aware Instruction SchEduling for unreliable hard-
ICCAD Best Paper Award, the 2016 NDSS Distinguished Poster
ware”. 17th Asia and South Pacific Design Automation Conference
Award, the 2015 Hellman Fellow Award, the 2012 DATE Best IP
(ASP-DAC’12), pages 671–676, 2012.
Award Nomination, the 2008 HiPEAC Paper Award, and the 2005
[37] D. Sabena, L. Sterpone, L. Carro, and P. Rech. “Reliability Evalua- DAC Best Paper Award Nomination. Besides 55+ IEEE/ACM publi-
tion of Embedded GPGPUs for Safety Critical Applications”. IEEE cations in the premier journals and conferences, Dr. Al Faruque holds 6
Transactions on Nuclear Science, pages 23–29, 2014. US patents. Dr. Al Faruque is an IEEE senior member.
[38] G. Sadowski. “Design Challenges Facing CPU-GPU-Accelerator
Integrated Heterogeneous Systems”. Design Automation Conference
(DAC’14), 2014.

GPU Architecture Aware Instruction Scheduling - 2017

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

GPU Architecture Aware Instruction Scheduling - 2017

Încărcat de

Drepturi de autor:

Formate disponibile

This article has been accepted for publication in a future issue of this journal, but has not been

Index Terms—GPGPU, Soft-error, Reliability, Instruction Scheduling, Compiler.

1 I NTRODUCTION AND M OTIVATIONAL E XAMPLE

I N recent years, the number of devices that require low-

instruction ready. Lnissue represents the latency to issue all

During run-time, the number of total memory accesses

nmem req = nwarp × nSM (7)

5.2 Control Flow Management During Vulnerable Pe-

Algorithm 1: Algorithm for Computing Vulnerable

17 SetOutgoingEdge(N , Eout , Lexe ) ;

scheduling technique in [17] may be used to improve the

7.2 Experimental Results

reliability is improved by 23% and 12% (up to 64% and

S-ar putea să vă placă și