Documente Academic
Documente Profesional
Documente Cultură
Block
Channel
Plane
Blocked
(a) Controller (b) Channel (c) Plane Planes &
free / valid / invalid blocking blocking blocking Channels
(a) (b) flash pages
Figure 3: Various levels of GC blocking. Colored
Figure 2: SSD Internals (Section 3). I/Os in bright planes are servable while non-colored I/Os in
dark planes are blocked. (a) In controller-blocking (3), a GC
blocks the controller/entire SSD. (b) In channel-blocking (3),
GC operation (4 main steps): When used-page count
a GC blocks the channel connected to the GC-ing plane. (c) In
increases above a certain threshold (e.g., 70%), a GC will plane-blocking (4.1), a GC only blocks the GC-ing plane.
start. A possible GC operation reads valid pages from an
old block, writes them to a free block, and erases the old
block, within the same plane. Figure 2 shows two copy- blocking the planes channel (e.g., bold line in Figure
backs in a GC-ing plane (two valid pages being copied to 3b), other threads/CPUs can serve other I/Os designated
a free block). Most importantly, with 4-KB register sup- to other channels (the colored I/Os in bright planes).
port in every plane, page copybacks happen within the We refer this as channel-blocking GC (i.e., a GC blocks
GC-ing plane without using the channel [11, 18]. the channel of the GC-ing plane). SSDSim [32] and
disksim+SSD [18] adopt this implementation. Commod-
The controller then performs the following for-loop of
ity SSDs do not come with layout specifications, but
four steps for every page copyback: (1) send a flash-to-
from our experiments (2), we suspect some form of
register read command through the channel (only 0.2s)
channel-blocking (at least in client SSDs) exists.
to the GC-ing plane, (2) wait until the plane executes
Figure 1 also implicitly shows how blocked I/Os create
the 1-page read command (40s without using the chan-
cascading queueing delays. Imagine the Outstanding
nel), (3) send a register-to-flash write command, and (4)
I/Os represents a full device queue (e.g., typically 32
wait until the plane executes the 1-page write command
I/Os). When this happens, the host OS cannot submit
(800s without using the channel). Steps 14 are re-
more I/Os, hence user I/Os are blocked in the OS queues.
peated until all valid pages are copied and then the old
We show this impact in our evaluation.
block is erased. The key point here is that copyback oper-
ations (steps 2 and 4; roughly 840s) are done internally
within the GC-ing plane without crossing the channel. 4 TT F LASH Design
GC Blocking: GC blocking occurs when some re-
We now present the design of TT F LASH, a new SSD ar-
sources (e.g., controller, channel, planes) are used by a
chitecture that achieves guaranteed performance close to
GC activity, which will delay subsequent requests, sim-
a no-GC scenario. We are able to remove GC blocking
ilar to head-of-line blocking. Blocking designs are used
at all levels with the following four key strategies:
as they are simple and cheap (small gate counts). But be-
cause GC latencies are long, blocking designs can pro- 1. Devise a non-blocking controller and channel pro-
duce significant tail latencies. tocol, pushing any resource blocking from a GC to
One simple approach to implement GC is with a only the affected planes. We call this fine-grained
blocking controller. That is, even when only one plane architecture plane-blocking GC (4.1).
is performing GC, the controller is busy communicating
2. Exploit RAIN parity-based redundancy (4.2) and
with the GC-ing plane and unable to serve outstanding
combine it with GC information to proactively re-
I/Os that are designated to any other planes. We refer
generate reads blocked by GC at the plane level,
to this as controller-blocking GC, as illustrated in Fig-
which we name GC-tolerant read (4.3).
ure 3a. Here, a single GC (the striped plane) blocks the
controller, thus technically all channels and planes are 3. Schedule GC in a rotating fashion to enforce at most
blocked (the bold lines and dark planes). All outstanding one GC in every plane group, such that no reads will
see more than one GC; one parity can only cut one
I/Os cannot be served (represented by the non-colored
I/Os). OpenSSD [4], VSSIM [50], and low-cost systems tail. We name this rotating GC (4.4).
such as eMMC devices adopt this implementation. 4. Use capacitor-backed write buffer to deliver fast
Another approach is to support multi-threaded/multi- durable completion of writes, allowing them to be
CPU with channel queueing. Here, while a thread/CPU evicted to flash pages at a later time in GC-tolerant
is communicating to a GC-ing plane (in a for-loop) and manner. We name this GC-tolerant flush (4.5).
from flash RAM to host DRAM (preventing parity re- .95 .95
0 20 40 60 80 0 20 40 60 80
generation), no support for wall-clock time (preventing
GC time prediction), inaccessible request queues and ab- Exchange Server (Exch) Live Maps Server (LMBE)
sence of GC queues (OpenSSD is whole-blocking). We 1 1
would like to reiterate that these are not hardware limi- .99 .99
tations, but rather, the ramifications of the elegant sim-
.98 .98
plicity of OpenSSD programming model (which is its
.97 .97
main goal). Nevertheless, our conversations with hard-
.96 .96
ware architects suggest that TT F LASH is implementable
on a real firmware (e.g., roughly a 1-year development .95 .95
0 20 40 60 80 0 20 40 60 80
and testing project on a FPGA-based platform).
Finally, we also investigated the LightNVM (Open- MSN File Server (MSNFS) TPC-C
Channel SSD) QEMU test platform [16]. LightNVM 1 1
[21] is an in-kernel framework that manages OpenChan- .99 .99
nel SSD (which exposes individual flash channels to the .98 .98
host, akin to Software-Defined Flash [44]). Currently,
.97 .97
neither OpenChannel SSD nor LightNVMs QEMU test
.96 .96
platform support intra-SSD copy-page command. With-
.95 .95
out such support and since GC is managed by the host
0 20 40 60 80 0 20 40 60 80
OS, GC-ed pages must cross back and forth between the Latency (ms) Latency (ms)
device and the host. This creates heavy background-vs-
(All) RGC+GTR+PB +PB
foreground I/O transfer contention between GC and user GTR+PB Base
I/Os. For example, the users maximum 50K IOPS can
downgrade to 3K IOPS when GC is happening. We leave Figure 5: Tail Latencies. The figures show the CDF of
this integration for future work after the intra-SSD copy- read latencies (x=080ms) in different workloads as we add
each TT F LASH strategy: +PB (4.1), +GTR (4.3), and +RGC
page command is supported.
(4.4). The y-axis shows 95100th percentiles.
% of Blocked Reads
Avg. Latency (us)
8
1500 6
1000 4
500 2
0 0
DAPPS DTRS Exch LMBE MSNFS TPCC DAPPS DTRS Exch LMBE MSNFS TPCC
Figure 6: Average Latencies. The figure compares the Figure 7: %GC-blocked read I/Os. The figure above
average read latencies of Base, TT F LASH, and NoGC scenarios corresponds to the results in Figure 5. The bars represent the
from the same experiments in Figure 5. ratio (in percent) of read I/Os that are GC-blocked (bottom bar)
LMBE
and queue-blocked (top bar) as explained in 6.1. All im-
TPCC
DTRS
MSN
Exch
DAP
2
.99 .99
0
File Network OLTP Varmail Video Web .98 .98
Server FS Server Proxy NoGC NoGC
ttFlash ttFlash
Base +PB +GTR +RGC Preempt Preempt
% of Blocked Reads
.97 .97
8 0 0.5 1 1.5 2 0 20 40 60 80
6 Latency (ms) Latency (ms)
4
2 (c) DTRS (Re-rated) (d) Synthetic Workload
0 6 25
File Network OLTP Varmail Video Web ttFlash ttFlash
the block erase (up to 2 ms) or the current page copy- Figure 9: TT F LASH vs. Preemptive GC. The figures
back to finish (up to 800 s delay), mainly because the are explained in Section 6.2.
finest-grained preemption unit is a page copyback (3).
TT F LASH on the other hand can rapidly regenerate the
6.3 TT F LASH vs. GC Optimizations
delayed data.
Most importantly, TT F LASH does not postpone GC in- GC can be optimized and reduced with better FTL man-
definitely. In contrast, preemptive GC piles up GC im- agement, special coding, novel write buffer scheme or
pact to the future, with the hope that there will be idle SSD-based log-structured file system. For example, in
time. However, with a continuous I/O stream, at one comparison to base approaches, Value locality reduces
point, the SSD will hit a GC high water mark (not enough erase count by 65% [29, Section 5], flash-aware RAID
free pages), which is when preemptive GC becomes non- by 40% [33, Figure 20], BPLRU by 41% [35, Section
preemptive [40]. To create this scenario, we run the same 4 and Figure 7], eSAP by 10-45% [37, Figures 1112],
workload but make SSDSim GC threshold hit the high F2FS by 10% [39, Section 3], LARS by 50% [41, Figure
water mark. Figure 9b shows that as preemptive GC 4], and FRA by 10% [42, Figure 12], SFS by 7.5 [43,
becomes non-preemptive, it becomes GC-intolerant and Section 4], WOM codes by 33% [48, Section 6].
creates long tail latencies. Contrary to these efforts, our approach is fundamen-
To be more realistic with the setup, we perform a sim- tally different. We do not focus in reducing the number
ilar experiment as in the Semi-Preemptive GC paper [40, of GCs, but instead, we eliminate the blocking nature of
IV]. We re-rate DTRS I/Os by 10 and re-size them by GC operations. With reduction, even if GC count is re-
30, in order to reach the high GC water mark (which duced by multiple times, it only makes GC-induced tail
we set to 75% to speed up the experiment). Figure 9c latencies shorter, but not disappear (e.g., as in Figure 5).
shows the timeline of observed latencies with TT F LASH Nevertheless, the techniques above are crucial in extend-
and preemptive GC. We also run a synthetic workload ing SSD lifetime, hence orthogonal to TT F LASH.
with continuous I/Os to prevent idle time (Figure 9d); the
workload generates 28-KB I/Os (full-stripe) every 130s 6.4 Write (P/E Cycle) Overhead
with 70% read and 30% write). Overall, Figures 9cd
highlight that preemptive GC creates backlogs of GC ac- Figure 10 compares the number of GCs (P/E cycles)
tivities, which will eventually cause SSD lock-down completed by the Base approach and TT F LASH within
when page occupancy reaches the high watermark. On the experiments in Figure 5. We make two observations.
the other hand, TT F LASH can provide stable latencies First, TT F LASH does not delay GCs; it actively performs
without postponing GC activities indefinitely. GCs at a similar rate as in the base approach, but yet still
The last two experiments above create high intensity delivers predictable performance. Second, TT F LASH in-
of writes, and within the same experiments, our GC- troduces 1518% of additional P/E cycles (in 4 out of
tolerant flush (GTF; 4.5) provides stable latencies, as 6 workloads), which mainly comes from RAIN; as we
implicitly shown in Figures 9cd. use N =8, there are roughly 15% (1/7) more writes in
49k
57k
12 1 30.1
RGC, 6 DWPD
10 RGC, 7 DWPD
8 .90 F-RGC, 7 DWPD
6
4 30.0
2 .80
DAPPS DTRS Exch LMBE MSNFS TPCC 6 DWPD
7 DWPD
.70 29.9
Figure 10: GC Completions (P/E Cycles). The figure 0 20 40 60 80 0 25 50 75 100
is explained in Section 6.4. Latency (ms) Time (s)