Sunteți pe pagina 1din 12

Using GPFS to Manage

NVRAM-Based Storage Cache


Salem El Sayed
1
, Stephan Graf
1
, Michael Hennecke
2
,
Dirk Pleiter
1
, Georg Schwarz
1
, Heiko Schick
3
, and Michael Stephan
1
1
JSC, Forschungszentrum J ulich, 52425 J ulich, Germany
2
IBM Deutschland GmbH, 40474 D usseldorf, Germany
3
IBM Deutschland Research & Development GmbH, 71032 Boblingen, Germany
Abstract. I/O performance of large-scale HPC systems grows at a sig-
nicantly slower rate than compute performance. In this article we inves-
tigate architectural options and technologies for a tiered storage system
to mitigate this problem. Using GPFS and ash memory cards a proto-
type is implemented and evaluated. We compare performance numbers
obtained by running synthetic benchmarks on a petascale Blue Gene/Q
system connected to our prototype. Based on these results an assessment
of the architecture and technology is performed.
1 Introduction
For very large high-performance computing (HPC) systems it has become a
major challenge to maintain a reasonable balance of compute performance and
performance of the I/O sub-system. In practice, this gap is growing and systems
are moving away from Amdahls rule of thumb for a balanced performance ratio,
namely a bit of I/O per second for each instruction per second (see [1] for an
updated version). Bandwidth is only one metric which describes the capability
of an I/O sub-system. Additionally, capacity and access rates, i.e. number of I/O
requests per second which can be served, have to be taken into account. While
todays technology allows to build high capacity storage systems, reaching high
access rates is even more challenging than improving the bandwidth.
These trends of technology are increasingly in conict with application de-
mands in computational sciences, in particular as these are becoming more I/O
intensive [2]. Using traditional technologies these trends are not expected to re-
verse. Therefore, the design of the I/O sub-system will be a key issue which
needs to be addressed for future exascale architectures.
Today, storage systems attached to HPC facilities typically comprise an ag-
gregation of spinning disks. These are hosted in external le servers which are
connected via a commodity network to the compute system. This design has the
advantage that it allows to integrate a huge number of disks and to ensure the
availability of multiple data paths for resilience.
The extensive use of disks is largely driven by costs. Disk technology has
improved dramatically over a long period of time in terms of capacity versus
J.M. Kunkel, T. Ludwig, and H. Meuer (Eds.): ISC 2013, LNCS 7905, pp. 435446, 2013.
c Springer-Verlag Berlin Heidelberg 2013
436 S. El Sayed et al.
0
0.05
0.1
0.15
0.2
0.25
0 20 40 60 80 100 120
B
a
n
d
w
i
d
t
h

[
G
B
/
s
]
Time [hour]
Read IO node
Write IO node
0
1
2
3
4
5
6
7
8
0 20 40 60 80 100 120
B
a
n
d
w
i
d
t
h

[
G
B
/
s
]
Time [hour]
Read
Write
Fig. 1. Read and write bandwidth averaged over 120 s as a function of time. The left
pane shows the measurements on a single Blue Gene/P I/O node while the right pane
shows the results from all 600 I/O nodes of the JUGENE system.
costs. Bandwidth per disk and access rates, however, increase at a much slower
rate (see, e.g., [3]) than compute performance. High bandwidth can thus only
be achieved by increasing the number of disks within a single I/O sub-system,
with load being suitably distributed. For HPC systems meanwhile the number
of disks started to be mainly determined by bandwidth and not by capacity
requirements. Scaling of disks is, however, limited by cost and power budget as
well as the exponentially increasing risk of failures and data corruption. Using
todays disk technology to meet exascale bandwidth requirements of 60 TByte/s
[4] would exceed the currently mandated power budget of 20 MW, Whereas
satisfying exascale bandwidth with ash memory and satisfying exascale capacity
with a disk tier is possible at an aordable power consumption.
To meet future demands it is therefore necessary to consider other non-volatile
storage technologies and explore new designs for the I/O sub-system architec-
ture. Promising opportunities arise from storage devices based on ash memory.
Compared to disk technologies they feature order(s) of magnitude higher band-
width and access rates. The main disadvantage is the poor capacity versus costs
ratio, but device capacity is slowly increasing.
In this article we explore an architecture where we integrate ash memory
devices into IBMs

General Parallel File System (GPFS

) to implement an
intermediate layer between compute nodes and disk-based le servers. GPFSs
Information Lifecycle Management (ILM) [5] is used to manage the tiered storage
such that this additional complexity is hidden from the user. The key feature of
GPFS which we exploit is the option to dene groups for dierent kind of storage
devices, known as GPFS storage pools. Furthermore, the GPFS policy engine
is used to manage the available storage and handle data movement between
dierent storage pools.
This approach is motivated by the observation that applications running on
HPC systems as they are operated at J ulich Supercomputing Centre (JSC)
tend to start bursts of I/O operations.
1
In Fig. 1 we show the read and write
1
Such behaviour has also been reported elsewhere in the literature, see, e.g., [6].
Using GPFS to Manage NVRAM-Based Storage Cache 437
bandwidth measured on JSCs petascale Blue Gene/P

facility JUGENE. On
each I/O node the bandwidth has been determined from the GPFS counters
which have been retrieved every 120 s.
If we assume an intermediate storage layer being available providing a much
higher I/O bandwidth to the compute nodes (but an unchanged bandwidth to-
wards the external storage) it is possible to reduce the time needed for I/O.
Alternatively, it is also possible to lower the total power consumption by pro-
viding the original bandwidth through ash, and reduce the bandwidth of the
disk storage. This requires mechanisms to stage data kept in the external storage
system before it is read by the application, as well as to cache data generated by
the application before it is written to the external storage. All I/O operations re-
lated to staging are supposed to be executed asynchronously with respect to the
application. The fast intermediate storage layer can also be used for out-of-core
computations where data is temporarily moved out of main memory. Another
use case is check-pointing. In both cases we assume the intermediate storage
layer to be large enough to hold all data such that migration of the data to the
external storage is avoided.
The key contributions of this paper are:
1. We designed and implemented a tiered storage system where non-volatile
ash memory is used to realize a fast cache layer and where resources and
data transfer are managed by GPFS ILM features.
2. We provide results for synthetic benchmarks which were executed on a petas-
cale Blue Gene/Q system connected to the small-scale prototype storage clus-
ter. We compare results obtained using dierent ash memory cards.
3. Finally, we perform a performance and usability analysis for such a tiered
storage architecture. We use I/O statistics collected on a Blue Gene/P system
as input for a simple model for using the system as a fast write cache.
2 Related Work
In recent years a number of papers have been published which investigate the
use of fast storage technologies for staging I/O data, as well as dierent software
architectures to manage such a tiered storage architecture.
In [7] part of the nodes volatile main memory is used to create a temporary
parallel le system. The performance of their RAMDISK nominally increases at
the same rate as the number of nodes used by the application is increased. The
authors implemented their own mechanisms to stage the data before a job starts
and after the job completed. Data staging is controlled by a scheduler which is
coupled to the systems resource manager (here SLURM).
The DataStager architecture [8, 9] uses the local volatile memory to keep
buers needed to implement asynchronous write operations. No le system is
used to manage the buer space, instead a concept of data objects is introduced
which are created under application control. After being notied, DataStager
438 S. El Sayed et al.
processes running on separate nodes manage the data transport, i.e. data trans-
port is server directed and data is pulled whenever sucient resources are
available.
To overcome power and cost constrains of DRAM based staging, in [10] the
use of NVRAM is advocated. Here a future node design scenario is evaluated
comprising Active NVRAM, i.e. NVRAM plus a low-power compute element.
The latter allows for asynchronous processing of data chunks committed by the
application. Final data may be ushed to external disk storage. Data transport
and resource management is thus largely controlled by the application.
3 Background
NAND ash memory belongs to the class of non-volatile memory technologies.
Here we only consider Single-Level Cell (SLC) NAND ash, which features the
highest number of write cycles. SLC ash chips currently have an endurance
of up to O(100,000) erase/write cycles. At device level the problem of failing
memory chips is signicantly mitigated by wear-leveling mechanisms and RAID
mechanisms. A large number of write cycles is critical when using ash memory
devices for HPC systems. The advantage of ash memory devices with respect
to standard disks are signicantly higher bandwidth and orders of magnitude
higher I/O operation rates due to signicantly lower access latencies, at much
lower power consumption.
GPFS is a exible, scalable parallel le system architecture which allows to
distribute both data and metadata. The smallest building block, a Network
Shared Disk (NSD), which can be any storage device, is dedicated for data only,
metadata only, or both. In this work we exploit several features of GPFS. First,
we make use of storage pools which allows to group storage elements. Pools are
traditionally used to handle dierent types or usage modes of disks (and tapes).
In the context of tiered storage this feature can be used to group the ash storage
into one and the slower disk storage into another pool. Second, a policy engine
provides the means to let GPFS manage the placement of new les and staging
of les. This engine is controlled by a simple set of rules.
4 GPFS-Based Staging
In this paper we investigate a setup where ash memory cards are integrated
into a persistent GPFS instance. GPFS features are used (1) to steer initial le
placement, (2) to manage migration from ash to external, disk-based storage,
and (3) to stage les from external storage to ash memory. The user therefore
continues to access a standard le system.
We organise external disk storage and ash memory into a disk and flash
pool, respectively. Then we use the GPFS policy engine to manage automatic
data staging. In Fig. 2 we show our policy rules with the parameters dened in
Table 1. These rules control when the GPFS events listed in Table 1 are thrown,
the numerical values should be set according to operational characteristics of an
Using GPFS to Manage NVRAM-Based Storage Cache 439
RULE SET POOL flash LIMIT(fmax)
RULE SET POOL disk
RULE MIGRATE
FROM POOL flash
THRESHOLD(fstart,fstop)
WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME)
TO POOL disk
Fig. 2. The GPFS policy rules for the target le system denes the placement and
migration rule for the les data blocks
HPC systems workload. Files are created in the storage pool flash as long as
lling does not exceed the threshold f
max
. When lling exceeds f
max
the second
rule applies which allows les to be created in the pool disk. This fall-back
mechanism is foreseen to avoid writes failing when the pool flash is full but
there is free space in the storage pool disk.
The third rule manages migration of data from the pool flash to disk. If the
lling of the pool flash exceeds the threshold f
start
the event lowDiskSpace is
thrown, and automatic data migration is initiated. Migration stops once lling
of pool flash drops below the limit f
stop
(where f
stop
< f
start
). For GPFS
to decide which les to migrate rst, a weight factor is assigned to each le,
which we have chosen such that least recently accessed les are migrated rst.
Migration is controlled by a callback function which is bound to the events
lowDiskSpace and noDiskSpace. At any time these events occur, the installed
policy rule for the according le system are re-evaluated. The GPFS parameter
noSpaceEventInterval denes at which intervals the events lowDiskSpace and
noDiscSpace may occur (it defaults to 120 seconds).
Finally, to steer staging of les to ash memory we propose the implementa-
tion of a prefetch mechanism which is either triggered by the user application or
the systems resource manager (like in [7]). Technically this can be realized by
using the command mmchattr -P flash <filename>. To avoid this le being
automatically moved back to the pool disk, the last access time has to be up-
dated, too. A possible way to achieve this is to create a library which will oer a
function like prefetch(<filename>). This library also has to ensure that only
one rank of the application is in charge.
Table 1. Policy rule parameters and the values used in this paper
Parameter Description Value Event
fmax Maximal lling (in percent) 90 noDiskSpace
fstart Filling limit which starts migration (in percent) 30 lowDiskSpace
fstop Filling limit which stops migration (in percent) 5
440 S. El Sayed et al.
The optimal choice of the parameters f
start
, f
stop
and f
max
depends on how
the system is used. For larger values of f
max
the probability of les being opened
for writing in the slow disk pool reduces. Large values for f
start
and f
stop
may
cause migration to start too late or nish too early and thus increase the risk
of the pool flash becoming full. On the other hand, small values would lead to
early migration which could have bad eects on read performance in case the
application tries to access the data for reading after migration.
5 Test Setup
5.1 Prototype Conguration
Our prototype I/O system JUNIORS (JUlich Novel IO Research System) con-
sists of IBM x3650 M3 servers with two hex-core Intel Xeon X5650 CPUs (run-
ning at 2.67 GHz) and 48 GBytes DDR3 main memory. For the tests reported
here we use up to 6 nodes. Each node is equipped with 2 ash memory cards
either from Fusion-io

or Texas Memory Systems

(TMS). The most important


performance parameters of these devices as reported by the corresponding vendor
have been collected in Table 2. Note that these numbers only give an indication
of the achievable performance. Each node is equipped with 2 dual-port 10-GbE
adapters. The number of ports has been chosen such that the nodes nominal
bandwidth to the ash memory and the network is roughly balanced. Channel
bonding is applied to reduce the number of logical network interfaces per node.
The Ethernet network connects the prototype I/O system to the peta-scale
Blue Gene/Q system JUQUEEN and the peta-scale disk storage system JUST.
To use the massively parallel compute system for generating load, we mount our
experimental GPFS le system on the Blue Gene/Q I/O nodes. For an overview
of the prototype system and the Ethernet interconnect see Fig. 3 (left pane).
On each node we installed the operating system RHEL6.1 (64bit) and GPFS
version 3.5.0.7. For the Fusion-io ioDrive Duo we used the driver/utility soft-
ware version 3.1.5 including the rmware v7.0.0 revision 107322. For the TMS
RamSan-70 cards we used the driver/utility software version 3.6.0.11.
To monitor data ow within the system we designed a simple monitoring
infrastructure consisting of dierent sensors as shown in Fig. 3 (right pane). The
tools netstat and iostat are used to sense the data ow between node and
network as well as through the Linux

block layer, respectively. Vendor specic


Table 2. Manufacturer hardware specication of the ash memory cards
2
Fusion-io ioDrive Duo SLC

TMS RamSan-70

Capacity 320 GByte 450 GByte


Read/write bandwidth [GByte/s] 1.5 / 1.5 1.25 / 0.9
Read/write IOPS 261,000 / 262,000 300,000 / 220,000
2
For the RamSan-70 we report the performance numbers for 4 kBytes block size.
Using GPFS to Manage NVRAM-Based Storage Cache 441
2x10GE
(bonding)
2x per node
JUQUEEN
JUST3
4x10GE
each node
~
6
0
G
B
/
s
2x10GE
(bonding)
per ION
juniors1
juniors2
juniors3
JUST
storage
network
block
layer
flash
driver
flash
memory
GPFS
(vendor specific)
mmpmon
network
driver
netstat
iostat
Fig. 3. The left pane shows a schematic view of the I/O prototype system and the
network through which it is connected to a compute system as well as a storage system.
The right pane illustrates the data ow monitoring infrastructure.
tools are used to sense the amount of data written (read) to (from) the ash
memory device. GPFS statistics is collected using the mmpmon facility.
5.2 Benchmark Denitions
To test the performance we used three dierent benchmarks. For network band-
width measurements we used the micro-benchmark nsdperf, a tool which is
part of GPFS. It allows to dene a set of server and clients and then generate
the network trac which real write and read operations would cause, without
actually performing any disk I/O.
For sequential I/O bandwidth measurements we used the SIONlib [11] bench-
mark tool partest, which generates a load which is similar to what one would
expect from HPC applications. These often perform task-local I/O where each
job rank creates its own le. If the number of ranks is large then the time needed
to handle le system metadata will become large. This is a software eect which
is not mitigated by using faster storage devices like ash memory cards. SIONlib
is a parallel I/O library that addresses this problem by transparently mapping a
large number of task-local les onto a small number of physical les. To bench-
mark our setup we run this test repeatedly using 32,768 processes, distributed
over 512 Blue Gene/Q nodes. During each iteration 2 TBytes of data is written
in total.
To measure bandwidth as well as I/O operation (IOP) rates we furthermore
use the synthetic benchmark tool IOR [12] to generate a parallel I/O workload.
It is highly congurable and supports various interfaces and access patterns
and thus allows mimicking the I/O workload of dierent applications. We used
512 Blue Gene/Q nodes with one task per node which all performed random
I/O operations on task-local les opened via the POSIX interface, with the ag
O DIRECT set to minimize caching eects. In total 128 GBytes are written and
read using transfers of size 4 kBytes.
442 S. El Sayed et al.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 50 100 150 200 250 300 350
B
a
n
d
w
i
d
t
h

[
G
B
/
s
]
Time [s]
FusionIO monitoring
network recieved
network send
iostat read
iostat write
flashcard read
flashcard write
0
0.5
1
1.5
2
2.5
3
3.5
0 100 200 300 400 500 600
B
a
n
d
w
i
d
t
h

[
G
B
/
s
]
Time [s]
TMS Ramsan monitoring
network recieved
network send
iostat read
iostat write
flashcard read
flashcard write
Fig. 4. Read and write bandwidth as a function of time for the SIONlib benchmark
using ash memory cards from Fusion-io (left) or TMS (right). Both gures shows the
I/O monitoring of one JUNIORS node (2 ash cards).
6 Evaluation
We start the performance evaluation by testing the network bandwidth between
the JUQUEEN system, where the load is generated, and the prototype I/O
system JUNIORS using nsdperf. For these tests we used 16 Blue Gene/Q IO
nodes as clients and 2 JUNIORS nodes as servers. We found a bandwidth of
8.1 GByte/s for emulated writes and 9.7 for reads. Comparing these results with
the nominal parameters of the ash cards listed in Table 2 indicates that our
network provides sucient bandwidth to balance the bandwidth to 2 ash cards
per node.
For the following benchmarks each JUNIORS server was congured as an
NSD server within a GPFS le system which was mounted by JUQUEEN I/O
nodes. Using 16 clients we measured the I/O bandwidth for sequential access
using partest. The observed read bandwidth is 12.5 Gbyte/s using 4 JUNIORS
server and 8 Fusion-io ash cards, slightly more than we would expect from the
vendors specications. However, a signicant dierent behaviour is observed for
writing where we observe a drop of the performance of more than 40% from 6.5
to 3.7 GByte/s after a short period of writing. To investigate the cause for this
behaviour we analyse the data ow information from our monitoring system.
The sensor values collected on 1 out of 4 nodes is plotted in Fig. 4 (left pane).
The benchmark rst performs a write and then a read cycle. We rst notice that
the amount of data passing the network device is consistent with the amount of
data transferred over the operating systems block device layer. At the beginning
of the initial write cycle the amount of data received via the network agrees with
the amount of data written to the ash card. However, after 67 s of writing we
observe a drop in the bandwidth of received data while at the same time the ash
card reports read operations. We observe that the amount of data passing the
block device layer towards the processor remains zero. Furthermore, the amount
of data written to the ash card agrees with the amount of data received via the
network plus the amount of data read from the ash cards. We therefore conclude
Using GPFS to Manage NVRAM-Based Storage Cache 443
that the drop in write performance is caused by read and write operations from
and to the ash card initiated by the ash card driver.
Let us now consider the performance obtained when using the ash cards
from TMS. Here we used 2 JUNIORS servers with 2 ash cards each. We ob-
serve a read (write) bandwidth of 5.7 (3.2) GByte/s. This corresponds to 114%
(90%) of the read (write) bandwidth one would naively expect from the vendors
performance specication. As can be seen from Fig. 4 the write bandwidth is
sustained during the whole test.
Next, we evaluated the I/O operation (IOPS) rate which we can obtain on
our prototype for a random access pattern generated by IOR. We again used the
setup with 2 JUNIORS servers and 4 TMS cards. For comparison we repeated
the benchmark run using the large-capacity storage system JUST, where about
5,000 disks are added to a single GPFS le system. In Table 3 the mean IOPS
values for read and write are listed. We observe the IOPS rate on our prototype
I/O system to be signicantly higher than on the standard, disk-based storage
system. For a fair comparison it has to be taken into account that the storage
system JUST has been optimized for streaming I/O with disks organized into
RAID6 arrays, with large stripes of 4 MBytes. Furthermore, the prototype and
compute system are slightly closer in terms of network hops. The results on the
prototype are an order of magnitude smaller than one would naively expect from
the vendors specication (see Table 2). These numbers we could only reproduce
when performing local raw ash device access without a le system.
In the nal step we dened two storage pools in the GPFS le system. For
the pool flash we used 4 TMS cards in 2 JUNIORS nodes. the pool disk was
implemented with 6 Fusion-io cards in 3 other servers, because there were no
disks in the JUNIORS cluster. These pools were all congured as dataOnly, while
a third pool on yet another ash disk is used for metadata. The GPFS policy
placement and a migration rule are dened as shown in Fig. 2 and Table 1. To
evaluate the setup we use the SIONlib benchmark partest, which uses POSIX
I/O routines, as well as the IOR benchmark, which we congured such that
MPI-IO is used, to create 1.5 TBytes of data.
Fig. 5 we show the data throughput at the dierent sensors for the dierent
benchmarks as a function of time. For the two dierent benchmarks no signicant
dierences are observed. We obtained similar results for IOR using the POSIX
instead of the MPI-IO interface. The SIONlib benchmark starts with a write
cycle followed by a read cycle, which here starts and ends about 500 s and 800 s
after starting the test, respectively. Initially all data is placed in pool flash.
After 260 s GPFS started the migration process. There is only a small impact
Table 3. IOPS comparison between two JUNIORS nodes using 4 TMS RamSan-70
cards and the classical scratch le system using about 5,000 disks
storage system write IOPS (mean) read IOPS (mean)
JUNIORS 52234 122123
JUST 20456 69712
444 S. El Sayed et al.
0
0.5
1
1.5
2
2.5
3
0 300 600 900 1200 1500 1800 2100
B
a
n
d
w
i
d
t
h

[
G
B
/
s
]
Time [s]
iostat read (flash)
iostat write (flash)
iostat read (disk)
iostat write (disk)
GPFS
migration
start
Benchmark end
GPFS migration
end
0
0.5
1
1.5
2
2.5
3
0 300 600 900 1200 1500 1800 2100
Time [s]
iostat read (flash)
iostat write (flash)
iostat read (disk)
iostat write (disk)
GPFS
migration
start
Benchmark end
GPFS migration
end
Fig. 5. Read and write bandwidth (per node) as a function of time for the SIONlib
benchmark using the POSIX interface (left) and the IOR benchmark using the MPI-IO
interface (right)
seen on the I/O performance. After the benchmark run ended it took up to 17
minutes to nish the migration.
7 Performance Analysis
To assess the potential of this architecture being used exclusively as a write cache,
it is instructive to consider a simple model. Let us denote the time required to
perform the computations and (synchronous) I/O operations by t
comp
and t
io
,
respectively, and assume t
comp
> t
io
. While the application is computing, data
can be staged between disk and ash pool. Therefore, it is reasonable to choose
the bandwidth between the compute system and the staging area y = t
comp
/t
io
times larger than between staging area and the external storage. As a result the
time for I/O should reduce by a factor 1/y and therefore the overall execution
time should reduce by a factor (t
comp
+t
io
/y)/(t
comp
+t
io
) = (y +1/y)/(y +1).
In this simple model the performance gain for the overall system would be up
to 17% for y = 1 +

2. Since the storage subsystem accounts only for a fraction


of the overall costs, this could signicantly improve overall cost eciency.
To investigate this further we implemented a simple simulation model as
shown in Fig. 6. The model mimics the GPFS policy rules used for the JU-
NIORS prototype. The behaviour is controlled by the policy parameters f
start
,
f
stop
and f
max
as well as a set of bandwidth parameters. For each data path a
dierent bandwidth parameter is foreseen. The bandwidth along the data path
connecting processing device and pool flash as well as the pools flash and
disk may change when data migration is started, like it is observed for our
prototype. All bandwidth parameters are chosen such that the performance of
our prototype is resembled. Note that the bandwidth is assumed not to depend
on I/O patterns. This is a simplication, as sustainable bandwidth for a disk-
based system depends heavily on the I/O request sizes. One big advantage of
ash storage is that it can sustain close to peak bandwidth for much small I/O
request sizes (and le system block sizes). From the JUGENE I/O statistics (see
Using GPFS to Manage NVRAM-Based Storage Cache 445
flash
Processor
disk
SManager
Fig. 6. Schematic view on the simulation model consisting of a processing device, a
storage manager and 2 storage pools, flash and disk
Fig. 1) we extract the amount of data written and the time between write op-
erations. With the additional pool flash disabled the execution times obtained
from the model and the real execution times agree within 10 %. When enabling
the write cache, we the simulated execution time reduces at a 1 % level. This
result is consistent with above model analysis as for the monitored time period
t
comp
/t
io
300, i.e. the system was not used by applications which are I/O
bound. This we consider typical for current HPC systems since the performance
penalty when executing a signicant amount of I/O operations is large.
8 Discussion and Conclusions
In this paper we evaluated the functionality and performance of an I/O proto-
type system comprising ash memory cards in addition to a disk pool. We could
demonstrate that the system is capable of sustaining high write and read band-
width to and from the ash cards using a massively-parallel Blue Gene/Q system
to generate the load. Taking into account that in standard user operation mode
I/O operations occur in bursts we investigated how such an I/O architecture
could be used to realise a tiered storage system where ash memory is used for
staging data. We demonstrated how such a system can be managed using the
policy rule mechanism of GPFS.
Our results indicate that for loads extracted from current I/O statistics some
gain is to be expected just using the architecture as a write cache. The statistics
are however biased as I/O bound applications are hardly using the system for
performance reasons. Transferring data sequentially using large blocks does not
meet the requirements of many scientic applications. A performance assessment
of the proposed prefetching mechanism from disk to ash could not be carried out
in the scope of this paper. It requires extensions to workow managers and other
software tools which make it easy for the user to provide information which les
will be accessed for reading before or while executing a job (see RAMDISK [7]
for a possible solution).
The considered hierarchical storage system comprising non-volatile memory
has an even higher potential for performance improvements when the applica-
tions I/O performance is mainly limited by the rate at which read and write
requests can be performed. For our prototype system we show that a high IOPS
446 S. El Sayed et al.
rate can be achieved. We therefore expect our tiered storage system, e.g., to be
particular ecient when being used for multi-pass analysis applications perform-
ing a large number of small random I/O operations.
Acknowledgements. Results presented in this paper are partially based on the
project MPP Exascale system I/O concepts, a joined project of IBM, CSCS
and J ulich Supercomputing Centre (JSC) which was partly supported by the EU
grant 261557 (PRACE-1IP). We would like to thank all members of this project,
in particular H. El-Harake, W. Homberg and O. Mextorf, for their contributions.
References
[1] Gray, J., Shenoy, P.: Rules of Thumb in Data Engineering, pp. 310 (2000)
[2] Bell, G., Gray, J., Szalay, A.: Petascale computational systems. Computer 39(1),
110112 (2006)
[3] Hitachi, https://www1.hgst.com/hdd/technolo/overview/
storagetechchart.html (accessed: January 26, 2013)
[4] Stevens, R., White, A., et al. (2010), http://www.exascale.org/mediawiki/
images/d/db/planningforexascaleapps-steven.pdf
(accessed: January 26, 2013)
[5] Mueller-Wicke, D., Mueller, C.: TSM for Space Management for UNIX GPFS
Integration (2010)
[6] Miller, E.L., Katz, R.H.: Input/output Behavior of Supercomputing Applications.
In: SC 1991, pp. 567576. ACM, New York (1991)
[7] Wickberg, T., Carothers, C.: The RAMDISK Storage Accelerator: a Method of
Accelerating I/O Performance on HPC Systems using RAMDISKs. In: ROSS 2012,
pp. 5:15:8. ACM, New York (2012)
[8] Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.:
DataStager: Scalable Data Staging Services for Petascale Applications. In: HPDC
2009, pp. 3948. ACM, New York (2009)
[9] Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.:
DataStager: Scalable Data Staging Services for Petascale Applications. Cluster
Computing 13(3), 277290 (2010)
[10] Kannan, S., Gavrilovska, A., Schwan, K., Milojicic, D., Talwar, V.: Using active
NVRAM for I/O staging. In: PDAC 2011, pp. 1522. ACM, New York (2011)
[11] Frings, W., Wolf, F., Petkov, V.: Scalable Massively Parallel I/O to Task-local
Files. In: SC 2009, pp. 17:117:11. ACM, New York (2009)
[12] Borrill, J., Oliker, L., Shalf, J., Shan, H.: Investigation of Leading HPC I/O
Performance Using a Scientic-application Derived Benchmark. In: SC 2007,
pp. 10:110:12. ACM, New York (2007)

IBM, Blue Gene and GPFS are trademarks of IBM in USA and/or other countries.
Linux is a registered trademark of Linus Torvalds in the USA, other countries, or
both. RamSan and Texas Memory Systems are registered trademarks of Texas Mem-
ory Systems, an IBM Company. Fusion-io, ioDrive, ioDrive2 Duo, ioDrive Duo are
trademarks or registered trademarks of Fusion-io, Inc.

S-ar putea să vă placă și