Documente Academic
Documente Profesional
Documente Cultură
) to implement an
intermediate layer between compute nodes and disk-based le servers. GPFSs
Information Lifecycle Management (ILM) [5] is used to manage the tiered storage
such that this additional complexity is hidden from the user. The key feature of
GPFS which we exploit is the option to dene groups for dierent kind of storage
devices, known as GPFS storage pools. Furthermore, the GPFS policy engine
is used to manage the available storage and handle data movement between
dierent storage pools.
This approach is motivated by the observation that applications running on
HPC systems as they are operated at J ulich Supercomputing Centre (JSC)
tend to start bursts of I/O operations.
1
In Fig. 1 we show the read and write
1
Such behaviour has also been reported elsewhere in the literature, see, e.g., [6].
Using GPFS to Manage NVRAM-Based Storage Cache 437
bandwidth measured on JSCs petascale Blue Gene/P
facility JUGENE. On
each I/O node the bandwidth has been determined from the GPFS counters
which have been retrieved every 120 s.
If we assume an intermediate storage layer being available providing a much
higher I/O bandwidth to the compute nodes (but an unchanged bandwidth to-
wards the external storage) it is possible to reduce the time needed for I/O.
Alternatively, it is also possible to lower the total power consumption by pro-
viding the original bandwidth through ash, and reduce the bandwidth of the
disk storage. This requires mechanisms to stage data kept in the external storage
system before it is read by the application, as well as to cache data generated by
the application before it is written to the external storage. All I/O operations re-
lated to staging are supposed to be executed asynchronously with respect to the
application. The fast intermediate storage layer can also be used for out-of-core
computations where data is temporarily moved out of main memory. Another
use case is check-pointing. In both cases we assume the intermediate storage
layer to be large enough to hold all data such that migration of the data to the
external storage is avoided.
The key contributions of this paper are:
1. We designed and implemented a tiered storage system where non-volatile
ash memory is used to realize a fast cache layer and where resources and
data transfer are managed by GPFS ILM features.
2. We provide results for synthetic benchmarks which were executed on a petas-
cale Blue Gene/Q system connected to the small-scale prototype storage clus-
ter. We compare results obtained using dierent ash memory cards.
3. Finally, we perform a performance and usability analysis for such a tiered
storage architecture. We use I/O statistics collected on a Blue Gene/P system
as input for a simple model for using the system as a fast write cache.
2 Related Work
In recent years a number of papers have been published which investigate the
use of fast storage technologies for staging I/O data, as well as dierent software
architectures to manage such a tiered storage architecture.
In [7] part of the nodes volatile main memory is used to create a temporary
parallel le system. The performance of their RAMDISK nominally increases at
the same rate as the number of nodes used by the application is increased. The
authors implemented their own mechanisms to stage the data before a job starts
and after the job completed. Data staging is controlled by a scheduler which is
coupled to the systems resource manager (here SLURM).
The DataStager architecture [8, 9] uses the local volatile memory to keep
buers needed to implement asynchronous write operations. No le system is
used to manage the buer space, instead a concept of data objects is introduced
which are created under application control. After being notied, DataStager
438 S. El Sayed et al.
processes running on separate nodes manage the data transport, i.e. data trans-
port is server directed and data is pulled whenever sucient resources are
available.
To overcome power and cost constrains of DRAM based staging, in [10] the
use of NVRAM is advocated. Here a future node design scenario is evaluated
comprising Active NVRAM, i.e. NVRAM plus a low-power compute element.
The latter allows for asynchronous processing of data chunks committed by the
application. Final data may be ushed to external disk storage. Data transport
and resource management is thus largely controlled by the application.
3 Background
NAND ash memory belongs to the class of non-volatile memory technologies.
Here we only consider Single-Level Cell (SLC) NAND ash, which features the
highest number of write cycles. SLC ash chips currently have an endurance
of up to O(100,000) erase/write cycles. At device level the problem of failing
memory chips is signicantly mitigated by wear-leveling mechanisms and RAID
mechanisms. A large number of write cycles is critical when using ash memory
devices for HPC systems. The advantage of ash memory devices with respect
to standard disks are signicantly higher bandwidth and orders of magnitude
higher I/O operation rates due to signicantly lower access latencies, at much
lower power consumption.
GPFS is a exible, scalable parallel le system architecture which allows to
distribute both data and metadata. The smallest building block, a Network
Shared Disk (NSD), which can be any storage device, is dedicated for data only,
metadata only, or both. In this work we exploit several features of GPFS. First,
we make use of storage pools which allows to group storage elements. Pools are
traditionally used to handle dierent types or usage modes of disks (and tapes).
In the context of tiered storage this feature can be used to group the ash storage
into one and the slower disk storage into another pool. Second, a policy engine
provides the means to let GPFS manage the placement of new les and staging
of les. This engine is controlled by a simple set of rules.
4 GPFS-Based Staging
In this paper we investigate a setup where ash memory cards are integrated
into a persistent GPFS instance. GPFS features are used (1) to steer initial le
placement, (2) to manage migration from ash to external, disk-based storage,
and (3) to stage les from external storage to ash memory. The user therefore
continues to access a standard le system.
We organise external disk storage and ash memory into a disk and flash
pool, respectively. Then we use the GPFS policy engine to manage automatic
data staging. In Fig. 2 we show our policy rules with the parameters dened in
Table 1. These rules control when the GPFS events listed in Table 1 are thrown,
the numerical values should be set according to operational characteristics of an
Using GPFS to Manage NVRAM-Based Storage Cache 439
RULE SET POOL flash LIMIT(fmax)
RULE SET POOL disk
RULE MIGRATE
FROM POOL flash
THRESHOLD(fstart,fstop)
WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME)
TO POOL disk
Fig. 2. The GPFS policy rules for the target le system denes the placement and
migration rule for the les data blocks
HPC systems workload. Files are created in the storage pool flash as long as
lling does not exceed the threshold f
max
. When lling exceeds f
max
the second
rule applies which allows les to be created in the pool disk. This fall-back
mechanism is foreseen to avoid writes failing when the pool flash is full but
there is free space in the storage pool disk.
The third rule manages migration of data from the pool flash to disk. If the
lling of the pool flash exceeds the threshold f
start
the event lowDiskSpace is
thrown, and automatic data migration is initiated. Migration stops once lling
of pool flash drops below the limit f
stop
(where f
stop
< f
start
). For GPFS
to decide which les to migrate rst, a weight factor is assigned to each le,
which we have chosen such that least recently accessed les are migrated rst.
Migration is controlled by a callback function which is bound to the events
lowDiskSpace and noDiskSpace. At any time these events occur, the installed
policy rule for the according le system are re-evaluated. The GPFS parameter
noSpaceEventInterval denes at which intervals the events lowDiskSpace and
noDiscSpace may occur (it defaults to 120 seconds).
Finally, to steer staging of les to ash memory we propose the implementa-
tion of a prefetch mechanism which is either triggered by the user application or
the systems resource manager (like in [7]). Technically this can be realized by
using the command mmchattr -P flash <filename>. To avoid this le being
automatically moved back to the pool disk, the last access time has to be up-
dated, too. A possible way to achieve this is to create a library which will oer a
function like prefetch(<filename>). This library also has to ensure that only
one rank of the application is in charge.
Table 1. Policy rule parameters and the values used in this paper
Parameter Description Value Event
fmax Maximal lling (in percent) 90 noDiskSpace
fstart Filling limit which starts migration (in percent) 30 lowDiskSpace
fstop Filling limit which stops migration (in percent) 5
440 S. El Sayed et al.
The optimal choice of the parameters f
start
, f
stop
and f
max
depends on how
the system is used. For larger values of f
max
the probability of les being opened
for writing in the slow disk pool reduces. Large values for f
start
and f
stop
may
cause migration to start too late or nish too early and thus increase the risk
of the pool flash becoming full. On the other hand, small values would lead to
early migration which could have bad eects on read performance in case the
application tries to access the data for reading after migration.
5 Test Setup
5.1 Prototype Conguration
Our prototype I/O system JUNIORS (JUlich Novel IO Research System) con-
sists of IBM x3650 M3 servers with two hex-core Intel Xeon X5650 CPUs (run-
ning at 2.67 GHz) and 48 GBytes DDR3 main memory. For the tests reported
here we use up to 6 nodes. Each node is equipped with 2 ash memory cards
either from Fusion-io
TMS RamSan-70
IBM, Blue Gene and GPFS are trademarks of IBM in USA and/or other countries.
Linux is a registered trademark of Linus Torvalds in the USA, other countries, or
both. RamSan and Texas Memory Systems are registered trademarks of Texas Mem-
ory Systems, an IBM Company. Fusion-io, ioDrive, ioDrive2 Duo, ioDrive Duo are
trademarks or registered trademarks of Fusion-io, Inc.