Algorithmic Techniques To Overcome The I/O Bottleneck

Copyright
c 1999 by Rakesh Dilip Barve

All rights reserved
ALGORITHMIC TECHNIQUES TO OVERCOME THE I/O
BOTTLENECK
by
Rakesh Dilip Barve
Department of Department of Computer Science
Duke University
Date:
Approved:
Jerey S. Vitter, Supervisor

Pankaj Agarwal
Lars Arge
Jerey Chase
Gregory Lawler
Dissertation submitted in partial fulllment of the

requirements for the degree of Doctor of Philosophy
in the Department of Department of Computer Science
in theDuke
Graduate School of
University
1998
ABSTRACT
(Computer Science { Algorithms)
ALGORITHMIC TECHNIQUES TO OVERCOME THE I/O
BOTTLENECK
by
Rakesh Dilip Barve
Department of Department of Computer Science
Duke University
Date:
Approved:
Jerey S. Vitter, Supervisor

Pankaj Agarwal
Lars Arge
Jerey Chase
Gregory Lawler
An abstract of a dissertation submitted in partial

fulllment of the requirements for the degree
of Doctor of Philosophy in the Department of
Department of Computer Science in the Graduate School of
Duke University
1998
Abstract
The I/O bottleneck is the bottleneck in the performance of large scale computing applications caused by the (widening)
gap in the performance of fast CPU and internal memory on the one hand, and slower external memory devices on
the other. This disparity in performance is a well-known problem, so researchers have developed several approaches
to alleviate the eects of the I/O bottleneck. In this thesis, we present algorithmic techniques relevant to some of
these approaches to overcome the I/O problem.
One useful approach is for the operating system to implement clever caching or prefetching strategies, so that
accessed data is always in internal memory as far as possible. We present an application-controlled paging algorithm
to manage a cache shared by several competing applications; using the framework of competitive analysis, we prove
that that using hints from applications can result in signicant improvements in paging performance compared to
previous approaches. Using hints from applications to incorporate locality of reference of individual applications into
the paging model represents a signifcantly dierent approach from previous theoretical approaches to incorporate
locality of reference.
Another approach to improve I/O performance is to develop applications using custom-made I/O algorithms.
We present an I/O algorithm called simple randomized mergesort (SRM), to perform external sorting using parallel
disks. Mergesort is widely used for external memory sorting, but some diculties inherent to external merging have
to be overcome in order to adapt it to use parallel disks. SRM is simple, has provably ecient I/O performance,
and is attractive for practical implementation. The analysis used to bound SRM's performance includes interesting
reductions to certain maximum occupancy problems.
Recent developments in database systems necessitate the development of external memory applications capable
of ecient I/O performance while adapting to dynamic, unpredictable, uctuations in their memory allocations. De-
signing such applications requires incorporating notions of online algorithms with techniques used in I/O algorithms.
We present a theoretical framework for such \memory-adaptive" algorithms: We present a computational model to
design and analyze memory-adaptive algorithms, notions of optimality, and lower bounds and optimal (to within
constant factors) algorithms for sorting, matrix multiplication and related problems.
A common approach to improve I/O performance is to employ many disks in parallel, organizing them using
one or more I/O buses with several disks on each bus. Motivated by the complexity of such systems, we present an
accurate analytical model of their I/O performance on interesting workloads relevant to I/O algorithms and other
large-scale applications. We present a simple technique, based on insights gained from our experiments, that results
in signicant I/O performance improvement in certain situations.
Finally, we present an implementation of SRM, using simple and practical data structures and techniques. We
present a comparison of our SRM implementation with an implementation of the popular disk striped mergesort
(DSM) algorithm, showing that SRM outperforms DSM and is ecient in practice.
iv
Acknowledgements
Many people { peers, colleagues, teachers, friends, and family { have (directly or indirectly) played a positive role
with respect to this thesis, and I am grateful to all of them.
I would like to thank my thesis supervisor, Je Vitter, for guiding me through my thesis. I have learnt a tremendous
amount about various technical and non-technical aspects of computer science research from my interactions with
him over the past ve years.
I would also like to thank Lars Arge for many useful discussions about I/O algorithms and TPIE, Je Chase
for useful discussions and help relevant to implementations of external memory algorithms, and Pankaj Agarwal for
interesting discussions and the courses he taught.
My stay at Duke University was made pleasant and enjoyable because of many friends: Thomas Alexander, Amit
Bagga, Kausar Banoo, Subhrajit Bhattacharya, Pavan Desikan, Sachin Garg, Ranjit Gupta, P. Krishnan, Jagdish
Krishnaswamy, Vishal Lulla, T. M. Murali, Ghazala Shahbuddin, Rajesh Rao and Kasturi Varadarajan.
I am immensely grateful to my wife Shalini Pillay-Barve, for being with me during the last year of my dissertation
work, for her love, calming presence, constant support, and the wonderful times. I am indebted to Rekha and Dilip
Barve for being such wonderful parents: For the past twenty-six years, I could always count on their love, guidance,
and belief in my abilities. I thank my brother Neil for his love and friendship.
Credits
I would like to acknowledge co-authors with whom I have had the opportunity to interact fruitfully, while working
on the material presented in this thesis. I collaborated with Eddie Grove and Je Vitter while working on material
presented in Chapter 2 and Chapter 3, with Je Vitter while working on material in Chapter 4 and Chapter 5, and
with Liddy Shriver, Phil Gibbons, Bruce Hillyer, Yossi Matias and Je Vitter while working on material in Chapter 6.
The work presented in Chapter 6 was partly carried out when I was visiting Bell Laboratories.
I am grateful for a discussion with Greg Lawler in which he suggested a simplifcation to the original version of
our proof of Theorem 7 pertaining to the \Dependent Maximum Occupancy" problem discussed in of Chapter 3. A
similar simplication was independently communicated by Don Knuth and incorporated in [Knu98, Section 5.4.9] in
his treatment of the material in Chapter 3.
I have availed of computing facilities at the computer science department of Duke University and in the Information
Sciences Center of Bell Laboratories in course of my thesis work.
I gratefully acknowledge the nancial support provided to me by an IBM Fellowship (from August 1995 through
May 1998), and by research grants from the National Science Foundation, and the Army Research Oce.
v
Contents
Abstract iv
Acknowledgements vi
List of Tables xiv
List of Figures xvi
1 Introduction 1
1.1 The I/O Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Approaches to Overcome the I/O Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Operating System Caching and Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 External Memory Algorithms (with statically allocated memory) . . . . . . . . . . . . . . . . . 6
1.2.3 EM Algorithms with Dynamic Memory Allocations . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Techniques to Improve Data Transfer Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Application-Controlled Paging for a Shared Cache 12

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Classical Caching and Competitive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Multi-application Caching Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Online Algorithm for Multi-application Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Relation to Previous Work on Classical Caching . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Lower Bounds for OPT and Competitive Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 General observations about holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Useful properties of holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.3 Relation to the algorithm of Cao et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Competitive Analysis of our Online Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 Application-Controlled Caching with Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8.1 Extending Fairness to the algorithm by Cao et al. . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Simple Randomized Mergesort for Parallel Disks 32
vi
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Overview of SRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 SRM Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Forecasting Format and Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 The SRM Merging Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 Internal Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.2 Maintaining the dynamic partition of internal memory . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.3 Maintaining the Forecasting Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.4 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.5 I/O Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Using Phases to Count ParRead Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Probabilistic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7.1 The Dependent Occupancy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7.2 Asymptotic Expansions of the Maximum Occupancy . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.3 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8 A Deterministic Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.9 Comparisons between SRM and DSM in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.9.1 Expressions for the number of I/O operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.9.2 Comparison based upon expected worst-case performance of SRM . . . . . . . . . . . . . . . . 64
3.9.3 Using simulations to count SRM's I/O operations . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.10 Realistic Values for Parameters k, D, and B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.11 Conclusions and Related Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 Design and Implementation of SRM in TPIE 70
4.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 External Mergesort Preliminaries and Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.1 Run Formation + Merging Passes = Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.2 Participation Order of Blocks in a Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.3 Previous Work on External Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
vii
4.2.4 Parallel Disk Merging in DSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.5 Diculty of Merging Optimally with Parallel Independent Disks . . . . . . . . . . . . . . . . . 77
4.3 SRM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 The Forecast and Flush Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2 Provable Performance Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.3 Data Structures Required in the Straightforward Implementation . . . . . . . . . . . . . . . . . 80
4.4 Implementation Techniques and Data Structures for SRM . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.1 Managing Forecasting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.2 Other Data Structures and Primitive Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.3 Basic Ideas of Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.4 Algorithmic Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.1 Computer System and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.2 Input Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5.3 DSM and SRM Congurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5.4 Performance Numbers and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.5 Relative Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.6 Improving I/O Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6.1 Distribution Sort and Multi-way Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6.2 Streaming Through Multimedia Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5 Memory-Adaptive External Memory Algorithms 99
5.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Dynamic Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Dynamically Optimal Memory-Adaptive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Memory-Adaptive Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.1 Memory-Adaptive Lower Bounds for Permuting . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.2 Memory-Adaptive Lower Bounds for Sorting, FFT and Permutation Networks . . . . . . . . . 110
5.3.3 Memory-Adaptive Lower Bounds for Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 111
viii
5.4 Designing Memory-Adaptive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4.1 Optimal non-adaptive external memory algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.2 Mimicking Optimal Static Memory Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.3 Adaptive Organization of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.4 Allocation Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 A Framework for Memory-Adaptive Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5.1 Memory-Adaptive Run Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5.2 Memory-adaptive merging stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.3 Resource consumption of memory-adaptive external mergesort . . . . . . . . . . . . . . . . . . 117
5.6 Potential of a Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.7 Nonoptimality of a simple memory-adaptive mergesort algorithm . . . . . . . . . . . . . . . . . . . . . 120
5.7.1 Sketch of the memory-adaptive external memory mergesort . . . . . . . . . . . . . . . . . . . . 121
5.7.2 Lower Bound on Resource Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.8 Dynamically Optimal Memory-Adaptive Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.8.2 Run-Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.8.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.8.4 Level-record Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.8.5 Invariants for run-records and level-records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.8.6 Low-level Merge Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.8.7 Downloading Work for Adaptivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.8.8 Putting download ( ) and llmerge ( ) Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.9 Analysis of resource consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.9.1 Resource Consumption of download ( ) calls, bad llmerge ( ) calls and llmerge (rr global ) calls . . 157
5.9.2 Potential Function Argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.9.3 Optimality of Resource Consumption for Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.10 Some notes on the potential function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.11 Dynamically Optimal Permuting, FFT, and Permutation Networks . . . . . . . . . . . . . . . . . . . . 168
5.11.1 Permuting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.11.2 Dynamically Optimal FFT and Permutation Networks . . . . . . . . . . . . . . . . . . . . . . . 169
5.12 Memory-Adaptive Buer Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
ix
5.12.1 Memory-adaptive buer emptying of internal nodes . . . . . . . . . . . . . . . . . . . . . . . . 172
5.12.2 Memory-adaptive buer emptying of other nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.13 Dynamically Optimal Memory-adaptive Matrix Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 175
5.13.1 Transformation between dierent blocking orders . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.13.2 Memory-adaptive Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.13.3 Mop Records and Level-Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.13.4 The loadlevel ( ), download ( ), and llmult ( ) Subroutines . . . . . . . . . . . . . . . . . . . . . . 180
5.13.5 Algorithm MAMultiply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.13.6 Resource Consumption Analysis of MAMultiply . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.13.7 Proving Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6 Modeling and Optimizing the I/O Performance of Multiple Disks on a SCSI Bus 192
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.3 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.4 Hardware Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.4.1 The Fence Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.5 Components Of Service Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.6 Rounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.7 Analytical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.7.1 Read duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.2 Single disk model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.3 Parallel disk model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.8 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.8.1 Experiments to Validate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.8.2 Conclusions From Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
Biography 229
x
List of Tables
3.1 The overhead v(k; D), computed by estimating C (kD; D)=k using computer simulations. . . . . . . . . 66
3.2 The performance ratio CSRM =CDSM for memory size M = (2k +4)DB + kD2 , with block size B = 1000.
(Both M and B are expressed in units of records.) The overhead factor v in CSRM is based upon
computer simulations of C (kd; D)=k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 The overhead factor v(k; D) for memory size M = (2k + 4)DB + kD2 obtained from simulations. . . 67
0 =CDSM for memory size M = (2k +4)DB + kD2 where CSRM
3.4 The performance ratio CSRM 0 is computed
using the overhead value v(k; D) obtained from simulations. . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Comparing SRM and DSM when internal memory is 15 MB and there are D = 6 disks. The input size
N is in units of 1 million items, each of size 104 bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Comparing SRM and DSM when memory is 24 MB and there are D = 6 disks. The input size N is in
units of 1 millions items, each of size 104 bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1 Disk parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.2 The average minimum time to read one sector on a Wren 7. . . . . . . . . . . . . . . . . . . . . . . . . 208
6.3 Validating equation (6.1) (1 Wren, fence 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.4 Validating equation (6.2) (1 Wren, fence 255). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.5 Validating equation (6.3) (Wren disks, fence 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.6 Validating equation (6.4) (Wren disks, fence 255). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.7 The Barracuda device parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.8 Validating equation (6.1) (1 Barracuda, fence 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.9 Validating equation (6.3) (Barracuda disks, fence 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.10 The Cheetah device parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.11 Validating equation (6.1) (1 Cheetah, fence 0, Sparc-20). . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.12 Validating equation (6.2) (1 Cheetah, fence 255, Sparc-20). . . . . . . . . . . . . . . . . . . . . . . . . 210
6.13 Validating equation (6.3) (Cheetah disks, fence 0, Sparc-20). . . . . . . . . . . . . . . . . . . . . . . . . 211
6.14 Validating equation (6.4) (Cheetah disks, fence 255, Sparc-20). . . . . . . . . . . . . . . . . . . . . . . 211
6.15 Validating equation (6.1) (1 Cheetah, fence 0, Ultra). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
xi
6.16 Validating equation (6.2) (1 Cheetah, fence 255, Ultra). . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.17 Validating equation (6.3) (Cheetah disks, fence 0, Ultra). . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.18 Validating equation (6.4) (Cheetah disks, fence 255, Ultra). . . . . . . . . . . . . . . . . . . . . . . . . 212
6.19 MB/s for naive and pipelined I/O, fence 0, Wren/Sun Sparc-20. . . . . . . . . . . . . . . . . . . . . . . 214
6.20 MB/s for naive and pipelined I/O, fence 0, Barracuda/DEC Alpha. . . . . . . . . . . . . . . . . . . . . 215
6.21 MB/s for naive and pipelined I/O, fence 0, Cheetah/Sun Ultra-1. . . . . . . . . . . . . . . . . . . . . . 215
xii
List of Figures
1.1 A Single Platter of a Magnetic Disk Drive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.1 The Parallel Disk Model [VS94]: D, B, M and N respectively denote the number of independent disk
drives, the size of each disk block, the size of internal memory, and the size of the input. . . . . . . . 33
3.2 (a) Dependent occupancy instance with Nb = 12, C = 5, D = 4. Arrows indicate the cyclical order of
blocks in the same chain. The maximum occupancy is 4, as realized in the second bin. (b) Classical
occupancy instance with Nb = 12, D = 4. Blocks fall independently of each other. The maximum
occupancy is 5, as realized in the second bin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Merging Phase Timings of SRM and DSM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xiii
magnetic surface of disk
disk read/write arm
11
00
111
000 00
11
000
111
read/write head
disk track
Figure 1.1: A Single Platter of a Magnetic Disk Drive.
Chapter 1
Introduction
In this thesis, we investigate algorithmic techniques to alleviate the eect of the so-called \I/O bottleneck". Through-
out this thesis, I/O (that is, Input/Output) refers to transfer of data between internal (or primary or main ) memory
consisting of RAM memory and secondary (or external ) memory consisting of magnetic disk drives [RW94]. We are
primarily concerned about computation involving the two-level memory hierarchy consisting of RAM and disk drive
memories. Although several computing systems also have a tertiary memory consisting of magnetic tape(s), we do
not address issues relevant to tertiary memory accesses in this thesis.
In this rst chapter, we give an overview of the entire thesis. In Section 1.1, we describe the nature of magnetic
disk drive accesses and the reason for the I/O bottleneck. In Section 1.2, we describe each of the remaining chapters
in the thesis and survey some topics and previous work relevant to the approach of each chapter.
1.1 The I/O Bottleneck

In the past decade, there has been a two-order-of magnitude increase in processor speed but only a two-fold improve-
ment in disk access time. (See [Dah96] for details of disk, RAM, processor speeds and other interesting technology
trends.) A typical disk drive is a factor of 105 {106 times slower [GVW96] performing random access than is the
internal memory of the computer system.
In general, the reason for the poor data access performance of disk drives is the need to rely on mechanical (head
1
movement) operations that are slow compared to the solid-state electronic switching based operations involved in
internal memory accesses. Magnetic disks consist of one or more rotating platters and one read/write head per platter
surface. Figure 1.1 shows one such platter. The data are stored in concentric circles on the platters called tracks. To
read or write a data item at a certain address on disk, the read/write head must mechanically seek to the correct
track and then wait for the desired address to pass by. The rate at which data passes under the head determines the
disk bandwidth. The seek time to move from one random track to another is often on the order of 5{10 milliseconds,
and the average rotational latency, which is the time for half a revolution, has the same order of magnitude. The term
disk latency refers to the combination of seek and rotational latencies. In order to amortize the delay corresponding
to the disk latency, it pays to transfer a large collection of contiguous data items, sometimes called a block or a page.
The bandwidth performance of disks is improving at only half the rate at which the bandwidth performance of
main memory is improving [Dah96]; disk latency is improving at an even smaller rate than disk bandwidth [Dah96]
and processor performance is increasing at a slightly faster rate than internal memory performance. The performance
discrepancy between disk and main memory access times, also known as the I/O bottleneck, will likely be with us
for quite a long time [GVW96].
1.2 Approaches to Overcome the I/O Bottleneck

At a high level, there are only a few fundamental techniques [GVW96] available to alleviate the eect of the I/O
bottleneck:
1. using multiple disks in parallel, thus increasing eective bandwidth of the I/O system;
2. more eective caching, and, more generally, reorganizing data within the storage system to exploit locality,
thereby reducing the cost of accessing data;
3. overlapping I/O with computation (for example, by prefetching data before it is needed) in order to reduce the
time that an application spends waiting for data to be transferred from or to a disk drive;
4. reducing or rearranging the accesses made to data by redesigning the applications themselves; and
5. scheduling and/or co-ordinating accesses to disk drives to increase the eective speed of data transfer to or from
disk drives, by exploiting characteristics and features of modern disk drives.
The algorithmic techniques we propose in this thesis span all the abovementioned techniques. However, instead of
classifying our proposed techniques according to the above categories, it is more fruitful to view each of our algorithmic
techniques in terms of its relationship with the principal components of a computing system. For our purposes, it is
enough to view a computing system as comprised of an operating system [Nut97] or a database management system
2
(DBMS) [UW97], and one or more application programs, or just applications. Although there are dierences in the
design objectives of database management systems and operating systems, throughout the rest of this chapter, we
use the term \operating system" to refer to database management systems as well. An important sub-component of
the operating system is the I/O system 1 that implements the transfer of data between internal and external memory,
carries out \low level" functions with respect to organizing data on disks, etc. The operating system, amongst other
things, is generally in charge of managing resources to applications based upon various criteria [Nut97, UW97]. From
the perspective of this thesis, one key resource that governs the performance of an application, and as a consequence,
of the entire computing system, is the amount of internal memory allocated to the application over its run time.
Whenever an application has less internal memory than it ideally would like to have, I/O operations are required
to transfer data between internal memory and disk: The disk plays the role of a store that backs up application
data that do not t in internal memory. Since I/O operations are necessary to make \progress" in an application's
computation, we can view even I/O operations as a resource that governs the performance of an application.
Algorithmic techniques can be applied to overcome the I/O bottleneck on several dierent fronts [GVW96]. Each
chapter of the thesis presents an algorithmic technique based upon a distinct approach to improve I/O performance.
For each of the remaining chapters of the thesis, we now describe the approach corresponding to that chapter, brie y
survey other techniques based upon that approach, and then describe key aspects of our algorithmic technique. Each
approach diers from other approaches with respect to the nature of its relationship with the principal components
of the computing system.
1.2.1 Operating System Caching and Prefetching

One approach towards overcoming the I/O bottleneck is caching and/or prefetching implemented in the operating
system. Programmers write applications as though all the data required by the application will be memory-resident;
it is then entirely up to the operating system to ensure that the application is processed correctly even if all of the
application's data may not t in the memory. The operating system tries to manage memory such that whenever
possible, the data required by the application is cached in memory. It is ecient to manage memory and execute I/O
operations in units of pages [Nut97]. Since the total amount of memory may be smaller than the total amount of
data of all the applications executing concurrently on a machine, ecient page replacement policies are required to
ensure that at any time only useful pages are cached in memory while remaining pages are stored on disk. Very often,
on detecting specic page access patterns of an application, or on receiving \hints" regarding future page accesses,
an operating system may also prefetch pages from disk into memory before their use so that by the time the page
is accessed, it is already in memory. Under this approach, each application can be modeled as an arbitrary xed
sequence of page accesses and has no control over the size or contents of the memory it gets allocated. The operating
1 Our usage of the term I/O system is generic; where applicable, low level components of a lesystem
or virtual memory paging system relevant to actual data transfer are all clubbed together under
the term I/O system.
3
system is supposed to use clever, general-purpose caching, prefetching and memory management strategies that will
provide good I/O performance for a large variety of applications and workloads. An operating system's workload
may consist of multiple, concurrently executing applications. The operating system is solely in control of memory
and I/O operations of the applications under this approach.
There is a large literature on operating system caching and prefetching techniques [Nut97]. We brie y mention the
many avors in which caching and prefetching techniques are available, and then mention some relevant theoretical
work in these scenarios. One widely studied problem is the problem of demand paging : Assuming that a bounded
number of application pages can be cached in memory at any time, what page should be evicted when the next
accessed page is not in memory, that is, when a page fault occurs? Belady [Bel66] presented an oine algorithm
that minimizes page faults assuming that the entire sequence of page accesses is known a priori. Subsequently, many
empirical studies showed that the Least Recently Used (LRU) algorithm is an extremely ecient online demand
paging algorithm in general, for the practical case in which the sequence of page accesses to be served is unknown.
In pioneering work, Sleator and Tarjan [ST85], and then Fiat et al. [FKL+ 91] applied the framework of competitive
analysis to theoretically evaluate the performance of online demand paging algorithms. The worst-case nature of
the competitive analysis framework used in [ST85, FKL+ 91] has often been considered unrealistic because it does
not adequately capture the characteristic locality of reference [Nut97] that most applications exhibit in practice.
Subsequent work [BIRS91, KPR92, IKP92, FK95, Alb93] tried to incorporate various notions of locality of reference
within the competitive analysis framework, thereby proving bounds on the performance of various online demand
paging algorithms under various assumptions. Recently, an interesting version of the demand paging problem involving
multiple applications sharing a xed size cache, with each application capable of furnishing useful \hints" to the
paging algorithm was proposed in [CFL94a], with an accompanying paging algorithm. Generalizations of the demand
paging problem in which page sizes may dier [You98], or the access times of pages may be non-uniform [You91],
and generalized variants [AAK99], geared towards scenarios involving wide area networks and the world-wide web,
have also been studied. Another generalization of the demand paging problem addresses the issue of distributing
a set of pages across a network of machines each with local memory; online algorithms have been proposed and
analyzed [ABF96, ABF93] to decide how to keep dynamically changing the contents of the local memories under
various constraints so as to minimize page faults at any machine.
A novel adaptation of the competitive analysis framework and a stall time model to study the performance of
prefetching algorithms while serving an application modeled as a sequence of page accesses was proposed in [CFKL95].
In prefetching algorithms, (some amount of) information regarding future page accesses is necessary. Besides making
page replacement decision, a prefetching algorithm also has to decide when to initiate an I/O operation; so techniques
used to prefetch and their theoretical analyses [CFKL95, Kk96, AGL98] appear to be relatively more complex than
those for demand paging.
In Chapter 2, we present an application-controlled paging algorithm to manage a cache shared by P competing,
4
concurrently executing applications. The application-controlled approach of using \hints" (regarding future page
accesses) provided by applications to aid replacement decisions represents a markedly dierent, and in many cases
improved, approach to incorporate each application's locality of reference compared to previous approaches mentioned
above. If k is used to denote the number of pages that can be stored in the cache shared by the P applications, for our
scenario, we show how \hints" provided by the applications can be used to circumvent a lower bound of ln k proved
for the competitive ratio [FKL+ 91] of any online algorithm for the \classical" paging model [FKL+ 91]: Our algorithm
has a competitive ratio of 2 ln P , which can be much smaller than ln k since k P typically. Our algorithm can
also be modied to incorporate some notions of fairness in situations when applications provide incorrect information
in their \hints". The theoretical application-controlled paging model we use is inspired by recent work on application-
controlled le caching by Cao et al. [CFL94a].
1.2.2 External Memory Algorithms (with statically allo-

cated memory)
In contrast to the previous approach in which the operating system solely controls memory contents and I/O op-
erations, the approach of external memory (EM) (or I/O or out-of-core ) algorithms requires the application to
completely manage its own internal memory. Each EM algorithm is custom-made for the problem being solved.
In this approach, the operating system statically allocates a xed invariant amount of memory to the application,
after which the application solely determines the sequence of I/O operations it carries out in conjunction with its
computation. Each I/O operation transfers one block of B contiguous items or records to or from disk into memory.
If there are D > 1 disks, each (parallel) I/O operation may transfer at most one block to or from each disk drive;
so at most D blocks can be transferred when there are D disks, one to or from each disk. The goal is to design the
EM algorithm so that the number of I/O operations incurred is the smallest possible for that application. Once it is
allocated memory by the operating system, the EM application exercises complete control over its memory and I/O
operations.
The fact that algorithms that work well in internal memory very often result in a non-optimal number of I/O
operations when intermediate data needs to be stored on disk has been well known for long, and so EM algorithms for
fundamental and frequently occurring problems such as sorting [Knu98] for the special case D = 1 have been in use
for many years. Database researchers have also developed I/O ecient algorithms (often analyzed experimentally)
for various other EM query processing applications such as sort-join and hash-join [UW97] for the case when D = 1.
However, recent research has led to the development of provably ecient parallel disk (D > 1) algorithms for several
basic problems, optimal EM algorithms for several graph theoretic and computational geometry problems, and new
external memory data structures.
It is convenient to use N to denote the number of items input to an EM algorithm and M to denote the number
5
of items that can be stored in the memory statically allocated to an EM algorithm. A pioneering step towards the
notion of I/O complexity of a problem, that is, the smallest number of I/O operations required to solve the problem,
was made by Floyd [Flo72] when he was able to prove a lower bound on the number of I/O operations required to
transpose a disk-resident matrix for the specic case D = 1 and B = (M ) = (N c ), where 0 < c < 1 is a constant.
The lower bounds apply to external sorting and permuting as well. However, the lower bounds are not tight in general
for other values of B, N , M , and D. Hong and Kung [HK81a] then presented \pebbling"-based lower bounds on the
number of I/O operations required in external FFT computation and external matrix multiplication when B = O(1)
and D = 1. In pioneering work, Aggarwal and Vitter [AV88] then proved general I/O complexity lower bounds for
sorting, permuting, FFT, permutation networks, and matrix transposition valid for all values of B, D, M , and N ;
paving the way for increased research activity in I/O algorithms. Vitter and Shriver [VS94] then proposed the so-
called parallel disk model (PDM) meant to facilitate the design and analysis of parallel disk algorithms; and proposed
optimal parallel disk algorithms for sorting and related problems [AV88], and matrix multiplication. There has since
been a large amount of work on external memory algorithms. Several optimal parallel disk sorting algorithms have
been proposed in [NV90, NV95, AP94, NV93]. Ecient algorithms for various EM graph problems were proposed
in [CGG+ 95, KS96, ABW98] and optimal algorithms for various EM computational geometry problems were proposed
in [GTVV93, CFM+ 98]. After the B-tree indexing data structures [UW97] proposed in the seventies, recently, many
new external memory data structures have been proposed. A notable data structure is the versatile buer tree
developed by Arge [Arg96], with many applications. Other external memory data structures relevant to ecient
geometric processing have been proposed in [RS94, VV96a, AV96]. Ecient EM algorithms have also been proposed
for various large-scale problems occurring in geographical information systems [AVVar, Arg97, AAM+ 98, APR+ 98]
and string processing applications [FG95, AFGV98]. A comprehensive survey of the state-of-the-art in the eld of
I/O algorithms can be found in [Vit98].
In Chapter 3, we present a simple randomized mergesort (SRM) for parallel disk sorting on the PDM [VS94].
Previously proposed theoretically ecient parallel disk sorting algorithms have the drawback that they are somewhat
complicated and have larger than desired constant factors, and so are not suitable for implementation. In contrast,
SRM is simple, consists of a simple prefetching and buer-management scheme that eciently utilizes available
memory and D-disk parallelism, and is easily implementable. With respect to the entire range of values of M ,
D and B, SRM is either provably optimal or provably ecient. Experimental evidence suggests that our analysis
is somewhat conservative with respect to practical situations. For all practical purposes, SRM may be considered
better than most other algorithms, even for ranges of M , D and B values over which SRM is, strictly speaking, not
theoretically optimal. Recently, Knuth [Knu98, Section 5.4.9] identied SRM (which he calls \randomized striping")
as the method of choice for sorting on parallel disks. The probabilistic analysis involved in proving bounds on SRM's
performance includes reductions to certain maximum occupancy problems, which may be of independent interest.
The prefetching and buer management used in SRM also has applications in parallel prefetching in certain situations,
6
as observed in [BKVV97].
In Chapter 4, we present techniques and issues relevant to the implementation of I/O algorithms using
TPIE [Ven95]: TPIE (Transparent Parallel I/O Environment) is a programming environment, originally developed
by Darren Vengro [Ven96] for his PhD thesis to facilitate the implementation of ecient I/O algorithms. Chapter 4
describes the implementation of the SRM algorithm in TPIE. We use elegant data structures and techniques to imple-
ment the forecasting [Knu98] and prefetching approach of SRM. We demonstrate that SRM is signicantly faster than
disk-striped mergesort (DSM), a popularly used parallel disk sorting algorithm. Generalizations of our forecasting
data structures and techniques also have applications in multi-way external distribution and in other situations such
as in video servers etc.
1.2.3 EM Algorithms with Dynamic Memory Allocations

The approach in Section 1.2.1 has the operating system in complete control over the memory allocation and the mem-
ory contents of the applications, and the operating system executes I/O operations on behalf of the applications. The
approach in Section 1.2.2 has the operating system only allocating a xed amount of memory to the EM application,
after which the EM application is solely responsible for managing its memory and I/O operations appropriately. We
now consider a dynamic approach applicable frequently in specic situations nowadays, in which the operating system
changes the memory allocation of an EM application from time to time, but the EM application is in complete charge
of the memory allocated to it at any time. Compared to the approach in Section 1.2.1, the application has greater
control over its resources, and in contrast to the approach of Section 1.2.2, the operating system can potentially
in uence the application's memory allocation (and hence its performance) throughout its execution.
Recent developments related to real-time database systems [Sys92, Rec88] and database systems based upon
administratively dened goals [FNG89, BCL93] have motivated the study of EM applications that can adapt to
memory uctuations, but such a study can also be viewed as a natural extension to the statically allocated, xed
memory approach (Section 1.2.2) to develop EM algorithms. Drastic performance degradation can occur when
the internal memory allocation of an EM application that assumes a static memory allocation, is unpredictably
reduced. EM techniques that incorporate notions of online algorithms are needed in order to design memory-adaptive
algorithms that can cope with memory uctuations. Work on memory-adaptive hash-join algorithms is reported
in [ZG90, PCL93b]; a memory-adaptive mergesort is proposed in [PCL93a, ZL97]. Prior work on memory-adaptive
EM algorithms has been incomplete and exclusively empirical in nature.
In Chapter 5, we present a theoretical framework for memory-adaptive algorithms. We present a theoretical
model for dynamic memory allocation and present a notion of optimality for memory-adaptive algorithms. Since
the performance of a memory-adaptive algorithm cannot be captured by considering in isolation either only its
memory usage or only its number of I/O operations, we introduce the problem-specic notion of resource consumption
of a memory-adaptive algorithm. We prove resource consumption bounds and present optimal memory-adaptive
7
algorithms for sorting (and related problems) and matrix multiplication (and related problems.) Our upper bound
analysis includes a novel potential function approach that provides insights on the notion of optimal memory utilization
during EM computation. We prove that a memory-adaptive sorting algorithm based upon a previously proposed
approach is non-optimal. Our lower bound techniques are related to techniques in [AV88] and [HK81a].
1.2.4 Techniques to Improve Data Transfer Speed

In the above three approaches, the emphasis is on developing algorithmic techniques (for the operating system or
application) that attain ecient I/O performance by appropriately determining what blocks or pages are to be
transferred in each I/O operation. A completely complementary approach is to develop techniques that will result in
ecient data transfer between memory and disk, assuming some other component of the computing system species
the blocks that need to be transferred in I/O operations. An instance of this approach is [MK91] and related work,
which proposed disk block allocation algorithms aimed to optimize the performance of le accesses. Researchers
have also proposed and experimentally analyzed techniques to dynamically reorganize data on disks based upon
access statistics, the goal being to accelerate data transfer rates [dJKH93, ES92, RW91, VC90]: The general idea
is to use access statistics and features of modern disk drives [RW94] to ensure that frequently used (\hot") data
resides in the middle of the disk surface, to ensure that data blocks accessed together are also stored together on
disk, and so on, thereby increasing locality of disk accesses. A technique used at \run-time" to accelerate data
transfer rates via optimized request sequencing is for disk drive scheduling algorithms to perform dynamic request
reordering [JW91, SCO90, WGP94] often taking advantage of low-level information available only to the disk drive
controller. A theoretical result pertaining to an ecient disk drive scheduling algorithm incorporating some recent
developments in disk drive technology appears in [ABZ96]. Scheduling techniques can improve device throughput by
reordering accesses for lowered positioning times, balance the use of resources over multiple requesting applications,
and reduce processing overheads by coalescing distinct accesses into a single larger access [Kot94, PEK96]. In DBMS,
which often tend to avoid the lesystem interface [Sto81], researchers have proposed application specic disk drive
scheduling and disk block layout techniques [ZL96, ZL98] designed to take advantage of disk controller features such
as sequential readahead to optimize data transfer during I/O. The technique in [CT96] proposes a specialized disk
scheduling algorithm to be used in conjunction with some advanced features of disk drive controllers to improve I/O
performance of certain multimedia applications. Several researchers have also carried out detailed modeling [RW94,
WGPW95, Che89, Shr97] of storage devices, which are getting more and more complicated [GVW96]: Detailed models
of underlying I/O devices are necessary to develop and implement algorithms to accelerate data transfer to or from
devices.
In Chapter 6, we model the performance of a parallel I/O system consisting of multiple disk drives sharing a
SCSI bus and present a technique to improve the I/O performance of the system. Our workload consists of an
approximation of the I/O workload generated by a PDM algorithm. Our analytical model for the I/O performance of
8
a system comprising of multiple disks sharing a bus is dierent from previous approaches to modeling multiple disks on
a bus. We take into account some interesting phenomena (not reported in previous literature on this subject) that we
discovered while monitoring the behavior of the system and some advanced features of modern disk drive controllers.
Our model has been validated on several disks and systems. We also present a simple \pipelining" technique based
upon the SCSI Prefetch command [CT96] to hide disk latency; we present an implementation for situations in which
the SCSI Prefetch command is not supported by the disk drive controllers. Our pipelining implementation (even
though we could not use SCSI Prefetch commands) yields signicant performance improvements in certain situations.
9
Chapter 2
Application-Controlled Paging for a
Shared Cache
Summary
We propose a provably ecient application-controlled global strategy for organizing a cache of size k shared among
P application processes. Each application has access to information about its own future page requests, and by
using that local information along with randomization in the context of a global caching algorithm, we are able to
break through the conventional Hk ln k lower bound on the competitive ratio for the caching problem. If the
P application processes always make good cache replacement decisions, our online application-controlled caching
algorithm attains a competitive ratio of 2HP 1 + 2 2 ln P . Typically, P is much smaller than k, perhaps by several
orders of magnitude. Our competitive ratio improves upon the 2P +2 competitive ratio achieved by the deterministic
application-controlled strategy of Cao, Felten, and Li. We show that no online application-controlled algorithm can
have a competitive ratio better than minfHP 1 ; Hk g, even if each application process has perfect knowledge of its
individual page request sequence. Our results are with respect to a worst-case interleaving of the individual page
request sequences of the P application processes.
We introduce a notion of fairness in the more realistic situation when application processes do not always make
good cache replacement decisions. We show that our algorithm ensures that no application process needs to evict one
of its cached pages to service some page fault caused by a mistake of some other application. Our algorithm is not
only fair, but remains ecient; the global paging performance can be bounded in terms of the number of mistakes
that application processes make.
2.1 Introduction
Caching is a useful technique for obtaining high performance in these days where the latency of disk access is relatively
high. Today's computers typically have several application processes running concurrently on them, by means of time
sharing and multiple processors. Some processes have special knowledge of their future access patterns. Cao et al
[CFL94a, CFL94b] exploit this special knowledge to develop eective le caching strategies.
An application providing specic information about its future needs is equivalent to the application having its own
caching strategy for managing its own pages in cache. We consider the multi-application caching problem, formally
dened in Section 2.3, in which P concurrently executing application processes share a common cache of size k. In
10
Section 2.4 we propose an online application-controlled caching scheme in which decisions need to be taken at two
levels: when a page needs to be evicted from cache, the global strategy chooses a victim process, but the process
itself decides which of its pages will be evicted from cache.
Each application process may use any available information about its future page requests when deciding which
of its pages to evict. However, we assume no global information about the interleaving of the individual page request
sequences; all our bounds are with respect to a worst-case interleaving of the individual request sequences.
Competitive ratios smaller than the Hk lower bound for classical caching [FKL+ 91] are possible for multi-
application caching, because each application may employ future information about its individual page request
sequence.1 The deterministic application-controlled algorithm proposed by Cao, Felten, and Li [CFL94a] achieves a
competitive ratio of 2P +2, as we prove in Section 2.6.3. We show in Sections 2.5{2.7 that our new randomized online
application-controlled caching algorithm improves the competitive ratio to 2HP 1 + 2 2 ln P , which is optimal up
to a factor of 2 in the realistic scenario when P < k. (If we use the algorithm of [FKL+ 91] for the case P k, the
resulting bound is optimal up to a factor of 2 for all P .) Our results are signicant since P is often much smaller
than k, perhaps by several orders of magnitude.
In Section 2.6.3, we also prove that no deterministic online application-controlled algorithm can have a competitive
ratio better than P + 1. Thus the dierence between our competitive ratio and that attained by the algorithm by
Cao, Felten, and Li is a further indication of the (well-known) power of randomization while solving online problems
in general and online paging, in particular.
In the scenario where application processes occasionally make bad page replacement decisions (or \mistakes"),
we show in Section 2.8 that our online algorithm incurs very few page faults globally as a function of the number of
mistakes. Our algorithm is also fair, in the sense that the mistakes made by one processor in its page replacement
decisions do not worsen the page fault rate of other processors.
2.2 Classical Caching and Competitive Analysis

The well-known classical caching (or paging) problem deals with a two-level memory hierarchy consisting of a fast
cache of size k and slow memory of arbitrary size. A sequence of requests to pages is to be satised in their order
of occurrence. In order to satisfy a page request, the page must be in fast memory. When a requested page is not
in fast memory, a page fault occurs, and some page must be evicted from fast memory to slow memory in order to
make room for the new page to be put into fast memory. The caching (or paging) problem is to decide which page
must be evicted from the cache. The cost to be minimized is the number of page faults incurred over the course of
servicing the page requests.
Belady [Bel66] gives a simple optimum oine algorithm for the caching problem; the page chosen for eviction is the
1 Here H represents the nth harmonic number P 1=i ln n. n
n =1 i
11
one in cache whose next request is furthest in the future. In order to quantify the performance of an online algorithm,
Sleator and Tarjan [ST85] introduce the notion of competitiveness, which in the context of caching can be dened
as follows: For a caching algorithm A, let FA () be the number of page faults generated by A while processing page
request sequence . If A is a randomized algorithm, we let FA () be the expected number of page faults generated
by A on processing , where the expectation is with respect to the random choices made by the algorithm. An
online algorithm A is called c-competitive if for every page request sequence , we have FA () c FOPT () + b,
where b is some xed constant. The constant c is called the competitive ratio of A. Under this measure, an online
algorithm's performance needs to be relatively good on worst-case page request sequences in order for the algorithm
to be considered good.
Sleator and Tarjan [ST85] show a lower bound of k on the competitive ratio of deterministic caching algorithm.
Fiat et al [FKL+ 91] prove a lower bound of Hk if randomized algorithms are allowed. They also give a simple and
elegant randomized algorithm for the problem that achieves a competitive ratio of 2Hk . Sleator and McGeoch [MS91]
give a rather involved randomized algorithm that attains the theoretically optimal competitive ratio of Hk .
2.3 Multi-application Caching Problem

In this paper we take up the theoretical issue of how best to use application processes' knowledge about their individual
future page requests so as to optimize caching performance. For analysis purposes we use an online framework similar
to that of [FKL+ 91, MS91]. As mentioned before, the caching algorithms in [FKL+ 91, MS91] use absolutely no
information about future page requests. Intuitively, knowledge about future page requests can be exploited to decide
which page to evict from the cache at the time of a page fault. In practice an application often has advance knowledge
of its individual future page requests. Cao, Felten and Li [CFL94a, CFL94b] introduced strategies that try to combine
the advance knowledge of the processors in order to make intelligent page replacement decisions.
In the multi-application caching problem we consider a cache capable of storing k pages that is shared
by P dierent application processes, which we denote P1 , P2 , : : : , PP . Each page in cache and memory
belongs to exactly one process. The individual request sequences of the processes may be interleaved in
an arbitrary (worst-case) manner.
Worst-case measure is often criticized when used for evaluating caching algorithms for individual application
request sequences [BIRS91, KPR92], but we feel that the worst-case measure is appropriate for considering a global
paging strategy for a cache shared by concurrent application processes that have knowledge of their individual page
request sequences. The locality of reference within each application's individual request sequence is accounted for in
our model by each application process's knowledge of its own future requests. The worst-case nature of our model
is that it assumes nothing about the order and durations of time for which application processes are active. In
12
this model our worst-case measure of competitive performance amounts to considering a worst-case interleaving of
individual sequences.
The approach of Cao et al [CFL94a] is to have the kernel deterministically choose the process owning the least
recently used page at the time of a page fault and ask that process to evict a page of its choice (which may be dierent
from the least recently used page). In Section 2.6.3, we show under the assumption that processes always make good
page replacement decisions that Cao et al's algorithm has a competitive ratio between P +1 and 2P +2. The algorithm
we present in the next section and analyze thereafter improves the competitive ratio to 2HP 1 + 2 2 ln P .
2.4 Online Algorithm for Multi-application Caching

Our algorithm is an online application-controlled caching strategy for an operating system kernel to manage a shared
cache in an ecient and fair manner. We show in the subsequent sections that the competitive ratio of our algorithm
is 2HP 1 + 2 2 ln P and that it is optimal to within a factor of about 2 among all online algorithms. (If P k, we
can use the algorithm of [FKL+ 91].)
On a page fault, we rst choose a victim process and then ask it to evict a suitable page. Our algorithm can
detect mistakes made by application processes, which enables us to reprimand such application processes by having
them pay for their mistakes. In our scheme, we mark pages as well as processes in a systematic way while processing
the requests that constitute a phase.
Denition 1 The global sequence of page requests is partitioned into a consecutive sequence of phases ; each phase
is a sequence of page requests. At the beginning of each phase, all pages and processes are unmarked. A page gets
marked during a phase when it is requested. A process is marked when all of its pages in cache are marked. A new
phase begins when a page is requested that is not in cache and all the pages in cache are marked. A page accessed
during a phase is called clean with respect to that phase if it was not in the online algorithm's cache at the beginning
of a phase. A request to a clean page is called a clean page request. Each phase always begins with a clean page
request.
Our marking scheme is similar to the one in [FKL+ 91] for the classical caching problem. However, unlike the
algorithm in [FKL+ 91], the algorithm we develop is a non-marking algorithm, in the sense that our algorithm may
evict marked pages. In addition, our notion of phase in Denintion 1 is dierent from the notion of phase in [FKL+ 91],
which can be looked upon as a special case of our more general notion. We put the dierences into perspective in
Section 2.4.1.
Our algorithm works as follows when a page p belonging to process Pr is requested:
1. If p is in cache:
13
(a) If p is not marked, we mark it.
(b) If process Pr has no unmarked pages in cache, we mark Pr .
2. If p is not in cache:
(a) If process Pr is unmarked and page p is not a clean page with respect to the ongoing phase (i.e, Pr has
made a mistake earlier in the phase by evicting p) then:
i. We ask process Pr to make a page replacement decision and evict one of its pages from cache in order
to bring page p into cache. We mark page p and also mark process Pr if it now has no unmarked pages
in cache.
(b) Else (process Pr is marked or page p is a clean page, or both):
i. If all pages in cache are marked, we remove marks from all pages and processes, and we start a new
phase, beginning with the current request for p.
ii. Let S denote the set of unmarked processes having pages in the cache. We randomly choose a process Pe
from S , each process being chosen with a uniform probability 1=jS j.
iii. We ask process Pe to make a page replacement decision and evict one of its pages from cache in order
to bring page p into cache. We mark page p and also mark process Pe if it now has no unmarked page
in cache.
Note that in Steps 2(a)i and 2(b)iii our algorithm seeks paging decisions from application processes that are
unmarked. Consider an unmarked process Pi that has been asked to evict a page in a phase, and consider Pi 's pages
in cache at that time. Let ui denote the farthest unmarked page of process Pi ; that is, ui is the unmarked page
of process Pi whose next request occurs furthest in the future among all of Pi 's unmarked cached pages. Note that
process Pi may have marked pages in cache whose next requests occur after the request for ui .
Denition 2 The good set of an unmarked process Pi at the current point in the phase is the set consisting of its
farthest unmarked page ui in cache and every marked page of Pi in cache whose next request occurs after the next
request for page ui . A page replacement decision made by an unmarked process Pi in either Step 2(a)i or Step 2(b)iii
that evicts a page from its good set is regarded as a good decision with respect to the ongoing phase. Any page from
the good set of Pi is a good page for eviction purposes at the time of the decision. Any decision made by an unmarked
process Pi that is not a good decision is regarded as a mistake by process Pi .
If a process Pi makes a mistake by evicting a certain page from cache, we can detect the mistake made by Pi if
and when the same page is requested again by Pi in the same phase while Pi is still unmarked.
In Sections 2.6 and 2.7 we specically assume that application processes are always able to make good decisions
about page replacement. In Section 2.8 we consider fairness properties of our algorithm in the more realistic scenario
where processes can make mistakes.
14
2.4.1 Relation to Previous Work on Classical Caching
Our marking scheme approach is inspired by a similar approach for the classical caching problem in [FKL+ 91].
However, the phases dened by our algorithm are signicantly dierent in nature from those in [FKL+ 91]. Our phase
ends when there are k distinct marked pages in cache; more than k distinct pages may be requested in the phase.
The phases depend upon the random choices made by the algorithm and are probabilistic in nature. On the other
hand, a phase dened in [FKL+ 91] ends when exactly k distinct pages have been accessed, so that given the input
request sequence, the phases can be determined independently of the caching algorithm being used.
The denition in [FKL+91] is suited to facilitate the analysis of online caching algorithms that never evict marked
pages, called marking algorithms. In the case of marking algorithms, since marked pages are never evicted, as soon
as k distinct pages are requested, there are k distinct marked pages in cache. This means that the phases determined
by our denition for the special case of marking algorithms are exactly the same as the phases determined by the
denition in [FKL+ 91]. Note that our algorithm is in general not a marking algorithm since it may evict marked
pages. While marking algorithms always evict unmarked pages, our algorithm always calls on unmarked processes to
evict pages; the actual pages evicted may be marked.
2.5 Lower Bounds for OPT and Competitive Ratio

In Section 2.6.3, we prove a lower bound of minfP + 1; kg on the competitive ratio of any deterministic online
algorithm for multi-application caching. In this section, we prove the stronger result that that the competitive ratio
of any online caching algorithm can be no better than minfHP 1 ; Hk g: This lower bound holds even for randomized
caching algorithms.
Let us denote by OPT the optimal oine algorithm for caching that works as follows: When a page fault occurs,
OPT evicts the page whose next request is furthest in the future request sequence among all pages in cache.
As in [FKL+ 91], we will compare the number of page faults generated by our online algorithm during a phase with
the number of page faults generated by OPT during that phase. We express the number of page fronts as a function
of the number of clean page requests during the phase. Here we state and prove a lower bound on the (amortized)
number of page faults generated by OPT in a single phase. The proof is a simple generalization of an analogous
proof in [FKL+ 91], which deals only with the deterministic phases of marking algorithms.
Lemma 1 Consider any phase i of our online algorithm in which ì clean pages are requested. Then OPT incurs
an amortized cost of at least ì =2 on the requests made in that phase.2
2 By \amortized" in Lemma 1 we mean for each j 1 that the number of page faults made by
P
OPT while serving the rst j phases is at least =1 ` =2, where ` is the number of clean page
j
i i
requests in the ith phase.
i
15
Proof : Let di be the number of clean pages in OPT 's cache at the beginning of phase i ; that is, di is the number
of pages requested in i that are in OPT 's cache but not in our algorithm's cache at the beginning of i . Let di+1
represent the same quantity for the next phase i+1 . Let di+1 = dm + du , where dm of the di+1 clean pages in
OPT 's cache at the beginning of i+1 are marked during i and du of them are not marked during i . Note that
d1 = 0 and di k for all i.
Of the ì clean pages requested during i , only di are in OPT 's cache, so OPT generates at least ì di page
faults during i . On the other hand, while processing the requests in i , OPT cannot use du of the cache locations,
since at the beginning of i+1 there are du pages in OPT 's cache that are not marked during i . (These du pages
would have to be in OPT 's cache before i even began.) There are k marked pages in our algorithm's cache at the
end of i , and there are dm other pages marked during i that are out of our algorithm's cache. So the number of
distinct pages requested during i is at least dm + k. Hence, OPT serves at least dm + k requests corresponding to
i without using du of the cache locations. This means that OPT generates at least (k + dm ) (k du ) = di+1
faults during i . Therefore, the number of faults OPT generates on i is at least
maxfì di ; di+1 g ì di + di+1 :

2
Let us consider the rst j phases. In the j th phase of the sequence, OPT has at least `j dj (`j dj )=2 faults.
In the rst phase, OPT generates k faults and `1 = k. Thus the sum of OPT 's faults over all phases is at least
jX1 ` di + di+1 + `j dj Xj
`1 + i ì =2;
i=2 2 2 i=1
where we use the fact that d2 k = `1 . Thus by denition, the amortized number of faults OPT generates over any
phase i is at least ì =2.
Next we will construct a lower bound for the competitive ratio of any randomized online algorithm even when
application processes have perfect knowledge of their individual request sequences. The proof is a straightforward
adaptation of the proof of the Hk lower bound for classical caching [FKL+ 91]. However, in the situation at hand, the
adversary has more restrictions on the request sequence that he can use to prove the lower bound, thereby resulting
in a lowering of the lower bound.
Theorem 1 The competitive ratio of any randomized algorithm for the multi-application caching problem is at least
minfHP 1 ; Hk g even if application processes have perfect knowledge of their individual request sequences.
Proof : If P > k, the Hk lower bound on the classical caching problem from [FKL+ 91] is directly applicable by
considering the case where each process accesses only one page each. This gives a lower bound of Hk on competitive
ratio.
16
In the case when P k, we construct a multi-application caching problem based upon the nemesis sequence used
in [FKL+ 91] for classical caching. In [FKL+ 91] a lower bound of Hk0 is proved for the special case of a cache of
size k0 and a total of k0 + 1 pages, which we denote c1 , c2 , : : : ,ck0 +1 . All but one of the pages can t in cache at
the same time. Our corresponding multi-application caching problem consists of P = k0 + 1 application processes
P1 , P2 , : : : , PP so that there is one process corresponding to each page of the classical caching lower bound instance
for a k0 -sized cache . Process Pi owns ri pages pi1 , pi2 , : : : piri . The total number PPi=1 ri of pages among all
the processes is k + 1, where k is the cache size; that is, all but one of the pages among all the processes can t in
memory simultaneously.
In the instance of the multi-application caching problem we construct, the request sequence for each process Pi
consists of repetitions of the double round-robin sequence
pi1 ; pi2 ; : : : ; piri ; pi1 ; pi2 ; : : : ; piri (2.1)
of length 2ri . We refer to the double round-robin sequence (2.1) as a touch of process Pi . When the adversary
generates requests corresponding to a touch of process Pi , we say that it \touches process Pi ."
Given an arbitrary adversarial sequence for the classical caching problem described above, we construct an adver-
sarial sequence for the multi-application caching problem by replacing each request for page ci in the former problem
by a touch of process Pi in the latter problem. We can transform an algorithm for this instance of multi-application
caching into one for the classical caching problem by the following correspondence: If the multi-application algorithm
evicts a page from process Pj while servicing the touch of process Pi , the classical caching algorithm evicts page cj
in order to service the request to page ci . In Lemma 2 below, we show that there is an optimum online algorithm
for the above instance of multi-application caching that never evicts a page belonging to process Pi while servicing a
fault on a request for a page from process Pi . Thus the transformation is valid, in that page ci is always resident in
cache after the page request to ci is serviced. This reduction immediately implies that the competitive ratio for this
instance of multi-application caching must be at least Hk0 = HP 1 .
Lemma 2 For the above instance of multi-application caching, any online algorithm A can be converted into an
online algorithm A0 that is at least as good in an amortized sense and that has the property that all the pages for
process Pi are in cache immediately after a touch of Pi is processed.
Proof : Intuitively, the double round-robin sequences force an optimal online algorithm to service the touch of a
process by evicting a page belonging to another process. We construct online algorithm A0 from A in an online
manner. Suppose that both A and A0 fault during a touch of process Pi . If algorithm A evicts a page of Pj , for some
j 6= i, then A0 does the same. If algorithm A evicts a page of Pi during the rst round-robin while servicing a touch
of Pi , then there will be a page fault during the second round-robin. If A then evicts a page of another process during
the second round-robin, then A0 evicts that page during the rst round-robin and incurs no fault during the second
17
round-robin. The rst page fault of A was wasted; the other page could have been evicted instead during the rst
round-robin. If instead A evicts another page of Pi during the second round-robin, then A0 evicts an arbitrary page
of another process during the rst round-robin, and A0 incurs no page fault during the second round-robin. Thus, if
A evicts a page of Pi , it incurs at least one more page fault than does A0 .
If A faults during a touch of Pi , but A0 doesn't, there is no paging decision for A0 to make. If A does not fault
during a touch of Pi , but A0 does fault, then A0 evicts the page that is not in A's cache. The page fault for A0 is
charged to the extra page fault that A incurred earlier when A0 evicted one of Pi 's pages.
Thus the number of page faults that A0 incurs is no more than the number of page faults that A incurs. By
construction, all pages of process Pi are in algorithm A0 's cache immediately after a touch of process Pi .
The double round-robin sequences in the above reduction can be replaced by single round-robin sequences by
redoing the explicit lower bound argument of [FKL+ 91].
2.6 Holes
In this section, we introduce the notion of holes, which plays a key role in the analysis of our online caching algorithm.
In Section 2.6.2, we mention some crucial properties of holes of our algorithm under the assumption that applications
always make good page replacement decisions. These properties are also useful in bounding the page faults that can
occur in a phase when applications make mistakes in their page replacement decisions.
Denition 3 The eviction of a cached page at the time of a page fault on a clean page request is said to create a hole
at the evicted page. Intuitively, a hole is the lack of space for some page, so that that page's place in cache contains
a hole and not the page. If page p1 is evicted for servicing the clean page request, page p1 is said to be associated
with the hole. If page p1 is subsequently requested and another page p2 is evicted to service the request, the hole is
said to move to p2 , and now p2 is said to be associated with the hole. And so on, until the end of the phase. We say
that hole h moves to process Pi to mean that the hole h moves to some page p belonging to process Pi .
2.6.1 General observations about holes

All requests to clean pages during a phase are page faults and create holes. The number of holes created during
a particular phase equals the number of clean pages requested during that phase. Apart from clean page requests,
requests to holes also cause page faults to occur. By a request to a hole we mean a request for the page associated
with that hole. As we proceed down the request sequence during a phase, the page associated with a particular hole
varies with time. Consider a hole h that is created at a page p1 that is evicted to serve a request for clean page pc .
When a request is made for page p1 , some page p2 is evicted, and h moves to p2 . Similarly when page p2 is requested,
h moves to some p3 , and so on. Let p1 , p2 , : : : , pm be the temporal sequence of pages all associated with hole h in a
18
particular phase such that page p1 is evicted when clean page pc is requested, page pi , where i > 1, is evicted when
pi 1 is requested and the request for pm falls in the next phase. Then the number of faults incurred in the particular
phase being considered on account of requests to h is m 1.
2.6.2 Useful properties of holes

In this section we make the following observations about holes under the assumption that application processes make
only good decisions.
Lemma 3 Let ui be the farthest unmarked page in cache of process Pi at some point in a phase. Then process Pi
is a marked process by the time the request for page ui is served.
Proof : This follows from the denition of farthest unmarked page and the nature of the marking scheme employed
in our algorithm.
Lemma 4 Suppose that there is a request for page pi , which is associated with hole h. Suppose that process Pi owns
page pi . Then process Pi is already marked at the time of the present request for page pi .
Proof : Page pi is associated with hole h because process Pi evicted page pi when asked to make a page replacement
decision, in order to serve either a clean request or a page fault at the previous page associated with h. In either case,
page pi was a good page at the time process Pi made the particular paging decision. Since process Pi was unmarked
at the time the decision was made, pi was either the farthest unmarked page of process Pi then or some marked page
of process Pi whose next request is after the request for Pi 's farthest unmarked page. By Lemma 3, process Pi is a
marked process at the time of the request for page pi .
Lemma 5 Suppose that page pi is associated with hole h. Let Pi denote the process owning page pi . Suppose page
pi is requested at some time during the phase. Then hole h does not move to process Pi subsequently during the
current phase.
Proof : The hole h belongs to process Pi . By Lemma 4 when a request is made to h, Pi is already marked and will
remain marked until the end of the phase. Since only unmarked processes are chosen to to evict pages, a request for
h thereafter cannot result in eviction of any page belonging to Pi , so a hole can never move to a process more than
once.
Let there be R unmarked processes at the time of a request to a hole h. For any unmarked process Pj , 1 j R,
let uj denote the farthest unmarked page of process Pj at the time of the request to hole h. Without loss of generality,
let us relabel the processes so that
u1 ; u2 ; u3 ; : : : ; uR (2.2)
is the temporal order of the rst subsequent appearance of the pages uj in the global page request sequence.
19
Lemma 6 In the situation described in (2.2) above, suppose during the page request for hole h that the hole moves
to a good page pi of unmarked process Pi to serve the current request for h. Then h can never move to any of the
processes P1 , P2 , : : : , Pi 1 during the current phase.
Proof : The rst subsequent request for the good page pi that Pi evicts, by denition, must be the same as or must
be after the rst subsequent request for the farthest unmarked page ui . So process Pi will be marked by the next
time hole h is requested, by Lemma 4. On the other hand, the rst subsequent requests of the respective farthest
unmarked pages u1 , : : : , ui 1 appear before that of page ui . Thus, by Lemma 3, the processes P1 , P2 , : : : , Pi 1 are
already marked before the next time hole h (page pi ) gets requested and will remain marked for the remainder of the
phase. Hence, by the fact that only unmarked processes get chosen, hole h can never move to any of the processes
P1 , P2 , : : : , Pi 1 .
Next we use the concept of holes to prove a lower bound on the competitive ratio of any deterministic online
multi-application caching algorithm, and an upper bound on the competitive ratio of the algorithm by [CFL94a].
2.6.3 Relation to the algorithm of Cao et al.

The algorithm proposed by Cao, Felten and Li [CFL94a] for the multi-application caching problem amounts to
evicting, at the time of a page fault, the farthest page from cache belonging to the process that owns the LRU page
in cache. Thus, for a given interleaving of individual request sequences, the paging decisions made by that algorithm
are deterministic. We now prove that no deterministic online algorithm for the multi-application caching problem can
have a competitive ratio better than P +1, and that the competitive ratio attained by the algorithm proposed by Cao,
Felten and Li [CFL94a] attains a competitive ratio of 2P + 2. The performance of the algorithm in [CFL94a] is thus
within a factor of two of the best possible performance by any deterministic online algorithm for the multi-application
caching problem.
The algorithm proposed by Cao, Felten and Li [CFL94a] for the multi-application caching problem amounts to
evicting, at the time of a page fault, the farthest page from cache belonging to the process that owns the LRU page
in cache.
Theorem 2 The algorithm of Cao, Felten, and Li is (2P + 2)-competitive. A generalized version of the algorithm
of Cao, Felten, and Li in which, at the time of a page fault, the process owning the LRU page in cache evicts any
(deterministically chosen) good page, is also (2P + 2)-competitive.
Proof : Let there be ` clean page requests in a phase. Then there are ` faults corresponding to clean page requests
resulting in ` holes. The algorithm evicts only good pages from cache, so holes are associated only with such pages.
By Lemma 5 we can conclude that each hole can result in at most one page fault per process up to the end of the
20
phase, so that the total number of page faults in the phase is bounded by ` + `P . Using Theorem 1 gives the above
competitive factor.
Theorem 3 The competitive ratio of any deterministic online algorithm for the multi-application caching problem
is at least minfP + 1; kg.
Proof : We rst consider the case in which P + 1 k. Since the algorithm is deterministic, we can construct an
interleave that costs the algorithm a factor of P + 1 times the number of faults that OPT will incur. For instance,
consider a single clean request in each phase. On the basis of our knowledge of the deterministic choices made by
the algorithm, we can easily make the resulting hole visit each process at least once so that the deterministic online
algorithm incurs at least P + 1 faults per phase whereas OPT incurs just one fault.
On the other hand in the case when P + 1 > k, the classical paging lower bound of k from [ST85] becomes
applicable, hence proving the theorem.
Given the notion of good pages that we developed above, it turns out that we can dene a slightly more general
version of the algorithm in [CFL94a] without changing its paging performance. Basically, in order to attain the
competitive ratio of 2P + 2, it is enough, at the time of a page fault, to evict from cache any good page belonging to
the process that owns the LRU page in cache. We say that this is a slightly more general version than the algorithm
presented in [CFL94a] because it may very often be the case that the good set of the process that owns the LRU
page contains several pages other than the farthest page of that process.
2.7 Competitive Analysis of our Online Algorithm

Our main result is Theorem 4, which states that our online algorithm for the multi-application caching problem is
roughly 2 ln P -competitive, assuming application processes always make good decisions (e.g., if each process knows
its own future page requests). By the lower bound of Theorem 1, it follows that our algorithm is optimal in terms of
competitive ratio up to a factor of 2.
Theorem 4 The competitive ratio of our online algorithm in Section 2.4 for the multi-application caching problem,
assuming that good evictions are always made, is at most 2HP 1 + 2. Our competitive ratio is within a factor of
about 2 of the best possible competitive ratio for this problem.
The rest of this section is devoted to proving Theorem 4. To count the number of faults generated by our algorithm
in a phase, we make use of the properties of holes from the previous section. If ` requests are made to clean pages
21
during a phase, there are ` holes that move about during the phase. We can count the number of faults generated by
our algorithm during the phase as
X̀
`+ Ni ; (2.3)
i=1
where Ni is the number of times hole hi is requested during the phase. Assuming good decisions are always made,
we will now prove for each phase and for any hole hi that the expected value of Ni is bounded by HP 1 .
Consider the rst request to a hole h during the phase. Let Rh be the number of unmarked processes at that
point of time. Let CRh be the random variable associated with the number of page faults because of requests to hole
h during the phase.
Lemma 7 The expected number E (CRh ) of page faults caused by requests to hole h is at most HRh .
Proof : We prove this by induction over Rh . We have E (C0 ) = 0 and E (C1 ) = 1. Suppose for 0 j Rh 1 that
E (Cj ) Hj . Using the same terminology and notation as in Lemma 6, let the farthest unmarked pages of the Rh
unmarked processes at the time of the request for h appear in the temporal order
u1 ; u2 ; u3 ; : : : ; uRh
in the global request sequence. We renumber the Rh unmarked processes for convenience so that page ui is the
farthest unmarked page of unmarked process Pi .
When the hole h is requested, our algorithm randomly chooses one of the Rh unmarked processes, say, process
Pi , and asks process Pi to evict a suitable page. Under our assumption, the hole h moves to some good page pi of
process Pi . From Lemmas 5 and 6, if our algorithm chooses unmarked process Pi so that its good page pi is evicted,
then at most Rh i processes remain unmarked the next time h is requested. Since each of the Rh unmarked
processes is chosen with a probability of 1=Rh , we have
X
Rh
E (CRh ) 1 + R1 E (CRh i )
h i=1
RXh 1
= 1 + R1 E (Ci )
h i=0
RX
h 1
1 + R1 Hi
h i=0
= HRh :
The last equality follows easily by induction and algebraic manipulations.
22
We now complete the proof of Theorem 4. By Lemma 4 the maximum possible number of unmarked processes
at the time a hole h is rst requested is P 1. Lemma 7 implies that the average number of times any hole can
be requested during a phase is bounded by HP 1 . By (2.3), the total number of page faults during the phase is
at most `(1 + HP 1 ). We have already shown in Lemma 1 that the OPT algorithm incurs an amortized cost of
at least `=2 for the requests made in the phase. Therefore, the competitive ratio of our algorithm is bounded by
`(1 + HP 1 )=(`=2) = 2HP 1 + 2. Applying the lower bound of Theorem 1 completes the proof.
2.8 Application-Controlled Caching with Fairness

In this section we analyze our algorithm's performance in the realistic scenario where application processes can make
mistakes, as dened in Denition 2. We bound the number of page faults it incurs in a phase in terms of page faults
caused by mistakes made by application processes during that phase. The main idea here is that if an application
process Pi commits a mistake by evicting a certain page p and then during the same phase requests page p while
process Pi is still unmarked, our algorithm makes process Pi pay for the mistake in Step 2(a)i.
On the other hand, if page p's eviction from process Pi was a mistake, but process Pi is marked when page p is
later requested in the same phase, say, at time t, then process Pi 's mistake is \not worth detecting" for the following
reason: Since evicting page p was a mistake, it must mean that at the time t1 of p's eviction, there existed a set U
of one or more unmarked pages of process Pi in cache whose subsequent requests appear after the next request for
page p. Process Pi is marked at the time of the next request for p, implying that all pages in U were evicted by
Pi at some times t2 , t3 , : : : tjU j+1 after the mistake of evicting p. If instead at time t1 , t2 , : : : tjU j+1 process Pi
makes the specic good paging decisions of evicting the farthest unmarked pages, the same set fpg [ U of pages will
be out cache at time t. In our notion of fairness we choose to ignore all such mistakes and consider them \not worth
detecting."
Denition 4 During an ongoing phase, any page fault corresponding to a request for a page p of an unmarked
process Pi is called an unfair fault if the request for page p is not a clean page request. All faults during the phase
that are not unfair are called fair faults.
The unfair faults are precisely those page faults which are caused by mistakes considered \worth detecting." We
state the following two lemmas that follow trivially from the denitions of mistakes, good decisions, unfair faults,
and fair faults.
Lemma 8 During a phase, all page requests that get processed in Step 2(a)i of our algorithm are precisely the unfair
faults of that phase. That is, unfair faults correspond to mistakes that get caught in Step 2(a)i of our algorithm.
Lemma 9 All fair faults are precisely those requests that get processed in Step 2(b)iii.
23
We now consider the behavior of holes in the current mistake-prone scenario.
Lemma 10 The number of holes in a phase equals the number of clean pages requested in the phase.
Lemma 11 Consider a hole h associated with a page p of a process Pi . If a request for h is an unfair fault, process
Pi is still unmarked and the hole h moves to some other page belonging to process Pi . If a request for hole h is a fair
fault, then process Pi is already marked and the hole h can never move to process Pi subsequently during the phase.
Proof : If the request for hole h is an unfair fault, then by denition process Pi is unmarked and by Lemma 8, h
moves to some other page p0 of process Pi . If the request for h is a fair fault, then by denition and the fact that the
request for h is not a clean page request, process Pi is marked. Since our algorithm never chooses a marked process
for eviction, it follows that h can never visit process Pi subsequently during the phase.
During a phase, a hole h is created in some process, say P1 , by some clean page request. It then moves around
zero or more times within process P1 on account of P1 's mistakes, until a request for hole h is a fair fault, upon which
it moves to some other process P2 , never to come back to process P1 during the phase. It behaves similarly in process
P2 , and so on up to the end of the phase. Let Th denote the total number of faults attributed to requests to hole h
during a phase, of which Fh faults are fair faults and Uh faults are unfair faults. We have Th = Fh + Uh .
By Lemma 11 and the same proof techniques as those in the proofs of Lemma 7 and Theorem 4, we can prove
the following key lemma:
Lemma 12 The expected number E (Fh ) of page requests to hole h during a phase that result in fair faults is at
most HP 1 .
By Lemma 10, our algorithm incurs at most ` + Pì=1 Thi page faults in a phase with ` clean page requests. The
expected value of this quantity is at most `(HP 1 + 1) + Pì=1 Uhi , by Lemma 12.
The expression P` Uh is the number of unfair faults, that is, the number of mistakes considered \worth
i=1 i
detecting." Our algorithm is very ecient in that the number of unfair faults is an additive term. For any phase
with ` clean requests, we denote Pì=1 Uhi as M .
Theorem 5 The number of faults in a phase with ` clean page requests and M unfair faults is bounded by
`(1 + HP 1 ) + M . At the time of each of the M unfair faults, the application process that makes the mistake that
causes the fault must evict a page from its own cache. No application process is ever asked to evict a page to service
an unfair fault caused by some other application process.
24
2.8.1 Extending Fairness to the algorithm by Cao et al.
It turns out that our notion of fairness extends, without any change, to the generalized version of the deterministic
algorithm of [CFL94a] that we mentioned in Section 2.6.3. It is easy to see that in the case of the generalized version
of the algorithm of [CFL94a], a process incurs an unfair fault only if, at some time in the past, that process had the
LRU page and the page it evicted was not a good page. Consequently, a result similar to Theorem 5 with (1+ HP 1 )
replaced by (1 + P ) holds for the generalized version of the algorithm of [CFL94a].
2.9 Conclusions and Future Work

Cache management strategies are of prime importance for high performance computing. We consider the case where
there are P independent processes running on the same computer system and sharing a common cache of size k.
Applications often have advance knowledge of their page request sequences. In this paper we have address the issue
of exploiting this advance knowledge to devise intelligent strategies to manage the shared cache, in a theoretical
setting. We have presented a simple and elegant randomized application-controlled caching algorithm for the multi-
application caching problem that achieves a competitive ratio of 2HP 1 + 2. Our result is a signicant improvement
over the competitive ratios of 2P + 2 [CFL94a] for deterministic multi-application caching and (Hk ) for classical
caching, since the cache size k is often orders of magnitude greater than P . We have proven that no online algorithm
for this problem can have a competitive ratio smaller than minfHP 1 ; Hk g, even if application processes have perfect
knowledge of individual request sequences. We conjecture that an upper bound of HP 1 can be proven, up to second
order terms, perhaps using techniques from [MS91], although the resulting algorithm is not likely to be practical.
Using our notion of mistakes we are able to consider a more realistic setting when application processes make bad
paging decisions and show that our algorithm is a fair and ecient algorithm in such a situation. No application
needs to pay for some other application process's mistake, and we can bound the global caching performance of our
algorithm in terms of the number of mistakes. Our notions of good page replacement decisions, mistakes, and fairness
in this context are new.
One related area of possible future work is to consider alternative models to our model of worst-case interleaving:
A most challenging problem is to consider application-controlled paging with realistic constraints based on context
switching and CPU scheduling. In realistic systems, when a process incurs a page fault, it gets context switched
and some other process gets scheduled according to the CPU scheduling algorithm. We note that that the request
sequence that leads to a lower bound on the performance of online paging needs the adversary to be unconstrained.
The diculty that arises from the context switching and CPU scheduling constraints is that the request sequence
cannot be considered to be xed a priori. This is because context switch events depend on previous paging decisions
and also determine the interleaving of the request sequences, which in turn determines good paging decisions. We
therefore refer to the online paging problem under context switching and CPU scheduling problem as a dynamic
25
paging problem. At this point of time, we do not know to what extent or under what restrictions the above problem
can be theoretically solved.
Another interesting area would be consider caching in a situation where some applications have good knowledge
of future page requests while other applications have no knowledge of future requests. We could also consider pages
shared among application processes.
26
Chapter 3
Simple Randomized Mergesort for
Parallel Disks
Summary
We consider the problem of sorting a le of N records on the D-disk model of parallel I/O in which there are two
sources of parallelism: block transfer and multiple disks. Records are transferred to and from disk concurrently
in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of
the D disks in parallel. We propose a simple, ecient, randomized mergesort algorithm called SRM that uses a
forecast-and- ush approach to overcome the inherent diculties of simple merging on parallel disks. SRM exhibits a
limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting,
our algorithm is able to read in, at any time, the \right" block from any disk, and using the technique of ushing, our
algorithm evicts, without any I/O overhead, just the \right" blocks from memory to make space for new ones to be
read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental ineciencies
of previous mergesort algorithms. By analysis of generalized maximum occupancy problems we are able to derive an
analytical upper bound on SRM's expected overhead valid for arbitrary inputs.
The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than
disk-striped mergesort (DSM) for realistic parameter values D, M , and B. Average-case simulations show further im-
provement on the analytical upper bound. Unlike previously proposed optimal sorting algorithms, SRM outperforms
DSM even when the number D of parallel disks is small.
3.1 Introduction
The classical problem of sorting and related processing is reported to consume roughly 20 percent of computing
resources in large-scale installations [Knu98, LV85]. In the light of the rapidly increasing gap between processor
speeds and disk memory access times, popularly referred to as the I/O bottleneck , the specic problem of external
memory sorting assumes particular importance. In external sorting, the records to be sorted are simply too many
to t in internal memory and so have to be stored on disk, thus necessitating I/O as a fundamental, frequently used
operation during sorting.
One way to alleviate the eects of the I/O bottleneck is to use parallel disk systems [HGK+ 94, PGK88, Uni89,
GS84, Mag87]. Aggarwal and Vitter [AV88], generalizing initial work done by Floyd [Flo72] and Hong and Kung
27
D disks
Block I/O
Internal memory
M 2DB .
Figure 3.1: The Parallel Disk Model [VS94]: D, B, M and N respectively denote the number of independent disk
drives, the size of each disk block, the size of internal memory, and the size of the input.
[HK81b], laid the foundation for I/O algorithms by studying the I/O complexity of sorting and related problems.
The model they studied [AV88] considers an internal memory of size M and I/O reads or writes that each result in a
transfer of D blocks, where each block is comprised of B contiguous records, from or to disks. Subsequently, Vitter
and Shriver [VS94] considered the realistic, two-level Parallel Disk Model (PDM) illustrated in Figure 3.1, in which
secondary memory is partitioned into D physically distinct and independent disk drives or read-write heads that can
simultaneously transmit a block of data, with the requirement that M 2DB.
In the D-disk two-level memory hierarchy [VS94], there are two sources of parallelism. First, as in traditional I/O
systems, records are transferred concurrently in blocks of B contiguous records. Secondly, in a single I/O operation,
each of the D disks can simultaneously transfer one (but only one) block of B records, so that each I/O operation
can potentially transfer DB records in parallel.
To be more precise, D is the number of blocks that can be transmitted in parallel at the speed at which data comes
o or goes onto the magnetic disk media. The parameter D may thus be smaller than the actual number of disks if
the channel bandwidth from the disks is not sucient for each disk to transmit or receive a block simultaneously. It
then might be useful to consider two disk parameters D and D0 , where D is the channel bandwidth in terms of the
number of blocks coming o or going onto disk that can be transferred simultaneously, and D0 is the number of disks
sharing the bandwidth. As long as at least D of the D0 disks have a block to transmit, the I/O channel can remain
busy. Hybrid models with multiple channels and several disks per channel are also possible. In this chapter, we adopt
the simpler and more restrictive model in which D = D0 and develop algorithms that are near-optimal even for this
more restrictive model. A more detailed discussion of disk models and characteristics appears in [RW94].
The problem of external sorting in the D-disk model has been extensively studied. Algorithms have been developed
N log(N=B) ) number of I/O operations1 , as well as doing an optimal amount
that use an asymptotically optimal ( DB log(M=B)
of work in internal memory. The previously developed sorting algorithms for the D-disk model have larger than
1 Throughout this chapter we use the standard asymptotic notation f (n) = O(g(n)) to mean that
there are constants C , n0 > 0 such that jf (n)j C jg(n)j for all n n0 . We say that f (n) =

(g(n)) if g(n) = O(f (n)). We write f (n) = (g(n)) if f (n) = O(g(n)) and f (n) =
(g(n)).
28
desired constant factors, and some are complicated to implement. As a result, the simple technique of mergesort
with disk striping (DSM), which can be asymptotically sub-optimal in terms of number of I/Os by a multiplicative
factor of ln(M=B), is commonly used in practice for external sorting on parallel disks[NV90, NV93]. In DSM, the
disks are coordinated so that in each parallel read and write, the locations of the blocks accessed on each disk are
the same, which has the logical eect of sorting with D0 = 1 disk and block size B0 = DB. In each merge pass,
R = (M=DB) runs are merged together at a time, except possibly in the last pass. The number of passes required
is thus ln(N=M )= ln(M=DB), with an additional pass for initial run formation. The resulting I/O complexity of DSM
N (1 + log(N=M ) )). The advantage of DSM is its simplicity, which results in very small constant factors, thus
is ( DB log(M=DB)
making it attractive for ecient implementation on existing parallel disk systems. But the disadvantage of DSM is
that it becomes inecient as the number of disks gets larger.
Thus, on the one hand we have external sorting algorithms that are optimal (up to a constant factor) but are not
ecient in practice unless the number of disks is very large, and on the other hand we have a simple disk striping
technique that works well with a small number of disks.
In order to realize an optimal and practical D disk external mergesort, it is necessary to merge R = (M=B)
runs at a time in such a manner that the number of I/Os required in each merge pass is (N=DB). An interesting
observation made in [Knu98, Section 5.4.9, Page 370] by Knuth is that the so-called Gilbreath principle, based upon
an elegant card trick, can be used to design a simple I/O algorithm to merge together R = 2 runs striped across
D disks in the ideal number of I/O operations. But there is no analog of the Gilbreath principle when R > 2.
Fundamental diculties need to be overcome while merging together a large number of runs striped across D disks
when the average amount of internal memory available as buer space for each run is only O(1) blocks.
In this chapter, we present an elegant technique to merge R = (M=B) runs striped across D disks using parallel
I/O operations. Our Simple Randomized Mergesort (SRM) technique is practical and provably ecient for a wide
range of values of D. We show that SRM is asymptotically optimal for sorting within very small constant factors
when M =
(DB log D), which is generally satised in practice. Moreover, even for ranges of D, M , and B when
SRM is sub-optimal, it still remains ecient and faster than DSM, as borne out by our theoretical analysis and
empirical simulations.
In the next section we describe SRM and some previous approaches to external mergesort, and we list our main
theoretical result Theorem 6, which gives an analytical upper bound on SRM's expected number of I/Os in the worst
case. Sections 3.3{3.8 focus on SRM and its analysis. Section 3.3 discusses the basic idea of SRM and the way that
records are distributed on the disks, which is important to attain ecient read and write parallelism. Section 3.4
We say that f (n) = o(g(n)) if f (n)=g(n) ! 0 as n ! 1. The above asymptotic notations can
be applied in a generalized manner as well. For example, we write f (n) = g(n) + O(h(n)) if
f (n) g(n) = O(h(n)), and we say that f (n) g(n) + O(h(n)) if there is some j (n) such that
f (n) j (n) and j (n) = g(n) + O(h(n)).
29
discusses the forecasting data structure needed for parallel reads. Section 3.5 describes the merge process in detail.
We analyze the merge process in Sections 3.6, 3.7, and 3.7.1 and prove Theorem 6. In Section 3.8, we brie y discuss
the potential benets of a deterministic version of our algorithm.
In Sections 3.9 and 3.10, we demonstrate the practical merit of SRM by comparison with DSM. We base our
comparisons on both our analytical worst-case expected bounds as well as the more optimistic empirical simulations
of average-case performance, for a wide variety of values for the parameters M , D, and B.
3.2 Main Results

3.2.1 Background
The rst step in mergesorting a le of N records is to sort the records internally, one half memoryload at a time
so as to overlap computation with I/O, to get 2N=M sorted runs each of length M=2. Techniques like replacement
selection [Knu98] can produce roughly N=M runs each of length about M . Then, in a series of passes over the le, R
sorted runs of records are repeatedly merged together until the le is sorted. At each step of the merge, the record
with the smallest key value among the R leading records of the R runs being merged is appended to the output run.
The merge order parameter R is preferably chosen as close to M=2B as possible so as to reduce the number of passes
and still allow double buering.
In order to carry out a merge, the leading portions of all the runs need to be in main memory. As soon as any
run's leading portion in memory gets depleted by the merge process, an I/O read operation is required. Since a large
number of runs (close to M=2B runs) are merged together at a time, the amount of memory available for buering
leading portions of runs is very small on average. It is of paramount importance that only those blocks that will be
required in the near future with respect to the merge process are fetched into main memory by the parallel reads.
However, this memory management is dicult to achieve since the order in which the data in various disk blocks will
participate in the merge process is input dependent and unknown. Vitter and Shriver [VS94] give further intuition
regarding the diculty of mergesorting on parallel disks. In Greed Sort [NV90], the trick of \approximate merging"
is used to circumvent this diculty. Aggarwal and Plaxton [AP94] use the Sharesort technique that does repeated
merging with accompanying overhead.
Recently, Pai et al [PSV94] considered the average-case performance of a simple merging scheme for R = D sorted
runs, one run on each disk. They use an approximate model of average-case inputs and require that the internal
memory be suciently large. They also require that each run reside entirely on a single disk; in order to get full
output bandwidth, the output run must be striped across disks. A mergesort based upon their merge scheme thus
requires an extra transposition pass between merge passes so that striped output runs of the previous merge pass can
be realigned onto individual disks.
30
3.2.2 Overview of SRM
We present an ecient, practical, randomized algorithm for external sorting called Simple Randomized Mergesort
(SRM). Our algorithm uses a generalization of the forecasting technique [Knu98] in order to carry out parallel
prefetching of appropriate disk blocks during merging. Randomization is used only while choosing, for each input
run, the disk on which to place the rst block of the run; the remaining blocks of that run are cyclically striped across
the disks. The use of randomization is thus very restricted; the merging itself is deterministic.
If the internal memory available is of size M , with block size B and with D parallel disks, SRM merges R runs
together at a time in each merge pass, where R is the largest integer satisfying M=B 2R + 4D + RD=B. We note
that M=B is the number of internal memory blocks available. Hence the merge order (the number of runs SRM
merges together at a time) is determined by the amount of internal memory at its disposal. As a function of M , B,
and D, the merge order R is given by (M=B 4D)=(2 + D=B). In the the realistic case that D = O(B), SRM merges
an optimal number (namely, R = (M=B)) of runs together at a time. For simplicity in the exposition, we assume
that D = O(B). (We can use the partial striping technique of [VS94] to enforce the assumption if needed.)
There are two aspects to our analysis of the algorithm|one theoretical and one practical. The rst (and main)
part of our analysis is bounding the expected number of I/Os of SRM, where the expectation is with respect to the
randomization internal to SRM and the bound on the expectation holds for any input. SRM uses an elegant internal
memory management scheme to solve the fundamental prefetching diculties of simple merging on parallel disks
alluded to earlier in Section 3.2.1. Below we brie y sketch this technique:
At any time while merging together R runs, let us consider how SRM reads in the next R blocks on disk required
by the merge process. Observing that the next R blocks on disk will not, in general, be uniformly distributed among
the D disks, we denote as d the maximum number of blocks among these R blocks that reside on a single disk.
The blocks of the R input runs are striped across the disks. In order to maximize parallelism, in each I/O read
operation, SRM fetches into memory one block from each disk whenever there is space to accommodate D blocks in
main memory. In any I/O read operation, the block that is read in from each disk is the one that contains the smallest
key among all the input blocks on that disk that are not yet in memory. This is achieved by using the forecasting
technique. In the event that there is only enough free memory to read in F additional disk blocks, where F < D,
SRM ushes out D F previously read blocks from memory back to disk. Among the blocks already in memory,
the D F blocks chosen for ushing are precisely the D F blocks that will participate in the merge farthest in
the future. The internal memory buer SRM that uses is just large enough to ensure that no block that is among
the R blocks required next by the merge is ever ushed out. Moreover, there is no actual I/O that accompanies the
ushing operation; SRM merely pretends that those blocks had never been read into memory. The forecasting and
ushing are enough to ensure that bringing precisely those R blocks that are next required by the merge from disk
into memory will take no more than d I/O read operations.
This memory management scheme enables us to obtain a handle on SRM's performance. We are able to relate
31
SRM's performance to a combinatorial problem we call the dependent maximum occupancy problem, which is a
generalization of the classical maximum occupancy problem [KSC78, VF90].
Our main theorem below gives expressions for the I/O performance of SRM for three patterns of growth rate
between the number M=B of blocks in internal memory and the number D of disks in the parallel disk system. The
dierent sizes of internal memory considered correspond to certain values of the merge order R as dened by the
formula R = (M=B 4D)=(2 + D=B) given above. The general formulation in terms of occupancy statistics is given
later.
Theorem 6 Under the assumption that D = O(B), the merge order R = (M=B) of SRM is optimal. SRM
N (1 + ln(N=M ) ) I/O write operations2 , which is optimal. The expected number Reads SRM of I/O
mergesort uses DB ln R
read operations done by SRM can be bounded from above as follows:
1. If R = kD for constant k, as R, D ! 1, we have
Reads SRM N ln(N=M ) ln D

DB ln kD k ln ln D
log log D)2
1 + lnlnlnlnlnDD + 1ln+lnlnDk + O (log
(log log D)2
+ N :
DB
2. If R = rD ln D for constant r, as R, D ! 1, we have
Reads SRM N + c N ln(N=M ) + O(1)

DB DB ln(rD ln D)
which is optimal within a constant factor, namely c. The magnitude of c depends upon r.
3. If R = rD ln D where r =
(1), we have
Reads SRM N ln(N=M )

DB ln(rD ln D)
r !
1 + 2r + 2p2lnr rln D + O 1r + r log r + 1p
log D D r
+ N
DB
2 Strictly speaking, the expressions for the exact number of \passes" and hence the number of
I/O operations required to sort N records using a mergesort involves applying the ceiling (d e)
function to certain logarithmic terms. Employing these exact expressions would mean dealing
with a complicated dependence upon N , without having any signicant impact on our results
for large N . Thus in this exposition we choose to overlook this issue and not apply the ceiling
function, thus obtaining simplied expressions for I/O complexity.
32
which is asymptotically optimal; that is, the factor of proportionality multiplying the (N=DB ) ln(N=M ) term
ln(rD ln D)
is 1.
The mergesort algorithm by Pai et al [PSV94] uses signicantly more I/Os and internal memory. In their scheme,
the internal memory size needs to be
(D2 B) to attain an ecient merging algorithm.
The other aspect of our analysis deals with the practical merit of our SRM mergesort algorithm on existing
parallel disk systems. In Sections 3.9 and 3.10, we consider implementations of disk striping (DSM) and SRM
mergesort algorithms for the interesting case R = kD, in which the number of blocks in internal memory scales
linearly with the number of disks D being used. We demonstrate that SRM's good I/O performance is not merely
asymptotic, but extends to practical situations as well. Like DSM, SRM also overlaps I/O operations and internal
computation, which is important in practice.
3.3 SRM Data Layout

Our primary goal in this chapter is to take advantage of mergesort's simplicity and potential for high eciency in
practice when used with parallel disk systems. We use striped input runs in our merging procedure so that the output
runs of one merge pass written with full write parallelism can participate as input runs in the next pass without any
reordering or transposition. In this scheme, if the 0th block of a run r is on disk dr , then the ith block resides on disk
(i + dr ) mod D. Striping alone is not enough to ensure good performance. If many runs are being merged and the
disk dr for each run r is chosen deterministically, the merging algorithm can have abysmal worst case performance.
Throughout the entire duration of the merge, the R leading blocks of the R runs may always lie on the same disk,
thus causing I/O throughput to be only a factor of 1=D of optimal.
A natural alternative that we pursue in this chapter is to consider a randomized approach to mergesort. We
randomly assign the starting disk dr for each input run r. Then, on average, it is not necessary to read many blocks
from the same disk at one time. The analysis involves an interesting reduction to a maximum bucket occupancy
problem.
In order to facilitate analysis, for each run r we choose the disk dr on which r's initial block will reside inde-
pendently and identically with uniform probability. The location of every subsequent block of run r is determined
deterministically by cycling through the disks. We will measure our algorithm's performance by bounding the expected
number of I/Os required to merge R runs for any arbitrary (that is, worst-case ) set of input runs; the expectation
is with respect to the randomization in the placement of the rst block of the runs. The actual key values of the
records that constitute the runs can be arbitrary and their relative order does not aect the bounds we derive.
If we consider the problem of deterministic mergesorting with random inputs, as opposed to randomized merge-
sorting, we believe that the same analysis technique can yield similar bounds (but for the average-case) for a version
of our algorithm in which the starting disk dr for run r is deterministically chosen to be uniformly staggered, so that
33
dr = 0 for r = 0; 1; : : : ; R=D 1, dr = 1 for r = R=D; R=D + 1; : : : ; 2R=D 1 and so on. Moreover, as we indicate
in Section 3.8, by taking into consideration the lengths of the runs being merged in dierent stages of the mergesort, it
may be possible to improve the bound on overall average I/O performance. In this chapter, however, we concentrate
on our SRM method, since a very limited amount of explicit randomization in SRM (used to determine the random
starting disks dr ) enables us to get around the average-case assumption of the deterministic version, thus facilitating
the elegant analysis presented here.
3.4 Forecasting Format and Data Structure

As observed earlier, each of the R runs that participate in a merge is output onto disk either during the run formation
stage or during some earlier merge. This enables us to begin each run on a uniformly random disk. We can also
format each block of every run so as to implant some future information as explained below. This is a generalization
of the forecasting technique of [Knu98]. We denote the ith block of a run r as br;i and the smallest (rst) key in
block br;i as kr;i . Blocks of a run r are formatted as follows:
Implanted in the initial block br;0 of run r are the key values kr;j for 0 j D 1. (Typically in practice,
the D key values will indeed t in a small portion of one block. Even if this is not so, since D = O(B), this
information will t in at most O(1) blocks.)
Implanted in the ith block br;i , where i > 0, of run r is the single key value kr;(i+D).
The extra information for forecasting in each block is only one key value, not one record. So the extra space taken
up by the forecasting format of runs is negligible. For simplicity, we assume that all key values are distinct.
Denition 5 At any time t during the merge, consider the unprocessed portions of the R runs. Among these records,
the record with the smallest key is called the next record of the merge at time t. A block belonging to any run is
said to begin participating in the merge process at the time t when its rst record becomes the next record of the
merge. A block ceases to participate in the merge as soon as all of its records get depleted by the merge. At any
point of time, consider all the blocks not in memory that reside on disk i. The smallest block of run j on disk i is
the block that will be the earliest participating block among all the blocks of run j on disk i. The smallest block on
disk i is the the earliest participating block on disk i at that time. At any time, the leading block of a run is the
block (possibly partly consumed) that contains the smallest key value in that run at that time. Note that a block
can become a leading block much before beginning to participate in the merge.
On an I/O read, the block read in by SRM from disk i is always the smallest block on disk i at that time. To be
able to do this, SRM maintains a forecasting data structure.
34
Denition 6 The forecasting data structure (FDS) consists of D arrays H0 ; H1 ; : : : ; HD 1 , one corresponding to
each disk. At any point of time, Hi [j ] stores Ki;j , where Ki;j is the smallest key value in the smallest block of run j
on disk i.
At any time, in order for SRM to read in the smallest block from disk i, it merely needs to read in the smallest
block of run j on disk i, where j is the run with the smallest key in Hi .
3.5 The SRM Merging Procedure

In this section, we discuss how SRM merges R runs striped across D disks, with the randomized layout described in
Section 3.3.
The SRM merging process can be specied as a set of two concurrent logical control ows corresponding to internal
merge processing and I/O scheduling , respectively. By internal merge processing, we refer to the computation that
merges the leading portions of runs after these leading portions have been brought into internal memory. Internal
merge processing has to wait when some run's leading portion in memory gets depleted or when the write buers
get full and need to be written to disk. By I/O scheduling we refer to the algorithm that decides which blocks to
bring in from the disks. I/O scheduling decisions are based upon information from the FDS data structure and the
amount of unoccupied internal memory. Implementing internal merge processing and the I/O schedule as concurrent
modules makes the overlapping of CPU computation and I/O operations an easier task.
Internal merge processing can be implemented using a variety of techniques; we refer the reader to [Knu98]. In
this section, we focus on the I/O scheduling algorithm. We rst discuss issues pertaining to management of internal
memory and maintaining the FDS data structure during the merge. We then go on to introduce some terminology
and notation that helps us describe I/O operations and the I/O scheduling algorithm. The terminology we develop
here will be used even in subsequent sections when we analyze the algorithm.
3.5.1 Internal Memory Management

Management of internal memory blocks plays a key role in the interaction between the I/O scheduling algorithm and
internal merge processing. In SRM, internal memory management of merge data is done at block level granularity,
so internal fragmentation within internal memory blocks is not an issue. By an internal block, we mean a set of
contiguous internal memory locations with xed boundaries, large enough to hold B records. Note that the internal
structures pertaining to internal memory management need to be updated whenever internal merge processing is in
certain critical states and after every I/O read operation. In this subsection, we describe some of the more important
aspects of SRM's internal memory management.
The number M=B of blocks in internal memory, expressed in terms of R, B and D, is 2R + 4D + RD=B. Of
35
these, SRM maintains a dynamically changing partition fML ; MR ; MD ; MW g of 2R + 4D physical blocks of internal
memory.
Denition 7 The set ML of R internal blocks is maintained such that at any time, if the leading block of any run
is in internal memory, its internal block is in ML . The set MD contains D internal blocks maintained such that each
I/O read operation of SRM can read in D blocks from disks into the internal memory blocks of MD . The remaining
R + D internal memory blocks comprise the set MR . The sets are maintained such that as soon as there are D
unoccupied internal blocks, a parallel read can be initiated into MD . In addition, SRM uses a set MW of 2D internal
memory blocks as an output buer.
The D internal memory blocks of MD are used specically to initiate I/O read operations at the earliest possible
time, potentially bringing in blocks even before they begin participating. The output buer MW is large enough to
ensure that the output run can be written out with full parallelism in the forecasting format described earlier.
The forecasting data structure FDS and related auxiliary data structures occupy not more than about RD=B
blocks.
3.5.2 Maintaining the dynamic partition of internal memory

As internal merging proceeds and the parallel I/O operations keep transferring blocks between the parallel disk system
and internal memory, internal memory blocks need to be exchanged among the sets ML , MR , MD , and MW in order
to ensure that the assertions in Denition 7 are met. Below, we describe three types of exchanges of internal memory
blocks in the management of ML , MR , and MD :
1. Consider any state of internal merge processing when the last record in the leading block of a run r gets
consumed by the merge. If run r's new leading block occupies an internal block of MR , then MR and ML
mutually exchange the internal blocks corresponding to the new and old leading blocks so that MR gets an
unoccupied internal block and ML gets an occupied one.
2. Consider any I/O read operation that has just completed. I/O read operations always read in blocks into the
D internal memory blocks of set MD . If the read operation brings into MD a block that is the leading block
of some run r, then ML and MD exchange internal blocks corresponding to the old and new leading blocks, so
that MD gets an unoccupied internal block and ML gets an occupied one.
3. Whenever MR has at least one unoccupied internal memory block and MD has at least one occupied internal
memory block, MR and MD exchange internal memory blocks so that MD obtains an unoccupied block and
MR gets an occupied one.
In addition, records are added to blocks of MW as the internal merging proceeds.
36
3.5.3 Maintaining the Forecasting Data Structure
Whenever a block belonging to run j is read into memory from disk i, the forecasting data from that block is used
to update the entry Hi [j ] of FDS.
We will see in Denition 10 that SRM may at times ush a block belonging to run j to the disk i it originated
from. This will not require any I/O but merely require that the entry Hi [j ] of FDS be updated with the smallest
key of the particular block being ushed. If more than one block from a particular run j are ushed together, all to
the same disk i, then the entry Hi [j ] is updated with the smallest key value among all the blocks from run j being
ushed to disk i.
3.5.4 Terminology and Notation

Denition 8 Suppose we order a set A of blocks in ascending order by the blocks' smallest key values. We dene
Rank A (b), for b 2 A, to be the rank of block b in this order. The rst (smallest-valued) block has rank 1 and the last
(largest-valued) block has rank jAj. For any time t during the merge, we dene the following terms:
Ft denotes the set of full blocks in internal memory such that no block b in Ft is the leading block of any run
at time t.
St denotes the set of D blocks, each of which is the smallest block on one of the D disks.
Fset t (l) = fb j b 2 Ft ; Rank Ft (b) jFt j l + 1g denotes the set of the l highest ranked blocks of Ft . By
denition, no block of Fset t (l) is a leading block.
OutRank t = minb2St fRank Ft [St (b)g denotes the rank of the smallest-ranked block of St in the set Ft [ St.
Denition 9 The operation ParReadt is executed at a time t only when D unoccupied internal blocks are available
in MD . Using FDS, it reads in the set St of D blocks into MD , exchanging internal blocks of MD with ML as in
point 2 of Section 3.5.1 if necessary. FDS is updated as described in Section 3.5.3 using the implanted information
in blocks of St.
Denition 10 The operation Flush t(j ) virtually ushes out the j blocks of the set Fset t (j ) occupying internal blocks
in MR at time t. In SRM, we have j D whenever Flush t (j ) is invoked. The ushing of blocks is virtual in the
sense that the blocks do not actually have to be written out to disk , so there is no I/O involved. FDS is updated as
described in Section 3.5.3 to re ect the fact that the ushed out blocks will have to be read back from their original
disks when needed.
An important feature of SRM exploited in Section 3.6 is that leading blocks of runs are never ushed out.
37
3.5.5 I/O Scheduling Algorithm
In this subsection, we specify the parallel I/O schedule of SRM. As far as writes go, Each write is executed with full
write parallelism as soon as the next output stripe of D formatted blocks are ready in the output buer MW . With
respect to reads, the basic idea is to read in D blocks of the set St whenever possible; ushing a few of the highest
ranked non-leading blocks in Ft , if necessary.
1. At the beginning, SRM reads the rst block from each of the R runs into the R internal blocks of ML using
parallel reads.
2. Until the merge is completed, whenever the I/O system is free at a time t when D unoccupied blocks are
available in MD , SRM does the following:
(a) If there are D unoccupied internal blocks in the set MR at time t, a ParReadt operation is initiated.
(b) Else if the number of occupied internal blocks in MR at time t is R + extra , where 1 extra D, and if
OutRank t > extra , a ParReadt operation is initiated.
(c) Else if the number of occupied internal blocks in MR at time t is R + extra where 1 extra D, and
if OutRank t extra , a Flush t (extra OutRank t + 1) operation is invoked, followed by initiation of a
ParReadt operation.
In Step 2c, the ushed out blocks are the ones that will not be used in the \near future," as we will see in the
next section. Lemma 13 below ensures that SRM merging runs to completion, needing only the specied amount of
memory.
Lemma 13 Consider any ParReadt operation invoked by SRM at time t. No block not in internal memory when
ParReadt has completed begins participating in the merge until D unoccupied physical blocks are available in the
set MD of internal memory blocks. This ensures that there will be enough space for the next read operation to begin
before any outside block begins participating.
Proof : Consider step 2a of the algorithm: D unoccupied internal blocks are available in the set MR before ParRead t
gets initiated. In this case, the requirement of having D unoccupied blocks in internal memory for the next parallel
read is trivially met by exchange operations as in point 3 of Section 3.5.1. Consider step 2b of the algorithm in which
only D extra unoccupied blocks are available in MR but OutRank t > extra just before ParReadt gets initiated. In
this case, by denition of OutRank t , extra blocks that belong to the set MR just before ParReadt gets initiated begin
participation before any block not yet in memory after completion of ParReadt . These extra blocks become leading
blocks of their respective runs before participating. Thus by exchange operations as in point 1 of Section 3.5.1, MR
gets extra unoccupied internal memory blocks from ML . This means that by means of exchange operations as in
38
point 3 of Section 3.5.1, MD has D extra + extra = D unoccupied blocks before participation of any block that
is still on disk after ParReadt is completed. Similarly, in case of step 2c of the algorithm, OutRank t 1 internal
blocks of MR become unoccupied owing to participation while extra OutRank t +1 blocks of MR are ushed. These
unoccupied blocks, along with the D extra unoccupied blocks already present in MR ensure that the set MD does
obtain D unoccupied blocks before any outside block begins participating in the merge. This completes the proof of
the lemma.
Lemma 13 shows that a ParRead operation can be initiated before any block brought in by that operation begins
participation. Thus there is genuine prefetching ability, which is useful in overlapping I/O operations with internal
processing.
3.6 Using Phases to Count ParRead Operations

For the purpose of analysis, we break the process of merging R runs comprising a total of N 0 records into a sequence
of phases , and we upper bound the expected number of read operations required during a phase. The overall bound
on the merge process can then be computed using the linearity of expectations.
Consider the set of all blocks except blocks in the initial set. Each such block can participate in the merge only
after its preceding blocks have ceased to participate, so that if the ith block of a run is currently participating, at
least (i 1)B records from that run have already been output.
Denition 11 Consider the set R0 of all blocks of all runs excluding the initial block of each run. The participation
index of any block in R0 is i if it is the ith block, in chronological order, to begin participating in the merge, where
1 i N 0 BRB . We denote by set Pj , where 1 j N 0RBRB , the subset of R0 consisting of blocks whose
participating indices are in the range [(j 1)R + 1; jR].
Using parallel reads, SRM tries to ll the R internal blocks of MR using FDS. Every incoming block provides new
implanted information. When a block with small participation index is on disk while memory is full, we bring it in,
by ushing blocks that are surely not among the next R blocks to begin participating in the merge, as Lemma 14
shows.
Lemma 14 Consider a ush Flush t at time t. The R + OutRank t 1 smallest-ranked blocks of Ft (in MR ) do not
get ushed out: they remain in MR after the ush completes.
Proof : Since Flush t is invoked at time t, it means that the number of blocks in MR is R + extra (1 extra D)
and 0 < OutRank t extra . By denition of the Flush t operation, extra OutRank t + 1 blocks corresponding to the
extra OutRank t + 1 highest ranked blocks in the set of all blocks occupying internal blocks of MR are ushed out.
This means that R + OutRank t 1 of the lowest ranking blocks remain in internal memory even after the Flush t
operation. Hence the lemma is proved.
39
The merge proceeds in a sequence of phases, each contributing R blocks to the output run.
Denition 12 The rst phase begins at the time p0 of the last ParReadp read operation invoked in Step 1 of SRM.
0
The j th phase ends at the time pj of the read ParReadpj such that any block b 2 R0 with participation index ib ,
where ib jR, has been read into memory (at least once) from disk by some ParReadt such that t pj . The
ParReadt read operations such that pj 1 < t pj are the reads forming the j th phase , and the set Pj dened above
is the set of blocks forming the j th phase.
The next step is to show that when the j th phase is over, blocks with participation indices smaller than jR have
already been \taken care of."
Lemma 15 After the read ParReadpj at time pj is complete, the following invariants hold:
1. No block still on disk has a participating index less than jR + 1.
2. No block b with participating index ib (j + 1)R can be ushed out by any Flush t , where t > pj .
Proof : We prove the claims by induction. For the base case j = 1, observe that by time p1 , all blocks having
participation indices at most R are read in at least once. By Lemma 14, none of these blocks could be ushed out at
any time t p1 because they would be among the R smallest ranked blocks of MR , thus proving the claim. We now
consider the second claim for j = 1. Suppose there is a ush Flush t at any time t > p1 . By the previous claim, if X
is the number of blocks in MR that have participating indices at most R, then OutRank t > X . By Lemma 14, the
R + OutRank t 1 smallest ranked blocks at time t remain in MR after the ush. Of these, at most X OutRank t 1
blocks have ranks smaller than R. Thus if there is any block b with rank at most 2R at time t, then b must be among
the R + OutRank t 1 smallest ranked blocks. Hence the base case is proved for both claims.
Suppose now that both claims are true for some j 1. Then by inductive hypothesis for Claim 1, when the read
ParReadpj 1 at pj 1 completes, the smallest participating index of any block still on disk is (j 1)R + 1. By the
inductive hypothesis for Claim 2 none of the blocks with participating indices at most jR can get ushed at any time
t > pj 1 . By denition, all blocks with indices at most jR are read in by pj . Since none of them can get ushed
after pj 1 , the rst claim is true for j . The second claim for j follows by an argument identical to the argument for
its base case.
We are led immediately to the following lemma that gives the number of blocks in the output run of the merge
at the end of the rst j phases.
Lemma 16 At least jRB records from blocks of the set R0 are already in the merged output of SRM before any
block still on disk after time pj begins participating in the merge. Thus all reads ParReadt such that t pj and the
reads I0 required in the step 1 of SRM can be charged to these jRB records in the output of SRM.
40
We wish to obtain a handle on the number of read operations ParRead t where pj 1 < t pj (the reads associated
with the j th phase). We will show that this number is the largest number of blocks in Pj that need to be read in
from any particular disk.
Denition 13 Consider the blocks of the set Pj+1 just after ParReadpj is completed. Such a block b is said to be
on level 0 if it is already in internal memory and level h if it is the hth smallest block on its disk at that time. Let
Lj+1 denote the highest level of any block of Pj+1 . In the case of the rst phase, let L1 be the highest level of any
block of P1 after Step 1 of the algorithm in Section 3.5.5.
We prove the following characterization of the number of reads associated with the (j + 1)st phase.
Lemma 17 The Lj+1 th read of SRM after time pj is done at time pj+1 .
Proof : The proof follows from Lemma 15. No block with participation index at most jR is still on disk after the read
at pj completes. Moreover no block with participation index at most (j + 1)R can be ushed out after pj . Clearly
then, since any ParReadt operation reads in the smallest block from every disk at time t, we have the nice property
that all blocks of Pj +1 that are at the hth level will get read by the hth read after pj , never to get ushed out. This
proves the lemma.
We can now relate progress in the merge to the number of read operations involved in phases by means of the
following lemma that follows from Lemmas 17 and 16.
Lemma 18 The rst I0 + P1ij Li reads of SRM can be charged to at least jRB records in the output of the
merge, where I0 is the number of reads SRM incurs in Step 1 of the algorithm in Section 3.5.5. The total number
0 R
of I/O read operations required to complete the merge is given by the above sum with j = N =B
R .
In the next section we give a bound on the expectation of Li with respect to the randomization involved in the
choice of disks for the initial blocks of the runs. Thereafter, Lemma 18 can be used to obtain the overall bound on
the number of reads involved in SRM.
3.7 Probabilistic Analysis

Consider any arbitrary R runs input to SRM. The initial block of each run is placed on a disk chosen with a uniform
probability of 1=D independently of other runs. Since runs are cycled across disks, the position of the initial block
of the run determines the position of the other blocks of the run. As a result, for any j , the j th block of a run is on
any one of the D disks, each with probability 1=D, depending only upon the disk containing that run's initial block.
41
Lemma 19 Consider the ith phase and the related set Pi of blocks such that 1 i N 0RBRB . Let nj denote the num-
P
ber of blocks from the j th run in Pi , so that 0j R 1 nj = R. Let Cj denote the ordered list hbj;0 ; bj;1 ; : : : ; bj;nj 1 i
of the nj blocks of run j in Pi , where the order is the chronological order of participation. Let Dj denote the disk
from which the block bj;0 originates, for each run j , 0 j R 1. Then the disks Dj are independently distributed,
each with uniform probability over the set f0; 1; : : : ; D 1g of D disks.
Denition 14 We call each Cj a chain of length nj of contiguous blocks of run j . The disk corresponding to any
given block of Cj is determined by the disk of the lead block bj;0 of Cj .
We recall that the number of reads Li in the ith phase is the maximum level of any block on any disk at time
pi 1 .
Denition 15 Consider the set Pi . Let L0i be the maximum level of any block of Pi on disk, considering all of Pi 's
blocks on their respective original disks.
In the following subsections, we will be able to bound E (L0i ). The following lemma shows that L0i is an overestimate
of Li , so we get a bound on Li too.
Lemma 20 With Li and L0i dened as above, we have Li L0i and E (Li ) E (L0i ).
3.7.1 The Dependent Occupancy Problem

In this section, we dene the dependent occupancy problem, which along with the well-studied classical occupancy
problem (a special case of the former), is directly related to the I/O performance of SRM.
In the classical occupancy problem of parameters fNb ; Dg, Nb balls are thrown into D bins indepen-
dently each with uniform probability of 1=D of falling into any bin. We denote the asymptotically tight
expression for the expectation of the maximum number of balls in any bin [VF90, KSC78] by C (Nb ; D).
In the dependent occupancy problem, we consider D bins but instead of balls, we consider C chains of
balls, such that the total number of balls summed over the C chains, is Nb . A chain of length ` consists
of ` balls linked together in a chain. A chain of length ` is said to fall or get thrown into a bin s, where
0 s D 1, if the leading ball of the chain falls into bin s and the remaining ` 1 balls of that chain
are deposited cyclically into bins; that is the i-th ball (with 0 i ` 1 ) of that chain falls in bin
(s + i) mod D. Since the sum of the number of balls from all the chains is exactly Nb , the maximum
length of a chain is Nb . Denoting as nj , the number of chains of length j , we have P1j Nb nj = C and
P
1j Nb jnj = Nb . We are interested in the expectation of the maximum occupancy of any bin when
each of the C chains gets independently and identically randomly thrown, such that the probability of
each chain falling into any of the D bins is 1=D.
42
(a) Dependent Occupancy (b) Classical Occupancy
Figure 3.2: (a) Dependent occupancy instance with Nb = 12, C = 5, D = 4. Arrows indicate the cyclical order
of blocks in the same chain. The maximum occupancy is 4, as realized in the second bin. (b) Classical occupancy
instance with Nb = 12, D = 4. Blocks fall independently of each other. The maximum occupancy is 5, as realized in
the second bin.
The classical occupancy problem is the special case of the dependency occupancy problem when C = Nb and
n1 = Nb with nj = 0 for 1 < j Nb . Figure 2 above shows one instance of both types of problems for Nb = 12,
D = 4, and with C = 5 for the dependent occupancy problem. The gure uses square blocks instead of balls and
uses arrows to connect together blocks from the same chain.
3.7.2 Asymptotic Expansions of the Maximum Occupancy

Our goal here is to obtain a bound on the expected maximum occupancy in dependent occupancy problems. As
mentioned earlier, the leading terms in our upper bounds for this quantity are the same as those in the well-known
bounds for C (Nb ; D) in the classical occupancy problem [KSC78]. We conjecture that the expected maximum classical
occupancy C (Nb ; D) is an upper bound for the maximum dependent occupancies.
Consider the following intuition: in dependent occupancy, a chain of length ` results in a maximum occupancy
of d`=De because of the constraint that balls of the chain are cyclically distributed among the D bins. The cyclic
distribution tends to reduce the variance of the individual occupancies, thus lowering the maximum occupancy. In
classical occupancy, the average occupancy of any bin when ` balls are independently and uniformly randomly thrown
is `=D but more than `=D of them can easily fall into the same bin.
For instance, in Figure 2, it can be seen that pairs of blocks from the same chain that are forced to occupy dierent
bins in the dependent occupancy problem may occupy the same bin in the classical version of the occupancy problem.
Intuitively, independence increases the expected maximum occupancy.
43
In [BGV97], our approach was to use the well known bounds for the classical maximum occupancy to bound
dependent maximum occupancies; however, the proof of the bound we gave was erroneous. We do believe, however,
that the bound is correct, as we conjecture above. In this chapter, we instead derive a direct bound on the expectation
of the maximum dependent occupancy via interesting analytical and asymptotic techniques, and as a result our proof
is independent of the classical occupancy bounds of [KSC78].
As a rst step toward obtaining an upper bound on the maximum dependent occupancy, we rst state and prove a
simple lemma that shows that it will suce for us to focus on dependent problems having no chains of length greater
than D:
Lemma 21 Consider any dependent occupancy problem X involving C chains, D bins and a total of Nb balls such
that one or more chains in X are of length greater than D. Let Xmax denote the maximum occupancy random
variable of X . Then there is a dependent occupancy problem X involving a total of Nb balls and D bins such
that no chain involved in X has length greater than D and E [Xmax ] = E [Xmax
], where Xmax
is the maximum
occupancy random variable for X .
Proof : Consider the dependent occupancy problem X 1 obtained by replacing a chain C1 of X of length aD + b, where
a 1 and 0 b < D, with, instead, a chains of length D and one chain of length b. Let Xmax
1 denote the maximum
occupancy random variable of X 1 . We claim that the occupancy distribution of Nb balls among the D bins in the
problem X is exactly the same as that of the problem X 1 .
To see this, rst consider the occupancy distribution of balls of chain C1 among the bins when it is randomly
thrown into the bins. Let us suppose that 1 b < D. When the chain C1 gets thrown, b contiguously placed bins
numbered, say s; (s + 1) mod D; : : : ; (s + b 1) mod D with 0 s D 1, each get a + 1 balls. The other D b
bins each get exactly a balls. There are D distinct sets S0 ; S1 ; : : : ; SD 1 , each set consisting of b contiguously placed
bins. Let us denote by Si0 the set of bins other than those in Si . Then for 0 i D 1, the probability that set Si 's
b bins each receive a + 1 balls of C1 whereas bins in the set Si0 each receive a balls is exactly 1=D.
We now consider the occupancy distribution of balls when a chains each of length D and one chain of length b
get randomly thrown into the bins. It is not hard to see that the occupancy distribution of balls that results from
throwing in these a + 1 chains independently and uniformly randomly among the bins is precisely the occupancy
distribution of balls that results from throwing chain C1 , described in the previous paragraph. Even in the case when
b = 0, the occupancy distribution of balls that results from throwing in chain C1 is the same as that obtained from
throwing in a chains of length D each.
Moreover, the two occupancy distributions that result from the other balls are obviously identical since the same
C 1 chains are involved in both the distributions. By independence and the claim in the previous paragraphs, the
occupancy distribution of balls in problem X is the same as the one in problem X 1 .
44
Since we can replace one chain (C1 ) of problem X with a multiple number of chains of length at most D to obtain
a problem X 1 with the same occupancy distribution of balls, we can repeat this process on X 1 to replace one more
chain of length greater than D to obtain another problem X 2 , and so on, retaining the same occupancy distribution
of balls as in the original problem X . We continue this process until we have a dependency problem with no chain of
length greater than D and denote this problem as X . Since the occupancy distributions of X and X are identical,
so are their expected maximum occupancies. This proves the lemma.
We now use the above lemma in proving an upper bound on the expected maximum occupancy of dependent
occupancy problems involving a total of Nb balls and D bins.
Theorem 7 Consider any dependent occupancy problem X 0 that involves C 0 chains, D bins and a total of Nb balls
from the C 0 chains. Let Xmax
0 denote the maximum occupancy random variable for X 0 . Let Nb = kD, for k > 0.
Then
1. If k is a constant as D ! 1,

0 ] ln D 1 + ln ln ln D + 1 + ln k + O (log log log D)2 :
E [Xmax
ln ln D ln ln D ln ln D (log log D) 2
2. In the case where k = r ln D and r =

(1), we have
r Nb !
0 ] 1 + 2 + p ln r + O 1 + log r + 1p
E [Xmax
r 2 2r ln D r r log D D r D:
Note that the right hand side is of the form NDb (1 + o(1)) when r ! 1. In the case when r = (1), we have
0 ] cNb =D for some constant c > 0.
E [Xmax
Since the classical occupancy problem is a special case of the dependent occupancy problem, an immediate
implication is the following corollary.
Corollary 1 Consider a classical occupancy problem in which Nb balls are independently, uniformly randomly
thrown into D bins. The expected maximum occupancy C (Nb ; D) of this problem is bounded from above by the
0 ] of the maximum occupancy of any
same upper bounds that we proved in Theorem 7 for the expectation E [Xmax
dependent occupancy problem involving a total of Nb balls and D bins.
The upper bounds we prove for the expectation of maximum occupancy in the dependent occupancy problem are
precisely the asymptotically tight bounds for the classical maximum occupancy problem derived in [KSC78]. In fact,
45
our proof for the upper bound on dependent expected maximum occupancy constitutes an alternate approach to
obtaining the same asymptotically leading terms as those in [KSC78] for the classical expected maximum occupancy.
(The proof in [KSC78] also serves as a lower bound on classical expected maximum occupancy but our techniques
can be modied to do the same.)
Proof of Theorem 7: The given dependency problem X 0 might involve some chains that are longer than D in length.
0 ] where Xmax
Lemma 21 ensures us that there always exists a dependency problem X such that E [Xmax ] = E [Xmax
is the maximum occupancy random variable of X , and X involves no chain of length greater than D. We will thus
focus our attention on this problem X .
Let C denote the number of chains in X , and for 1 j Nb , let nj denote the number of chains of length j .
Thus we have
X
nj = C; (3.1)
1j D
and
X
jnj = Nb : (3.2)
1j D
Let T denote any positive integer. One way of computing E [Xmax ] is
X
E [Xmax ] = PrfXmax > mg (3.3)
m 0
X
T+ PrfXmax > mg (3.4)
m T
X
T+ D PrfX > mg: (3.5)
m T
where X is the random variable corresponding to the occupancy of one particular bin , say bin b0 , the bin numbered
0. We will derive an appropiate bound on the quantity PmT PrfX > mg and then apply it in inequality (3.5)
above to bound E [Xmax ].
If all the chains involved are of length 1, computing PrfX > mg is relatively straightforward. The fact that there
can be up to C = Nb chains whose lengths may vary from 1 through D introduces dependencies that complicate the
task of obtaining a bound for PrfX > mg. To obtain a bound on PrfX > mg, we will use a generating function
approach.
46
Denition 16 Consider a random variable W that takes integral values. The probability generating function (PGF)
of W , denoted GW (z), is dened to be the function
X
GW (z) = PrfW = tgz t :
t
Thus the coecient of zt in GW (z) is PrfW = tg.
In the simple case when dependency problem X has only one chain, and the chain has length `, the PGF GX (z)
is

GX (z) = 1 D` z0 + D` z1 = 1 D` + D` z:
In the general case when dependency problem X has C independent chains, the PGF is the product of the PGFs
corresponding to each individual chain, which gives us
Y j + jz nj :
GX (z) = 1
D D (3.6)
1j D
Suppose that P 1 is a number to be determined later. By denition, we have GX (P ) = E [P X ]. Thus, since

X is a positive integral random variable, we have
X
GX (P ) = P t PrfX = tg P m PrfX mg; (3.7)
t 1
for any positive integer m. By (3.6) and (3.7), we get
PrfX mg GX m
(P ) (3.8)
P
Y (P 1)j nj
= P1m 1+ D (3.9)
1j D
Y jnj
P1m 1+ PD 1 (3.10)
1j D
Nb
= P1m 1 + P D 1 : (3.11)
Step (3.10) follows from the inequality (1 + (P 1)j=D) (1 + (P 1)=D)j when j 0 and P 1. Step (3.11)
follows from (3.2). Note that if the probability generating function for the maximum occupancy random variable of
classical occupancy with Nb balls and D bins is denoted GY (z), then GY (P ) is precisely the expression (1 + (P
1)=D)Nb .
47
We need to select an appropriate value of P in the bound (3.11) for PrfX mg. The idea is to optimize the
value of P so that the right hand side of (3.5) is minimal. In what follows, we will denote P as 1 + , where is a
positive quantity to be determined later. It is understood that P (and thus ) may be a function of D.
Substituting P = 1 + into (3.11), we have
Nb
PrfX mg (1 +1)m 1 + D : (3.12)
We know that E [Xmax ] Nb =D trivially. In order to conveniently manipulate the right hand side of (3.5), which
will be a function of T and , we parameterize T as T = 1 + Nb =D, where > 1 is chosen so that Nb =D is an
integer. Plugging in this parameterization and simplifying (3.5) using (3.12), we have
Nb X
E [Xmax ]
D + 1 + m>Nb =D D PrfX > mg (3.13)
Nb X Nb 1
D + 1 + D 1+ D (1 + )m (3.14)
m>Nb =D
Nb Nb 1 X 1
D + 1 +D 1+ D
(1 + )Nb =D m>0 (1 + )m
(3.15)
Nb Nb 1
+D 1+ D
=
D + 1
(1 + )Nb =D : (3.16)
If the value of is large enough, the second term of (3.16) will be negligible. We now derive the smallest value
can take that still ensures that the second term in (3.16) is at most 1:
N
D 1 + D b (1 + 1)Nb =D 1:
Taking natural logarithms on both sides, we have
Nb
ln D + Nb ln 1 + D D ln(1 + ) ln 0:
Rearranging terms, we get

NDb ln(1 + ) Nb ln 1 + D + ln D ln (3.17)
D ln 1 +

ln(1 + D) + N Dln(1
ln D
+ )
D ln
Nb ln(1 + ) : (3.18)
b
48
We take to be the smallest real number satisfying (3.18) such that Nb =D is integral. Replacing by in
(3.16) and exploiting the fact that the second term in (3.16) is at most 1, we get
Nb
E [Xmax ]
D +1 +1 (3.19)

DNb + 2: (3.20)
We are now ready to consider the dierent cases regarding the relationship between Nb and D. The basic idea is
to choose dierent values of for the dierent cases so as to optimize the bounds obtained using (3.20). Throughout
this derivation we will make use of the bound ln(1 + x) = x x2 =2 + O(x3 ) for bounded positive x.
Case 1: Nb = kD for constant k > 0. In this case, we choose
1=r = (ln D)=k :

= ln(1=r) ln ln D ln k
Substituting and simplifying, the rst term in expression (3.18) can be written as
ln D 1 + O log log log D ;

k(ln ln D)2 log log D
and the second term in expression (3.18) can be written as
ln D 1 + ln ln ln D + ln k + O (log log log D)2 :

k ln ln D ln ln D ln ln D (log log D)2
The third term of expression (3.18) is O(1). Hence the bound on can be written as
log log D)2
= k lnlnlnDD 1 + lnlnlnlnlnDD + 1ln+lnlnDk + O (log
(log log D)2 : (3.21)
Using inequality (3.21) in inequality (3.20) and simplifying, we get

2
E [Xmax ] ln D 1 + ln ln ln D + 1 + ln k + O (log log log D2) : (3.22)
ln ln D ln ln D ln ln D (log log D)
This completes the proof of Case 1 of Theorem 7.

p p
Case 2: Nb = rD ln D with r =
(1). In this case, we choose = 2=r . Since =D = 2=r=D ! 0 asymptotically
p p
as D ! 1, it follows that D ln(1 + =D) = 2=r(1 + O( 2=r=D)).
49
Let us rst consider the more interesting subcase when r ! 1; ! 0. In this the rst term on the right hand
side of inequality (3.18) equals
p2=r 1 + O p2=r r
D ln 1 + D D
ln(1 + ) = p2=r 1=r + O(1=rpr) = 1 + 21r + O 1r + D1pr : (3.23)
The second term on the right hand side of (3.18) equals

D ln D = p 1=r
p = p1 + O 1 : (3.24)
Nb ln(1 + ) 2=r 1=r + O(1=r r) 2r r
The third term of (3.18) is
D ln ln
Nb ln(1 + ) = (r ln D)(1 + O()) (3.25)
= pln r ln 2
2r(ln D) 2=r(1 + O(1=pr))
(3.26)
p ln r
2 2r(ln D)(1 + O(1=pr))
< (3.27)
= p ln r + O log r : (3.28)
2 2r ln D r log D
By (3.23),(3.24), and (3.28), we can choose to be

r r + 1p :
= 1 + 2r + p ln r + O 1r + r log
log D D r (3.29)
2 2r ln D
For the bound on the expected maximum occupancy, applying (3.29) in inequality (3.20) and simplication yields
r2 !
E [Xmax ] 1+ + p ln r + O 1 + log r + 1p Nb + 2
r 2 2r ln D r r log D D r D
r Nb !
1 + 2 + p ln r + O 1 + log r + 1p
r 2 2r ln D r r log D D r D
In the subcase when r = k = (1) for some k > 0, it is easily seen that the right hand side of (3.18) depends upon
r but it is O(1), and thus = O(1). Using c0 , where c0 is a positive constant, in inequality (3.20), we have
E [Xmax ] c0 Nb + 2: (3.30)
D
This completes the proof of Case 2 of Theorem 7.
50
3.7.3 Proof of Theorem 6
An intuitive way to look at the analysis of SRM is that the number of I/O read operations corresponding to a phase
is the expected maximum occupancy of a dependent occupancy problem involving Nb = R balls and D bins. Thus
to compute the number of reads for the entire mergesort, we have to multiply the expected maximum occupancy by
the total number of phases in the mergesort, which is the product of the number ln(N=M )= ln R of passes over the
le other than the initial run formation pass and the number N=RB of phases in a merge pass.
Using Lemmas 18 and 20, with the random variables I0 and L0i dened as in those lemmas, the random variable
that bounds the number of I/O read operations during a merge of N 0 records during SRM is I0 + P1iJ L0i , where
0 R
J = N =B
R . The expected number of read operations in a merge is thus
X 0
E [I0 ] + E [Li ]: (3.31)
1iJ
The random variable I0 is the maximum occupancy of a classical occupancy problem involving R balls and D bins
and each L0i is the maximum occupancy of a dependent occupancy problem involving R balls and D bins.
Let us consider Case 1 of Theorem 6, in which we have R = kD for constant k > 0. In terms of occupancies, this
corresponds to Case 1 of Theorem 7, in which a total of Nb = R = kD balls are thrown into D bins. From the bounds
for Case 1 of Theorem 7 and Corollary 1, the expectations E [I0 ] and E [L0i ] are all bounded by the right hand side
of (3.22). Substituting (3.22) into (3.31) and simplifying, we nd that the number of reads in a merge of N 0 records
during SRM is given by
log log D)2
(J + 1) lnlnlnDD 1 + lnlnlnlnlnDD + 1ln+lnlnDk + O (log
(log log D)2
N 0 ln D 1 + ln ln ln D + 1 + ln k + O (log log log D)2 :
= kDB (3.32)
ln ln D ln ln D ln ln D (log log D)2
In each pass over the le, the sum of the sizes of all the runs that are merged is P N 0 = N . Therefore by (3.32), the
expected number of reads done by SRM in each pass on the le is
N ln D 1 + ln ln ln D + 1 + ln k + O (log log log D)2 : (3.33)
kDB ln ln D ln ln D ln ln D (log log D)2
There are (ln(N=M ))=(ln(kD)) merge passes other than the initial run formation pass, which costs N=DB read
operations. Thus, the expected number of read operations required to sort N using SRM in Case 1 of Theorem 6 is
N + ln(N=M ) N ln D 1 + ln ln ln D + 1 + ln k + O (log log log D)2 :

DB ln(kD) DB k ln ln D ln ln D ln ln D (log log D)2
Similarly, we can apply occupancy bounds from other cases of Theorem 7 and Corollary 1 to bound the expectations
E [I0 ] and E [L0i ], where 1 i J , in Cases 2 and 3 of Theorem 6, and we get the desired expressions shown in the
statement of Theorem 6. This completes the proof of Theorem 6.
51
3.8 A Deterministic Variant
As mentioned earlier in Section 3.3, when the input data records of the runs are randomly and uniformly distributed,
we expect a version of our algorithm that uses a deterministic staggered distribution of starting blocks of runs on
disks to give an average performance comparable to the bounds we showed in Theorem 6. Intuitively, this eect
occurs because with long, random input runs, the R leading blocks of the R input runs at any time during the merge
tend to be spread no worse than randomly among the D disks.
When the R input runs are staggered on the D disks, the rst run has its rst block on disk 0 with subsequent
blocks cyclically stored on the disks, the next run has its rst block stored on disk 1, and so on. If the runs are \short
enough," we can expect the runs to maintain their stagger on the disks throughout the duration of the merge on the
average, ensuring very ecient I/O. On the other hand, as observed in the previous paragraph, even when runs are
\long," we can expect I/O to be reasonably ecient. Hence, if an analysis of having runs staggered deterministically
during initial merge stages (when runs are \short") is combined with an analysis that exploits the randomness in
maximum occupancy situations during later merge stages (when runs are \long"), we might be able to obtain an
improvement in overall I/O performance of the mergesort. We might thus be able to prove theoretical optimality
(within a constant factor) for an increased range of values of M , B and D for our mergesort, compared with the
range in Theorem 1.
3.9 Comparisons between SRM and DSM in Prac-

tice
In DSM, the popularly-used merging technique with disk striping, there are O(M=DB) runs merged in each merge
step with perfect read parallelism. The price that DSM pays for not using the disks independently is that it can
merge only (M=DB) rather than (M=B) runs at a time. However, DSM is often more ecient in practice than
the previously developed asymptotically optimal sorting algorithms, since the latter algorithms have larger overheads
[VV96b].
When the amount M of internal memory is small or the number D of disks is large, our SRM method is clearly
superior to DSM. For example, if M = (DB), DSM is suboptimal by a factor of (log D) whereas SRM is suboptimal
by a factor of (log D= log log D). The improvement of SRM over DSM shows up in practice even when M is
substantially larger or the number of disks D is small, which is when DSM has better I/O performance than previously
developed external sorting algorithms.
In this section, we compare the performance of the DSM and SRM mergesorts on existing parallel disk systems. In
Subsection 3.9.2 we make a comparison based upon an estimate of the upper bound on the expected worst-case number
of I/O operations of an SRM mergesort with the exact number of I/O operations for the standard DSM mergesort
with disk striping, using parameters of presently existing or feasible parallel disk systems. In Subsection 3.9.3 we
52
make a similar comparison based upon actual simulations of SRM to estimate its average-case I/O performance on
random inputs. Both the comparisons show that SRM's I/O performance is always better than DSM's. Moreover,
simulations indicate that SRM's actual I/O performance is much better than that implied by the estimate of the
analytical bound in Subsection 3.9.2.
3.9.1 Expressions for the number of I/O operations

In this subsection and the following ones, we assume that SRM is able to merge R = kD runs at a time, for interesting
values of k and D. We consider DSM using the same amount of memory and estimate the relative performances of
SRM and DSM.
We will rst consider SRM, which merges R = kD runs at a time. The amount of internal memory needed to
N (1 + ln(N=M ) )
support the merging is M = (2kD + 4D)B + kD2 . The total number of writes SRM requires is DB ln(kD)
since it has perfect write parallelism. In terms of reads, SRM requires N=DB reads corresponding to the initial run
formation pass. It subsequently requires ln(ln(N=M ) N
kD) passes to perform the merging, each incurring v DB reads, where
v = v(k; D) is the overhead factor that represents a multiplicative overhead over and above the minimum number
N
DB of reads for a single pass. The total number of I/O operations SRM takes to sort N records is thus
N 2 + ln(N=M ) (1 + v) = N 2 + C ln N ;
DB ln(kD) DB SRM M
where
1+v :
CSRM = ln( (3.34)
kD)
We now compute in a similar way the number of I/O operations needed by DSM to sort N records using the same
amount of memory as SRM above. We assume that DSM uses 2D blocks per run for I/O read buers and 2D blocks
for I/O write buers. DSM merges (M=B 2D)=2D = k + 1 + kD=2B runs at a time. It can be veried that the
number of I/O operations required by DSM to sort N records is
N 2 + 2 ln(N=M ) N
= 2 + C ln N ;
DB ln(k + 1 + kD=2B) DB DSM M
where
CDSM = ln(k + 1 +2 kD=2B) : (3.35)
53
3.9.2 Comparison based upon expected worst-case perfor-
mance of SRM
Our goal in this subsection is to compare SRM and DSM based upon the expected worst-case performance of SRM.
Our comparison uses k and D values corresponding to realistic systems. To make the comparison possible, we need
to obtain an estimate of the overhead factor v(k; D) dened in the previous subsection corresponding to these values
of k and D.
Note that for a pair of given nite values of k and D, the estimate of the expected worst-case value of v using
Theorem 6 is lax because of contributions from lower-order terms. As explained in Section 3.7.3, the expected number
of reads done by SRM is bounded by an expression involving expected maximum occupancies of the classical and
dependent cases. We get more useful comparisons between SRM and DSM by replacing the maximum dependent
occupancy by the maximum classical occupancy in the bound for the expected number of reads required by SRM.
This approach can be justied on two counts. First, the upper bound in Theorem 7 that we proved for the expected
maximum dependent occupancy is the same as the expression for the expected maximum of the classical occupancy
problem. Moreover, we conjecture that expected classical maximum occupancy is never less than the expected
maximum occupancy of dependent occupancy problems involving the same number of balls and bins, as explained in
Section 3.7.2.
For given values of k and D, we simulate throwing kD balls into D bins repeatedly to estimate C (kD; D). Thus
the expected worst-case overhead v(k; D) is estimated by repeated ball-throwing experiments to estimate C(kD;D) .k
Table 3.1 shows the values of v based upon such experiments.
For every k, D pair considered, we use (3.34) to estimate an expected worst-case value for CSRM using the
abovementioned estimates for v. In Table 3.2 we present the CSRM =CDSM ratio for several values of k and D. The
ratio CSRM =CDSM represents the relative I/O advantage of SRM over DSM, neglecting the 2N=DB I/Os that both
methods require during the initial run formation. We used block size B = 1000 records for all k; D pairs. (The choice
of B is not signicant so long as it is reasonable.) In our representation, 2k is roughly the number of internal memory
blocks available per disk.
Table 3.2 shows that the SRM uses signicantly fewer I/Os than does DSM. For example, for D = 50, k = 100,
which translates into M = 10:5 million records of internal memory, SRM uses 0:60 times as many I/Os as does the
slower DSM, not counting the initial run formation pass that they share in common. When D is small, as k increases
relative to D, the CSRM =CDSM ratio gradually increases toward 1, which indicates a lessening advantage of SRM over
DSM when there are few disks and a huge amount of internal memory. As we show in the next subsection, since we
overestimate the number of reads required by SRM in our analysis, SRM actually exhibits even better performance
than indicated in Tables 3.1 and 3.2.
54
D = 5 D = 10 D = 50 D = 100 D = 1000
k=5 1:6 1:7 2:2 2:3 2:7
k = 10 1:4 1:5 1:8 1:9 2:2
k = 20 1:3 1:4 1:5 1:6 1:8
k = 50 1:2 1:2 1:3 1:4 1:5
k = 100 1:11 1:16 1:22 1:26 1:3
k = 1000 1:04 1:05 1:08 1:08 1:1
Table 3.1: The overhead v(k; D), computed by estimating C (kD; D)=k using computer simulations.
D = 5 D = 10 D = 50 D = 100 D = 1000
k=5 0:71 0:62 0:51 0:48 0:46
k = 10 0:72 0:66 0:54 0:50 0:48
k = 20 0:75 0:68 0:56 0:53 0:49
k = 50 0:77 0:71 0:59 0:55 0:50
k = 100 0:78 0:72 0:61 0:57 0:51
k = 1000 0:83 0:77 0:67 0:63 0:56
Table 3.2: The performance ratio CSRM =CDSM for memory size M = (2k + 4)DB + kD2 , with block size B = 1000.
(Both M and B are expressed in units of records.) The overhead factor v in CSRM is based upon computer simulations
of C (kd; D)=k.
3.9.3 Using simulations to count SRM's I/O operations

In the previous subsection we showed by using classical maximum occupancy to estimate the overhead term v that
SRM performs well compared to DSM on arbitrary inputs. In this section we use simulations of the algorithm itself
to estimate v on average-case inputs and make a similar comparison.
Our experiments consist of simulating SRM while merging R = kD sorted runs, and each of length L, for a large
range of k and D values. The runs input to the merge were generated such that each set of input runs was equally
likely. SRM's actions while merging depend only upon the relative order of the input keys. Thus, there is an obvious
one-to-one correspondence between the set of all possible input runs to the merge and the set of partitions of the
set I = f1; 2; : : : ; LkDg, each partition splitting I into kD disjoint subsets of size L. We generate average-case inputs
to the merge by generating partitions of the set I , with each partition being equally likely.
Our simulations indicate that SRM's I/O overhead is noticeable only when k is small compared with the number
of disks D. We ran our simulations for several dierent sets of parameters. Not only did we vary k and D, but for
each k, D pair we also varied B and L (where L is as dened above). We present here an illustrative sample of typical
simulation outcomes, with our main focus on the parameters k and D. Table 3.3 shows the multiplicative overhead v
corresponding to simulations for interesting k;D pairs. It can be seen that the estimates of v based upon average-case
inputs are smaller than the corresponding estimates of the expectation of worst-case values of v in Table 3.1. (For
values of k larger than the ones shown here, the simulations gave values v that were practically equal to 1.) The
simulations show in the average-case that SRM has little or no overhead when k is reasonably large and that the
number of I/O read operations required to carry out the merge is practically N 0 =DB, where N 0 = LkD is the number
of records in the merged output. In our simulations, N 0 was always 1000 times bigger than kDB. Longer simulations
tend to be time consuming.
The average-case simulations indicate when k is reasonably large that there is little or no ushing of blocks by
55
D = 5 D = 10 D = 50
k=5 1:0 1:0 1:2
k = 10 1:00 1:0 1:1
k = 50 1:00 1:00 1:00
Table 3.3: The overhead factor v(k; D) for memory size M = (2k + 4)DB + kD2 obtained from simulations.
D = 5 D = 10 D = 50
k=5 0:56 0:47 0:37
k = 10 0:61 0:52 0:40
k = 50 0:71 0:63 0:51
0 =CDSM for memory size M = (2k + 4)DB + kD2 where CSRM
Table 3.4: The performance ratio CSRM 0 is computed
using the overhead value v(k; D) obtained from simulations.
SRM. Intuitively, the rate of consumption (on account of merging) of the blocks in internal memory is such that there
is almost always space to read in the next \smallest" D blocks without the need to ush.
In order to make a similar comparison as the previous subsection, but based upon results of simulations of the
0 analogously to CSRM using (3.34), by using the average-case values
algorithm itself, we computed the term CSRM
for v obtained from simulations of the algorithm.
0 =CDSM ratios so obtained, for various k; D pairs. We note that the entries in
Table 3.4 below shows the CSRM
Table 3.4 are smaller than the corresponding entries in Table 3.2, indicating that SRM's performance is indeed better
than that implied by the more pessimistic upper bound. Moreover, SRM's low overhead in the average-case indicates
that for all practical purposes it is an optimal external sorting algorithm.
3.10 Realistic Values for Parameters k, D, and B

We have shown in the previous section for a wide range of k, D, and B values that SRM performs very eciently.
Since not every k, D, B triplet corresponds to that of a realistic machine, we try in this section to obtain a crude
approximation of the relative magnitudes of k, D, and B in typical computers. We argue that k is generally much
larger than D in realistic machines, which implies that SRM is the method of choice on such machines, still noticeably
faster than DSM.
In our terminology, where the internal memory size is M = (2kD + 4D)B + kD2 and the merge order is R = kD,
the expression 2kD is roughly equal to M=B, the number of blocks that t in main memory, under the realistic
assumption that D = O(B). Let us consider a fast, uniprocessor workstation attached to, say, D = 5 independent
disks for parallel I/O. We are likely to nd internal memories of the order of 100 megabytes on such machines and disk
block or track sizes of the order of 10{50 kilobytes. This would mean that k may be on the order of 200{1000 when
D = 5 on such workstations. If the same amount of main memory is used with instead 10 disks and 50-kilobyte disk
blocks, k would be on the order of 100. On the other hand when the number of disks is relatively high as in large-scale
multiprocessor computing systems, we would still expect k to be large, even after factoring in the increased block
sizes that such machines have. This is primarily because large-scale computing systems tend to have huge internal
56
memories. For instance, in a system with 100 parallel disks and 100-kilobyte disk blocks, one would expect on the
order of at least 5{10 gigabytes or more of aggregate internal memory. This would correspond to k being on the order
of 500{1000 or more. The relative magnitudes of k, D, and B cited here are meant to represent likely scenarios; it is
conceivable that there are systems with dierent relationships among the values of k, D, and B.
For most values of of k, D, and B, the previous section shows that SRM is extremely ecient. Looking back at
the occupancy analysis, the reason for SRM's eciency is that throwing a large number (say, kD) of balls uniformly
and independently into a small number of bins (say, D) does result in a more or less balanced distribution. Even
DSM will perform well when D is small and k is large, since it will merge k runs at a time, which is not much worse
than merging the optimal kD runs at a time. However, SRM's extremely low overhead still gives it an advantage
over DSM.
3.11 Conclusions and Related Notes

In this chapter, we have presented a simple, ecient mergesort algorithm for parallel disks that makes a limited
use of randomization. The analysis of the I/O performance involved a reduction to certain maximum occupancy
problems. We demonstrated the practical merit of the algorithm by showing that it incurs fewer I/O operations than
the commonly used disk-striped mergesort, even on realistic parallel disk systems with a small number of disks. We
did so analytically by estimating the maximum bucket occupancy values for several k, D pairs and empirically by
using simulations to count the number of I/O operations needed by SRM in the average case. We argued that SRM
may be considered an optimal external sorting algorithm in practice. The technique of staggering runs might yield
further gains in practice if combined with our general approach.
An interesting point to note is that the ecient parallel disk merging procedure we present here, wherein (M=B)
sorted runs are merged together can be \reversed" to obtain an ecient parallel disk distribution procedure, wherein
records can be eciently distributed into (M=B) distinct buckets, each one striped across the D disks. On simple
examination we observe that the bounds in the proof of Theorem 6 that hold for the number of reads incurred while
externally merging R striped runs each of length N=R records also hold for the number of writes while distributing
a total of N records into R striped buckets.
Another interesting note is that the forecast and ush approach can be used fruitfully in a framework for compet-
itive prefetching on parallel disks with a notion of local lookahead , for read-once sequences [BKVV97]. In [BKVV97]
we prove lower bounds on online parallel prefetching for read-once sequences and present a simple, optimal (upto
constant factors) prefetching algorithm for parallel disks.
57
Chapter 4
Design and Implementation of SRM in
TPIE
Summary
External sorting is a fundamental to database systems not only for producing sorted output but also as a core
subroutine in many database operations [Gra93, IBM90]. Technology trends indicate that developing techniques
that eectively use multiple disks in parallel in order to speed up the performance of external sorting is of prime
importance. The simple randomized merging (SRM ) mergesort algorithm proposed in Chapter 3 is the rst parallel
disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [Knu98,
Section 5.4.9] recently identied SRM (which he calls \randomized striping") as the method of choice for sorting with
parallel disks.
In this chapter, we present an ecient implementation of SRM, based upon novel and elegant data structures.
We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and
ush technique for buer management. Our techniques amount to a signicant improvement in the way SRM carries
out the parallel, independent disk accesses necessary to eciently read blocks of input runs during external merging.
We present the performance of SRM over a wide range of input sizes and compare its performance with that of
disk-striped mergesort (DSM), the commonly used technique to implement external mergesort on D parallel disks.
DSM consists of using a standard mergesort algorithm in conjunction with striped I/O for parallel disk access. SRM
merges together signicantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We
demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher
than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is signicantly faster than with
DSM, since SRM requires fewer passes.
The techniques in this chapter can be generalized to meet the load-balancing requirements of other applications
using parallel disks, including distribution sort and multiway partitioning of a le into several other les. Since
both parallel disk merging and multimedia processing deal with streams that get \consumed" at nonuniform and
partially predictable rates, our techniques for lookahead based upon forecasting data may have relevance in video
server applications.
58
4.1 Introduction and Motivation
External sorting is a fundamental operation used widely in database systems. It is used not only for producing sorted
output but also as a core subroutine in many other database operations [Gra93, IBM90]. Modern technology trends
indicate that processor speeds are increasing at a faster rate than disk drive performance [Dah96, GVW96], and so
the development of external sorting techniques capable of utilizing multiple disks in parallel is of prime importance
for database systems.
External mergesort is the most commonly used technique to perform large-scale sorting [ZL98]. In this chapter,
we address problems arising in the development of a simple and ecient parallel disk mergesort. External mergesort
consists of a run formation phase, which produces sorted runs, and a merge phase, which merges sorted runs to
produce the sorted output. While it is simple to modify run formation techniques developed for single disk systems
to work eciently on parallel disk systems, fundamental diculties1 need to be overcome in order to merge together
several runs, each one striped across the disks, in a manner that eciently utilizes all the disks in parallel.
In this chapter, we present the design, implementation, and performance of an extremely simple and ecient
parallel disk merging technique. The most signicant aspects of our parallel disk merging technique are its elegant
and novel data structures and the prefetching and buer management technique that together constitute a practical
implementation of the external mergesort algorithm SRM,which was shown to have provably ecient parallel I/O
performance guarantees for the parallel disk model of Vitter and Shriver [VS94]. SRM was recently identied by
Knuth [Knu98]2 in the new edition of his seminal work as the method of choice for optimal sorting on parallel disks.
Another interesting aspect of our implementation technique is that it can easily be modied to suit the load balancing
needs of other applications in a parallel disk context, including external distribution sort and a multiway partitioning
of a le into other les. The technique we develop to implement a lookahead mechanism via forecasting also has
potential applications during parallel prefetching in video servers in which many striped les may need to be streamed
at nonuniform rates with real time constraints.
Although a tremendous amount of research has been conducted on external sorting [ECW94, ZL98, Sal89, ZL96],
the focus has largely focussed on single disk systems and on goals such as ecient layout of disk blocks, ecient
scheduling of I/O at disks, and techniques to implement read-aheads in course of external sorting. In this chapter,
we focus on the orthogonal approach of minimizing elapsed time by developing techniques to exploit I/O parallelism
and minimize parallel I/O operations in a system containing several disks. While some of the techniques developed
in [ECW94, ZL98, Sal89, ZL96] can potentially be used in conjunction with the ones we develop here, exploring that
avenue is beyond the scope of this chapter. The parallel disk model (PDM) [VS94] is meant for designing algorithms
1 It is easy to carry out an ecient parallel disk merge of several runs when each run resides entirely
on a single disk [PV92], provided the output can be striped across all the disks. However such a
merging scheme is fundamentally inecient for a parallel disk mergesort, since extra transposition
passes would be needed.
2 Knuth refers to the SRM algorithm as \randomized striping".
59
capable of exploiting I/O parallelism. In the PDM, an input le containing N items3 is striped in blocks containing
B items across D disk drives all of which may be used in parallel as follows: In each I/O operation, an application
can transfer at most one block of B items between internal memory and each disk drive; so up to D blocks can be
transferred in a single I/O operation. With respect to the problem of external sorting, results in [VS94] and earlier
work [AV88] show that given an internal memory capable of holding up to M items, sorting a le of N items requires
N logM=B (N=B ) I/O operations. Sorting requires logM=B (N=B ) passes over the data; each pass can be done
DB
in a linear number of I/O operations (N=DB reads and N=DB writes). The main diculty in parallel disk sorting is
laying out intermediate data blocks and accessing blocks in such a manner that \useful computation" is performed in
memory while ensuring that on average each I/O operation transfers (D) blocks. Several interesting parallel disk
N logM=B (N=B ) of I/O operations have
sorting algorithms [VS94, NV95, AP94] performing an optimal number DB
been proposed, but they are somewhat complicated and dicult to implement in practice.
As a consequence, an attractive alternative to implement sorting algorithms for parallel disks is to use the technique
of disk striping (or striped I/O ) in conjunction with well known single disk sorting techniques as follows: In each
striped-I/O operation, the logical locations of the blocks accessed at each one of the D disks are the same. Logically,
the eect of striped I/O is to reduce the number of disks to 1 and increase the block size to DB from the application's
point of view. As a result, single disk algorithms such as external mergesort and external radix sort with block size
congured to DB can be implemented to utilize D disks on a parallel disk system. Since double-buered mergesort
has been shown to be very ecient [Sal89], its disk-striped version, called disk-striped mergesort (DSM) [Ven94], is
considered particularly attractive. Sorting algorithms such as DSM and disk-striped radix sort [CH96] based upon
striped I/O are simple and all their I/O operations can achieve full D-disk parallelism. However, because the logical
block size blows up from B to DB, the number of runs participating in each merge of DSM goes down by a factor
of D, and hence DSM (and for similar reasons, all other striped-I/O sorting algorithms) require a non-optimal number
log
M=DB (N=DB ) of passes over the data; the degree of non-optimality increases with D and most importantly,
the non-optimality shows up in practice even for a moderately small number D of disks.
The SRM algorithm proposed in Chapter 3 is the rst parallel disk sorting algorithm that requires a provably

optimal number logM=B (N=B) of passes over the data and is simple enough to be considered a candidate for
implementation. For practical ranges of the parameters B and D, each pass takes a linear optimal number O(N=DB)
of I/Os. While merging, SRM uses a generalization of Knuth's forecasting technique [Knu98] and a new buer
management technique called forecast and ush in order to eciently access the D disks in a parallel, independent
manner fundamentally dierent from striped I/O. The basic prefetching technique in SRM is to use forecasting
information to read in the \smallest" block from each one of the D disks in every I/O operation; if there is not
3 By items, we refer to records or tuples. While discussing I/O complexity bounds, it is convenient
to state block sizes and le sizes in terms of items instead of bytes.
60
enough space in internal memory to read D blocks, SRM simply ushes (without any I/O) a sucient number of
\largest" blocks from memory.
4.1.1 Our Contributions

In this chapter, we present novel and elegant data structures and techniques to implement the forecast and ush buer
management scheme of SRM. The latter scheme, as proposed in Chapter 3, requires the use of D separate priority
queues [CLR90], each corresponding to a unique disk and each involving R forecasting keys at any time, where R is
the merge order. Furthermore, the scheme in Chapter 3 did not cover details of internal memory management and
how to eciently track down the \largest" blocks in memory at the time of the ush operations. In this chapter, we
present an implementation of the forecasting and ush scheme that requires only a single priority queue comprising
R forecasting keys, to be used in conjunction with D ordinary queues implemented as simple arrays. The novelty lies
in the way we store, use, and manage forecasting data during the merging process. We also show how to implement
memory management and how to perform ush operations. Our design signicantly simplies the implementation of
SRM.
The second interesting contribution of our chapter is the practical comparison of the merging phase implementation
of DSM and SRM, on a state-of-the-art computer system consisting of six disks that can be used independently and
in parallel. In each merge operation, DSM merges together approximately M=2DB runs, whereas SRM merges
together a signicantly larger number of runs (which can go up to approximately M=2B). DSM will thus tend to
have higher disk locality in each merging operation, compared to SRM. As a result, each merge pass of SRM incurs
some overhead relative to each merge of DSM, on account of higher disk latencies (since a larger number of streams
is involved) as well as because each merge pass may require more than N=DB read I/O operations, as we noted in
Chapter 3. Hence, even if SRM requires a smaller number of merge passes on a given input, comparing the practical
performance of SRM and DSM's merging phases remains an interesting exercise. Our intuition is borne out in practice
when we nd that the data streaming speed attained by DSM while merging is noticeably better than that of SRM.
However, the overhead in our implementation of SRM relative to DSM is small enough that SRM's merging phase
easily outperforms DSM's merging phase by a signicant margin.
In Section 4.2 below, we go through some preliminaries for external mergesort, discuss relevant previous work
and describe DSM. In Section 4.3, we present the SRM merging algorithm and sketch previous results for SRM.
In Section 4.4, we present our implementation of the SRM merging algorithm, including some pseudo-code. In
Section 4.5, we present various aspects of the performance comparison of the merging phases of SRM and DSM. In
Section 4.6, we present various applications of generalizations of the implementation techniques we developed here
and nally, in Section 4.7, we make some concluding remarks including avenues for related future work.
61
4.2 External Mergesort Preliminaries and Previ-
ous Work
External sorting has been studied extensively by many researchers. Knuth [Knu98] contains a comprehensive study
of the eld. External mergesort is the most widely used external sorting technique.
4.2.1 Run Formation + Merging Passes = Mergesort

While sorting a le of N items using an internal memory capable of storing M items, the run formation pass of
external mergesort forms (N=M ) runs, each of size (M ). The implementation of the run formation phase has
been researched extensively [Knu98, ADADC+ 97, LG98]. At any given time, run formation involves only one input
stream and one output stream, so it is straightforward to adapt previously developed run formation techniques to
run eciently on parallel disks. Since run formation is not the focus of this chapter, for the algorithms considered
here, we use a simple quicksort-based run formation implementation, which achieves good cache utilization in internal
memory. At the end of such a run formation pass, there are approximately N=M runs of length approximately M .
After the run formation pass, merge operations executed during the merging phase repeatedly merge together
an appropriately chosen number R of runs until a single sorted run is output. The choice of R, the merge order , is
typically made in accordance with the amount M of internal memory available, the size B of disk blocks, and in the
case of striped-I/O mergesort, the number D of disks. The total number of merge passes required by the mergesort

is logR (N=M ) .
4.2.2 Participation Order of Blocks in a Merge

During a merge, data is transferred between internal memory and disks in blocks of B items. The blocks of the runs
input to an external merge operation have a natural total order that is useful to dene. In the process, we also dene
the crucial notion of the leading block of a run at any time.
Denition 17 Consider an external merge involving R input runs. A block of items is said to be depleted or
consumed by a merge as soon as the last item in that block is written into the output run. The leading block of the
rth run, where 0 r < R, at any time is dened as follows: At the beginning of the merge, the 0th block of the rth
run is the leading block of that run. For i > 0, the ith block of the rth run becomes that run's leading block as soon
as the (i 1)st block of that run gets depleted by the merge. A block is said to begin participating in the merge as
soon as it becomes the leading block of its run. The participation order 4 of the blocks of the input runs is dened as
follows: The 0th block of the rth run is the rth block in the participation order. The remaining blocks of the input
runs follow the 0th blocks of the R runs in the order in which they begin participating in the merge.
4 Sometimes this order is also referred to as the consumption sequence [ZL98] of blocks of a merge.
62
In general the leading block of every input run needs to be in internal memory for a merge computation to proceed,
unless the input run has been completely depleted by the merge. As a result, the order in which input blocks are
read into memory tends to approximately follow the participation order of input blocks.
4.2.3 Previous Work on External Merging

It is beyond the scope of this chapter to carry out a comprehensive survey of the vast body of research on external
merging. Salzberg [Sal89] showed that double buering [Knu98] with reasonably large sized buers is an ecient
approach to implement external merging in general. Zheng and Larson [ZL96] suggested using six to ten oating
buers per input run on average and proposed a planning strategy that utilizes the extra buer space to read
disk blocks in an order dierent from participation order with a view to optimizing seek time. Estivill-Castro
and Wood [ECW94] extended this work, in part, to exploit pre-existing order in input data. Recently, Zhang and
Larson [ZL98] suggested further improvements to the planning strategy via extended forecasting and block clustering.
The abovementioned approaches try to maximize overlapping of I/O and computation and minimize delays on
account of disk latency during external merging. While Zheng and Larson do apply their planning strategy in
a multiple disk situation, these studies are primarily oriented towards single disk systems. Our interest lies in
speeding up external merging using the orthogonal approach of maximizing I/O parallelism. Although some of the
abovementioned approaches can be used in conjunction with parallel I/O, we do not pursue that line of work in
this chapter. In this chapter, as in the NOWSort [ADADC+97] implementation, we assume that ecient lesystem
performance can be obtained so long as the logical block size B is reasonably large (of the order of 256 KB) and B is
also the size of an input or output buer in internal memory. Since all the merging approaches mentioned in the above
paragraph use, on average, a constant number of input buers for each input run, the merge order is R = O(M=B)
and the number of merge passes required is roughly logM=B (N=M ), which is optimal for the case when D = 1.
4.2.4 Parallel Disk Merging in DSM

DSM is a double-buered mergesort that can be easily implemented with striped I/O. The NOWSort [ADADC+ 97]
implementation uses a disk-striped mergesort to locally perform an external sort at each individual workstation. By
the nature of striped I/O, the size of each input buer in DSM is DB and the order R of external merging in DSM is
R M=2DB. Each run is striped blockwise across all the D disks in a round-robin manner. Initially the two buers
of each input run are read into internal memory. During the merge, whenever a run's leading buer gets depleted a
parallel read operation to load the next DB items into the free input buer is issued, while the merge proceeds using
the other input buer of the run. The output of the merge is written in units of DB items and is doubly buered as
well. Clearly, DSM has the advantage of being simple, enjoying full D-disk parallelism, and an overlap of computation

and I/O activity. However, the number of merge passes required in DSM is approximately logM=DB (N=M ) , which
63

can be larger than the optimal number logM=B (N=M ) of passes by a factor approaching log D. More importantly,
from a practical point of view, the increase in the number of merge passes shows up even when the number D of disks
is only moderately high, thus hindering the performance of DSM in practice. 5
4.2.5 Diculty of Merging Optimally with Parallel Indepen-

dent Disks
Optimal sorting on parallel disks requires the ability to access D disks in a parallel, independent manner in which
dierent logical blocks may be accessed on each disk; this access mode is therefore fundamentally dierent from
striped I/O. In order to optimally sort using parallel disks, a mergesort needs to optimally merge a large number6
R = (M=B) of runs striped across D disks in each merge operation.
Fundamental diculties arise from the fact that very often during the merge there are times when the set of the
R next participating blocks all reside on a small subset of the D disks, thereby causing \hotspots". Such hotspots
are caused by the unpredictable, nonuniform rates at which runs get consumed by the merge. When there are
hotspots, reading the set of the next R participating blocks can take many more parallel I/Os than the optimal
number dR=De parallel I/Os. We refer the reader to [VS94] for more intuition regarding the diculty merging with
parallel independent disks. Nodine and Vitter [NV95] overcame this diculty by performing external merging by
rst approximately merging the runs followed by additional passes to rene the merge. Aggarwal and Plaxton's
Sharesort [AP94] technique does repeated merging and has accompanying overheads. Each of these approaches
involves extra overheads and are not ideal for practical implementation.
4.3 SRM Algorithm

The SRM algorithm overcomes the diculties involved in parallel disk merging by using a generalization of the
forecasting technique in an elegant prefetching and buer management scheme. In single disk systems, forecasting
refers to the technique of using the last key of an in-memory block of a run to predict when the next block of that run
will begin participating in the merge. In SRM, whenever intermediate runs are written to disk (during run formation
or as the output of a merge operation), they are striped blockwise in a round-robin manner across the D disks. While
writing the blocks of the rth run to disk, SRM implants the following forecasting information in each block of that
5 In the NOWSort parameters for the Minutesort record, the size of the local internal memory and
the size of the le that needs to be sorted locally at each workstation is such that the number of
runs merged during the merging phase is very small, so their application of disk-striped mergesort
involves only a single merge pass.
6 It is enough to merge R = (M=B ) runs, for any constant 0 < c 1, in each merge operation
c
in order to attain optimality (within constant factors) in the number of passes.
64
run: In the ith block of the rth run, it stores the key value of the last item in the (i + D 1)st block7 of that run.
SRM uses the forecasting information in the ith block of a run to predict the time at which the (i + D)th block of
that run begins participating in the merge. If the ith block of a run is from disk d, then the (i + D)th block of that
run is the next block of that run on disk d.
The round-robin blockwise striping employed by SRM while writing out runs to disk diers from the usual striping
technique in the following sense: The rst block of a run is written on a disk d0 chosen uniformly at random from
among the D disks; thereafter, blocks of the run are placed in the usual round-robin fashion on disks (d0 +1) mod D,
(d0 +2) mod D, : : : and so on. This is the only application of randomization in SRM. The randomization helps SRM
avoid poor merging performance for any particular ordering of items in the input le. The probabilistic analysis in
Chapter 3 of SRM's I/O performance is with respect to the randomization mentioned here; there are no assumptions
whatsoever regarding the input le to be sorted.
4.3.1 The Forecast and Flush Scheme

We now present the simple prefetching and buer management scheme employed by SRM. The total number of
internal memory blocks used by SRM as presented in Chapter 3 is D blocks for blocks actively being read into
memory, R blocks to hold the R leading blocks of the R runs, R blocks for holding prefetched data, and an additional
2D blocks for output run data.8 The oating buers technique is used to implement internal memory management.
The internal merge process works on the R leading blocks corresponding to the R input runs to produce blocks of
the output run. As soon as an in-memory block becomes a leading block, it is pinned in internal memory until it
gets depleted. The same holds for a block that becomes a leading block while still on disk, as soon as it is read into
internal memory. Whenever each one of a set of D blocks of the output run has its forecasting information, that
set of blocks is written out with full D-disk parallelism. The forecast and ush scheme for prefetching and buer
management in order to \feed" the internal computation works as follows:
1. Until there are no more input blocks to be read into internal memory:
(a) Find out, for each disk, the smallest block (with respect to participation order) among all blocks on that
disk. Suppose that of these D smallest blocks (one per disk), ` of them are leading blocks of their respective
runs, where 0 ` D.
(b) If the number of free blocks available to hold prefetched data is D ` f , for some f > 0, then ush
out f of the largest blocks (with respect to participation order) among all the prefetched blocks. This
7 In the original presentation of SRM, the forecasting information in the ith block of a run is the
key value of the rst item in the (i + D)th block of that run. However, the approach here, taken
from [Knu98], is a little simpler.
8 Knuth [Knu98] points out that SRM can be congured to work with any number R + m0 of blocks
for prefetched data as long as m0 D 1; so there is some exibility here.
65
merely involves tracking down the f largest blocks among the prefetched blocks in internal memory and
then simply marking them as free blocks; so there is no I/O involved in ushing. If at least one block is
ushed, update the information about the smallest blocks on the D disks (since we now pretend as though
the ushed blocks are on disk.)
(c) In parallel, read in the smallest block from each one of the D disks.
4.3.2 Provable Performance Guarantees

As stated above, SRM merges approximately M=2B runs in each merge operation and so the number of merge passes

it requires is logM=B (N=M ) , which is optimal. Write operations during SRM proceed at full disk parallelism as
indicated above. A rigorous analysis of the expected number of parallel reads (Step 1c) required in SRM is presented
in Chapter 3; here, the expectation is with respect to the randomization used by SRM to choose the starting disk for
each intermediate output run, and the analysis is worst-case and so holds for any arbitrary inputs.
The ushing of blocks in Step 1b above may lead to extra parallel read operations; so the expected number of
parallel read operations required in an SRM merging pass can, in general, exceed N=DB. This brings us to the
following denition.
Denition 18 The parallel I/O overhead of an SRM merging pass is dened as the ratio between the number of
parallel read operations incurred during that merging pass and the optimal quantity N=DB.
The analysis of SRM in Chapter 3 implies that for most values of M , D, and B which together determine in
SRM, the expected value of is 1 or a small constant greater than 1. Although the upper bound analysis indicates
that there are some R; D pairs for which the expected number of parallel read operations in an SRM merging pass is
non-optimal by small factors, simulations (see Chapter 3) suggest that the upper bound analysis is pessimistic, and
SRM's performance in practice is optimal, with close to 1. Knuth [Knu98] recently identied SRM as the method
of choice for sorting on parallel disks.
4.3.3 Data Structures Required in the Straightforward Im-

plementation
In a direct implementation of SRM as stated in Chapter 3, one forecasting heap would be required for each one
of the D disks in order to keep track of the smallest block (w.r.t. participation order) on each disk at any time.
Each forecasting heap is basically a priority queue [CLR90]. In general, at any time, the forecasting heap for disk d
would contain R elements, each one corresponding to the smallest block of a unique input run on disk d at that
time. Whenever a parallel read operation is completed, all the D forecasting heaps would have to be updated. Flush
operations also require updating one or more forecasting heaps.
66
Additionally, SRM needs to maintain order among the prefetched blocks, since from time to time, the algorithm
may need to ush out of internal memory up to D 1 of the largest prefetched blocks. There is also a need to ensure
that I/O and computation are overlapped as far as possible.
4.4 Implementation Techniques and Data Struc-

tures for SRM
In this section, we present an implementation of the forecast and ush scheme using novel data structures and
techniques. Our approach greatly simplies the task of implementing SRM. We require only one priority queue
in conjunction with D ordinary queues, as opposed to the D priority queues required by a naive implementation.
Maintenance of order among prefetched blocks falls out as a natural consequence of our technique. We also propose
a simple technique to overlap I/O and computation.
For the moment, we assume that we have access to an appropriate high-level interface to specify I/O. In the next
section, we show how such an interface was implemented in the TPIE [Ven95] programming environment for external
memory programming. Below, we describe a novel approach to process forecasting data in course of an SRM merge
operation, which forms the basis of the SRM merge implementation. Then we describe the basic data structures
involved in our implementation. We end this section with an algorithmic description of our implementation.
4.4.1 Managing Forecasting Data

In single disk merging, the key of the last item of a block a run is the forecaster for the next block of that run, so there
is no inherent need to explicitly store forecasting information in blocks. However, in SRM, given ith block of a run,
we need to be able to forecast when the (i + D)th block of that run will begin participating, so forecasting information
has to be stored explicitly. While the original implementation of SRM proposed implanting one forecasting key (the
key of the last item in the block D 1 blocks farther in the run) in each block of a run, here we propose that
forecasting information be managed altogether separately.
Our proposal is motivated by the observation that in most applications in practice (and in particular, in database
applications), the size of the forecasting data involved in a merge operation is much smaller than the size of the
runs participating in the merge. If the size of an item is I bytes and the size of its key is K bytes, then while the
size (in bytes) of the runs input to a merging pass is N I bytes, the size of the corresponding forecasting data is
only NK=B bytes, so the forecasting data is smaller by a factor of BI=K , which is large in practice. For instance,
database benchmarks typically have I = 100 bytes and K = 10 bytes. If B = 1000 items9, then the forecasting data
is 10000 times smaller than the les being merged. Thus while merging les totally involving 1 GB of data, the total
amount of forecasting data is only 100 KB.
9 In our implementation and in [ADADC+ 97], B is even larger.
67
Given the importance of the role of forecasting in SRM and the size of internal memory likely to be used, in most
situations all the forecasting data relevant to a merge operation can be kept resident in internal memory. One way
to implement this would be to rst read in the small le(s) containing all the forecasting data related to the input
runs at the beginning of the merge operation. The forecasting data corresponding to the output run can either be
written out all at once at the end of the merge operation or written out in a blocked manner from time to time
during the merge. The forecasting data corresponding to the input runs can be disposed of after the merge operation.
Using this approach the number of parallel I/O operations for reading and writing forecasting data over a merging
pass would be d2NK=DB2 I e, which is a small fraction of the minimum number d2N=DB e of parallel I/O operations
required for transferring the run data blocks during the merging pass. An alternative implementation that is much
easier and likely to be feasible most of the time, is to never have forecasting data go to disk at any time throughout
the entire mergesort. Forecasting data is intermediate data generated during SRM , so there is no fear of it being
lost on account of system failures. In internal memory, two forecasting buers can be used over all merging passes;
the two buers ip- op in their role as buers for input run forecasting data and output run forecasting data from
each merge pass to the next.
There may, however, be exceptional situations in which having the forecasting data resident in internal memory
may not reasonable. For instance, this can happen while merging terabytes of data; in such cases, forecasting data
may consume either a signicant fraction of available internal memory or (in really extreme situations) may even
exceed internal memory. For such cases, we propose yet another technique to process forecasting data that will
consume only a small fraction of internal memory (only a small portion of the forecasting data corresponding to a
merge operation will be in memory at a time) and will incur only a small I/O overhead relative to the I/O required
for transferring the run data blocks.
In the remainder of this section we describe the the precise operations on forecasting data in our implementation,
followed by a discussion of the how to handle the forecasting data when it cannot be kept in internal memory.
The Forecasting Heap

Consider an SRM merge operation involving R input runs. Since we propose to manage the forecasting data separately
from the runs themselves, for each run there is a forecasting data run contains the forecasting keys of that run in
sorted order. In all situations other than the exceptional situation discussed in Section 4.4.1, the R forecasting
data runs remain resident in internal memory during the course of the merge. The only operation that needs to be
supported on the forecasting data runs in order to facilitate our implementation of SRM's forecast and ush scheme is
an incremental merge of the R forecasting data runs that outputs one forecasting key at a time. Such an incremental
merge can be implemented by having a priority queue that contains, at any time, the leading forecasting key of each
of the R forecasting runs. The smallest key in the priority queue is output at each step; if that key belongs to the
rth forecasting data run, then the next key from that run is inserted into the priority queue. The process can thus
68
continue one step at a time. We use the term forecasting heap to refer to the priority queue on the forecasting data
runs.
The forecasting key output at each step predicts the time at which some block from some run begins participating
in the merge. The order of the forecasting keys output by the forecasting heap is precisely the participation order
of their corresponding run blocks. By maintaining R counters, one per run, each initialized to 0 at the start of the
merge, we can keep track of the index of the rst block in the rth run whose participation time has not yet been
predicted; we simply increment the rth counter whenever the forecasting heap outputs a key from the rth run. The
forecasting heap can thus be used to perform a \lookahead" as we discuss below.
When Forecasting Data is Too Large

In the exceptional situations when the forecasting data involved in a merge operation is so large that it would consume
a signicant portion of or may even exceed available internal memory, we propose the following implementation for
the forecasting heap. In such situations, we choose a special block size Bf (in bytes ) for processing les containing
forecasting data runs. The size of Bf is chosen such that Bf = BI=bf , where bf is a small constant reasonably greater
than 1 that is chosen for a desired performance. The idea is that a portion of size R Bf bytes of internal memory
is dedicated to the forecasting heap; a buer of Bf bytes is used to buer each forecasting data run. Whenever the
buer of a forecasting data run gets depleted, we read in (using a sequential I/O operation from a single disk) the
next Bf bytes from the forecasting data run. In this manner the forecasting heap can be easily implemented.
The R Bf bytes occupied by the forecasting heap can be made small by choosing a large value of bf , but since
the total number of extra I/O operations required over a merging pass on account of the forecasting heap would be
dNKbf =B2 I e, the parameter bf should not be made too large. As a fraction of the minimum number dN=DBe of
(parallel) I/O operations required during a merge, the I/O overhead of the forecasting heap is DKbf =BI , which is
extremely small in most practical situations. For example, with K = 10 bytes, I = 100 bytes, B = 1000, and bf = 8,
the fractional I/O overhead is D=1250 which is small in practice. The memory usage for the forecasting heap in
this case would be less than 1=16 of the internal memory usage of SRM, assuming that at least 2R + D blocks of
internal memory are used by SRM to process the main merge. The memory and I/O overhead of the forecasting data
corresponding to the output run of a merge operation can be somewhat smaller, but are in the same ball park.
4.4.2 Other Data Structures and Primitive Operations

Let m denote the total number of internal memory blocks used to input run blocks, including leading blocks, prefetched
blocks, and active blocks into which read operations are currently reading data, Our implementation maintains a set
of R internal memory pointers, denoted Leading 0 , Leading 1 , : : : , Leading R 1 , pointing to the leading blocks of the R
runs. The implementation also maintains a main merge heap that is continuously merging leading blocks to produce
items of the output run. All of R runs' data blocks in internal memory other than the R leading blocks are maintained
69
in a queue of placeholders, called the lookahead queue LQ for reasons that will soon be clear. Each placeholder is a
structure with a eld block ptr to store a pointer to a block in internal memory, a eld run id to store the identity
of a run, and a eld block num to store the index of a block within a run. We use LQ :head and LQ :tail to denote
the placeholders at the head and the tail of the queue LQ . For each disk 0 d < D, we maintain an occupancy
queue OQ d of elements. Each element is a pointer to some placeholder stored in the lookahead queue LQ . Elements
in OQ d always point to placeholders of blocks on disk d that are not yet in internal memory. We use OQ d :head
and OQ d :tail to denote the elements at the head and tail of the queue OQ d . By appending (resp., prepending) an
element to OQ d , we refer to the act of adding an element behind (resp., in front of) the element OQ d :tail (resp.,
OQ d :head .) The lookahead queue as well as all the occupancy queues must be traversable in both directions. We
use s0 , s1 , : : : , sR 1 to denote the starting disks of the R input runs. Each starting disk is chosen randomly when
the run is begun and is known to the merging algorithm.
The purpose of the lookahead queue LQ is to maintain prefetched input run blocks in participation order. The
purpose of each occupancy queue OQ d , where 0 d < D, is to maintain (pointers to) placeholders corresponding to
blocks from disk d in their participation order, so that blocks from disk d can be read by Parallel Read operations in
proper sequence.
Our implementation and SRM's properties ensure that the number10 of elements in the lookahead queue LQ is
never more than maxfm; RDg + R + D, and the number of elements in any occupancy queue cannot exceed R + D.
Since elements of the LQ or any of the OQ d 's are O(1) bytes in size, we can very simply implement LQ and all
the OQ d queues using statically allocated circular arrays in the obvious manner, with insignicant space overhead in
practice.
Next we dene some primitive operations in order to facilitate the presentation of our implementation.
1. Lookahead ( ). The operation Lookahead ( ) gets the next forecasting key from the forecasting heap and updates
the forecasting heap appropriately, as discussed earlier. If the key so obtained predicts the participation time of
the ith block in the rth run, then a new placeholder p with p:block num := i, p:run id := r, and p:block ptr :=
nil , is appended to LQ which makes LQ :tail = p. Finally, an element pointing to placeholder p is appended to
OQ d , where d = (sr + i) mod D is the disk on which the ith block of run r resides.
2. Parallel Read . The operation Parallel Read issues read requests11 for a set of at most D blocks, one per
disk. The reads are carried out in parallel. Parallel Read is non-blocking in the sense that it returns control
10 Although the lookahead queue LQ can have R entries for each of the D disks, the number of
blocks of internal memory actually in use can never exceed m.
11 Our implementation uses memory-mapped I/O. The memory-mapped I/O calls we use are an
enhanced version [CABG98] of the original memory-mapped calls provided by Digital Unix. The
enhanced version sends o an asynchronous I/O request under the hood as soon as the call is
made. We implement a D-disk parallel I/O operation by issuing D such calls for blocks on
dierent disks.
70
immediately without waiting for the reads to actually complete. The precise block of disk d for which a read is
issued is determined as follows: If the occupancy queue OQ d is empty, no block is read from disk d. Otherwise,
if OQ d :head points to placeholder p in LQ , the following is done:
(a) Block p:block num of run p:run id is read from disk d into internal memory.
(b) The eld p:block ptr is set to point to the newly read block in internal memory.
(c) Element OQ d :head is removed from OQ d .
4.4.3 Basic Ideas of Our Implementation

Our implementation reads blocks of the input run into internal memory only via the Parallel Read operation described
above. We maintain the invariant that whenever a Parallel Read operation is executed, the block corresponding to the
placeholder pointed to by OQ d :head for each 0 d < D is the smallest block on disk d with respect to participation
order at that time. Thus, the smallest block is read from each of the D disks, which implements Step 1c of the forecast
and ush scheme of Section 4.3.1. Our implementation also maintains the important invariant that the participation
order of the input run blocks with placeholders in LQ is precisely the order of their placeholders. A placeholder
p in LQ has p:block ptr = nil if and only if its corresponding block is not in internal memory. Thus, in order to
ush f blocks as in Step 1b of Section 4.3.1, we simply traverse LQ from tail to head, nd the rst f placeholders
p0 ; p1 ; : : : ; pf 1 so found with pj :block ptr 6= nil , and among other things set pj :block ptr := nil .
The lookahead queue feeds blocks to the merge process; the head LQ :head of the lookahead queue gets removed
whenever a run's leading block gets depleted by the merge. Placeholders are added at the end of the lookahead queue
by the Lookahead ( ) operations. Even if a block has a placeholder in LQ , the block is not necessarily in internal
memory. The order in which blocks are read into memory is determined by the occupancy queues. The occupancy
queue feeds the Parallel Read operations; the head of each occupancy queue OQ d , where 0 d < D, is removed
during a Parallel Read operation. Elements are added at the end of the occupancy queues by the Lookahead ( )
operations.
Whenever a Lookahead ( ) operation is executed, a placeholder p for the block that immediately follows LQ :tail
in participation order is created and made the new LQ :tail . In addition, the Lookahead ( ) operation also an element
pointing to p is added at the end of OQ d , where d is the disk on which the block resides. In order to ensure that
a Parallel Read operation reads one block from each disk, we simply have to ensure that OQ d is nonempty for all
0 d < D. Hence, just before executing a Parallel Read operation we perform as many Lookahead ( ) operations
as are necessary to ensure that each OQ d queue has at least one element. The precise number of Lookahead ( )
operations needed to do so is unpredictable and dependent upon the disk distribution of the relevant blocks and their
participation order. The above mechanism automatically guarantees the invariant that each Parallel Read operation
always reads the smallest block from each disk at that time.
71
Our implementation guarantees that whenever a leading run block gets depleted, a read request for the next block of
that run (unless the run itself is depleted) was already issued in some previous Parallel Read operation, and moreover,
that block is precisely the block pointed to by p:block ptr, where placeholder p = LQ :head. In order to do so, we
maintain a special variable, marked that remembers the placeholder of the smallest block for which a read was issued
in the most recent Parallel Read operation. Whenever the placeholder marked begins participating in the merge, we
temporarily interrupt the (internal memory) merging computation, employ the mechanism of the previous paragraph
to ensure that full parallelism for the Parallel Read operation, execute the Parallel Read operation, update the value
of marked , and then resume the merging computation. Since Parallel Read is non-blocking, our implementation
performs I/O overlapped with computation. Sometimes, in the above process, a few blocks may need to be ushed.
4.4.4 Algorithmic Description

We are now in a position to give an algorithmic description of our implementation. For simplicity, we do not mention
the I/O operations in the context of the forecasting heap, with the understanding that the implementor will choose
an appropriate technique and perform the corresponding operations, based upon the discussion in Section 4.4.1.
We use two variables marked and next marked to record appropriate placeholders in the lookahead queue LQ .
As mentioned earlier, the total number of blocks available at the start of the merge operation for input run blocks
(including leading blocks, blocks currently being read into internal memory, and prefetched blocks) is assumed to be
a number m 2R + D. The output run has two DB item sized buers that are used in the usual double-buered
fashion, with writes at full D-disk parallelism. Whenever an input run block gets depleted by the merge or when an
input run block gets ushed, the number of free blocks increases by 1. If p is a placeholder in LQ , then we follow the
convention that p:block ptr is nil whenever the block p:block num of run p:run id is not in internal memory and is
still on disk.
1. For each run 0 r < R, read into internal memory its rst block (on starting disk sr ). The pointers Leading 0 ,
Leading 1 , : : : , Leading R 1 are made to point to the corresponding rst blocks. The total number of parallel
I/O operations required to implement this step is equal to the maximum number of rst blocks on any one disk.
2. Insert the rst key from each one of the R forecasting data runs into the forecasting heap. Insert the rst item
from each run into the main merge heap.
3. Initialize the lookahead queue LQ and the D occupancy queues OQ d , for 0 d < D, to be empty queues.
4. While there exists at least one empty occupancy queue OQ d and the forecasting heap is not empty, execute a
Lookahead ( ) operation.
5. Set marked := LQ :head .
72
6. [Get ready for the next parallel read in the merge by ushing blocks if necessary.] If there are at least D free
blocks in internal memory, then proceed to Step 7. Otherwise let the number of free blocks in internal memory
be D f , where f 1. Beginning with the tail of LQ , traverse LQ towards its head until f placeholders p0 , p1 ,
: : : , pf 1 are found such that pj :block ptr 6= nil , for 0 j < f . Suppose that disk dj is the disk from which
the block pj :block num of run pj :run id originates. Then, for each 0 j < f ,
(a) Set pj :block ptr := nil .

(b) If the occupancy queue OQ dj is empty, insert an element pointing to the placeholder pj into OQ dj ;
otherwise prepend an element pointing to placeholder pj to the head of queue OQ dj .
(c) If placeholder pj is ahead of the placeholder marked in queue LQ , then set marked := pj .
7. Execute a Parallel Read operation.
8. While there exists at least one empty occupancy queue OQ d and the forecasting heap is not empty, execute a
Lookahead ( ) operation.
9. Set next marked := p0 , where p0 is the placeholder in LQ closest to LQ :head among the placeholders pointed
to by elements OQ d :head , for 0 d < D.
10. ag := 0.
11. While ( ag = 0)
(a) Generate the next item x from the main merge heap. Let r be the run containing x.
(b) If run r has no more items to be merged, free the leading block Leading r and proceed to Step 11d.
Otherwise, if the leading block Leading r of run r has just been depleted, free that block, set Leading r to
point to the block p:block ptr , for placeholder p = LQ :head , and remove placeholder p from the lookahead
queue LQ . If p = marked , set ag := 1.
(c) Insert the next item from run r into the main merge priority queue. (If the leading block Leading r just
changed in Step 11b above, the next item from run r is the rst item of block Leading r .)
(d) Add item x to the output run buer.
(e) If adding x completes the block of an output run, add the key of x to the forecasting data run of the
output run.
(f) If the current output run buer now has DB items, then switch output buers and issue a non-blocking
request to write out DB items to disk with full D-disk parallelism.
73
(g) If the main merge heap is empty, set ag := 2.
12. If ag = 1, set marked := next marked and loop back to Step 6.
13. If ag = 2, write to disk the remaining items from the output run buer and the merge is completed.
4.5 Performance Results

In this section we present the performance of SRM in practical scenarios and, in particular, compare its performance
to that of an ecient implementation of DSM and demonstrate that SRM signicantly outperforms DSM. We begin
with a description of the computer system and programming environment of our implementations.
4.5.1 Computer System and Environment

Our experiments were carried out on a Digital Personal Workstation with a 500Mhz EV5.6 (21164A) CPU. We used
D = 6 state-of-the-art ST34501W Cheetah [Inc] disks for our experiments, two disks on each one of three Ultra-Wide
SCSI buses attached to the system. The operating system was Digital Unix Version 4.0.
Both the algorithms were implemented using the Transparent Parallel I/O Environment (TPIE) [Ven95] pro-
gramming environment, which was originally developed by Darren Vengro [Ven94] for his PhD and is currently
being extended as part of an ongoing project at Duke University's Center for Geometric Computing. TPIE is a
stream-oriented environment written in C++ designed to enable the implementation of ecient external memory al-
gorithms on single and multiple disk systems. It provides basic building blocks for programmers to use while writing
external memory programs. TPIE has built-in features such as a memory manager that manages buers; it also keeps
track of the amount of internal memory used by a program, which is very useful to control memory utilization during
experiments as well as in memory management in general.
We implemented an interface for parallel disk streams striped in the usual round-robin manner in units of logical
blocks across the six disks. Each striped stream consists of one Unix le on each disk; each disk is a separately
mounted lesystem. In order to facilitate randomized striping, our interface allows an application to begin striping
on any disk of its choice. The I/O operations of TPIE used in our experiments were implemented using an enhanced
version [CABG98] of memory-mapped I/O calls. A parallel I/O operation is simulated by six memory-mapped I/O
calls, one to each disk. Each memory-mapped I/O call is a non-blocking call that instantaneously dispatches o an
asynchronous I/O operation under the hood. In all our experiments, the size of the unied buer cache was small
enough (relative to the amount of data involved while sorting) so that eects from the buering in the unied buer
cache were negligible.
74
Block Size
In our experiments, we used a logical block size of 256 KB; thus, all the memory-mapped I/O calls mapped regions
of size 256 KB. It would be interesting to explore use of a smaller logical block size if we had control over disk
block allocation, disk scheduling and so on, because we could use techniques as in [ZL96, ZL98] to achieve good I/O
performance. Since we use lesystems and do not have such control, we can still ensure that the disk block allocation,
readahead, and disk scheduling will be done eciently by use of large blocksizes in memory-mapped calls (hence
our choice of the 256 KB size) and by making sure that each Unix le is accessed sequentially (so that lesystem
readahead is triggered wherever possible). We set the block size to 256 KB for both DSM and SRM so as to allow
proper comparisons in performance.
4.5.2 Input Characteristics

For all our experiments, we considered items of size I = 104 bytes, with keys of size K = 8 bytes and a blocksize
B = 2520 items (of size 256 KB). The unsorted input stream for each run of the two sorting algorithms was always
a uniformly randomly generated sequence of items. Both SRM and DSM, in our implementation, use the same run
formation algorithm and so for a given internal memory size, both algorithms have to merge an identical number of
runs during their merging phase.
Because of the randomization used by SRM, it is hard to construct a particular sequence of input records that
brings out bad I/O performance in SRM's merging phase; indeed, the whole point of randomization is to ensure
that no pathologically ordered input le can hinder the performance of SRM. Moreover the very nature of SRM
ensures that its performance cannot degrade if some run suddenly gets consumed at higher rates relative to others;
this is because all runs are striped and because each Parallel Read operation always reads into internal memory
the smallest block from each disk. We believe that skewed and nonuniformly randomly generated inputs cannot
signicantly change the performance characteristics of SRM and DSM.
4.5.3 DSM and SRM Congurations

Each buer of a striped-I/O parallel disk stream contains DB items. Since each striped-I/O stream uses double
buering, the amount of internal memory used by each striped-I/O parallel disk stream is 3:2 MB, which slightly
larger than 2DBI bytes owing to some other implementation related overheads.
Run Formation and the Number of Runs Formed

The run formation stage of both SRM and DSM involves at most two striped-I/O parallel disk streams active at any
time. Thus, the number of runs generated during run formation is determined by the amount of internal memory the
algorithm is allowed to use, the amount of internal memory consumed by the buers of a striped-I/O stream, and the
75
amount of internal memory that TPIE reserves for program variables. In the rest of this section, we use the symbol
U to denote the number of runs formed during the run formation stage of any given experiment.
Determining the Merge Order

During the merging pass, SRM and DSM use internal memory in very dierent ways. Given the same amounts of
memory, the maximum merge order of an SRM merge operation is signicantly larger than the maximum merge order
of a DSM merge operation. During a merge of order R, DSM requires enough internal memory to have 3:2 MB sized
buers for each one of R + 1 streams. Thus the merge order for DSM is determined in a straightforward manner.
On the other hand, in order to carry out an R-way merge, SRM requires only one buer of size 3:2 MB (corre-
sponding to two buers of size DB for its output run), 2R + D buers of size 256 KB (corresponding to B), space
for forecasting data and some other small per-run memory overheads. In all our experiments, all the forecasting data
consumed a very small fraction of the total amount of internal memory available: It was always smaller than 100 KB,
whereas the internal memory available in our two sets of experiments were 15 MB and 24 MB respectively. It is
very often the case in SRM that there is a wide range of feasible values for R in which the resulting number dlogR U e
of merging passes required to merge the initial U runs is the same optimal value. In such cases, we set SRM's merge
order R to be equal to the smallest possible feasible value resulting in an optimal number dlogR U e of merging passes,
but for practical reasons we never set R higher than 19. The advantage of using the smallest possible merge order is
that the number of les involved in the merge is the smallest possible, which tends to keep the amount of disk latency
incurred while merging as small as possible. Another advantage is that the average amount of internal memory space
available per input run during a merge operation is increased, which has the eect of minimizing the overhead
(dened in Section 4.3) incurred in every merging pass.
Even though we try to keep the merge order of SRM as small as is possible, the merge order of SRM merging
operations is signicantly greater than DSM merging operations. Hence one expects that DSM merging passes will
have higher disk locality and that SRM merging passes will incur more overhead relative to DSM merging passes on
account of disk latency.
4.5.4 Performance Numbers and Graphs

In this section we report on two sets of experiments to compare the performance of SRM and DSM. In both cases,
the input le size was varied in units of one million items ( 100 MB) in the range from 1 million items ( 100 B)
to 10 million items ( 1 GB). In the rst set of experiments, the amount of internal memory available to the sorting
algorithm was 15 MB whereas in the second set of experiments it was 24 MB.
In Figure 4.1 Tables 4.1 and 4.2, we present the performance numbers for the merging phases of both algorithms
for the two sets of experiments. Table 4.1 is the table corresponding to experiments for internal memory size 15 MB,
and Table 4.2 is the table corresponding to experiments for internal memory size 24 MB. Each data point in Tables
76
N =1 N =2 N =3 N =4 N =5 N=6 N =7 N=8 N =9 N = 10
U 12 24 36 47 59 71 82 94 106 117
Passes DSM 3 3 4 4 4 4 5 5 5 5
RDSM 3 3 3 3 3 3 3 3 3 3
Time DSM (s) 18:3 35:7 70:8 95:8 122 141:2 198:4 226:2 254:8 284:2
Rate DSM (MB/s) 32:5 33:3 33:6 33:1 32:5 33:7 35:0 35:1 35:0 34:9
Passes SRM 1 2 2 2 2 2 2 2 2 2
RSRM 12 5 6 7 8 9 10 10 11 11
Time SRM (s) 6:3 26:6 39:2 52:0 65:8 79:2 91:0 104:8 119:2 138:4
Rate SRM (MB/s) 31:5 29:8 30:4 30:5 30:2 30:1 30:5 30:3 30:0 28:7
1:04 1:03 1:04 1:03 1:03 1:04 1:03 1:03 1:03 1:03
Table 4.1: Comparing SRM and DSM when internal memory is 15 MB and there are D = 6 disks. The input size
N is in units of 1 million items, each of size 104 bytes.
N =1 N =2 N =3 N =4 N =5 N=6 N =7 N=8 N =9 N = 10
U 7 14 20 27 33 40 46 53 60 66
Passes DSM 2 2 2 3 3 3 3 3 3 3
RDSM 5 5 5 5 5 5 5 5 5 5
Time DSM (s) 11:5 21:2 32:4 63:0 79:6 95:0 110:6 127:2 143:2 158:2
Rate DSM (MB/s) 34:5 37:4 36:7 37:8 37:4 37:6 37:7 37:4 37:4 37:6
Passes SRM 1 1 2 2 2 2 2 2 2 2
RSRM 7 14 5 6 6 7 7 8 8 9
Time SRM (s) 6:0 20:0 36:0 48:8 62:4 75:4 89:2 100:4 114:6 126:0
Rate SRM (MB/s) 33:1 19:8 33:1 32:5 31:8 31:6 31:1 31:6 31:2 31:5
1:00 1:00 1:00 1:01 1:01 1:01 1:01 1:01 1:01 1:01
Table 4.2: Comparing SRM and DSM when memory is 24 MB and there are D = 6 disks. The input size N is in
units of 1 millions items, each of size 104 bytes.
4.1 and 4.2 is based upon the average value obtained by conducting the same experiment ve times with a dierent
random input on each run. The graph in Figure 4.1 plots the average time in seconds required to complete the
merging phase of SRM or DSM at a given data point. The tables provide additional insightful information. Of
particular interest is the average data streaming rate during a merging phase, which is dened as the total amount
of I/O (reads as well as writes) in bytes during the merging phase, divided by the time required to complete the
merging phase.
Each table lists the total number U of runs formed during run formation, which is identical for both SRM and
DSM. For DSM, the table lists the merge order RDSM (each merge operation except possibly the last merge operation
of a DSM merging phase has this merge order), the number Passes DSM of passes, the time Time DSM required to
complete the merging phase, and the data streaming rate Rate DSM attained by DSM during its merging phase.
For SRM, for each data point, the table lists the number Passes SRM of passes, the merge order RSRM (of each
SRM merging operation in the merging phase except possibly the last one), the time Time SRM required to complete
the merging phase, SRM's data streaming rate Rate SRM , and the overhead corresponding to extra parallel read
operations.
4.5.5 Relative Performance Comparisons

SRM signicantly outperforms DSM in the experiments. For the experiments with internal memory size 15 MB,
SRM's performance is better by a margin of almost 50%. SRM's margin of improvement is less impressive with
internal memory size 24 MB; but the improvement is still in the 25% ball park for input sizes larger than N = 4
million items. There is one data point (N = 3 million items, with memory size of 24 MB) at which DSM is actually
77
300.0
"DSM,15"
250.0 "SRM,15"
"DSM,24"
"SRM,24"
Time in seconds
200.0
150.0
100.0
50.0
0.0
1 2 3 4 5 6 7 8 9 10
Input Size N in millions of items
Figure 4.1: Merging Phase Timings of SRM and DSM.
marginally better than SRM; this happens to be the only point in all our experiments in which DSM and SRM require
the same number of passes.
The average data streaming rate of DSM is consistently better than that of SRM, as anticipated, but SRM
outperforms DSM because of its smaller number of passes. When we compare SRM's streaming rate for 15 MB with
its streaming rate for 24 MB, we see an overall improvement in streaming rate for the larger internal memory size,
since the number of runs is reduced and SRM's merge order tends to be smaller. We are not able to explain the
improvement in the streaming rate of DSM's performance in the experiments with 24 MB relative to its performance
in the experiments with 15 MB; the improvement is somewhat surprising because DSM's merge order increases from 3
to 5 when the internal memory is increased.
The overhead in the total number of Parallel Read operations required by SRM is small, very close to 1, as
expected on the basis of previous analysis in Chapter 3. This suggests to us that the elapsed time performance is
not hindered by the ush operations incurred by SRM; if the implementation of the Parallel Read can be improved,
which we think possible, we can get further improvement of SRM's elapsed time performance. We brie y mention
possible approaches to improve the performance of each parallel I/O operation in Section 4.5.6.
One interesting observation regarding our data points is the sudden drop in streaming rate when the merge order
becomes 14, as in the case with N = 2 million items and an internal memory size of 24 MB. In this case, the
streaming rate is 22 MB/sec, although a streaming rate of almost 30 MB/sec is possible for R = 11, N = 9 million
items and N = 10 million items, and internal memory size of 15 MB. We were not able to account for the sudden
78
drop in streaming rate; some preliminary experiments indicate that the most important reason for the sudden drop in
streaming rate may not be the increased seeking but some other eects perhaps related to the number of les opened
by the application at any time.
4.5.6 Improving I/O Performance

SRM can be made to perform even better if the implementation of the parallel I/O is improved. The hardware being
used (the CPU and the I/O system) and our experience with parallel I/O systems [BSG+ 98, BGH+ 97] suggest that
there is scope for improving the performance of parallel I/O operations in our implementation, thereby improving the
streaming data rates. Possible techniques to improve performance when we have control over disk block allocation,
disk scheduling etc. (as discussed in Section 4.5.1) include controlling the layout of blocks of runs and carefully
planning the sequence in which blocks from each disk can be read into memory [ZL96, ZL98]. Another approach that
may help in improving I/O performance is to use techniques such as the one in [BSG+ 98] which exploit the readahead
mechanism used by disk drive controllers to load data into their track buers. A simple high-level approach that may
result in improved performance is if we split the set of disks into two sets and then used one set to store input runs
and the other for output runs, swapping their role after each merge pass; this approach ensures that writes and reads
do not interfere during the external merging process.
4.6 Other Applications

In this section we brie y mention other situations in which the data structures and techniques we developed to
implement SRM can be used fruitfully.
4.6.1 Distribution Sort and Multi-way Partitioning

Consider a distribution sort or the partitioning of a stream into several other streams on an I/O system with D disks.
Such a distribution/partitioning type of computation may be required as part of some other database operation, for
example in a hash join. We could envisage using striped I/O to ensure perfect parallelism on all D disks. But use
of striped I/O would mean that the number of streams into which an input stream can be distributed using internal
memory size M would be O(M=DB); when the number of partitions or buckets desired is large, many distribution
passes would be required. In this situation, just like it was desirable to merge O(M=B) runs at a time during external
mergesort, it is desirable to be able to partition an input stream into O(M=B) streams. A randomized striping of
the R output streams and 2R internal memory blocks will help in implementing an R-way distribution. In such a
scheme, data structures and mechanisms similar to the ones we developed in this chapter can be used: For instance,
blocks destined to go to disk d can be queued up in queue OQ d , and a Parallel Write analogous to Parallel Read
can write the block corresponding to the head of the OQ d queue appropriately to disk d.
79
4.6.2 Streaming Through Multimedia Files
The problem of external merging of streams that are striped across disks is similar in terms of access patterns to a
video server that has to stream through multiple streams that are striped across disks. The nonuniformity of the
rates at which runs get depleted is similar to nonuniformity of streaming rates owing to dierent compression rates
for dierent frames in the stream. In both cases though, the rates of streaming required are partially predictable,
albeit to a limited extent. While merging, the forecasting keys predict the participation order of blocks of a merge.
Consider using a le of timestamps, where each timestamp corresponds to the time at which a block from that stream
must be in internal memory. A le of such timestamps corresponding to a video stream is analogous to the forecasting
data run corresponding to a run. Hence, the techniques we developed to implement the lookahead mechanism and
the forecast and ush buer management and prefetching scheme can now be analogously implemented using the
timestamps to predict the time at which a block must be in memory. Our data structures and techniques may thus
have applications to video servers, although substantial modications may be needed to implement the real time
aspect of video servers.

In this chapter, we considered the important problem of external sorting in a parallel disk setting. We have proposed
simple and elegant data structures and techniques to implement the SRM mergesort algorithm for parallel disks.
To our knowledge, this is the rst practical implementation of a parallel disk sorting algorithm that performs a
provably optimal number of passes. Our simplied implementation of SRM includes a novel technique to implement
a lookahead mechanism using forecasting keys. Our implementation signicantly outperforms the popular double-
buered disk-striped mergesort (DSM) technique. Although each merging pass of DSM occurs a little faster than
that of SRM, the smaller number of passes required by SRM makes SRM's overall performance better than that of
DSM. Our techniques are also applicable to other streaming operations in databases.
In future work, we hope to improve the implementation of parallel I/O operations in TPIE and thus get an
improvment in the elapsed time performance of SRM. We also plan to implement a parallel disk distribution sort based
upon a simplied version of the algorithm in [VS94], but doing the distribution using the approach of Section 4.6.1,
analogous to the one in SRM executed in \reverse". Comparing the performance of such a sort with SRM should
be particularly interesting because disk drives may be somewhat better at performing the kind of I/O needed for
distribution compared to merging. (In the envisioned parallel external distribution operation, the input stream is
implemented using striped I/O, whereas the output stream requires parallel independent disk accesses.)
80
Chapter 5
Memory-Adaptive External Memory
Algorithms
Summary.
We consider the problem of devising external memory algorithms whose memory allocations can change dynamically
and unpredictably at run-time. The investigation of \memory-adaptive" algorithms, which are designed to adapt to
dynamically changing memory allocations, can be considered a natural extension of the investigation of traditional,
non-adaptive external memory algorithms. Our study is motivated by high performance database systems and
operating systems in which applications are prioritized and internal memory is dynamically allocated in accordance
with the priorities. In such situations, external memory applications are expected to perform as well as possible for
the current memory allocation. The computation must be reorganized to adapt to the sequence of memory allocations
in an online manner.
In this paper we present a simple and natural dynamic memory allocation model. We dene memory-adaptive
external memory algorithms and specify what is needed for them to be dynamically optimal. Using novel techniques,
we design and analyze dynamically optimal memory-adaptive algorithms for the problems of sorting, permuting,
FFT, permutation networks, (standard) matrix multiplication and LU decomposition. We also present a dynamically
optimal (in an amortized sense) memory-adaptive version of the buer tree, a generic external memory data structure
for a large number of batched dynamic applications. We show that a previously devised approach to memory-adaptive
external mergesort is provably nonoptimal because of fundamental drawbacks. The lower bound proof techniques for
sorting and matrix multiplication are fundamentally distinct techniques, and they are invoked by most other external
memory lower bounds; hence we anticipate that the techniques presented here will apply to many external memory
problems.
5.1 Introduction and Motivation

The great majority of the previous work on external memory (EM) algorithms, including our work in the previous
chapters, assumes that internal memory is statically allocated and can hold M items throughout the execution of
the EM algorithm. Previous work in EM algorithms assumes that a statically allocated internal memory capable of
holding M items is available throughout the execution of the external memory algorithm. A natural extension is
to consider the performance of an external memory algorithm when the size of the available internal memory varies
81
dynamically because of other ongoing activity on the computing machine. With the advent of real-time database
systems [Sys92, Rec88] and database systems based upon administratively dened goals [FNG89, BCL93], it has
become necessary [PCL93b] to design EM applications that can cope eciently with dynamic and unpredictable
uctuations in memory allocation. As a result, external memory applications and queries are expected to eciently
adapt to situations in which portions of their internal memory are taken away from them or are allocated to them
unpredictably and dynamically , in course of the execution of the computation. External memory algorithms that can
adapt to dynamically changing amounts of internal memory are said to be memory-adaptive [PCL93a]. Memory-
adaptive algorithms should perform as eciently as possible when memory is scarce and should take advantage of
extra memory when it becomes available. The computation must be reorganized to adapt to the sequence of memory
allocations in an online manner.
Prior work on memory-adaptive EM algorithms has been incomplete and exclusively empirical in nature. The
development of memory-adaptive algorithms seems to have been motivated by the fact that conventional join algo-
rithms (intended to have a xed memory allocation throughout their execution) are extremely sensitive to memory
uctuations and are time consuming. Hash join algorithms capable of adapting eciently to memory uctuations as
long as the size of internal memory remains at least square root of the le size were developed and experimentally
analyzed by Zeller and Gray [ZG90] and Pang et al. [PCL93b]. The notion of memory-adaptive algorithms is much
more powerful than the passive approach of relying on virtual memory paging or suspending an application until there
is adequate memory [PCL93a]. Pang et al. [PCL93a] proposed a scheme for memory-adaptive external mergesort
and a related sort-join scheme, which they analyzed using an experimental framework. Zhang and Larson [ZL97]
experimentally evaluated their memory-adaptive mergesort strategy, but they made the restriction that once a merge
operation is in progress its memory allocation cannot be changed until it completes.
In the next section, we present a realistic model for the design and analysis of memory-adaptive algorithms
and we dene dynamically optimal memory-adaptive algorithms. In Section 5.3, we present asymptotically tight
resource consumption bounds for key problems such as permuting, sorting, FFT, permutation networks and matrix
multiplication. Our lower bounds provide a reinterpretation of the lower bounds of [AV88] and [HK81a] in a dynamic
memory allocation context. In order to prove algorithms for the above problems to be dynamically optimal, we dene
natural, application-specic measures for the resource-consumption at each I/O step. The measures determine how
eciently an algorithm adapts to memory uctuations. In the remaining sections, besides the above problems, we
also show how to design and analyze dynamically optimal algorithms for the a memory-adaptive version of the buer
tree [Arg94] and LU decomposition [WGWR93]. The lower bound proof techniques for sorting and related problems
on the one hand, and the problem of matrix multiplication on the other, are fundamentally distinct techniques, and
they are invoked by most other external memory lower bounds; hence we anticipate that the techniques presented
here will apply to many external memory problems [VS94, GTVV93, CGG+ 95, AV96, Arg94, VV95].
In Section 5.4 we discuss an approach to design memory-adaptive algorithms using optimal static memory I/O
82
algorithms and then provide intuition on the nature of diculties caused by dynamic memory allocation.
Sections 5.5 through Section 5.10 discuss various aspects of memory-adaptive mergesorting. In Section 5.5,
we present a natural framework to design memory-adaptive mergesort algorithms. In Section 5.6, we dene the
fundamental notion of merge potential, which is meant to quantify the \progress" made by a memory-adaptive
merge algorithm up to any time during its execution. In Section 5.7, we use an adversarial argument and construct
a \nemesis" sequence of memory allocations to prove that two variants of a mergesort algorithm based upon the
memory-adaptive merging techniques proposed by Pang et al. [PCL93a] are not dynamically optimal. The notion of
merge potential helps us isolate the fundamental drawback of the merging techniques of [PCL93a]. In Section 5.8 we
present an ecient and elegant memory-adaptive merging algorithm that forms the basis of our dynamically optimal
sorting algorithm. Our algorithm uses novel data structures and online techniques to reorganize merge computation
in response to dynamic uctuations in memory. In Section 5.9, we analyze the resource consumption of mergesort
our algorithm and establish that it is dynamically optimal. In Section 5.10, we discuss how, besides general insights
into the problem of memory-adaptive merging, the notion of merge potential also gives interesting insights into
the dierence in the resource consumption required of a memory-adaptive merging algorithm when it is used as a
subroutine during a mergesort and when it is used in isolation as a dynamically optimal algorithm for merging.
In Section 5.11, we show how the sorting algorithm can be used to obtain memory-adaptive algorithms for
permuting, FFT and permutation networks. In Section 5.12, we show to apply our memory-adaptive sorting technique
to implement the buer emptying operation of the I/O buer tree [Arg94], realizing a dynamically optimal (in an
amortized sense) memory-adaptive buer tree. This result is particularly signicant with respect to the extendibility
of our techniques to diverse applications since the buer tree is an I/O optimal data structure for several several
applications involving batched dynamic problems [Arg96, Vit98], including such time consuming operations as bulk-
loading of B-trees and R-trees [AHVV99].
In Section 5.13, we present simple techniques resulting in a dynamically optimal algorithm for memory-adaptive
matrix multiplication and LU factorization; our techniques can thus form a basis for memory-adaptive scientic
computing.
5.2 Dynamic Memory Model

We wish to enable an external memory algorithm to perform eciently even when the amount of allocated inter-
nal memory uctuates unpredictably. The amount of internal memory is dynamically determined by the resource
allocator, which is typically a database system or an operating system. In this section we propose a model for how
a resource allocator may dynamically change the memory allocations of an algorithm during \run time", and, we
specify when memory-adaptive algorithms are dynamically optimal.
Denition 19 We assume that the resource allocator allocates memory blocks to the external memory algorithm
83
in a sequence of allocation phases . Each allocation phase is characterized by its size. Consider an external memory
algorithm A with an input of size n = N=B disk blocks. In the dynamic memory model, the resource allocation to
A consists of a sequence of phases (called the allocation sequence ) of sizes s1 ; s2 ; s3 ; : : : , where si is the size of the
ith phase. The following constraints are met by the allocation phases:
1. In each I/O operation at most B contiguous items can be transferred between internal memory and disk.
2. During the ith allocation phase, the external memory algorithm A is allocated exactly si internal memory
blocks by the resource allocator. These si blocks are A's to use as it sees t until it executes 2si I/O operations.
The external memory algorithm A can voluntarily terminate an allocation phase of size si before completing
2si I/O operations during that phase.
3. For each i, we have model si mmax where model 4 is some constant to be determined later and
mmax = minfcn; phy max g where phy max is the maximum number of internal blocks the physical memory of the
computer can accommodate and c is an application-specic positive constant.
The maximum allocation is trivially bounded by cn. Assuming that c is dened appropriately for the given
application, the algorithm A can complete all computation if the internal memory allocation is cn blocks.
Denition 20 A memory-adaptive algorithm is an algorithm that adheres to the dynamic memory model of Deni-
tion 19. We assume that at any time a memory-adaptive algorithm A has access to the following in-memory variables
relevant to A that can be easily maintained by the resource allocator in internal memory:
1. Variable mem , which contains the size of the ongoing allocation phase of algorithm A.
2. Variable left , which contains the number of I/O operations remaining in the ongoing allocation phase of algo-
rithm A.
3. Variable next , which contains the size of the next allocation phase1 of algorithm A.
The internal memory resource allocator (adversary) has tremendous exibility since it can dynamically choose
allocation phases of arbitrary sizes, varying from model blocks to the maximum possible memory allocation of mmax
blocks. The memory-adaptive algorithm has to adapt to sizes of allocation phases in an online manner. Requirement 2
of Denition 19, which species that each memory allocation of m blocks must last for 2m I/Os, is a very natural
assumption: The duration of allocation enables the memory-adaptive algorithm to load up to m blocks into memory,
carry out internal memory computation, and then write up to m blocks back to disk so it is long enough to allow
1 The variable next is relevant only in practice: our techniques (with minor modications) and
theoretical results hold even when the resource allocator cannot provide this information about
the next phase until that phase begins.
84
all m internal memory blocks to be used. That requirement is implicitly met in conventional virtual memory paging
systems designed for non-adaptive external memory algorithms. For example, in a virtual memory system suppose at
some point an application has m memory blocks. If, for some reason, the virtual memory system now decides to leave
that application with only (say) pm memory blocks to create internal memory space for some other higher-priority
application, the virtual memory system would immediately have to write m pm = (m) blocks to disk, thereby
incurring (m) I/O operations.
5.2.1 Dynamically Optimal Memory-Adaptive Algorithms

We now dene what it means for a memory-adaptive algorithm to be dynamically optimal.
Denition 21 Consider a computational problem P and a memory-adaptive algorithm A that solves P . Given any
N -sized instance IN of P , we say that algorithm A solves IN during allocation sequence if A begins execution with
the rst phase of and completes execution by the end of . We say that A solves P during allocation sequence
if A can solve any instance IN of P during .
Denition 22 Consider a memory-adaptive algorithm A for problem P . We say that A is dynamically optimal for
P if, for all minimal allocation sequences such that A solves P during (but A does not solve P during a proper
prex of ), no other memory-adaptive algorithm can solve P more than a constant number of times during .
For instance suppose that a memory-adaptive sorting algorithm AS can sort a le of N items during an allocation
sequence = s1 ; s2 ; : : : ; s` . Then for AS to be dynamically optimal there must be no more than a constant number
c of non-overlapping contiguous subsequences 1 ; 2 ; : : : ; c of such that some memory-adaptive sorting algorithm
AS can sort an arbitrary N -sized le during each i .
5.3 Memory-Adaptive Lower Bounds

We now present asymptotically tight bounds for the memory and I/O resources consumed by memory-adaptive
algorithms for the fundamental problems of permuting, sorting, fast fourier transform (FFT), permutation networks,
buer tree operations and matrix multiplication. The problem of permuting a le of N items is the same as sorting
a le of N items except that the key values of the N items in the output are required to form a permutation of
f1; 2; : : : ; N g. Buer tree operations refer to operations on a memory-adaptive version of Arge's buer-tree [Arg94]
data structure, discussed in Section 5.12.
In this section, we prove only the pertinent lower bounds; the upper bounds are proved in subsequent sections as
indicated below. Based upon the lower bound on the resources that are needed to solve a problem, we can dene the
resource consumption at each I/O step of a memory-adaptive algorithm for that problem.
85
In the following theorem, we use the notion of dynamically optimal algorithms to present our resource consumption
bounds.
Theorem 8 Suppose that A is a memory-adaptive algorithm that nishes its computation during an allocation
sequence of sizes m1 ; m2 ; : : : ; m`(A) . Let TA denote the total number
P`(A) 2mj of I/O operations incurred by
j =1
A.
1. Suppose that A is a dynamically optimal algorithm for permuting, then
1 (T lg N ) + `X(A)
2mj lg mj = (n lg n): (5.1)
B A
j =1
2. Suppose that A is a dynamically optimal algorithm for sorting or FFT or permutation networks or executing
a sequence of insert/delete operations2 on our memory-adaptive buer tree. Then we have
`X
(A)
2mj lg mj = (n lg n): (5.2)
j =1
3. Suppose that A is a dynamically optimal algorithm for (standard) matrix multiplication of two N^ N^ matrices,
or LU decomposition of an N^ N^ matrix3 . Then
`X
(A)
m3j =2 = (n3=2 ): (5.3)
j =1
The buer tree alluded to in the theorem is a memory-adaptive version of the original buer tree [Arg94]. It is
described in Section 5.12. The bounds in Theorem 8 lead to natural notions of resource consumption of memory-
adaptive algorithms for the various problems discussed.
Denition 23 Consider the 2m I/O operations of a memory-adaptive algorithm A during any allocation phase of
size m.
1. If A is a permuting algorithm, the resource consumption of each I/O operation is dened to be the quantity
1
B lg N + lg m.
2. If A is an algorithm for sorting, FFT, permutation networks, or for executing a sequence of operations on a
buer tree [Arg94], then the resource consumption of each I/O operation is dened to be the quantity lg m.
2 In the case of the buer tree, N denotes the number of insert/delete operations.
3 In the case of matrix multiplication and LU decomposition, N denotes N^ 2 .
86
3. If A is an algorithm for matrix multiplication or LU decomposition, then the resource consumption of each I/O
operation is dened to be the quantity pm=2.
The resource consumption of algorithm A is dened to be the sum of the resource consumptions the I/O operations
of A.
We can recast Theorem 8 in terms of resource consumption as follows:
Corollary 2 A memory-adaptive algorithm A is dynamically optimal if and only if its resource consumption is
(n lg n) for permuting, (n lg n) for sorting, FFT, permutation networks, and buer tree operations, and (n3=2 )
for (standard) matrix multiplication and LU decomposition.
Below, we prove the lower bounds implicit in Theorem 8 for permuting, sorting, FFT, permutation networks
and matrix multiplication by reinterpreting the original I/O lower bounds proved in [AV88] and [HK81a, SV87] in
a dynamic memory context. The lower bound for buer tree operations can be proved by adapting the arguments
of [AKL93] relating comparison tree lower bounds to I/O lower bounds to the dynamic memory model.
In Section 5.8 and Section 5.13, we present dynamically optimal algorithms for sorting and matrix multiplication
respectively. We demonstrate optimality in each case by showing that the resource consumption meets the bound
given above. In Section 5.11, we show how to apply our memory-adaptive mergesort and related techniques to obtain
dynamically optimal algorithms for permuting, FFT and permutation networks. In Section 5.12, we show how to use
our sorting algorithm as a subroutine to devise a dynamically optimal memory-adaptive buer tree. By observations
made in [WGWR93], the dynamically optimal algorithm developed for matrix multiplication can be modied to
obtain a dynamically optimal algorithm for LU factorization.
5.3.1 Memory-Adaptive Lower Bounds for Permuting

We prove a lower bound on the resource consumption incurred by any memory-adaptive algorithm to permute a le
of n blocks of items.
Theorem 9 Consider any memory-adaptive algorithm A that permutes a le containing N = nB items during the
allocation sequence = m1 ; m2 ; : : : ; m`(A) . Let TA denote the total number
P`(A) 2mi of I/O operations incurred
i=1
by A. Then we have
1 (T lg N ) + `X(A)
2mi lg mi =
(n lg n); (5.4)
B A
i=1
87
where, by denition, the left hand side is the resource consumption of algorithm A. In the case when 1 (TA lg N )
B
`X
(A)
2mi lg mi , the bound (5.4) implies the lower bound
i=1
TA =
(N ): (5.5)
Otherwise, the bound (5.4) implies the lower bound
`X
(A)
(2mi lg mi ) =
(n lg n): (5.6)
i=1
Proof : Without loss of generality, we make the following assumptions made in [AV88]: I/O operations are \simple"
and respect block boundaries. If a block of B items is input into memory at any time and those B items had been
output to disk during an earlier output operation, then we assume that the relative order of those B items was
computed when they were last together in internal memory. We also assume that if mB B items reside in memory
at the time of initiating an input operation, then the relative ordering of the mB in-memory items is determined on
completion of the input operation.
Consider any one of the 2mi I/O operations of the ith allocation phase mi of algorithm A and suppose it is an
input I/O operation. There are less than n + Pij =1 2mi n + TA N lg N blocks on disk of which A must read
one, so there are no more than N lg N choices available to A. Let us consider how the number of realizable orderings
changes when a given disk block is read into internal memory. By denition, the maximum number of in-memory
items at the time of any I/O operation during the ith phase is Mi B, where Mi = mi B. There are at most B items

in the input block and they can intersperse among the Mi items in internal memory in at most MBi ways, so the

number of realizable orderings increases by a factor of MBi . If the input block has never before resided in internal
memory, the number of realizable orderings increases by a B! factor since the B items can be permuted amongst
themselves. (This extra contribution of B! can happen only once for each of the n original blocks.) The increase in
the number of realizable orderings from writing a disk block is considerably less than reading it. Thus the number of
distinct orderings that can be realized by algorithm A increases by a factor of at most
2mi
(B!)i0 N (lg N ) MBi
during the ith allocation phase, where i0 is the number of previously unread blocks read during the ith phase. The
total number of distinct orderings that can be realized by A during allocation sequence is no more than
Y 2mi
(B!)N=B N (lg N ) MBi : (5.7)
1i`(A)
88
Setting the above expression to be at least N !=2, taking logarithms and applying Stirling's formula [Knu97], we have
`X
(A)
N lg B + 2mi lg N + B lg MBi =
(N lg N )
i=1
`X
(A)
=) 2mi lg N + B lg Mi =
N lg N
i=1 B B ;
which can be simplied further to

`X
(A)
2mi (lg N + B lg mi ) =
(N lg n)
i=1
`X
(A)
=) TA lg N + 2mi B lg mi =
(N lg n):
i=1
Dividing throughout by B and simplifying, we establish the lower bound (5.4).

`X(A)
We now have to consider two separate cases. First we consider the case 1 (TA lg N ) 2mi lg mi , so that we
B i=1
have the lower bound
1
B (TA lg N ) =
(n lg n): (5.8)
`X(A)
Since B1 (TA lg N ) 2mi lg mi , we have
i=1
lg N B T lg min fm g : (5.9)

TA A 1i`(A) i
Using the fact that
lg(1min fm g) lg model ;
i`(A) i
where model is the constant dened in the dynamic memory model, we have
lg(1min fm g) c0 2:
i`(A) i
By (5.9), we have lg N c0 B and so
N 1=c0 = (2lg N )1=c0

(2c0 B )1=c0
2B ;
89
p
which implies that B < N . The bound (5.5) follows from (5.8) after some simplication.
`X(A)
In the case when B1 (TA lg N ) < lg mi , the lower bound (5.6) follows trivially from (5.4).
i=1
The lower bound (5.1) on the resource consumption of dynamically optimal permuting algorithms in Theorem 8
follows from (5.4).
Intuitively, Theorem 9 says that the resource consumption of permuting is identical to the resource consumption of
sorting (Theorem 8) except when the allocation sequence is what we call a scanty allocation sequence . An allocation
sequence of ` phases m1 ; m2 ; : : : ; m` is said to be scanty if
X̀
(lg N ) 2mi =
(N lg n)
i=1
but
X̀ X̀
2mi lg mi B1 (lg N ) 2mi :
i=1 i=1
It is unlikely for an allocation sequence to be scanty in practice, so, in the most likely case when an allocation sequence
is not scanty, the resource consumption of permuting is identical to that of sorting. Interestingly, when the allocation
sequence is scanty, even the naive permuting algorithm incurring T = (N ) I/O operations is a dynamically optimal
permuting algorithm.
5.3.2 Memory-Adaptive Lower Bounds for Sorting, FFT and

Permutation Networks
Permuting is a special case of sorting, and hence the lower bound for permuting applies to sorting as well. However,
in the case in which the naive (N )-I/O permuting algorithm is dynamically optimal for permuting, we can prove a
stronger lower bound for sorting using an adversarial argument and the comparison model of computation. Using the
same notation as in Theorem 9 and arguments similar to the ones in [AV88] we can show that the maximum number
of total orders consistent with comparisons made during the allocation sequence m1 ; m2 ; : : : ; m`(A) is no more than
Y Mi 2mi
(B!)N=B :
B (5.10)
1i`(A)
Using arguments similar to the ones in [AV88] we can show that the maximum number of realizable orderings for
a memory-adaptive permutation network algorithm during `(A) allocation phases is no more than (5.10) as well. In
90
contrast to sorting, the proof of the above bound for permutation networks does not involve any comparison model
based arguments arguments and follows directly from a counting argument.
As in [AV88], we can exploit the fact that any permutation network can be constructed by stacking together at
most three appropriate FFT networks to conclude that the FFT and permutation network problems are essentially
equivalent. Thus the problem of FFT computation has the same lower bound as that for permutation networks.
The theorem below follows by setting the expression (5.10) to be at least N !, taking logarithms, applying Stirling's
inequality [Knu97] and then simplifying.
Theorem 10 Consider any memory-adaptive algorithm A for sorting a le containing N = nB items or for
computing the N -input FFT digraph or an N -input permutation network during the allocation sequence =
m1 ; m2 ; : : : ; m`(A) . Then we have
X
À
2mi lg mi =
(n lg n); (5.11)
i=1
where, by denition, the left hand side is the resource consumption of algorithm A.
The lower bound (5.2) of Theorem 8 follows from Theorem 10.
5.3.3 Memory-Adaptive Lower Bounds for Matrix Multipli-

cation
We now derive lower bounds on the resources consumed by any memory-adaptive algorithm to multiply two N^ N^
matrices.
Consider the problem of multiplying two N^ N^ matrices, consisting of N = N^ 2 elements in total. Hong and
Kung [HK81a] proved fundamental I/O bounds for this problem using graph pebbling arguments. Based upon their
arguments, Savage and Vitter [SV87] showed that if the amount of main memory available to an external memory
p
algorithm is xed to be M then any (standard) matrix multiplication algorithm takes at least
(N 3=2 =B M ) =
(n3=2 =pm) I/O operations, where n = N=B and m = M=B. This bound is easily realized by a simple external
memory algorithm [VS94].
The problem of computing the product C = A B of two N^ N^ matrices can be viewed as pebbling through a
directed acyclic graph (DAG) containing (N^ 3 ) constant degree nodes. For our purposes, it suces to say that in
order to compute C it is necessary to pebble through all (N^ 3 ) nodes of the matrix multiplication DAG. We refer
the reader to [SV87] for a study of issues related to pebbling-based I/O complexity arguments.
91
In order to adapt the lower bound argument for matrix multiplication to our memory-adaptive setting we consider
the maximum number of nodes (of the matrix multiplication DAG) that may be pebbled in a single allocation phase
of size s; that is, the maximum number of DAG nodes pebbled using at most s internal memory blocks and no more
than 2s I/O operations. We state the following lemma whose proof follows from the arguments of [SV87] which are
based upon ideas from [HK81a].
Lemma 22 Let the pebbling operations of matrix multiplication be as dened in [SV87]. The maximum number
of matrix multiplication DAG nodes that can be pebbled during an allocation phase of size m is O(M 3=2 ), where
M = mB N^ 2 .
The lower bound of Theorem 8 for matrix multiplication follows easily from Lemma 22. The same lower bound
applies to any standard LU factorization algorithm as well: This is because the DAG corresponding to a standard
LU factorization algorithm on an N^ N^ matrix also contains (N^ 3 ) nodes that need to be pebbled and a version of
Lemma 22 is applicable to an LU factorization DAG.
5.4 Designing Memory-Adaptive Algorithms

We now consider how memory-adaptive algorithms can make \optimum utilization" of the memory and I/O resources
comprising an m-sized allocation phase. In order to get an idea of how this may be achieved, we examine external
memory algorithms that are optimal for a xed internal memory allocation of m blocks.
5.4.1 Optimal non-adaptive external memory algorithms

Consider an optimal external mergesort algorithm. Given a xed number m of internal memory blocks an optimal
mergesort algorithm consists of a run-formation stage followed by a sequence of u(m)-way external merge operations,
where u(m) = (mc ) for 0 < c 1. During a typical sequence of (m) I/O operations, this algorithm's execution
consists of generating (m) blocks output by an u(m)-way external merge. Consider the design of an external
memory algorithm to carry out the matrix multiplication AB = C . Given a xed number m of internal memory
blocks, an optimal matrix multiplication algorithm consists of a sequence of \v(m)-operations" each of which consists
of carrying out an internal memory multiplication of submatrices of A and B each consisting of v(m) blocks, where
v(m) = (m). Clearly each such operation consists of (m) I/O operations.
5.4.2 Mimicking Optimal Static Memory Algorithms

The computation carried out by an optimal external memory algorithm over a sequence of (m) I/O operations is
determined by the number m of internal memory blocks allocated to the algorithm. Optimality results from the
92
fact that the algorithm achieves \optimal resource utilization" during a typical allocation phase of size m. By the
above reasoning, for a memory-adaptive algorithm to be ecient, the computation it carries out over each allocation
phase should be determined by the size of that phase. In order to attain dynamic optimality, a memory-adaptive
algorithm's \progress" during an allocation phase of size m should mimic be comparable to that of its optimal, xed-
memory analog over (m) I/O operations. Ideally, during each allocation phase of size m, where 1 m mmax ,
a memory-adaptive mergesort algorithm should write to disk (m) blocks resulting from u(m)-way merging and a
memory-adaptive matrix multiplication algorithm should execute a v(m)-operation.
5.4.3 Adaptive Organization of Computation

The challenge in designing dynamically optimal memory-adaptive algorithms is to organize, in an ecient and online
manner, the external memory computation so as to attain optimal resource utilization during every allocation phase
as described above. The issue of eciency in this online organization of external memory computation arises in two
contexts.
Firstly, in order to cope with arbitrary variation in allocations, the computation needs to be broken down in
an online manner into a sequence of \smaller granularity" computations such that executing that sequence of
computations is a resource-ecient rendering of the original computation. For instance, an s-way merge of runs
of the set U = fr0 ; r1 ; : : : ; rs 1 g can be reorganized as follows: First we can compute the h runs of the set
U 0 = fr00 ; r10 ; : : : ; rh0 1 g such that run ri0 , where 0 i h 1, is a merge of si runs of U and Pih=01 si = s.
For each i; j such that i 6= j and 0 i; j h 1, the si runs merged to obtain ri0 are all distinct from the s0j runs
merged to obtain rj0 . Then we can merge together the runs of the set U 0 to complete the original merge of runs of the
set U . Each of h merge operations producing r00 ; r10 ; : : : ; rh0 1 may or may not subsequently need to be broken down
further into smaller merge operations and so on, depending upon the allocation sequence. The dynamic optimality (or
non-optimality) of a memory-adaptive mergesort depends upon how it decides h, si and the si runs chosen to produce
run ri0 . Similarly a computation consisting of multiplying matrices, each consisting of s blocks, can be achieved by
carrying out a series of appropriately chosen matrix multiplication operations each multiplying matrices consisting of
s0 < s blocks. Each s0 -sized multiplication operation may or may not need to be broken down into smaller operations,
depending upon the allocation sequence. The dynamic optimality of a memory-adaptive multiplication depends upon
how it determines the value of s0 .
The second context in which eciency is important is while breaking down a computation into subcomputation:
Since such activity may also involve external memory operations, the data structures and techniques chosen to
reorganize external memory computation in an online manner must themselves be ecient.
93
5.4.4 Allocation Levels
As observed in Section 5.4.2, a memory-adaptive algorithm's computation in an allocation phase should be determined
by the size of the phase. While mimicking an optimal external memory algorithm, it is convenient to clump together
ranges of allocation phase sizes into single allocation levels and thereby partition the whole range fmodel ; : : : ; mmax g
of allocation phase sizes into several allocation levels. Consider using the following mimicking strategy in an attempt
to devise a dynamically optimal mergesort algorithm: During an allocation phase of size m, we always try to compute
(m) blocks corresponding to the output of a u(m)-way merge, where u(m) = 2blg(m 1)c m=2. In this case, any
allocation phase of size s such that s 2 f2` + 1; 2` + 2; : : : ; 2`+1 g is said to be at allocation level ` with respect to
our strategy since we always try to output (s) blocks of a 2` -way merge during that allocation phase. On the other
hand, suppose we set u(m) = 22blg lg(m 1)c pm, then any phase of size s 2 f22` + 1; 22` + 2; : : : ; 22`+1 g can be
said at allocation level ` with respect to the modied strategy. In this chapter, we formally dene allocation levels
for each memory-adaptive algorithm we present.
5.5 A Framework for Memory-Adaptive Merge-

sort
In this section, we discuss a simple framework that can be used to construct memory-adaptive mergesort algorithms.
The memory-adaptive mergesort based upon techniques proposed by Pang et al. [PCL93a] can be cast in terms of our
framework. However, the techniques of [PCL93a] suer from fundamental drawbacks that make the resulting sorting
algorithm nonoptimal as we show in Section 5.6 and 5.7. In Section 5.8, we design a new memory-adaptive merging
algorithm that yields a dynamically optimal sorting algorithm when applied using our framework.
External mergesorts [Knu98, AV88] consist of a run formation stage in which sorted runs are formed (by reading in
memoryloads of items, sorting them and writing them out to disk) followed by a merging stage in which the mergesort
algorithm keeps merging (as large as possible) a number of runs together until there is only one run remaining. Thus
we need to devise memory-adaptive techniques for run formation and merging.
5.5.1 Memory-Adaptive Run Formation

The straightforward memory-adaptive run formation technique we propose is as follows: In an allocation phase of
size m, we read in m blocks of items from the input le, sort them using an optimal in-place sorting technique, and
write them out using m write I/O operations.4 The number and the lengths of the runs formed during run formation
depends upon the allocation sequence.
4 In order to simplify discussions we neglect details regarding small amounts of additional memory
space required by in-memory sorting.
94
Lemma 23 For a le of n blocks, each run that is formed using the above memory-adaptive run formation strategy,
with the possible exception of one run, is at least model blocks long and the number of runs formed is at most
dn=model e. The resource consumption of the above run formation strategy is no more than 2n lg mmax , where mmax
is dened in Denition 19.
5.5.2 Memory-adaptive merging stage

The main diculty of memory-adaptive mergesorting lies in merging. If an external mergesort algorithm has a static
number m of internal memory blocks to use throughout the algorithm, an I/O-optimal strategy [AV88] is to merge
together the m 1 shortest runs and to repeat the process until a single sorted run remains.5 Thus if the input le is
n blocks long and there are n0 sorted runs after run formation, dlogm 1 n0 e = d(lg n0 )= lg(m 1)e merge passes are
required to complete the sort.
The merging stage we propose is a modication of the above approach. Let Q be the queue of (pointers to) runs
that need to be merged during the merging stage. We implement Q as a blocked list of pointers. Whenever a run is
formed during the run-formation stage a pointer to that run is inserted into Q. At the end of run formation Q can
have up to dn=model e pointers. Let M be any memory-adaptive merge subroutine that can merge up to M runs,
where M 2, to produce a single output run. In the merging stage, the memory-adaptive merge subroutine M is
used repeatedly in the following manner until only a single run remains in Q:
1. Remove the leading minfM ; jQjg (pointers to) runs from Q.
2. Merge these runs together using M in a memory-adaptive manner.
3. Append (the pointer to ) the output run to the end of Q.
We have the following estimate of resource consumption over the merging stage as a function of the resource
consumption of M.
Lemma 24 Suppose that the resource consumption of merge algorithm M when merging M runs totally containing
n0 blocks is bounded by n0 g(M ). Suppose that the size of the le to sort is n blocks and that the total number
of runs in Q immediately after the run formation stage is n0 bn=model c. Then the resource consumption of the
merging stage is no more than
lg n0
n lg M + 1 g(M ):
5 If double buering is used to overlap I/O and CPU time,then approximately m=2 runs are merged
together.
95
Proof : Each time an item x is in one of the runs being merged by M we say that M touches item x. By assumption,
each item touched by M is charged a resource consumption of g(M )=B for that execution of M. We prove below
that the maximum number of times any element can be touched is dlg n0 = lg M e + 1, thus proving the lemma.
Instead of using the list Q of runs throughout the merging stage, consider the following modied merging stage
using several dierent queues for the sake of this analysis. Let Q0 be the list of n0 runs immediately after the run
formation stage. At merge level i, we merge runs from queue Qi and insert each run output by such a merge into list
Qi+1 . At the beginning of merge-level i, if the number of runs in Qi is no greater than M , we merge all runs in Qi
to terminate the merging stage. If this is not the case, then while the number of runs in Qi is at least M , we repeat
the following operation: We remove runs from Qi , merge them together to form a new run r, and insert run r into
list Qi+1 . When the number of runs in Qi becomes less than M , we append Qi to Qi+1 .
The maximum number of times any item can be touched during our original merging stage is no more than the
number of merge levels in the merging stage described above. Below we bound the total number of merge levels,
which we denote by #Passes. Consider an M -ary representation of the number jQi j of runs in jQi j just prior to
beginning the ith merge-level, for 0 i #Passes 1. The number of digits in the M -ary representation of jQi j is
always one less than the number of digits in the M -ary representation of jQi 1 j, for all except possibly one value of
i in the range 1 i #Passes 1. If the number of M -ary digits in the representation of jQij is the same as that
of jQi 1 j, then the most signicant digit in the M -ary representation of jQi j is 1 and all other digits are strictly less

than M 1. It follows that #Passes lglgn0 + 1. This proves the lemma.
M
5.5.3 Resource consumption of memory-adaptive external

mergesort
We will now discuss the requirements that the memory-adaptive merging subroutine M needs to satisfy in order for
our approach to result in a dynamically optimal memory-adaptive sorting algorithm.
Lemma 25 The resource consumption of the sorting algorithm based upon our memory-adaptive mergesort (run
formation and merging stage) framework is at most
lg n0
2n lg mmax + n
lg M + 1 g(M )
where n is the number of blocks in the input le, M and M are as dened in Section 5.5.2 and g( ) and n0
dn=model e are as dened in Lemma 24.
If for a given memory-adaptive subroutine M we have g(M ) = O(lg M ), then by Lemma 25, our framework
results in a total resource consumption of O(n lg n) and by Corollary 2 the sorting algorithm is dynamically optimal.
96
On the other hand if g(M ) = !(lg M ) then using M in our framework results in a dynamically nonoptimal sorting
algorithm.
Corollary 3 A memory-adaptive sorting algorithm cast in our mergesort framework is dynamically optimal if and
only if g(M ) = O(lg M ).
Thus as far as our framework is concerned the optimality of the memory-adaptive sorting algorithm depends upon
the resource consumption of the memory-adaptive merging subroutine M.
Let us consider the quantity M , the number of runs merged by a single application of the memory-adaptive
merging subroutine M. If M is a constant, say 2, then the resource consumption of M can be as high as
(n0 lg mmax ) = !(n0 lg M ) where n0 is the total number of blocks involved in the binary merge carried out by
M. This high resource consumption is incurred by M if the allocation sequence consists of allocation phases of
mmax blocks throughout the duration of M's execution, which spans
(n0 ) I/O operations. Clearly, the strategy of
restricting M to 2 is in contrast to the strategy (suggested in Section 5.4.2) of having memory-adaptive algorithms
mimic optimal static memory algorithms: When the memory size is m blocks, an optimal static memory mergesort
algorithm always merges
(mc ) runs where c is a positive constant. However, as illustrated in Section 5.7, the task
of dynamically reorganizing merge computation in a manner that ensures that the arity of merging computation is
proportional to the memory allocation is not a straightforward one.
5.6 Potential of a Merge

We now discuss the notion of the potential of a merge which applies to any merging algorithm. The potential of a
merge at any time is a quantication of the progress made by the merge up to that time.
Denition 24 Let U be a set of runs input to a merging routine M. Every subset u of U has a run ru associated
with it: If u is the singleton set then ru is the run it contains otherwise ru is the run output by a merge of the runs
contained in u. The rank p(ru ) of a run ru is dened to be the cardinality juj of u. The physical sequence q(ru ; t)
corresponding to run ru at time t is dened as follows:
1. At time t = 0, the physical sequence q(ru ; 0) of any input run ru , with juj = 1, is the entire sequence of elements
of ru taken in order. The physical sequence q(ru0 ; 0) of run ru0 at time t = 0, where u0 U is not a singleton
set, is the empty sequence.
2. Suppose that at time t, algorithm M is in the process of executing a merge operation. Let ru be the run
corresponding to the output run of that merge operation and runs ru1 ; ru2 ; : : : ; ruh are the input runs of the
merge operation, so we have u = S1j h uj . Suppose that the physical sequences q(ru ; t) and q(ruj ; t), where
1 j h, are dened inductively. Then if, at time t, algorithm M removes the leading item x of the physical
97
sequence of run ruj and appends it to the physical sequence of run ru , we have q(ru ; t + 1) = q(ru ; t) x and
q(ruj ; t) = x q(ruj ; t + 1), where \" denotes concatenation.
We say the formation of the run ru associated with set u is logically complete at time t if no append operation (as in
Step 2 above) corresponding to q(ru ; t0 + 1) = q(ru ; t0 ) x for some x can be executed at any time t0 t.
We illustrate this notion by means of an example: Suppose that the merge of runs of the set S 0 = fp0 ; p1 ; p2 ; p3 ; p4 g
is in progress and by some point t of time, runs p2 and p4 have already been depleted by that merge. Then at time
t the formation of the run rs associated with the set s = fp2 ; p4 g is logically complete: Since all elements of p2 and
p4 have already been appended into the physical sequence q(rS0 ; t) of run rS0 by time t, no element can ever be
appended to the empty physical sequence q(rs ; t) at any time after time t.
Denition 25 The rank p(x; t) of an item x at time t is the rank p(r) of the run r such that x lies in the physical
sequence q(r; t) of run r at time t. Suppose that x1 ; x2 ; : : : ; xN 0 are the N 0 = n0 B items of all the runs in the set U
being merged by the merging routine M. Then the potential (t) of the merge at time t is dened to be
XN 0
(t) = 1 lg p(xi ; t):
B i=1
Clearly, at the time t = 0 of the beginning of the merge, the potential of the merge is 0. When algorithm M
nishes the merge of runs of set U , the potential of the merge (or rather, the progress made by M) is n0 lg jU j, where
n0 B is the total number of items merged.
The manner in which the potential of a merge changes over an allocation sequence depends upon the merging
algorithm. We are able to relate the cumulative resource consumption of our algorithm MAMerge , presented in
Section 5.8, up to any time t to the potential (t) of the merge at time t; thus we use potential of the merge to keep
track of the resource consumption of MAMerge .
5.7 Nonoptimality of a simple memory-adaptive

mergesort algorithm
In this section, we discuss an elegant memory-adaptive mergesort based upon the techniques developed by Pang et
al. [PCL93a]. We prove that its resource consumption is nonoptimal.
The memory-adaptive merging algorithm presented in [PCL93a] is not described completely. Below, we present
an intuitively appealing algorithm for memory-adaptive merging called M0 which can reasonably be considered to be
an extension of the memory-adaptive merging techniques of [PCL93a]. We present two variants of M0 : One variant,
98
which we call the linear variant , tries to execute
(m)-way merges when the memory allocated is m blocks and the
other variant, which we call the sublinear variant , tries to execute
(pm)-way merges when the memory allocated
is m blocks. Using an adversarial argument we show that sorting algorithms based upon the linear and sublinear

variants of M0 respectively incur a resource consumption of
n(lg n)2 and
n(lg n)(lg lg n) respectively and so,
by Corollary 2, are dynamically nonoptimal.
5.7.1 Sketch of the memory-adaptive external memory

mergesort
The run formation stage remains the one we proposed in Section 5.5.6 During the merging stage, a single application
of the subroutine M0 of [PCL93a] can be considered to merge at most M0 = mmax 1 runs, where mmax is the
maximum number of disk blocks that can be t in physical memory. We will only sketch the memory-adaptive
merging algorithm M0 in this section: Our focus is on aspects of the algorithm that make the resource consumption
of M0 sub-optimal, so we skip several of the details of the algorithm [PCL93a].
Memory-adaptive subroutine M 0
We now describe the memory-adaptive merging subroutine M0 which attempts to mimic an optimal (static) external
memory mergesort as described in Section 5.4.2. The two variants of M0 correspond to the two mimicking strategies
mentioned in Section 5.4.4 for memory-adaptive mergesort.
Suppose that M0 tries to execute a u(m)-way merge during allocation phases of size m. Then the linear variant
of M0 corresponds to u(m) = 2blg(m 1)c while the sublinear variant of M0 corresponds to u(m) = 22blg lg(m 1)c .
We now dene the level function f ( ) that maps allocation phases to allocation levels, as explained in Section 5.4.4.
Denition 26 For the linear variant of M0 , we dene f (m) to be blg(m 1)c and f 1 (`) to be 2` . For the sublinear
variant of M0 we dene f (m) to be blg lg(m 1)c and f 1 (`) to be 22` . (Strictly speaking, the functions f ( ) and
f 1 ( ) are not mathematical inverses of each other.) An allocation phase of size m is said to be at allocation level
f (m); or alternatively, during an allocation phase of size m, the allocation level is said to be f (m).
In the context of the memory-adaptive mergesort framework we introduced in Section 5.5, we use M0 as a
subroutine to memory-adaptively merge M0 = f 1 (f (mmax )) runs at a time. This means that the linear variant of
M0 has M = (mmax ) whereas the sublinear variant has M =
(pmmax ).
0 0
6 The paper [PCL93a] considers quicksort and dierent variants of replacement-selection in the
context of practical issues such as lengths of runs, response-time to uctuations in memory and
disk-locality during the run-formation stage. Our approach can be extended to address most of
the practical issues they consider with respect to the run-formation stage.
99
The basic idea of M0 is to always associate each allocation level ` with a set S` of runs such that the rank p(rS` )
of the run rS` associated with S` is f 1 (`). The set S` of input runs associated with allocation level ` may change
over the course of the allocation sequence. The computation linked to level ` at any time is the merge computation
necessary to produce blocks of the run rS` associated with S` . Whenever the level of allocation is `, M0 executes
a portion of the computation linked to level `. Whenever the formation of the run rS` is logically complete, M0 is
assigned a new set of runs such that the condition p(rS` ) = f 1 (`) is satised even with the new value of S` and the
process continues.
Member runs of the set S` may either be marked or unmarked . The sets S` for ` 2 f1; 2; : : : ; f (mmax )g are
maintained by M0 over time as follows:
1. Initialization: The set Sf (mmax ) is initialized to the M0 runs that are the runs to be merged. All runs in
Sf (mmax ) are unmarked. The sets S` corresponding to all other allocation levels are set to nil .
2. During the algorithm, as soon as the merge producing the run rS` associated with a set S` becomes logically
complete, the set S` is set to an empty set.
3. During an allocation phase of size m, we check to see if set S` , where ` = f (m), is empty on account of points 1
or 2. If set S` is empty, we execute procedure load (S` ) dened in point 4 below. Supposing S` is non-empty,
then during that allocation phase M0 computes blocks of the run rS` associated with set S` .
4. Whenever a set S` , where ` 6= f (mmax ), needs to be loaded we execute the procedure load (S` ). In this procedure,
if set S`+1 is empty we recursively load it by executing load (S`+1 ). Supposing S`+1 is not empty, if it contains
0 unmarked runs, we unmark all runs of that set and set S` to S`+1 . On the other hand, if S`+1 contains
unmarked runs, we remove a subset s S`+1 containing f 1 (`) unmarked runs7 from S`+1 , set S` = s and
add a marked run rs that is associated with subset s into set S`+1 .
The biggest drawback of the above scheme for memory-adaptive merging is that merge computation producing
blocks of run rS` associated with set S` is carried out in allocation phases of size at least f 1 (`) even if only a 2-way
(binary) merge is needed to produce run rS` , owing to previously executed computations at lower levels: The rst
time set S` gets loaded it may have a large number f 1 (`), of runs to merge. At such times, the merge computation
producing blocks of the run rS` makes very ecient use of m-sized allocation phases, where ` = f (m). However, in
general, allocation levels uctuate. If the allocation levels remain smaller than ` for long enough, it is possible that
when the allocation level next becomes `, a binary merge is what is required in order to produce blocks of the run rS` .
7 If set S +1 contains even one unmarked run it contains at least f 1 (`) unmarked runs.
`
100
The transformation of what was originally an f 1 (`)-way merge into a 2-way merge can take place if the formation
of both the runs rS0 and rS00 , where S` = S 0 [ S 00 , is logically completed when the allocation level is smaller than `.
Even when the merge required to produce blocks of rS` is transformed into a binary merge, the algorithm M0
will persist in using level ` allocation phases to execute that binary merge. This is inecient because huge allocation
phases of size m such that f (m) = ` can end up being used to execute the binary merge: The increase in merge
potential registered by M0 during such an m sized phase is only O(m) whereas the resource consumption is 2m lg m.
The fact that resource consumption is much greater than the increase in merge potential registered by M0 during
such phases turns out to be a fundamental ineciency, as will be seen from our analysis below and in the following two
sections. This ineciency is precisely what we exploit to produce a nemesis sequence of allocation phases resulting
in sub-optimal resource consumption for the algorithm.
We do not mention details regarding data structures used to maintain the sets S` of runs associated with levels
during M0 since we assume, conservatively, that the sets of runs can be maintained without any cost.
5.7.2 Lower Bound on Resource Consumption

We will construct a sequence of allocation phases that forces sub-optimal resource consumption for the memory-
adaptive sorting algorithm obtained by applying the previous section's memory-adaptive algorithm M0 to our
memory-adaptive mergesort framework. We will express the nemesis sequence construction in terms of f (m) and
f 1 (`) so that it is applicable to both the linear and sublinear variants of the merging scheme.
For convenience, we assume that in case of the linear variant of M0 , mmax = 2`max + 1 whereas in case of the
sublinear variant of M0 , we assume that mmax = 22`max + 1. We use a simple technique to construct our nemesis
allocation sequence. We rst introduce some terminology that is applicable to the nemesis allocation sequence we
construct for the algorithm M0 .
Denition 27 We use (n0 ; r; m) to denote the allocation sequence m1 ; m2 ; : : : ; mf satisfying the following condi-
tions:
1. mi m for 1 i f .
2. Beginning with the allocation phase m1 , the algorithm M0 completes the merge of r runs totally consisting of
n0 blocks of items precisely at the end of the allocation sequence (n0 ; r; m).
We use (n0 ; r; m) to denote the resource consumption corresponding to the allocation sequence (n0 ; r; m).
Below we show how to recursively construct the nemesis sequence (n0 ; mmax 1; mmax ). We prove a lower bound
on (n0 ; mmax 1; mmax ) assuming that all the mmax 1 runs being merged are of length n0 =(mmax 1) blocks.
101
Recursive Formulation
We use the notation 1 2 to mean the concatenation of the two sequences 1 and 2 . We also use the notation
p1 to mean 1 p1 1 . Our construction only uses r's of the form 2` (respectively 22k ) in the linear (respectively
sublinear) variant. In the denition below, we use r^ to denote f 1 (f (r)). We have r^ = r=2 in the linear variant and
r^ = pr in the sublinear variant, for the r's we consider. Our recursive construction of the nemesis allocation sequence
(n0 ; mmax 1; mmax ) is as follows:
1. Base Case. We dene
(n0 ; 2; m) = mn0 =m
Thus (n0 ; 2; m) = 2n0 log m, by denition.
2. Recursion. We dene
0
(n0 ; r; m) = nrr^ ; r^; r^ + 1 ) rr^ n0 ; rr^ ; m :
Thus we have
0
(n0 ; r; m) = r n r^ ; r^; r^ + 1 + n0 ; r ; m :
r^ r r^
It is easy to prove inductively that (n0 ; r; m) constructed as above meets the requirements mentioned in the
denition above. We will now prove lower bounds on the resource consumption of the memory-adaptive merging
algorithm by solving for (n0 ; mmax 1; mmax ).
Lemma 26 In the linear variant of the memory-adaptive external memory algorithm,
(n0 ; mmax 1; mmax ) =

(n0 (lg mmax )2 );
whereas in case of the sublinear variant,
(n0 ; mmax 1; mmax ) =

(n0 (lg mmax )(lg lg mmax )):
Proof : In case of the linear version of the algorithm, we have
(n0 ; r; r + 1) = 2(n0 =2; r=2; r=2 + 1) + (n0 ; 2; r + 1):
102
Using the base case and supposing inductively that (n00 ; r0 ; r0 + 1) n00 (lg(r0 + 1))2 for n00 n0 , r0 < r, we have
(n0 ; r; r + 1) 2(n0 =2)(lg(r=2 + 1))2 + 2n0 lg(r + 1)

n0 (lg(r + 1) 1)2 + 2n0 lg(r + 1)
n0 (lg(r + 1))2 2n0 lg(r + 1) + 2n0 lg(r + 1)
n0 (lg(r + 1))2 :
Thus (n0 ; mmax 1; mmax ) n0 (lg mmax )2 in case of the linear variant.
In case of the sublinear variant the recurrence for (n0 ; r; r + 1) unfolds as
p p p p
(n0 ; r; r + 1) = r(n0 = r; r; r + 1) + (n0 ; r; r + 1):
p
Recall that in the sublinear variant we assume that the r's are such that lg lg r is integral. We transform the second and
third variables of (n0 ; r; r0 + 1) by dening the function (n0 ; lg lg r; lg lg r0 ) to be identical to (n0 ; r; r0 + 1), where
lg lg r and lg lg r0 are both integral. Thus (n0 ; r; r + 1) = (n0 ; k; k), where k = lg lg r and we bound (n0 ; r; r + 1)
as follows.
(n0 ; k; k) 22k 1 (n0 =22k 1 ; k 1; k 1) + (n0 ; k 1; k)

kX1 i
= 22 (n0 =22i ; i; i) + (n0 ; 0; k):
i=1
Using the base case and inductively assuming that (n00 ; k0 ; k0 ) n00 k0 2k0 1 for n00 n0 and k0 k, we have
X k 1
(n0 ; k; k) n0 =2 i2i + 2n0 2k
i=1
n0 k2k 1 :
Thus we have (n0 ; mmax 1; mmax ) n0 (lg lg mmax )(lg mmax )=2 with respect to the sublinear variant.
Using the above lemma in conjunction with Lemma 25, we have the following theorem regarding resource con-
sumption of the memory-adaptive external memory sorting algorithm based upon techniques of [PCL93a].
Theorem 11 While sorting a le of n blocks, the memory-adaptive external memory sorting algorithms based upon
the linear and sublinear variants of the memory-adaptive external memory merging subroutine M0 have resource
consumption of
(n(lg n)(lg mmax )) and
(n(lg n)(lg lg mmax )) respectively.
By Theorem 11 and Corollary 2, the above approach to memory-adaptive sorting is dynamically nonoptimal.
103
5.8 Dynamically Optimal Memory-Adaptive Sort-
ing
In this section we present a new memory-adaptive merging subroutine MAMerge that can be used as M in the
framework of Section 5.5.2 to obtain a dynamically optimal sorting algorithm.
Throughout this section, we use to denote the number MAMerge of runs that each application of MAMerge
merges together. The value of is appropriately chosen to be
(mmax ) except possibly for the nal application in
which case it can be as small as 2. The novelty of MAMerge lies in the data structures and techniques it uses to
reorganize the original merge computation adaptively so as to ensure that in \typical" allocation phases, the resource
consumption of MAMerge is within a constant factor of the increase in merge potential it registers, and, during all
other allocation phases, the total number of I/O operations it incurs is linear in the total number of blocks output
by MAMerge .
Theorem 12 Suppose that MAMerge is used to merge together a set of input runs totally comprising n0 blocks.
Consider any time t during the execution of MAMerge, up to and including the time MAMerge nishes execution.
Then the resource consumption of MAMerge during the allocation sequence up to time t is O( (t)+n0 lg mmax ), where
(t) is the potential of the merge at time t. In the case =
(mmax ), we have O( (t) + n0 lg mmax ) = O(n0 lg ).
The nal application of MAMerge may merge only a small number o(mmax ) of runs. This application of MAMerge
may incur a resource consumption of O(n0 lg mmax ) as opposed to O(n0 lg ). However, we can show that our sorting
algorithm remains dynamically optimal.
A sketch of the above theorem gives a \high level" idea of the technique used by MAMerge to tie its resource
consumption at any time to the potential of its merge at that time. In order to sketch the proof of Theorem 12, we
dene \optimal" and \nonoptimal phases".
Denition 28 An allocation phase of size m, where model m mmax , in which the potential of the merge being
carried out by MAMerge increases by an additive amount of (m lg m) is an optimal phase . Every other allocation
phase is said to be a nonoptimal phase.
The novel aspects of MAMerge are the techniques and data structures it employs to ensure that a typical allocation
phase is an optimal phase. In a typical allocation phase of size m, MAMerge can eciently access the physical
sequences of m0 =
(pm) appropriate runs ru ; ru1 ; ru2 ; : : : ; rum0 such that
1. u = S1im0 ui , and
2. For 1 i m0 , we have p(ru ) = juj = m0 jui j = m0 p(rui ).
104
Whenever these conditions are satised, MAMerge can use the 2m I/O operations of the phase to append (mB )
new items to the physical sequence of run ru , where each appended item belongs to the physical sequence of one of
the runs ru1 ; ru2 ; : : : ; rum0 . By denition, the increase in the potential of the merge during such an allocation phase

is mB B1 lg pp((rru)) = (m lg m), and thus the phase is optimal. The resource consumption 2m lg m incurred
ui
during an optimal phase of size m can be charged to the potential increase (m lg m) registered by MAMerge during
that phase. Since the potential of the merge can never exceed n0 lg , the net resource consumption during all optimal
phases is no more than O(n0 lg ).
On the other hand, the techniques used by MAMerge also ensure that the total number of I/O operations obtained
by summing the I/O operations over all nonoptimal phases is O(n0 ). Since the maximum resource consumption of
an I/O operation is lg mmax , the resource consumption during all the nonoptimal phases of MAMerge remains
O(n0 lg mmax ). This concludes the sketch of the proof of Theorem 12.
5.8.1 Overview
Each application of MAMerge , except possibly the nal application, merges together = 22dlg lg(mmax =level )e runs
from Q as described in Section 5.5.2, where level is a constant to be determined later. MAMerge partitions the set
of possible sizes of allocation phases into \allocation levels".
Denition 29 An allocation phase of size s is said to be at allocation level level (s) = dlg lg(s=level )e; or alternatively,
during allocation phases of size s, the (ongoing) allocation level is said to be level (s). In our scheme, we require that
the integral constant model dened in Denition 19 be large enough for level (model ) to be 1. We use `max 0 to denote
the integer level (mmax ).
By denition, each allocation phase is at one of the allocation levels `, where ` 2 f1; 2; : : : ; `max 0 g.
The basic strategy employed by MAMerge is to dynamically maintain an association of a merge operation \ap-
propriate for level `" with each allocation level ` and to generate (m) blocks output by that merge operation during
an allocation phase of size m at level `. 8 In the case when the formation of the output run of that merge opera-
tion is logically completed before (m) blocks can be output, MAMerge ends up generating less than (m) blocks
during that phase. Whenever the formation of the output run of the merge operation associated with level ` gets
logically completed, MAMerge has to reorganize the global merge computation so as to nd a new merge operation
\appropriate for level `".
8 An exceptional case is when ` > maxlevel , where maxlevel is a variable maintained by MAMerge
as described below.
105
Denition 30 A merge operation is said to be appropriate for level ` if the rank p(x; t) of an element x appended
at time t to the physical sequence of the output run of that merge operation is such that
p(x; t) 22` 1 p(x; t 1):
Put another way, if x lies in the physical sequence of run r0 at time t 1 and in the physical sequence of the output
run r of the merge operation at time t, then we have p(r) 22` 1 p(r0 ).
Every allocation phase in which MAMerge outputs (m) blocks of a merge operation appropriate for level ` is
an optimal phase, by denition. Allocation phases spent by MAMerge in reorganizing the global merge so as to nd
merge operations for levels not currently associated with appropriate merges may be nonoptimal. An allocation phase
of size m at level ` can also be nonoptimal if the formation of the output run of the merge operation appropriate for
level ` gets completed during that phase. Another source of possibly nonoptimal phases are phases at any allocation
level ` > maxlevel , where maxlevel is a special variable maintained by MAMerge . The number of nonoptimal phases
is small enough that the number of I/O operations summed over all nonoptimal phases is O(n0 ), where n0 is the
number of blocks output by MAMerge .
In Section 5.8.2 we present the recursively dened data structure of a \run-record" that plays a central role in
the manner in which MAMerge dynamically reorganizes its merge computation. In Section 5.8.3, we describe the
preprocessing stage of MAMerge and some other preliminaries of MAMerge . In Section 5.8.4, we present a data
structure called level-record that stores, for each level `, the merge operation appropriate for level `. In Section 5.8.5,
we mention the invariants pertaining to run-records and level-records that MAMerge maintains. In Section 5.8.6,
we describe the simple procedure executed by MAMerge during a phase at allocation level ` when there does exist
an appropriate merge operation for level `. In Section 5.8.7, we present the download ( ) operation used to nd new
appropriate merge operations for levels that are currently not associated with any merge operation. In Section 5.8.8,
we sew together our data structures and techniques to obtain the memory-adaptive merging algorithm MAMerge .
We analyze MAMerge 's resource consumption in Section 5.9.
5.8.2 Run-Records
We associate a \run-record", dened below, with every run formed in course of our mergesort algorithm. Each run-
record contains a pointer to the start and end of its run's physical sequence on disk. The queue Q of runs dened in
Section 5.5.2 is, in fact, implemented as a queue of run-records.
At any time the \state" of MAMerge consists of a set of merge operations that can collectively be viewed as
an adaptive re-organization of the original -way merge that MAMerge sets out to compute. Linked implicitly to
each such merge operation in the state of MAMerge is a run-record dened below: Consider, for example, merging
a set P of runs r0 ; r1 ; : : : ; rp 1 , where p = jP j, into the run r and suppose that p = 22` for a non-negative integer
106
`. In our scheme, we ensure that the number of runs in all merge operations in MAMerge 's state is always of the
form 22x , where x is a non-negative integer. Algorithm MAMerge maintains a run-record rr associated with the
output run r. Run-record rr contains pointers to the leading and trailing disk blocks of run r, if any, and to a list
of the run-records rr i associated with the runs ri 2 P . Thus whenever we wish to work on the merge operation
whose output is run r, we can do so by using the run-records rr i to get pointers to blocks of the runs ri . We
append the merge output to the trailing disk block of run r pointed to by rr . In general, the runs ri 2 P can
themselves be runs that correspond to the outputs of some other merge operations. More importantly, the design of
algorithm MAMerge easily handles situations in which the formation of a run ri 2 P may not be logically complete,
in the sense of Denition 24. Algorithm MAMerge has the exibility of implementing the p-way merge linked to
rr by recursively splitting it into pp-way merge operations: The run-record rr stores a pointer to a list of pp run-
records rr 00 ; rr 01 ; : : : ; rr 0pp 1 associated with the runs r00 ; r10 ; : : : ; rp0 p 1 such that run r is logically the output of the
pp-way merge of runs r0 ; r0 ; : : : ; rp0 0 p
0 1 p 1 , and each run ri is logically the output of the p-way merge of the runs
ripp ; ripp+1 ; : : : ; r(i+1)pp 1 .
Below we give a precise denition of the elds that form a run-record. It is useful to separate the logical notion
of a run from the way it may actually exist on disk at any time.
Denition 31 In our scheme, the physical sequence q(r; t) at any time t of a run r is stored in a blocked manner on
disk.
Before dening the recursive run-record data structure, we dene the various elds of a run-record.
Denition 32 The elds of a run-record associated with a run r0 are as follows:

1. begin : At any time t, the begin eld points to the leading element of the physical sequence q(r0 ; t), assuming
q(r0 ; t) is non-empty, on disk.
2. end : At any time t, the end eld points to the trailing element of the physical sequence q(r0 ; t), assuming q(r0 ; t)
is non-empty, on disk.
3. Order : An integer eld.
4. inputs : The inputs eld points to the disk location of the leading run-record of a list of Order run-records
stored in a blocked manner on disk. The inputs eld implicitly represents this blocked list of Order run-records
so we sometimes refer to the pointer inputs as a list of run-records. This list of run-records may actually exist
as a sub-list of a larger blocked list of run-records on disk.
5. ag : The ag eld records whether or not the formation of run r0 is logically complete, as in Denition 24.
Accordingly, ag is set to Done or NotDone respectively. The ag eld of the run-record associated with any
run input to MAMerge is initialized to Done .
107
p
6. splitters : The splitters eld points to the disk location of the leading run-record of a list of Order run-records
p
stored in a blocked manner on disk. The splitters pointer implicitly represents the blocked list of Order
run-records.
Each run-record occupies only O(1) amount of space, which is proportional to O(1=B) disk blocks.
Denition 33 If run r is one of the input runs of MAMerge , then in the run-record rr associated with r, we
have rr :Order = 1 and rr :inputs = rr :splitters = nil . On the other hand suppose that run r is logically the run
corresponding to the output of the merge of the runs ri , where 0 i p 1 and p = 22` for a non-negative integer `,
and rr i is the recursively dened run-record associated with run ri . Then given a run-record rr at any time, we say
_
rr = frr 0 ; rr 1 ; : : : ; rr p 1 g
if and only if the following conditions are satised:
1. rr :Order = p.
2. The list rr :inputs contains precisely the p run-records rr 0 ; rr 1 ; : : : ; rr p 1 .
3. The list rr :splitters contains pp run-records rr 0j , where 0 j pp 1, such that rr 0j =

Wfrr p ; rr p ; : : : ; rr p g
j p j p+1 (j +1) p 1
We say that the computation associated with run-record rr is the computation involved in merging the runs associated
with the rr :Order run-records in rr :inputs to produce blocks appended to the physical sequence of run r associated
with rr .
Although the denition of a run-record is recursive, we do not employ recursion to construct a run-record rr such
that rr = Wfrr 0 ; rr 1 ; : : : ; rr p 1 g, given the run-records rr i .
Denition 34 Given a blocked list Lp of p = 22` run-records rr 0 ; rr 1 ; : : : ; rr p 1 , where ` is a non-negative integer,

the construction of a run-record rr satisfying the condition rr = Wfrr j j0 j p 1g is called a construct (rr ; Lp )
operation.
The construct (rr ; Lp ) operation can be implemented by successively constructing a sequence of ` blocked lists Li
of run-records, for 0 i ` 1 and then setting rr :inputs to be the list Lp and rr :splitters to be the list L` 1 . In
general, the blocked lists Li are constructed so as to satisfy the the following conditions:
1. List Li is a blocked list containing p=22i run-records, which we denote rr (i; j ), where 0 j p=22i 1.
108
2. The j th run-record rr (i; j ) in list Li , where 0 i ` 1 and 0 j p=22i 1, satises rr (i; j ):Order = 22i and
the list rr (i; j ):inputs consists of the 22i run-records rr jx ; rr jx+1 ; : : : ; rr (j +1)x 1 , of list Lp , where x = 22i .
3. Consider a run-record rr (i; j ), where 1 i ` 1 and 0 j p=22i 1, from any list other than list L0 .
Let x denote 22i . Then, the px = 22i 1
run-records in the list rr (i; j ):splitters are precisely the run-records
rr (i 1; j px); rr (i 1; j px + 1); : : : ; rr (j + 1)px 1, of list Li 1 . Each run-record in list L0 has its splitters
eld set to nil .
4. The elds rr :inputs and rr :splitters are set so that rr :inputs represents list Lp and rr :splitters represents list
L` 1 .
5. The ag elds of all run-records of the lists L0 ; L1 ; : : : ; L` 1 and the run-record rr are said to NotDone ; the
begin and end elds of these run-records are set to nil indicating that their respective physical sequences at
that time are all empty.
It can be inductively shown that the above conditions imply for 0 i ` 1 and 0 j x 1 that
_
rr (i; j ) = frr jx ; rr jx+1 ; : : : ; rr (j+1)x 1 g
W
where x = 22i . This means that rr = frr j j0 k p 1g, as desired.
If the lists Li are constructed in ascending order of i, then all the run-records of the blocked list Li can be
constructed in a single traversal of list Li 1 for i > 0 and list Lp for i = 0: For i > 0, Step 2 above is implemented
by setting rr (i; j ):inputs to be equal to the (value of) the eld rr (i 1; j 22i 1 ):inputs and rr (i; j ):splitters to store
the disk location of the run-record rr (i 1; j 22i 1 ). The following lemma follows from the fact that the total number
of run-records summed over all the lists Li and list Lp is O(p) and that constructing each list Li requires no more
than O(1) blocks of internal memory.
Lemma 27 The total number of I/O operations incurred by a construct (rr ; Lp ) operation, where Lp is a blocked
list of p run-records, is O(p=B + lg lg p) = O(p). The total number of memory blocks required to implement
construct (rr ; Lp ) is O(1).
Another useful observation is expressed by the following lemma.
Lemma 28 The total number of new run-records created by construct (rr ; Lp ), where Lp contains p run-records is
O(p).
109
During the preprocessing stage of MAMerge , we execute a construct (rr ; Lp0 ) operation in which the number p0
of run-records in Lp0 is not of the form 22` but rather of the form w22` , where w is a non-negative integer less than
22` . In this case the above procedure can be modied so that the number of run-records in list L` 1 is w, instead of
pp0 . However, the condition that rr = Wfrr 0 jrr 0 2 L 0 g remains satised.
p
Corollary 4 The total number of I/O operations incurred by the construct (rr ; Lp0 ) operation described in the above
paragraph, in which Lp0 is a blocked list of p0 = w22` run-records and w is a positive integer less than 22` , is no
more than O(p0 =B + lg lg p0 ), which is O(p0 ). The total number of memory blocks required in the implementation is
O(1).
5.8.3 Preprocessing
Having dened run-records and the construct ( ) operation, we now describe the preprocessing carried out by MAMerge
before it starts merging. First we introduce some terminology that we will be using throughout the next two sections.
Denition 35 The number of runs merged by an application of MAMerge is given by

n o
= min 22dlg lg(mmax =level )e ; jQj ;
where jQj is the number of runs in list Q at the beginning of that application of MAMerge . We use `max to denote
the integer dlg lg e, and to denote the integer 22`max .
The value of for each application of MAMerge , in our framework is such that, except perhaps in the nal
application of MAMerge , we have `max = `max 0 = dlg lg(mmax =level )e and = .
During the nal application of MAMerge , it is possible that < , in which case, it is convenient to introduce
some dummy9 run-records to the list of run-records corresponding to runs being merged. Adding dummy run-records
enables us to make the convenient assumption that that the number of run-records in the list rr 0 :inputs of any
run-record rr 0 is always of the form 22` for integral `. If < , one possibility is to add dummy run-records,
but this can be extremely inecient since can be as high as (2 ). So we rst add enough dummy run-records
to obtain a total of w1=2 run-records, where w is an integer no larger than 1=2 . Then after running the modied
version of the construct ( ) operation alluded to in Corollary 4, we add some more dummy run-records appropriately
to ensure that, after preprocessing, the number of run-records in the list rr 0 :inputs of any run-record rr 0 is of the
form 22` while maintaining the condition that the number of dummy run-records added is O().
We execute the following steps during the preprocessing stage of MAMerge in our framework.
9 We dene the run associated with a dummy run-record in Section 5.9.
110
1. Copy the leading run-records from jQj into a new blocked list L of run-records and remove these run-records
from Q.
2. Append at most 1=2 1 dummy run-records to the end of L so that the number of run-records in L is w1=2
for a non-negative integer w.
3. Execute a modied version of the construct (rr 0 ; L) operation (alluded to in Corollary 4) on the blocked list L
containing w1=2 run-records. The blocked list rr 0 :splitters , denoted L0 , contains w run-records.
4. Discard run-record rr 0 and append less than 1=2 dummy run-records to L0 so that L0 now contains w0 run-
records, where w0 is the smallest integer such that w0 w and w0 = 22x 1=2 for an integral x.
5. Execute a construct (rr ; L0 ) operation on the blocked list L0 containing w0 run-records.
The run-record rr constructed in Step 5 has rr :inputs pointing to a blocked list of w0 run-records rr j , where
0 j w0 1. With one possible exception, the blocked list rr j :inputs of each non-dummy run-record rr j of list
rr :inputs contains a unique set of 1=2 run-records from the set of run-records associated with runs MAMerge sets
out to merge; one non-dummy run-record of list rr :inputs may possibly contain the dummy run-records introduced
in Step 2. The union of the sets represented by the lists rr j :inputs includes the run-records corresponding to runs
input to MAMerge and thus, the run associated run-record rr is logically the run corresponding to a merge of the
input runs.
Lemma 27 and Corollary 4 above imply the following lemma.
Lemma 29 The total number of I/O operations incurred during the preprocessing stage is no more than O(=B +
`max ), which is O(). The total number of memory blocks required to implement the preprocessing is O(1). The total
number of new run-records created during the preprocessing stage, including dummy run-records is O(1=2 ) = O().
The following denitions refer to variables maintained and used by MAMerge .
Denition 36 The algorithm MAMerge maintains a variable rr global storing a special run-record. At the end of
Step 5 above, we initialize rr global to the run-record rr resulting from the execution of the construct (rr ; L0 ) operation.
We use Wtop to denote the number of non-dummy run-records in the list rr global :inputs immediately after Step 5 of
the preprocessing completes.
Throughout MAMerge 's execution, the variable rr global stores the run-record whose run is logically the merge of
all the runs input to MAMerge .
111
5.8.4 Level-record Data Structure
We now describe the \level-record" data structure that associates with each level `, where 1 ` `max , a merge
operation appropriate for level `. A level-record stores a pointer to a run-record together with some supplementary
information.
Denition 37 Every allocation level `, where 1 ` `max , is associated with its level-record, denoted lr [`]. Level-
record lr [`] is either nil or it comprises of the following three elds:
1. rr : The rr eld stores the location of a run-record which can be viewed as a repository of computation that
may carried out at level `.
2. current : The current eld stores a non-negative integer.
3. active : The active eld is a pointer to a run-record which we call the active run-record of level ` and denote by
rr ` , in short. The merge operation associated with run-record rr ` is always a merge operation appropriate for
level `.
Level-records lr [1] through lr [`max ] are stored in a blocked list so they occupy O(`max =B + 1) disk blocks in total.
By \computation associated with level `" we refer to the merge operation associated with run-record rr ` .
Immediately upon completion of the preprocessing of MAMerge , all level-records except lr [`max ] are initialized to
nil . Level record lr [`max ] is initialized as follows: Its rr eld is set to run-record rr global , its current eld is set to 0
and its active eld is set to the same value as rr global :inputs (which points to the rst run-record in the blocked list
implicitly represented by rr global :inputs .)
The following denition refers to a variable maintained and used by MAMerge .
Denition 38 Throughout its operation, MAMerge maintains a special variable maxlevel initialized to `max .
The value of maxlevel is always such that lr [maxlevel ]:rr = rr global .
5.8.5 Invariants for run-records and level-records

We now present the invariants pertaining to run-records and level-records maintained by MAMerge . The invariants
hold immediately after preprocessing is completed and after each llmerge ( ) and download ( ) operation during
MAMerge 's execution.
In the invariants specied below, we use the following variables and shortened names.
Denition 39 By `, we denote an integer such that 1 ` `max . We use current to denote the eld lr [`]:current
of level-record lr [`], rr to denote lr [`]:rr , m to denote lr [`]:rr :Order , and active to denote lr [`]:active . By rr i , where
0 i m 1, we denote the ith run-record in the list rr :inputs of m run-records.
In case of the invariants below that refer to level-record lr [`], it is obviously assumed that lr [`] is not nil .
112
Invariants
1. 1 maxlevel `max .
2. The run associated with run-record rr global is always the run corresponding to the output of the merge of
runs input to MAMerge .
3. If maxlevel < ` `max , then level-record lr [`] = nil .
4. lr [maxlevel ]:rr stores the location of run-record rr global .
5. rr : ag is set to NotDone .
6. We have 0 current m. For the case ` = `max ,

current 62 fWtop ; Wtop + 1; : : : ; m 1g.
7. If current < m then the run-record rr ` pointed to by active is the current -th run-record rr current in the list
rr :inputs , otherwise run-record rr ` is the run-record rr itself.
8. If ` = maxlevel , then m = 22x , where x is an integer no larger than ` 1.
9. If ` < maxlevel , then m = 22` 1 .
10. Each one of the m current run-records rr i , where current i m 1, has rr i : ag set to NotDone and
rr i :Order set to 22` 1 . In the specic case of ` = maxlevel = `max , this invariant holds for the non-dummy
run-records rr i , where current i Wtop 1 of the list rr :inputs .
11. Consider the ordered list S` of run-records obtained as follows: First concatenate together the lists rr i :inputs
of the run-records rr i , where current i m 1, in increasing order of i to obtain the list S 0 . Then list S`
is obtained by appending the ordered list rr 0 ; rr 1 ; : : : ; rr current 1 of run-records to the tail of list S 0 . Let the
elements, in order, of list S` be s0 ; s1 ; : : : ; sjS` j , where jS` j denotes the number of run-records in S` .
(a) At most ` 1 run-records of S` either have their ag eld set to NotDone .

(b) Consider a run-record s 2 S` , with s: ag = NotDone . Then there is precisely one level `0 , where 0 < `0 < `,
such that lr [`0 ]:rr points to s.
(c) Consider any pair si1 and si2 of elements of list S` such that si1 : ag = si2 : ag = NotDone and i1 < i2
with lr [`1 ]:rr pointing to si1 and lr [`2 ]:rr pointing to si2 . Then `1 < `2 .
We present a couple of useful observations as lemmas based upon the above invariants, using the same notation.
113
Lemma 30 Unless ` = maxlevel and current = m, given level-record lr [`], we can obtain a run-record rr 0 using
a single I/O operation, where rr 0 is such that rr 0 : ag = NotDone and the merge operation associated with rr 0 is
appropriate for level `.
Proof : The invariants 5, 6, 7, 10 and 9 together imply that the active run-record rr ` of level ` has rr ` :Order = 22` 1 ,
unless ` = maxlevel and current = m. By denition of run-records and Denition 30, the merge operation aliated
with rr ` is appropriate for level `, unless ` = maxlevel and current = m. Since level-record lr [`]:active is a pointer
to rr ` , the lemma is true.
When the allocation level is `, MAMerge carries out computation producing blocks belonging to the run associated
with the active run-record rr ` of level ` by merging the runs associated with the run-records in the list rr ` :inputs :
Lemma 30 ensures us that this amounts to making optimal utilization of resources. However, as will be seen later, it
is possible for a run-record rr 0 in the list rr ` :inputs to have its ag eld set to NotDone , meaning that the formation
of the run associated with rr 0 is not logically complete: This is a potential problem since it means that in order to
produce blocks of the run associated with rr ` , MAMerge would inherently have to also carry out the merge linked to
each such run-record rr 0 and so producing blocks of the run associated to rr ` may require MAMerge to have more
memory than originally expected. However the invariants 11a and 11b avert this potential problem. Invariant 11a
implies that there are at most ` 1 run-records in list rr `:inputs with their ag eld set to NotDone and invariant 11b
implies that the total number of extra run-records necessitated by these run-records is no more than O 22` 1 . This
number is within a constant factor of the number 22` 1 of run-records originally expected to be involved in the merge
producing blocks of the run associated with rr ` , thus averting the potential problem.
It is easy to prove that the above invariants are all true immediately after the preprocessing computation and
initialization of level-records is completed.
5.8.6 Low-level Merge Computation

We now describe the procedure llmerge (rr ) used by MAMerge to carry out the merge computation aliated to
run-record rr . Whenever the allocation level is ` maxlevel , and level-record lr [`] is not nil , MAMerge executes the
procedure llmerge (rr ` ) described below, producing blocks of the run associated with the active run-record rr ` of level
`. Lemma 30 ensures us that rr ` is aliated to a merge operation appropriate for level ` and so level ` allocation
phases are optimal, unless possibly when ` = maxlevel and lr [`]:current = lr [`]:rr :Order .
Whenever the allocation level ` is greater than maxlevel or when ` = maxlevel and lr [`]:current = 22` 1 , MAMerge
executes the procedure llmerge (rr global ): By invariants 4 and 8, such a phase can be nonoptimal even if (mB )
elements are appended to the physical sequence of the run associated with rr global during that phase. However,
114
we argue in Section 5.9 that the total amount of resource consumption over all executions of llmerge (rr global ) is
O(n00 lg mmax ), where n00 is the sum of the number of disk blocks of all the runs merged by MAMerge .
Invariants for llmerge ( )

In our description of the procedure llmerge ( ), we use rr to denote the run-record passed as an argument to llmerge ( ).
We ensure that the following invariants are always satised whenever MAMerge executes procedure llmerge (rr ) and
the allocation level is `:
1. rr : ag = NotDone .
2. If ` maxlevel , then
(a) rr = rr ` , the active run-record of level `.

(b) During the level ` allocation phase at the instant when MAMerge makes the call to execute procedure
llmerge (rr ), the following condition is satised: Either the number left of I/O operations in the ongoing
allocation phase is such that left llm 22` 1 , where llm is a positive constant dened below, or, the
immediately following allocation phase10 of size next is at level level (next ) = `.
(c) The level-records lr [1] through lr [`] are already in memory at the time the call to execute llmerge (rr ) is
made.
3. If ` > maxlevel , then
(a) rr = rr global .
(b) During the level ` allocation phase, at the instant when MAMerge makes the call to execute procedure
llmerge (rr ), the following condition is satised: Either the number left of I/O operations in the ongoing
allocation phase is such that left llm 22maxlevel , where llm is a positive constant dened below, or the
immediately following allocation phase of size next is at allocation level level (next ) > maxlevel .
(c) The level-records lr [1] through lr [maxlevel ] are already in memory at the time the call to execute
llmerge (rr ) is made.
4. When the execution of llmerge (rr ) is completed, the ag eld of run-record rr is set to Done if and only if the
formation of the run associated with rr is logically complete, as in Denition 24.
The following denition is used during our discussion on llmerge ( ).

10 As mentioned before, it is possible to modify our strategy to make do without the information
corresponding to next with no loss of eciency, asymptotically.
115
Denition 40 We dene k to be the quantity minf`; maxlevel + 1g and m to be the quantity 22k . 1
The execution of procedure llmerge (rr ), producing blocks of the run associated to run-record rr , is split into two
parts, called the merging part and the state-saving part respectively. During the merging part, llmerge (rr ) performs
I/O related to the merging process whereas, during the state-saving part, llmerge (rr ) performs I/O in which relevant
data structures, updated so as to maintain invariants (of Section 5.8.5), and partially full buers of runs are committed
to disk before the allocation level changes. Invariants 2b and 3b above ensure that llmerge (rr ) is executed only when
the ongoing allocation level is guaranteed to last for a number of I/Os large enough to accommodate both the merging
part and the state-saving part, as the invariants 2b and 3b above indicates.
During the merging part, llmerge (rr ) loads into memory the run-records in the list rr :inputs and then starts
merging the physical sequences of the runs associated with these run-records, appending the merge output to the
physical sequence of the run associated with rr . If the physical sequence of any run associated with a run-record rr 0
of rr :inputs becomes empty during the merge and rr 0 : ag = NotDone , llmerge (rr ) now has to include the physical
sequences of the runs associated with run-records in the list rr 0 :inputs in the merge operation. In order to do this,
llmerge (rr ) rst has to loads run-records of the list rr 0 :inputs into memory. Similar steps result if some other run-
record has a ag eld value of NotDone when the physical sequence of the associated run becomes empty. The
algorithm executed during the merging part of llmerge (rr ) is therefore as follows:
1. The set T is initialized to contain all run-records in rr :inputs .
2. Ensure that each run-record of T is allocated one internal memory block to buer the leading block of the
physical sequence of the run it is associated to.
3. While the physical sequence of the run associated with every run-record in set T is non-empty, merge the
physical sequences corresponding to run-records of T into the physical sequence of the run associated with rr .
4. If the physical sequence of the run associated to run-record rr 0 2 T becomes empty, then T = T frr 0 g. If
rr 0 : ag = NotDone , then add all run-records of list rr 0 :inputs to T . Go to Step 2.
Denition 41 Any run-record that becomes an element of set T at some point during the merging part is said to
have been touched by llmerge (rr ). Let y(k ) and z(k) respectively denote the minimum and maximum number of
run-records touched during the merging part of llmerge (rr ).
The number of I/Os required to load data blocks and data structures necessary to begin the merging part of
llmerge (rr ) is proportional to the number of run-records it touches. So we rst obtain a handle on this quantity by
means of the following two lemmas.
116
Lemma 31 The maximum number z(k) of run-records touched during the execution of llmerge (rr ) is no more than

O 22k 1 .
Proof : We rst consider the case when k = ` maxlevel and rr = rr ` . In this case, by invariants 8 and 10 of
Section 5.8.5, we know that the number of run-records in list rr :inputs is 22k 1 and all these are touched. We will
consider run-records other than those in list rr :inputs that get touched. For a given touched run-record to cause
more run-records to be touched, that run-record necessarily must have a ag eld value of NotDone .
We say that a run-record rr 0 belongs to level `0 if rr 0 is either in the inputs list of the run-record lr [`0 ]:rr or in
the inputs list of some run-record in the inputs list of the run-record lr [`0 ]:rr .
As a result of invariant 11 of Section 5.8.5 and the fact that any run-record not included in list rr :inputs is touched
only if it lies in a depleted run-record's inputs list, each run-record touched by llmerge (rr ) must belong to a level
`0 < k = `. By invariants 10 and 9 of Section 5.8.5, the maximum number of run-records belonging to to level `0 is
0
22` . Hence the maximum number z(k) of run-records touched by llmerge (rr ) when k = ` is
kX1 `0
22k 1 + 22 = O 22k 1
`0 =1
In the case when k = maxlevel + 1 and rr is rr global , every touched run-record belongs to a level `0 < k so the
maximum number z(k) of touched run-records is no more than
kX1 `0
22 = O 22k 1
`0 =1
Hence the lemma is proved.
The following lemma can be easily proved.
Lemma 32 If the run-record rr is not the run-record rr global , then we have y(k) 22k , where y(k) is the smallest
1
possible number of run-records that llmerge (rr ) touches.
If the runs being merged during llmerge (rr ) are long enough, the merging part of llmerge (rr ) can proceed for
an indenitely long time, unless we preempt llmerge (rr ). The need to preempt stems from the fact that if the
allocation level changes to `0 maxlevel , MAMerge would then execute llmerge (rr `0 ), where rr `0 is the active run-
record of level `0 . In order to be able to resume a merge operation at some later stage, during the state-saving part
117
of llmerge (rr ) we commit partially empty buer blocks of physical sequences being merged, updated touched run-
records and pertinent updated level-records back to disk: Transferring these to disk obviously requires I/O operations
so llmerge (rr ) reserves a certain number of I/O operations from its total number, llm 22k 1 , of I/O operations,
specically for this purpose. We now describe the state-saving part of llmerge (rr ) and determine how many I/O
operations it requires.
During its state-saving part, llmerge (rr ) executes the following steps:
1. The ag eld of any run-record rr 0 such that rr 0 is either rr or a touched run-record, is set to Done if it was
previously NotDone and the formation of the run associated with rr 0 is logically complete, as in Denition 24.
(By virtue of invariant 4 of Section 5.8.6, it is enough to set rr 0 : ag = Done whenever, after recursively
determining the values of the ag elds of all the run-records in list rr 0 :inputs , it is found that they all have
their ag elds set to Done .11 )
2. If a touched run-record whose ag eld changes value from NotDone to Done is the run-record whose location
is stored in lr [`0 ]:rr or lr [`0 ]:active , for some `0 such that 1 `0 minf`; maxlevel g, we need to accordingly
update the level-record lr [`0 ] in order to ensure that the invariants of Section 5.8.5 remain true: It is not hard
to see that these invariants can be maintained easily for all relevant `0 s.
3. Write out to disk the internal memory blocks that buer the physical sequences being merged. Write
out all touched run-records and the run-record rr back to disk. Write out the level-records lr [1] through
lr [minf`; maxlevel g] back to disk.
The state-saving part basically ensures that the following lemma is true.
Lemma 33 The invariants of Section 5.8.5 and the invariant 4 of Section 5.8.6 remains true after the execution of
llmerge (rr ) is completed.
By Lemma 31, since llmerge (rr ) has at most one internal memory block corresponding to each touched run-record,
we have the following lemma.
Lemma 34 There exists a constant save such that the total number of I/O operations required to implement
llmerge (rr )'s state-saving part, in which touched run-records, pertinent level-records and partially lled blocks of
physical sequences are written to disk, is no more than save 22k 1 .
11 Dummy run-records introduced in the preprocessing stage are all to be treated as run-records
with their ag elds set to Done .
118
In order to complete the description of llmerge (rr ), we need to dene the constant llm used in invariants 2b and
3b. As mentioned above llm 22k 1 should take into account the save 22k 1 required to complete the state-saving
part of llmerge (rr ). Additionally, llm should also be large enough for a useful amount of \work" to get done during
the merging part of llmerge (rr ). Below we quantify the notion of useful amount of work.
Denition 42 Consider the physical sequence q(r; t) of the run r associated with run-record rr at time t just before
the execution of llmerge (rr ) begins. We call a particular execution of llmerge (rr ) a good call if at least 22k 1
B
items are appended to the physical sequence q(r; t) during that execution of llmerge (rr ). Any execution of llmerge (rr )
that is not a good call is a bad call.
We consider the work involved in appending 22k 1 B items to the physical sequence of the run associated with
llmerge (rr ) a useful amount of work during the merging part of llmerge (rr ). The following lemma bounds the total
number of I/O operations incurred in carrying out this useful amount of work.
Lemma 35 Let qmax be the total number of items that need to be appended to the physical sequence of the run
r associated with run-record rr for the formation of r to be logically complete. Then the total number of I/O
operations incurred by the the merging part of llmerge (rr ) to carry out enough merging computation to append
minf22k 1 B; qmax g items to the physical sequence of run r is no more than load 22k 1 , where load is a small
positive constant.
Sketch of Proof : Suppose that the merging technique used during the merging part is the \standard" external memory
jT j-way merge technique, where T is the set dened in the description of the merging part. The lemma then follows
from the fact that there are at most O 22k 1 touched run-records during llmerge (rr ) and after incurring O(1) I/O
operations corresponding to each touched run-record as \start-up overhead", the merging process results in 1 block
of items being appended to the physical sequence of run r every O(1) I/O operations.
The execution of llmerge (rr ) requires one internal memory block to buer the physical sequence corresponding to
each touched run-record and O(1=B) internal memory blocks to store each touched run-record or each level-record
loaded by llmerge (rr ). Thus by Lemma 35, we have the following lemma.
Lemma 36 The total number of internal memory blocks required by llmerge (rr ) is no more than load 2k
2 2 .
1
We now dene the constant llm used in invariants 2b and 3b.
119
Denition 43 We dene the constant llm to be load + save , where load and save are respectively dened in
Lemma 34 and Lemma 35. We dene the constant 0llm to be llm + such that for 1 ` `max , the total number
of I/O operations required to load the level-records lr [1] through lr [`] is no more than 22` 1
and is as small as
possible.
The denition of 0llm is for minor technical reasons. Suppose that the invariants mentioned earlier are satised
at the time MAMerge makes the call to execute llmerge (rr ) at allocation level `. Then the complete description of
llmerge (rr ), based upon the merging and state-saving parts, described above is as follows.
1. Execute the merging part until a time t such that at least one of the following conditions is violated:
(a) If k = ` < maxlevel +1, then either the number left of I/O operations in the ongoing allocation phase is such
that left save 22` 1 or the immediately following allocation phase of size next is at level level (next ) = `.
If k = maxlevel + 1, then either the number left of I/O operations in the ongoing allocation phase is such
that left save 22k 1
or the immediately following allocation phase of size next is at level level (next ) >
maxlevel .
(b) The formation of the run associated with run-record rr is not logically complete at time t.
2. Execute the state-saving part described above.
The number of I/O operations during the merging part can be more than
22k 1 : Computation corresponding
to the merging part is not interrupted so long as the conditions in step 1a and 1b above are satised. Thus the
number of blocks appended to the physical sequence of the run associated with rr can also exceed O 22k 1 . The
only situation that causes the execution of llmerge (rr ) to be a bad call is when the condition in step 1b is violated
during the execution of llmerge (rr ): As a result of invariants 2b and 3b of Section 5.8.6 and the denition of llm , the
execution of llmerge (rr ) can never end up being a bad call on account of violation condition 1a during the execution
of llmerge (rr ). We have the following lemmas regarding good and bad llmerge (rr ) calls relating the number of I/O
operations incurred by llmerge (rr ) and the number of blocks appended to the physical sequence of the run associated
with run-record rr .
Lemma 37 Suppose that the execution of llmerge (rr ) is a good call in which g items are appended to the physical
sequence of the run associated with rr. Then the total number of I/O operations incurred by the execution of
llmerge (rr ) is O(g=B ).
120
Sketch of Proof : After the merging process has loaded into memory the leading B items of the runs being merged,
roughly speaking, each time a block is brought into memory there is a corresponding block being written to the
output run. The total number of touched run-records is O 22k
. The total number of I/O operations required
1
to append 22k 1
B items to the concerned physical sequence is O 22k , by Lemma 35. Since the llmerge (rr ) in
1
question is a good call, we have g 22k 1

B. Hence the lemma follows.
The following lemma follows directly from Lemma 35.
Lemma 38 Suppose that the execution of llmerge (rr ) is a bad call. Then the total number of I/O operations incurred
by the execution of llmerge (rr ) is O 22k 1
In order to test that the condition of step 1b above remains true, at any time during its execution, llmerge (rr )
needs to maintain the invariant that the ag eld of rr is Done if and only if the formation of the run associated
with rr is logically complete: By virtue of Step 1 of the state-saving part and the fact that all touched run-records
reside in internal memory at the time the formation of the run associated with rr does become logically complete,
this invariant can be achieved easily by llmerge (rr ).
5.8.7 Downloading Work for Adaptivity

If lr [`] is not nil , whenever the allocation level is `, MAMerge executes the merging operation aliated with rr ` . If
MAMerge 's allocation level remains ` or smaller for extended periods of time, MAMerge runs out of all the
(22` 1 )-
way merging work associated with lr [`] and level-record lr [`] becomes nil . When the level-record lr [`] becomes nil ,
we need to associate level ` with \fresh" appropriate merging work. We refer to the process of establishing such an
association a \downloading" process since it requires \stealing" some of the merging work associated with a higher
allocation level `0 > `. Once the new association of level ` with such a merge operation is made, the run-record rr `
is once more aliated to a merge operation which is appropriate for level `, and which can be executed when the
allocation level next becomes `.
Loading Work Down One Level

We now consider what exactly constitutes downloading work from level ` + 1 to level `, assuming lr [`] is nil but
lr [` + 1] is not. If m = 22` , downloading work from level ` + 1 to level ` involves reorganizing the m-way merge
aliated with the active run-record rr `+1 of level ` + 1 into a pm-way merge of pm-way runs each the output of a
pm-way merge. We use the procedure construct ( ) in order to bring about such a reorganization.
121
Assuming that level-record lr [` + 1] is not nil and level-record lr [`] is nil , we use the procedure loadlevel (`),
described below, to download the active run-record rr `+1 from lr [` + 1] to lr [`], appropriately updating lr [` + 1] in
the process.
1. Let rr new be a new run-record.
2. If ` + 1 = maxlevel and lr [` + 1]:current = lr [` + 1]:rr :Order < 22` , then
(a) Set rr new to be the run-record lr [` + 1]:rr .
3. Else
(a) Let m = 22` and let Lpm be the blocked list containing pm run-records, implicitly represented by the
eld rr `+1 :splitters of the active run-record rr `+1 of level ` + 1.
(b) Discard the run-record rr `+1 , execute construct (rr new ; Lpm ) to appropriately initialize run-record rr new
and let rr new now take the place of the old run-record rr `+1 on disk.
4. Set lr [`]:rr to store the disk location of run-record rr new , set lr [`]:current = 0 and set lr [`]:active = rr new :inputs
so it points to the rst run-record of the list represented by rr new :inputs .
5. If lr [` + 1]:current lr [` + 1]:rr :Order 1 then
(a) If ` + 1 = `max and lr [` + 1]:current = Wtop 1, then

i. Set lr [` + 1]:current to the value lr [` + 1]:rr :Order .
ii. Set lr [` + 1]:active to store the location of run-record lr [` + 1]:rr .
(b) Otherwise, increment the value of lr [` +1]:current by 1. Then, if lr [` + 1]:current < lr [` + 1]:rr :Order 1,
set lr [` + 1]:active to store the location of the lr [` + 1]:current -th run-record of list lr [` + 1]:rr :inputs ; else
set lr [` + 1]:active to store the location of run-record lr [` + 1]:rr .
6. Else (that is, lr [` + 1]:current = lr [` + 1]:rr :Order )
(a) Set lr [` + 1] to nil .

(b) If maxlevel = ` + 1, set maxlevel to ` and set rr global to be the run-record pointed by lr [maxlevel ]:rr .
The following lemma follows from Lemma 30, Lemma 27 and the observation that the list involved in the
construct ( ) operation of Step 3b contains pm run-records.
122
Lemma 39 Suppose that m = 22` , lr [`] = nil and lr [` + 1] 6= nil, where 1 ` < maxlevel, at some time during our
algorithm. Then the execution of procedure loadlevel (`) described above takes no more than O(pm) I/O operations
and can be implemented using O(1) blocks of internal memory.
The signicance of the upper bound on internal memory above is that procedure loadlevel (`) can be executed
over O(pm=B + 1) I/O operations possibly spanning several arbitrary allocation phases, since the constant model
will be set to a value larger than the number of blocks required in loadlevel (`) for any `.
In MAMerge , the only time new run-records are created is during loadlevel ( ) operations. Using Lemma 28 and
the fact that the list Lpm passed as argument to construct ( ) during loadlevel ( ) contains pm run-records, we have
the following lemma.
Lemma 40 Suppose that m = 22` , lr [`] = nil and lr [` + 1] 6= nil, where 1 ` < maxlevel, at some time during our
algorithm. Then the total number of new run-records created during loadlevel (`) is no more than O(pm).
We also make the crucial observation that loadlevel (`) maintains the invariants proposed in Section 5.8.5.
Lemma 41 The invariants of Section 5.8.5 remain true after the execution of loadlevel (`).
Proof : Since lr [`] = nil and since lr [` + 1] satises invariant 11a prior to loadlevel (`), the list S`+1 dened in
invariant 11 contains at most ` 1 run-records with ag elds set to NotDone .
First we will prove that lr [`] satises all the invariants after the execution of loadlevel (`). Consider the ac-
tive run-record rr `+1 of level ` + 1 just before loadlevel (`) is executed. In the case when ` + 1 = maxlevel and
lr [` + 1]:current = lr [` + 1]:rr :Order < 22` , rr `+1 happens to be the run-record pointed by lr [` + 1]:rr . Invariant 8
implies that rr `+1 :Order = 22x for an integer x such that x ` 1. Thus, the assignment making lr [`]:rr store
the location of the run-record rr `+1 , the other assignments in Step 4, and the fact that maxlevel decreases to
`, ensure that lr [`] satises the invariants after the execution of loadlevel (`). When any one of the conditions
` + 1 = maxlevel and lr [` + 1]:current = lr [` + 1]:rr :Order < 22` are not satised, the run-record rr `+1 necessar-
ily has rr `+1 :Order = 22` . In Step 3b, we replace this run-record on disk with an appropriately constructed new
run-record rr new with rr new :Order = 22` 1 . Thus, the assignment making lr [`]:rr store the location of this newly
constructed run-record and the other assignments in Step 4 ensure that lr [`] satises the invariants after the execution
of loadlevel (`).
We will prove that lr [` + 1] satises the invariants after the execution of loadlevel (`). If lr [` + 1]:current was
lr [` + 1]:Order prior to loadlevel (`), then lr [` + 1] becomes nil trivially satisfying all invariants. Step 5 ensures that
invariants 6 and 7 remain true even after loadlevel (`). In the case when lr [` + 1] is not nil after the execution of
123
loadlevel ( ), we now consider the parts of invariant 11. After the execution of loadlevel (`), the set of run-records in list
S`+1 has at most one run-record with ag set to NotDone more than just before loadlevel (`). But since the maximum
number of run-records in S`+1 prior to loadlevel (`) is ` 1, invariant 11a remains true even after loadlevel (`). The
invariant 11b remains true of S`+1 even after loadlevel (`) because the only change among level-records of levels lower
than ` + 1 is that lr [`]:rr now stores the location of the run-record rr new which lies in the list lr [` + 1]:inputs . The
invariant 11c remains true of ordered list S`+1 after loadlevel (`) since the last element of S`+1 is the run-record
pointed by lr [`]:rr , which is at the highest level lower than level ` + 1.
In case maxlevel decreases during loadlevel ( ), the run associated to the run-record pointed by lr [maxlevel ]:rr
is logically the same as the one associated with the run-record pointed by the lr [maxlevel ]:rr with the old value of
maxlevel , so invariant 4 remains true.
It can be easily veried that the other invariants remain true after the execution of loadlevel (`) as well. Thus the
lemma is proved.
Loading Work Down Several Levels

Consider the problem of downloading work to level k when several levels k; k + 1; ; ` have level-records set to nil
and only the level-record lr [` + 1] of level ` + 1 is not nil . In such a situation, MAMerge has to download work to
level k from level ` + 1 and it does so using the procedure download (k) described below.
1. Traverse the (blocked) list of level-records beginning with lr [k + 1] until a level-record lr [` + 1] such that
lr [` + 1] 6= nil is found.
2. Execute loadlevel (i) for i going from ` down to k.
Thus, during the execution of download (k), MAMerge ends up downloading work to the other levels k + 1; ; ` that
had their level-records set to nil as well.
Denition 44 We say that the execution of download (k) loads levels k through `, where ` is as dened above. The
highest level loaded by download (k) is level `.
Using Lemma 39 and the invariants of Section 5.8.5, we have the following lemma regarding resource consumption
during the execution of download (k).
Lemma 42 Suppose m = 22` where ` is the highest level that is loaded by download (k). The total number of I/O
operations involved in rst traversing the list of level-records lr [1] through lr [k] and then executing download (k) is
O(pm). The download (k) operation can be implemented such that it requires no more than O(1) internal memory
blocks.
124
Proof : The total number of I/O operations required to traverse the list lr [1] through lr [k] of pointers before executing
download (k) and then traversing level-records lr [k] through lr [` + 1] during the initial part of download (k) requires

no more than O (lg lg m)=B + 1 I/O operations. Using Lemma 39 and the invariants of Section 5.8.5, the ` k + 1
calls to loadlevel ( ) totally incur
`Xk p ! `Xk p
! p
O m1=2i =O ( m)1=2i = O( m)
i=0 i=0
I/O operations. Thus the total number of I/O operations is O(pm). As in the case of loadlevel ( ) we require only
O(1) memory blocks to complete the operation.
In the above lemma we also counted the I/O operations required to load level-record lr [k] into memory assuming
we have to follow the blocked list of pointers beginning with lr [1] to do so.
The following lemma follows from Lemma 42 and is used later, among other things, in setting the constant level
to an appropriate value.
Lemma 43 Consider a sequence of d operations such that:

1. In the ith operation, where 0 i d 1, we traverse the blocked linked list of level-records to access lr [ì ] and
then execute download (ì ).
2. Over the entire sequence of d operations, for no level `0 , where 1 `0 `max 1, is loadlevel (`0 ) executed
more than once. Moreover, let ` be the highest level such that loadlevel (`) is executed during the sequence of d
operations and let m = 22` .
Then there exists a small positive constant dl such that the number of I/O operations over the entire sequence of
d actions is no more than dl pm.
Using Lemma 40, we can easily prove the following lemma.
Lemma 44 The total number of new run-records during the sequence of d download ( ) operations described in
Lemma 43 is no more than O(pm).
We also make the following simple observation regarding the invariants of Section 5.8.5.
Lemma 45 The invariants of Section 5.8.5 remain true after the execution of download (k).
Proof : During download (k), run-records and level-records are manipulated only during calls to loadlevel ( ). The
lemma here follows from Lemma 41 and the fact that any download ( ) execution calls loadlevel (`0 ) only when lr [`0 ]
is nil and lr [`0 + 1] 6= nil .
125
5.8.8 Putting download ( ) and llmerge ( ) Together
We describe how to combine the download ( ) procedure, used to dynamically reorganize merging computation, and the
llmerge ( ) procedure, used to merge physical sequences in appropriate merge operations, together with the run-record
and level-record data structures to realize the memory-adaptive merging routine MAMerge that merges together
runs. By using MAMerge in our framework of Section 5.5.2 as described at the beginning of Section 5.8, we obtain
a memory-adaptive mergesort.
In order to complete the description of our memory-adaptive sorting algorithm we need to dene the constant
level used in our denition of allocation levels, and the constant model in the context of our memory-adaptive
mergesort.
Denition 45 The constant level is dened to be level = (0llm + dl )=2. The constant model is dened to be
model = d2level e.
If m is of the form 22` , where 1 ` `max , then the number 2level pm of I/O operations in a level ` allocation
phase of size level pm is large enough to rst accommodate dl pm I/O operations (corresponding to the sequence of
download ( ) operations in Lemma 43) \loading" level ` and then permit the smallest number llm m of I/O operations
involved in a good llmerge (rr ) call, where rr = rr ` , the active run-record of level `. The smallest size model of an
allocation phase is such as to permit a binary merge during a phase of that size.
Next we dene some useful abbreviations used in our description of MAMerge .
Denition 46 We use the symbol clevel to mean the current allocation level level (mem ). If ` maxlevel , we dene
the predicate enough (`) to be true whenever the invariant 2b of Section 5.8.6 is satised and false otherwise. If
` > maxlevel , we dene the predicate enough (`) to be true whenever the invariant 3b of Section 5.8.6 is satised and
false otherwise.
Algorithm MAMerge
The algorithm MAMerge can be summarized as follows:
1. If the allocation level is ` maxlevel and there is work associated with level ` (meaning lr [`] 6= nil ), then
execute llmerge (rr ). If ` > maxlevel , execute llmerge (rr global ).
2. If the allocation level is ` and lr [`] = nil , then download some work to level ` by executing download (`).
A more precise description is as follows:
126
1. Execute the preprocessing stage of Section 5.8.3 and then the initialization of level-records as indicated in
Section 5.8.4.
2. Until rr global : ag 6= Done , do:
(a) Let `^ = minfclevel ; maxlevel g.
(b) Load level-records in a blocked manner beginning with lr [1] until lr [`^] into memory. .
(c) If clevel > maxlevel then
i. Execute llmerge (rr global ).
ii. When the call llmerge (rr global ) returns control, if rr global : ag = Done , then MAMerge is completed.
iii. If the call llmerge (rr global ) returns control at a time when the allocation level level (next ) of the
following phase is maxlevel or smaller, then relinquish the remaining portion of the current phase,
whose allocation level must necessarily be greater than maxlevel . Otherwise proceed immediately to
the following step
iv. GoTo Step 2a.
(d) When execution reaches this point, we have clevel maxlevel and we need to consider the two cases
lr [clevel ] = nil and lr [clevel ] 6= nil .
i. If (lr [clevel ] = nil ), then
A. If clevel = maxlevel , then MAMerge is completed; otherwise execute download (clevel ).
B. When the call download ( ) returns control, the level of allocation clevel may have changed since
the time the call is made. If enough (clevel ) is false for the new value of clevel , then relinquish the
remaining portion of the allocation phase ongoing when download ( ) returns. If enough (clevel ) is
true, then proceed immediately to the following step.
C. GoTo Step 2a.
ii. Else
A. While enough (clevel )ANDlr [clevel ] 6= NIL, execute llmerge (rr clevel ), where rr clevel is the active
run-record of level clevel .
B. When the last call llmerge (rr clevel ) of the while loop returns, if enough (clevel ) is false, then
relinquish the remaining portion of the ongoing allocation phase; otherwise proceed immediately
to the following step.
C. GoTo Step 2a.
127
Relinquished I/O Operations
Since invariant 2b (respectively invariant 3b) has to be met at the time the call to llmerge (rr ` ) (respectively
llmerge (rr global )) is made, MAMerge sometimes relinquishes a portion of an allocation phase. We count the re-
linquished I/O operations among I/O operations incurred by MAMerge by using the following charging scheme.
Denition 47 Whenever MAMerge relinquishes part of an allocation immediately after executing a download ( )
operation, we charge the relinquished I/O operations to that particular download ( ) operation. Whenever MAMerge
relinquishes part of an allocation immediately after executing an llmerge ( ) operation, we charge the relinquished I/O
operations to that particular llmerge ( ) operation.
First we account for I/O operations charged to download ( ) operations by using the notion of `; d-sequences, which
are sequences of d consecutive download ( ) operations, possibly followed by a relinquish operation.
Denition 48 Consider a sequence of d consecutive download (ì ) operations, where 0 i d 1 and d 1,

that satisfy the conditions mentioned in Lemma 43. and let ` be as dened in Lemma 43. We call the above
sequence of d download ( ) operations an `; d-sequence if the rst time MAMerge either relinquishes I/O operations
(in Step 2(d)iB above) or executes an llmerge ( ) operation after it executes download (`0 ) is only immediately after
it executes download (`d 1 ).
We make a crucial observation regarding `; d-sequences.
Lemma 46 Consider any `; d-sequence download (`0 ); : : : ; download (`d 1 ) and let m = 22` . If, at any time after
download (`0 ) begins and before download (`d 1 ) ends, MAMerge is subjected to an allocation phase of size mh such
that level (mh ) `, then the execution of download (`d 1 ) is immediately followed (that is, without relinquishing
any I/Os) by the execution of llmerge (rr ), where `h = level (mh ) and rr is the active run-record rr `h of level `h if
`h maxlevel and the run-record rr global otherwise.
Proof : From Lemma 43, we know that the total number of I/O operations required over the entire `; d-sequence
is no more than dl pm. Suppose that after download (`0 ) begins execution and before download (`d 1 ) completes
execution, MAMultiply gets an allocation phase of size mh where mh is as dened above. Then even if all dl pm
I/O operations corresponding to the `; d-sequence occurred during the phase of size mh , that phase is still left with
llm pmh pending I/O operations that would cause enough (level (mh )) to evaluate to true: This follows from the
denition of level above. Furthermore, by denition of `; d-sequence, the dth download ( ) operation download (`d 1 )
of the `; d-sequence cannot be immediately followed by another download ( ) operation so llmerge (rr ) is the next
operation executed by MAMerge . Thus the lemma is proved.
128
If I/O operations corresponding to a portion of an allocation are relinquished by MAMerge after an `; d-sequence,
then the level of the allocation is necessarily smaller than `, by virtue of the above lemma. This also means that the
maximum number of I/O operations relinquished by MAMerge , that can be charged to an `; d-sequence is no more
than O(22` 1 ).
We state now a simple consequence of the above lemma, Lemma 43 and denition 47 to bound the number of I/O
operations that can be charged to an `; d-sequence.
Lemma 47 The number of I/O operations that can be charged to an `; d-sequence is no more than O(pm), where
m = 22` .
When MAMerge relinquishes I/O operations of a portion of a level ` allocation phase immediately after executing
an llmerge ( ) call (Step 2(d)iiB or Step 2(c)iii), the number of I/O operations charged to that llmerge ( ) call remains
within a constant factor of the I/O operations incurred by that llmerge ( ) call, as per Lemma 37 or Lemma 38.
Lemma 48 The total number of I/O operations charged to an llmerge ( ) operation, including I/Os relinquished by
MAMerge, is given by Lemma 37 if llmerge ( ) is a good call and Lemma 38 if it is a bad call.
5.9 Analysis of resource consumption

In this section we show that in merging together runs, totally consisting of n0 blocks, our MA algorithm consumes
only O(n0 lg + n0 lg mmax ) resources and results in an optimal sorting algorithm.
We already showed that the total number of I/O operations incurred while preprocessing is no more than O()
I/O operations, which means a resource consumption of no more than O( lg mmax ). Next we show that the resource
consumption incurred by the downloading activity, which makes MAMerge memory-adaptive, is only O( lg mmax ).
Then by charging the I/O operations incurred by a bad llmerge ( ) call to the run-records touched by that call, we
argue that the total number of I/O operations incurred by bad llmerge ( ) calls throughout the execution of MAMerge
is no more than O(). Then we argue that the total number of I/O operations incurred by llmerge (rr global ) calls is no
more than O(n0 ). Finally, we employ the notion of merge potential to show that MAMerge makes optimal utilization
of available resources during good llmerge ( ) calls. We do so by showing that the resource consumption of each good
llmerge ( ) call is always within a constant factor of the increase in merge potential it registers. A corollary of the
above fact is that the net resource consumption charged to all good llmerge ( ) calls cannot exceed O(n0 lg ).
129
5.9.1 Resource Consumption of download ( ) calls, bad
llmerge ( ) calls and llmerge (rr global ) calls
By denition, allocation phases utilized by MAMerge for the execution of download ( ) calls, bad llmerge ( ) calls,
and llmerge (rr global ) calls, can be nonoptimal phases. Here we show that the net resource consumption over all such
activity during MAMerge is no more than O(n0 lg mmax ), where n0 is the total number of blocks in the the runs
being merged by MAMerge .
Resource consumption during download ( ) computation

As we noted earlier, download ( ) computation is memory-oblivious: Since it never requires more than model blocks it
can be carried out during allocation phase(s) at arbitrary levels. Every `; d-sequence involves a loadlevel (`) operation.
We bound the resource consumption of all the download ( ) operations that occur in course of the algorithm by
bounding the maximum possible number of loadlevel (`) operations in course of the algorithm, for 1 ` `max 1,
and then using Lemma 47.
First we make an observation that bounds the maximum number of times level ` can be involved in a loadlevel (` 1)
operation at a stretch, without the occurrence of a loadlevel (`) operation.
Lemma 49 Suppose that 1 ` < `max . Then, if ` + 1 < `max , the maximum number of loadlevel (`) operations the
level-record lr [` + 1] can be involved in before it becomes nil is lr [` + 1]:rr :Order + 1 if ` + 1 < `max ; if ` + 1 = `max ,
the maximum number of loadlevel (`) operations the level-record lr [` + 1] can be involved in before it becomes nil is
Wtop + 1.
Since each `; d-sequence necessarily includes a loadlevel (`) operation, we have the following lemma.
Lemma 50 Consider any level `max j , where 1 j `max 1. The total number of `max j; d-sequences possible
in course of MAMerge is no more than
Yj
+1 (1=2i + 1)
22`max 1
i=2
where = 22`max and the product

Q above is dened to be 1 when j = 1. Moreover, the above quantity is always
O(=1=2j ).
Proof : Firstly, by Lemma 49, the number of times loadlevel (`max

1) can be called is =22`max 1 + 1, by the
denition of the loadlevel ( ) procedure, from invariant 6 of Section 5.8.5 and the fact that the number Wtop of
non-dummy run-records immediately after the preprocessing stage is no more than =22`max
1 .
130
By invariant 9 of Section 5.8.5, for all ` such that 1 ` `max 1, we have lr [`]:Order = 22` 1 .
We prove the lemma by induction on j . From Lemma 49 and the observation above, we know that that
loadlevel (`max (J + 1)) can be called at most 22`max (J +1) + 1 = 1=2J +1 times each time a loadlevel (`max J)
operation completes. The fact that the above expression is O(=1=2j ) also follows by induction on j . In fact we can
show that the above expression is no more than

1 + 12=2j 12=2j

The lemma follows from the observation that each `max j; d-sequence must involve a loadlevel (`max j ) opera-
tion.
Lemma 51 Suppose that allocation level `max j , where 1 j `max 1, is charged all the I/O operations charged
to any `max j; d-sequence. Then the total number of I/O operations charged to level `max j , where 1 j `max 1,
is no more than O =1=2j+1 .

Proof : Consider any `max j; d-sequence, where j 1. By Lemma 50, the maximum number of `max j; d-sequences

is no more than O =1=2j . Since Lemma 47 proves that the maximum number of I/O operations that can be
charged to any `max j;d-sequence is O 22`max j 1 = O 1=2j , the maximum number of I/O operations that
+1
can be charged to level `max j is

O =1=2j O 1=2j+1

which is O =1=2j+1 . Hence the lemma is proved.
We nally bound the total number of I/O operations and the total amount of resource consumption charged to
any download ( ) call.
Theorem 13 The total number of I/O operations charged to all download ( ) calls operation is O(). The total
resource consumption over all these I/O operations is O( lg mmax ).
Proof : Each download ( ) is, by denition, part of an `; d-sequence. So it is enough to bound the I/O operations
charged to `; d-sequences. The total number of I/O operations to be bounded is the sum of the number of I/O
operations charged to any level, over all levels. Using Lemma 51, this number is
0`max 1 1
X
O@ 1=2j
A
j =0
+1
131
which can be simplied as
1 + 1 + + 1 + 1 :
O 1 + 21 + 14 + 16 256 1=4 1=2
The above expression is O() .

The bound on the total resource consumption is obtained by assuming that the amount of memory available
during all the O() I/O operations counted above is mmax and using the denition of resource consumption for
memory-adaptive sorting.
Resource Consumption during bad llmerge ( ) calls

A bad llmerge ( ) call can consume a whole lot of resources while doing very little \work". Such a situation occurs
when an llmerge (rr ` ) execution is preempted at a time when very few items remain to be added to the physical
sequence of the run associated with rr ` : If the level of allocation becomes ` once more soon after this preemption,
just setting up the merge operation llmerge (rr ` ) can incur O(22` 1 ) I/O operations belonging to an allocation phase
of size O(22` ).
We now bound the total resource consumption of all bad llmerge ( ) calls over the execution of MAMerge by
proving that the total number of I/O operations charged to bad llmerge ( ) calls over the execution of MAMerge is
O(). We do so by charging the O(22` 1 ) I/O operations incurred by a bad llmerge (rr ` ) call to the run-records it
touches, and then observing on the one hand that no run-record can ever be touched by more than one bad llmerge ( )
call and on the other that the total number of run-records created during MAMerge is O().
By invariant 4 of Section 5.8.6 and invariants 8 and 9 of Section 5.8.5, we have the following lemma.
Lemma 52 If an execution of llmerge (rr global ) is a bad call, then the algorithm MAMerge terminates after that
call. The total number of I/O operations that can be charged to that llmerge (rr global ) call is O().
On account of the above lemma we now consider only bad llmerge (rr ` ) calls executed when the allocation level
is `. We use the following charging scheme to count the total number of I/O operations over all such bad llmerge ( )
calls.
Denition 49 We charge the total number O(22` ) of I/O operations charged to a bad llmerge (rr ` ) call (see
1
Lemma 48 and Lemma 38) to the

(22` 1 ) run-records touched (see Lemma 32) by that llmerge (rr ` ) call.
We now make a useful observation regarding bad llmerge (rr ` ) calls.
Lemma 53 When the execution of a bad llmerge (rr ` ) call completes, we have rr ` : ag = Done.
132
Proof : This follows from the denition of the constant llm in Denition 43, constant load in Lemma 35, invariants 2b
and 4 of Section 5.8.6, and the fact that the only reason for llmerge (rr ` ) to end up being a bad call is that the
formation of the run associated with rr ` is logically complete.
Based upon Lemma 53, we make the following useful observation.
Lemma 54 Any run-record can be touched by at most one bad llmerge ( ) call.
Proof : Consider any run-record rr 0 other than run-record rr global . Consider the rst bad llmerge (rr ` ) call that touches
run-record rr 0 and let t denote the time at which that llmerge (rr ` ) call completes its execution. By Lemma 53, we
have rr ` : ag = Done , when the execution of llmerge (rr ` ) completes. Since rr 0 is touched by llmerge (rr ` ), by
denition of the merging part of llmerge (), there must exist a \path" rr ` = rr (0); rr (1); rr (2); : : : ; rr (d) = rr 0 such
that rr (i) is in list rr (i 1):inputs . It can be e proved via induction on the length of the path that no llmerge () call
after time t can touch run-record rr 0 . This proves the lemma.
From the charging scheme of Denition 49, it is clear that we can charge each run-record touched by a bad
llmerge ( ) call at most O(1) I/O operations of that llmerge ( ) call and so, by Lemma 54 above, the total number
of I/O operations incurred by bad llmerge ( ) calls throughout the execution of MAMerge is bounded by the total
number of run-records created during MAMerge . Below we prove that the total number of run-records created during
MAMerge is O().
Lemma 55 The total number of run-records created during MAMerge is O().
Proof : The proof follows by observing that the total number O(pm) of run-records created during an `; d-sequence
(Lemma 44) is roughly the same as the number of I/O operations incurred by the `; d-sequence (Lemma 43), where
m = 22` . Hence, using the same techniques as in Lemma 50, Lemma 51 and Theorem 13, we can argue that the total
number of run-records created during MAMerge is no more than O().
Thus we have proved the following theorem.
Theorem 14 The total number of I/O operations that can be charged to bad llmerge ( ) calls made during the
execution of MAMerge is O(). The total resource consumption that can be charged to bad llmerge ( ) calls made
throughout MAMerge is no more than O( lg mmax ).
133
Resource Consumption of llmerge (rr global ) calls
Each llmerge (rr global ) call appends blocks of items to the physical sequence of the run associated with the \global"
run-record. By denition, when MAMerge merges run totally consisting of n0 blocks, the number of blocks in
the run associated with rr global is no more than n0 . Blocks are appended to the physical sequence corresponding to
rr global in an ecient manner such that the total number of I/Os charged to llmerge (rr global ) calls is no more than
O(n0 ).
We have the following theorem regarding the I/O operations incurred by llmerge (rr global ) calls.
Theorem 15 Suppose that the runs input to MAMerge totally consist of n0 blocks. The total number of I/O
operations that can be charged to llmerge (rr global ) calls made during the execution of MAMerge is O(n0 ). The
total resource consumption that can be charged to llmerge (rr global ) calls made during the execution of MAMerge is
O(n0 lg mmax ).
Proof : By Lemma 52, we know that there can be at most one bad llmerge (rr global ) call during MAMerge and it
incurs no more than O() I/O operations. By Lemma 48 and Lemma 37, we know that if a good llmerge (rr global )
call appends g0 blocks of items to the run associated with rr global , then that llmerge (rr global ) call can be charged at
most O(g0 ) I/O operations. From invariant 2 of Section 5.8.5, we know that the total number of blocks that can get
appended to the run associated with rr global is no more than n0 . The theorem follows.
5.9.2 Potential Function Argument

We have shown that the number of I/O operations charged to download ( ) operations and bad llmerge ( ) calls
is O() = O(n0 ) whereas the number of I/O operations charged to llmerge (rr global ) calls is O(n0 ). Since the net
number of I/O operations in these activities is small, even if we assume conservatively that throughout these O(n0 )
I/O operations the allocation level was at its maximum value `max , the net resource consumption of these activities
remains O(n0 lg mmax ).
On the other hand the number of I/O operations charged to good llmerge ( ) calls in general may be superlinear
in the number of blocks n0 output by MAMerge . The number of I/O operations charged to good llmerge ( ) calls in
general may be as high as O(n0 lg ), which means we cannot adopt the counting strategy mentioned in the above
paragraph: Assuming that the allocation level throughout these O(n0 lg ) I/O operations is `max would lead to an
O(n0 (lg )(lg mmax )) bound on resource consumption of good llmerge ( ) calls, which is clearly not acceptable.
In order to demonstrate that the resource consumption of good llmerge ( ) calls is bounded by O(n0 lg mmax ) as
well, we use the notion of merge potential dened in Section 5.6. Informally speaking, when a good llmerge (rr ` ) call
is charged g0 I/O operations, its resource consumption is O(g0 2` ); but the good llmerge (rr ` ) call also raises the rank
134
of each one of the
(g0 B) items it appends to the physical sequence of the run associated with rr ` by an additive
value of
(2` ), so that the potential of the merge increases by an amount of
(g0 2` ). Since the maximum value the
potential of the merge assumes is O(n0 lg mmax ), it follows that the net resource consumption is O(n0 lg mmax ).
We now formally prove our claims. Each of the run-records of the runs input to MAMerge have a unique run
assigned to them. It is convenient to logically assign sets containing \dummy input runs" to dummy run-records that
may possibly be introduced in course of our preprocessing. The assignment of sets containing dummy input runs to
dummy run-records is only performed for the sake of analysis and does not alter the computation in any way.
Denition 50 Let U denote the set of runs being merged by MAMerge . For the sake of convenience in analysis,
we assign a dummy set to each dummy run-record created during MAMerge 's preprocessing stage. Each dummy set
is a a set of one or more dummy input runs . Each dummy input run is a run distinct from any of the runs of U
and is dened to satisfy the following conditions:
1. Each dummy input run is contained in at most one dummy set.
2. The physical sequence of each dummy input run ej (y) is always empty.
3. The rank of each dummy input run is 1.
We assign appropriately sized dummy sets to each of the dummy run-records introduced in Step 2 and Step 4 of
the preprocessing stage.
Denition 51 Consider the D1 dummy run-records d0 ; d1 ; : : : ; dD 1 1 introduced in Step 2 of the preprocessing

stage. Each dummy run-record dj is assigned a dummy-set containing a single dummy input run, for 0 j D1 1.
Consider the D2 dummy run-records d00 ; d01 ; : : : ; d0D 1 introduced in Step 4 of the preprocessing stage. Each dummy
2
run-record d0j is assigned a dummy set containing 22`max 1

dummy input runs, for 0 j D2 1.
Just as there is a run associated with a non-dummy run-record, we can now associate runs with dummy run-
records.
Denition 52 The run associated with a dummy run-record assigned a set s of dummy input runs is dened to be
the (vacuous) merge of the dummy input runs of s. The rank of this run is jsj.
A simple lemma that can be proved by induction is the following.

Lemma 56 Consider a run-record rr which is neither the run-record associated with an input run nor a dummy
run-record. Let rr 0 and rr 00 be any two run-records in the list rr :inputs. Let r, r0 and r00 respectively be the runs
associated with run-records rr, rr 0 and rr 00 respectively. Then we have
p(r0 ) = p(r00 ) = rr :pOrder

(r)
135
The following lemma follows easily from the denition of the potential assigned to a run-record and the potential
of the state of the merge.
Lemma 57 The rank of the run rglobal associated with run-record rr global does not change with time and is larger
than the rank of the run associated with any other run-record in MAMerge. Moreover, we have p(rglobal ) < 2 lg =
O(lg mmax ). The potential of the merge when MAMerge completes execution is no more than 2n0 lg , where n0 is
the total number of blocks merged by MAMerge.
We consider any good llmerge (rr ` ) operation, where rr ` is the active run-record of level ` and such that rr ` 6=
rr global , executed when the allocation level is `. The following lemma proves that the total resource consumption
charged to the llmerge (rr ` ) is within a constant factor of the increase in the potential of the state of the merge
brought about by llmerge (rr ` ).
Lemma 58 Suppose that the resource consumption of an llmerge ( ) call is dened to be the resource consumption
over all the I/O operations charged to that llmerge ( ) call. Consider a good llmerge (rr ` ) call in which G0 items are
added to the run associated with the active run-record rr ` of level `, where rr ` 6= rr global . Then
1. The resource consumption of that llmerge (rr ` ) call is O(G0 2` =B ).
2. The increase in the potential of the state of the merge is at least G0 2` =2B
Proof : By invariant 2b and the description of the llmerge ( ) procedure we know that all I/O operations charged
to llmerge (rr ` ) are ones in which the allocation phase is of size at most O(22` ). By Lemma 48 and Lemma 37 we
know that the total number of I/O operations charged to the llmerge (rr ` ) call is O(G0 =B). Thus the net resource

consumption charged to llmerge (rr ` ) is O (G0 =B) lg 22` = O (G0 2` )=B , by denition of resource consumption.
Consider any item x appended to the physical sequence of the run r` , associated with rr ` , by llmerge (rr ` ).
From the procedure llmerge ( ) and the denition of the potential assigned to an item at any time, we know that
the maximum possible rank x could have had prior to llmerge (rr ` ) is p(r0 ), where r0 is the run associated with a
run-record in the list rr :inputs . By invariant 10, 9 and the denition of the active run-record rr ` of level `, we
know that rr ` :Order = 22` 1 . Thus by Lemma 56, the rank of element x increases by a factor of at least 22` 1
during llmerge (rr ` ). Since llmerge (rr ` ) adds G0 items to the physical sequence of the run associated with rr ` , the
net increase in merge potential is at least

G0 B1 lg 22` 1 = G0 2` =2B:
Thus the lemma is proved.
136
Lemma 58 and Lemma 57 can now be used to prove the following theorem.
Theorem 16 The total resource consumption charged to all good llmerge (rr ) calls during MAMerge in which rr is
not the run-record rr global is O(n0 lg ), where n0 is the total number of blocks of items among the runs input to
MAMerge.
Proof : By Lemma 58, we have the condition that whenever any of the good llmerge ( ) calls in question are charged a
resource consumption of R, the increase in the potential of the state of the merge is
(R): Thus at any time the total
amount of resource consumption charged to all all good llmerge (rr ) calls in which rr 6= rr global is within a constant
factor of the potential of the state of the merge. Since, by Lemma 57, the maximum value for the potential of the
state of the merge is 2n0 lg , the theorem is proved.
5.9.3 Optimality of Resource Consumption for Sorting

From Lemma 29, Theorem 13, Theorem 14, Theorem 16 and the denition of , the number of runs input to MAMerge ,
we have the following theorem.
Theorem 17 Consider our algorithm MAMerge used to merge runs each of length at least model blocks, totally
comprising n0 blocks. The total amount of resource consumption incurred by MAMerge is O(n0 lg mmax ).
Applying Theorem 17 to Lemma 25 and Corollary 2, we have the following theorem.
Theorem 18 Consider the memory-adaptive sorting algorithm based upon our mergesort framework of Section 5.5.2
and using MAMerge as the the memory-adaptive merging routine, as indicated at the beginning of Section 5.8. When
used to sort an input le of n blocks, the resource consumption of our sorting algorithm is O(n lg n) and the algorithm
is dynamically optimal.
5.10 Some notes on the potential function

Our potential argument implies that in order to attain dynamic optimality it is necessary for the memory-adaptive
mergesort to pump up the merge potential by at least
(m lg m) during a typical allocation phase of size m. In this
context, consider a memory-adaptive merge M0 that tries to attain ecient memory utilization by using the following
technique during allocation phases of size m: Algorithm M0 carries out (m) binary merge operations \in parallel"
by dividing the I/Os and the memory blocks of the allocation phase among the (m) binary merges. This way M0
makes use of all the m memory blocks of the allocation. However, in terms of our potential function, the increase in
potential during an allocation phase of size m using algorithm M0 is only 1=B for each one of (mB ) items (that
137
is; one block of items for each one of (m) binary merges), resulting in a net potential increase of (m), in contrast
to a desired increase of
(m lg m). In fact, the trivial memory-adaptive merging algorithm that consists of always
executing a single binary merge merge oblivious of the allocation level also attains the same increase (m) in merge
potential during an allocation phase of size m even though it uses only O(1) blocks of the m blocks allocated to it.
This indicates that the strategy M0 is as bad as simple binary merging, notwithstanding the fact that it uses all
blocks of the allocation phase.
A much more subtle issue related to our particular potential function is that the maximum potential and hence the
resource consumption can be O(n lg jS j) when the runs of set S totally consist of n blocks. This resource consumption
is optimal for memory-adaptive merging in the case when all input runs are equal in length; however, it is nonoptimal
for memory-adaptive merging when the input runs can be of arbitrary lengths: To see this point, consider a set S of
runs such that one run is n jS j 1 blocks long and all other runs are one block long each. Suppose that n > jS j lg jS j.
This set of runs can be merged by rst merging together the jS j 1 one-block runs using O(jS j lg jS j) I/O operations in
a binary merging process and then merging the jS j 1-block run with the n jS j 1-block run to get the n block output
run. The net number of I/O operations is O(n). Since all merges here were binary, they can all be performed over a
sequence of arbitrary allocation phases. When the allocation sequence consists of O(1) sized allocation phases, the
net resource consumption of this strategy is O(n). The O(n lg jS j) resource consumption permitted by our potential
function is rendered nonoptimal in the above example. The question of obtaining a dynamically optimal general
purpose algorithm for merging is an interesting open question. In case of our mergesort, although the runs being
merged by algorithm MAMerge are, in general, unequal in length and so its resource consumption O(n lg mmax ) is
nonoptimal with respect to the general problem of memory-adaptive merging. Yet our MAMerge -based framework
is dynamically optimal for sorting : Even though MAMerge is not dynamically optimal for the general problem of
memory-adaptive merging, it does yield a dynamically optimal sorting algorithm.
Yet another question pertains to the manner in which MAMerge reorganizes merge computation. Algorithm
MAMerge splits an m = 22` 1 -way merge into pm-way merges when it needs to reorganize the merge to assign to
allocation levels smaller than `. The potential increase registered by MAMerge during a typical phase of size m is
optimal (with respect to sorting) because this potential increase corresponds to that attained by 2m I/O operations of
a \traditional" mergesort algorithm A0 that repeatedly merges together the xed number pm runs, which is optimal
for a static memory allocation of m blocks. However, for a static memory of m blocks, the mergesort algorithm A00
that repeatedly merges together (m) runs is more ecient by a constant factor. One interesting question is whether
or not there exists a memory-adaptive mergesort which registers a potential increase comparable to that of A00 during
a phase of size m.
138
5.11 Dynamically Optimal Permuting, FFT, and
Permutation Networks
We now consider dynamically optimal algorithms for problems related to sorting. We use the memory-adaptive
merging and sorting algorithms as subroutines for developing memory-adaptive permuting and FFT algorithms: The
memory-adaptive permutation network follows from the FFT algorithm since a series of at most three appropriately
designed FFT circuits can simulate any permutation network [AV88].
5.11.1 Permuting
A permuting problem can be solved by rst attaching with each item to be permuted its destination address and then
sorting the items to be permuted using the address eld as key. Our sorting algorithm can be used for this purpose.
However, as pointed out in Theorem 9, there are certain circumstances under which the sorting lower bound does not
hold for the permuting problem. In the text immediately following Theorem 9 in Section 5.3.1, we argued that when
the sorting lower bound does not hold for permuting, a naive internal memory algorithm using O(N ) I/Os and O(1)
blocks of internal memory to permute a le of N items is dynamically optimal. Based upon this observation we can
develop a dynamically optimal permuting algorithm.
Theorem 19 There exists a simple dynamically optimal permuting algorithm.
Sketch of Proof : We can run \in parallel", the naive permuting algorithm, which at any time takes only O(1)
memory blocks, and the dynamically optimal sorting algorithm. The sorting algorithm gets to use all but O(1) blocks
of memory in each allocation phase and the naive permuting algorithm gets to use one I/O operation for every O(1)
I/O operations the sorting algorithm gets so that the number of I/Os spent in total is within a constant factor of I/Os
spent on permuting. It can be veried that if we terminate the above algorithm as soon as one of the two algorithms
running \in parallel" completes execution, the resource consumption is within a constant factor of optimal. Thus our
algorithm is dynamically optimal.
5.11.2 Dynamically Optimal FFT and Permutation Net-

works
We now develop a memory-adaptive version of the FFT algorithm of Vitter and Shriver [VS94], which is based upon
a series of shue-merge operations. A shue-merge is a merge in which all input runs are equal in length and the
output run consists of a perfect item-wise interleaving of the input runs. We dene a shue merge as follows.
139
Denition 53 An f -way shue merge is a merge of f runs each consisting of an equal number, say, p blocks of
items such that if the sequence of items of the ith run are denoted by
xi;0 ; xi;1 ; xi;2 ; : : : ; xi;pB 1 ;
for each i satisfying 0 i f 1, then the output of the merge is
x0;0 ; x1;0 ; x2;0 ; : : : ; xf 1;0 ; x0;1 ; x1;1 ; : : : ; xf 1;1 ; : : : : : : ; x0;pB 1 ; x1;pB 1 ; : : : ; xf 1;pB 1 :
The following lemma follows from the denition above and Theorem 12.
Lemma 59 An mmax -way shue merge such that each run consists of n0 =mmax blocks of items can be performed
in a memory-adaptive manner by algorithm MAMerge with a resource consumption of O(n0 lg mmax ).
In our memory-adaptive simulation of the FFT algorithm of Vitter and Shriver, we chop up the N -input lg N -level
0 \layers", where each layer consists of lg Mmax
FFT digraph into (lg N )= lg Mmax 0 = 22`max 1 B . Below
0 levels and Mmax
we describe how to process each layer. In order to route the outputs of one layer to appropriate input destinations

of the next layer we need to perform a series of max 1; logmmax (minfMmax0 ; N=Mmax0 g) mmax -way shue-merge
operations each involving n blocks of items [VS94]. By Lemma 59, we can use MAMerge to implement all the shue-
merge operations required to route the outputs of one layer to the inputs of the next. The net resource consumption
of all such mmax -way shue-merge operations, summed over all (lg N )= lg Mmax0 layers is

O n(lg mmax ) lglgMN0 0 ; N
1 + logmmax min Mmax M0
; (5.12)
max max
which can be simplied to O(n lg n) [VS94].

0 levels and a total of N input items. We can split each layer into N=Mmax
Each layer consists of lg Mmax 0 \groups"
0 = 22`max 1 B input FFT graph. While processing a particular
such that each group itself is an independent Mmax
layer, whenever our algorithm has an allocation phase of size y 22`max 1 for y 1, we compute the outputs of
byc groups during that phase. In order to function eciently during smaller allocation phases, we make the crucial
observation that pebbling through an an FFT graph with Mmax 0 = 22`max 1 inputs is equivalent to rst pebbling
through 22`max 2
independent FFT graphs each with 22`max 2 B input nodes, then executing a 22`max 2 -way shue-
merge totally involving 22`max 2
blocks, and then pebbling through 22`max 2
independent FFT graphs each with
22`max 2 B input nodes once more. This decomposition of an m = 22`max 1
block FFT digraph into pm = 22`max 2
equally sized, independent FFT digraphs is analogous to the splitting of an m-way merge into pm merges, each one
itself a pm-way merge. Each of these steps can be implemented in a memory-adaptive manner by using modications
140
of the data structures and online reorganization techniques that we developed for MAMerge so that the net resource
consumption incurred in processing a single group is O (Mmax 0 =B ) = O (Mmax
0 =B ) lg(Mmax 0 =B ) lg mmax . As a
result each layer can be processed with a resource consumption of O(n lg mmax ).
Thus the net resource consumption of our FFT algorithm is O(n lg n) and we have a dynamically optimal FFT
algorithm. It follows that we also have a dynamically optimal algorithm for arbitrary permutation networks since it
is well known that any N input permutation network can be simulated by at most three appropriate N input FFT
digraphs placed one after the other; see [AV88] for references.
5.12 Memory-Adaptive Buer Tree

In this section we present a memory-adaptive version of the buer tree data structure introduced by Arge [Arg94].
The buer tree is a general technique to eciently externalize many internal memory data structures. The most
appealing aspect of the buer tree is that it isolates the I/O specic parts of data structures so that a specic set
of I/O ecient techniques can be applied to several dierent internal memory data structures that have a \buer
tree wrapper", resulting in I/O optimal (in an amortized sense [Arg94]) algorithms for several applications consisting
of batched dynamic problems . These include improved graph algorithms, ordered binary decision diagrams, external
heaps and string sorting, among other applications. (See [Arg96] and [Vit98] for details.) More recent applications
for the buer tree include \bulk loading" operations on R-trees and B-trees [AHVV99].
In order to present a concise description of our memory-adaptive buer tree, it is convenient to informally dif-
ferentiate between two types of external memory problems: External memory problems that are memory-oblivious
and external memory problems that are memory-intensive. Memory-oblivious problems include operations such as
scanning a le, partitioning a le into a constant number of buckets, merging together a constant number of runs:
Assuming that the input size consists of n blocks, all the above problems can be solved in an optimal, linear number
O(n) number of I/O operations regardless of internal memory size and are unaected positively or negatively, by
memory uctuations. The download ( ) operation of MAMerge is an example of a memory-oblivious operation. On
the other hand, memory-intensive problems include sorting a le, merging a set of !(1) runs together, distributing
an unsorted le into !(1) buckets. In such applications it is essential to mimic an optimal algorithm designed for the
static, m-block memory version of the problem during allocation phases of size m; not doing so results in nonoptimal
utilization of allocations and accompanying overhead. Memory-adaptive algorithms for such applications are aected
by memory uctuations and changes in the allocation level: When such allocation levels drop they have to do more
I/O to do the same amount of work and when allocation levels rise they can do the same work with a small number
of I/O operations.
We identify the buer-emptying operation to be the only memory-intensive operation involved in implementing
buer trees and use our dynamically optimal sorting technique to render an optimal, memory-adaptive version of the
141
buer emptying operation. Since all other operations involved in the buer tree technique are memory-oblivious, this
is enough to guarantee a dynamically optimal buer tree (in an amortized sense) by invoking the same arguments as
in [Arg94].
Denition 54 A buer tree of fanout parameter m0 consists of an (a; b)-tree extended with m0 buer blocks per
internal node where a = m0 =4; b = m0 , as dened in [Arg94]. A memory-adaptive buer tree is a buer tree of fanout
0 , where mmax
parameter mmax 0 is a polynomial in mmax , with memory-adaptive performance; that is, the buer tree
operations have to be performed over an allocation sequence as per our dynamic memory model.
Insert and Delete operations on the buer tree may involve the execution of buer emptying computations and
splitting and merging nodes in course of rebalancing operations. We refer the reader to [Arg94] for details on the
buer tree. An examination of buer tree operations and the techniques used to implement them reveal that the buer
emptying operation is the only memory-intensive buer tree operation, all other operations being memory-oblivious.
5.12.1 Memory-adaptive buer emptying of internal nodes

In the original buer tree [Arg94], the buer emptying process at internal node v consists of using (m0 ) blocks of
memory to empty the contents associated with node v as follows:
1. Load into internal memory and sort at most m0 B=2 timestamped items of the buer associated with node v.
2. Distribute these sorted items, after cancelling out \annihilating pairs", into at most O(m0 ) nodes corresponding
to the children of node v using the (m0 ) partitioning elements associated with v.
The above procedure takes O(m0 ) I/O operations using (m0 ) internal memory blocks. We show how to implement
the above buer emptying process in our dynamic memory model using no more than O(mmax 0 lg mmax ) resource
consumption in our memory-adaptive buer tree.
0 B=2 or fewer timestamped items of node v are sorting using our memory-sorting algorithm using
1. The mmax
0 lg mmax ) resource consumption. This implements Step 1 of the above buer emptying
no more than O(mmax
operation.
0 B=2 or fewer items have been sorted, we can carry out the distribution of these items to
2. Since the set of mmax
the children of v in a memory-oblivious manner: stream through these records and the sorted list of partitioning
0 ) I/O operations. The net resource consumption over this
elements of node v: this takes no more than O(mmax
implementation of Step 2 of the buer emptying process cannot be more than O(mmax 0 lg mmax ).
Hence we have the following lemma.
142
Lemma 60 The total amount of resource consumption during a buer emptying operation on an internal node of
0 is O(mmax
our buer tree with fanout parameter mmax 0 lg mmax ).
5.12.2 Memory-adaptive buer emptying of other nodes

We use the buer emptying process of Section 5.12.1 even in the case of tree nodes just above the level of leaf nodes,
which are not considered internal nodes [Arg94]. However, while the buer tree guarantees that the buer emptying
0 ) blocks of records to be emptied, there is no such
computation at each internal node never involves more than O(mmax
guarantee while emptying the buer of a tree node just above the level of leaf nodes. So, as far as nodes just above
the level of leaf nodes are concerned, the buer emptying process described above can involve more than mmax 0 B=2
items. If the buer tree is being used to carry out a sequence of N = nB arbitrary insert/delete operations, the
number of items involved in a buer emptying operation at a node just above the tree node could be as high as
O(nB). Fortunately, it can be shown [Arg94] that each item12 can be involved at most once in a buer-emptying
computation at any node on the level just above the leaf level. This means that the number of items involved in
buer emptying operations summed over all the nodes on the level just above leaf nodes is O(nB ).
Using the same technique used to implement the buer emptying computation for internal nodes we have the
following lemma.
Lemma 61 Consider an initially empty buer tree with fanout parameter mmax that evolves over an arbitrary
sequence of nB insert/delete operations. Consider all the buer emptying operations that occur at nodes on the
level just above the leaf nodes of the tree. Let the ith such buer emptying operation occur at node vi . Let the total
number of blocks of the items associated with node vi 's buer be ni . Then
1.
P ni n.
i
2. The total amount of resource consumption Ri of the buer emptying operation on node vi is no more than
O(ni lg ni ).
3. The total amount

P Ri of resource consumption over all buer emptying operations at nodes at the level just
i
P
above the leaf nodes over the entire sequence of nB insert/delete operations is Ri = O(n lg ni ).
i
Main result on buer tree

Based upon Lemma 60 and 61 and the amortization arguments of Theorem 1 of [Arg94], we have the following
theorem regarding resource consumption of our memory-adaptive buer tree.
12 Each item here has a unique timestamp. The ith insert/delete operation on the buer tree
generates a record with timestamp i.
143
Theorem 20 The total resource consumption over an arbitrary sequence of nB insert and delete operation on an
initially empty memory-adaptive buer tree is O(n lg n); the amortized resource consumption of each operation is

O (1=B) lg n . Our memory-adaptive buer tree is dynamically optimal.
Proof : The total number of times each of the N items can be involved in buer emptying operations at \full" internal
0 ) [Arg94]. Since each such buer emptying process involves (mmax
nodes is O(lg n= lg mmax 0 B ) items and incurs a
0 lg mmax
resource consumption of O(mmax 0 ) resource consumption (by Lemma 60) , the total resource consumption for
all buer emptying operations involving \full" internal nodes is
0 lg mmax
n mmax 0
N lg lg
m0 m0 B
= O(n lg n)
max max
0 lg mmax
Each rebalance operation13 can incur no more than O(mmax 0 ) resource consumption (by Lemma 60), and there
0 ) rebalancing operations over the entire sequence of nB insert/delete operations [Arg94].
are no more than O(n=mmax
Thus rebalancing cannot cost more than O(n lg mmax ) = O(n lg n) resource consumption. Lemma 61 accounts for the
resource consumption during buer emptying operations at non-internal nodes just above the level of leaves. This
proves the theorem.
5.13 Dynamically Optimal Memory-adaptive Ma-

trix Arithmetic
In this section we consider the problem of multiplying two N^ N^ matrices in a dynamically optimal manner. It turns
out that the techniques developed here for matrix multiplication also apply to the problem of LU factorization of an
n block matrix, as implied by results in [WGWR93]. We rst consider issues related to disk block layout of matrices
that arise during the algorithm.
5.13.1 Transformation between dierent blocking orders

Consider the multiplication of two N^ N^ matrices A and B each consisting of N = N^ 2 elements spread over
n = N=B14 elements disk blocks. The product matrix C = AB also consists of N elements. For convenience, we
assume in this section that n is a power of 4: If this condition is not met each matrix can be padded without changing
the asymptotic running time of our algorithm.
13 In each rebalance operation, either a node is split into two or two nodes are fused into one; the
latter operation may require emptying some \non-full" buer containing O(mmax 0 ) blocks. See
[Arg94] for details.
14 In contrast to previous discussions, in this Section we use N to denote the size of the output, as
opposed to the input, which here is of size 2N .
144
Very often, matrices are stored in row-major order, dened below, on disk. However, in many external memory
matrix algorithms it is more convenient to store matrices in two-dimensional blocked order, dened below, on disk.
Denition 55 An N = N^ N^ sized matrix A is said to be stored in row-major order on disk if its elements are on
p p
n blocks bi;j , where 0 j N=B 1 and 0 i N 1, such that block bi;j contains elements A[i; jB] through
A[i; jB + B 1].
An N = N^ N^ sized matrix A is said to be stored in two-dimensional blocked order on disk if its elements are
p p p p
on n blocks bi;j , where 0 i; j j , such that bi;j contains the B elements A[i0 ; j B] through A[i0 ; j B + B 1]
p p p
of row i0 , for each i0 such that i B i0 i B + B 1. We assume that each block bi;j of the two-dimensional
blocked order has pointers to the (at most two) blocks adjacent to it in the row i of blocks and the (at most two)
blocks adjacent to it in the column j of blocks.
We can transform an N^ N^ matrix's storage format from row-major order to two-dimensional blocked order as
p p
follows: If N^ B, the matrix is already in two-dimensional blocked order. Assuming N^ > B , in a linear pass
over the matrix we add dummy elements if necessary to ensure that the number of rows and columns in the matrix is
p
a multiple of B. With some abuse of notation, we use the same symbol N^ to denote the resulting number of rows
p
and columns. If r = minf B; N=
p
^ B g, we can perform the desired transformation by performing a series of r-way
merges. Each of the r runs are one block long and the (implicit) keys of the merged items are determined by their
position in the row-major order.
During allocation phases of size xr, where bxc 1, bxcr row-major order blocks can be trivially transformed
into bxcr two-dimensional order blocks: The resource consumption of the transformation algorithm during any such
allocation phase is O (xr)3=2 .

On the other hand, during allocation phases smaller than r, we can employ the memory-adaptive merging algorithm
MAMerge to appropriately implement r-way merges realizing the required transformation. If m1 ; m2 ; : : : ; mt are the
sizes of the allocation phases involved in a single such r-way memory-adaptive merge computation, we know from
Theorem 17 that
X
t
mi lg mi cr lg r
i=1
where c is a positive constant. It follows that
X
t pmi pmi
(mi lg mi )
lg mi (cr lg r ) max
1it lg mi
i=1
which implies that the net resource consumption over the phases m1 ; m2 ; : : : ; mt is no more than O(r3=2 ), since
max1it fpmi = lg mi g pr= lg r. Thus we have the following lemma.
145
Lemma 62 Any N^ N^ matrix consisting of N = nB = N^ 2 elements can be transformed from row-major order to
two-dimensional blocked order and vice-versa without incurring a resource consumption of more than O(n3=2 ).
Since we are interested in an O(n3=2 ) bound on the resource consumption of our memory-adaptive matrix mul-
tiplication algorithm, by Lemma 62, we can assume without loss of generality that all matrices are stored in two-
dimensional blocked order on disk.
5.13.2 Memory-adaptive Matrix Multiplication

The in-memory matrix multiplication AB = C , where each of A,B and C consist n blocks, can be executed using
approximately 3n blocks of internal memory and (n) I/O operations. Thus we set the parameter mmax of the
dynamic memory model to be mmax = minf3n; phy max g.
The multiplication of large matrices can be carried out by a series of multiplications of smaller matrices as
explained below. We rst chop the matrices A,B and C into submatrices each consisting of an appropriate number
m^ max = (mmax ) of disk blocks. We organize the computation AB = C to proceed in (n=m^ max )3=2 steps such that
3=2 ), thus obtaining a dynamically optimal memory-
each step is guaranteed to incur resource consumption of O(m^ max
adaptive algorithm.
p
Suppose that matrix A is partitioned into (n=m^ max ) square submatrices Ai;j , where 0 i; j n=m^ max 1,
p
such that each square submatrix consists of m^ pm^ = m^ max blocks. Suppose that B and C are similarly
max max
partitioned into square submatrices each consisting of m^ max blocks. Then, we organize out computation of AB = C
to proceed as follows:
p
1. For 0 i; j n=m^ max 1, Ci;j := 0. (Set each submatrix Ci;j of matrix C to zero.)
p
2. For 0 i; j; k n=m^ max 1, Ci;j := Ci;j + Ai;k Bk;j . (Compute Ai;k Bk;j and add the resulting m^ max B
elements to the corresponding existing m^ max B elements of Ci;j .)
The following lemma can be proved easily.
Lemma 63 The computation indicated above in Steps 1 and 2 correctly outputs the product matrix C = AB.
The computation indicated in Step 1 above can easily be carried out implicitly. In the remainder of this Section
we show how to carry out the computation Ci;j := Ci;j + Ai;k Bk;j indicated in Step 2 above in a memory-adaptive
3=2 ) resource consumption, thus yielding a dynamically optimal matrix multiplication
manner incurring only O(m^ max
algorithm. In order to mimic the standard, I/O-optimal matrix multiplication algorithm for static m-block internal
memory, we need to carry out a matrix multiplication operation involving (m) blocks during an allocation phase
146

of size m: As a result of the nature of the resource consumption, we need to \pebble" (mB )3=2 in a \typical"
allocation phase of size m, in order to achieve optimality. This suggests that the matrix multiplication computation
should be organized in such a manner that whenever the allocation phase size m0 is in the range [m; cm], for a
constant c > 1, we should carry out a a multiplication of two square sub-matrices each containing (clogc m ) disk

blocks: This ensures that the number of DAG nodes pebbled is (clogc m B 3=2 ) = (m0 B)3=2 .
Denition 56 In our memory-adaptive matrix multiplication algorithm an allocation phase of size m is said to be at
m e, where
allocation level level (m), where level (m) is dened (for our matrix multiplication algorithm) to be dlog4 clevel
clevel is an appropriately chosen positive constant. Thus each allocation phase is at some level ` where 1 ` `max
and `max = level (mmax ). We dene m^ max to be the number 4`max 1 .
5.13.3 Mop Records and Level-Records

We focus now on the problem of implementing the computation of Step 2 in a memory-adaptive manner. In our
scheme we have to deal with square submatrices consisting of 2` 2` blocks, where 1 ` `max . Recall that we
assume that matrices are stored in two-dimensional blocked order on disk. Thus, given a pointer to any one block
of a submatrix, we can easily access all other blocks of the submatrix in order to to load the disk blocks submatrix
into memory. Without loss of generality, we choose the pointer p(A^) to a specic block of a matrix A^ to act as the
handle for matrix A^.
Denition 57 Given a square matrix A^ stored in two-dimensional blocked order on disk, the lt-ptr p(A^) of matrix
A^ is the pointer to that block of A^ that is the intersection of A^'s rst row of blocks with its rst column of blocks.
We are now in a position to describe mop (matrix operation) records, each of which corresponds to a multiplication
of submatrices consisting of 2` 2` blocks, for some ` such that 1 ` `max .
Denition 58 Consider the mop record mr corresponding to the matrix-multiplication C^ := C^ + A^B^ , where each
^ B^ and C^ consist of 2` 2` blocks. We denote by Aî;j , where 0 i; j 1, the four 2` 1 2` 1 -block
one of A;
non-overlapping square submatrices, that A^ can be decomposed into. Similarly, we denote by Bî;j and Cî;j , where
0 i; j 1, respectively the square submatrices resulting from a similar decomposition of B and C respectively. The
mop record mr then consists of the following elds:
1. ltptrs : This eld is assigned the triple p(A^); p(B^ ); and p(C^ ) of lt-ptrs of matrices A;
^ B^ and C^ respectively.
2. lsize : This eld assigned the number `.
147
3. split : This eld is assigned the twelve pointers p(Aî;j ); p(Bî;j ); p(Cî;j ) where 0 i; j 1, which are respectively
the lt-ptrs of the twelve submatrices Aî;j ; Bî;j ; and Cî;j dened above.
The twelve pointers assigned to the eld split of a mop record are used to further split the matrix multiplication
operation if needed, as follows: If matrix A^ (respectively B^ and C^ ) is broken down into four square submatrices Aî;j
(respectively Bî;j and Cî;j ) where 0 i; j 1, then the product C^ := C^ + A^B^ can be computed by computing the
eight products, Cî;j := Cî;j + Aî;k B^k;j , where 0 i; j; k 1.
Denition 59 Consider the mop record mr corresponding to the operation C^ := C^ + A^B^ , with Aî;j ; Bî;j ; and Cî;j
for 0 i; j 1 as dened above. Suppose q is an integer such that 0 q 7. Then by \ the qth 0 1 triple", we refer
to the triple (i0 ; j 0; k0 ) that is the qth triple in a lexicographic ordering of the eight triples f(i; j; k) : 0 i; j; k 1g.
And by \the qth subproduct of mop record mr ", we refer to the product Cî0 ;j0 := Cî0 ;j 0 + Aî0 ;k0 B^k0 ;j 0 , where (i0 ; j 0 ; k0 )
is the qth 0 1 triple.
We are now in a position to describe level-records for matrix-multiplication, which perform the same role here
that they played in our memory-adaptive sorting algorithm: That is, given an allocation phase at level `, we can
simply look up lr [`], the level-record corresponding to level `, to decide what computation to carry out during that
phase.
Denition 60 Consider ` such that 1 ` `max . The level-record lr [`] corresponding to allocation level ` is either
set to nil or consists of the following elds:
1. mr : This is assigned a mop-record mr such that mr :lsize = `.

2. The integer ctriple such that 0 ctriple 7.
All level-records lr [`], where 1 ` `max , are stored as as blocked linked list.
When our algorithm is subjected to a phase at level `, our algorithm looks up lr [`] and then executes computation
corresponding to the lr [`]:ctriple -th subproduct of mop record lr [`]:mr , incrementing lr [`]:ctriple .
5.13.4 The loadlevel ( ), download ( ), and llmult ( ) Subroutines

We now describe the loadlevel ( ) and download ( ) functions that provide the same functionality they did during
memory-adaptive sorting.
The algorithm maintains a variable called maxlevel such that 1 maxlevel `max is always true. Intuitively,
maxlevel is such that any time our algorithm is subjected to an allocation phase at level maxlevel + 1 or higher,
148
it completes the entire computation involved in an instance of Step 2 of Section 5.13.2: If, on any given instance,
some computation pertaining to that instance has already been completed before receiving the phase at allocation
level greater than maxlevel , we simply execute the remaining computation required to nish processing that instance,
during that phase. Even if the allocation level is never maxlevel +1, our algorithm completes the computation of each
instance of Step 2 of Section 5.13.2 eciently. Until the processing of a given instance of Step 2 of Section 5.13.2 is
not completed, we have lr [maxlevel ] 6= nil . Thus the variable maxlevel is updated appropriately during the functions
loadlevel ( ) and download ( ).
The subroutine loadlevel (`)

As mentioned earlier, our goal is to execute a subproduct of lr [`]:mr when the allocation level is `. When all such
subproducts of a given mop record lr [`]:mr are completed, lr [`] is set to nil . When this happens, we need to assign
computation work to lr [`] appropriately in a dynamic manner in order to use future phases at level ` eectively. The
procedure loadlevel (`), where 1 ` `max 1, given below, is executed to assign work from lr [` + 1] to lr [`] and is
executed only when lr [` + 1] 6= nil :
1. Suppose that lr [` + 1]:mr corresponds to the matrix operation C^ := C^ + A^B^ . Let q = lr [` + 1]:ctriple and let
(i; j; k) be such that Cî;j + Aî;k B^k;j is the qth subproduct of lr [` + 1]:mr . Suppose x is a new mop record
to be appropriately initialized.
2. Set x:ltptrs to the triple p(Aî;k ); p(B^k;j ); p(Cî;j ) of lt-ptrs.
3. Set x:lsize to `.
4. If ` 1, then compute the four lt-ptrs p(Xi0 ;j0 ), where 0 i0 ; j 0 1, corresponding to the four 2` 1 2` 1
block submatrices Xi0 ;j0 , for each one of X = Aî;k ; X = B^k;j ; and X = Cî;j . These pointers can be computed
by traversing appropriately the boundary blocks of X , for a given value of X . Set x:split to the twelve pointers
so obtained.
5. Set lr [`]:mr to x and lr [`]:ctriple = 0.
6. If lr [` + 1]:ctriple < 7, increment lr [` + 1]:ctriple . Otherwise, set lr [` + 1] to nil and If maxlevel = ` + 1 set
maxlevel to `.
The following lemma bounds the total number of I/Os and the total number of internal memory blocks required
to execute loadlevel (`).
Lemma 64 The total number of internal memory blocks required during the execution of loadlevel (`) is O(1). The
total number of I/O operations incurred during loadlevel (`) is O(2` 1 ).
149
Proof : It is easy to see that no more than a constant amount of internal memory is required during loadlevel (`).
As regards the number of I/O operations, it can be seen that for each one of the three instances of X , obtaining
the lt-ptrs of Xi0 ;j 0 , where 0 i0 ; j 0 1, takes no more than O(2` 1 ) I/O operations. No other activity during
loadlevel (`) incurs any I/O. Thus the lemma is proved.
The subroutine download (` ) 0
Whenever lr [`0 ] is nil we may need to assign some new work to lr [`0 ] from some level-record at a higher level ` + 1,
where `0 `, via a series of applications of loadlevel (`00 ) for `0 `00 `. We present below the steps involved in
download (`0 ), which is only executed when `0 < maxlevel :
1. Set `00 = `0 .
2. While lr [`00 ] = nil , `00 = `00 + 1.
3. Set ` = `00 1.
4. For `00 going from ` down to `0 , execute loadlevel (`00 ).
Denition 61 The levels `0 through ` are said to have been loaded by the download (`0 ) call described above. Level
` is said to be the highest level to be loaded.
The following lemma bounds the total memory and I/O requirement of download (`0 ). The proof follows easily
from Lemma 64.
Lemma 65 Suppose ` is as dened above; that is, ` is the highest level to get loaded during download (`0 ). Then
the total number of I/Os incurred in rst accessing lr [`0 ] by following the blocked linked list of level-records and then
executing download (`0 ) is O(2` 1 ). The total number of internal memory blocks required is O(1).
The following lemma is useful while accounting for the resource consumption during download ( ) operations.
Lemma 66 Consider a sequence of d operations such that:

1. In the ith operation, where 0 i d 1, we traverse the blocked linked list of level-records to access lr [ì ] and
then execute download (ì ).
2. Over the entire sequence of d operations, for no level `0 , where 1 `0 `max 1, is loadlevel (`0 ) executed more
than once. Moreover, let ` denoted the highest level `0 for which loadlevel (`0 ) was executed over the sequence
of d operations.
Then there exists a small positive constant cdl such that the number of I/O operations over the entire sequence of d
operations is no more than cdl 2` 1 .
150
The subroutine llmult ( )
We now describe the simple matrix multiplication routine llmult (`) executed when the allocation level is ` and lr [`]
is not nil . Basically this routine simply reads in the blocks of the submatrices involved in the qth subproduct of
lr [`]:mr , where q = lr [`]:ctriple , carries out the multiplication and addition, and then writes the blocks back to disk.
^ B;
1. Suppose lr [`]:ltptrs contains the lt-ptrs of submatrices A; ^ and C^ respectively, each consisting of 2` 2` blocks.
Suppose lr [`]:ctriple is q and that (i; j; k) is the qth 0 1 triple.
2. Use the lt-ptrs p(Aî;k ); p(B^j;k ) and p(Cî;j ) stored in the eld lr [`]:mr :split to respectively read in blocks of the
three 2` 1 2` 1 -block submatrices Aî;k ; B^j;k ; and Cî;j .
3. Perform the internal memory computation Cî;j := Cî;j + Aî;k B^k;j .
4. Write the disk blocks of Cî;j back to disk. 15
5. If lr [`]:ctriple < 7, it is incremented. Otherwise lr [`] is set to nil . Level-record lr [`] is written to disk.
We will now bound the total number of I/Os and the total memory requirement of llmult (`).
Lemma 67 The number of I/O operations incurred in accessing lr [`] using the blocked linked list of level-records
is no more than `=B + 1. The number of I/O operations incurred during llmult (`) is no more than 4 4` 1 . The
total number of internal memory blocks required is no more than 4 4` 1 . There exists a small constant c0llm such
that the total number of I/O operations incurred in rst accessing lr [`] and then executing llmult (`) is bounded by
c0llm 4` 1 and the total number of memory blocks required is bounded by c0llm 4` 1 =2.
Proof : The proof is trivial since the number of level-records in the accessed portion of the blocked list of level-records
is ` and the number of blocks of Aî;k ; B^k;j ; and Cî;j each is 4` 1 ; blocks of Aî;k and B^k;j are only read in whereas
blocks of Cî;j are input and then output after computation.
We present a useful lemma that will come in handy while accounting for resource consumption.
Lemma 68 Suppose c0llm is as dened in Lemma 67. Then there exists a small positive constant cllm such that
P`0 c0 4`0 1 cllm 4` 1.
` =1 llm
15 In practice the level ` matrix multiplication operations can be ordered such that the next level `
operation is C^ + A^ +1 B^ +1 so that disk blocks of C^ would not be written back to disk
i;j i;k k ;j i;j
if if the allocation level remains ` long enough for this next operation to immediately follow the
just completed operation.
151
5.13.5 Algorithm MAMultiply
The procedure download ( ) described above is memory oblivious in the sense it can function with some constant
number c of internal memory blocks and since we ensure that the smallest allocation phase has size c, it can execute
in any allocation phase. The procedure llmult (`) on the other hand requires O(4` 1 ) internal memory blocks over a
sequence of O(4` 1 ) I/O operations, so it is executed when the allocation level is `. We now show how to sew these
two procedures together to obtain a memory-adaptive matrix multiplication algorithm.
Consider some point of time at which the allocation level is ` and we could start execution on llmult (`): It is
appropriate to actually go through with the call to llmult (`) only when either the current allocation phase, say of size
m, has O(4` 1 ) I/O operations remaining in it or if we know that the next allocation phase is also a level ` allocation
phase. We dene the following predicate enough (m) to guide this decision of the memory-adaptive algorithm
Denition 62 During an ongoing allocation phase of size m such that level (m) = `, the boolean predicate enough (m)
is true if and only if left cllm 4` 1 or level (next ) = `.
We now dene the constant clevel appropriately, which is instrumental in the classication of allocation phase sizes
into dierent allocation levels.
Denition 63 We dene the constant clevel to be the smallest constant such that 2clevel cllm + cdl .
We now present the memory-adaptive matrix multiplication that carries out the computation C^ + A^ B^ , where
^ and C^ consist of pm^ max pm^ max blocks, thus yielding a memory-adaptive implementation of Step 2.
^ B;
each of A;
We use clevel to mean level (mem ) in the following description.
1. Initialize all elds of a new mop record x corresponding to the matrix-multiplication operation C^ + A^ B^ .
Then set lr [`max ]:mr to x and lr [`max ]:ctriple to 0. Set maxlevel to `max .
2. While (lr [maxlevel ] 6= nil ) execute the following:
(a) Walk through level-records until lr [minfclevel ; maxlevel g] is in memory.

(b) If (clevel > maxlevel ), complete the operation loading in appropriate blocks into internal memory, per-
forming required operations and then writing them out to disk. Set lr [`] to nil for all `.
(c) Otherwise; that is, if (clevel maxlevel ), then
(d) If (lr [clevel ] = nil ), then
i. Execute download (clevel ).
152
ii. (Here, mem may have changed from its value at the beginning of Step 2(d)i.) If enough (mem ) is false
then relinquish what's left of the ongoing allocation phase.
iii. GoTo Step 2a.
(e) Otherwise; that is, if (lr [clevel ] 6= nil ), then
i. While (enough (mem )ANDlr [clevel ] 6= nil , execute llmult (clevel ).
ii. If enough (mem ) is false then relinquish what's left of the ongoing allocation phase.
iii. GoTo Step 2a.
For analysis, it is convenient to dene the call llmult (maxlevel + 1), although llmult (`) is only dene for the case
` maxlevel .
Denition 64 We dene the computation involved in the execution of Step 2b above to be the computation corre-
sponding to llmult (maxlevel + 1). Thus, llmult (maxlevel + 1) is said to be executed when (and if) Step 2b above is
executed.
5.13.6 Resource Consumption Analysis of MAMultiply

We will now prove that the algorithm MAMultiply computes the matrix multiplication operation C^ + A^B^ , where
p
each of the three submatrices consist of m^ max m^ max
p blocks each, incurring a resource consumption of no more
3=2 ). We rst prove that the resource consumption in the download ( ) expense account is O(m
than O(m^ max 3=2 ) by
^ max
combining bounds on the number of download ( ) operations with bounds on the amount of resource consumption of
3=2 )
individual download ( ) operations. The fact that the resource consumption of all llmult ( ) operations is O(m^ max
follows from the observation that the O(43`=2 ) resource consumption during the execution of llmult (`) is charged to

the
(4` B)3=2 pebbling operations performed during llmult (`).
Resource consumption of download ( ) operations

We begin with the denition of a certain type of download ( ) operation sequence followed by a couple of key lemmas
about such sequences.
Denition 65 Consider a sequence of d consecutive download (ì ) operations, where 0 i d 1 and d 1, that
satisfy the conditions mentioned in Lemma 66 and let ` be as dened in Lemma 66. We call the above sequence
of d download ( ) operations an `; d-sequence if the rst time MAMultiply either relinquishes I/O operations (in
Step 2(d)ii above) or executes an llmult ( ) operation after it executes download (`0 ) is only immediately after it
executes download (`d 1 ).
153
Lemma 69 Consider any `; d-sequence download (`0 ); : : : ; download (`d 1 ). If, at any time after download (`0 ) begins
and before download (`d 1 ) ends, MAMultiply is subjected to an allocation phase of size mh such that level (mh )
`=2 + 1, then the execution of download (`d 1 ) is immediately followed (that is, without relinquishing any I/Os) by
the execution of llmult (`h ), where `h = minflevel (mh ); maxlevel + 1g.
Proof : From Lemma 66, we know that the total number of I/O operations required over the entire `; d-sequence is
no more than cdl 2` 1 . Suppose that after download (`0 ) begins execution and before download (`d 1 ) completes
execution, MAMultiply gets an allocation phase of size mh where mh is as dened above. Then even if all the
cdl 2` 1 cdl 4`=2 I/O operations incurred during the `; d-sequence occurred during the phase of size mh , that
allocation phase is still left with cllm 4level (mh ) 1 pending I/O operations. Hence on completion of the `; d-sequence,
enough (level (mh )) evaluates to true: This follows from the denition of clevel above. Furthermore, by denition of
`; d-sequence, the dth download ( ) operation download (`d 1 ) of the `; d-sequence cannot be immediately followed
by another download ( ) operation so llmult (`h ) is the next operation executed by MAMultiply . Thus the lemma is
proved.
We state now a simple corollary of the above lemma.
Corollary 5 If an `; d-sequence is followed by the execution of Step 2(d)ii, the total number of I/O operations
relinquished is no more than O(4l=2 ).
Charging Scheme for download ( ) Operations

Each `; d-sequence incurs a certain amount of resource consumption. We use the following charging scheme to account
for the of resource consumption of `; d-sequences:
1. In the event that the `; d-sequence is followed by llmult (`h ), where `h > l=2, we charge the resource consumption
of the `; d-sequence to the llmult (`h ) operation.
2. In the event that the `; d-sequence is followed by the execution of Step 2(d)ii or by the execution of an llmult (`0h )
operation, where `0h `=2, we account for its resource consumption using Lemma 69 and Corollary 5.
We rst count the maximum number of `; d-sequences that can occur during MAMultiply .
Lemma 70 The total number of times an `; d-sequence can occur during the entire execution of MAMultiply is
8`max ` , where 1 ` `max 1.
Proof : This follows from the fact that each time lr [` + 1] is set to a non-nil value, the maximum number of `; d-
sequences that can occur before lr [` + 1] becomes nil is 8.
154
We will now bound the total resource consumption of all download ( ) operations, barring those involved in `; d-
sequences whose resource consumption is charged to llmult ( ) operations.
Theorem 21 Suppose that the resource consumption of an an `; d-sequence is the resource consumption during
I/O operations incurred during the `; d-sequence and the resource consumption corresponding to the I/O operations
relinquished by the (possible) execution of Step 2(d)ii immediately following the `; d-sequence. The total resource
consumption of all `; d-sequences, except any `; d-sequence whose resource consumption we charge in Step 1 of
3=2 ).
Section 5.13.6 to an llmult (`h ) operation with `h > l=2, is no more than O(m^ max
Proof : By Lemma 70, the total number of `; d-sequences that can occur is 8`max ` . Including the I/O operations that
are possibly relinquished on account of executing Step 2(d)ii immediately after the `; d-sequence, the total number
of I/O operations charged to the `; d-sequence is no more than O(4`=2 ), by Lemma 66 and Corollary 5. Also, by
Lemma 69, the maximum allocation level during any of these O(4l=2 ) I/O operations is no more than `=2, implying
that the maximum resource consumption of each `; d-sequence relevant to this theorem is no more than O((4`=2 )3=2 ).
Hence the total amount of resource consumption that can be charged to all `; d-sequences is
0 1
X1
`max X1
`max
8`max ` O 2` 3=2 = O @8`max 23`=2 =23` A
l=1 l=1
0 `max 1 1
X
= O @8 ` max 1=2 A
3 `= 2
l=1
`max
= O 8 ;
3=2 .
which is O(m^ max ) since 8`max = (43=2 )`max = (4`max )3=2 = m^ max
Resource Consumption of llmult ( ) Operations

We argue here that the total amount of resource consumption that can be charged to an llmult (`) operation is no

more than O (4` )3=2 while the number of pebbling operations accomplished is at least
(4` B)3=2 .

We rst have the following lemma implying that our reorganization of the operation C^ 0 := C^ 0 + A^0 B^ 0 , where each
of A^0 ; B^ 0 ; and C^ 0 are submatrices consisting of 2` 2` blocks, into 8 subproduct operations Cî;j
0 := Cî;j 0
0 + A^0i;k B^j;k
corresponding to the 8 0 1-triples is correct.
Lemma 71 The matrix operation C^ 0 + A^0 B^ 0 is correctly implemented by the 8 operations Cî;j
0 := Cî;j 0
0 + A^0i;k B^j;k
corresponding to the 8 (i; j;k) 0 1-triples.
155
It can be inductively proved that the number of pebbling operations performed by MAMultiply using the above
approach is no more than (m^ max B)3=2 . We consider now the maximum resource consumption that can be charged
to a single llmult (`) operation.
Lemma 72 The maximum resource consumption that can be charged to a single llmult (`) operation, where 1 `

maxlevel, is O (4` )3=2 .
Proof : By Lemmas 67 and 68, the total number of I/O operations incurred by an llmult (`) operation is O(4` ) for
any `, including ` = maxlevel + 1. The total number of I/O operations possibly relinquished on account of executing
Step 2(e)ii immediately after the llmult (`) operation is O(4` ). If 1 ` maxlevel the allocation level throughout

the above I/O operations is ` so that the resource consumption is O (4` )3=2 . On the other hand, the total number
of I/O operations incurred by the `0 ; d-sequence, where `0 2` 2, whose resource consumption we possibly charge
in Step 1 of Section 5.13.6 to llmult (`) is no more than O(4` ), by Lemma 66. The maximum allocation level during

these O(4` ) I/O operations is `, so their resource consumption is O (4` )3=2 . This proves the lemma.
We now bound the total resource consumption charged to all the llmult ( ) operations incurred during mamultiply
3=2 ).
by O(m^ max
Theorem 22 The total resource consumption charged to all the llmult ( ) operations incurred during mamultiply
3=2 ).
by O(m^ max
Proof : First we consider all llmult (`) operations with ` < maxlevel + 1. By Lemma 72, we know that each llmult (`)

operation can be charged a resource consumption of no more than O (4` )3=2 . By denition, each llmult (`) operation

performs
(4` B)3=2 pebbling operations. Thus at any time during MAMultiply , B3=2 times the total resource
consumption charged to llmult ( ) operations up to that point of time is always of the order of the total number
of pebbling operations performed by MAMultiply up to that point. Since MAMultiply performs no more than
(m^ max B)3=2 pebbling operations, the lemma holds for all llmult (`) operations with ` < maxlevel + 1. There can
be at most one llmult (`) operations with ` maxlevel and such an operation incurs O(m^ max ) I/O operations so its
3=2 ). This proves the lemma.
resource consumption is also O(m^ max
5.13.7 Proving Optimality

Theorem 21 and Theorem22 together imply the following lemma bounding the total resource consumption of
MAMultiply .
156
3=2 ).
Theorem 23 The total resource consumption of MAMultiply is no more than O(m^ max
Since the total number of MAMultiply operations involved is the same as the number of times Step 2, which is
(n=m^ max )3=2 , Theorem 23, Corollary 2 and the denition of dynamic optimality implies the following theorem.
Theorem 24 The total amount of resource consumption of our memory-adaptive matrix multiplication algorithm
when used to multiply two n block matrices is O(n3=2 ). Our memory-adaptive matrix multiplication algorithm is
dynamically optimal.

In this chapter we have presented a simple and reasonable dynamic memory allocation model that enables database
and operating systems to dynamically change the amount of memory that external memory algorithms are allocated.
We have dened what it means for memory-adaptive external memory algorithms designed to work in this model to be
dynamically optimal. We have presented dynamically optimal memory-adaptive algorithms for fundamental problems
such as sorting, problems related to sorting, permuting, other problems related to sorting and matrix multiplication.
We have also presented a dynamically optimal (in an amortized sense) version of the buer tree, which has a large
number of batched dynamic applications and applications such as \bulk-loading" of external memory data structures.
We have shown that a previously devised approach to memory-adaptive external mergesort is provably nonoptimal
because of fundamental drawbacks. The lower bound proof techniques for sorting and matrix multiplication are the
two fundamentally distinct proof techniques invoked by most other external memory lower bounds and hence we
anticipate that the techniques presented here will apply to many external memory problems.
In the case of mergesorting and matrix multiplication, we have shown how to sew together a conventional external
memory algorithm with an appropriate \memory-adaptivity" data structure that balances work across \levels of
allocation " to obtain a memory-adaptive external memory algorithm. Our proof techniques deal with the interesting
constraints faced while proving optimality of resource consumption in our dynamic memory model.
We believe that our techniques for memory-adaptive merging apply to memory-adaptive distribution and thus to
a dynamically optimal distribution sort. Since the BatchMerge K ( ) operation used in [BK98] can be performed using
a modication of our memory-adaptive merging technique, we conjecture that we can design a dynamically optimal
memory-adaptive version of the worst-case optimal external priority queue of [BK98].
We have mentioned in Section 5.10 some interesting open questions regarding dynamically optimal memory-
adaptive merging. Another interesting open question pertains to the memory-adaptive buer tree: In contrast to
the buer tree presented in Section 5.12 which has relatively huge buers of xed size mmax blocks for each node,
is it possible to design a \ner granularity" buer tree with \memory-adaptive buer sizes"? It would be fruitful to
extend our approach to other domains and applications. An interesting question is whether or not we can devise a
157
general technique that takes any external memory algorithm that is optimal for static memory and convert it into a
dynamically optimal memory-adaptive algorithm.
158
Chapter 6
Modeling and Optimizing the I/O
Performance of Multiple Disks on a SCSI
Bus
Summary
For a wide variety of computational tasks, disk I/O continues to be a serious obstacle to high performance. To meet
demanding I/O requirements, systems are designed to use multiple disk drives that share one or more I/O ports to form
a disk farm or RAID array. The focus of the present chapter is on systems that use multiple disks per SCSI bus. We
measured the performance of concurrent random I/Os for three types of SCSI disk drives and three types of computers.
The measurements enable us to study bus-related phenomena that impair performance. We describe these phenomena,
and present a new I/O performance model that incorporates bus eects to predict the average throughput achieved
by concurrent random I/Os that share a SCSI bus. This model, although relatively simple, predicts performance
on these platforms to within 11% for xed I/O sizes in the range 16{128 KB/s. We then describe a technique to
improve the I/O throughput. This technique increases the percentage of disk head positioning time that is overlapped
with data transfers, and increases the percentage of transfers that occur at bus bandwidth, rather than at disk-head
bandwidth. Our technique is most eective for large I/Os and high concurrency|an important performance region
for large-scale computing|our improvements are 10{20% better than the naive method for random workloads.
6.1 Introduction
As a consequence of the disparity in internal memory and magnetic disk drive access performances, computer systems
that perform I/O-intensive processing are often designed to use many disks in parallel, usually organized as a disk
farm or a RAID array. The physical organization generally consists of one or more I/O buses (e.g., SCSI, FC, or
SSA) with several disks on each bus. Very often, it is convenient to use parallel I/O systems comprising only of
\o-the-shelf" hardware components (sans any specialized hardware controllers etc.) by attaching several disks to a
SCSI bus and then performing parallel I/O using software striping, or in the application itself. Since modeling such
parallel I/O systems is inherently complex, understanding, in a quantitative manner, the performance of a parallel
I/O system is a signicant problem that must be overcome by system designers, application writers and developers
of RAID controllers.
159
Previous work related to disk I/O performance has focused on the disk drive as the bottleneck, downplaying the
importance of bus contention and other bus eects. Indeed, the bus eects play an insignicant role in I/O performance
for UNIX workloads with small I/O request sizes. But many I/O-intensive applications benet signicantly from large
sized requests. Among these are multimedia servers and certain database and scientic computing applications based
upon EM algorithms to process massive data sets (e.g., [VS94, MNO+ 98, CH96]). In such applications, parallel I/O
performance is often limited by bus eects, rather than by the performance of individual disk drives.
In this chapter, our goal is to understand the I/O performance of a system consisting of a SCSI bus with several disk
drives attached to that bus. The workload we primarily focus on is intended to model the I/O workload corresponding
to applications based upon parallel disk algorithms and other similar applications [VS94, MNO+ 98, CH96]. We
develop a general analytic model to predict the I/O performance of a parallel disk system, for such workloads. Such
an understanding comprises a step towards the elusive goal of being able to accurately and generally predict the
running time of an EM application. Based upon our understanding, we also design and implement a clever technique
that can exploit future access information and features of modern disk drive controllers to signicantly improve the
I/O performance of a parallel disk system, in certain situations. A patent application based upon the modeling and
performance improvement aspects of our work is currently led.
In a parallel I/O system with multiple disks on a bus, many factors in uence the throughput. The factors that
are usually considered include the number of disks, the size of each I/O transfer, and the positioning time (seek and
rotational latency). We also account for the dierence between the transfer rate from a disk platter through the
disk head to the disk's internal cache (that is,the disk bandwidth ), and the (signicantly higher) burst rate from a
disk's cache to the host over the SCSI bus (that is, the bus bandwidth ). We model the eect of a disk drive's read
lookahead mechanism, which sometimes enables the disk to prefetch data into its internal cache in anticipation of a
future sequential read request. We also consider the eect of a relatively obscure disk control parameter called the
fence1 The fence setting can signicantly in uence I/O performance. Based upon experiments and observations, we
found an interesting convoy-like behavior not reported in literature, which we call rounds, that plays a key role in
determining I/O performance. We present an a description, analysis and an explanation for the rounds in Section 6.6.
Our performance model diers from previous work on parallel I/O performance because our model is based upon
a quantitative study of I/O bandwidth involving bus eects, disk drive mechanisms and important parameters not
considered by previous approaches. For a workload consisting of random, parallel accesses to multiple disks on a bus,
with zero \think time" between requests, we quantitatively model the sustained I/O. We validate our model on four
parallel I/O systems (Sun Ultra-1 with four Seagate Cheetah disks, Sun Sparc-20 with four Seagate Cheetah disks,
Sun Sparc-20 with seven Seagate Wren-7 disks, and DEC Alpha with four Seagate Barracuda disks).
Our modeling and analysis revealed that the I/O performance of the parallel disk system can very often be improved
1 The fence is called the buer full ratio on the SCSI-2 disconnect/reconnect control mode page.
160
using simple pipelining techniques that exploit disk drive mechanisms. Specically, we employ a prefetching scheme,
similar to that in [CT96], that causes the disks to prefetch data into their internal caches, even for workloads that
have random requests. This achieves greater overlap of bus transfers with disk seeks, and increases the percentage of
transfers that occur at the full bus bandwidth rather than at the signicantly smaller disk bandwidth. Whereas the
simulations in [CT96] employ the SCSI PREFETCH command, which is not widely supported, we show how to obtain
the eect of this command on drives that do not support a SCSI PREFETCH command, and we describe the measured
performance impact of doing so on real systems. Experiments show bandwidth improvements of 10{20% can be
obtained when using our technique to perform large reads from high-performance disk drives. (Our techniques are
not benecial in the case of small I/Os or a lightly-loaded SCSI bus.)
In Section 6.2, we mention some related work. Section 6.3 describes the workload that we focus on in during this
study. In Section 6.4, we describe the hardware congurations of the systems we worked on. Section 6.5 describes
specic components of the service time at a disk drive that are known to in uence I/O performance. In Section 6.6, we
discuss the interesting phenomenon of rounds. In Section 6.7, we present the analytical models for I/O performance
pertaining to our workloads, and in Section 6.8, we validate our models on the dierent congurations. In Section 6.9,
we present our pipelining technique to improve I/O performance and also present results of experiments demonstrating
the performance improvement. Finally, in Section 6.10, we conclude and discuss future work.
6.2 Related Work

A number of analytic models exist for the I/O subsystem. The analytic disk model of [Shr97] captures bus eects only
in the single-disk case and does not directly model the fence. The Pantheon disk simulator [RW94] incorporates bus
contention and other bus eects, but no results have been published that describe the idle periods and disk bandwidth
limited bus transfers that we observe. [HP96] presents a method for approximating the throughput of multiple disks
on a SCSI bus by summing the seek time, rotational latency, and transfer time, and derating the performance by a
bus contention factor derived from a general queuing model. None of this previous literature describes the rounds
phenomenon that we observe. Ignoring this phenomenon when modeling can result in throughput prediction errors
greater than 100% for the workloads we consider. [WGPW95] proposes many experiments to measure the performance
parameters of an individual disk drive. We use similar approaches to determine the values that parameterize our
model.
There have been several performance analyses of RAID [PGK88] based systems, such as [Che89, CLG+ 94] but
these analyses do not adopt the quantitative approach of devising a model to predict I/O performance based solely
upon parameters of the disk drives and the bus. With respect to improvement of I/O performance, in contrast to work
on RAIDs, our work describes how to model and improve application-level performance by coordinating concurrent
accesses to multiple disks on a SCSI bus. Our technique may also be applicable to the internal algorithms used in
RAID controllers.
161
In Section 6.9, we describe a technique to increase I/O throughput by causing disks to prefetch data from the
disk surface to their internal caches (thus overlapping one disk's positioning time with other disks' bus transfer time).
This technique diers from previous techniques that prefetch from disk to main memory during sequential access
workloads (thus overlapping I/O with application think time), by using historical access patterns (see [TPG97] for
references) or by using application or compiler provided predictions of future accesses [CFL94b, KTP+ 96, MDK96].
[TPG97] mentions using application-provided knowledge to schedule future disk I/O to reduce access latency, but
this idea is not developed further. [CFL94b] presents results on prefetching into main memory by scheduling multiple
prefetch commands for two disks on a SCSI bus. The above prefetching techniques attempt to hide disk latency by
overlapping I/O with computation whereas our technique tries to overlap disk latency with bus transfer from other
disks, so our technique operates at a ner level of granularity and may even be used in a manner complementary to
the prefetching techniques above. [CT96] proposes and studies a scheduling algorithm for a video server architecture
consisting of multiple disks on each bus. A SCSI PREFETCH command is used to prefetch into each disk's internal
cache, prior to a read operation that transfers the data on the bus. A token-based round-robin scheduling of these
reads is proposed to overcome the problem of unfair arbitration in the SCSI protocol. [CT96] reports only simulation
results, with no validation of whether or not the simulator adequately captures the disk and bus eects of real
systems. Moreover, many current disk drives (including the Seagate Barracuda and Cheetah) do not implement the
SCSI PREFETCH command.
6.3 Workload
Our workload consists of xed-size2 read requests directed to a collection of independent disks that share a SCSI bus.
The requests are generated by multiple processes of equal priority running concurrently on a uniprocessor, one process
per disk. Each process executes a tight loop that generates a random block address on its disk, takes a timestamp,
issues a seek and a read system call to the raw disk (bypassing the le system), and takes another timestamp when
the read request completes. Each experiment consists of three phases: a startup period during which requests are
issued but not timed, a measurement period during which the timings are accumulated in tables in main memory, and
a cool down period during which requests continue to be issued. The purpose of the startup and cool down periods
is to ensure that the I/O system is under full load during the measurements. We observed that the I/O systems
provided fairness in all our experiments: each disk completed approximately the same number of I/Os.
This workload captures the access patterns of external-memory algorithms designed for the Parallel Disk Model
[VS94]. Examples of such algorithms are parallel disk sorting algorithms [BGV97, CH96] and parallel disk matrix
multiplication [VS94]. In Parallel Disk Model algorithms, reads and writes are concurrent requests to a set of disks,
2 In a recent extension to this work, we have also incorporated requests variable-sized blocks;
essentially the same model works, with the dierence that we have to use the average block size
instead of any xed size.
162
issued in lock-step, one request per disk. Our workload also models applications that use balanced collective I/O,
i.e., where all processes make a single joint I/O request rather than numerous independent requests. This workload
also can be used to model a video-on-demand server when the data is striped across multiple disks.
6.4 Hardware Conguration

We took measurements on four hardware congurations. The rst consists of four Seagate Cheetah ST-34501W disks
connected to a Sun Ultra-1 computer running Solaris 2.5.1. Although this Cheetah has an ultra-wide SCSI interface
capable of a 40 MB/s maximum bus rate, the controller card in the computer is a fast-wide card that cannot exceed
20 MB/s. The second conguration is four Seagate Cheetah (ST-34501W) disks connected to a Sun Sparc-20 running
Solaris 2.5, again with the 20 MB/s controller. The third conguration is four Seagate Barracuda (ST-32171W)
disks connected to a DEC AlphaStation 600 5/266 running Digital UNIX 4.0. The fourth conguration uses seven
Imprimis (Seagate) Wren-7 (94601-15) disks (5 MB/s SCSI-1) on a Sun Sparc 20 running Solaris 2.5. Table 6.1
presents additional parameters of the disks we used [Cor89, Tec97b, Tec97a].
Each disk has a unique SCSI id which determines the priority of the disk when multiple devices are contending
for use of the idle SCSI bus. The SCSI controller at the host has the highest priority, so it will win any contention
in which it participates.
When the Wren is transferring data from the disk surface to the host (through the cache), it disconnects from
the bus when it reaches a track boundary. When the Wren is performing readahead into the disk cache, readahead
stops when a cylinder boundary is reached. In the more modern Barracudas and Cheetahs, readahead can continue
across a cylinder boundary, although these disks must detect a sequential access pattern before they will commence
readahead.
6.4.1 The Fence Parameter

As mentioned earlier, the time at which the disk contends for reconnection to the bus depends upon the fence
parameter. When a SCSI disk is instructed to perform a read, and the disk recognizes that there will be a signicant
delay, such as for a seek, the disk releases control of the SCSI bus (it disconnects). When the disk is ready to transfer
the data to the host, the disk contends for control of the SCSI bus (reconnect) so that the read can be completed.3
The fence determines the time at which the disk will begin to contend for the SCSI bus. If the fence is set to the
minimum value, the disk will contend after the rst sector of data has been transferred from the disk surface to the
disk's internal cache. By contrast, if the fence is set to the maximum value, the disk will wait4 until almost all of the
3 As discussed in Section 6.4 and Section 6.7.3, disks reaching track or cylinder boundaries on the
disk surface may also disconnect, and later reconnect, during data transfer.
4 Actually, the behavior of the disk is a little more complicated than stated here; a precise speci-
cation is generally available from the product manual.
163
Table 6.1: Disk parameters
parameter Cheetah Barracuda Wren
maximum disk queue length 64 64 1
average seek latency 7.7 ms 9.4 ms 15 ms
rotational speed (revolutions per minute) 10033 7200 3597
average sectors per track 170 163 70
sector size 512 bytes 512 bytes 512 bytes
number of data surfaces 8 5 15
disk buer size 512 KB 512 KB 240 KB
number of buer segments 1{15 1{15 1
requested data has accumulated in the disk cache before contending for the bus. The performance implication is as
follows. A low fence setting tends to reduce the response time, because the disk attempts to send data to the host as
soon as the rst sector is available. But when the cached data has been sent to the host (at the full bus bandwidth),
the disk continues to hold the bus. The remainder of the transfer occurs at the disk head bandwidth|the rate at
which bits pass under the disk head. The disk head bandwidth is usually less than 25% of the bus bandwidth, and
for some disks, far less. A high fence setting causes the disk to delay the start of data transfer to the host, but when
the transfer does occur, it proceeds at \bus bandwidth", from semiconductor cache on the disk drive into the host
controller and memory system. In systems with multiple disks on a bus, a high fence setting potentially increases
overall throughput for I/O intensive workloads.
If the fence value5 is 0, the disk contends for the bus as soon as one sector has been read into the cache. If the
fence value is 255/256, the disk waits until 255/256 of the requested number of sectors are in the cache, or until the
cache becomes full.
6.5 Components Of Service Time

The signicant components of the time to complete a read operation are as follows.
Host queue time: the time during which a request remains queued up in the device driver or the host SCSI controller,
which depends upon the particular strategy employed in those components of the system.
SCSI overhead: the time for the SCSI protocol to send a request from the device driver to the disk, denoted as
Overhead in the equations of Section 6.7.2 and 6.7.3.
Device queue time: the time that a request waits in the disk controller while the controller is serving some previous
request. This time is zero for a drive that can only handle one request at a time.
Seek time: the time required by the disk head to move to the track containing a requested block address. Seek time
has a non-linear dependency upon the number of tracks to be traversed.
5 In this chapter, we report only for the fence set to zero or the fence set to 255=256 case; in a
recent extension, we have incorporated other fence values in our model.
164
Rotational latency time: after a seek completes, the time during which the disk rotates to bring the requested block
to the disk head.
Rotational transfer time: after the rotational latency completes, the time required for the head to transfer data
from the disk surface to the disk buer. This time is largely governed by the speed of rotation and the number
of bytes per track. This time is proportional to the number of bytes transferred, and includes any additional
time required for track switches and cylinder switches when an I/O extends across multiple tracks or cylinders,
Bus busy time: the time period during which a requested block sits in the disk's buer, waiting for the SCSI bus to
become available for a transfer to the host.
Bus transfer time: the time required to transmit the block over the SCSI bus, at the sustained bus bandwidth, from
the disk to the host. It is proportional to the size of the block to be transferred.
The service time for a disk request is not the simple sum of these components. For instance, if the fence is 0,
some of the rotational transfer time may be overlapped with the bus transfer time. If many disks share a bus, the
overlapped I/O transfers may cause the bus busy time to dominate, leading to service times much larger than the
bus transfer time. If the I/O requests are small, then the SCSI overhead may dominate, in which case the eective
data rate on the bus cannot approach the bus bandwidth, even if many disks share the bus.
Our experiments have at most one outstanding request per disk, so both the host queue time and the device queue
time are zero.
6.6 Rounds
In our experiments, we observed fairness in the servicing of suciently large I/O requests, despite the fact that the
SCSI disks have dierent priorities when contending for the bus. Although each process attempts to progress through
its requests as fast as possible, without coordinating with other processes, we typically observe a convoy behavior
among the requests by all the processes. Namely, all disks receive a request, then all disks transmit data back to the
host before any disk receives another request. We use the term rounds for this periodic convoy behavior.
We were surprised to see rounds. Since the host has the highest SCSI priority, one would expect that soon after
a disk completes one request, the host would seize the bus to send another request to that disk, thereby keeping the
bus and all the disks busy.
We observed rounds on a variety6 of hardware architectures (Sun, DEC, Intel-PC) and SCSI controllers, running
several dierent versions of UNIX such as Solaris, Ultrix, and Net-BSD. In a recent development, by instrumenting
6 Recently our experiment was replicated [Nie98] on a 26-CPU Sun E6000 with Seagate Cheetah
(ST-19171FC) Fibre Channel disks. Using one CPU to send requests to four disks on one Fibre
Channel loop, it was observed that there was round-like behavior for request sizes 16-200 KB.
165
the NetBSD kernel, we determined that the operating system does not queue requests before sending them to the host
bus adapter. When an Ancot SCSI bus analyzer was used to determine whether the host bus adapter arbitrates for
the bus when it has a disk request it was found that it does not: If any disk wants the bus, the host bus adapter will
not arbitrate for the bus, even if it is waiting to send a request to an idle disk. The host bus adapter only arbitrates
for the bus if no disk is arbitrating.
Since using the SCSI bus analyzer is time-intensive and since we did not originally have easy access to a SCSI
analyzer, we rst determined when rounds happen using application-level timestamps. To determine whether D
disks are being served in rounds under some workload, we examine the ordered I/O completion timestamps from a
10 second run (after a startup interval) using a sliding window of size D. A detailed description of our timestamp
analysis algorithms that reveal the round-like behavior of the I/O can be found in [BGH+ 98]: Here we sketch our
technique. A violation of round ordering is said to occur on the j th timestamp in the window (where 0 j D 1) if
there is an i < j such that the ith and j th I/O of the window both originate from the same disk: if the current sliding
window contains a violation at the j th position, the window is advanced by j positions. Otherwise it is advanced
by D positions. The fraction of I/O operations that do not violate round ordering is our measure of the extent of
round formation for that experiment. In our experiments, rounds occurred 88{99% of the time for uniform random
workloads containing a mixture of 1, 2, 3, or 4 dierent request sizes and for workloads that have spatial locality.
The workloads that we experimented with have request sizes of B; : : : iB , for i the number of request sizes in the
workload and for B = 8; 16; 32; 64; or 128 KBs.
The Wren disks exhibit rounds for requests of size 8 KB or larger; more details can be found in [BGH+ 98]. The
Barracudas and Cheetahs exhibit rounds with requests of size 16 KB or larger. If the request size is very small, we
do not observe rounds. In this case, the bus is not a bottleneck, but the random seeks and rotational latencies cause
reasonably fair workloads on the disks.
The current literature does not discuss rounds as we observed them. In fact, [STV96] states that even in cases when
the load is symmetrically distributed and balanced, one process can monopolize the disks while all other processes
come to a halt.
6.7 Analytical Model

In this section, we develop a disk model for one or more disks that share a SCSI bus. For the workload described
above, our model predicts the average time to complete a read request, as a function of the number of disks sharing
the bus, the request size in bytes, the setting of the fence parameter (0 or 255/256), and a collection of performance
parameters for individual disks and the SCSI bus. We note that these two fence settings emphasize response time
and bandwidth, respectively.
In Section 6.7.1, we rst dene a read duration, the quantity whose average value our model actually predicts.
The average read duration is inherently related to the parallel I/O throughput being delivered by the I/O system. In
166
Section 6.7.2, we present a model to predict average read duration on a single disk system and in Section 6.7.3, we
present a model to predict average read duration on a single disk system.
6.7.1 Read duration

The workload experiments described in Section 6.3 collect timestamps immediately before a read call is made and
immediately after the read call returns.7 We use the term read duration for the time period between such a pair of
timestamps.
The average read duration is closely related to the throughput at which data blocks are being retrieved by the
parallel I/O system. For workloads that have zero think-time between requests for blocks of size B bytes, retrieved
in parallel from D disks, the average throughput in bytes per second is DB=ReadDuration, where ReadDuration is the
average read duration, computed over all the read calls of all the processes.
The bus bandwidth places an upper bound on performance. If the bus were continuously busy transferring data,
the corresponding average read duration would equal the time required to transmit D blocks of data. This scenario
would correspond with a situation in which all the other components of the service time for a disk (e.g., seek time,
SCSI overhead, bus busy time, rotational latency and transfer) are overlapped with the data transfers of the other
disks that share the bus. Thus, the average read duration is an indicator of how eectively the disk latency is hidden
by other concurrent transfers on the bus.
6.7.2 Single disk model

In this section we present our model of read duration when only a single disk is active.
Read duration for fence value 0. When the fence value is 0, the disk requests the bus as soon as the rst sector
is in the disk cache. After the rst sector has been transferred to the host, the transfer of the remainder of the data
occurs at the media-to-cache rate (bandwidthrot ), which is smaller than the cache-to-host rate (bandwidth bus ). When
using only a single disk, and the block does not cross a track or cylinder boundary, the average time to read a block
of size B (B >> 1 sector) is well approximated by
ReadDuration = Overhead + E[SeekTime] + E[RotationalLatency ] + B :

bandwidthrot
This equation approximates the average read duration as the sum of the SCSI protocol overhead time, the expected
seek time, the expected rotational latency, and the time to read the data from the disk surface.
When B is large, the data will be spread out over a number of tracks and possibly cylinders. Thus the track
and cylinder switch times must be taken into account. Let TrackSwitchTime and CylinderSwitchTime represent the
7 The preemption of a process during the brief time interval between the return of a read call and
the invocation of the next read call is suciently rare to be negligible.
167
amount of time to perform one track switch and one cylinder switch, respectively. We can approximate the number
of cylinder switches by B=AverageCylinderSize, and the number of track switches (including those that also cross a
cylinder boundary) by B=AverageTrackSize . Let TrackCylinderSwitchTime be the sum of the track and cylinder switch
times, given by
B B
TrackSwitchTime
AverageTrackSize AverageCylinderSize
+
CylinderSwitchTime B :
AverageCylinderSize
Using the above denition of TrackCylinderSwitchTime , we get the following expression for the average read duration.
ReadDuration = Overhead + E[SeekTime] + E[RotationalLatency ]
B
+ bandwidth + TrackCylinderSwitchTime : (6.1)
rot
Read duration for maximum fence value. When the fence is set to its maximum value, most of the data is
read into the disk drive's cache before the bus is requested. Data is transferred rst from the disk platter into the
disk cache (at bandwidthrot ), and then over the bus at the cache-to-host rate (bandwidth bus ). When using only a
single disk, the average time to read a block of size B that does not span across multiple tracks or cylinders is
ReadDuration = Overhead + E[SeekTime] + E[RotationalLatency ]
+ B + B :
bandwidth rot bandwidthbus
Taking into account the time for the cylinder and track crossings, we have
ReadDuration = Overhead + E[SeekTime] + E[RotationalLatency ] + B

bandwidth rot
+ TrackCylinderSwitchTime + B : (6.2)
bandwidth bus
6.7.3 Parallel disk model

As explained in Section 6.6, we observe the formation of rounds of I/O transactions. In each round, one block is read
from each disk. When the fence is 0, a disk is ready to transfer data to the host after it has positioned its head to
the data and read the rst sector into its disk cache. This time is dominated by the positioning time, which greatly
exceeds the rotational transfer time for one sector. Transmission of data to the host begins when any one of the disks
is ready, so on a bus with D disks, the idle time on the bus at the beginning of a round is well approximated by the
expected minimum positioning time8 , denoted MPT(D).
8 Here we use experimentally measured values of the expected minimum positioning time, but
techniques based upon [Shr97] can be used to analytically estimate the same quantity.
168
Parallel read duration for fence value 0. The general scenario in a round is as follows. D blocks are requested.
Usually the requested blocks are not in the disk caches, so the drives disconnect from the SCSI bus. After the smallest
of the D positioning times, that disk reads the rst sector into its cache, and reconnects to the host. It transmits
the rst sector at bandwidthbus , and then continues transmitting at bandwidthrot . It transmits a leading portion of
the requested block to the host, after which it disconnects, either because the leading portion was the entire block,
or because the remaining portion of the block lies on the next track or cylinder. By the time this disconnection
occurs, it is likely that other drives have read enough data into their disk caches that the remaining portion of the D
blocks can be transmitted to the host at bandwidthbus . There may be several disconnects during this transmission, as
various drives reach track or cylinder boundaries, but as soon as one drive disconnects, another reconnects to continue
sending data to the host. We denote the average size of the leading portion of the rst block by Leading Portion(B).
We make two simplications. Although the rst disk sends one sector at bandwidthbus before sending more at
bandwidthrot , we say that the entire leading portion from the rst disk is sent at bandwidthrot . Second, the overhead
of the disconnection and reconnection is suciently small that it is absorbed into the overhead term.
Thus the average read duration is given by
ReadDuration
= Overhead + MPT(D) + Leading Portion(B ) + DB Leading Portion(B ) (6.3)

bandwidth rot bandwidth bus
When the block size B is small, it is usual for the entire block to reside on a single track, whereas for large blocks
the expected size of the leading portion is one half the track size. Thus if B AverageTrackSize=2, we approximate
Leading Portion(B ) = B , otherwise Leading Portion(B ) = AverageTrackSize=2.
Note that equation (6.3) does not contain terms to account for the track and cylinder crossings as equations (6.1)
and (6.2) do. These crossings do not add to the read duration, because the bus remains busy: one disk disconnects
and another disk immediately seizes the bus to send its data to the host.
Parallel read duration for maximum fence value. In this case, the bus is idle during the shortest positioning
time, then the bus continues to remain idle while that disk reads the block of B bytes into its cache. Next the bus
transmits that block to the host, followed by the blocks from the other D 1 disks. Thus the average read duration
in this case is given by
ReadDuration = Overhead + MPT(D) + B + DB : (6.4)

bandwidthrot bandwidth bus
6.8 Validation
To validate the four performance equations of the previous section, we ran experiments using the workload and
hardware congurations described in Section 6.4 to measure the average read durations. The experiments contained
169
startup and cool down phases to ensure that the bus experienced a typical load during the measurement period. The
measurement period was 10 seconds; this period was small, yet gave results close to longer measurement periods.
We varied the request size from 8 KB to 128 KB and the values of D from 1 to the maximum for that hardware
conguration (4 or 7 disks on a SCSI bus). We compared the measured average read durations with the values
predicted by equations (6.1){(6.4) to calculate the relative model error.
Below, we rst present, along with some additional experiments used for measurements, the parameters used to
validate the performance model on all four of the I/O systems described in Section 6.4 and present in detail the results
of our validation experiments. Equations (6.1){(6.4) require a number of disk-specic parameters such as Overhead,
bandwidthrot and bandwidth bus . We measured some of these values experimentally, and used the device specications
for others.
We wrap up this section by drawing conclusions regarding the accuracy of our model.
6.8.1 Experiments to Validate Models

In Section 6.7, we presented equations that predict the average read latency for one or more disks that share a bus,
for an intensive workload of xed-size random reads. In this appendix, we describe the experiments and parameters
that were used to validate the performance model on all four of the I/O systems described in Section 6.4.
Seagate Wren 7
First we describe how we determine the values of the model parameters for the Wren 7, and then we present tables
that compare the model predictions with the measured values.
The SCSI overhead. When an I/O request is issued by an application, it travels a path through the operating
system, device driver, SCSI controller, SCSI bus, and the disk controller. This path can be delayed at many points,
so the time to deliver the request to the disk is not constant. For instance, the device driver can be preempted by a
high-priority interrupt, or the SCSI bus can be busy transferring other data so that it is not immediately available to
send a command to the disk, or the disk controller can be busy with housekeeping operations so it is slow to accept
the request. The overhead also varies across dierent kinds of requests. For example, the disk checks its cache during
a read command, but not for a seek.
We determine a value for Overhead by measuring the time to seek to the track where the disk head already is
(a zero-length seek, and no track switch). This time ranges from 1.36 to 1.45 ms. Thus, we use 1.4 ms for the
value of Overhead when validating the model. As a second conrmation of this value, we measure the time to read a
cached block of size 512 bytes. Using the measured mean bus bandwidth, we compute the amount of time required
to transfer the data over the bus, and attribute the remaining time to overhead. This technique yields values in the
range 1:2{1:3 ms.
170
The seek time. Since we are reading random blocks, the seek time contributing to a particular read duration
depends on the head position just prior to the read. In predicting the read duration for a particular block size we
use the average seek time given in the disk drive manual, as depicted in Table 6.1. We experimentally conrm this
value by measuring the time needed to seek over 1/3 of the total number of cylinders, which is 14.9 ms.
The rotational latency. The rotational latency is the time required for the beginning of the block to come under
the head after the seek. On average this corresponds to the time for half a rotation. We experimentally conrm this
value by measuring random workloads with dierent request sizes.
The media-cache bandwidth. We can calculate the media-to-cache bandwidth bandwidthrot|it is just the bytes
per track divided by the rotation time of the disk. For the Wren 7, the average value is 2.13 MB/second. The Wren 7
is a zoned disk. The transfer rate ranges from 2.34 MB/s in the outer-most zone to 1.62 MB/s in the inner-most
zone.
The track and cylinder switches. The disk manual [Cor89] states that the time for a track switch is 2 ms. We
measured the time for a cylinder crossing to be 3.5 ms.
The sustained SCSI bus bandwidth. We experimentally determine the bus bandwidth bandwidthbus by the
method suggested in [WGPW95]. We measure the mean read service time for request sizes ranging from 8 KB to
128 KB, where the data to be read has previously been prefetched into the disk cache. For all pairs of read sizes, we
calculate the bus bandwidth as the mean of b=t, where b is the dierence between the read sizes in bytes, and t
is the dierence in times. The values ranged between 3.22 and 3.46; we use 3.3 MB/s as the bus bandwidth.
The expected minimum positioning time. We experimentally determine MPT(D), the expected minimum
positioning time for D concurrent requests, by issuing random one-sector non-blocking reads to D disks and measuring
the time for the rst one to complete. Note that these measurements are the sum of Overhead and MPT(D). The
measurements are averaged over 1000 trials for 1{5 disks and over 200 trials for 6 and 7 disks. The results appear in
Table 6.2. We use these values for Overhead + MPT in equations (6.3) and (6.4).
Table 6.2: The average minimum time to read one sector on a Wren 7.
D 1 2 3 4 5 6 7
mean min 26.7 22.3 20.0 19.1 18.3 17.2 16.6
We can perform a back-of-the-envelope calculation when D = 1 to check the reasonableness of Table 6.2. For one
disk, this value should be the sum of the mean seek time, the mean rotational latency, the time to read one sector
into the disk cache (i.e., B bandwidth rot (B)), and the time to transfer the sector from the disk cache to the host
(i.e., B bandwidth bus (B)). For the Wren 7 disks, the mean seek time is 15.0 ms and the mean rotational latency is
8.3 ms. The transfer rate from media to disk cache is 2.13 MB/s, giving us 0.25 ms to transfer a 512 byte sector.
Using the transfer rate of 3.3 MB/sec from the disk cache to the host, we get a transfer time of 0.1 ms. In addition,
171
there is some overhead time to get the SCSI request to the disk and process it; we have measured this value as 1.4
ms. Summing these values, we get 25.1 ms. Thus, the measured value of 26.7 ms seems reasonable.
Tables 6.3{6.6 compare the predictions of the four equations of Section 6.7 with measured values, using 1, 2, 4,
and 7 Wren disks on a SCSI bus. Our model to predict average read duration length for request sizes of 16 KB or
more is accurate; the maximum relative error is 8.5%, and most are below 3%.
For parallel disks and requests of size 8 KB, the errors are larger (2.3{10.0%). Our model counts the positioning
time of only one disk in the read duration, under the assumption that the others will be overlapped with the positioning
time and bus transfer time of the rst disk. We believe that sometimes the fastest request completes before any of
the remaining D 1 disks becomes ready to transfer data. We suspect that this occurrence may also explain some
of the inaccuracy in the 8-KB experiments on the Barracudas and Cheetahs.
Table 6.3: Validating equation (6.1) (1 Wren, fence 0).

Block size (KB) Measured (ms) Estimated (ms) Relative error %
8 31.6 29.3 7.3
16 35.1 33.6 4.5
32 43.2 42.0 2.8
64 60.8 58.9 3.1
96 76.0 75.8 0.2
128 95.4 92.8 2.7
Table 6.4: Validating equation (6.2) (1 Wren, fence 255).

8 33.5 31.8 5.0
16 40.6 38.5 5.2
32 53.6 51.9 3.1
64 81.3 78.8 3.1
96 106.3 105.6 0.6
128 136.2 132.5 2.7
Table 6.5: Validating equation (6.3) (Wren disks, fence 0).

Block D=2
size Meas. Est. Rel. Meas.
D=4 Rel.
Est. Meas.
D=7 Rel.
Est.
(KB) (ms) (ms) err (%) (ms) (ms) err (%) (ms) (ms) err (%)
8 31.7 28.5 10.0 32.6 30.3 7.0 37.9 35.3 6.9
16 36.6 34.8 5.2 42.5 41.5 2.3 59.0 53.9 8.5
32 46.9 47.2 0.8 65.0 63.9 1.6 94.9 91.2 3.9
64 68.3 67.6 1.0 105.8 104.2 1.5 159.1 161.3 1.4
96 89.2 87.5 1.9 145.2 143.9 0.9 225.0 230.8 2.6
128 114.8 107.4 6.4 190.4 183.6 3.5 305.9 300.3 1.8
The Barracudas
We determine the parameters for the Barracuda disk by the techniques previously described for the Wren disks. The
parameters are shown in Table 6.7.
172
For the Barracudas we did not modify the fence value, so we only evaluate the model for fence 0. Tables 6.8 and
6.9 compare the predictions of the fence=0 equations of Section 6.7 with measured values, using 1{4 Barracuda disks
on a DEC Alpha SCSI bus. The largest error is 10.8%.
Seagate Cheetah on Sun Sparc-20

For the Cheetah disk, we determine the disk parameters (Table 6.10) by the techniques previously described for the
Wren disk.
Tables 6.11{6.14 compare the predictions of the four equations of Section 6.7 with measured values on a Sun
Sparc-20 Model 61 workstation, using 1{4 Cheetah disks on a fast-wide SCSI bus. The relative errors are smaller
than 8%, except for the case of 8 KB, fence 0, and 4 disks, which has an error of 14%.
Table 6.6: Validating equation (6.4) (Wren disks, fence 255).

Block D=2 D=4 Rel. D=7 Rel.
size Meas. Est. Rel. Meas. Est. Meas. Est.
8 33.7 31.0 7.9 34.4 32.8 4.7 38.7 37.8 2.3
16 41.9 39.7 5.1 46.2 46.5 0.7 63.1 58.9 6.6
32 57.0 57.2 0.2 74.7 73.9 1.1 107.0 101.2 5.5
64 91.9 92.0 0.2 131.6 128.6 2.2 182.5 185.7 1.7
96 128.0 126.9 0.9 185.0 183.3 0.9 263.4 270.2 2.6
128 164.7 161.8 1.7 241.6 238.1 1.5 354.6 354.8 0.0
Table 6.7: The Barracuda device parameters.

parameter value
SCSI overhead 0.7 ms
expected minimum positioning time 10.8{12.6 ms
bandwidthbus 18.77 MB/s
single track switch time 2 ms
single cylinder switch time 2 ms
Table 6.8: Validating equation (6.1) (1 Barracuda, fence 0).

8 15.2 14.2 6.3
16 16.8 15.3 8.7
32 17.8 17.4 2.1
64 22.4 21.7 3.1
96 26.6 25.9 2.7
128 31.4 30.2 3.8
Table 6.9: Validating equation (6.3) (Barracuda disks, fence 0).

Block D=2 D=3 D=4
size Meas. Est. Rel. Meas. Est. Rel. Meas. Est. Rel.
8 14.6 13.9 4.9 14.2 13.3 6.7 14.4 12.9 10.3
16 15.8 15.1 4.2 15.7 15.0 4.5 15.4 15.1 2.4
32 17.7 17.6 0.2 17.6 18.4 4.5 18.0 19.3 7.3
64 22.4 22.6 1.1 23.2 25.1 8.4 25.1 27.8 10.8
96 27.3 27.0 1.2 29.2 31.2 6.9 33.5 35.6 6.3
128 32.2 30.5 5.4 35.5 36.4 2.7 41.5 42.6 2.7
173
Seagate Cheetah on Sun Ultra-1
The device parameters for the Cheetahs on the Sun Ultra-1 are the same as on the Sun Sparc-20 with the exception
of the SCSI overhead; we measured this to be 0.4 ms.
Tables 6.15{6.18 compare the predictions of the four equations of Section 6.7 with measured values on a Sun
Table 6.10: The Cheetah device parameters.

parameter value
SCSI overhead 0.8 ms
expected minimum positioning time 8.0{11.53 ms
bandwidthbus 18.64 MB/s
single track switch time 0.2 ms
single cylinder switch time 0.2 ms
Table 6.11: Validating equation (6.1) (1 Cheetah, fence 0, Sparc-20).

8 11.7 12.1 3.7
16 12.6 12.7 1.3
32 13.9 13.9 0.1
64 16.8 16.4 2.4
96 19.8 18.8 5.0
128 22.8 21.2 7.0
Table 6.12: Validating equation (6.2) (1 Cheetah, fence 255, Sparc-20).

8 11.9 12.6 5.7
16 13.2 13.6 3.1
32 15.4 15.7 2.0
64 19.8 19.9 0.5
96 24.4 24.0 1.6
128 29.4 28.2 3.9
Table 6.13: Validating equation (6.3) (Cheetah disks, fence 0, Sparc-20).

Block D=2 D=4
size Meas. Est. Rel. Meas. Est. Rel.
(KB) (ms) (ms) err (%) (ms) (ms) err (%)
8 11.8 10.9 7.5 11.5 9.9 14.0
16 12.6 11.9 5.5 12.5 11.8 5.3
32 14.2 14.0 1.9 14.9 15.6 4.8
64 17.6 18.0 2.7 21.6 23.2 7.7
96 21.1 21.9 3.6 28.3 30.6 8.0
128 25.9 25.4 1.8 37.6 37.6 0.1
Table 6.14: Validating equation (6.4) (Cheetah disks, fence 255, Sparc-20).
Block D=2 D=4
size Meas. Est. Rel. Meas. Est. Rel.
(KB) (ms) (ms) err (%) (ms) (ms) err (%)
16 13.2 12.8 3.0 13.1 12.7 3.5
32 15.7 15.7 0.3 16.2 17.4 7.0
64 20.5 21.6 5.2 24.5 26.1 6.7
96 26.1 27.4 4.8 34.3 36.1 5.3
128 32.9 33.3 1.0 46.0 45.5 1.3
174
Ultra-1 workstation, using 1{4 Cheetah disks on a fast-wide SCSI bus. The relative error is smaller than 10% for
request sizes 16 KB and greater, and smaller than 14% for 8 KB requests.
Table 6.15: Validating equation (6.1) (1 Cheetah, fence 0, Ultra).

8 11.8 12.1 3.1
16 12.3 12.7 3.9
32 13.6 13.9 2.7
64 16.0 16.4 2.5
96 18.8 18.8 0.4
128 21.3 21.2 0.5
Table 6.16: Validating equation (6.2) (1 Cheetah, fence 255, Ultra).

8 12.1 12.6 3.9
16 13.1 13.6 3.6
32 15.3 15.7 2.7
64 19.5 19.9 1.9
96 24.4 24.0 1.5
128 28.5 28.2 1.0
Table 6.17: Validating equation (6.3) (Cheetah disks, fence 0, Ultra).

Block D=2 D=3 D=4
8 11.4 10.9 4.4 11.3 9.8 13.7 11.1 9.9 11.4
16 12.2 11.9 2.3 12.3 11.3 8.3 12.2 11.8 3.4
32 13.7 14.0 1.6 14.4 14.2 1.7 15.1 15.6 3.2
64 17.4 18.0 3.6 19.7 20.0 1.9 22.8 23.2 1.7
96 21.2 21.9 0.9 25.6 25.6 0.1 30.6 30.6 0.3
128 25.2 25.4 0.4 31.5 30.9 1.8 38.2 37.6 1.7
Table 6.18: Validating equation (6.4) (Cheetah disks, fence 255, Ultra).
Block D=2 D=3 D=4
8 11.7 11.3 3.0 11.6 10.3 11.8 11.4 10.3 9.1
16 13.0 12.8 1.8 13.1 12.2 7.3 13.0 12.7 2.6
32 15.6 15.7 1.0 16.0 16.0 0.0 16.5 17.4 5.4
64 20.9 21.6 3.1 22.6 23.6 4.1 26.3 26.7 1.6
96 26.6 27.4 2.9 31.5 31.2 0.9 37.2 36.1 3.1
128 32.2 33.3 3.1 39.1 38.8 0.8 46.2 45.5 1.6
6.8.2 Conclusions From Experimental Validation

Based upon the above experiments, we now present the conclusions regarding the accuracy of our model. For request
sizes 16 KB to 128 KB, the maximum relative error on the Wren disk for the single disk model is less than 5.2% and
for the parallel disk model, 8.5%. Our maximum errors for the Barracuda are larger: 8.7% for the single disk model
175
and 10.8% for the parallel disk model. Our maximum errors for the Cheetah are similar to the Barracuda: 7.0% for
the single disk model and 9.2% for the parallel disk model. The Cheetah model error increases to 14% with 8 KB
blocks. We believe that these disks are so fast that the transfer of the rst small block sometimes completes before
any other disks are ready to use the bus, so the model underestimates the amount of bus idle time.
Equations (6.1){(6.4) require a number of disk-specic parameters such as Overhead, bandwidthrot and
bandwidthbus . We measured some of these values experimentally, and used the device specications for others.
6.9 Pipelining
When we examine the model equations (6.3) and (6.4) to look for opportunities to decrease the read duration, we
notice that one possibility is to decrease the minimum positioning time, and another possibility is to convert those
transfers that occur at bandwidthrot to the faster bandwidthbus .
Assuming that during iteration j 1 we know the blocks that will be requested during iteration j , we propose a
pipelining technique to overlap the positioning time for iteration j with the transfer time of the previous iteration.
Furthermore, this pipelining technique stages data in the on-disk caches, so that the rst block transmitted during
iteration j is sent from cache at bandwidth bus , rather than from the disk surface at bandwidth rot .
Let bi;j denote the block to be retrieved from disk i in round j . Then the pipelining technique schedules the SCSI
bus as follows.
for 0 i D 1
Request LoadIntoDiskBuer(bi;0 ) on disk i.
for 0 j NumRequests
for 0 i D 1
Read(bi;j ) from disk i. //Blocks bi;j are already in disk buer
Request LoadIntoDiskBuer(bi;j +1 ) on disk i.
We assume that each disk drive begins prefetching data into its cache as soon as it receives the LoadIntoDiskBuer
command. The prefetching is overlapped with the bus transmission of blocks from other disks and the previous round.
This technique gives us a simple way to hide disk latency, ensuring fair parallel I/O in rounds. Random blocks are
retrieved from each disk at close to the SCSI bus bandwidth.
In a sense, we are using the disk caches as an extension of the computer's main memory. But we would not achieve
the same performance gain by adding main memory in the computer and using double-buered I/O. Double buering
overlaps I/O with computation, but does not decrease the time required for the I/O. Our pipelining technique eects
an explicit scheduling of the SCSI bus, to overlap various operations within the I/O subsystem, thereby shortening
the time required to complete the I/O. In particular, we overlap the bus transfer with the random access latency, and
cause the bus transfer to occur at the higher cache data rate, rather than the slower disk-head rate.
An ideal implementation of our pipelining technique would use one SCSI command to execute the LoadIn-
toDiskBuer step. Indeed, there is such a command; it is called the SCSI Prefetch. Unfortunately, it is an optional
command, and is not supported by the Wren, Barracuda, or Cheetah. But nearly all disks support cache readahead,
176
Table 6.19: MB/s for naive and pipelined I/O, fence 0, Wren/Sun Sparc-20.
Block size D=2 D=4 D=7
(KB) Naive Pipeline % Naive Pipeline % Naive Pipeline %
8 0.52 0.48 -8 0.97 0.85 -13 1.43 1.22 -15
16 0.89 0.83 -7 1.53 1.49 -3 1.89 1.93 2
32 1.37 1.40 2 1.97 2.19 11 2.35 2.45 4
64 1.88 2.05 9 2.42 2.72 12 2.80 2.98 6
96 2.15 2.42 13 2.65 2.89 9 3.00 3.09 3
128 2.26 2.53 12 2.68 2.98 11 2.93 2.90 -1
and at the cost of some overhead, we can trigger the readahead to achieve the eect of a SCSI prefetch. The overhead
of triggering the readahead is non-negligible. The experimental results described below indicate the conditions under
which the performance gain from pipelining outweighs the additional overhead.
For many disks, we implement the LoadIntoDiskBuer for block b by a non-blocking asynchronous read of the
disk sector just before b. The asynchronous read triggers the disk cache readahead mechanism to load b into the
disk cache. In the ith iteration of the for loop, during the j th round of I/O, to cause a LoadIntoDiskBuer(bi;j +1 ),
our implementation uses the aioread() system call to send a 1-sector non-blocking read for the sector immediately
preceding block bi;j +1 of disk i. In the next round, block bi;j +1 on disk i will be found in the disk's cache.
Table 6.19 evaluates the eectiveness of the pipelining technique with 2, 4, and 7 Wren disks on a Sparc-20,
using transfers ranging from 8 KB to 128 KB. The measurements are averaged over 1000 I/Os9 . The table compares
the aggregate transfer rate in MB/s achieved by the \naive" approach (one process per disk performing random
I/Os) with the pipelined technique. The column labeled \%" contains the relative improvement (in percent) of the
pipelined technique. Performance in the experiments with small block sizes is inhibited by unhidden positioning
time; this harms performance by several percent. With 7 disks, the bus is so overloaded that there is little room for
improvement. With 2 or 4 disks and moderate or large block sizes, the overlaps gained by the pipeline technique more
than compensate for the increased overhead. For example, with 4 disks and 64 KB blocks, the bandwidth improves
from 2.42 MB/s to 2.72 MB/s.
Table 6.20 presents the results of the corresponding experiment on the DEC Alpha with Seagate Barracuda
disks. The measurements are averaged over 300 I/Os; this number of I/Os is small enough to t within the limited
number of unclaimed aioread() structure stubs in Digital Unix. For large transfers and many disks sharing the
bus, the throughput gain achieved by the pipelining technique is greater than the overhead of the asynchronous I/O
operations.
When we run the same set of experiments on the Cheetah disks on a Sun Sparc-20, we do not see improvements
with the pipelining method. We attribute this to the host overhead for the aioread() command, which we measure
to be 2.1{2.3 ms (approximately equivalent to the amount of time needed to read 16{32 KB from disk cache to host).
As shown in Table 6.21, the faster Sun Ultra-1 reduces the overhead suciently to make the pipelining technique
9 We also performed experiments with 300 I/Os for most of our hardware congurations and received
similar numbers.
177
Table 6.20: MB/s for naive and pipelined I/O, fence 0, Barracuda/DEC Alpha.
8 1.05 0.93 -11 1.58 1.36 -13 2.10 1.78 -15
16 1.96 1.75 -11 2.95 2.62 -11 3.90 3.30 -15
32 3.49 3.15 -10 5.18 4.66 -10 6.78 6.06 -11
64 5.51 5.19 -6 7.97 7.43 -7 9.76 9.92 2
96 6.87 6.48 -7 9.58 9.69 1 11.25 12.51 11
128 7.87 7.60 -3 10.35 11.19 8 12.13 14.40 19
Table 6.21: MB/s for naive and pipelined I/O, fence 0, Cheetah/Sun Ultra-1.
8 1.32 1.22 -8 1.97 1.74 -12 2.63 2.30 -13
16 2.48 2.34 -6 3.65 3.31 -9 4.83 4.28 -11
32 4.43 4.27 -4 6.32 6.07 -4 8.02 7.75 -3
64 7.08 7.09 0 9.38 10.05 7 10.72 12.48 16
96 8.76 9.48 8 10.85 12.76 18 12.09 14.12 17
128 9.86 11.01 12 11.79 14.19 20 13.00 14.45 11
viable for large block sizes. The measurements are averaged over 1000 I/Os. If the disks supported SCSI prefetch
with a command overhead equal to that of a seek, we would expect to see improvements of 10{30% for pipelining
over the naive method. Moreover, as CPU speeds increase in the future, a wider variety of workloads may benet
from pipelining.
Consider the I/O requests generated by I/O-intensive algorithms with D processes or threads running on multi-
processors or uniprocessors. These I/O requests all funnel through the disk device driver code. Depending upon the
I/O sizes and the disk conguration, it may be possible to improve the performance of such algorithms by interposing
an I/O scheduling thread that causes pipelining by issuing the appropriate LoadIntoDiskBuer operations. Similarly,
it may be possible to improve the performance of a RAID disk array by an internal implementation of this pipelining
technique.

Based upon an extensive series of measurements, we have developed accurate models of the I/O throughput achieved
by multiple disks that share a SCSI bus for a balanced random-access workload.
The performance advantage of scheduling within the request queue for each disk is well known. We have shown
that coordinating the accesses across a collection of disks that share a SCSI bus can also improve performance by
10{20%. This coordination across disks does not have the NP-hard complexity of general scheduling. Moreover, our
technique enables the disks to be self-governing, in that we do not need to predict the positioning time that will be
incurred by each I/O request.
We have used our performance models to develop a pipelining technique and have described the circumstances
under which it improves the aggregate disk bandwidth on the shared SCSI bus. The improvement is a consequence
178
of increased overlap between disk seeks and data transfers, and an increase in the fraction of transfers that occur at
the disk cache transfer rate rather than the slower disk-head rate.
We performed preliminary experiments that indicate that the model remains accurate when the workload exhibits
spatial locality or has multiple request sizes.
In each of our experiments there is at most one request in the I/O path for each disk. As future work, we
would like to extend our technique to incorporate the advantages of disk controller command queuing to schedule
multiple requests per disk. Preliminary experiments suggest that our pipelining technique may be benecial in this
environment too. This technique could be benecial for applications that can generate batches of requests for each
disk such as video-on-demand servers. We would also like to model write requests; writes are more dicult to model
because of the complexity of the write-to-disk policies. Our model could also be extended to approximate wall clock
time in a I/O system-dependent fashion, furthering the results of [CH96].
[AAK99] Susanne Albers, Sanjeev Arora, and Sanjeev Khanna. Page replacement for general caching problems.
Proc. 10th Annual Symposium on Discrete Algorithms (SODA), 1999.
[AAM+98] Pankaj Agarwal, Lars Arge, T. M. Murali, Kasturi Varadarajan, and Je Vitter. I/o-ecient algorithms
for contour line extraction and planar graph blocking. Proc. 9th Symposium on Discrete Algorithms
(SODA), 1998.
[ABF93] Baruch Awerbuch, Yair Bartal, and Amos Fiat. Heat and dump: Randomized competitive distributed
paging. In Proc. 34th IEEE Symp. Foundations of Computer Science, 1993.
[ABF96] Baruch Awerbuch, Yair Bartal, and Amos Fiat. Distributed paging for general networks. In Proceedings
8th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 574{583, 1996.
[ABW98] J. Abello, A. Buchsbaum, and J. Westbrook. A functional approach to external memory graph algo-
rithms. External Memory Algorithms and Visualization, 1998.
[ABZ96] Mathew Andrews, Michael Bender, and Lisa Zhang. New algorithms for the disk scheduling problem.
37th Annual Symposium on Foundations of Computer Science (FOCS), pages 580{589, October 1996.
[ADADC+97] Andrea C. Arpaci-Dussea, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and
David A. Patterson. High-performance sorting on networks of workstations. SIGMOD97, 1997.
[AFGV98] Lars Arge, Paolo Ferragina, Roberto Grossi, and Je Vitter. On sorting strings in external memory.
ACM Symposium on the Theory of Computing (STOC), 1998.
179
[AGL98] S. Albers, N. Garg, and S. Leonardi. Minimizing stall time in single and parallel disk systems. Proc.
30th Annual ACM Symposium on Theory of Computing (STOC98), 1998.
[AHVV99] Lars Arge, Klaus Hinrichs, Jan Vahrenhold, and Je Vitter. Ecient bulk operations on dynamic
r-trees. ALENEX'99, 1999.
[AKL93] L. Arge, M. Knudsen, and K. Larsen. A general lower bound on the I/O-complexity of comparison-
based algorithms. In Proceedings of the 3rd Workshop on Algorithms and Data Structures, volume
709, pages 83{94. Lecture Notes in Computer Science, Springer-Verlag, 1993.
[Alb93] S. Albers. The in uence of lookahead in competitive paging algorithms. In Proceedings 1st Annual
European Symposiun on Algorithms, pages 1{12, 1993.
[AP94] Alok Aggarwal and C. Greg Plaxton. Optimal parallel sorting in multi-level storage. Proc. Fifth
Annual ACM-SIAM Symp. on Discrete Algorithms, pages 659{668, 1994.
[APR+98] Lars Arge, Octavian Procopiuc, Sridhar Ramaswamy, Torsten Suel, and Je Vitter. Theory and prac-
tice of i/o-ecient algorithms for multidimensional batched searching problems. Proc. 9th Symposium
on Discrete Algorithms (SODA), 1998.
[Arg94] Lars Arge. The buer tree: A new technique for optimal I/O-algorithms. Technical Report RS-94-16,
BRICS, Univ. of Aarhus, Denmark, 1994.
[Arg96] Lars Arge. Ecient External-Memory Data Structures and Applications. PhD thesis, Department of
Computer Science, University of Aarhus, 1996.
[Arg97] L. Arge. External-memory algorithms with applications in geographic information systems. In M. van
Kreveld, J. Nievergelt, T. Roos, and P. Widmayer, editors, Algorithmic Foundations of GIS. Springer-
Verlag, LNCS 1340, 1997.
[AV88] Alok Aggarwal and Jerey S. Vitter. The input/output complexity of sorting and related problems.
Communications of the ACM, 31(9):1116{1127, 1988.
[AV96] L. Arge and J. S. Vitter. Optimal interval management in external memory. Proc. of the 37th Annual
IEEE Symposium on Foundations of computer Science (FOCS '96), pages 560{569, October 1996.
Also appeared in Abstracts of the First CGC Workshop on Computational Geometry, Center for
Geometric Computing, Johns Hopkins university, Baltimore, MD, October 1996.
180
[AVVar] Lars Arge, Darren Erik Vengro, and Jerey Scott Vitter. External-memory algorithms for processing
line segments in georgraphic information systems. special issue on cartography and geographic infor-
mation systems in Algorithmica, to appear. A shortened version appears in Proceedings of the 3rd
Annual European Symposium on Algorithms (ESA '95), September 1995, published in Lecture Notes
in Computer Science, Springer-Verlag, Berlin, 295{310.
[BCL93] K. P. Brown, M. J. Carey, and M. Livny. Managing memory to meet multiclass workload response
time goals. Proc. 19th Int. Conf. on Very Large Data Bases, August 1993.
[Bel66] A. L. Belady. A study of replacement algorithms for virtual storage computers. IBM Systems Journal,
5:78{101, 1966.
[BGH+97] Rakesh Barve, Phillip B. Gibbons, Bruce Hillyer, Yossi Matias, Elizabeth Shriver, and Jerey Scott
Vitter. Modeling and optimizing I/O throughput of multiple disks on a bus: the long version. Technical
Report BL97.01578, Bell Labs, 1997.
[BGH+98] Rakesh Barve, Phillip B. Gibbons, Bruce K. Hillyer, Yossi Matias, Elizabeth Shriver, and Jerey Scott
Vitter. Round-like behavior in multiple disks on a bus: a study with wren 7 disks. Technical report,
Bell Labs, October 1998.
[BGV97] Rakesh D. Barve, Edward F. Grove, and Jerey Scott Vitter. Simple randomized mergesort on parallel
disks. 23(4), 1997. Special issue on parallel I/O. An earlier version appears in Proc. of the 8th Annual
ACM Symposium on Parallel Algorithms and Architectures (SPAA '96), Padua, Italy, June 1996,
109{118.
[BIRS91] A. Borodin, S. Irani, P. Raghavan, and B. Schieber. Competitive paging with locality of reference. In
Proceedings of the 23rd Annual ACM Symposium on Theory of Computation, May 1991.
[BK98] G. S. Brodal and J. Katajainen. Worst-case ecient external-memory priority queues. Scandinavian
Workshop on Algorithmic Theory, 1998.
[BKVV97] Rakesh D. Barve, Mahesh Kallahalla, Peter J. Varman, and J. S. Vitter. Competitive parallel disk
prefetching and buer management. Proceedings of the Fifth Annual Workshop on I/O in Parallel
and Distributed Systems (IOPADS), 1997.
[Bre95] Dany Breslauer. On competitive online paging with lookahead. (RS-95-50), September 1995.
[BSG+ 98] Rakesh Barve, Elizabeth Shriver, Phillip B. Gibbons, Bruce K. Hillyer, Yossi Matias, and Jerey Scott
Vitter. Modeling and optimizing I/O throughput of multiple disks on a bus. In Joint International
181
Conference on Measurement and Modeling of Computer Systems (Sigmetrics '98/Performance '98),
June 1998. Extended abstract.
[CABG98] Jerey S. Chase, Darrell Anderson, Rakesh Barve, and Syam Gadde. Improving i/o performance with
blocked le mapping. 1998. In Submission.
[CFKL95] Pei Cao, Edward Felten, Anna Karlin, and Kai Li. A study of integrated prefetching and caching
strategies. Proceedings of the ACM SIGMETRICS Conference on Measurement and Modelling of
Computer Systems, May 1995.
[CFL94a] P. Cao, E. W. Felten, and K. Li. Application-controlled le caching policies. In Proceedings of the
Summer USENIX Conference, 1994.
[CFL94b] P. Cao, E. W. Felten, and K. Li. Implementation and performance of application-contolled le caching.
In Proceedings of the First OSDI Symposium, 1994.
[CFM+98] A. Crauser, P. Ferragina, K. Mehlhorn, U. Meyer, and E. Ramos. Randomized external-memory algo-
rithms for geometric problems. Proceedings of the 14th ACM Symposium on Computational Geometry,
June 1998.
[CGG+ 95] Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengro, and J. S. Vitter. External-
memory graph algorithms. In Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete
Algorithms, January 1995.
[CH96] Thomas H. Cormen and Melissa Hirschl. Early experiences in evaluating the parallel disk model with
the vic* implementation. Technical Report PCS-TR96-293, Dept. of Computer Science, Dartmouth
College, August 1996.
[Che89] P. M. Chen. An evaluation of redundant arrays of disks using an amdahl 5890. UCB/CSD 89 506, U.
C. Berkeley, May 1989.
[CLG+ 94] Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. RAID:
high-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145{185, June 1994.
[CLR90] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. The
MIT Press and McGraw-Hill, 1990.
[Cor89] Control Data Corporation. Product specication for wren vii scsi disk drive model 94601. Spec.
77765417, Revision A, September 1989.
182
[CT96] Shenze Chen and Manu Thapar. I/O channel and real-time disk scheduling for video servers. In
NOSSDAV'96, 1996.
[Dah96] Mike Dahlin. The impact of technology trends on le system design.

http://www.cs.utexas.edu/users/dahlin/techTrends/trends.ps, 1996.
[dJKH93] W. de Jonge, M. F. Kaashoek, and W. C. Hsieh. The logical disk: A new approach to improving
le systems. Proc. of the 14th ACM Symposium on Operation Systems Principles (SOSP), December
1993.
[ECW94] Vladimir Estivill-Castro and Derrick Wood. Foundations of external merging. Proc. Fourteenth Con-
ference on Foundations of Software Technology and Theoretical Computer Science, pages 414{425,
1994.
[ES92] R. M. English and A. A. Stepanov. Loge: A self-organizing storage device. Proc. of the USENIX
Winter'92 Technical Conference, pages 237{251, 1992.
[FG95] P. Ferragina and R. Grossi. A fully-dynamic data structure for external substring search. In Proceedings
of the 27th Annual ACM Symposium on Theory of Computation, May 1995.
[FK95] A. Fiat and A. Karlin. Randomized and muli-pointer paging with locality of reference. In Proc. of the
27th Annual ACM Symp. on the Theory of computing, pages 626{634, 1995.
[FKL+ 91] A. Fiat, R. M. Karp, M. Luby, L. A. McGeoch, D. D. Sleator, and N. E. Young. On competitive
algorithms for paging problems. Journal of Algorithms, 12:685{699, 1991.
[Flo72] R. W. Floyd. Permuting information in idealized two-level storage. In R. Miller and J. Thatcher,
editors, Complexity of Computer Computations, pages 105{109. Plenum, 1972.
[FNG89] D. Ferguson, C. Nikolaou, and L. Gergiadis. Goal oriented, adaptive transaction routing for high
performance transaction processing systems. Proc. of the 2nd Int. Conf. on Parallel and Distributed
Information Systems, January 1989.
[GK81] Daniel H. Greene and Donald E. Knuth. Mathematics for the Analysis of Algorithms. Birkhauser,
Boston., 1981.
[Gra93] Goetz Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73{
170, 1993.
[GS84] D. Giord and A. Spector. The TWA reservation system. Comm. ACM, 27(7):650{665, July 1984.
183
[GTVV93] M. T. Goodrich, J.-J. Tsay, D. E. Vengro, and J. S. Vitter. External-memory computational geometry.
In IEEE Foundations of Computer Science, pages 714{723, 1993.
[GVW96] G. A. Gibson, J. S. Vitter, and J. Wilkes. Report of the working group on storage I/O issues
in large-scale computing. ACM Computing Surveys, 28(4), December 1996. Also available as
http://www.cs.duke.edu/~jsv/report.ps.
[HGK+94] L. Hellerstein, G. Gibson, R. M. Karp, R. H. Katz, and D. A. Patterson. Coding techniques for
handling failures in large disk arrays. Algorithmica, 12(2{3), 1994.
[HK81a] J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. Proc. 13th Annual ACM
Symp. on Theory of Computation, pages 326{333, may 1981.
[HK81b] J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. Proc. 13th Annual ACM
Symp. on Theory of Computation, pages 326{333, may 1981.
[HP96] John L. Hennessy and David L. Patterson. Computer Architecture: A Quantitative Approach. Morgan
Kaufmann Publishers Inc., 1996.
[IBM90] IBM. DATABASE 2, Administration Guide for Common Servers. June 1990.
[IKP92] S. Irani, A. R. Karlin, and S. Phillips. Strongly competitive algorithms for paging with locality of
reference. In Proceedings of the 3rd Annual ACM-SIAM Symposium of Discrete Algorithms, January
1992.
[Inc] Seagate Technology Inc. St-34501w/wc ultra-scsi wide (cheetah 4lp) data sheet. Found at
ftp://ftp.seagate.com/techsuppt/scsi/st34501w.txt.
[JW91] D. M. Jacobson and J. Wilkes. Disk scheduling algorithms based on rotational position. Technical
Report HPL{91{7, Hewlett-Packard Laboratories, February 1991.
[Kk96] Tracy Kimbrel and Anna karlin. Near-optimal parallel prefetching and caching. Proc. of the 37th
Annual IEEE Symposium on Foundations of computer Science (FOCS '96), October 1996.
[Knu97] D. E. Knuth. Fundamental Algorithms, volume 1 of The Art of Computer Programming. Addison-
Wesley, Reading Ma., third edition, 1997.
[Knu98] D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley,
Reading MA, second edition, 1998.
184
[Kot94] David Kotz. Disk-directed i/o for mimd multiprocessors. Proc. of 1994 Symposium on Operating
Systems Design and Implementation, November 1994.
[KP94] E. Koutsoupias and C. Papadimitriou. Beyond competitive analysis. In Proc. of the 35th Annual
IEEE Foundations of Computer Science, pages 394{400, 1994.
[KPR92] A. R. Karlin, S. J. Phillips, and P. Raghavan. Markov paging. In Proceedings of the 33rd Annual
IEEE Conference on Foundations of Computer Science, pages 208{217, October 1992.
[KS96] V. Kumar and E. Schwabe. Improved algorithms and data structures for solving graph problems in
external memory. Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing,
pages 169{176, October 1996.
[KSC78] V. F. Kolchin, B. A. Sevastyanov, and V. P. Chistyakov. Random Allocations. Winston & Sons,
Washington, 1978.
[KTP+ 96] Tracy Kimbrel, Andrew Tomkins, R. Hugo Patterson, Brian Bershad, Pei Cao, Edward Felten, Garth
Gibson, Anna R. Karlin, and Kai Li. A trace-driven comparisons of algorithms for parallel prefetching
and caching. Proc. of the 1996 Symposium on Operating Systems Design and Implementation, 1996.
[LG98] P.-A. Larson and Goetz Graefe. Memory management during run generation in external sorting.
SIGMOD98, June 1998.
[LV85] E. E. Lindstrom and J. S. Vitter. The design and analysis of bucketsort for bubble memory secondary
storage. IEEE Transactions on Computers, C-34:218{233, March 1985.
[Mag87] N. B. Maginnis. Store more, spend less: Mid-range options abound. Computerworld, pages 71{82,
November 1987.
[MDK96] Todd C. Mowry, Angela K. Demke, and Orran Krieger. Automatic compiler-inserted i/o prefetching
for out-of-core applications. Proc. of the 1996 Symposium on Operating Systems Design and Imple-
mentation, 1996.
[MK91] L. W. McVoy and S. R. Kleiman. Extent-like performance from a unix le system. In Proceedings
Winter Usenix 1991, Dallas, Texas, 1991.
[MNO+ 98] C. Martin, P. S. Narayan, B. Ozden, R. Rastogi, and A. Silberschatz. The fellini multimedia storage
system. Journal of Digital Libraries, 1998. To Appear.
185
[MS91] L. A. McGeoch and D. D. Sleator. A strongly competitive randomized paging algorithm. Algorithmica,
6:816{825, 1991.
[Nie98] Nils Nieuwejaar. Personal Communication. 1998.
[Nut97] Gary Nutt. Operating Systems: A Modern Perspective. Addison-Wesley Publishing Company, 1997.
[NV90] M. H. Nodine and J. S. Vitter. Large-scale sorting in parallel memories. In Proc. 3rd ACM Symp. on
Parallel Algorithms and Architectures, pages 29{39, 1990.
[NV93] M. H. Nodine and J. S. Vitter. Deterministic distribution sort in shared and distributed memory
multiprocessors. Proc. 5th Annual ACM Symp. on Parallel Algorithms and Architectures, pages 120{
129, 1993.
[NV95] M. H. Nodine and J. S. Vitter. Greed sort: An optimal sorting algorithm for multiple disks. Journal
of the ACM, 42(4):919{933, July 1995.
[PCL93a] H. Pang, M. Carey, and M. Livny. Memory-adaptive external sorts. Proc. Nineteenth International
Conf. on Very Large Data Bases, 1993.
[PCL93b] H. Pang, M.J. Carey, and M. Livny. Partially preemptible hash joins. Proc. 1993 ACM-SIGMOD
Conf. on Management of Data, 1993.
[PEK96] A. Purakayastha, C. S. Ellis, and D. Kotz. Enwrich: A compute-processor write caching scheme for
parallel le systems. Fourth Workshop on Imput/Output in Parallel and Distributed Systems, May
1996.
[PGK88] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (raid).
Proc. 1988 ACM-SIGMOD Conf. on Management of Data, pages 109{116, 1988.
[PSV94] V. S. Pai, A. A. Schaer, and P. J. Varman. Markov analysis of multiple-disk prefetching strategies
for external merging. Theoretical Computer Science, 128(2):211{239, June 1994.
[PV92] V. S. Pai and P. J. Varman. Prefetching with multiple disks for external mergesort: Simulation and
analysis. In 8th International Conference of Database Engineering, pages 273{182, 1992.
[Rec88] SIGMOD Record. Special issue on real-time database systems. 17(1), March 1988.
[RS94] S. Ramaswamy and S. Subramanian. Path caching: a technique for optimal external searching. Proc.
13th ACM Conf. on Princ. of Database Systems, 1994.
186
[RW91] Chris Ruemmler and John Wilkes. Disk shuing. Technical Report HPL{91{156, Hewlett-Packard
Laboratories, October 1991.
[RW94] C. Ruemmler and J. Wilkes. An introduction to disk drive modeling. IEEE Computer, pages 17{28,
March 1994.
[Sal89] B. Salzberg. Merging sorted runs using large main memory. Acta Informatica, 27:195{215, 1989.
[SCO90] M. Seltzer, P. Chen, and J. Ousterhout. Disk scheduling revisited. Proc. of Winter USENIX Technical
Conference, pages 313{323, January 1990.
[Shr97] Elizabeth Shriver. Performance Modelling For Realistic Storage Devices. Ph. d., New York University,
May 1997. Available at http://www.bell-labs.com/ shriver.
[ST85] D. D. Sleator and R. E. Tarjan. Amortized eciency of list update and paging rules. Communications
of the ACM, 28(2):202{208, February 1985.
[Sto81] Michael Stonebraker. Operating system support for database management. Communications of the
ACM, 7(24), July 1981.
[STV96] J. B. Sinclair, J. Tang, and P. J. Varman. Placement-related problems in shared disk i/o. In Ravi
Jain, John Werth, and James Browne, editors, Input/Output in Parallel and Distributed Computer
Systems. Volume 362, The Kluwer International Series in Engineering and Computer Science, Kluwer
Academic Publishers, 1996.
[SV87] J. E. Savage and J. S. Vitter. Parallelism in space-time tradeos. In F. P. Preparata, editor, Advances
in Computing Research, Volume 4, pages 117{146. JAI Press, 1987.
[Sys92] Real-Time Systems. Special issue on real-time databases. 4(3), September 1992.
[Tec97a] Seagate Technologies. Cheetah 4lp family: St34501 n/w/wc/wd/dc. Product Manual, volume 1,
Document number 83329120, Revision A, April 1997.
[Tec97b] Seagate Technologies. Specications for st{32171w. http://www.seagate.com/, July 1997.
[Tor95] Eric Torng. A unied analysis of paging and caching. In Proc. of the 36th Annual IEEE Foundations
of Computer Science, pages 194{203, 1995.
[TPG97] Andrew Tomkins, R. Hugo Patterson, and Garth Gibson. Informed multi-process prefetching and
caching. Proceedings of SIGMETRICS'97, 1997.
187
[Uni89] University of California at Berkeley. Massive Information Storage, Management, and Use (NSF
Institutional Infrastructure Proposal), January 1989. Technical Report No. UCB/CSD 89/493.
[UW97] Jerey D. Ullman and Jenier Widom. A First Course in Database Systems. Prentice Hall, 1997.
[VC90] P. Vongsathorn and S. D. Carson. A system for adaptive disk rearrangement. Software{Practice and
Experience, 20(3):225{242, March 1990.
[Ven94] Darren Erik Vengro. A transparent parallel I/O environment. In Proc. 1994 DAGS Symposium on
Parallel Computation, July 1994.
[Ven95] Darren Erik Vengro. TPIE User Manual and Reference. Duke University, 1995. Available via WWW
at http://www.cs.duke.edu:~dev/tpie.html.
[Ven96] Darren Vengro. The Theory and Practice of I/ O Algorithms. Ph. d., Brown University, 1996.
[VF90] J. S. Vitter and Ph. Flajolet. Average-case analysis of algorithms and data structures. In Jan van
Leeuwen, editor, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity,
chapter 9, pages 431{524. North-Holland, 1990.
[Vit98] Jerey S. Vitter. External memory algorithms. Proceedings of the 17th Annual ACM
Symposium on Principles of Database Systems (PODS '98), pages 119{178, 1998. See
http://www.cs.duke.edu/ jsv/bib/jsvbibown.bib.
[VS94] J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica,
12(2{3):110{147, 1994.
[VV95] Darren Erik Vengro and Jerey Scott Vitter. I/O-ecient scientic computation using TPIE. Tech-
nical Report CS{1995{18, Duke University Dept. of Computer Science, 1995.
[VV96a] D. E. Vengro and J. S. Vitter. Ecient 3-d range searching in external memory. Proc. of the 28th
Annual ACM Symposium on Theory of Computing (STOC '96), May 1996.
[VV96b] D. E. Vengro and J. S. Vitter. I/O-ecient computation: The TPIE approach. In Proceedings of the
Goddard Conference on Mass Storage Systems and Technologies, NASA Conference Publication 3340,
Volume II, pages 553{570, College Park, MD, September 1996.
[WGP94] Bruce L. Worthington, Gregory R. Ganger, and Yale N. Patt. Scheduling algorithms for modern disk
drives. Proceedings of the ACM Sigmetrics Conference, May 1994.
188
[WGPW95] Bruce L. Worthington, Gregory R. Ganger, Yale N. Patt, and John Wilkes. On-line extraction of scsi
disk drive parameters. Proceedings of the ACM Sigmetrics Conference, pages 146{156, May 1995.
[WGWR93] D. Womble, D. Greenberg, S. Wheat, and R. Riesen. Making parallel computer i/o practical. Pro-
ceedings of the 1993 DAGS/PC Symposium, pages 56{63, June 1993.
[You91] Neal Young. Competitive Paging and Dual-Guided Online Weighted Caching and Matching Algo-
rithms. Ph. d., Princeton University, 1991.
[You98] Neal E. Young. Online le caching. ACM-SIAM Symposium On Discrete Algorithms, (SODA), 1998.
[ZG90] H. Zeller and J. Gray. An adaptive hash join algorithm for multiuser environments. Proc. of the 16th
Intl. Conf. on Very Large Data Bases, 1990.
[ZL96] L. Q. Zheng and P.-A. Larson. Speeding up external mergesort. IEEE Trans. Knowldege and Data
Engineering, 8(2):322{332, 1996.
[ZL97] W. Zhang and P.-A. Larson. Dynamic memory adjustment for external mergesort. Proc. Twenty-third
International Conf. on Very Large Data Bases, 1997.
[ZL98] Weiye Zhang and P.-A Larson. Buering and read-ahead strategies for external mergesort. Proceedings
of the 24th VLDB Conference, pages 523{532, 1998.
189
Biography
Rakesh Barve was born on March 4, 1972 in Bombay (now known as Mumbai), India. He received a Bachelor of
Technology degree in Computer Science and Engineering from the Indian Institute of Technology (IIT), Bombay in
May 1993. He joined the computer science doctoral program at Duke University in August 1993. In course of his
Ph.D., he was the recipient of an IBM Graduate Fellowship Award for three academic years from 1995 through 1998.
He received the Doctor of Philosophy in Computer Science from Duke University in December 1998.
Rakesh Barve is interested in the design, analysis, and implementation of algorithms and in the use of practical
techniques and data structures to design ecient, application-specic solutions. His research work has focussed on
techniques to overcome slow access speeds of secondary memory: These include external memory data structures
and algorithms, caching, prefetching, buer management, load balancing across multiple disk drives, and operating
systems issues pertaining to I/O performance and implementation of external memory computations.
Selected Publications
1. \Application Controlled Paging for a Shared Cache", Rakesh D. Barve, Edward F. Grove and Jerey S. Vitter.
Accepted for publication in the SIAM Journal of Computing. A short version appeared in the Proceedings of
the 36th Annual IEEE Symposium on Foundations of Computer Science (FOCS 95), August 1995.
2. \Simple Randomized Mergesort on Parallel Disks", Rakesh D. Barve, Edward F. Grove and Jerey. S. Vitter,
in Parallel Computing , 1997, Volume 23, Number 4, pp 601{631. An earlier version appeared in Proceedings of
the 8th ACM Symposium on Parallel Algorithms and Architectures (SPAA 96), 1996, Padua, Italy. Another
version appeared in the Proceedings of the DIMACS Workshop on Randomization in Algorithm Design , 1998,
edited by Panos Pardalos, Jose' Rolim and Sanguthevar Rajasekharan, for American Mathematical Society.
3. \Competitive Parallel Disk Prefetching and Buer Management", Rakesh D. Barve, M. Kallahalla, P. J. Varman
and Jerey. S. Vitter, in Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed
Systems (IOPADS-97), November, 1997, pp 47{56.
4. \ Modeling and optimizing I/O throughput of multiple disks on a bus", Rakesh Barve, Elizabeth Shriver, Phillip
B. Gibbons, Bruce K. Hillyer, Yossi Matias, and Jerey Scott Vitter. in Proceedings of Sigmetrics '98 as an
extended abstracti , pages 264-265, June 1998.
5. \I/O bandwidth of multiple disks on a bus", 1998, Rakesh Barve, Elizabeth Shriver, Phillip B. Gibbons, Bruce
K. Hillyer, Yossi Matias, and Jerey Scott Vitter, In Submission .
6. \ A Theory of Memory-Adaptive Algorithms", 1998, Rakesh D. Barve and Jerey S. Vitter, In Submission .
A longer version of this paper, called \External Memory Algorithms with Dynamically Changing Memory
Allocations: Long Version", appears as Duke University Technical Report CS{1998{09.
7. \Improving I/O performance with blocked le mapping", 1998, Jerey S. Chase, Darrell Anderson, Rakesh
Barve and Syam Gadde, In Submission .
8. \A Simple and Ecient Parallel Disk Mergesort", 1998, Rakesh D. Barve and Jerey S. Vitter, In Submission .
9. \Engineering External Memory Computations Using TPIE", 1998, Rakesh D. Barve and Lars A. Arge and
Jerey S. Vitter, In Progress .
10. \A Characterization of Simple and Provably Ecient Prefetching Algorithms", Wei Jin, Rakesh D. Barve and
Kishore S. Trivedi, In Submission .
11. \On the Complexity of Learning from Drifting Distributions", Rakesh D. Barve and Philip M. Long, in Infor-
mation and Computation , Volume 138, Number 2, pp 101{123, 1997. A short version appeared in Proceedings
of the Ninth Annual Conference on Computational Learning Theory (COLT '96), 1996, Desenzano Del Garda,
Italy.
Patents
Two patents related to \ Modeling and optimizing I/O throughput of multiple disks on a bus" are currently led and
pending.
190

Algorithmic Techniques To Overcome The I/O Bottleneck

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Algorithmic Techniques To Overcome The I/O Bottleneck

Încărcat de

Drepturi de autor:

Formate disponibile

Copyright

c 1999 by Rakesh Dilip Barve

Je rey S. Vitter, Supervisor

Dissertation submitted in partial ful llment of the

Je rey S. Vitter, Supervisor

An abstract of a dissertation submitted in partial

2 Application-Controlled Paging for a Shared Cache 12

6.1 Disk parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

6.3 Validating equation (6.1) (1 Wren, fence 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

6.4 Validating equation (6.2) (1 Wren, fence 255). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

6.5 Validating equation (6.3) (Wren disks, fence 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

6.6 Validating equation (6.4) (Wren disks, fence 255). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

6.7 The Barracuda device parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

6.8 Validating equation (6.1) (1 Barracuda, fence 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

6.9 Validating equation (6.3) (Barracuda disks, fence 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

6.10 The Cheetah device parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

6.11 Validating equation (6.1) (1 Cheetah, fence 0, Sparc-20). . . . . . . . . . . . . . . . . . . . . . . . . . . 210

6.12 Validating equation (6.2) (1 Cheetah, fence 255, Sparc-20). . . . . . . . . . . . . . . . . . . . . . . . . 210

6.13 Validating equation (6.3) (Cheetah disks, fence 0, Sparc-20). . . . . . . . . . . . . . . . . . . . . . . . . 211

6.15 Validating equation (6.1) (1 Cheetah, fence 0, Ultra). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

6.17 Validating equation (6.3) (Cheetah disks, fence 0, Ultra). . . . . . . . . . . . . . . . . . . . . . . . . . 212

4.1 Merging Phase Timings of SRM and DSM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

1.1 The I/O Bottleneck

1.2 Approaches to Overcome the I/O Bottleneck

1.2.1 Operating System Caching and Prefetching

1.2.2 External Memory Algorithms (with statically allo-

1.2.3 EM Algorithms with Dynamic Memory Allocations

1.2.4 Techniques to Improve Data Transfer Speed

2.2 Classical Caching and Competitive Analysis

2.3 Multi-application Caching Problem

2.4 Online Algorithm for Multi-application Caching

2.5 Lower Bounds for OPT and Competitive Ratio

maxf`i di ; di+1 g  `i di + di+1 :

2.6.1 General observations about holes

2.6.2 Useful properties of holes

2.6.3 Relation to the algorithm of Cao et al.

2.7 Competitive Analysis of our Online Algorithm

The last equality follows easily by induction and algebraic manipulations.

2.8 Application-Controlled Caching with Fairness

2.9 Conclusions and Future Work

3.2 Main Results

1. If R = kD for constant k, as R, D ! 1, we have

Reads SRM  N ln(N=M ) ln D

2. If R = rD ln D for constant r, as R, D ! 1, we have

Reads SRM  N + c N ln(N=M ) + O(1)

Reads SRM  N ln(N=M )

3.3 SRM Data Layout

3.4 Forecasting Format and Data Structure

3.5 The SRM Merging Procedure

3.5.1 Internal Memory Management

3.5.2 Maintaining the dynamic partition of internal memory

In addition, records are added to blocks of MW as the internal merging proceeds.

3.5.4 Terminology and Notation

3.6 Using Phases to Count ParRead Operations

1. No block still on disk has a participating index less than jR + 1.

3.7 Probabilistic Analysis

3.7.1 The Dependent Occupancy Problem

3.7.2 Asymptotic Expansions of the Maximum Occupancy

2. In the case where k = r ln D and r =

Let T denote any positive integer. One way of computing E [Xmax ] is

Thus the coecient of zt in GW (z) is PrfW = tg.

Suppose that P  1 is a number to be determined later. By de nition, we have GX (P ) = E [P X ]. Thus, since

for any positive integer m. By (3.6) and (3.7), we get

Jerey S. Vitter, Supervisor

Dissertation submitted in partial fulllment of the

Jerey S. Vitter, Supervisor

maxf`i di ; di+1 g `i di + di+1 :

Reads SRM N ln(N=M ) ln D

Reads SRM N + c N ln(N=M ) + O(1)

Reads SRM N ln(N=M )

Thus the coecient of zt in GW (z) is PrfW = tg.

Suppose that P 1 is a number to be determined later. By denition, we have GX (P ) = E [P X ]. Thus, since

ln D 1 + O log log log D ;

ln D 1 + ln ln ln D + ln k + O (log log log D)2 :

By (3.23),(3.24), and (3.28), we can choose to be

N + ln(N=M ) N ln D 1 + ln ln ln D + 1 + ln k + O (log log log D)2 :

4.2.5 Diculty of Merging Optimally with Parallel Indepen-

4.5.3 DSM and SRM Congurations

which can be simplied further to

lg N B T lg min fm g : (5.9)

By (5.9), we have lg N c0 B and so