Sunteți pe pagina 1din 157

Lecture Notes on Operating Systems

Marvin Solomon
Computer Sciences Department
Unversity of Wisconsin -- Madison
solomon@cs.wisc.edu
Mon Jan 24 13:28:57 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


Contents
• Introduction
• History
• What is an OS For?
• Bottom-up View
• Top-Down View
• Course Outline
• Java for C++ Programmers
• Processes and Synchronization
• Using Processes
• What is a Process?
• Why Use Processes
• Creating Processes
• Process States
• Synchronization
• Race Conditions
• Semaphores
• The Bounded Buffer Problem
• The Dining Philosophers
• Monitors
• Messages
• Deadlock
• Terminology
• Deadlock Detection
• Deadlock Recovery
• Deadlock Prevention
• Deadlock Avoidance
• Implementing Processes
• Implementing Monitors
• Implementing Semaphores
• Implementing Critical Sections
• Short-term Scheduling
• Memory Management
• Allocating Main Memory
• Algorithms for Memory Management
• Compaction and Garbage Collection
• Swapping
• Paging
• Page Tables
• Page Replacement
• Frame Allocation for a Single Process
• Frame Allocation for Multiple Processes
• Paging Details
• Segmentation
• Multics
• Intel x86
• Disks
• File Systems
• The User Interface to Files
• Naming
• File Structure
• File Types
• Access Modes
• File Attributes
• Operations
• The User Interface to Directories
• Implementing File Systems
• Files
• Directories
• Symbolic Links
• Mounting
• Special Files
• Long File Names
• Space Management
• Block Size and Extents
• Free Space
• Reliability
• Bad-block Forwarding
• Back-up Dumps
• Consistency Checking
• Transactions
• Performance
• Protection and Security
• Security
• Threats
• The Trojan Horse
• Design Principles
• Authentication
• Protection Mechanisms
• Access Control Lists
• Capabilities
• Encryption
• Key Distribution
• Public Key Encryption
CS 537 Lecture Notes Part 1
Introduction
Contents
• History
• What is an OS For?
• Bottom-up View
• Top-Down View
• Course Outline

History
The first computers were built for military purposes during World War II, and the first commercial
computers were built during the 50's. They were huge (often filling a large room with tons of
equipment), expensive (millions of dollars, back when that was a lot of money), unreliable, and slow
(about the power of today's $1.98 pocket calculator). Originally, there was no distinction between
programmer, operator, and end-user (the person who wants something done). A physicist who wanted to
calculate the trajectory of a missile would sign up for an hour on the computer. When his time came, he
would come into the room, feed in his program from punched cards or paper tape, watch the lights
flash, maybe do a little debugging, get a print-out, and leave.
The first card in the deck was a bootstrap loader. The user/operator/programmer would push a
button that caused the card reader to read that card, load its contents into the first 80 locations in
memory, and jump to the start of memory, executing the instructions on that card. Those instructions
read in the rest of the cards, which contained the instructions to perform all the calculations desired:
what we would now call the "application program".
This set-up was a lousy way to debug a program, but more importantly, it was a waste of the
fabulously expensive computer's time. Then someone came up with the idea of batch processing.
User/programmers would punch their jobs on decks of cards, which they would submit to a
professional operator. The operator would combind the decks into batches. He would precede the batch
with a batch executive (another deck of cards). This program would read the remaining programs into
memory, one at a time, and run time. The operator would take the printout from the printer, tear off the
part associated with each job, wrap it around the asociated deck, and put it in an output bin for the user
to pick up. The main benefit of this approach was that it minimized the wasteful down time between
jobs. However, it did not solve the growing I/O bottleneck.
Card readers and printers got faster, but since they are mechanical devices, there were limits to how
fast they could go. Meanwhile the central processing unit (CPU) kept getting faster and was spending
more and more time idly waiting for the next card to be read in or the next line of output to be printed.
The next advance was to replace the card reader and printer with magnetic tape drives, which were
much faster. A separate, smaller, slower (and persumably cheaper) peripheral computer would copy
batches of input decks onto tape and transcribe output tapes to print. The situation was better, but there
were still problems. Even magnetic tapes drives were not fast enough to keep the mainframe CPU busy,
and the peripheral computers, while cheaper than the mainframe, were still not cheap (perhaps
hundreds of thousands of dollars).
Then someone came up with a brilliant idea. The card reader and printer were hooked up to the
mainframe (along with the tape drives) and the mainframe CPU was reprogrammed to swtich rapidly
among several tasks. First it would tell the card reader to start reading the next card of the next input
batch. While it was waiting for that operation to finish, it would go and work for a while on another job
that had been read into "core" (main memory) earlier. When enough time had gone by for that card be
read in, the CPU would temporarily set aside the main computation, start transfering the data from that
card to one of the tape units (say tape 1), start the card reader reading the next card, and return to the
main computation. It would continue this way, servicing the card reader and tape drive when they
needed attention and spending the rest of its time on the main computation. Whenever it finished
working on one job in the main computation, the CPU would read another job from an input tape that
had been prepared earlier (tape 2). When it finished reading in and exceuting all the jobs from tape 2, it
would swap tapes 1 and 2. It would then start executing the jobs from tape 1, while the input "process"
was filling up tape 2 with more jobs from the card reader. Of course, while all this was going on, a
similar process was copying output from yet another tape to the printer. This amazing juggling act was
called Simultaneous Peripheral Operations On Line, or SPOOL for short.
The hardware that enabled SPOOLing is called direct memory access, or DMA. It allows the card
reader to copy data directly from cards to core and the tape drive to copy data from core to tape, while
the expensive CPU was doing something else. The software that enabled SPOOLing is called
multiprogramming. The CPU switches from one activity, or "process" to another so quickly that it
appears to be doing several things at once.
In the 1960's, multiprogramming was extended to ever more ambitious forms. The first extension
was to allow more than one job to execute at a time. Hardware developments supporting this extension
included decreasing cost of core memory (replaced during this period by semi-conductor random-
access memory (RAM)), and introduction of direct-access storage devices (called DASD - pronounced
"dazdy" - by IBM and "disks" by everyone else). With larger main memory, multiple jobs could be kept
in core at once, and with input spooled to disk rather than tape, each job could get directly at its part of
the input. With more jobs in memory at once, it became less likely that they would all be
simultaneously blocked waiting for I/O, leaving the expensive CPU idle.
Another break-through idea from the 60's based on multiprogramming was timesharing, which
involves running multiple interactive jobs, switching the CPU rapidly among them so that each
interactive user feels as if he has the whole computer to himself. Timesharing let the programmer back
into the computer room - or at least a virtual computer room. It allowed the development of interactive
programming, making programmers much more productive. Perhaps more importantly, it supported
new applications such as airline reservation and banking systems that allowed 100s or even 1000s of
agents or tellers to access the same computer "simultaneously". Visionaries talked about an "computing
utility" by analogy with the water and electric utilities, which would delived low-cost computing power
to the masses. Of course, it didn't quite work out that way. The cost of computers dropped faster than
almost anyone expected, leading to mini computers in the '70s and personal computers (PCs) in the
80's. It was only in the 90's that the idea was revived, in the form of an information utility otherwise
known as the information superhighway or the World-Wide Web.
Today, computers are used for a wide range of applications, including personal interactive use
(word-processing, games, desktop publishing, web browing, email), real-time systems (patient care,
factories, missiles), embedded systems (cash registers, wrist watches, tosters), and transaction
processing (banking, reservations, e-commerce).
What is an OS For?
Beautification Principle
The goal of an OS is to make hardware look better than it is.
• More regular, uniform (instead of lots of idiosyncratic devices)

• Easier to program (e.g., don't have to worry about speeds, asynchronous


events)
• Closer to what's needed for applications:

• named, variable-length files, rather than disk blocks

• multiple ``CPU's'', one for each user (in shared system) or activity (in
single-user system)
• multiple large, dynamically-growing memories (virtual memory)
Resource principle
• The goal of an OS is to mediate sharing of scarce resources
Q: What is a ``resource''?
A: Something that costs money!
• Why share?
• expensive devices
• need to share data (database is an ``expensive device''!)
• cooperation between people (community is an ``expensive device''!!)
• Problems:
• getting it to work at all
• getting it to work efficiently
• utilization (keeping all the devices busy)
• throughput (getting a lot of useful work done per hour)
• response (getting individual things done quickly)
• getting it to work correctly
• limiting the effects of bugs (preventing idiots from ruining it for everyone)
• preventing unauthorized
• access to data
• modification of data
• use of resources
(preventing bad guys from ruining it for everyone)
Bottom-up View (starting with the hardware)
Hardware (summary; more details later)
• components
• one or more central processing units (CPU's)
• main memory (RAM, core)
• I/O devices
• bus, or other communication mechanism connects them all together
• CPU has a PC1
• fetches instructions one at a time from location specified by PC
• increments PC after fetching instruction; branch instructions can also alter the PC
• responds to "interrupts" by jumping to a different location (like an unscheduled
procedure call)
• Memory responds to "load" and "store" requests from the CPU, one at a time.
• I/O device
• Usually looks like a chunk of memory to the CPU.
• CPU sets options and starts I/O by sending "store" requests to a particular address.
• CPU gets back status and small amounts of data by issuing "load" requests.
• Direct memory access (DMA): Device may transfer large amounts of data directly
to/from memory by doing loads and stores just like a CPU.
• Issues an interrupt to the CPU to indicate that it is done.
Timing problem
• I/O devices are millions or even billions of times slower than CPU.
• E.g.:
• Typical PC is >10 million instructions/sec
• Typical disk takes > 10 ms to get one byte from disk ratio: 100,000 : 1
• Typical typist = 60 wpm = 1 word = 5 bytes/sec = 200 ms = 2 million instructions per
key-stroke. And that doesn't include head-scratching time!
• Solution:
start disk device
do 100,000 instructions of other useful computation
wait for disk to finish

• Terrible program to write; debug. And it would change with a faster disk!
• Better solution:
Process 1:
for (;;) {
start I/O
wait for it to finish
use the data for something
}
Process 2:
for (;;) {
do some useful computation
}
Operating system takes care of switching back and forth between process 1 and process 2 as
``appropriate''.
(Question: which process should have higher priority?)
Space problem
• Most of the time, a typical program is "wasting" most of the memory space allocated to it.
• Looping in one subroutine (wasting space allocated to rest of program)
• Fiddling with one data structure (wasting space allocated to other data structures)
• Waiting for I/O or user input (wasting all of its space)
• Solution: virtual memory
• Keep program and data on disk (100-1000 times cheaper/byte).
• OS automatically copies to memory pieces needed by program on demand.

Top-Down View (what does it look like to various kinds of users?)


• End user.
• Wants to get something done (bill customers, write a love letter, play a game, design a
bomb).
• Doesn't know what an OS is (or care!)
May not even realize there is a computer there.
• Application programmer.
• Writes software for end users. Uses ``beautified'' virtual machine
• named files of unlimited size
• unlimited memory
• read/write returns immediately
• Calls library routines
• some really are just subroutines written by someone else
• sort an array
• solve a differential equation
• search a string for a character
• others call the operating system
• read/write
• create process
• get more memory
• Systems programmer (you, at the end of this course)
• Creates abstractions for application programmers
• Deals with real devices
Course Outline
1. Processes.
• What processes are.
• Using processes
• synchronization and communication
• semaphores, critical regions, monitors, conditions,
• messages, pipes
• process structures
• pipelines, producer/consumer, remote procedure call
• deadlock
• Implementing processes
• mechanism
• critical sections
• process control block
• process swap
• semaphores, monitors
• policy (short-term scheduling)
• fcfs, round-robin, shortest-job next, multilevel queues
2. Memory
• Main-memory allocation
• Swapping, overlays
• Stack allocation (implementation of programming languages)
• Virtual memory hardware
• paging, segmentation, translation lookaside buffer
• policy
• page-replacement algorithms
• random, fifo, lru, clock, working set
3. I/O devices
• device drivers, interrupt handlers
• disks
• hardware characteristics
• disk scheduling
• elevator algorithm
4. File systems
• file naming
• file structure (user's view)
• flat (array of bytes)
• record-structured
• indexed
• random-access
• metadata
• mapped files
• implementation
• structure
• linked, tree-structured, B-tree
• inodes
• directories
• free-space management
5. Protection and security
• threats
• access policy
• capabilities, access-control lists
• implementation
• authentication/determination/enforcement
• encryption
• conventional
• public-key
• digital signatures
1
In this course PC stands for program counter, not personal computer or politically correct
CS 537 Lecture Notes Part 2
Java for C++ Programmers
Contents
• Introduction
• A First Example
• Names, Packages, and Separate Compilation
• Values, Objects, and Pointers
• Garbage Collection
• Static, Final, Public, and Private
• Arrays
• Strings
• Constructors and Overloading
• Inheritance, Interfaces, and Casts
• Exceptions
• Threads
• Input and Output
• Other Goodies

Introduction
The purpose of these notes is to help students in Computer Sciences 537 (Introduction to Operating
Systems) at the University of Wisconsin - Madision learn enough Java to do the course projects. The
Computer Sciences Department is in the process of converting most of its classes from C++ to Java as
the principal langauge for programming projects. CS 537 was the first course to make the switch, Fall
term, 1996. At that time vitually all the students had heard of Java and none had used it. Over the last
few years more and more of our courses were converted to Java. Finally last year (1998-99), the
introductory programming prerequisites for this course, CS 302 and CS 367, were taught in Java.
Nonetheless, many students are unfamiliar with Java, having learned how to program from earlier
versions of 302 and 367, or from courses at other institutions.
Applications vs Applets
The first thing you have to decide when writing a Java program is whether you are writing an
application or an applet. An applet is piece of code designed to display a part of a document. It is run
by a browser (such as Netscape Navigator or Microsoft Internet Explorer) in response to an <applet>
tag in the document. We will not be writing any applets in this course.
An application is a stand-alone program. All of our programs will be applications.
Java was orginally designed to build active, multimedia, interactive environments, so its standard
runtime library has lots of features to aid in creating user interfaces. There are standard classes to create
scrollbars, pop-up menus, etc. There are special facilities for manipulating URL's and network
connections. We will not be using any of these features. On the other hand, there is one thing operating
systems and user interfaces have in common: They both require multiple, cooperating threads of
control. We will be using those features in this course.
JavaScript
You may have heard of JavaScript. JavaScript is an addition to HTML (the language for writing
Web pages) that supports creation of ``subroutines''. It has a syntax that looks sort of like Java, but
otherwise it has very little to do with Java. I have heard one very good analogy: JavaScript is to Java as
the C Shell (csh) is to C.
The Java API
The Java langauge is actually rather small and simple - an order of magnitude smaller and simpler
than C++, and in some ways, even smaller and simpler than C. However, it comes with a very large and
constantly growing library of utility classes. Fortunately, you only need to know about the parts of this
library that you really need, you can learn about it a little at a time, and there is excellent, browsable,
on-line documentation. These libraries are grouped into packages. One set of about 60 packgages,
called the Java 2 Platform API comes bundled with the language (API stands for "Application
Programming Interface"). You will probably only use classes from three of these packages:
• java.lang contains things like character-strings, that are essentially "built in" to the langauge.
• java.io contains support for input and output, and
• java.util contains some handy data structures such as lists and hash tables.

A First Example
Large parts of Java are identical to C++. For example, the following procedure, which sorts an
array of integers using insertion sort, is exactly the same in C++ or Java.1

/** Sort the array a[] in ascending order


** using an insertion sort.
*/
void sort(int a[], int size) {
for (int i = 1; i < size; i++) {
// a[0..i-1] is sorted
// insert a[i] in the proper place
int x = a[i];
int j;
for (j = i-1; j >=0; --j) {
if (a[j] <= x)
break;
a[j+1] = a[j];
}
// now a[0..j] are all <= x
// and a[j+2..i] are > x
a[j+1] = x;
}
}
Note that the syntax of control structures (such as for and if), assignment statements, variable
declarations, and comments are all the same in Java as in C++.
To test this procedure in a C++ program, we might use a ``main program'' like this:

#include <iostream.h>
#include <stdlib.h>
extern "C" long random();

/** Test program to test sort */


int main(int argc, char *argv[]) {
if (argc != 2) {
cerr << "usage: sort array-size" << endl;
exit(1);
}
int size = atoi(argv[1]);
int *test = new int[size];
for (int i = 0; i < size; i++)
test[i] = random() % 100;
cout << "before" << endl;
for (int i = 0; i < size; i++)
cout << " " << test[i];
cout << endl;

sort(test, size);

cout << "after" << endl;


for (int i = 0; i < size; i++)
cout << " " << test[i];
cout << endl;
return 0;
}
A Java program to test the sort procedure is different in a few ways. Here is a complete Java
program using the sort procedure.
import java.io.*;
import java.util.Random;
class SortTest {
/** Sort the array a[] in ascending order
** using an insertion sort.
*/
static void sort(int a[], int size) {
for (int i = 1; i < size; i++) {
// a[0..i-1] is sorted
// insert a[i] in the proper place
int x = a[i];
int j;
for (j = i-1; j >=0; --j) {
if (a[j] <= x)
break;
a[j+1] = a[j];
}
// now a[0..j] are all <= x
// and a[j+2..i] are > x
a[j+1] = x;
}
}

/** Test program to test sort */


public static void main(String argv[]) {
if (argv.length != 1) {
System.out.println("usage: sort array-size");
System.exit(1);
}
int size = Integer.parseInt(argv[0]);
int test[] = new int[size];
Random r = new Random();

for (int i = 0; i < size; i++)


test[i] = (int)(r.nextFloat() * 100);
System.out.println("before");
for (int i = 0; i < size; i++)
System.out.print(" " + test[i]);
System.out.println();

sort(test, size);

System.out.println("after");
for (int i = 0; i < size; i++)
System.out.print(" " + test[i]);
System.out.println();

System.exit(0);
}
}
A copy of this program is available in ~cs537-1/public/examples/SortTest.java.
To try it out, create a new directory and copy the example to a file named SortTest.java in that directory
or visit with your web browser and use the Save As... option from the File menu. The file must be called
SortTest.java!

mkdir test1
cd test1
cp ~cs537-1/public/examples/SortTest.java SortTest.java
javac SortTest.java
java SortTest 10
(The C++ version of the program is also available in ~cs537-1/public/examples/sort.cc ).
The javac command invokes the Java compiler on the source file SortTest.java. If all goes well,
it will create a file named SortTest.class, which contains code for the Java virtual machine. The java
command invokes the Java interpreter to run the code for class SortTest. Note that the first parameter is
SortTest, not SortTest.class or SortTest.java because it is the name of a class, not a file.
There are several things to note about this program. First, Java has no ``top-level'' or ``global''
variables or functions. A Java program is always a set of class definitions. Thus, we had to make sort
and main member functions (called ``methods'' in Java) of a class, which we called SortTest.
Second, the main function is handled somewhat differently in Java from C++. In C++, the first function
to be executed is always a function called main, which has two arguments and returns an integer value.
The return value is the ``exit status'' of the program; by convention, a status of zero means ``normal
termination'' and anything else means something went wrong. The first argument is the number of
words on the command-line that invoked the program, and the second argument is a an array of
character strings (denoted char *argv[] in C++) containing those words. If we invoke the program by
typing

sort 10
we will find that argc==2, argv[0]=="sort", and argv[1]=="10".
In Java, the first thing executed is the method called main of the indicated class (in this case
SortTest). The main method does not return any value (it is of type void). For now, ignore the words
``public static'' preceding void. We will return to these later. The main method takes only one
parameter, an array of strings (denoted String argv[] in Java). This array will have one element for each
word on the command line following the name of the class being executed. Thus in our example call,

java SortTest 10
argv[0] == "10". There is no separate argument to tell you how many words there are, but in
Java, you can tell how big any array is by using length. In this case argv.length == 1, meaning argv
contains only one word.
The third difference to note is the way I/O is done in Java. System.out in Java is roughly
equivalent to cout in C++ (or stdout in C), and

System.out.println(whatever);
is (even more) roughly equivalent to
cout << whatever << endl;
Our C++ program used three functions from the standard library, atoi, random, and exit.
Integer.parseInt does the same thing as atoi: It converts the character-string "10" to the integer value
ten, and System.exit(1) does the same thing as exit(1): It immediately terminates the program, returning
an exit status of 1 (meaning something's wrong). The library class Random defines random-number
generators. The statement Random r = new Random() create an instance of this class, and r.nextFloat()
uses it to generate a floating point number between 0 and 1. The cast (int) means the same thing in Java
as in C++. It converts its floating-point argument to an integer, throwing away the fraction.
Finally, note that the #include directives from C++ have been replaced by import declarations.
Although they have roughly the same effect, the mechanisms are different. In C++, #include
<iostream.h> pulls in a source file called iostream.h from a source library and compiles it along with
the rest of the program. #include is usually used to include files containing declarations of library
functions and classes, but the file could contain any C++ source code whatever. The Java declaration
import java.util.Random imports the pre-compiled class Random from a package called java.util. The
next section explains more about packages.

Names, Packages, and Separate Compilation


As in C or C++, case is significant in identifiers in Java. Aside from a few reserved words, like if,
while, etc., the Java langauge places no restrictions on what names you use for functions, variables,
classes, etc. However, there is a standard naming convention, which all the standard Java libraries
follow, and which you must follow in this class.
• Names of classes are in MixedCase starting with a capital letter. If the most natural name for the
class is a phrase, start each word with a capital letter, as in StringBuffer.
• Names of "constants" (see below) are ALL_UPPER_CASE. Separate words of phrases with
underscores as in MIN_VALUE.
• Other names (functions, variables, reserved words, etc.) are in lower case or mixedCase, starting
with a lower-case letter.
A more extensive set of guidelines is included in the Java Language Specification.
Simple class defintions in Java look rather like class definitions in C++ (although, as we shall see later,
there are important differences).
class Pair { int x, y; }
Each class definition should go in a separate file, and the name of the source file must be
exactly the same (including case) as the name of the class, with ".java" appended. For example, the
definition of Pair must go in file Pair.java. The file is compiled as shown above and produces a .class
file. There are exceptions to the rule that requires a separate source file for each class. In particular,
class definitions may be nested. However, this is an advanced feature of Java, and you should never
nest class definitions unless you known what you're doing!
There is a large set of predefined classes, grouped into packages. The full name of one of these
predefined classes includes the name of the package as prefix. We already saw the class
java.util.Random. The import statement allows you to omit the package name from one of these classes.
Because the SortTest program starts with
import java.util.Random;
we can write
Random r = new Random();
rather than
java.util.Random r = new java.util.Random();
You can import all the classes in a package at once with a notation like
import java.io.*;
The package java.lang is special; every program behaves as if it started with
import java.lang.*;
whether it does or not. You can define your own packages, but defining packages is an advanced
topic beyond the scope of what's required for this course.
The import statement doesn't really "import" anything. It just introduces a convenient
abbreviation for a fully-qualified class name. When a class needs to use another class, all it has to do is
use it. The Java compiler will know that it is supposed to be a class by the way it is used, will import
the appropriate .class file, and will even compile a .java file if necessary. (That's why it's important for
the name of the file to match the name of the class). For example, here is a simple program that uses
two classes:
class HelloTest {
public static void main(String[] args) {
Hello greeter = new Hello();
greeter.speak();
}
}
class Hello {
void speak() {
System.out.println("Hello World!");
}
}
Put each class in a separate file (HelloTest.java and Hello.java). Then try this:
javac HelloTest.java
java Hello
You should see a cheery greeting. If you type ls you will see that you have both HelloTest.class
and Hello.class even though you only asked to compiled HelloTest.java. The Java compiler figured out
that class HelloTest uses class Hello and automatically compiled it. Try this to learn more about what's
going on:
rm -f *.class
javac -verbose HelloTest.java
java Hello

Values, Objects, and Pointers


It is sometimes said that Java doesn't have pointers. That is not true. In fact, objects can only be
referenced with pointers. More precisely, variables can hold primitive values (such as integers or
floating-point numbers) or references (pointers) to objects. A variable cannot hold an object, and you
cannot make a pointer to a primitive value. Since you don't have a choice, Java doesn't have a special
notation like C++ does to indicate when you want to use a pointer.
There are exactly eight primitive types in Java, boolean, char, byte, short, int, long, float, and
double. Most of these are similar to types with the same name in C++. We mention only the differences.
A boolean value is either true or false. You cannot use an integer where a boolean is required
(e.g. in an if or while statement) nor is there any automatic conversion between boolean and integer.
A char value is 16 bits rather than 8 bits, as it is in C or C++, to allow for all sorts of international
alphabets. As a practical matter, however, you are unlikely to notice the difference. The byte type is an
8-bit signed integer (like signed char in C or C++).
A short is 16 bits and an int is 32 bits, just as in C or C++ on most modern machines (in C++ the
size is machine-dependent, but in Java it is guaranteed to be 32 bits). A Java long is not the same as in
C++; it is 64 bits long--twice as big as a normal int--so it can hold any value from
-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. The types float and double are just like in
C++: 32-bit and 64-bit floating point.
As in C++, objects are instances of classes. There is no prefix * or & operator or infix ->
operator.
As an example, consider the class declaration (which is the same in C++ and in Java)
class Pair { int x, y; }

C++ Java
Pair origin; Pair origin = new Pair();
Pair *p, *q, *r; Pair p, q, r;
origin.x = 0; origin.x = 0;
p = new Pair; p = new Pair();
p -> y = 5; p.y = 5;
q = p; q = p;
r = &origin; not possible

.
As in C or C++, arguments to a Java procedure are passed ``by value''
:

void f() {
int n = 1;
Pair p = new Pair();
p.x = 2; p.y = 3;
System.out.println(n); // prints 1
System.out.println(p.x); // prints 2
g(n,p);
System.out.println(n); // still prints 1
System.out.println(p.x); // prints 100
}
void g(int num, Pair ptr) {
System.out.println(num); // prints 1
num = 17; // changes only the local copy
System.out.println(num); // prints 17

System.out.println(ptr.x);// prints 2
ptr.x = 100; // changes the x field of caller's Pair
ptr = null; // changes only the local ptr
}
The formal parameters num and ptr are local variables in the procedure g initialized with copies
of the values of n and p. Any changes to num and ptr affect only the copies. However, since ptr and p
point to the same object, the assignment to ptr.x in g changes the value of p.x.
Unlike C++, Java has no way of declaring reference parameters, and unlike C++ or C, Java has
no way of creating a pointer to a (non-object) value, so you can't do something like this

/* C or C++ */
void swap1(int *xp, int *yp) {
int tmp;
tmp = *xp;
*xp = *yp;
*yp = tmp;
}
int foo = 10, bar = 20;
swap1(&foo, &bar); /* now foo==20 and bar==10 */

// C++ only
void swap2(int &xp, int &yp) {
int tmp;
tmp = xp;
xp = yp;
yp = tmp;
}
int this_one = 88, that_one = 99;
swap2(this_one, that_one); // now this_one==99 and that_one==88
You'll probably miss reference parameters most in situations where you want a procedure to
return more than one value. As a work-around you can return an object or array or pass in a pointer to
an object. See Section 2.6 on page 36 of the Java book for more information.

Garbage Collection
New objects are create by the new operator in Java just like C++ (except that an argument list is
required after the class name, even if the constructor for the class doesn't take any arguments so the list
is empty). However, there is no delete operator. The Java system automatically deletes objects when no
references to them remain. This is a much more important convenience than it may at first seem. delete
operator is extremely error-prone. Deleting objects too early can lead to dangling reference, as in
p = new Pair();
// ...
q = p;
// ... much later
delete p;
q -> x = 5; // oops!
while deleting them too late (or not at all) can lead to garbage, also known as a storage leak.

Static, Final, Public, and Private


Just as in C++, it is possible to restrict access to members of a class by declaring them private,
but the syntax is different
In C++:
class C {
private:
int i;
double d;
public:
int j;
void f() { /*...*/ }
}
In Java:
class C {
private int i;
public int j;
private double d;
public void f() { /* ... */ }
}
As in C++, private members can only be accessed from inside the bodies of methods (function
members) of the class, not ``from the outside.'' Thus if x is an instance of C, x.i is not legal, but i can be
accessed from the body of x.f(). (protected is also supported; it means the same thing as it does in C++).
The default (if neither public nor private is specified) is that a member can be accessed from anywhere
in the same package, giving a facility rather like ``friends'' in C++. You will probably be putting all
your classes in one package, so the default is essentially public, but you should not rely on this default.
In this course, every member must be declared public, protected, or private.
The keyword static also means the same thing in Java as C++, which not what the word implies:
Ordinary members have one copy per instance, whereas a static member has only one copy, which is
shared by all instances. In effect, a static member lives in the class itself, rather than instances.
class C {
int x = 1; // by the way, this is ok in Java but not C++
static int y = 1;
void f(int n) { x += n; }
static int g() { return ++y; }
}
C p = new C();
C q = new C();
p.f(3);
q.f(5);
System.out.println(p.x); // prints 4
System.out.println(q.x); // prints 6
System.out.println(C.y); // prints 1
System.out.println(p.y); // means the same thing
System.out.println(C.g());// prints 2
System.out.println(q.g());// prints 3
Static members are often used instead of global variables and functions, which do not exist in
Java. For example,
Math.tan(x); // tan is a static method of class Math
Math.PI; // a static "field" of class Math with value 3.14159...
Integer.parseInt("10"); // used in the sorting example
The keyword final is roughly equivalent to const in C++: final fields cannot be changed. It is
often used in conjunction with static to defined named constants.
class Card {
public int suit = CLUBS; // default
public final static int CLUBS = 1;
public final static int DIAMONDS = 2;
public final static int HEARTS = 3;
public final static int SPADES = 4;
}
Card c = new Card();
c.suit = Card.SPADES;
Each Card has its own suit. The value CLUBS is shared by all instances of Card so it only needs
to be stored once, but since it's final, it doesn't need to be stored at all!

Arrays
In Java, arrays are objects. Like all objects in Java, you can only point to them, but unlike a C++
variable, which is treated like a pointer to the first element of the array, a Java array variable points to
the whole object. There is no way to point to a particular slot in an array.
Each array has a read-only (final) field length that tells you how many elements it has. The elements are
numbered starting at zero as in C++: a[0] ... a[a.length-1]. Once you create an array (using new), you
can't change its size. If you need more space, you have to create a new (larger) array and copy over the
elements (but see the library class Vector below).

int x = 3; // a value
int[] a; // a pointer to an array object; initially null
int a[]; // means exactly the same thing (for compatibility with C)
a = new int[10]; // now a points to an array object
a[3] = 17; // accesses one of the slots in the array
a = new int[5]; // assigns a different array to a
// the old array is inaccessible (and so
// is garbage-collected)
int[] b = a; // a and b share the same array object
System.out.println(a.length); // prints 5

Strings
Since you can make an array of anything, you can make an an array of char or an an array of
byte, but Java has something much better: the type String. The + operator is overloaded on Strings to
mean concatenation. What's more, you can concatenate anything with a string; Java automatically
converts it to a string. Built-in types such as numbers are converted in the obvious way. Objects are
converted by calling their toString() methods. Library classes all have toString methods that do
something reasonable. You should do likewise for all classes you define. This is great for debugging.
String s = "hello";
String t = "world";
System.out.println(s + ", " + t); // prints "hello, world"
System.out.println(s + "1234"); // "hello1234"
System.out.println(s + (12*100 + 34)); // "hello1234"
System.out.println(s + 12*100 + 34); // "hello120034" (why?)
System.out.println("The value of x is " + x); // will work for any x
System.out.println("System.out = " + System.out);
// "System.out = java.io.PrintStream@80455198"
String numbers = "";
for (int i=0; i<5; i++)
numbers += " " + i;
System.out.println(numbers); // " 0 1 2 3 4"
Strings have lots of other useful operations:
String s = "whatever", t = "whatnow";
s.charAt(0); // 'w'
s.charAt(3); // 't'
t.substring(4); // "now" (positions 4 through the end)
t.substring(4,6); // "no" (positions 4 and 5, but not 6)
s.substring(0,4); // "what" (positions 0 through 3)
t.substring(0,4); // "what"
s.compareTo(t); // a value less than zero
// s precedes t in "lexicographic"
// (dictionary) order
t.compareTo(s); // a value greater than zero (t follows s)
t.compareTo("whatnow"); // zero
t.substring(0,4) == s.substring(0,4);
// false (they are different String objects)
t.substring(0,4).equals(s.substring(0,4));
// true (but they are both equal to "what")
t.indexOf('w'); // 0
t.indexOf('t'); // 3
t.indexOf("now"); // 4
t.lastIndexOf('w'); // 6
t.endsWith("now"); // true
and more.
You can't modify a string, but you can make a string variable point to a new string (as in
numbers += " " + i;). See StringBuffer if you want a string you can scribble on.

Constructors and Overloading


A constructor is like in C++: a method with the same name as the class. If a constructor has
arguments, you supply corresponding values when using new. Even if it has no arguments, you still
need the parentheses (unlike C++). There can be multiple constructors, with different numbers or types
of arguments. The same is true for other methods. This is called overloading. Unlike C++, you cannot
overload operators. The operator `+' is overloaded for strings and (various kinds of) numbers, but user-
defined overloading is not allowed.

class Pair {
int x, y;
Pair(int u, int v) {
x = u; // the same as this.x = u
y = v;
}
Pair(int x) {
this.x = x; // not the same as x = x!
y = 0;
}
Pair() {
x = 0;
y = 0;
}
}
class Test {
public static void main(String[] argv) {
Pair p1 = new Pair(3,4);
Pair p2 = new Pair(); // same as new Pair(0,0)
Pair p3 = new Pair; // error!
}
}
NB: The bodies of the methods have to be defined in line right after their headers as shown
above. You have to write
class Foo {
double square(double d) { return d*d; }
};
rather than
class Foo {
double square(double);
};
double Foo::square(double d) { return d*d; }
// ok in C++ but not in Java
Inheritance, Interfaces, and Casts
In C++, when we write
class Derived : public Base { ... }
we mean two things:
• A Derived can do anything a Base can, and perhaps more.
• A Derived does things the way a Base does them, unless specified otherwise.
The first of these is called interface inheritance or subtyping and the second is called method
inheritance. In Java, they are specified differently.
Method inheritance is specified with the keyword extends.
class Base {
int f() { /* ... */ }
void g(int x) { /* ... */ }
}
class Derived extends Base {
void g(int x) { /* ... */ }
double h() { /* ... */ }
}
Class Derived has three methods: f, g, and h. The method Derived.f() is implemented in the
same way (the same executable code) as Base.f(), but Derived.g() overrides the implementation of
Base.g(). We call Base the super class of Derived and Derived a subclass of Base. Every class (with
one exception) has exactly one super class (single inheritance). If you leave out the extends
specification, Java treats it like ``extends Object''. The primordial class Object is the lone exception -- it
does not extend anything. All other classes extend Object either directly or indirectly. Object has a
method toString, so every class has a method toString; either it inherits the method from its super class
or it overrides it.
Interface inheritance is specified with implements. A class implements an Interface, which is
like a class, except that the methods don't have bodies. Two examples are given by the built-in
interfaces Runnable and Enumeration.
interface Runnable {
void run();
}
interface Enumeration {
Object nextElement();
boolean hasMoreElements();
}
An object is Runnable if it has a method named run that is public2 and has no arguments or
results. To be an Enumeration, a class has to have a public method nextElement() that returns an Object
and a public method hasMoreElements that returns a boolean. A class that claims to implement these
interfaces has to either inherit them (via extends) or define them itself.
class Words extends StringTokenizer implements Enumeration, Runnable {
public void run() {
for (;;) {
String s = nextToken();
if (s == null) {
return;
}
System.out.println(s);
}
}
Words(String s) {
super(s);
// perhaps do something else with s as well
}
}
The class Words needs methods run, hasMoreElements, and nextElement to meet its promise to
implement interfaces Runnable and Enumeration. It inherits implementations of hasMoreElements and
nextElement from StringTokenizer , but it has to give its own implementation of run. The implements
clause tells users of the class what they can expect from it. If w is an instance of Words, I know I can
write
w.run();
or
if (w.hasMoreElements()) ...
A class can only extend one class, but it can implement any number of interfaces.
By the way, constructors are not inherited. The call super(s) in class Words calls the constructor
of StringTokenizer that takes one String argument. If you don't explicitly call super, Java automatically
calls the super class constructor with no arguments (such a constructor must exist in this case). Note the
call nextToken() in Words.run, which is short for this.nextToken(). Since this is an instance of Words, it
has a nextToken method -- the one it inherited from StringTokenizer.
A cast in Java looks just like a cast in C++: It is a type name in parentheses preceding an
expression. We have already seen an example of a cast used to convert between primitive types. A cast
can also be used to convert an object reference to a super class or subclass. For example,
Words w = new Words("this is a test");
Object o = w.nextElement();
String s = (String)o;
System.out.println("The first word has length " + s.length());
We know that w.nextElement() is ok, since Words implements the interface Enumeration, but all
that tells us is that the value returned has type Object. We cannot call o.length() because class Object
does not have a length method. In this case, however, we know that o is not just any kind of Object, but
a String in particular. Thus we cast o to type String. If we were wrong about the type of o we would get
a run-time error. If you are not sure of the type of an object, you can test it with instanceof (note the
lower case `o'), or find out more about it with the method Object.getClass()
if (o instanceof String) {
n = ((String)o).length();
} else {
System.err.println("Bad type " + o.getClass().getName());
}

Exceptions
A Java program should never ``core dump,'' no matter how buggy it is. If the compiler excepts it
and something goes wrong at run time, Java throws an exception. By default, an exception causes the
program to terminate with an error message, but you can also catch an exception.

try {
// ...
foo.bar();
// ...
a[i] = 17;
// ...
}
catch (IndexOutOfBoundsException e) {
System.err.println("Oops: " + e);
}
The try statement says you're interested in catching exceptions. The catch clause (which can
only appear after a try) says what to do if an IndexOutOfBoundsException occurs anywhere in the try
clause. In this case, we print an error message. The toString() method of an exception generates a string
containing information about what went wrong, as well as a call trace. Because we caught this
exception, it will not terminate the program. If some other kind of exception occurs (such as divide by
zero), the exception will be thrown back to the caller of this function and if that function doesn't catch
it, it will be thrown to that function's caller, and so on back to the main function, where it will terminate
the program if it isn't caught. Similarly, if the function foo.bar throws an IndexOutOfBoundsException
and doesn't catch it, we will catch it here.
The catch clause actually catches IndexOutOfBoundsException or any of its subclasses,
including ArrayIndexOutOfBoundsException , StringIndexOutOfBoundsException , and others. An
Exception is just another kind of object, and the same rules for inheritance hold for exceptions as any
other king of class.
You can define and throw your own exceptions.
class SytaxError extends Exception {
int lineNumber;
SytaxError(String reason, int line) {
super(reason);
lineNumber = line;
}
public String toString() {
return "Syntax error on line " + lineNumber + ": " + getMessage();
}
}
class SomeOtherClass {
public void parse(String line) throws SyntaxError {
// ...
if (...)
throw new SyntaxError("missing comma", currentLine);
//...
}
public void parseFile(String fname) {
//...
try {
// ...
nextLine = in.readLine();
parse(nextLine);
// ...
}
catch (SyntaxError e) {
System.err.println(e);
}
}
}
Each function must declare in its header (with the keyword throws) all the exceptions that may
be thrown by it or any function it calls. It doesn't have to declare exceptions it catches. Some
exceptions, such as IndexOutOfBoundsException, are so common that Java makes an exception for
them (sorry about that pun) and doesn't require that they be declared. This rule applies to
RuntimeException and its subclasses. You should never define new subclasses of RuntimeException.
There can be several catch clauses at the end of a try statement, to catch various kinds of
exceptions. The first one that ``matches'' the exception (i.e., is a super class of it) is executed. You can
also add a finally clause, which will always be executed, no matter how the program leaves the try
clause (whether by falling through the bottom, executing a return, break, or continue, or throwing an
exception).

Threads
Java lets you do several things at once by using threads. If your computer has more than one CPU, it
may actually run two or more threads simultaneously. Otherwise, it will switch back and forth among
the threads at times that are unpredictable unless you take special precautions to control it.
There are two different ways to create threads. I will only describe one of them here.

Thread t = new Thread(command); //


t.start(); // t start running command, but we don't wait for it to finish
// ... do something else (perhaps start other threads?)
// ... later:
t.join(); // wait for t to finish running command
The constructor for the built-in class Thread takes one argument, which is any object that has a method
called run. This requirment is specified by requiring that command implement the Runnable interface
described earlier. (More precisely, command must be an instance of a class that implements Runnable).
The way a thread ``runs'' a command is simply by calling its run() method. It's as simple as that!
In project 1, you are supposed to run each command in a separate thread. Thus you might declare
something like this:

class Command implements Runnable {


String theCommand;
Command(String c) {
theCommand = c;
}
public void run() {
// Do what the command says to do
}
}
You can parse the command string either in the constructor or at the start of the run() method.
The main program loop reads a command line, breaks it up into commands, runs all of the commands
concurrently (each in a separate thread), and waits for them to all finish before issuing the next prompt.
In outline, it may look like this.

for (;;) {
System.out.print("% "); System.out.flush();
String line = inputStream.readLine();
int numberOfCommands = // count how many comands there are on the line
Thread t[] = new Thread[numberOfCommands];
for (int i=0; i<numberOfCommands; i++) {
String c = // next command on the line
t[i] = new Thread(new Command(c));
t[i].start();
}
for (int i=0; i<numberOfCommands; i++) {
t[i].join();
}
}
This main loop is in the main() method of your main class. It is not necessary for that class to
implement Runnable.
Although you won't need it for project 1, the next project will require to to synchronize thread with
each other. There are two reasons why you need to do this: to prevent threads from interferring with
each other, and to allow them to cooperate. You use synchronized methods to prevent interference, and
the built-in methods Object.wait() , Object.notify() , Object.notifyAll() , and Thread.yield() to support
cooperation.
Any method can be preceded by the word synchronized (as well as public, static, etc.). The rule is:
No two threads may be executing synchronized methods of the same object at the same time.
The Java system enforces this rule by associating a monitor lock with each object. When a thread calls a
synchronized method of an object, it tries to grab the object's monitor lock. If another thread is holding
the lock, it waits until that thread releases it. A thread releases the monitor lock when it leaves the
synchronized method. If one synchronized method of a calls contains a call to another, a thread may
have the same lock ``multiple times.'' Java keeps track of that correctly. For example,

class C {
public synchronized void f() {
// ...
g();
// ...
}
public synchronized void g() { /* ... */ }
}
If a thread calls C.g() ``from the outside'', it grabs the lock before executing the body of g() and releases
it when done. If it calls C.f(), it grabs the lock on entry to f(), calls g() without waiting, and only
releases the lock on returning from f().
Sometimes a thread needs to wait for another thread to do something before it can continue. The
methods wait() and notify(), which are defined in class Object and thus inherited by all classes, are
made for this purpose. They can only be called from within synchronized methods. A call to wait()
releases the monitor lock and puts the calling thread to sleep (i.e., it stops running). A subsequent call to
notify on the same object wakes up a sleeping thread and lets it start running again. If more than one
thread is sleeping, one is chosen arbitrarily3 no threads are sleeping in this object, notify() does nothing.
The awakened thread has to wait for the monitor lock before it starts; it competes on an equal basis with
other threads trying to get into the monitor. The method notifyAll is similar, but wakes up all threads
sleeping in the object.
class Buffer {
private Queue q;
public synchronized void put(Object o) {
q.enqueue(o);
notify();
}
public synchronized Object get() {
while (q.isEmpty())
wait();
return q.dequeue();
}
}
This class solves the so-call ``producer-consumer'' problem (it assumes the Queue class has been
defined elsewhere). ``Producer'' threads somehow create objects and put them into the buffer by calling
Buffer.put(), while ``consumer'' threads remove objects from the buffer (using Buffer.get()) and do
something with them. The problem is that a consumer thread may call Buffer.get() only to discover that
the queue is empty. By calling wait() it releases the monitor lock and goes to sleep so that producer
threads can call put() to add more objects. Each time a producer adds an object, it calls notify() just in
case there is some consumer waiting for an object.
This example is not correct as it stands (and the Java compiler will reject it). The wait() method can
throw an InterruptedException exception, so the get() method must either catch it or declare that it
throws InterruptedException as well. The simplest solution is just to catch the exception and ignore it:

class Buffer {
private Queue q;
public synchronized void put(Object o) {
q.enqueue(o);
notify();
}
public synchronized Object get() {
while (q.isEmpty()) {
try {
wait();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
return q.dequeue();
}
}
The method printStackTrace() prints some information about the exception, including the line number
where it happened. It is a handy thing to put in a catch clause if you don't know what else to put there.
Never use an empty catch clause. If you violate this rule, you will live to regret it!
There is also a version of Object.wait() that takes an integer parameter. The call wait(n) will return after
n milliseconds if nobody wakes up the thread with notify or notifyAll sooner.
You may wonder why Buffer.get() uses while (q.isEmpty()) rather than if (q.isEmpty()). In this
particular case, either would work. However, in more complicated situations, a sleeping thread might be
awakened for the ``wrong'' reason. Thus it is always a good idea when you wake up to recheck the
condition that made to decide to go to sleep before you continue.

Input and Output


Input/Output, as described in Chapter 12 of the Java book, is not as complicated as it looks. You can get
pretty far just writing to System.out (which is of type PrintStream ) with methods println and print. For
input, you probably want to wrap the standard input System.in in a BufferedReader , which provides
the handy method readLine()

BufferedReader in =
new BufferedReader(new InputStreamReader(System.in));
for(;;) {
String line = in.readLine();
if (line == null) {
break;
}
// do something with the next line
}
If you want to read from a file, rather than from the keyboard (standard input), you can use FileReader,
probably wrapped in a BufferedReader.

BufferedReader in =
new BufferedReader(new FileReader("somefile"));
for (;;) {
String line = in.readLine();
if (line == null) {
break;
}
// do something with the next line
}
Similarly, you can use new PrintWriter(new FileOutputStream("whatever")) to write to a file.

Other Goodies
The library of pre-defined classes has several other handy tools. See the online manual , particularly
java.lang and java.util for more details.
Integer, Character, etc.
Java makes a big distinction between values (integers, characters, etc.) and objects. Sometimes you
need an object when you have a value (the next paragraph has an example). The classes Integer,
Character, etc. serve as convenient wrappers for this purpose. For example, Integer i = new Integer(3)
creates a version of the number 3 wrapped up as an object. The value can be retrieved as i.intValue.
These classes also serve as convenient places to define utility functions for manipulating value of the
given types, often as static methods or defined constants.

int i = Integer.MAX_VALUE; // 2147483648, the largest possible int


int i = Integer.parseInt("123"); // the int value 123
String s = Integer.toHexString(123);// "7b" (123 in hex)
double x = Double.parseDouble("123e-2");
// the double value 1.23
Character.isDigit('3') // true
Character.isUpperCase('a') // false
Character.toUpperCase('a') // 'A'
Vector
A Vector is like an array, but it grows as necessary to allow you to add as many elements as you like.
Unfortunately, there is only one kind of Vector--a vector of Object. Thus you can insert objects of any
type into it, but when you take objects out, you have to use a cast to recover the original type.4

Vector v = new Vector(); // an empty vector


for (int i=0; i<100; i++)
v.add(new Integer(i));
// now it contains 100 Integer objects

// print their squares


for (int i=0; i<100; i++) {
Integer member = (Integer)(v.get(i));
int n = member.intValue();
System.out.println(n*n);
}

// another way to do that


for (Iterator i = v.iterator(); i.hasNext(); ) {
int n = ((Integer)(i.next())).intValue();
System.out.println(n*n);
}
v.set(5, "hello"); // like v[5] = "hello"
Object o = v.get(3); // like o = v[3];
v.add(6, "world"); // set v[6] = "world" after first shifting
// element v[7], v[8], ... to the right
// to make room
v.remove(3); // remove v[3] and shift v[4], ... to the
// left to fill in the gap
Elements of a Vector must be objects, not values. That means you can put a String or an instance of a
user-defined class into a Vector, but if you want to put an integer, floating-point number, or character
into Vector, you have to wrap it:

v.add(47); // WRONG!
sum += v.get(i); // WRONG!
v.add(new Integer(47)); // right
sum += ((Integer)v.get(i)).intValue();
// ugly, but right
The class Vector is implemented using an ordinary array that is generally only partially filled. If Vector
runs out of space, it allocates a bigger array and copies over the elements. There are a variety of
additional methods, not shown here, that let you give the implementation advice on how to manage the
extra space more efficiently. For example, if you know that you are not going to add any more elements
to v, you can call v.trimToSize() to tell the system to repack the elements into an array just big enough
to hold them.
Don't forget to import java.util.Vector; or import java.util.*; .
Maps and Sets
The interface Map5 represents a table mapping keys to values. It is sort of like an array or Vector, except
that the ``subscripts'' can be any objects, rather than non-negative integers. Since Map is an interface
rather than a class you cannot create instances of it, but you can create instances of the class HashMap,
which implements Map using a hash table.

Map table = new HashMap(); // an empty table


table.put("seven", new Integer(7)); // key is the String "seven";
// value is an Integer object
table.put("seven", 7); // WRONG! (7 is not an object)
Object o = table.put("seven", new Double(7.0));
// binds "seven" to a double object
// and returns the previous value
int n = ((Integer)o).intValue(); // n = 7
table.containsKey("seven"); // true
table.containsKey("twelve"); // false

// print out the contents of the table


for (Iterator i = table.keySet().iterator(); i.hasNext(); ) {
Object key = i.next();
System.out.println(key + " -> " + table.get(key));
}
o = table.get("seven"); // get current binding (a Double)
o = table.remove("seven"); // get current binding and remove it
table.clear(); // remove all bindings
Sometimes, you only care whether a particular key is present, not what it's mapped to. You could
always use the same object as a value (or use null), but it would be more efficient (and, more
importantly, clearer) to use a Set.

System.out.println("What are your favorite colors?");


BufferedReader in =
new BufferedReader(new InputStreamReader(System.in));
Set favorites = new HashSet();
try {
for (;;) {
String color = in.readLine();
if (color == null) {
break;
}
if (!favorites.add(color)) {
System.out.println("you already told me that");
}
}
} catch (IOException e) {
e.printStackTrace();
}
int n = favorites.size();
if (n == 1) {
System.out.println("your favorite color is:");
} else {
System.out.println("your " + n + " favorite colors are:");
}
for (Iterator i = favorites.iterator(); i.hasNext(); ) {
System.out.println(i.next());
}
StringTokenizer
A StringTokenizer is handy in breaking up a string into words separated by white space (or other
separator characters). The following example is from the Java book:

String str = "Gone, and forgotten";


StringTokenizer tokens = new StringTokenizer(str, " ,");
while (tokens.hasMoreTokens())
System.out.println(tokens.nextToken());
It prints out

Gone
and
forgotten
The second arguement to the constructor is a String containing the characters that such be considered
separators (in this case, space and comma). If it is omitted, it defaults to space, tab, return, and newline
(the most common ``white-space'' characters).
There is a much more complicated class StreamTokenizer for breaking up an input stream into tokens.
Many of its features seem to be designed to aid in parsing the Java langauge itself (which is not a
surprise, considering that the Java compiler is written in Java).
Other Utilities
The random-number generator Random was presented above. See Chapter 13 of the Java book for
information about other handy classes.

Previous Introduction
Next Processes and Synchronization
Contents
1
Throughout this tutorial, examples in C++ are shown in green and examples in Java are shown in blue.
This example could have been in either green or blue!
2
All the members of an Interface are implicitly public. You can explicitly declare them to be public, but
you don't have to, and you shouldn't.
3
as a practical matter, it's probably the one that has been sleeping the longest, but you can't depend on
that
4
Interface Iterator was introduced with Java 1.2. It is a somewhat more convenient version of the older
interface Enumeration discussed earlier.
5
Interfaces Map and Set were introduced with Java 1.2. Earier versions of the API contained only
Hashtable, which is similar to HashMap.
CS 537 Lecture Notes Part 3
Processes and Synchronization
Previous Java for C++ Programmers
Next Deadlock
Contents

Contents
• Using Processes
• What is a Process?
• Why Use Processes
• Creating Processes
• Process States
• Synchronization
• Race Conditions
• Semaphores
• The Bounded Buffer Problem
• The Dining Philosophers
• Monitors
• Messages

The text book mixes a presentation of the features of processes of interest to programmers creating
concurrent programs with discussion of techniques for implementing them. The result is (at least to me)
confusing. I will attempt to first present processes and associated features from the user's point of view
with as little concern as possible for questions about how they are implemented, and then turn to the
question of implementing processes.

Using Processes
What is a Process?
[Silberschatz, Galvin, and Gagne, Sections 4.1, 5.1, 5.2]
A process is a ``little bug'' that crawls around on the program executing the instructions it sees there.
Normally (in so-called sequential programs) there is exactly one process per program, but in
concurrent programs, there may be several processes executing the same program. The details of what
constitutes a ``process'' differ from system to system. The main difference is the amount of private state
associated with each process. Each process has its own program counter, the register that tells it where
it is in the program. It also needs a place to store the return address when it calls a subroutine, so that
two processes executing the same subroutine called from different places can return to the correct
calling points. Since subroutines can call other subroutines, each process needs its own stack of return
addresses.
Processes with very little private memory are called threads or light-weight processes. At a minimum,
each thread needs a program counter and a place to store a stack of return addresses; all other values
could be stored in memory shared by all threads. At the other extreme, each process could have its own
private memory space, sharing only the read-only program text with other processes. This essentially
the way a Unix process works. Other points along the spectrum are possible. One common approach is
to put the local variables of procedures on the same private stack as the return addresses, but let all
global variables be shared between processes. A stack frame holds all the local variables of a procedure,
together with an indication of where to return to when the procedure returns, and an indication of where
the calling procedure's stack frame is stored. This is the approach taken by Java threads. Java has no
global variables, but threads all share the same heap. The heap is the region of memory used to hold
objects allocated by new. In short, variables declared in procedures are local to threads, but objects are
all shared. Of course, a thread can only ``see'' an object if it can reach that object from its ``base'' object
(the one containing its run method, or from one of its local variables.

class Worker implements Runnable {


Object arg, other;
Worker(Object a) { arg = a; }
public void run() {
Object tmp = new Object();
other = new Object();
for(int i = 0; i < 1000; i++) // do something
}
}
class Demo {
static public void main(String args[]) {
Object shared = new Object();

Runnable worker1 = new Worker(shared);


Thread t1 = new Thread(worker1);

Runnable worker2 = new Worker(shared);


Thread t2 = new Thread(worker2);

t1.start(); t2.start();
// do something here
}
}
There are three treads in this program, the main thread and two child threads created by it. Each child
thread has its own stack frame for Worker.run(), with space for tmp and i. Thus there are two
copies of the variable tmp, each of which points to a different instance of Object. Those objects are
in the shared heap, but since one thread has no way of getting to the object created by the other thread,
these objects are effectively ``private'' to the two threads.1 Similarly, the objects pointed to by other
are effectively private. But both copies of the field arg and the variable shared in the main thread all
point to the same (shared) object.
Other names sometimes used for processes are job or task.
It is possible to combine threads with processes in the same system. For example, when you run Java
under Unix, each Java program is run in a separate Unix process. Unix processes share very little with
each other, but the Java threads in one Unix process share everything but their private stacks.
Why Use Processes
Processes are basically just a programming convenience, but in some settings they are such a great
convenience, it would be nearly impossible to write the program without them. A process allows you to
write a single thread of code to get some task done, without worrying about the possibility that it may
have to wait for something to happen along the way. Examples:
A server providing services to others.
One thread for each client.
A timesharing system.
One thread for each logged-in user.
A real-time control computer controlling a factory.
One thread for each device that needs monitoring.
A network server.
One thread for each connection.

Creating Processes
[Silberschatz, Galvin, and Gagne, Sections 4.3.1, 4.3.2, 5.6.1]
When a new process is created, it needs to know where to start executing. In Java, a thread is given an
object when it is created. When it is started, it starts execution at the beginning of the run method of
that object.
In Unix, a new process is started with the fork() command. It starts a new process running in the
same program, starting at the statement immediately following the fork() call. After the call, both the
parent (the process that called fork()) and the child are both executing at the same point in the
program. The child is given its own memory space, which is initialized with an exactly copy of the
memory space (globals, stack, heap objects) of the parent. Thus the child looks like an exact clone of
the parent, and indeed, it's hard to tell them apart. The only difference is that fork() returns 0 in the
child, but a non-zero value in the parent.

#include <iostream.h>
#include <unistd.h>

char *str;

int f() {
int k;

k = fork();
if (k == 0) {
str = "the child has value ";
return 10;
}
else {
str = "the parent has value ";
return 39;
}
}

main() {
int j;
str = "the main program ";
j = f();
cout << str << j << endl;
}
This program starts with one process executing main(). This process calls f(), and inside f() it
calls fork(). Two processes appear to return from fork(), a parent and a child process. Each has its
own copy of the global global variable str and its own copy of the stack, which contains a frame for
main with variable j and a frame for f with variable k. After the return from fork the parent sets its
copy of k to a non-zero value, while the child sets its copy of k to zero. Each process then assigns a
different string to its copy of the global str and returns a different value, which is assigned to the
process' own copy of j. Two lines are printed:

the parent has value 39


the child has value 10
(actually, the lines might be intermingled).
Process States
[Silberschatz, Galvin, and Gagne, Sections 4.1.2, 5.6.2, 5.6.3. See Figure 4.1 on page 89]
Once a process is started, it is either runnable or blocked. It can become blocked by doing something
that explicitly blocks itself (such as wait()) or by doing something that implicitly blocks it (such as a
read() request). In some systems, it is also possible for one process to block another (e.g.,
Thread.suspend() in Java2 A runnable process is either ready or running. There can only be as
many running processes as there are CPUs. One of the responsibilities of the operating system, called
short-term scheduling is to switch processes between ready and running state. Two other possible states
are new and terminated. In a batch system, a newly submitted job might be left in new state until the
operating system decides there are enough available resources to run the job without overloading the
system. The decision of when to move a job from new to ready is called long-term scheduling. A
process may stay in terminated state after finishing so that the OS can clean up after it (print out its
output, etc.) Many systems also allow one process to enquire about the state of another process, or to
wait for another process to complete. For example, in Unix, the wait() command blocks the current
process until at least one of its children has terminated. In Java, the method Thread.join() blocks
the caller until the indicated thread has terminated (returned from its run method). To implement
these functions, the OS has to keep a copy of the terminated process around. In Unix, such a process is
called a ``zombie.''
Some systems require every process to have a parent. What happens when a the parent dies before the
child? One possibility is cascading termination. Unix uses a different model. An ``orphan'' process is
adopted by a special process called ``init'' that is created at system startup and only goes away at system
shutdown.
Synchronization
[Silberschatz, Galvin, and Gagne, Chapter 7]

Race Conditions
Consider the following extremely simple procedure

void deposit(int amount) {


balance += amount;
}
(where we assume that balance is a shared variable). If two processes try to call deposit
concurrently, something very bad can happen. The single statement balance += amount is really
implemented, on most computers, buy a sequence of instructions such as

Load Reg, balance


Add Reg, amount
Store Reg, balance
Suppose process P1 calls deposit(10) and process P2 calls deposit(20). If one completes
before the other starts, the combined effect is to add 30 to the balance, as desired. However, suppose the
calls happen at exactly the same time, and the executions are interleaved. Suppose the initial balance is
100, and the two processes run on different CPUs. One possible result is

P1 loads 100 into its register


P2 loads 100 into its register
P1 adds 10 to its register, giving 110
P2 adds 20 to its register, giving 120
P1 stores 110 in balance
P2 stores 120 in balance
and the net effect is to add only 20 to the balance!
This kind of bug, which only occurs under certain timing conditions, is called a race condition. It is an
extremely difficult kind of bug to track down (since it may disappear when you try to debug it) and may
be nearly impossible to detect from testing (since it may occur only extremely rarely). The only way to
deal with race conditions is through very careful coding. To avoid these kinds of problems, systems that
support processes always contain constructs called synchronization primitives.
Semaphores
[Silberschatz, Galvin, and Gagne, Section 7.5]
One of the earliest and simplest synchronization primitives is the semaphore. We will consider later
how semaphores are implemented, but for now we can treat them like a Java object that hides an integer
value and only allows three operations: initialization to a specified value, increment, or decrement.3

class Semaphore {
private int value;
public Semaphore(int v) { value = v; }
public void up() { /* ... */ }
public void down() { /* ... */ };
}
Although there are methods for changing the value, there is no way to read the current value! There two
bits of ``magic'' that make this seemingly useless class extremely useful:
6. The value is never permitted to be negative. If the value is zero when a process calls down, that
process is forced to wait (it goes into blocked state) until some other process calls up on the
semaphore.
7. The up and down operations are atomic: A correct implementation must make it appear that
they occur instantaneously. In other words, two operations on the same semaphore attempted at
the same time must not be interleaved. (In the case of a down operation that blocks the caller, it
is the actual decrementing that must be atomic; it is ok if other things happen while the calling
process is blocked).
Our first example uses semaphores to fix the deposit function above.

shared Semaphore mutex = new Semaphore(1);


void deposit(int amount) {
mutex.down();
balance += amount;
mutex.up();
}
We assume there is one semaphore, which we call mutex (for ``mutual exclusion'') shared by all
processes. The keyword shared (which is not Java) will be omitted if it is clear which variables are
shared and which are private (have a separate copy for each process). Semaphores are useless unless
they are shared, so we will omit shared before Semaphore. Also we will abbreviate the declaration
and initialization as

Semaphore mutex = 1;
Let's see how this works. If only one process wants to make a deposit, it does mutex.down(),
decreasing the value of mutex to zero, adds its amount to the balance, and returns the value of mutex
to one. If two processes try to call deposit at about the same time, one of them will get to do the
down operation first (because down is atomic!). The other will find that mutex is already zero and be
forced to wait. When the first process finishes adding to the balance, it does mutex.up(), returning
the value to one and allowing the other process to complete its down operation. If there were three
processes trying at the same time, one of them would do the down first, as before, and the other two
would be forced to wait. When the first process did up, one of the other two would be allowed to
complete its down operation, but then mutex would be zero again, and the third process would
continue to wait.
The Bounded Buffer Problem
[Silberschatz, Galvin, and Gagne, Sections 4.4 and 7.6.1 ]
Suppose there are producer and consumer processes. There may be many of each. Producers somehow
produce objects, which consumers then use for something. There is one Buffer object used to pass
objects from producers to consumers. A Buffer can hold up to 10 objects. The problem is to allow
concurrent access to the Buffer by producers and consumers, while ensuring that
1. The shared Buffer data structure is not screwed up by race conditions in accessing it.
2. Consumers don't try to remove objects from Buffer when it is empty.
3. Producers don't try to add objects to the Buffer when it is full.
When condition (3) is dropped (the Buffer is assumed to have infinite capacity), the problem is called
the unbounded-buffer problem, or sometimes just the producer-consumer problem. Here is a solution.
First we implement the Buffer class. This is just an easy CS367 exercise; it has nothing to do with
processes.

class Buffer {
private Object[] elements;
private int size, nextIn, nextOut;
Buffer(int size) {
this.size = size;
elements = new Object[size];
nextIn = 0;
nextOut = 0;
}
public void addElement(Object o) {
elements[nextIn++] = o;
if (nextIn == size) nextIn = 0;
}
public Object removeElement() {
Object result = elements[nextOut++];
if (nextOut == size) nextOut = 0;
return result;
}
}
Now for a solution to the bounded-buffer problem using semaphores.4

shared Buffer b = new Buffer(10);


Semaphore
mutex = 1,
empty = 10,
full = 0;

class Producer implements Runnable {


Object produce() { /* ... */ }
public void run() {
Object item;
for (;;) {
item = produce();
empty.down();
mutex.down();
b.addElement(item);
mutex.up();
full.up();
}
}
}
class Consumer implements Runnable {
void consume(Object o) { /* ... */ }
public void run() {
Object item;
for (;;) {
full.down();
mutex.down();
item = b.removeElement();
mutex.up();
empty.up();
consume(item);
}
}
}
As before, we surround operations on the shared Buffer data structure with mutex.down() and
mutex.up() to prevent interleaved changes by two processes (which may screw up the data
structure). The semaphore full counts the number of objects in the buffer, while the semaphore
empty counts the number of free slots. The operation full.down() in Consumer atomically waits
until there is something in the buffer and then ``lays claim'' to it by decrementing the semaphore.
Suppose it was replaced by

while (b.empty()) { /* do nothing */ }


mutex.down();
/* as before */
(where empty is a new method added to the Buffer class). It would be possible for one process to
see that the buffer was non-empty, and then have another process remove the last item before it got a
chance to grab the mutex semaphore.
There is one more fine point to notice here: Suppose we reversed the down operations in the consumer

mutex.down();
full.down();
and a consumer tries to do these operation when the buffer is empty. It first grabs the mutex
semaphore and then blocks on the full semaphore. It will be blocked forever because no other
process can grab the mutex semaphore to add an item to the buffer (and thus call full.up()). This
situation is called deadlock. We will study it in length later.
The Dining Philosophers
[Silberschatz, Galvin, and Gagne, Section 7.6.3]
There are five philosopher processes numbered 0 through 4. Between each pair of philosophers is a
fork. The forks are also numbered 0 through 4, so that fork i is between philosophers i-1 and i (all
arithmetic on fork numbers and philosopher numbers is modulo 5 so fork 0 is between philosophers 4
and 0).

Each philosopher alternates between thinking and eating. To eat, he needs exclusive access to the forks
on both sides of him.

class Philosopher implements Runnable {


int i; // which philosopher
public void run() {
for (;;) {
think();
take_forks(i);
eat();
put_forks(i)
}
}
}
A first attempt to solve this problem represents each fork as a semaphore:

Semaphore fork[5] = 1;
void take_forks(int i) {
fork[i].down();
fork[i+1].down();
}
void put_forks(int i) {
fork[i].up();
fork[i+1].up();
}
The problem with this solution is that it can lead to deadlock. Each philosopher picks up his right fork
before he tried to pick up his left fork. What happens if the timing works out such that all the
philosophers get hungry at the same time, and they all pick up their right forks before any of them gets
a chance to try for his left fork? Then each philosopher i will be holding fork i and waiting for fork
i+1, and they will all wait forever.

There's a very simple solution: Instead of trying for the right fork first, try for the lower numbered fork
first. We will show later that this solution cannot lead to deadlock.
This solution, while deadlock-free, is still not as good as it could be. Consider again the situation in
which all philosophers get hungry at the same time and pick up their lower-numbered fork. Both
philosopher 0 and philosopher 4 try to grab fork 0 first. Suppose philosopher 0 wins. Since

philosopher 4 is stuck waiting for fork 0, philosopher 3 will be able to grab both is forks and start
eating.

Philosopher 3 gets to eat, but philosophers 0 and 1 are waiting, even though neither of them shares a
fork with philosopher 3, and hence one of them could eat right away. In summary, this solution is safe
(no two adjacent philosophers eat at the same time), but not as concurrent as possible: A philosopher's
meal may be delayed even though the delay is not required for safety.
Dijkstra suggests a better solution. More importantly, he shows how to derive the solution by thinking
about two goals of any synchronization problem:
Safety
Make sure nothing bad happens.
Liveness
Make sure something good happens whenever it can.
For each philosopher i let state[i] be the state of philosopher i--one of THINKING, HUNGRY, or
EATING. The safety requirement is that no to adjacent philosophers are simultaneously EATING. The
liveness criterion is that no philosopher is hungry unless one of his neighbors is eating (a hungry
philosopher should start eating unless the safety criterion prevents him). More formally,
Safety
For all i, !(state[i]==EATING && state[i+1]==EATING)

Liveness
For all i, !(state[i]==HUNGRY && state[i-1]!=EATING &&
state[i+1]!=EATING)
With this observation, the solution almost writes itself

Semaphore mayEat[5] = { 0, 0, 0, 0, 0};


Semaphore mutex = 1;
final static public int THINKING = 0;
final static public int HUNGRY = 1;
final static public int EATING = 2;
int state[5] = { THINKING, THINKING, THINKING, THINKING, THINKING };
void take_forks(int i) {
mutex.down();
state[i] = HUNGRY;
test(i);
mutex.up();
mayEat[i].down();
}
void put_forks(int i) {
mutex.down();
state[i] = THINKING;
test(i==0 ? 4 : i-1); // i-1 mod 5
test((i==4 ? 0 : i+1); // i+1 mod 5
mutex.up();
}
void test(int i) {
if (state[i]==HUNGRY &amp;&amp; state[i-1]!=EATING &amp;&amp; state[i+1] != EATING) {
state[i] = EATING;
mayEat[i].up();
}
}
The method test(i) checks for a violation of liveness at position i. Such a violation can only occur
when philosopher i gets hungry or one of his neighbors finishes eating. Each philosopher has his own
mayEat semaphore, which represents permission to start eating. Philosopher i calls
mayEat[i].down() immediately before starting to eat. If the safety condition allows philosopher i
to eat, the procedure test(i) grants permission by calling mayEat[i].up(). Note that the
permission may be granted by a neighboring philosopher, in the call to test(i) in put_forks, or
the hungry philospher may give himself permission to eat, in the call to test(i) in get_forks.
Monitors
[Silberschatz, Galvin, and Gagne, Section 7.7]
Although semaphores are all you need to solve lots of synchronization problems, they are rather ``low
level'' and error-prone. As we saw before, a slight error in placement of semaphores (such as switching
the order of the two down operations in the Bounded Buffer problem) can lead to big problems. It is
also easy to forget to protect shared variables (such as the bank balance or the buffer object) with a
mutex semaphore. A better (higher-level) solution is provided by the monitor (also invented by
Dijkstra).
If you look at the example uses of semaphores above, you see that they are used in two rather different
ways: One is simple mutual exclusion. A semaphore (always called mutex in our examples) is
associated with a shared variable or variables. Any piece of code that touches these variables is
preceded by mutex.down() and followed by mutex.up(). Since it's hard for a programmer to
remember to do this, but easy for a compiler, why not let the compiler do the work?5

monitor class BankAccount {


private int balance;
public void deposit(int amount) {
balance += amount;
}
// etc
}
The keyword monitor tells the compiler to add a field

Semaphore mutex = 1;
to the class, add a call of mutex.down() to the beginning of each method, and put a call of
mutex.up() at each return point in each method.
The other way semaphores are used is to block a process when it cannot proceed until another process
does something. For example, a consumer, on discovering that the buffer is empty, has to wait for a
producer; a philosopher, on getting hungry, may have to wait for a neighbor to finish eating. To provide
this facility, monitors can have a special kind of variable called a condition variable.

interface Condition {
public void signal();
public void wait();
}
A condition variable is like a semaphore, with two differences:
1. A semaphore counts the number of excess up operations, but a signal operation on a
condition variable has no effect unless some process is waiting. A wait on a condition variable
always blocks the calling process.
2. A wait on a condition variable atomically does an up on the monitor mutex and blocks the
caller. In other words if c is a condition variable c.wait() is rather like mutex.up();
c.down(); except that both operations are done together as a single atomic action.
Here is a solution to the Bounded Buffer problem using monitors.
monitor BoundedBuffer {
private Buffer b = new Buffer(10);
private int count = 0;
private Condition nonfull, nonempty;
public void insert(Object item) {
if (count == 10)
nonfull.wait();
b.addElement(item);
count++;
nonempty.signal();
}
public Object remove() {
if (count == 0)
nonempty.wait();
item result = b.removeElement();
count--;
nonfull.signal();
return result;
}
}
In general, each condition variable is associated with some logical condition on the state of the monitor
(some expression that may be either true or false). If a process discovers, part-way through a method,
that some logical condition it needs is not satisfied, it waits on the corresponding condition variable.
Whenever a process makes one of these conditions true, it signals the corresponding condition variable.
When the waiter wakes up, he knows that the problem that caused him to go to sleep has been fixed,
and he may immediately proceed. For this kind of reasoning to be valid, it is important that nobody else
sneak in between the time that the signaller does the signal and the waiter wakes up. Thus, calling
signal blocks the signaller on yet another queue and immediately wakes up the waiter (if there are
multiple processes blocked on the same condition variable, the one waiting the longest wakes up).
When a process leaves the monitor (returns from one of its methods), a sleeping signaller, if any, is
allowed to continue. Otherwise, the monitor mutex is released, allowing a new process to enter the
monitor. In summary, waiters are have precedence over signalers.
This strategy, while nice for avoiding certain kinds of errors, is very inefficient. As we will see when
we consider implementation, it is expensive to switch processes. Consider what happens when a
consumer is blocked on the nonempty condition variable and a producer calls insert.
• The producer adds the item to the buffer and calls nonempty.signal().
• The producer is immediately blocked and the consumer is allowed to continue.
• The consumer removes the item from the buffer and leaves the monitor.
• The producer wakes up, and since the signal operation was the last statement in insert,
leaves the monitor.
There is an unnecessary switch from the producer to the consumer and back again.
To avoid this inefficiency, all recent implementations of monitors replace signal with notify. The
notify operation is like signal in that it awakens a process waiting on the condition variable if
there is one and otherwise does nothing. But as the name implies, a notify is a ``hint'' that the
associated logical condition might be true, rather than a guarantee that it is true. The process that called
notify is allowed to continue. Only when it leaves the monitor is the awakened waiter allowed to
continue. Since the logical condition might not be true anymore, the waiter needs to recheck it when it
wakes up. For example the Bounded Buffer monitor should be rewritten to replace

if (count == 10)
nonfull.wait();
with

while (count == 10)


nonfull.wait();

Java has built into it something like this, but with two key differences. First, instead of marking a whole
class as monitor, you have to remember to mark each method as synchronized. Every object is
potentially a monitor. Second, there are no explicit condition variables. In effect, every monitor has
exactly one anonymous condition variable. Instead of writing c.wait() or c.notify(), where c
is a condition variable, you simply write wait() or notify(). A solution to the Bounded Buffer
problem in Java might look like this:

class BoundedBuffer {
private Buffer b = new Buffer(10);
private int count = 0;
synchronized public void insert(Object item) {
while (count == 10)
wait();
b.addElement(item);
count++;
notifyAll();
}
synchronized public Object remove() {
while (count == 0)
wait();
Object result = b.removeElement();
count--;
notifyAll();
return result;
}
}
Instead of waiting on a specific condition variable corresponding to the condition you want (buffer non-
empty or buffer non-full), you simply wait, and whenever you make either of these conditions true,
you simply notifyAll. The operation notifyAll is similar to notify, but it wakes up all the
processes that are waiting rather than just one6 In general, a process has to use notifyAll rather than
notify, since the process awakened by notify is not necessarily waiting for the condition that the
notifier just made true.
This BoundedBuffer solution is not correct if it uses notify insteal of notifyAll. Consider a
system with 20 consumer threads and one producer and suppose the following sequence of events
occurs.
1. All 20 consumer threads call remove. Since the buffer starts out empty, they all call wait and
stop running.
2. The producer thread calls insert 11 times. Each of the first 10 times, it adds an object to the
buffer and wakes up one of the waiting consumers. The 11th time, it finds that count == 10
and calls wait. Unlike signal, which blocks the caller, notify allows the producer thread
to continue, so it may finish this step before any of the awakened consumer threads resume
execution.
3. Each of the 10 consumer threads awakened in Step 2 re-tests the condition count == 0, finds
it false, removes an object from the buffer, decrements count, and calls notify. Java
makes no promises which thread is awakened by each notify, but it is possible indeed likely,
that each notify will awaken one of the remaining 10 consumer threads blocked in Step 1.
4. Each of the consumer threads awakened in Step 3 finds that count == 0 and calls wait
again.
At this point, the system grinds to a halt. The lone producer thread is blocked on wait even though the
buffer is empty. The problem is that the notify calls in Step 3 woke up the "wrong" threads; the
notify in BoundedBuffer.remove is meant to wake up waiting producers, not waiting
consumers. The correct solution, which uses notifyAll rather than notify, wakes up the
remaining 10 consumers and the producer in Step 3. The 10 consumers go back to sleep, but the
producer is allowed to continue adding objects to the buffer.
As another example, here's a version of Dijkstra's solution to the dining philosophers problem in
"Java".

class Philosopher implements Runnable {


private int id;
private DiningRoom diningRoom;
public Philosopher(int id, DiningRoom diningRoom) {
this.id = id;
this.diningRoom = diningRoom;
}
public void think() { ... }
public void run() {
for (int i=0; i<100; i++) {
think();
diningRoom.dine(id);
}
}
}
class DiningRoom {
final static private int THINKING = 0;
final static private int HUNGRY = 1;
final static private int EATING = 2;
private int[] state = { THINKING, THINKING, ... };
private Condition[] ok = new Condition[5];

private void eat(int p) { ... }

public synchronized void dine(int p) {


state[p] = HUNGRY;
test(p);
while (state[p] != EATING)
try { ok[p].wait(); } catch (InterruptedException e) {}
eat();
state[p] = THINKING;
test((p+4)%5); // (p-1) mod 5
test((p+1)%5); // (p+1) mod 5
}

private void test(int p) {


if (state[p] == HUNGRY
&& state[(p+1)%5] != EATING
&& state[(p+4)%5] != EATING
){
state[p] = EATING;
ok[p].notify();
}
}
}
When a philosopher p gets hungry, he calls DiningRoom.dine(p). In that procedure, he advertizes
that he is HUNGRY and calls test to see if he can eat. Note that in this case, the notify has no effect:
the only thread that ever waits for ok[p] is philosopher p, and since he is the caller, he can't be
waiting! If his neighbors are not eating, he will set his own state to EATING, the while loop will be
skipped, and he will immediately eat. Otherwise, he will wait for ok[p].
When a philosopher finishes eating, he calls test for each of his neighbors. Each call checks to see if
the neighbor is hungry and able to eat. If so, it sets the neighbor's state to EATING and notify's the
neighbor's ok condition in case he is already waiting.
This solution is fairly simple and easy to read. Unfortunately, it is wrong! There are two problems.
First, it isn't legal Java (which is why I put "Java" in quotes above). Java does not have a Condition
type. Instead it has exactly one anonymous condition variable per monitor. That part is (surprisingly)
easy to fix. Get rid of all mention of the array ok (e.g., ok[p].wait() becomes simply wait())
and change notify() to notifyAll(). Now, whenever any philosopher's state is changed to
EATING, all blocked philosophers are awakened. Those whose states are still HUNGRY will simply go
back to sleep. (Now you see why we wrote while (state[p] != EATING) rather than if
(state[p] != EATING)). The solution is a little less efficient, but not enough to worry about. If
there were 10,000 philosophers, and if a substantial fraction of them were blocked most of the time, we
would have more to worry about, and perhaps we would have to search for a more efficient solution.
The second problem is that this solution only lets one philopher at a time eat. The call to eat is inside
the synchronized method dine, so while a philosopher is eating, no other thread will be able to
enter the DiningRoom The solution to this problem is to break dine into two pieces: one piece that
grabs the forks and another piece that releases them. The dine method no longer needs to be
synchronized.

public void dine(int p) {


grabForks(p);
eat(p);
releaseForks(p);
}

private synchronized void grabForks(int p) {


state[p] = HUNGRY;
test(p);
while (state[p] != EATING)
try { wait(); } catch (InterruptedException e) {}
}

private synchronized void releaseForks(int p) {


state[p] = THINKING;
test((p+4)%5);
test((p+1)%5);
}

Messages
[Silberschatz, Galvin, and Gagne, Section 4.5]
Since shared variables are such a source of errors, why not get rid of them altogether? In this section,
we assume there is no shared memory between processes. That raises a new problem. Instead of
worrying about how to keep processes from interfering with each other, we have to figure out how to
let them cooperate. Systems without shared memory provide message-passing facilities that look
something like this:

send(destination, message);
receive(source, message_buffer);
The details vary substantially from system to system.
Naming
How are destination and source specified? Each process may directly name
the other, or there may be some sort of mailbox or message queue object to be used as
the destination of a send or the source of a receive. Some systems allow
a set of destinations (called multicast and meaning ``send a copy of the message to
each destination'') and/or a set of sources, meaning ``receive a message from any one
of the sources.'' A particularly common feature is to allow source to be ``any'',
meaning that the receiver is willing to receive a message from any other process that
is willing to send a message to it.
Synchronization
Does send (or receive) block the sender, or can it immediately continue? One
common combination is non-blocking send together with blocking receive.
Another possibility is rendezvous, in which both send and receive are blocking.
Whoever gets there first waits for the other one. When a sender and matching
receiver are both waiting, the message is transferred and both are allowed to
continue.
Buffering
Are messages copied directly from the sender's memory to the receiver's memory, or
are first copied into some sort of ``system'' memory in between?
Message Size
Is there an upper bound on the size of a message? Some systems have small, fixed-
size messages to send signals or status information and a separate facility for
transferring large blocks of data.
These design decisions are not independent. For example, non-blocking send is generally only
available in systems that buffer messages. Blocking receive is only useful if there is some way to
say ``receive from any'' or receive from a set of sources.
Message-based communication between processes is particularly attractive in distributed systems (such
as computer networks) where processes are on different computers and it would be difficult or
impossible to allow them to share memory. But it is also used in situations where processes could share
memory but the operating system designer chose not allow sharing. One reason is to avoid the bugs that
can occur with sharing. Another is to build a wall of protection between processes that don't trust each
other. Some systems even combine message passing with shared memory. A message may include a
pointer to a region of (shared) memory. The message is used as a way of transferring ``ownership'' of
the region. There might be a convention that a process that wants to access some shared memory had to
request permission from its current owner (by sending a message).
Unix is a message-based system (at the user level). Processes do not share memory but communicate
through pipes.7 A pipe looks like an output stream connected to an input stream by a chunk of memory
used to make a queue of bytes. One process sends data to the output stream the same way it would write
data to a file, and another reads from it the way it would read from a file. In the terms outlined above,
naming is indirect (with the pipe acting as a mailbox or message queue), send (called write in Unix)
is non-blocking, while receive (called read) is blocking, and there is buffering in the operating
system. At first glance it would appear that the message size is unbounded, but it would actually be
more accurate to say each ``message'' is one byte. The amount of data sent in a write or received in a
read is unbounded, but the boundaries between writes are erased in the pipe: If the sender does three
writes of 60 bytes each and the receive does two reads asking for 100 bytes, it will get back the first
100 bytes the first time and the remaining 80 bytes the second time.
Continued...

Previous Java for C++ Programmers


Next Deadlock
Contents
1
I'm using the term ``private'' informally here. The variable tmp is not a field of a class but rather a
local variable, so it cannot be declared public, private, etc. It is ``private'' only in the sense that no
other thread has any way of getting to this object.
2
Note that this method is deprecated, which means you should never use it!
3
In the original definition of semaphores, the up and down operations were called V() and P(),
respectively, but people had trouble remembering which was which. Some books call them signal
and wait, but we will be using those names for other operations later.
4
Remember, this is not really legal Java. We will show a Java solution later.
5
Monitors are not available in this form in Java. We are using Java as a vehicle for illustrating various
ideas present in other languages. See the discussion of monitors later for a similar feature that is
available in Java.
6
The Java language specification says that if any threads are blocked on wait() in an object, a
notify in that object will wake up exactly one thread. It does not say that it has to be any particular
thread, such as the one that waited the longest. In fact, some Java implementations actually wake up the
thread that has been waiting the shortest time!
7
There are so many versions of Unix that just about any blanket statement about Unix is sure to be a
lie. Some versions of Unix allow memory to be shared between processes, and some have other ways
for processes to communicate other than pipes.

solomon@cs.wisc.edu
Wed Mar 8 13:36:15 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


CS 537
Lecture Notes Part 4
Processes and Synchronization, Continued
Deadlock
Contents
• Terminology
• Deadlock Detection
• Deadlock Recovery
• Deadlock Prevention
• Deadlock Avoidance

Using Processes (Continued)

Deadlock
[Silberschatz, Galvin, and Gagne, Chapter 8]

Terminology
The Dining Philosophers problem isn't just a silly exercise. It is a scale-model example of a very
important problem in operating systems: resource allocation. A ``resource'' can be defined as
something that costs money. The philosophers represent processes, and the forks represent resources.
There are three kinds of resources:
• sharable
• serially reusable
• consumable
Sharable resources can be used by more than one process at a time. A consumable resource can only be
used by one process, and the resource gets ``used up.'' A serially reusable resource is in between. Only
only process can use the resource at a time, but once it's done, it can give it back for use by another
process. Examples are the CPU and memory. These are the most interesting type of resource. We won't
say any more about the other kinds.
A process requests a (serially reusable) resource from the OS and holds it until it's done with it; then it
releases the resource. The OS may delay responding to a request for a resource. The requesting process
is blocked until the OS responds. Sometimes we say the process is ``blocked on the resource.'' In actual
systems, resources might be represented by semaphores, monitors, or condition variables in monitors--
anything a process may wait for.
A resource might be preemptable, meaning that the resource can be ``borrowed'' from the process
without harm. Sometimes a resource can be made preemptable by the OS, at some cost. For example,
memory can be preempted from a process by suspending the process, and copying the contents of the
memory to disk. Later, the data is copied back to the memory, and the process is allowed to continue.
Preemption effectively makes a serially reusable resource look sharable.
There are three ways of dealing with deadlocks: detection and recovery, prevention, or avoidance.
Deadlock Detection
[Silberschatz, Galvin, and Gagne, Section 8.6]
The formal definition of deadlock:
A set of processes is deadlocked if each process in the set is waiting for an event
that only a process in the set can cause.
We can show deadlock graphically by building the waits-for graph. Draw each process as a little circle,
and draw an arrow from P to Q if P is waiting for Q. The picture is called a graph, the little circles are
called nodes, and the arrows connecting them are called arcs [Silberschatz, Galvin, and Gagne, Figure
8.10(b)]. We can find out whether there is a deadlock as follows:

for (;;) {
find a node n with no arcs coming out of it;
if (no such node can be found)
break;
erase n and all arcs coming into it;
}
if (any nodes are left)
there is a deadlock;
This algorithm simulates a best-case scenario: Every runnable process runs and causes all events that
are expected from it, and no process waits for any new events. A node with no outgoing arcs represents
a process that isn't waiting for anything, so is runnable. It causes all events other processes are waiting
for (if any), thereby erasing all incoming arcs. Then, since it will never wait for anything, it cannot be
part of a deadlock, and we can erase it.
Any processes that are left at the end of the algorithm are deadlocked, and will wait forever. The graph
that's left must contain a cycle (a path starting and ending at the same node and following the arcs). It
may also contain processes that are not part of the cycle but are waiting for processes in the cycle, or
for processes waiting for them, etc. The algorithm will never erase any of the nodes in a cycle, since
each one will always have an outgoing arc pointing to the next node in the cycle.
The simplest cycle is an arc from a node to itself. This represents a process that is waiting for itself, and
usually represents a simple programming bug:

Semaphore s = 0;
...
s.down();
s.up();
If no other process can do s.up(), this process is deadlocked with itself.
Usually, processes block waiting for (serially reusable) resources. The ``events'' they are waiting for are
release of resources. In this case, we can put some more detail into the graph. Add little boxes
representing resources. Draw an arc from a process to a resource if the process is waiting for the
resource, and an arc from the resource to the process if the process holds the resource. The same
algorithm as before will tell whether there is a deadlock. As before, deadlock is associated with cycles:
If there is no cycle in the original graph, there is no deadlock, and the algorithm will erase everything.
If there is a cycle, the algorithm will never erase any part of it, and the final graph will contain only
cycles and nodes that have paths from them to cycles.
Resource Types
[Silberschatz, Galvin, and Gagne, Section 8.2.2]
Often, a request from a process is not for a particular resource, but for any resource of a given type. For
example, a process may need a block of memory. It doesn't care which block of memory it gets. To
model this, we will assume there there some number m of resource types, and some number U[r] of
units of resource r, for each r between 1 and m. To be very general, we will allow a process to request
multiple resources at once: Each request will tell now many units of each resource the process needs to
continue. The graph gets pretty hard to draw [Silberschatz, Galvin, and Gagne, Figure 8.1], but
essentially the same algorithm can be used to determine whether there is a deadlock. We will need a
few arrays for bookkeeping.

U[r] = total number of units of resource r in the system


curAlloc[p][r] = number of units of r currently allocated to process p
available[r] = number of units of r that have not been allocated to any process
request[p][r] = number of units of r requested by p but not yet allocated
As before, the algorithm works by simulating a best-case scenario. We add an array of boolean
done[] with one element for each process, and initially set all elements to false. In this, and later
algorithms, we will want to compare arrays of numbers. If A and B are arrays, we say that A <= B if
A[i] <= B[i] for all subscripts i.1

boolean lessOrEqual(int[] a, int[] b) {


for (int i=0; i<a.length; i++)
if (a[i] > b[i]) return false;
return true;
}
Similarly, when we add together two arrays, we add them element by element. The following methods
increment or decrement each element of one array by the corresponding element of the second.

void incr(int[] a, int[] b) {


for (int i=0; i<a.length; i++)
a[i] += b[i];
}
void decr(int[] a, int[] b) {
for (int i=0; i<a.length; i++)
a[i] -= b[i];
}
We will sometimes need to make a temporary copy of an array

int[] copy(int[] a) {
return (int[])a.clone();
}
int[][] copy(int[][] a) {
int[][] b = new int[a.length][];
for (int i = 0; i < a.length; i++)
b[i] = copy(a[i]);
return b;
}
Finally, note that request is a two dimensional array, but for any particular value of p,
request[p] is a one-dimensional array rp corresponding to the pth row of request and
representing the current allocation state of process p: For each resource r, rp[r] =
request[p][r] = the amount of resource r requested by process p. Similar remarks apply to to
curAlloc and other two-dimensional arrays we will introduce later.
With this machinery in place, we can easily write a procedure to test for deadlock.

/** Check whether the state represented by request[][] and the


** global arrays curAlloc[][] and available[] is deadlocked.
** Return true if there is a deadlock.
*/
boolean deadlocked(int[][] request) {
int[] save = copy(available);
boolean[] done = new boolean[numberOfProcesses];
for (int i = 0; i < done.length; i++)
done[i] = false;
for (int i = 0; i < numberOfProcesses; i++) {
// Find a process that hasn't finished yet, but
// can get everything it needs.
int p;
for (p = 0; p < numberOfProcesses; p++) {
if (!done[p] && lessOrEqual(request[p], available))
break;
}
if (p == numberOfProcesses) {
// No process can continue. There is a deadlock
available = save;
return true;
}
// Assume process p finishes and gives back everything it has
// allocated.
incr(available, curAlloc[p]);
done[p] = true;
}
available = save;
return false;
}
The algorithm looks for a process whose request can be satisfied immediately. If it finds one, it assumes
that the process could be given all the resources it wants, would do what ever it wanted with them, and
would eventually give them back, as well as all the resources it previously got. It can be proved that it
doesn't matter what order we consider the processes; either we succeed in completing them, one at a
time, or there is a deadlock.
How expensive is this algorithm? Let n denote the number of processes and m denote the number of
resources. The body of the third for loop (the line containing the call to lessOrEqual) is executed at
most n2 times and each call requires m comparisons. Thus the entire method may make up to n2m
comparisons. Everything else in the procedure has a lower order of complexity, so running time of the
procedure is O(n2m). If there are 100 processes and 100 resources, n2m = 1,000,000, so if each iteration
takes about a microsecond (a reasonable guess on current hardware), the procedure will take about a
second. If, however, the number of processes and resources each increase to 1000, the running time
would be more like 1000 seconds (16 2/3 minutes)! We might want to use a more clever coding in such
a situation.
Deadlock Recovery
Once you've discovered that there is a deadlock, what do you do about it? One thing to do is simply re-
boot. A less drastic approach is to yank back a resource from a process to break a cycle. As we saw, if
there are no cycles, there is no deadlock. If the resource is not preemptable, snatching it back from a
process may do irreparable harm to the process. It may be necessary to kill the process, under the
principle that at least that's better than crashing the whole system.
Sometimes, we can do better. For example, if we checkpoint a process from time to time, we can roll it
back to the latest checkpoint, hopefully to a time before it grabbed the resource in question. Database
systems use checkpoints, as well as a a technique called logging, allowing them to run processes
``backwards,'' undoing everything they have done. It works like this: Each time the process performs an
action, it writes a log record containing enough information to undo the action. For example, if the
action is to assign a value to a variable, the log record contains the previous value of the record. When a
database discovers a deadlock, it picks a victim and rolls it back.
Rolling back processes involved in deadlocks can lead to a form of starvation, if we always choose the
same victim. We can avoid this problem by always choosing the youngest process in a cycle. After
being rolled back enough times, a process will grow old enough that it never gets chosen as the victim--
at worst by the time it is the oldest process in the system. If deadlock recovery involves killing a
process altogether and restarting it, it is important to mark the ``starting time'' of the reincarnated
process as being that of its original version, so that it will look older that new processes started since
then.
When should you check for deadlock? There is no one best answer to this question; it depends on the
situation. The most ``eager'' approach is to check whenever we do something that might create a
deadlock. Since a process cannot create a deadlock when releasing resources, we only have to check on
allocation requests. If the OS always grants requests as soon as possible, a successful request also
cannot create a deadlock. Thus the we only have to check for a deadlock when a process becomes
blocked because it made a request that cannot be immediately granted. However, even that may be too
frequent. As we saw, the deadlock-detection algorithm can be quite expensive if there are a lot of
processes and resources, and if deadlock is rare, we can waste a lot of time checking for deadlock every
time a request has to be blocked.
What's the cost of delaying detection of deadlock? One possible cost is poor CPU utilization. In an
extreme case, if all processes are involved in a deadlock, the CPU will be completely idle. Even if there
are some processes that are not deadlocked, they may all be blocked for other reasons (e.g. waiting for
I/O). Thus if CPU utilization drops, that might be a sign that it's time to check for deadlock. Besides, if
the CPU isn't being used for other things, you might as well use it to check for deadlock!
On the other hand, there might be a deadlock, but enough non-deadlocked processes to keep the system
busy. Things look fine from the point of view of the OS, but from the selfish point of view of the
deadlocked processes, things are definitely not fine. If the processes may represent interactive users,
who can't understand why they are getting no response. Worse still, they may represent time-critical
processes (missile defense, factory control, hospital intensive care monitoring, etc.) where something
disastrous can happen if the deadlock is not detected and corrected quickly. Thus another reason to
check for deadlock is that a process has been blocked on a resource request ``too long.'' The definition
of ``too long'' can vary widely from process to process. It depends both on how long the process can
reasonably expect to wait for the request, and how urgent the response is. If an overnight run deadlocks
at 11pm and nobody is going to look at its output until 9am the next day, it doesn't matter whether the
deadlock is detected at 11:01pm or 8:59am. If all the processes in a system are sufficiently similar, it
may be adequate simply to check for deadlock at periodic intervals (e.g., one every 5 minutes in a batch
system; once every millisecond in a real-time control system).
Deadlock Prevention
There are four necessary condition for deadlock.
8. Mutual Exclusion. Resources are not sharable.
9. Non-preemption. Once a resource is given to a process, it cannot be revoked until the process
voluntarily gives it up.
10. Hold/Wait. It is possible for a process that is holding resources to request more.
11. Cycles. It is possible for there to be a cyclic pattern of requests.
It is important to understand that all four conditions are necessary for deadlock to occur. Thus we can
prevent deadlock by removing any one of them.
There's not much hope of getting rid of condition (1)--some resources are inherently non-sharable--but
attacking (2) can be thought of as a weak form of attack on (1). By borrowing back a resource when
another process needs to use it, we can make it appear that the two processes are sharing it.
Unfortunately, not all resources can be preempted at an acceptable cost. Deadlock recovery, discussed
in the previous section, is an extreme form of preemption.
We can attack condition (3) either by forcing a process to allocate all the resources it will ever need at
startup time, or by making it release all of its resources before allocating any more. The first approach
fails if a process needs to do some computing before it knows what resources it needs, and even it is
practical, it may be very inefficient, since a process that grabs resources long before it really needs
them may prevent other processes from proceeding. The second approach (making a process release
resources before allocating more) is in effect a form of preemption and may be impractical for the same
reason preemption is impractical.
An attack on the fourth condition is the most practical. The algorithm is called hierarchical allocation.
If resources are given numbers somehow (it doesn't matter how the numbers are assigned), and
processes always request resources in increasing order, deadlock cannot occur.
Proof.
As we have already seen, a cycle in the waits-for graph is necessary for there to be
deadlock. Suppose there is a deadlock, and hence a cycle. A cycle consists of
alternating resources and processes. As we walk around the cycle, following the
arrows, we see that each process holds the resource preceding it and has requested the
one following it. Since processes are required to request resources in increasing
order, that means the numbers assigned to the resources must be increasing as we go
around the cycle. But it is impossible for the number to keep increasing all the way
around the cycle; somewhen there must be drop. Thus we have a contradiction: Either
some process violated the rule on requesting resources, or there is no cycle, and
hence no deadlock.
More precisely stated, the hierarchical allocation algorithm is as follows:
When a process requests resources, the requested resources must all have numbers
strictly greater than the number of any resource currently held by the process.
This algorithm will work even if some of the resources are given the same number. In fact, if they are
all given the same number, this rule reduces to the ``no-hold-wait'' condition, so hierarchical allocation
can also be thought of as a relaxed form of the no-hold-wait condition.
These ideas can be applied to the Dining Philosophers problem. Dijkstra's solution to the dining
philosophers problem gets rid of hold-wait. The mutex semaphore allows a philosopher to pick up
both forks ``at once.'' Another algorithm would have a philosopher pick up one fork and then try to get
the other one. If he can't, he puts down the first fork and starts over. This is a solution using preemption.
It is not a very good solution (why not?).
If each philosopher always picks up the lower numbered fork first, there cannot be any deadlock. This
algorithm is an example of hierarchical allocation. It is better than Dijkstra's solution because it
prevents starvation. (Can you see why starvation is impossible?) The forks don't have to be numbered 0
through 4; any numbering that doesn't put any philosopher between two forks with the same number
would do. For example, we could assign the value 0 to fork 0, 1 to all other even-numbered forks, and 2
to odd-numbered forks. (One numbering is better than the other. Can you see why?)
Deadlock Avoidance

The final approach we will look at is called deadlock avoidance. In this approach, the OS may delay
granting a resource request, even when the resources are available, because doing so will put the system
in an unsafe state where deadlock may occur later. The best-known deadlock avoidance algorithm is
called the ``Banker's Algorithm,'' invented by the famous E. W. Dijkstra.
This algorithm can be thought of as yet another relaxation of the the no-hold-wait restriction. Processes
do not have to allocate all their resources at the start, but they have to declare an upper bound on the
amount of resources they will need. In effect, each process gets a ``line of credit'' that is can drawn on
when it needs it (hence the name of the algorithm).
When the OS gets a request, it ``mentally'' grants the request, meaning that it updates its data structures
to indicate it has granted the request, but does not immediately let the requesting process proceed. First
it checks to see whether the resulting state is ``safe''. If not, it undoes the allocation and keeps the
requester waiting.
To check whether the state is safe, it assumes the worst case: that all running processes immediately
request all the remaining resources that their credit lines allow. It then checks for deadlock using the
algorithm above. If deadlock occurs in this situation, the state is unsafe, and the resource allocation
request that lead to it must be delayed.
To implement this algorithm in Java, we will need one more table beyond those defined above.

creditLine[p][r] = number of units of r reserved by process p but not yet allocated to it


Here's the procedure

/** Try to satisfy a particular request in the state indicated by the


** global arrays curAlloc, creditLine, and available.
** If the request can be safely granted, update the global state
** appropriately and return true.
** Otherwise, leave the state unchanged and return false.
*/
boolean tryRequest(int p, int[] req) {
if (!lessOrEqual(req, creditLine[p])) {
System.out.println("process " + p
+ " is requesting more than it reserved!");
return false;
}
if (!lessOrEqual(req, available)) {
System.out.println("process " + p
+ " is requesting more than there is available!");
return false;
}
int[] saveAvail = copy(available);
int[][] saveAlloc = copy(curAlloc);
int[][] saveLine = copy(creditLine);

// Tentatively give him what he wants


decr(available, req);
decr(creditLine[p], req);
incr(curAlloc[p], req);

if (safe()) {
return true;
}
else {
curAlloc = saveAlloc;
available = saveAvail;
creditLine = saveLine;
return false;
}
}
/** Check whether the current state is safe. */
boolean safe() {
// Assume everybody immediately calls in their credit.
int[][] request = copy(creditLine);

// See whether that causes a deadlock.


return !deadlocked(request);
}
When a process p starts, creditLine[p][r] is set to p's declared maximum claim on resource r.
Whenever p is granted some resource, not only is the amount deducted from available, it is also
deducted from creditLine.
When a new request arrives, we first see if it is legal (it does not exceed the requesting process'
declared maximum allocation for any resources), and if we have enough resources to grant it. If so, we
tentatively grant it and see whether the resulting state is safe. To see whether a state is safe, we consider
a ``worst-case'' scenario. What if all processes suddenly requested all the resources remaining in their
credit lines? Would the system deadlock? If so, the state is unsafe, so we reject the request and
``ungrant'' it.
The code written here simply rejects requests that cannot be granted because they would lead to an
unsafe state or because there are not enough resources available. A more complete version would record
such requests and block the requesting processes. Whenever another process released some resources,
the system would update the state accordingly and reconsider all the blocked proce sses to see whether it
could safely grant the request of any of them.

An Example
A system has three classes of resource: A, B, and C. Initially, there are 8 units of A and 7 units each of
resources B and C. In other words, the array U above has the value { 8, 7, 7 }. There are five
processes that have declared their maximum demands, and have been allocated some resources as
follows:
Maximum Demand CurrentAllocation
Process
A B C A B C
1 4 3 6 1 1 0
2 0 4 4 0 2 1
3 4 2 2 1 1 1
4 1 6 3 0 0 2
5 7 3 2 2 1 0
(The table CurrentAllocation is the array curAlloc in the Java program.)
To run the Bankers Algorithm, we need to know the amount of remaining credit available for each
process (credLine[p][r]), and the amount resources left in the bank after the allocations
(available[r]). The credit line for a process and resource type is computed by subtracting the
current allocation for that process and resource from the corresponding maximum demand.
Remaining Credit
Process
A B C
1 3 2 6
2 0 2 3
3 3 1 1
4 1 6 1
5 5 2 2
The value available[r] is calculated by subtracting from U[r] the sum of the rth column of
curAlloc: available = { 4, 2, 3 }.
If process 4 were to request two units of resource C, the request would be rejected as an error because
process 4 initially declared that it would never need more than 3 units of C and it has already been
granted 2.
A request of five units of resource A by process 5 would be delayed, even though it falls within his
credit limit, because 4 of the original 8 units of resource A have already been allocated, leaving only 4
units remaining.
Suppose process 1 were to request 1 unit each of resources B and C. To see whether this request is safe,
we grant the request by subtracting it from process 1's remaining credit and adding it to his current
allocation, yielding
Current Allocation Remaining Credit
Process
A B C A B C
1 1 2 1 3 1 5
2 1 2 1 0 2 3
3 1 1 1 3 1 1
4 0 0 2 1 6 1
5 2 1 0 5 2 2
We also have to subtract the allocation from the amount available, yeilding available = { 4, 1,
2 }.
To see whether the resulting state is safe, we treat the Remaining Credit array as a Request array
and check for deadlock. We note that the amounts in available are not enough to satisfy the
request of process 1 because it wants 5 more units of C and we have only 2. Similarly, we cannot
satisfy 2, 4, or 5 because we have only one unit remaining of B and they all want more than that.
However, we do have enough to grant 3's request. Therefore, we assume that we will give process
3 its request, and it will finish and return those resources, along with the remaining resources
previously allocated to it, and we will increase our available holdings to { 5, 2, 3 }. Now we
can satisfy the request of either 2 or 5. Suppose we choose 2 (it doesn't matter which process we
choose first). After 2 finishes we will have { 6, 4, 4 } and after 5 finishes, our available
will increase to { 8, 5, 4 }. However, at this point, we do not have enough to satisfy the
request of either of the remaining processes 1 or 4, so we conclude that the system is deadlocked,
so the original request was unsafe.
If the original request (1 unit each of B and C) came from process 2 rather than 1, however, the
state would be found to be safe (try it yourself!) and so it would be granted immediately.

Previous Processes and Synchronization


Next Implementation of Processes
Contents
1
Note that, unlike numbers, it is possible to have arrays A and B such that neither A <= B nor B
<= A. This will happen if some of the elements of A are smaller than the corresponding elements
of B and some are bigger.

solomon@cs.wisc.edu
Mon Jan 24 13:34:16 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


CS 537
Lecture Notes Part 5
Processes and Synchronization, Continued
Implementation of Processes
Contents
• Implementing Monitors
• Implementing Semaphores
• Implementing Critical Sections
• Short-term Scheduling

Implementing Processes
We presented processes from the ``user's'' point of view bottom-up: starting with the process concept,
then introducing semaphores as a way of synchronizing processes, and finally adding a higher-level
synchronization facility in the form of monitors. We will now explain how to implement these things in
the opposite order, starting with monitors, and finishing with the mechanism for making processes run.
Some text books make a big deal out of showing that various synchronization primitives are equivalent
to each other. While this is true, it kind of misses the point. It is easy to implement semaphores with
monitors,

class Semaphore {
private int value;
public Semaphore(int initialValue) { value = initialValue; }
public synchronized void up() { value++; notify(); }
public synchronized void down() {
while (value == 0) wait();
value--;
}
}
but that's not the way it usually works. Normally, semaphores (or something very like them) are
implemented using lower level facilities, and then they are used to implement monitors.
Implementing Monitors
Since monitors are a language feature, they are implemented with the help of a compiler. In response to
the keywords monitor, condition, signal, wait, and notify, the compiler inserts little bits
of code here and there in the program. We will not worry about how the compiler manages to do that,
but only concern ourselves with what the code is and how it works.
The monitor keyword in ``standard'' monitors says that there should be mutual exclusion between the
methods of the monitor class (the effect is similar to making every method a synchronized
method in Java). Thus the compiler creates a semaphore mutex initialized to 1 and adds
muxtex.down();
to the head of each method. It also adds a chunk of code that we call exit (described below) to each
place where a method may return--at the end of the procedure, at each return statement, at each point
where an exception may be thrown, at each place where a goto might leave the procedure (if the
language has gotos), etc. Finding all these return points can be tricky in complicated procedures,
which is why we want the compiler to help us out.
When a process signals or notifies a condition variable on which some other process is waiting,
we have a problem: We can't let both of the processes continue immediately, since that would violate
the cardinal rule that there may never be more than one process active in methods of the same monitor
object at the same time. Thus we must block one of the processes: the signaller in the case of signal
and the waiter in the case of notify We will first show how signal-style monitors are
implemented, and later show the (simpler) solution for notify-style monitors.
When a process calls signal, it blocks itself on a semaphore we will call highPriority since
processes blocked on it are given preference over processes blocked on mutex trying to get in ``from
the outside.'' We will also need to know whether any process is waiting for this semaphore. Since
semaphores have no method for asking whether anybody is waiting, we will use an ordinary integer
variable highCount to keep track of the number of processes waiting for highPriority. Both
highPriority and highCount are initialized to zero.
Each condition variable c is replaced by a semaphore cSem, initialized to zero, and an integer
variable cCount, also initialized to zero. Each call c.wait() becomes

cCount++;
if (highCount > 0)
highPriority.up();
else
mutex.up();
cSem.down();
cCount--;
Before a process blocks on a condition variable, it lets some other process go ahead, preferably one
waiting on the highPriority semaphore.
The operation c.signal() becomes

if (cCount > 0) {
highCount++;
cSem.up();
highPriority.down();
highCount--;
}
Notice that a signal of a condition that is not awaited has no effect, and that a signal of a
condition that is awaited immediately blocks the signaller.
Finally, the code for exit which is placed at every return point, is

if (highCount > 0)
highPriority.up();
else
mutex.up();
Note that this is the code for c.wait() with the code manipulating cCount and cSem deleted.
If a signal call is the very last thing before a return the operations on highPriority and
highCount may be deleted. If all calls of signal are at return points (not an unusual situation),
highPriority and highCount can be deleted altogether, along with all code that mentions them.
The variables highCount and cCount are ordinary integer variables, so there can be problems if
two or more processes try to access them at the same time. The code here is carefully written so that a
process only inspects or changes one of these variables before calling up() on any semaphore that
would allow another process to become active inside the monitor.
In systems that use notify (such as Java), c.notify() is replaced by

if (cCount > 0) {
cCount--;
cSem.up();
}
In these systems, the code for c.wait() also has to be modified to delay waking up until the
notifying process has blocked itself or left the monitor. One way to do this would be for the process to
call highPriority.down immediately after waking up from a wait. A simpler solution (the one
actually used in Java) is get rid of the highPriority semaphore and make the waiter call
mutex.down. In summary, the code for c.wait() is

cCount++;
mutex.up();
cSem.down();
mutex.down();
and the code for exit is

mutex.up();
Note that when a language has notify instead of signal it has to implement wait and exit
differently. No system offers both signal and notify.
In summary,

source code signal implementation notify implementation


method start mutex.down() mutex.down()
cCount++;
if (highCount > 0)
cCount++;
highPriority.up();
mutex.up();
c.wait() else
cSem.down();
mutex.up();
mutex.down();
cSem.down();
cCount--;
c.signal() if (cCount > 0) {
highCount++;
cSem.up();
highPriority.down();
highCount--;
}
if (cCount > 0) {
cCount--;
c.notify()
cSem.up();
}
while (cCount > 0) {
cCount--;
c.notifyAll()
cSem.up();
}
if (highCount > 0)
highPriority.up();
method exit else mutex.up();
mutex.up();

Finally, note that we do not use the full generality of semaphores in this implementation of monitors.
The semaphore mutex only takes on the values 0 and 1 (it is a so-called binary semaphore) and the
other semaphores never have any value other than zero.
Implementing Semaphores

A simple-minded attempt to implement semaphores might look like this:

class Semaphore {
private int value;
Semaphore(int v) { value = v; }
public void down() {
while (value == 0) {}
value--;
}
public void up() {
value++;
}
}
There are two things wrong with this solution: First, as we have seen before, attempts to manipulate a
shared variable without synchronization can lead to incorrect results, even if the manipulation is as
simple as value++. If we had monitors, we could make the modifications of value atomic by
making the class into a monitor (or by making each method synchronized), but remember that
monitors are implemented with semaphores, so we have to implement semaphores with something even
more primitive. For now, we will assume that we have critical sections: If we bracket a section of code
with beginCS and endCS,

beginCS()
do something;
endCS()
the code will execute atomically, as if it were protected by a semaphore

mutex.down();
do something;
mutex.up();
where mutex is a semaphore initialized to 1. Of course, we can't actually use a semaphore to
implement semaphores! We will show how to implement beginCS and endCS in the next section.
The other problem with our implementation of semaphores is that it includes a busy wait. While
Semaphore.down() is waiting for value to become non-zero, it is looping, continuously testing
the value. Even if the waiting process is running on its own CPU, this busy waiting may slow down
other processes, since it is repeatedly accessing shared memory, thus interfering with accesses to that
memory by other CPU's (a shared memory unit can only respond to one CPU at a time). If there is only
one CPU, the problem is even worse: Because the process calling down() is running, another process
that wants to call up() may not get a chance to run. What we need is some way to put a process to
sleep. If we had semaphores, we could use a semaphore, but once again, we need something more
primitive.
For now, let us assume that there is a data structure called a PCB (short for ``Process Control Block'')
that contains information about a process, and a procedure swapProcess that takes a pointer to a
PCB as an argument. When swapProcess(pcb) is called, state of the currently running process (the
one that called swapProcess) is saved in pcb and the CPU starts running the process whose state
was previously stored in pcb instead. Given beginCS, endCS, and swapProcess, the complete
implementation of semaphores is quite simple (but very subtle!).

class Semaphore {
private PCBqueue waiters; // processes waiting for this Semaphore
private int value; // if negative, number of waiters

private static PCBqueue ready; // list of all processes ready to run

Semaphore(int initialValue) { value = initialValue; }

public void down() {


beginCS()
value--;
if (value < 0) {
// The current process must wait

// Find some other process to run. The ready list must


// be non-empty or there is a global deadlock.
PCB pcb = ready.removeElement();

swapProcess(pcb);

// Now pcb contains the state of the process that called


// down(), and the currently running process is some
// other process.
waiters.addElement(pcb);
}
endCS
}
public void up() {
beginCS()
value++;
if (value <= 0) {
// The value was previously negative, so there is
// some process waiting. We must wake it up.
PCB pcb = waiters.removeElement();
ready.addElement(pcb);
}
endCS
}
} // Semaphore
The implementation of swapProcess is ``magic'':

/* This procedure is probably really written in assembly language,


* but we will describe it in Java. Assume the CPU's current
* stack-pointer register is accessible as "CPU.sp".
*/
void swapProcess(PCB pcb) {
int newSP = pcb.savedSP;
pcb.savedSP = CPU.sp;
CPU.sp = newSP;
}
As we mentioned earlier, each process has its own stack with a stack frame for each procedure that
process has called but not yet completed. Each stack frame contains, at the very least, enough
information to implement a return from the procedure: the address of the instruction that called the
procedure, and a pointer to the caller's stack frame. Each CPU devotes one of its registers (call it SP) to
point to the current stack frame of the process it is currently running. When the CPU encounters a
return statement, it reloads its SP and PC (program counter) registers from the stack frame. An
approximate description in pseudo-Java might be something like this.

class StackFrame {
int callersSP;
int callersPC;
}
class CPU {
static StackFrame sp; // the current stack pointer
static InstructionAddress pc; // the program counter
}

// Here's how to do a "return"


register InstructionAddress rtn = CPU.sp.callersPC;
CPU.sp = CPU.sp.callersSP;
goto rtn;
(of course, there isn't really a goto statement in Java, and this would all be done in the hardware or a
sequence of assembly language statements).
Suppose process P0 calls swapProcess(pcb), where pcb.savedSP points to a stack frame
representing a call of swapProcess by some other process P1. The call to swapProcess creates a
frame on P0's stack and makes SP point to it. The second statement of swapProcess saves a pointer
to that stack frame in pcb. The third statement then loads SP with a pointer to P1's stack frame for
swapProcess. Now, when the procedure returns, it will be a return to whatever procedure called
swapProcess in process P1.
Implementing Critical Sections
The final piece in the puzzle is to implement beginCS and endCS. There are several ways of doing
this, depending on the hardware configuration. First suppose there are multiple CPU's accessing a
single shared memory unit. Generally, the memory or bus hardware serializes requests to read and write
memory words. For example, if two CPU's try to write different values to the same memory word at the
same time, the net result will be one of the two values, not some combination of the values. Similarly, if
one CPU tries to read a memory word at the same time another modifies it, the read will return either
the old or new value--it will not see a ``half-changed'' memory location. Surprisingly, that is all the
hardware support we need to implement critical sections.
The first solution to this problem was discovered by the Dutch mathematician T. Dekker. A simpler
solution was later discovered by Gary Peterson. Peterson's solution looks deceptively simple. To see
how tricky the problem is, let us look at a couple of simpler-- but incorrect--solutions. For now, we will
assume there are only two processes, P0 and P1. The first idea is to have the processes take turns.

shared int turn; // 0 or 1


void beginCS(int i) { // process i's version of beginCS
while (turn != i) { /* do nothing */ }
}
void endCS(int i) { // process i's version of endCS
turn = 1 - i; // give the other process a chance.
}
This solution is certainly safe, in that it never allows both processes to be in their critical sections at the
same time. The problem with this solution is that it is not live. If process P0 wants to enter its critical
section and turn == 1, it will have to wait until process P1 decides to enter and then leave its
critical section. Since we will only used critical sections to protect short operations (see the
implementation of semaphores above), it is reasonable to assume that a process that has done
beginCS will soon do endCS, but the converse is not true: There's no reason to assume that the other
process will want to enter its critical section any time in the near future (or even at all!).
To get around this problem, a second attempt to solve the problem uses a shared array critical to
indicate which processes are in their critical sections.

shared boolean critical[] = { false, false };


void beginCS(int i) {
critical[i] = true;
while (critical[1 - i]) { /* do nothing */ }
}
void endCS(int i) {
critical[i] = false;
}
This solution is unfortunately prone to deadlock. If both processes set their critical flags to true
at the same time, they will each loop forever, waiting for the other process to go ahead. If we switch the
order of the statements in beginCS, the solution becomes unsafe. Both processes could check each
other's critical states at the same time, see that they were false, and enter their critical sections.
Finally, if we change the code to

void beginCS(int i) {
critical[i] = true;
while (critical[1 - i]) {
critical[i] = false;
/* perhaps sleep for a while */
critical[i] = true;
}
}
livelock can occur. The processes can get into a loop in which each process sets its own critical
flag, notices that the other critical flag is true, clears its own critical flag, and repeats.
Peterson's (correct) solution combines ideas from both of these attempts. Like the second ``solution,''
each process signals its desire to enter its critical section by setting a shared flag. Like the first
``solution,'' it uses a turn variable, but it only uses it to break ties.

shared int turn;


shared boolean critical[] = { false, false };
void beginCS(int i) {
critical[i] = true; // let other guy know I'm trying
turn = 1 - i; // be nice: let him go first
while (
critical[1-i] // the other guy is trying
&& turn != i // and he has precedence
) { /* do nothing */ }
}
void endCS(int i) {
critical[i] = false; // I'm done now

}
Peterson's solution, while correct, has some drawbacks. First, it employs a busy wait (sometimes called
a spin lock) which is bad for reasons suggested above. However, if critical sections are only used to
protect very short sections of code, such as the down and up operations on semaphores as above, this
isn't too bad a problem. Two processes will only rarely attempt to enter their critical sections at the
same time, and even then, the loser will only have to ``spin'' for a brief time. A more serious problem is
that Peterson's solution only works for two processes. Next, we present three solutions that work for
arbitrary numbers of processes.
Most computers have additional hardware features that make the critical section easier to solve. One
such feature is a ``test and set'' instruction that sets a memory location to a given value and at the same
time records in the CPU's unshared state information about the location's previous value. For example,
the old value might be loaded into a register, or a condition code might be set to indicate whether the
old value was zero. Here is a version using Java-like syntax
shared boolean lock = false; // true if any process is in its CS
void beginCS() { // same for all processes
for (;;) {
boolean key = testAndSet(lock);
if (!key)
return;
}
}
void endCS() {
lock = false;
}
Some other computers have a swap instruction that swaps the value in a register with the contents of a
shared memory word.

shared boolean lock = false; // true if any process is in its CS


void beginCS() { // same for all processes
boolean key = true;
for (;;) {
swap(key, lock)
if (!key)
return;
}
}
void endCS() {
boolean key = false;
swap(key, lock)
}
The problem with both of these solutions is that they do not necessarily prevent starvation. If several
processes try to enter their critical sections at the same time, only one will succeed (safety) and the
winner will be chosen in a bounded amount of time (liveness), but the winner is chosen essentially
randomly, and there is nothing to prevent one process from winning all the time. The ``bakery
algorithm'' of Leslie Lamport solves this problem. When a process wants to get service, it takes a ticket.
The process with the lowest numbered ticket is served first. The process id's are used to break ties.

static final int N = ...; // number of processes

shared boolean choosing[] = { false, false, ..., false };


shared int ticket[] = { 0, 0, ..., 0 };

void beginCS(int i) {
choosing[i] = true;
ticket[i] = 1 + max(ticket[0], ..., ticket[N-1]);
choosing[i] = false;
for (int j=0; j<N; j++) {
while (choosing[j]) { /* nothing */ }
while (ticket[j] != 0
&& (
ticket[j] < ticket[i]
|| (ticket[j] == ticket[i] && j < i)
) { /* nothing */ }
}
}
void endCS(int i) {
ticket[i] = 0;
}
Finally, we note that all of these solutions to the critical-section problem assume multiple CPU's
sharing one memory. If there is only one CPU, we cannot afford to busy-wait. However, the good news
is that we don't have to. All we have to do is make sure that the short-term scheduler (to be discussed in
the next section) does not switch processes while a process is in a critical section. One way to do this is
simply to block interrupts. Most computers have a way of preventing interrupts from occurring. It can
be dangerous to block interrupts for an extended period of time, but it's fine for very short critical
sections, such as the ones used to implement semaphores. Note that a process that blocks on a
semaphore does not need mutual exclusion the whole time it's blocked; the critical section is only long
enough to decide whether to block.
Short-term Scheduling
Earlier, we called a process that is not blocked ``runnable'' and said that a runnable process is either
ready or running. In general, there is a list of runnable processes called the ready list. Each CPU picks a
process from the ready list and runs it until it blocks. It then chooses another process to run, and so on.
The implementation of semaphores above illustrates this. This switching among runnable processes is
called short-term scheduling 1 and the algorithm that decides which process to run and how long to run
it is called a short-term scheduling policy or discipline. Some policies are preemptive, meaning that the
CPU may switch processes even when the current process isn't blocked.
Before we look at various scheduling policies, it is worthwhile to think about what we are trying to
accomplish. There is a tension between maximizing overall efficiency and giving good service to
individual ``customers.'' From the system's point of view, two important measures are
Throughput.
The amount of useful work accomplished per unit time. This depends, of course, on
what constitutes ``useful work.'' One common measure of throughput is jobs/minute
(or second, or hour, depending on the kinds of job).
Utilization.
For each device, the utilization of a device is the fraction of time the device is busy. A
good scheduling algorithm keeps all the devices (CPU's, disk drives, etc.) busy most
of the time.
Both of these measures depend not only on the scheduling algorithm, but also on the offered load. If
load is very light--jobs arrive only infrequently--both throughput and utilization will be low. However,
with a good scheduling algorithm, throughput should increase linearly with load until the available
hardware is saturated and throughput levels off.

Each ``job''2 also wants good service. In general, ``good service'' means good response: It is starts
quickly, runs quickly, and finishes quickly. There are several ways of measuring response:
Turnaround.
The length of time between when the job arrives in the system and when it finally
finishes.
Response Time.
The length of time between when the job arrives in the system and when it starts to
produce output. For interactive jobs, response time might be more important than
turnaround.
Waiting Time.
The amount of time the job is ready (runnable but not running). This is a better
measure of scheduling quality than turnaround, since the scheduler has no control of
the amount of time the process spends computing or blocked waiting for I/O.
Penalty Ratio.
Elapsed time divided by the sum of the CPU and I/O demands of the the job. This is a
still better measure of how well the scheduler is doing. It measures how many times
worse the turnaround is than it would be in an ``ideal'' system. If the job never had to
wait for another job, could allocate each I/O device as soon as it wants it, and
experienced no overhead for other operating system functions, it would have a
penalty ratio of 1.0. If it takes twice as long to complete as it would in the perfect
system, it has a penalty ratio of 2.0.
To measure the overall performance, we can then combine the performance of all jobs using any one of
these measures and any way of combining. For example, we can compute average waiting time as the
average of waiting times of all jobs. Similarly, we could calculate the sum of the waiting times, the
average penalty ratio, the variance in response time, etc. There is some evidence that a high variance in
response time can be more annoying to interactive users than a high mean (within reason).
Since we are concentrating on short-term (CPU) scheduling, one useful way to look at a process is as a
sequence of bursts. Each burst is the computation done by a process between the time it becomes ready
and the next time it blocks. To the short-term scheduler, each burst looks like a tiny ``job.''
First-Come-First-Served
The simplest possible scheduling discipline is called First-come, first-served (FCFS). The ready list is a
simple queue (first-in/first-out). The scheduler simply runs the first job on the queue until it blocks,
then it runs the new first job, and so on. When a job becomes ready, it is simply added to the end of the
queue.
Here's an example, which we will use to illustrate all the scheduling disciplines.
Burst Arrival Time Burst Length
A 0 3
B 1 5
C 3 2
D 9 5
E 12 5
(All times are in milliseconds). The following Gantt chart shows the schedule that results from FCFS
scheduling.

The main advantage of FCFS is that it is easy to write and understand, but it has some severe problems.
If one process gets into an infinite loop, it will run forever and shut out all the others. Even if we
assume that processes don't have infinite loops (or take special precautions to catch such processes),
FCFS tends to excessively favor long bursts. Let's compute the waiting time and penalty ratios for these
jobs.
Start Finish Waiting Penalty
Burst
Time Time Time Ratio
A 0 3 0 1.0
B 3 8 2 1.4
C 8 10 5 3.5
D 10 15 1 1.2
E 15 20 3 1.6
Average 2.2 1.74
As you can see, the shortest burst (C) has the worst penalty ratio. The situation can be much worse if a
short burst arrives after a very long one. For example, suppose a burst of length 100 arrives at time 0
and a burst of length 1 arrives immediately after it, at time 1. The first burst doesn't have to wait at all,
so its penalty ratio is 1.0 (perfect), but the second burst waits 99 milliseconds, for a penalty ratio of
100.
Favoring long bursts means favoring CPU-bound processes (which have very long CPU bursts between
I/O operations). In general, we would like to favor I/O-bound processes, since if we give the CPU to an
I/O-bound process, it will quickly finish its burst, start doing some I/O, and get out of the ready list.
Consider what happens if we have one CPU-bound process and several I/O-bound processes. Suppose
we start out on the right foot and run the I/O-bound processes first. They will all quickly finish their
bursts and go start their I/O operations, leaving us to run the CPU-bound job. After a while, they will
finish their I/O and queue up behind the CPU-bound job, leaving all the I/O devices idle. When the
CPU-bound job finishes its burst, it will start an I/O operation, allowing us to run the other jobs. As
before, they will quickly finish their bursts and start to do I/O. Now we have the CPU sitting idle, while
all the processes are doing I/O. Since the CPU hog started its I/O first, it will likely finish first,
grabbing the CPU and making all the other processes wait. The system will continue this way,
alternating between periods when the CPU is busy and all the I/O devices are idle with periods when
the CPU is idle and all the processes are doing I/O. We have destroyed one of the main motivations for
having processes in the first place: to allow overlap between computation with I/O. This phenomenon is
called the convoy effect.
In summary, although FCFS is simple, it performs poorly in terms of global performance measures,
such as CPU utilization and throughput. It also gives lousy response to interactive jobs (which tend to
be I/O bound). The one good thing about FCFS is that there is no starvation: Every burst does get
served, if it waits long enough.
Shortest-Job-First
A much better policy is called shortest-job-first (SJF). Whenever the CPU has to choose a burst to run,
it chooses the shortest one. (The algorithm really should be called ``shortest burst first'', but the name
SJF is traditional). This policy certainly gets around all the problems with FCFS mentioned above. In
fact, we can prove the SJF is optimal with respect to average waiting time. That is, any other policy
whatsoever will have worse average waiting time. By decreasing average waiting time, we also
improve processor utilization and throughput.
Here's the proof that SJF is optimal. Suppose we have a set of bursts ready to run and we run them in
some order other than SJF. Then there must be some burst that is run before shorter burst, say b1 is run
before b2, but b1 > b2. If we reversed the order, we would increase the waiting time of b1 by b2, but
decrease the waiting time of b2 by b1. Since b1 > b2, we have a net decrease in total, and hence average,
waiting time. Continuing in this manner to move shorter bursts ahead of longer ones, we eventually end
up with the bursts sorted in increasing order of size (think of this as a bubble sort!).
Here's our previous example with SJF scheduling
Start Finish Waiting Penalty
Burst
Time Time Time Ratio
A 0 3 0 1.0
B 5 10 4 1.8
C 3 5 0 1.0
D 10 15 1 1.2
E 15 20 3 1.6
Average 1.6 1.32
Here's the Gantt chart:
As described, SJF is a non-preemptive policy. There is also a preemptive version of the SJF, which is
sometimes called shortest-remaining-time-first (SRTF). Whenever a new job enters the ready queue,
the algorithm reconsiders which job to run. If the new arrival has a burst shorter than the remaining
portion of the current burst, the scheduler moves the current job back to the ready queue (to the
appropriate position considering the remaining time in its burst) and runs the new arrival instead.
With SJF or SRTF, starvation is possible. A very long burst may never get run, because shorter bursts
keep arriving in the ready queue. We will return to this problem later.
There's only one problem with SJF (or SRTF): We don't know how long a burst is going to be until we
run it! Luckily, we can make a pretty good guess. Processes tend to be creatures of habit, so if one burst
of a process is long, there's a good chance the next burst will be long as well. Thus we might guess that
each burst will be the same length as the previous burst of the same process. However, that strategy
won't work so well if a process has an occasional oddball burst that unusually long or short burst. Not
only will we get that burst wrong, we will guess wrong on the next burst, which is more typical for the
process. A better idea is to make each guess the average of the length of the immediately preceding
burst and the guess we used before that burst: guess = (guess + previousBurst)/2. This
strategy takes into account the entire past history of a process in guessing the next burst length, but it
quickly adapts to changes in the behavior of the process, since the ``weight'' of each burst in computing
the guess drops off exponentially with the time since that burst. If we call the most recent burst length
b1, the one before that b2, etc., then the next guess is b1/2 + b2/4 + b4/8 + b8/16 + ....
Round-Robin and Processor Sharing
Another scheme for preventing long bursts from getting too much priority is a preemptive strategy
called round-robin (RR). RR keeps all the bursts in a queue and runs the first one, like FCFS. But after
a length of time q (called a quantum), if the current burst hasn't completed, it is moved to the tail of the
queue and the next burst is started. Here are Gantt charts of our example with round-robin and quantum
sizes of 4 and 1.
With q = 4, we get an average waiting time of 3.2 and an average penalty ratio of 1.88 (work it out
yourself!). With q = 1, the averages increase to 3.6 and 1.98, respectively, but the variation in penalty
ratio descreases. With q = 4 the penalty ratios range from 1.0 to 3.0, whereas with q = 1, the range is
only 1.6 to 2.5.
The limit, as q approaches zero, is called processor sharing (PS). PS causes the CPU to be shared
equally among all the ready processes. In the steady state of PS, when no bursts enter or leave the ready
list, each burst sees a penalty ratio of exactly n, the length of the ready queue. In this particular
example, burst A arrives at time 0 and for one millisecond, it has the CPU to itself, so when B arrives at
time 1, A has used up 1 ms of its demand and has 2 ms of CPU demand remaining. From time 1 to 3, A
and B share the CPU equally. Thus each of them gets 1 ms CPU time, leaving A with 1 ms remaining
and B with 4 ms remaining. After C arrives at time 3, there are three bursts sharing the CPU, so it takes
3 ms -- until time 6 -- for A to finish. Continuing in a similar manner, you will find that for this
example, PS gives exactly the same results as RR with q = 1. Of course PS is only of theoretical
interest. There is a substantial overhead in switching from one process to another. If the quantum is too
small, the CPU will spend most its time switching between processes and practically none of it actually
running them!
Priority Scheduling
There are a whole family of scheduling algorithms that use priorities. The basic idea is always to run
the highest priority burst. Priority algorithms can be preemptive or non-preemptive (if a burst arrives
that has higher priority than the currently running burst, does do we switch to it immediately, or do we
wait until the current burst finishes?). Priorities can be assigned externally to processes based on their
importance. They can also be assigned (and changed) dynamically. For example, priorities can be used
to prevent starvation: If we raise the priority of a burst the longer it has been in the ready queue,
eventually it will have the highest priority of all ready burst and be guaranteed a chance to finish. One
interesting use of priority is sometimes called multi-level feedback queues (MLFQ). We maintain a
sequence of FIFO queues, numbered starting at zero. New bursts are added to the tail of queue 0. We
always run the burst at the head of the lowest numbered non-empty queue. If it doesn't complete in
complete within a specified time limit, it is moved to the tail of the next higher queue. Each queue has
its own time limit: one unit in queue 0, two units in queue 1, four units in queue 2, eight units in queue
3, etc. This scheme combines many of the best features of the other algorithms: It favors short bursts,
since they will be completed while they are still in low-numbered (high priority) queues. Long bursts,
on the other hand, will be run with comparatively few expensive process switches.
This idea can be generalized. Each queue can have its own scheduling discipline, and you can use any
criterion you like to move bursts from queue to queue. There's no end to the number of algorithms you
can dream up.
Analysis
It is possible to analyze some of these algorithms mathematically. There is a whole branch of computer
science called ``queuing theory'' concerned with this sort of analysis. Usually, the analysis uses
statistical assumptions. For example, it is common to assume that the arrival of new bursts is Poisson:
The expected time to wait until the next new burst arrives is independent of how long it has been since
the last burst arrived. In other words, the amount of time that has passed since the last arrival is no clue
to how long it will be until the next arrival. You can show that in this case, the probability of an arrival
in the next t milliseconds is
1 - e-at, where a is a parameter called the arrival rate. The average time between arrivals is 1/a. Another
common assumption is that the burst lengths follow a similar ``exponential'' distribution: the probability
that the length of a burst is less than t is 1 - e-bt, where b is another parameter, the service rate. The
average burst length is 1/b. This kind of system is called an ``M/M/1 queue.''
The ratio p = a/b is of particular interest:3 If p > 1, burst are arriving, on the average, faster than they
are finishing, so the ready queue grows without bound. (Of course, that can't happen because there is at
most one burst per process, but this is theory!) If p = 1, arrivals and departures are perfectly balanced.
It can be shown that for FCFS, the average penalty ratio for bursts of length t is
P(t) = 1 + p / [ (1-p)bt ]
As you can see, as t decreases, the penalty ratio increases, proving that FCFS doesn't like short bursts.
Also note that as p approaches one, the penalty ratio approaches infinity.
For processor sharing, as we noticed above, all processes have a penalty ratio that is the length of the
queue. It can be shown that on the average, that length is 1/(1-p).

Previous Deadlock
Next Memory Management
Contents
1
We will see medium-term and long-term scheduling later in the course.
2
A job might be a batch job (such as printing a run of paychecks), an interactive login session, or a
command issued by an interactive session. It might consist of a single process or a group of related
processes.
3
Actually, a, b, and p are supposed to be the Greek letters ``alpha,'' ``beta,'' and ``rho,'' but I can't figure
out how to make them in HTML.
CS 537
Lecture Notes Part 6
Memory Management
Contents
• Allocating Main Memory
• Algorithms for Memory Management
• Compaction and Garbage Collection
• Swapping

Allocating Main Memory


We first consider how to manage main (``core'') memory (also called random-access memory (RAM)).
In general, a memory manager provides two operations:

Address allocate(int size);


void deallocate(Address block);
The procedure allocate receives a request for a contiguous block of size bytes of memory and
returns a pointer to such a block. The procedure deallocate releases the indicated block, returning it
to the free pool for reuse. Sometimes a third procedure is also provided,

Address reallocate(Address block, int new_size);


which takes an allocated block and changes its size, either returning part of it to the free pool or
extending it to a larger block. It may not always be possible to grow the block without copying it to a
new location, so reallocate returns the new address of the block.
Memory allocators are used in a variety of situations. In Unix, each process has a data segment. There
is a system call to make the data segment bigger, but no system call to make it smaller. Also, the system
call is quite expensive. Therefore, there are library procedures (called malloc, free, and realloc)
to manage this space. Only when malloc or realloc runs out of space is it necessary to make the
system call. The C++ operators new and delete are just dressed-up versions of malloc and free.
The Java operator new also uses malloc, and the Java runtime system calls free when an object is
no found to be inaccessible during garbage collection (described below).
The operating system also uses a memory allocator to manage space used for OS data structures and
given to ``user'' processes for their own use. As we saw before, there are several reasons why we might
want multiple processes, such as serving multiple interactive users or controlling multiple devices.
There is also a ``selfish'' reason why the OS wants to have multiple processes in memory at the same
time: to keep the CPU busy. Suppose there are n processes in memory (this is called the level of
multiprogramming) and each process is blocked (waiting for I/O) a fraction p of the time. In the best
case, when they ``take turns'' being blocked, the CPU will be 100% busy provided n(1-p) >= 1. For
example, if each process is ready 20% of the time, p = 0.8 and the CPU could be kept completely busy
with five processes. Of course, real processes aren't so cooperative. In the worst case, they could all
decide to block at the same time, in which case, the CPU utilization (fraction of the time the CPU is
busy) would be only 1 - p (20% in our example). If each processes decides randomly and independently
when to block, the chance that all n processes are blocked at the same time is only pn, so CPU
utilization is 1 - pn. Continuing our example in which n = 5 and p = 0.8, the expected utilization would
be 1 - .85 = 1 - .32768 = 0.67232. In other words, the CPU would be busy about 67% of the time on the
average.
Algorithms for Memory Management
[ Silberschatz, Galvin, and Gagne, Section 9.3 ]
Clients of the memory manager keep track of allocated blocks (for now, we will not worry about what
happens when a client ``forgets'' about a block). The memory manager needs to keep track of the
``holes'' between them. The most common data structure is doubly linked list of holes. This data
structure is called the free list. This free list doesn't actually consume any space (other than the head
and tail pointers), since the links between holes can be stored in the holes themselves (provided each
hole is at least as large as two pointers. To satisfy an allocate(n) request, the memory manager
finds a hole of size at least n and removes it from the list. If the hole is bigger than n bytes, it can split
off the tail of the hole, making a smaller hole, which it returns to the list. To satisfy a deallocate
request, the memory manager turns the returned block into a ``hole'' data structure and inserts it into the
free list. If the new hole is immediately preceded or followed by a hole, the holes can be coalesced into
a bigger hole, as explained below.
How does the memory manager know how big the returned block is? The usual trick is to put a small
header in the allocated block, containing the size of the block and perhaps some other information. The
allocate routine returns a pointer to the body of the block, not the header, so the client doesn't need
to know about it. The deallocate routine subtracts the header size from its argument to get the
address of the header. The client thinks the block is a little smaller than it really is. So long as the client
``colors inside the lines'' there is no problem, but if the client has bugs and scribbles on the header, the
memory manager can get completely confused. This is a frequent problem with malloc in Unix
programs written in C or C++. The Java system uses a variety of runtime checks to prevent this kind of
bug.
To make it easier to coalesce adjacent holes, the memory manager also adds a flag (called a ``boundary

tag'') to the beginning and end of each hole or allocated block, and it records the size of a hole at both
ends of the hole.

When the block is deallocated, the memory manager adds the size of the block (which is stored in its
header) to the address of the beginning of the block to find the address of the first word following the
block. It looks at the tag there to see if the following space is a hole or another allocated block. If it is a
hole, it is removed from the free list and merged with the block being freed, to make a bigger hole.
Similarly, if the boundary tag preceding the block being freed indicates that the preceding space is a
hole, we can find the start of that hole by subtracting its size from the address of the block being freed
(that's why the size is stored at both ends), remove it from the free list, and merge it with the block
being freed. Finally, we add the new hole back to the free list. Holes are kept in a doubly-linked list to
make it easy to remove holes from the list when they are being coalesced with blocks being freed.
How does the memory manager choose a hole to respond to an allocate request? At first, it might
seem that it should choose the smallest hole that is big enough to satisfy the request. This strategy is
called best fit. It has two problems. First, it requires an expensive search of the entire free list to find the
best hole (although fancier data structures can be used to speed up the search). More importantly, it
leads to the creation of lots of little holes that are not big enough to satisfy any requests. This situation
is called fragmentation, and is a problem for all memory-management strategies, although it is
particularly bad for best-fit. One way to avoid making little holes is to give the client a bigger block
than it asked for. For example, we might round all requests up to the next larger multiple of 64 bytes.
That doesn't make the fragmentation go away, it just hides it. Unusable space in the form of holes is
called external fragmentation, while unused space inside allocated blocks is called internal
fragmentation.
Another strategy is first fit, which simply scans the free list until a large enough hole is found. Despite
the name, first-fit is generally better than best-fit because it leads to less fragmentation. There is still
one problem: Small holes tend to accumulate near the beginning of the free list, making the memory
allocator search farther and farther each time. This problem is solved with next fit, which starts each
search where the last one left off, wrapping around to the beginning when the end of the list is reached.
Yet another strategy is to maintain separate lists, each containing holes of a different size. This
approach works well at the application level, when only a few different types of objects are created
(although there might be lots of instances of each type). It can also be used in a more general setting by
rounding all requests up to one of a few pre-determined choices. For example, the memory manager
may round all requests up to the next power of two bytes (with a minimum of, say, 64) and then keep
lists of holes of size 64, 128, 256, ..., etc. Assuming the largest request possible is 1 megabyte, this
requires only 14 lists. This is the approach taken by most implementations of malloc. This approach
eliminates external fragmentation entirely, but internal fragmentation may be as bad as 50% in the
worst case (which occurs when all requests are one byte more than a power of two).
Another problem with this approach is how to coalesce neighboring holes. One possibility is not to try.
The system is initialized by splitting memory up into a fixed set of holes (either all the same size or a
variety of sizes). Each request is matched to an ``appropriate'' hole. If the request is smaller than the
hole size, the entire hole is allocated to it anyhow. When the allocate block is released, it is simply
returned to the appropriate free list. Most implementations of malloc use a variant of this approach
(some implementations split holes, but most never coalesce them).
An interesting trick for coalescing holes with multiple free lists is the buddy system. Assume all blocks
and holes have sizes which are powers of two (so requests are always rounded up to the next power of
two) and each block or hole starts at an address that is an exact multiple of its size. Then each block has
a ``buddy'' of the same size adjacent to it, such that combining a block of size 2 n with its buddy creates
a properly aligned block of size 2n+1 For example, blocks of size 4 could start at addresses 0, 4, 8, 12,
16, 20, etc. The blocks at 0 and 4 are buddies; combining them gives a block at 0 of length 8. Similarly
8 and 12 are buddies, 16 and 20 are buddies, etc. The blocks at 4 and 8 are not buddies even though
they are neighbors: Combining them would give a block of size 8 starting at address 4, which is not a
multiple of 8. The address of a block's buddy can be easily calculated by flipping the nth bit from the
right in the binary representation of the block's address. For example, the pairs of buddies (0,4), (8,12),
(16,20) in binary are (00000,00100), (01000,01100), (10000,10100). In each case, the two addresses in
the pair differ only in the third bit from the right. In short, you can find the address of the buddy of a
block by taking the exclusive or of the address of the block with its size. To allocate a block of a given
size, first round the size up to the next power of two and look on the list of blocks of that size. If that
list is empty, split a block from the next higher list (if that list is empty, first add two blocks to it by
splitting a block from the next higher list, and so on). When deallocating a block, first check to see
whether the block's buddy is free. If so, combine the block with its buddy and add the resulting block to
the next higher free list. As with allocations, deallocations can cascade to higher and higher lists.
Compaction and Garbage Collection
What do you do when you run out of memory? Any of these methods can fail because all the memory is
allocated, or because there is too much fragmentation. Malloc, which is being used to allocate the data
segment of a Unix process, just gives up and calls the (expensive) OS call to expand the data segment.
A memory manager allocating real physical memory doesn't have that luxury. The allocation attempt
simply fails. There are two ways of delaying this catastrophe, compaction and garbage collection.
Compaction attacks the problem of fragmentation by moving all the allocated blocks to one end of
memory, thus combining all the holes. Aside from the obvious cost of all that copying, there is an
important limitation to compaction: Any pointers to a block need to be updated when the block is
moved. Unless it is possible to find all such pointers, compaction is not possible. Pointers can stored in
the allocated blocks themselves as well as other places in the client of the memory manager. In some
situations, pointers can point not only to the start of blocks but also into their bodies. For example, if a
block contains executable code, a branch instruction might be a pointer to another location in the same
block. Compaction is performed in three phases. First, the new location of each block is calculated to
determine the distance the block will be moved. Then each pointer is updated by adding to it the
amount that the block it is pointing (in)to will be moved. Finally, the data is actually moved. There are
various clever tricks possible to combine these operations.
Garbage collection finds blocks of memory that are inaccessible and returns them to the free list. As
with compaction, garbage collection normally assumes we find all pointers to blocks, both within the
blocks themselves and ``from the outside.'' If that is not possible, we can still do ``conservative''
garbage collection in which every word in memory that contains a value that appears to be a pointer is
treated as a pointer. The conservative approach may fail to collect blocks that are garbage, but it will
never mistakenly collect accessible blocks. There are three main approaches to garbage collection:
reference counting, mark-and-sweep, and generational algorithms.
Reference counting keeps in each block a count of the number of pointers to the block. When the count
drops to zero, the block may be freed. This approach is only practical in situations where there is some
``higher level'' software to keep track of the counts (it's much too hard to do by hand), and even then, it
will not detect cyclic structures of garbage: Consider a cycle of blocks, each of which is only pointed to
by its predecessor in the cycle. Each block has a reference count of 1, but the entire cycle is garbage.
Mark-and-sweep works in two passes: First we mark all non-garbage blocks by doing a depth-first
search starting with each pointer ``from outside'':

void mark(Address b) {
mark block b;
for (each pointer p in block b) {
if (the block pointed to by p is not marked)
mark(p);
}
}
The second pass sweeps through all blocks and returns the unmarked ones to the free list. The sweep
pass usually also does compaction, as described above.
There are two problems with mark-and-sweep. First, the amount of work in the mark pass is
proportional to the amount of non-garbage. Thus if memory is nearly full, it will do a lot of work with
very little payoff. Second, the mark phase does a lot of jumping around in memory, which is bad for
virtual memory systems, as we will soon see.
The third approach to garbage collection is called generational collection. Memory is divided into
spaces. When a space is chosen for garbage collection, all subsequent references to objects in that space
cause the object to be copied to a new space. After a while, the old space either becomes empty and can
be returned to the free list all at once, or at least it becomes so sparse that a mark-and-sweep garbage
collection on it will be cheap. As an empirical fact, objects tend to be either short-lived or long-lived. In
other words, an object that has survived for a while is likely to live a lot longer. By carefully choosing
where to move objects when they are referenced, we can arrange to have some spaces filled only with
long-lived objects, which are very unlikely to become garbage. We garbage-collect these spaces seldom
if ever.
Swapping
[ Silberschatz, Galvin, and Gagne, Section 9.2 ]
When all else fails, allocate simply fails. In the case of an application program, it may be adequate
to simply print an error message and exit. An OS must be able recover more gracefully.
We motivated memory management by the desire to have many processes in memory at once. In a
batch system, if the OS cannot allocate memory to start a new job, it can ``recover'' by simply delaying
starting the job. If there is a queue of jobs waiting to be created, the OS might want to go down the list,
looking for a smaller job that can be created right away. This approach maximizes utilization of
memory, but can starve large jobs. The situation is analogous to short-term CPU scheduling, in which
SJF gives optimal CPU utilization but can starve long bursts. The same trick works here: aging. As a
job waits longer and longer, increase its priority, until its priority is so high that the OS refuses to skip
over it looking for a more recently arrived but smaller job.
An alternative way of avoiding starvation is to use a memory-allocation scheme with fixed partitions
(holes are not split or combined). Assuming no job is bigger than the biggest partition, there will be no
starvation, provided that each time a partition is freed, we start the first job in line that is smaller than
that partition. However, we have another choice analogous to the difference between first-fit and best
fit. Of course we want to use the ``best'' hole for each job (the smallest free partition that is at least as
big as the job), but suppose the next job in line is small and all the small partitions are currently in use.
We might want to delay starting that job and look through the arrival queue for a job that better uses the
partitions currently available. This policy re-introduces the possibility of starvation, which we can
combat by aging, as above.
If a disk is available, we can also swap blocked jobs out to disk. When a job finishes, we first swap
back jobs from disk before allowing new jobs to start. When a job is blocked (either because it wants to
do I/O or because our short-term scheduling algorithm says to switch to another job), we have a choice
of leaving it in memory or swapping it out. One way of looking at this scheme is that it increases the
multiprogramming level (the number of jobs ``in memory'') at the cost of making it (much) more
expensive to switch jobs. A variant of the MLFQ (multi-level feedback queues) CPU scheduling
algorithm is particularly attractive for this situation. The queues are numbered from 0 up to some
maximum. When a job becomes ready, it enters queue zero. The CPU scheduler always runs a job from
the lowest-numbered non-empty queue (i.e., the priority is the negative of the queue number). It runs a
job from queue i for a maximum of i quanta. If the job does not block or complete within that time
limit, it is added to the next higher queue. This algorithm behaves like RR with short quanta in that
short bursts get high priority, but does not incur the overhead of frequent swaps between jobs with long
bursts. The number of swaps is limited to the logarithm of the burst size.

Previous Implementation of Processes


Next Paging
Contents
solomon@cs.wisc.edu
Mon Jan 24 13:34:16 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


CS 537
Lecture Notes Part 7
Paging
Contents
• Paging
• Page Tables
• Page Replacement
• Frame Allocation for a Single Process
• Frame Allocation for Multiple Processes

Paging
[ Silberschatz, Galvin, and Gagne, Section 9.4 ]
Most modern computers have special hardware called a memory management unit (MMU). This unit
sits between the CPU and the memory unit. Whenever the CPU wants to access memory (whether it is
to load an instruction or load or store data), it sends the desired memory address to the MMU, which
translates it to another address before passing it on the the memory unit. The address generated by the
CPU, after any indexing or other addressing-mode arithmetic, is called a virtual address, and the
address it gets translated to by the MMU is called a physical address.

Normally, the translation is done at the granularity of a page. Each page is a power of 2 bytes long,
usually between 1024 and 8192 bytes. If virtual address p is mapped to physical address f (where p is a
multiple of the page size), then address p+o is mapped to physical address f+o for any offset o less than
the page size. In other words, each page is mapped to a contiguous region of physical memory called a
page frame.
The MMU allows a contiguous region of virtual memory to be mapped to page frames scattered around
physical memory making life much easier for the OS when allocating memory. Much more importantly,
however, it allows infrequently-used pages to be stored on disk. Here's how it works: The tables used
by the MMU have a valid bit for each page in the virtual address space. If this bit is set, the translation
of virtual addresses on a page proceeds as normal. If it is clear, any attempt by the CPU to access an
address on the page generates an interrupt called a page fault trap. The OS has an interrupt handler for
page faults, just as it has a handler for any other kind of interrupt. It is the job of this handler to get the
requested page into memory.
In somewhat more detail, when a page fault is generated for page p1, the interrupt handler does the
following:
• Find out where the contents of page p1 are stored on disk. The OS keeps this information in a
table. It is possible that this page isn't anywhere at all, in which case the memory reference is
simply a bug. In this case, the OS takes some corrective action such as killing the process that
made the reference (this is source of the notorious message ``memory fault -- core dumped'').
Assuming the page is on disk:
• Find another page p2 mapped to some frame f of physical memory that is not used much.
• Copy the contents of frame f out to disk.
• Clear page p2's valid bit so that any subsequent references to page p2 will cause a page fault.
• Copy page p1's data from disk to frame f.
• Update the MMU's tables so that page p1 is mapped to frame f.
• Return from the interrupt, allowing the CPU to retry the instruction that caused the interrupt.

Page Tables
[ Silberschatz, Galvin, and Gagne, Sections 9.4.1-9.4.4 ]
Conceptually, the MMU contains a page table which is simply an array of entries indexed by page
number. Each entry contains some flags (such as the valid bit mentioned earlier) and a frame number.
The physical address is formed by concatenating the frame number with the offset, which is the low-
order bits of the virtual address.
There are two problems with this conceptual view. First, the lookup in the page table has to be fast,
since it is done on every single memory reference--at least once per instruction executed (to fetch the
instruction itself) and often two or more times per instruction. Thus the lookup is always done by
special-purpose hardware. Even with special hardware, if the page table is stored in memory, the table
lookup makes each memory reference generated by the CPU cause two references to memory. Since in
modern computers, the speed of memory is often the bottleneck (processors are getting so fast that they
spend much of their time waiting for memory), virtual memory could make programs run twice as
slowly as they would without it. We will look at ways of avoiding this problem in a minute, but first we
will consider the other problem: The page tables can get large.
Suppose the page size is 4K bytes and a virtual address is 32 bits long (these are typical values for
current machines). Then the virtual address would be divided into a 20-bit page number and a 12-bit
offset (because 212 = 4096 = 4K), so the page table would have to have 2 20 = 1,048,576 entries. If each
entry is 4 bytes long, that would use up 4 megabytes of memory. And each process has its own page
table. Newer machines being introduced now generate 64-bit addresses. Such a machine would need a
page table with 4,503,599,627,370,496 entries!
Fortunately, the vast majority of the page table entries are normally marked ``invalid.'' Although the
virtual address may be 32 bits long and thus capable of addressing a virtual address space of 4
gigabytes, a typical process is at most a few megabytes in size, and each megabyte of virtual memory
uses only 256 page-table entries (for 4K pages).
There are several different page table organizations use in actual computers. One approach is to put the
page table entries in special registers. This was the approach used by the PDP-11 minicomputer
introduced in the 1970's. The virtual address was 16 bits and the page size was 8K bytes. Thus the
virtual address consisted of 3 bits of page number and 13 bits of offset, for a total of 8 pages per
process. The eight page-table entries were stored in special registers. [As an aside, 16-bit virtual
addresses means that any one process could access only 64K bytes of memory. Even in those days that
was considered too small, so later versions of the PDP-11 used a trick called ``split I/D space.'' Each
memory reference generated by the CPU had an extra bit indicating whether it was an instruction fetch
(I) or a data reference (D), thus allowing 64K bytes for the program and 64K bytes for the data.]
Putting page table entries in registers helps make the MMU run faster (the registers were much faster
than main memory), but this approach has a downside as well. The registers are expensive, so it works
for very small page-table size. Also, each time the OS wants to switch processes, it has to reload the
registers with the page-table entries of the new process.
A second approach is to put the page table in main memory. The (physical) address of the page table is
held in a register. The page field of the virtual address is added to this register to find the page table
entry in physical memory. This approach has the advantage that switching processes is easy (all you
have to do is change the contents of one register) but it means that every memory reference generated
by the CPU requires two trips to memory. It also can use too much memory, as we saw above.
A third approach is to put the page table itself in virtual memory. The page number extracted from the
virtual address is used as a virtual address to find the page table entry. To prevent an infinite recursion,
this virtual address is looked up using a page table stored in physical memory. As a concrete example,
consider the VAX computer, introduced in the late 70's. The virtual address of the VAX is 30 bits long,
with 512-byte pages (probably too small even at that time!) Thus the virtual address a consists of a 21-
bit page number p and a nine-bit offset o. The page number is multiplied by 4 (the size of a page-table
entry) and added to the contents of the MMU register containing the address of the page table. This
gives a virtual address that is resolved using a page table in physical memory to get a frame number f1.
In more detail, the high order bits of p index into a table to find a physical frame number, which, when
concatenated with the low bits of p give the physical address of a word containing f. The concatenation
of f with o is the desired physical address.

As you can see, another way of looking at this algorithms is that the virtual address is split into fields
that are used to walk through a tree of page tables. The SPARC processor (which you are using for this
course) uses a similar technique, but with one more level: The 32-bit virtual address is divided into
three index fields of 8, 6, and 6 bits and a 12-bit offset. The root of the tree is pointed to by an entry in a
context table, which has one entry for each process. The advantage of these schemes is that they save
on memory. For example, consider a VAX process that only uses the first megabyte of its address space
(2048 512-byte pages). Since each second level page table has 128 entries, there will be 16 of them
used. Adding to this the 64K bytes needed for the first-level page table, the total space used for page
tables is only 72K bytes, rather than the 8 megabytes that would be needed for a one-level page table.
The downside is that each level of page table adds one more memory lookup on each reference
generated by the CPU.
A fourth approach is to use what is called an inverted page table. (Actually, the very first computer to
have virtual memory, the Atlas computer built in England in the late 50's used this approach, so in some
sense all the page tables described above are ``inverted.'') An ordinary page table has an entry for each
page, containing the address of the corresponding page frame (if any). An inverted page table has an
entry for each page frame, containing the corresponding page number. To resolve a virtual address, the
table is searched to find an entry that contains the page number. The good news is that an inverted page
table only uses a fixed fraction of memory. For example, if a page is 4K bytes and a page-table entry is
4 bytes, there will be exactly 4 bytes of page table for each 4096 bytes of physical memory. In other
words, less that 0.1% of memory will be used for page tables. The bad news is that this is by far the
slowest of the methods, since it requires a search of the page table for each reference. The original Atlas
machine had special hardware to search the table in parallel, which was reasonable since the table had
only 2048 entries.
All of the methods considered thus far can be sped up by using a trick called caching. We will be seeing
many many more examples of caching used to speed things up throughout the course. In fact, it has
been said that caching is the only technique in computer science used to improve performance. In this
case, the specific device is called a translation lookaside buffer (TLB). The TLB contains a set of
entries, each of which contains a page number, the corresponding page frame number, and the
protection bits. There is special hardware to search the TLB for an entry matching a given page number.
If the TLB contains a matching entry, it is found very quickly and nothing more needs to be done.
Otherwise we have a TLB miss and have to fall back on one of the other techniques to find the
translation. However, we can take that translation we found the hard way and put it into the TLB so that
we find it much more quickly the next time. The TLB has a limited size, so to add a new entry, we
usually have to throw out an old entry. The usual technique is to throw out the entry that hasn't been
used the longest. This strategy, called LRU (least-recently used) replacement is also implemented in
hardware. The reason this approach works so well is that most programs spend most of their time
accessing a small set of pages over and over again. For example, a program often spends a lot of time in
an ``inner loop'' in one procedure. Even if that procedure, the procedures it calls, and so on are spread
over 40K bytes, 10 TLB entries will be sufficient to describe all these pages, and there will no TLB
misses provided the TLB has at least 10 entries. This phenomenon is called locality. In practice, the
TLB hit rate for instruction references is extremely high. The hit rate for data references is also good,
but can vary widely for different programs.
If the TLB performs well enough, it almost doesn't matter how TLB misses are resolved. The IBM
Power PC and the HP Spectrum use inverted page tables organized as hash tables in conjunction with a
TLB. The MIPS computers (MIPS is now a division of Silicon Graphics) get rid of hardware page
tables altogether. A TLB miss causes an interrupt, and it is up to the OS to search the page table and
load the appropriate entry into the TLB. The OS typically uses an inverted page table implemented as a
software hash table.
Two processes may map the same page number to different page frames. Since the TLB hardware
searches for an entry by page number, there would be an ambiguity if entries corresponding to two
processes were in the TLB at the same time. There are two ways around this problem. Some systems
simply flush the TLB (set a bit in all entries marking them as unused) whenever they switch processes.
This is very expensive, not because of the cost of flushing the TLB, but because of all the TLB misses
that will happen when the new process starts running. An alternative approach is to add a process
identifier to each entry. The hardware then searches on for the concatenation of the page number and
the process id of the current process.
We mentioned earlier that each page-table entry contains a ``valid'' bit as well as some other bits. These
other bits include
Protection
At a minimum one bit to flag the page as read-only or read/write. Sometimes more
bits to indicate whether the page may be executed as instructions, etc.
Modified
This bit, usually called the dirty bit, is set whenever the page is referenced by a write
(store) operation.
Referenced
This bit is set whenever the page is referenced for any reason, whether load or store.
We will see in the next section how these bits are used.
Page Replacement
[ Silberschatz, Galvin, and Gagne, Section 10.2-10.3 ] All of these hardware methods for implementing
paging have one thing in common: When the CPU generates a virtual address for which the
corresponding page table entry is marked invalid, the MMU generates a page fault interrupt and the OS
must handle the fault as explained above. The OS checks its tables to see why it marked the page as
invalid. There are (at least) three possible reasons:
• There is a bug in the program being run. In this case the OS simply kills the program (``memory
fault -- core dumped'').
• Unix treats a reference just beyond the end of a process' stack as a request to grow the stack. In
this case, the OS allocates a page frame, clears it to zeros, and updates the MMU's page tables
so that the requested page number points to the allocated frame.
• The requested page is on disk but not in memory. In this case, the OS allocates a page frame,
copies the page from disk into the frame, and updates the MMU's page tables so that the
requested page number points to the allocated frame.
In all but the first case, the OS is faced with the problem of choosing a frame. If there are any unused
frames, the choice is easy, but that will seldom be the case. When memory is heavily used, the choice of
frame is crucial for decent performance.
We will first consider page-replacement algorithms for a single process, and then consider algorithms to
use when there are multiple processes, all competing for the same set of frames.
Frame Allocation for a Single Process

FIFO
(First-in, first-out) Keep the page frames in an ordinary queue, moving a frame to the tail of the
queue when it it loaded with a new page, and always choose the frame at the head of the queue for
replacement. In other words, use the frame whose page has been in memory the longest. While this
algorithm may seem at first glance to be reasonable, it is actually about as bad as you can get. The
problem is that a page that has been memory for a long time could equally likely be ``hot''
(frequently used) or ``cold'' (unused), but FIFO treats them the same way. In fact FIFO is no better
than, and may indeed be worse than
RAND
(Random) Simply pick a random frame. This algorithm is also pretty bad.
OPT
(Optimum) Pick the frame whose page will not be used for the longest time in the future. If there is
a page in memory that will never be used again, it's frame is obviously the best choice for
replacement. Otherwise, if (for example) page A will be next referenced 8 million instructions in the
future and page B will be referenced 6 million instructions in the future, choose page A. This
algorithm is sometimes called Belady's MIN algorithm after its inventor. It can be shown that OPT
is the best possible algorithm, in the sense that for any reference string (sequence of page numbers
touched by a process), OPT gives the smallest number of page faults. Unfortunately, OPT, like SJF
processor scheduling, is unimplementable because it requires knowledge of the future. It's only use
is as a theoretical limit. If you have an algorithm you think looks promising, see how it compares to
OPT on some sample reference strings.
LRU
(Least Recently Used) Pick the frame whose page has not been referenced for the longest time. The
idea behind this algorithm is that page references are not random. Processes tend to have a few hot
pages that they reference over and over again. A page that has been recently referenced is likely to
be referenced again in the near future. Thus LRU is likely to approximate OPT. LRU is actually
quite a good algorithm. There are two ways of finding the least recently used page frame. One is to
maintain a list. Every time a page is referenced, it is moved to the head of the list. When a page
fault occurs, the least-recently used frame is the one at the tail of the list. Unfortunately, this
approach requires a list operation on every single memory reference, and even though it is a pretty
simple list operation, doing it on every reference is completely out of the question, even if it were
done in hardware. An alternative approach is to maintain a counter or timer, and on every reference
store the counter into a table entry associated with the referenced frame. On a page fault, search
through the table for the smallest entry. This approach requires a search through the whole table on
each page fault, but since page faults are expected to tens of thousands of times less frequent than
memory references, that's ok. A clever variant on this scheme is to maintain an n by n array of bits,
initialized to 0, where n is the number of page frames. On a reference to page k, first set all the bits
in row k to 1 and then set all bits in column k to zero. It turns out that if row k has the smallest value
(when treated as a binary number), then frame k is the least recently used.
Unfortunately, all of these techniques require hardware support and nobody makes hardware that
supports them. Thus LRU, in its pure form, is just about as impractical as OPT. Fortunately, it is
possible to get a good enough approximation to LRU (which is probably why nobody makes
hardware to support true LRU).
NRU
(Not Recently Used) There is a form of support that is almost universally provided by the hardware:
Each page table entry has a referenced bit that is set to 1 by the hardware whenever the entry is used
in a translation. The hardware never clears this bit to zero, but the OS software can clear it
whenever it wants. With NRU, the OS arranges for periodic timer interrupts (say once every
millisecond) and on each ``tick,'' it goes through the page table and clears all the referenced bits. On
a page fault, the OS prefers frames whose referenced bits are still clear, since they contain pages
that have not been referenced since the last timer interrupt. The problem with this technique is that
the granularity is too coarse. If the last timer interrupt was recent, all the bits will be clear and there
will be no information to distinguished frames from each other.
SLRU
(Sampled LRU) This algorithm is similar to NRU, but before the referenced bit for a frame is
cleared it is saved in a counter associated with the frame and maintained in software by the OS. One
approach is to add the bit to the counter. The frame with the lowest counter value will be the one
that was referenced in the smallest number of recent ``ticks''. This variant is called NFU (Not
Frequently Used). A better approach is to shift the bit into the counter (from the left). The frame that
hasn't been reference for the largest number of ``ticks'' will be associated with the counter that has
the largest number of leading zeros. Thus we can approximate the least-recently used frame by
selecting the frame corresponding to the smallest value (in binary). (That will select the frame
unreferenced for the largest number of ticks, and break ties in favor of the frame longest
unreferenced before that). This only approximates LRU for two reasons: It only records whether a
page was referenced during a tick, not when in the tick it was referenced, and it only remembers the
most recent n ticks, where n is the number of bits in the counter. We can get as close an
approximation to true LRU as we like, at the cost of increasing the overhead, by making the ticks
short and the counters very long.
Second Chance
When a page fault occurs, look at the page frames one at a time, in order of their physical addresses.
If the referenced bit is clear, choose the frame for replacement, and return. If the referenced bit is
set, give the frame a ``second chance'' by clearing its referenced bit and going on to the next frame
(wrapping around to frame zero at the end of memory). Eventually, a frame with a zero referenced
bit must be found, since at worst, the search will return to where it started. Each time this algorithm
is called, it starts searching where it last left off. This algorithm is usually called CLOCK because
the frames can be visualized as being around the rim of an (analogue) clock, with the current
location indicated by the second hand.
We have glossed over some details here. First, we said that when a frame is selected for replacement,
we have to copy its contents out to disk. Obviously, we can skip this step if the page frame is unused.
We can also skip the step if the page is ``clean,'' meaning that it has not been modified since it was read
into memory. Most MMU's have a dirty bit associated with each page. When the MMU is setting the
referenced bit for a page, it also sets the dirty bit if the reference is a write (store) reference. Most of the
algorithms above can be modified in an obvious way to prefer clean pages over dirty ones. For
example, one version of NRU always prefers an unreferenced page over a referenced one, but with one
category, it prefers clean over dirty pages. The CLOCK algorithm skips frames with either the
referenced or the dirty bit set. However, when it encounters a dirty frame, it starts a disk-write operation
to clean the frame. With this modification, we have to be careful not to get into an infinite loop. If the
hand makes a complete circuit finding nothing but dirty pages, the OS simply has to wait until one of
the page-cleaning requests finishes. Hopefully, this rarely if ever happens.
There is a curious phenomenon called Belady's Anomaly that comes up in some algorithms but not
others. Consider the reference string (sequence of page numbers) 0 1 2 3 0 1 4 0 1 2 3 4. If we use FIFO
with three page frames, we get 9 page faults, including the three faults to bring in the first three pages,
but with more memory (four frames), we actually get more faults (10).
Frame Allocation for Multiple Processes
[ Silberschatz, Galvin, and Gagne, Section 10.4-10.5 ]
Up to this point, we have been assuming that there is only one active process. When there are multiple
processes, things get more complicated. Algorithms that work well for one process can give terrible
results if they are extended to multiple processes in a naive way.
LRU would give excellent results for a single process, and all of the good practical algorithms can be
seen as ways of approximating LRU. A straightforward extension of LRU to multiple processes still
chooses the page frame that has not been referenced for the longest time. However, that is a lousy idea.
Consider a workload consisting of two processes. Process A is copying data from one file to another,
while process B is doing a CPU-intensive calculation on a large matrix. Whenever process A blocks for
I/O, it stops referencing its pages. After a while process B steals all the page frames away from A.
When A finally finishes with an I/O operation, it suffers a series of page faults until it gets back the
pages it needs, then computes for a very short time and blocks again on another I/O operation.
There are two problems here. First, we are calculating the time since the last reference to a page
incorrectly. The idea behind LRU is ``use it or lose it.'' If a process hasn't referenced a page for a long
time, we take that as evidence that it doesn't want the page any more and re-use the frame for another
purpose. But in a multiprogrammed system, there may be two different reasons why a process isn't
touching a page: because it is using other pages, or because it is blocked. Clearly, a process should only
be penalized for not using a page when it is actually running. To capture this idea, we introduce the
notion of virtual time. The virtual time of a process is the amount of CPU time it has used thus far. We
can think of each process as having its own clock, which runs only while the process is using the CPU.
It is easy for the CPU scheduler to keep track of virtual time. Whenever it starts a burst running on the
CPU, it records the current real time. When an interrupt occurs, it calculates the length of the burst that
just completed and adds that value to the virtual time of the process that was running. An
implementation of LRU should record which process owns each page, and record the virtual time its
owner last touched it. Then, when choosing a page to replace, we should consider the difference
between the timestamp on a page and the current virtual time of the page's owner. Algorithms that
attempt to approximate LRU should do something similar.
There is another problem with our naive multi-process LRU. The CPU-bound process B has an
unlimited appetite for pages, whereas the I/O-bound process A only uses a few pages. Even if we
calculate LRU using virtual time, process B might occasionally steal pages from A. Giving more pages
to B doesn't really help it run any faster, but taking from A a page it really needs has a severe effect on
A. A moment's thought shows that an ideal page-replacement algorithm for this particular load would
divide into two pools. Process A would get as many pages as it needs and B would get the rest. Each
pool would be managed LRU separately. That is, whenever B page faults, it would replace the page in
its pool that hadn't been referenced for the longest time.
In general, each process has a set of pages that it is actively using. This set is called the working set of
the process. If a process is not allocated enough memory to hold its working set, it will cause an
excessive number of page faults. But once a process has enough frames to hold its working set, giving
it more memory will have little or no effect.

More formally, given a number t, the working set with parameter t of a process, denoted Wt, is the set of
pages touched by the process during its most recent t references to memory1 Because most processes
have a very high degree of locality, the size of t is not very important provided it's large enough. A
common choice of t is the number of instructions executed in 1/2 second. In other words, we will
consider the working set of a process to be the set of pages it has touched during the previous 1/2
second of virtual time. The Working Set Model of program behavior says that the system will only run
efficiently if each process is given enough page frames to hold its working set. What if there aren't
enough frames to hold the working sets of all processes? In this case, memory is over-committed and it
is hopeless to run all the processes efficiently. It would be better to simply stop one of the processes and
give its pages to others.
Another way of looking at this phenomenon is to consider CPU utilization as a function of the level of
multiprogramming (number of processes). With too few processes, we can't keep the CPU busy. Thus
as we increase the number of processes, we would like to see the CPU utilization steadily improve,
eventually getting close to 100%. Realistically, we cannot expect to quite that well, but we would still
expect increasing performance when we add more processes.
Unfortunately, if we allow memory to become over-committed, something very different may happen:

After a point, adding more processes doesn't help because the new processes do not have enough
memory to run efficiently. They end up spending all their time page-faulting instead of doing useful
work. In fact, the extra page-fault load on the disk ends up slowing down other processes until we reach
a point where nothing is happening but disk traffic. This phenomenon is called thrashing.
The moral of the story is that there is no point in trying to run more processes than will fit in memory.
When we say a process ``fits in memory,'' we mean that enough page frames have been allocated to it to
hold all of its working set. What should we do when we have more processes than will fit? In a batch
system (one were users drop off their jobs and expect them to be run some time in the future), we can
just delay starting a new job until there is enough memory to hold its working set. In an interactive
system, we may not have that option. Users can start processes whenever they want. We still have the
option of modifying the scheduler however. If we decide there are too many processes, we can stop one
or more processes (tell the scheduler not to run them). The page frames assigned to those processes can
then be taken away and given to other processes. It is common to say the stopped processes have been
``swapped out'' by analogy with a swapping system, since all of the pages of the stopped processes have
been moved from main memory to disk. When more memory becomes available (because a process has
terminated or because its working set has become smaller) we can ``swap in'' one of the stopped
processes. We could explicitly bring its working set back into memory, but it is sufficient (and usually a
better idea) just to make the process runnable. It will quickly bring its working set back into memory
simply by causing page faults. This control of the number of active processes is called load control. It is
also sometimes called medium-term scheduling as contrasted with long-term scheduling, which is
concerned with deciding when to start a new job, and short-term scheduling, which determines how to
allocate the CPU resource among the currently active jobs.
It cannot be stressed too strongly that load control is an essential component of any good page-
replacement algorithm. When a page fault occurs, we want to make a good decision on which page to
replace. But sometimes no decision is good, because there simply are not enough page frames. At that
point, we must decide to run some of the processes well rather than run all of them very poorly.
This is a very good model, but it doesn't immediately translate into an algorithm. Various specific
algorithms have been proposed. As in the single process case, some are theoretically good but
unimplementable, while others are easy to implement but bad. The trick is to find a reasonable
compromise.
Fixed Allocation
Give each process a fixed number of page frames. When a page fault occurs use LRU
or some approximation to it, but only consider frames that belong to the faulting
process. The trouble with this approach is that it is not at all obvious how to decide
how many frames to allocate to each process. If you give a process too few frames, it
will thrash. If you give it too many, the extra frames are wasted; you would be better
off giving those frames to another process, or starting another job (in a batch system).
In some environments, it may be possible to statically estimate the memory
requirements of each job. For example, a real-time control system tends to run a fixed
collection of processes for a very long time. The characteristics of each process can
be carefully measured and the system can be tuned to give each process exactly the
amount of memory it needs. Fixed allocation has also been tried with batch systems:
Each user is required to declare the memory allocation of a job when it is submitted.
The customer is charged both for memory allocated and for I/O traffic, including
traffic caused by page faults. The idea is that the customer has the incentive to
declare the optimum size for his job. Unfortunately, even assuming good will on the
part of the user, it can be very hard to estimate the memory demands of a job.
Besides, the working-set size can change over the life of the job.
Page-Fault Frequency (PFF)
This approach is similar to fixed allocation, but the allocations are dynamically
adjusted. The OS continuously monitors the fault rate of each process, in page faults
per second of virtual time. If the fault rate of a process gets too high, either give it
more pages or swap it out. If the fault rate gets too low, take some pages away. When
you get back enough pages this way, either start another job (in a batch system) or
restart some job that was swapped out. This technique is actually used in some
existing systems. The problem is choosing the right values of ``too high'' and ``too
low.'' You also have to be careful to avoid an unstable system, where you are
continually stealing pages from a process until it thrashes and then giving them back.
Working Set
The Working Set (WS) algorithm (as contrasted with the working set model) is as
follows: Constantly monitor the working set (as defined above) of each process.
Whenever a page leaves the working set, immediately take it away from the process
and add its frame to a pool of free frames. When a process page faults, allocate it a
frame from the pool of free frames. If the pool becomes empty, we have an overload
situation--the sum of the working set sizes of the active processes exceeds the size of
physical memory--so one of the processes is stopped. The problem is that WS, like
SJF or true LRU, is not implementable. A page may leave a process' working set at
any time, so the WS algorithm would require the working set to be monitored on
every single memory reference. That's not something that can be done by software,
and it would be totally impractical to build special hardware to do it. Thus all good
multi-process paging algorithms are essentially approximations to WS.
Clock
Some systems use a global CLOCK algorithm, with all frames, regardless of current
owner, included in a single clock. As we said above, CLOCK approximates LRU, so
global CLOCK approximates global LRU, which, as we said, is not a good algorithm.
However, by being a little careful, we can fix the worst failing of global clock. If the
clock ``hand'' is moving too ``fast'' (i.e., if we have to examine too many frames
before finding one to replace on an average call), we can take that as evidence that
memory is over-committed and swap out some process.
WSClock
An interesting algorithm has been proposed (but not, to the best of my knowledge
widely implemented) that combines some of the best features of WS and CLOCK.
Assume that we keep track of the current virtual time VT(p) of each process p. Also
assume that in addition to the reference and dirty bits maintained by the hardware for
each page frame i, we also keep track of process[i] (the identity of process that owns
the page currently occupying the frame) and LR[i] (an approximation to the time of
the last reference to the frame). The time stamp LR[i] is expressed as the last
reference time according to the virtual time of the process that owns the frame.
In this flow chart, the WS parameter (the size of the window in virtual time used to
determine whether a page is in the working set) is denoted by the Greek letter tau.
The parameter F is the number of frames--i.e., the size of physical memory divided
by the page size. Like CLOCK, WSClock walks through the frames in order, looking
for a good candidate for replacement, cleaning the reference bits as it goes. If the
frame has been referenced since it was last inspected, it is given a ``second chance''.
(The counter LR[i] is also updated to indicate that page has been referenced recently
in terms of the virtual time of its owner.) If not, the page is given a ``third chance'' by
seeing whether it appears to be in the working set of its owner. The time since its last
reference is approximately calculated by subtracting LR[i] from the current (virtual)
time. If the result is less than the parameter tau, the frame is passed over. If the page
fails this test, it is either used immediately or scheduled for cleaning (writing its
contents out to disk and clearing the dirty bit) depending on whether it is clean or
dirty. There is one final complication: If a frame is about to be passed over because it
was referenced recently, the algorithm checks whether the owning process is active,
and takes the frame anyhow if not. This extra check allows the algorithm to grab the
pages of processes that have been stopped by the load-control algorithm. Without it,
pages of stopped processes would never get any ``older'' because the virtual time of a
stopped process stops advancing.
Like CLOCK, WSClock has to be careful to avoid an infinite loop. As in the CLOCK
algorithm, it may may a complete circuit of the clock finding only dirty candidate
pages. In that case, it has to wait for one of the cleaning requests to finish. It may also
find that all pages are unreferenced but "new" (the reference bit is clear but the
comparison to tau shows the page has been referenced recently). In either case,
memory is overcommitted and some process needs to be stopped.

Previous Memory Management


Next More About Paging
Contents
1
The parameter t really should be the Greek letter tau, but it's hard to do Greek letters on
web pages.

solomon@cs.wisc.edu
Mon Jan 24 13:34:17 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


CS 537
Lecture Notes Part 7a
More About Paging

CS 537
Lecture Notes
Paging Details
Paging Details
Real-world hardware CPUs have all sorts of ``features'' that make life hard for people trying to write
page-fault handlers in operating systems. Among the practical issues are the following.
Page Size
How big should a page be? This is really a hardware design question, but since it depends on OS
considerations, we will discuss it here. If pages are too large, lots of space will be wasted by internal
fragmentation: A process only needs a few bytes, but must take a full page. As a rough estimate, about
half of the last page of a process will be wasted on the average. Actually, the average waste will be
somewhat larger, if the typical process is small compared to the size of a page. For example, if a page is
8K bytes and the typical process is only 1K, 7/8 of the space will be wasted. Also, the relative amount
of waste as a percentage of the space used depends on the size of a typical process. All these
considerations imply that as typical processes get bigger and bigger, internal fragmentation becomes
less and less of a problem.
On the other hand, with smaller pages it takes more page table entries to describe a given process,
leading to space overhead for the page tables, but more importantly time overhead for any operation
that manipulates them. In particular, it adds to the time needed to switch form one process to another.
The details depend on how page tables are organized. For example, if the page tables are in registers,
those registers have to be reloaded. A TLB will need more entries to cover the same size ``working set,''
making it more expensive and require more time to re-load the TLB when changing processes. In short,
all current trends point to larger and larger pages in the future.
If space overhead is the only consideration, it can be shown that the optimal size of a page is sqrt(2se),
where s is the size of an average process and e is the size of a page-table entry. This calculation is based
on balancing the space wasted by internal fragmentation against the space used for page tables. This
formula should be taken with a big grain of salt however, because it overlooks the time overhead
incurred by smaller pages.
Restarting the instruction
After the OS has brought in the missing page and fixed up the page table, it should restart the process in
such a way as to cause it to re-try the offending instruction. Unfortunately, that may not be easy to do,
for a variety of reasons.
Variable-length instructions
Some CPU architectures have instructions with varying numbers of arguments. For
example the Motorola 68000 has a move instruction with two arguments (source and
target of the move). It can cause faults for three different reasons: the instruction
itself or either of the two operands. The fault handler has to determine which
reference faulted. On some computers, the OS has to figure that out by interpreting
the instruction and in effect simulating the hardware. The 68000 made it easier for the
OS by updating the PC as it goes, so the PC will be pointing at the word immediate
following the part of the instruction that caused the fault. On the other hand, this
makes it harder to restart the instruction: How can the OS figure out where the
instruction started, so that it can back the PC up to retry?
Side effects
Some computers have addressing modes that automatically increment or decrement
index registers as a side effect, making it easy to simulate in one step the effect of the
C statement *p++ = *q++;. Unfortunately, if an instruction faults part-way
through, it may be difficult to figure out which registers have been modified so that
they can be restored to their original state. Some computers also have instructions
such as ``move characters,'' which work on variable-length data fields, updating a
pointer or count register. If an operand crosses a page boundary, the instruction may
fault part-way through, leaving a pointer or counter register modified.
Fortunately, most CPU designers know enough about operating systems to understand these problems
and add hardware features to allow the OS to recover. Either they undo the effects of the instruction
before faulting, or they dump enough information into registers somewhere that the OS can undo them.
The original 68000 did neither of these and so paging was not possible on the 68000. It wasn't that the
designers were ignorant of OS issues, it was just that there was not enough room on the chip to add the
features. However, one clever manufacturer built a box with two 68000 CPUs and an MMU chip. The
first CPU ran ``user'' code. When the MMU detected a page fault, instead of interrupting the first CPU,
it delayed responding to it and interrupted the second CPU. The second CPU would run all the OS code
necessary to respond to the fault and then cause the MMU to retry the storage access. This time, the
access would succeed and return the desired result to the first CPU, which never realized there was a
problem.
Locking Pages
There are a variety of cases in which the OS must prevent certain page frames from being chosen by the
page-replacement algorithm. For example, suppose the OS has chosen a particular frame to service a
page fault and sent a request to the disk scheduler to read in the page. The request may take a long time
to service, so the OS will allow other processes to run in the meantime. It must be careful, however,
that a fault by another process does not choose the same page frame! A similar problem involves I/O.
When a process requests an I/O operation it gives the virtual address of the buffer the data is supposed
to be read into or written out of. Since DMA devices generally do not know anything about virtual
memory, the OS translates the buffer address into a physical memory location (a frame number and
offset) before starting the I/O device. It would be very embarrassing if the frame were chosen by the
page-replacement algorithm before the I/O operation completes. Both of these problems can be avoided
by marking the frame a ineligible for replacement. We usually say that the page in that frame is
``pinned'' in memory. An alternative way of avoid the I/O problem is to do the I/O operation into or out
of pages that belong to the OS kernel (and are not subject to replacement) and copying between these
pages and user pages.
Missing Reference Bits
At least one popular computer, the Digital Equipment Corp. VAX computer, did not have any REF bits
in its MMU. Some people at the University of California at Berkeley came up with a clever way of
simulating the REF bits in software. Whenever the OS cleared the simulated REF bit for a page, it mark
the hardware page-table entry for the page as invalid. When the process first referenced the page, it
would cause a page fault. The OS would note that the page really was in memory, so the fault handler
could return without doing any I/O operations, but the fault would give the OS the chance to turn the
simulated REF bit on and mark the page as valid, so subsequent references to the page would not cause
page faults. Although the software simulated hardware with a real real REF bit, the net result was that
there was a rather high cost to clearing the simulated REF bit. The people at Berkeley therefore
developed a version of the CLOCK algorithm that allowed them to clear the REF bit infrequently.
Fault Handling
Overall, the core of the OS kernel looks something like this:

// This is the procedure that gets called when an interrupt occurs


// on some computers, there is a different handler for each "kind"
// of interrupt.
void handler() {
save_process_state(current_PCB);
// Some state (such as the PC) is automatically saved by the HW.
// This code copies that info to the PCB and possibly saves some
// more state.
switch (what_caused_the_trap) {
case PAGE_FAULT:
f = choose_frame();
if (is_dirty(f))
schedule_write_request(f); // to clean the frame
else
schedule_read_request(f); // to read in requested page
record_state(current_PCB);
// to indicate what this process is up to
make_unrunnable(current_PCB);
current_PCB = select_some_other_ready_process();
break;
case IO_COMPLETION:
p = process_that_requested_the_IO();
switch (reason_for_the_IO) {
case PAGE_CLEANING:
schedule_read_request(f); to read in requested page
break;
case BRING_IN_NEW_PAGE:
case EXPLICIT_IO_REQUEST:
make_runnable(p);
break;
}
case IO_REQUEST:
schedule_io_request();
record_state(current_PCB);
// to indicate what this process is up to
make_unrunnable(current_PCB);
current_PCB = select_some_other_ready_process();
break;
case OTHER_OS_REQUEST:
perform_request();
break;
}
// At this point, the current_PCB is pointing to a process that
// is ready to run. It may or may not be the process that was
// running when the interrupt occurred.
restore_state(current_PCB);
return_from_interrupt(current_PCB);
// This hardware instruction restores the PC (and possibly other
// hardware state) and allows the indicated process to continue.
}

Previous Paging
Next Segmentation
Contents

solomon@cs.wisc.edu
Mon Jan 24 13:34:17 CST 2000

Copyright © 1996-1998 by Marvin Solomon. All rights reserved.


CS 537 Lecture Notes, Part 8
Segmentation
• Segmentation
• Multics
• Intel x86

Segmentation
[ Silberschatz, Galvin, and Gagne, Section 9.5 ]
In accord with the beautification principle, paging makes the main memory of the computer look more
``beautiful'' in several ways.
• It gives each process its own virtual memory, which looks like a private version of the main
memory of the computer. In this sense, paging does for memory what the process abstraction
does for the CPU. Even though the computer hardware may have only one CPU (or perhaps a
few CPUs), each ``user'' can have his own private virtual CPU (process). Similarly, paging gives
each process its own virtual memory, which is separate from the memories of other processes
and protected from them.
• Each virtual memory looks like a linear array of bytes, with addresses starting at zero. This
feature simplifies relocation: Every program can be compiled under the assumption that it will
start at address zero.
• It makes the memory look bigger, by keeping infrequently used portions of the virtual memory
space of a process on disk rather than in main memory. This feature both promotes more
efficient sharing of the scarce memory resource among processes and allows each process to
treat its memory as essentially unbounded in size. Just as a process doesn't have to worry about
doing some operation that may block because it knows that the OS will run some other process
while it is waiting, it doesn't have to worry about allocating lots of space to a rarely (or sparsely)
used data structure because the OS will only allocate real memory to the part that's actually
being used.
Segmentation caries this feature one step further by allowing each process to have multiple ``simulated
memories.'' Each of these memories (called a segment) starts at address zero, is independently
protected, and can be separately paged. In a segmented system, a memory address has two parts: a
segment number and a segment offset. Most systems have some sort of segementation, but often it is
quite limited. Unix has exactly three segments per process. One segment (called the text segment) holds
the executable code of the process. It is generally1 read-only, fixed in size when the process starts, and
shared among all processes running the same program. Sometimes read-only data (such as constents)
are also placed in this segment. Another segment (the data segment) holds the memory used for global
variables. Its protection is read/write (but usually not executable), and is normally not shared between
processes.2 There is a special system call to extend the size of the data segment of a process. The third
segment is the stack segment. As the name implies, it is used for the process' stack, which is used to
hold information used in procedure calls and returns (return address, saved contents of registers, etc.) as
well as local variables of procedures. Like the data segment, the stack is read/write but usually not
executable. The stack is automatically extended by the OS whenever the process causes a fault by
referencing an address beyond the current size of the stack (usually in the course of a procedure call). It
is not shared between processes. Some variants of Unix have a fourth segment, which contains part of
the OS data structures. It is read-only and shared by all processes.
Many application programs would be easier to write if they could have as many segments as they liked.
As an example of an application program that might want multiple segments, consider a compiler. In
addition to the usual text, data, and stack segments, it could use one segment for the source of the
program being compiled, one for the symbol table, etc. (see Fig 9.18 on page 287). Breaking the
address space up into segments also helps sharing (see Fig. 9.19 on page 288). For example, most
programs in Unix include the library program printf. If the executable code of printf were in a
separate segment, that segment could easily be shared by multiple processes, allowing (slightly) more
efficient sharing of physical memory.3
If you think of the virtual address as being the concatenation of the segment number and the segment
offset, segmentation looks superficially like paging. The main difference is that the application
programmer is aware of the segment boundaries, but can ignore the fact that the address space is
divided up into pages.
The implementation of segmentation is also superficially similar to the implementation of paging (see
Fig 9.17 on page 286). The segment number is used to index into a table of ``segment descriptors,'' each
of which contains the length and starting address of a segment as well as protection information. If the
segment offset not less than the segment length, the MMU traps with a segmentation violation.
Otherwise, the segment offset is added to the starting address in the descriptor to get the resulting
physical address. There are several differences between the implementation of segments and pages, all
derived from the fact that the size of a segment is variable, while the size of a page is ``built-in.''
• The size of the segment is stored in the segment descriptor and compared with the segment
offset. The size of a page need not be stored anywhere because it is always the same. It is
always a power of two and the page offset has just enough bits to represent any legal offset, so it
is impossible for the page offset to be out of bounds. For example, if the page size is 4k (4096)
bytes, the page offset is a 12-bit field, which can only contain numbers in the range 0...4095.
• The segment descriptor contains the physical address of the start of the segment. Since all page
frames are required to start at an address that is a multiple of the page size, which is a power of
two, the low-order bits of the physical address of a frame are always zero. For example, if pages
are 4k bytes, the physical address of each page frame ends with 12 zeros. Thus a page table
entry contains a frame number, which is just the higher-order bits of the physical address of the
frame, and the MMU concatenates the frame number with the page offset, as contrasted with
adding the physical address of a segment with the segment offset.

Multics
One of the advantages of segmentation is that each segment can be large and can grow dynamically. To
get this effect, we have to page each segment. One way to do this is to have each segment descriptor
contain the (physical) address of a page table for the segment rather than the address of the segment
itself. This is the way segmentation works in Multics, the granddaddy of all modern operating systems
and a pioneer of the idea of segmentation. Multics ran on the General Electric (later Honeywell) 635
computer, which was a 36-bit word-addressable machine, which means that memory is divided into 36-
bit words, with consecutive words having addresses that differ by 1 (there were no bytes). A virtual
address was 36 bits long, with the high 18 bits interpreted as the segment number and the low 18 bits as
segment offset. Although 18 bits allows a maximum size of 218 = 262,144 words, the software enforced
a maximum segment size of 216 = 65,536 words. Thus the segment offset is effectively 16 bits long.
Associated with each process is a table called the descriptor segment. There is a register called the
Descriptor Segment Base Register (DSBR) that points to it and a register called the Descriptor Segment
Length Register (DSLR) that indicates the number of entries in the descriptor segment.
First the segment number in the virtual address is used to index into the descriptor segment to find the
appropriate descriptor. (If the segment number is too large, a fault occurs). The descriptor contains
permission information, which is checked to see if the current process has rights to access the segment
as requested. If that check succeeds, the memory address of a page table for the segment is found in the
descriptor. Since each page is 1024 words long, the 16-bit segment offset is interpreted as a 6-bit page
number and a 10-bit offset within the page. The page number is used to index into the page table to get
an entry containing a valid bit and frame number. If the valid bit is set, the physical address of the
desired word is found by concatenating the frame number with the 10-bit page offset from the virtual
address.
Actually, I've left out one important detail to simplify the description. The ``descriptor segment'' really
is a segment, which means it really is paged, just like any other segment. Thus there is another page
table that is the page table for the descriptor segment. The 18-bit segment number from the virtual
address is split into an 8-bit page number and a 10-bit offset. The page number is used to select an entry
from the decriptor segment's page table. That entry contains the (physical) address of a page of the
descriptor segment, and the page-offset field of the segment number is used to index into that page to
get the descriptor itself. The rest of the translation occurs as described in the preceding paragraph. In
total, each memory reference turns into four accesses to memory.
12. one to retrieve an entry from the descriptor segment's page table,
13. one to retrieve the descriptor itself,
14. one to retrieve an entry from the page table for the desired segment, and
15. one to load or store the desired data.
Multics used a TLB mapping the segment number and page number within the segment to a page frame
to avoid three of these accesses in most cases.

Intel x86
[ Silberschat, Galvin, and Gagne, Section 9.6 ]
The Intex 386 (and subsequent members of the X86 family used in personal computers) uses a different
approach to combining paging with segmentation. A virtual address consists of a 16-bit segment
selector and a 16 or 32-bit segment offset. The selector is used to fetch a segment descriptor from a
table (actually, there are two tables and one of the bits of the selector is used to choose which table).
The 64-bit descriptor contains the 32-bit address of the segment (called the segment base) 21 bits
indicating its length, and miscellaneous bits indicating protections and other options. The segment
length is indicated by a 20-bit limit and one bit to indicate whether the limit should be interpreted as
bytes or pages. (The segment base and limit ``fields'' are actually scattered around the descriptor to
provide compatibility with earlier version of the hardware.) If the offset from the original virtual
address does not exceed the segment length, it is added to the base to get a ``physical'' address called
the linear address (see Fig 9.20 on page 292). If paging is turned off, the linear address really is the
physical address. Otherwise, it is translated by a two-level page table as described previously, with the
32-bit address divided into two 10-bit page numbers and a 12 bit offset (a page is 4K on this machine).

Previous More About Paging


Next Disks
Contents
1
I have to say ``generally'' here and elsewhere when I talk about Unix because there are many variants
of Unix in existence. Sometimes I will use the term ``classic Unix'' to decribe the features that were in
Unix before it spread to many distinct dialects. Features in classic Unix are generally found in all of its
dialects. Sometimes features introduced in one variant became so popular that they were widely
immitated and are now available in most dialects.
2
This a good example of one of those ``popular'' features not in classic Unix but in most modern
variants: System V (an AT&T variant of Unix) introduced the ability to map a chunk of virtual memory
into the address spaces of multiple processes at some offset in the data segment (perhaps a different
offset in each process). This chunk is called a ``shared memory segment,'' but is not a segment in the
sense we are using the term here. So-called ``System V shared memory'' is available in most current
versions of Unix.
3
Many variants of Unix get a similar effect with so-called ``shared libraries,'' which are implemented
with shared memory but without general-purpose segmentation support.

solomon@cs.wisc.edu
Mon Jan 24 13:34:18 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


CS 537
Lecture Notes, Part 9
Disk Scheduling
Previous Segmentation
Next File Systems
Contents

Contents

Disk Hardware
[ Silberschatz, Galvin, and Gagne Section 13.1 ]
A (hard) disk drive record data on the surfaces of metal plates called platters that are coated with a
substance containing ground-up iron, or other substances that allow zeros and ones to be recorded as
tiny spots of magnetization. Floppy disks (also called ``diskettes'' by those who think the term ``floppy''
is undignified) are similar, but use a sheet of plastic rather than metal, and permanently enclose it in a
paper or plastic envelope. I won't say anything more about floppy disks, but most of facts about hard
disks are also true for floppies, but slower. It is customary to use the simple term ``disk'' to mean ``hard
disk drive'' and say ``platter'' when you mean the disk itself.
When in use, the disk spins rapidly and a read/write head slides along the surface. Usually, both sides
of a platter are used for recording, so there is a head for each surface. In some more expensive disk
drives, there are several platters, all on a common axle spinning together. The heads are fixed to an arm
that can move radially in towards the axle or out towards the edges of the platters. All of the heads are
attached to the same arm, so they are all at the same distance from the centers of their platters at any
given time.
To read or write a bit of data on the disk, a head has to be right over the spot where the data is stored.
This may require three operations, giving rise to four kinds of delay.
• The correct head (i.e., the correct surface) must be selected. This is done electronicly, so it is
very fast (at most a few microseconds).
• The head has to be moved to the correct distance from the center of the disk. This movement is
called seeking and involves physically moving the arm in or out. Because the arm has mass
(inertia), it must be accelerated and decelerated. When it finally gets where it's going, the disk
has to wait a bit for the vibrations caused by the jerky movement to die out. All in all, seeking
can take several milliseconds, depending on how far the head has to move.
• The disk has to rotate until the correct spot is under the selected disk. Since the disk is
constantly spinning, all the drive has to do is wait for the correct spot to come around.
• Finally, the actual data has to be transferred. On a read operation, the data is usually transferred
to a RAM buffer in the device and then copied, by DMA, to the computer's main memory.
Similarly, on write, the data is transferred by DMA to a buffer in the disk, and then copied onto
the surface of a platter.
The total time spent getting to the right place on the disk is called latency and is divided into rotational
latency and seek time (although sometimes people use the term ``seek time'' to cover both kinds of
latency).
The data on a disk is divided up into fixed-sized disk blocks. The hardware only supports reading or
writing a whole block at a time. If a program wants to change one bit (or one byte) on the disk, it has to
read in an entire disk block, change the part of it it want to change, and then write it back out. Each
block has a location, sometimes called a disk address that consists of three numbers: surface, track, and
sector. The part of the disk swept out by a head while it is not moving is a ring-shaped region on the
surface called a track. The track number indicates how far the data is from the center of the disk (the
axle). Each track is divided up into some number of sectors. On some disks, the outer tracks have more
sectors than the inner ones because the outer tracks are are longer, but all sectors are the same size. The
set of tracks swept out by all the heads while the arm is not moving is a called a cylinder. Thus a seek
operation moves to a new cylinder, positioning each each on one track of the cylinder.
This basic picture of disk organization hasn't changed much in forty years. What has changed is that
disks keep getting smaller and cheaper and the data on the surfaces gets denser (the spots used to record
bits are getting smaller and closer together). The first disks were several feet in diameter, cost tens of
thousands of dollars, and held tens of thousands of bytes. Currently (1998) a typical disk is 3-1/2 inches
in diameter, costs a few hundred dollars and holds several gigabytes (billions of bytes) of data. What
hasn't changed much is physical limitations. Early disks spun at 3600 revolutions per minute (RPM),
and only in last couple of years have faster rotation speeds become common (7200 RPM is currently a
common speed for mid-range workstation disks). At 7200 RPM, the rotational latency is at worst
1/7200 minute (8.33 milliseconds) and on the average it is half that (4.17 ms). The heads and the arm
that moves them have gotten much smaller and lighter, allowing them to be moved more quickly, but
the improvement has been modest. Current disks take anywhere from a millisecond to 10s of
milliseconds to seek to particular cylinder.
Just for reference, here are the specs for a popular disk used in PC's and currently selling at the the
DoIT tech store for $287.86.
Capacity 27.3 Gbyte
Platters 4
Heads 8
Cylinders 17,494
Sector size 512 bytes
Sectors per track ???
Max recording density 282K bits/inch
Min seek (1 track) 2.2 ms
Max seek 15.5 ms
Average seek 9.0 ms
Rotational speed 7200 RPM
Average rotational latency 4.17 ms
Media transfer rate 248 to 284 Mbits/sec
Data buffer 2MB
Minimum sustained transfer rate 13.8 to 22.9 MB/sec
Price About $287.86
Disk manufacturers such as Seagate, Quantum, and IBM have details of many more disks available
online.
Disk Scheduling
[ Silberschatz, Galvin and Gagne Section 13.2 ]
When a process wants to do disk I/O, it makes a call to the operating system. Since the operation may
take some time, the process is put into a blocked state, and the I/O request is sent to a part of the OS
called a device driver. If the disk is idle, the operation can be started right away, but if the disk is busy
servicing another request, it must be added to a queue of requests and wait its turn. Thus the total delay
seen by the process has several components:
• The overhead of getting into and out of the OS, and the time the OS spends fiddling with
queues, etc.
• The queuing time spent waiting for the disk to become available.
• The latency spent waiting for the disk to get the right track and sector.
• The transfer time spent actually reading or writing the data.
Although I mentioned a ``queue'' of requests, there is no reason why the requests have to be satisfied
first-come first-served. In fact, that is a very bad way to schedule disk requests. Since requests from
different processes may be scattered all over the disk, satisfying them in the order they arrive would
entail an awful lot of jumping around on the disk, resulting in excessive rotational latency and seek
time -- both for individual requests and for the system as a whole. Fortunately, better algorithms are not
hard to devise.
Shortest Seek Time First (SSTF)
When a disk operation finishes, choose the request that is closest to the current head
position (the one that minimizes rotational latency and seek time). This algorithm
minimizes latency and thus gives the best overall performance, but suffers from poor
fairness. Requests will get widely varying response depending on how lucky they are
in being close the the current location of the heads. In the worst case, requests can be
starved (be delayed arbitrarily long).
The Elevator Algorithm
The disk head progresses in a single direction (from the center of the disk to the edge,
or vice versa) serving the closest request in that direction. When it runs out of
requests in the direction it is currently moving, it switches to the opposite direction.
This algorithm usually gives more equitable service to all requests, but in the worst
case, it can still lead to starvation. While it is satisfying requests on one cylinder,
other requests for the same cylinder could arrive. If enough requests for the same
cylinder keep coming, the heads would stay at that cylinder forever, starving all other
requests. This problem is easily avoided by limiting how long the heads will stay at
any one cylinder. One simple scheme is only to serve the requests for the cylinder
that are already there when the heads gets there. New requests for that cylinder that
arrive while existing requests are being served will have to wait for the next pass.
One-way Elevator Algorithm
The simple (two-way) elevator algorithm gives poorer service to requests near the
center and edges of the disk than to requests in between. Suppose it takes time T for a
pass (from the center to the edge or vice versa). A request at either end of a pass (near
the hub or the edge of the disk) may have to wait up to time 2T for the heads to travel
to the other end and back, and on average the delay will be T. A request near the
``middle'' (half way between the hub and the edge) will get twice as good service:
The worse-case delay is T and the average is T/2. If this bias is a problem, it can be
solved by making the elevator run in one direction only (say from hub to edge).
When it finishes the request closest to the edge, it seeks all the way back to the first
request (the one closes to the hub) and starts another pass from hub to edge. In
general, this approach will increase the total amount of seek time because of the long
seek from the edge back to the hub, but on a heavily loaded disk, that seek will be so
infrequent as not to make much difference.

Previous Segmentation
Next File Systems
Contents

solomon@cs.wisc.edu
Mon Jan 24 13:34:18 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


CS 537
Lecture Notes, Part 10
File Systems
Previous Disks
Next More About File Systems
Contents

Contents
• The User Interface to Files
• Naming
• File Structure
• File Types
• Access Modes
• File Attributes
• Operations
• The User Interface to Directories
• Implementing File Systems
• Files
• Directories
• Symbolic Links
• Mounting
• Special Files

First we look at files from the point of view of a person or program using the file system, and then we
consider how this user interface is implemented.

The User Interface to Files


Just as the process abstraction beautifies the hardware by making a single CPU (or a small number of
CPUs) appear to be many CPUs, one per ``user,'' the file system beautifies the hardware disk, making it
appear to be a large number of disk-like objects called files. Like a disk, a file is capable of storing a
large amount of data cheaply, reliably, and persistently. The fact that there are lots of files is one form
of beautification: Each file is individually protected, so each user can have his own files, without the
expense of requiring each user to buy his own disk. Each user can have lots of files, which makes it
easier to organize persistent data. The filesystem also makes each individual file more beautiful than a
real disk. At the very least, it erases block boundaries, so a file can be any length (not just a multiple of
the block size) and programs can read and write arbitrary regions of the file without worrying about
whether they cross block boundaries. Some systems (not Unix) also provide assistance in organizing
the contents of a file.
Systems use the same sort of device (a disk drive) to support both virtual memory and files. The
question arises why these have to be distinct facilities, with vastly different user interfaces. The answer
is that they don't. In Multics, there was no difference whatsoever. Everything in Multics was a segment.
The address space of each running process consisted of a set of segments (each with its own segment
number), and the ``file system'' was simply a set of named segments. To access a segment from the file
system, a process would pass its name to a system call that assigned a segment number to it. From then
on, the process could read and write the segment simply by executing ordinary loads and stores. For
example, if the segment was an array of integers, the program could access the ith number with a
notation like a[i] rather than having to seek to the appropriate offset and then execute a read system
call. If the block of the file containing this value wasn't in memory, the array access would cause a page
fault, which was serviced as explained in the previous chapter.
This user-interface idea, sometimes called ``single-level store,'' is a great idea. So why is it not common
in current operating systems? In other words, why are virtual memory and files presented as very
different kinds of objects? There are possible explanations one might propose:
The address space of a process is small compared to the size of a file system.
There is no reason why this has to be so. In Multics, a process could have up to 256K
segments, but each segment was limited to 64K words. Multics allowed for lots of
segments because every ``file'' in the file system was a segment. The upper bound of
64K words per segment was considered large by the standards of the time; The
hardware actually allowed segments of up to 256K words (over one megabyte). Most
new processors introduced in the last few years allow 64-bit virtual addresses. In a
few years, such processors will dominate. So there is no reason why the virtual
address space of a process cannot be large enough to include the entire file system.
The virtual memory of a process is transient--it goes away when the process
terminates--while files must be persistent.
Multics showed that this doesn't have to be true. A segment can be designated as
``permanent,'' meaning that it should be preserved after the process that created it
terminates. Permanent segments to raise a need for one ``file-system-like'' facility, the
ability to give names to segments so that new processes can find them.
Files are shared by multiple processes, while the virtual address space of a process
is associated with only that process.
Most modern operating systems (including most variants of Unix) provide some way
for processes to share portions of their address spaces anyhow, so this is a particularly
weak argument for a distinction between files and segments.
The real reason single-level store is not ubiquitous is probably a concern for efficiency. The usual file-
system interface encourages a particular style of access: Open a file, go through it sequentially, copying
big chunks of it to or from main memory, and then close it. While it is possible to access a file like an
array of bytes, jumping around and accessing the data in tiny pieces, it is awkward. Operating system
designers have found ways to implement files that make the common ``file like'' style of access very
efficient. While there appears to be no reason in principle why memory-mapped files cannot be made to
give similar performance when they are accessed in this way, in practice, the added functionality of
mapped files always seems to pay a price in performance. Besides, if it is easy to jump around in a file,
applications programmers will take advantage of it, overall performance will suffer, and the file system
will be blamed.
Naming
Every file system provides some way to give a name to each file. We will consider only
names for individual files here, and talk about directories later. The name of a file is (at
least sometimes) meant to used by human beings, so it should be easy for humans to use.
Different operating systems put different restrictions on names:
Size.
Some systems put severe restrictions on the length of names. For example DOS
restricts names to 11 characters, while early versions of Unix (and some still in use
today) restrict names to 14 characters. The Macintosh operating system, Windows 95,
and most modern version of Unix allow names to be essentially arbitrarily long. I say
``essentially'' since names are meant to be used by humans, so they don't really to to
be all that long. A name that is 100 characters long is just as difficult to use as one
that it forced to be under 11 characters long (but for different reasons). Most modern
versions of Unix, for example, restrict names to a limit of 255 characters.1
Case.
Are upper and lower case letters considered different? The Unix tradition is to
consider the names Foo and foo to be completely different and unrelated names. In
DOS and its descendants, however, they are considered the same. Some systems
translate names to one case (usually upper case) for storage. Others retain the original
case, but consider it simply a matter of decoration. For example, if you create a file
named ``Foo,'' you could open it as ``foo'' or ``FOO,'' but if you list the directory, you
would still see the file listed as ``Foo''.
Character Set.
Different systems put different restrictions on what characters can appear in file
names. The Unix directory structure supports names containing any character other
than NUL (the byte consisting of all zero bits), but many utility programs (such as the
shell) would have troubles with names that have spaces, control characters or certain
punctuation characters (particularly `/'). MacOS allows all of these (e.g., it is not
uncommon to see a file name with the Copyright symbol © in it). With the world-
wide spread of computer technology, it is becoming increasingly important to support
languages other than English, and in fact alphabets other than Latin. There is a move
to support character strings (and in particular file names) in the Unicode character set,
which devotes 16 bits to each character rather than 8 and can represent the alphabets
of all major modern languages from Arabic to Devanagari to Telugu to Khmer.
Format.
It is common to divide a file name into a base name and an extension that indicates
the type of the file. DOS requires that each name be compose of a bast name of eight
or less characters and an extension of three or less characters. When the name is
displayed, it is represented as base.extension. Unix internally makes no such
distinction, but it is a common convention to include exactly one period in a file
name (e.g. foo.c for a C source file).

File Structure
Unix hides the ``chunkiness'' of tracks, sectors, etc. and presents each file as a ``smooth'' array of bytes
with no internal structure. Application programs can, if they wish, use the bytes in the file to represent
structures. For example, a wide-spread convention in Unix is to use the newline character (the character
with bit pattern 00001010) to break text files into lines. Some other systems provide a variety of other
types of files. The most common are files that consist of an array of fixed or variable size records and
files that form an index mapping keys to values. Indexed files are usually implemented as B-trees.
File Types
Most systems divide files into various ``types.'' The concept of ``type'' is a confusing one, partially
because the term ``type'' can mean different things in different contexts. Unix initially supported only
four types of files: directories, two kinds of special files (discussed later), and ``regular'' files. Just
about any type of file is considered a ``regular'' file by Unix. Within this category, however, it is useful
to distinguish text files from binary files; within binary files there are executable files (which contain
machine-language code) and data files; text files might be source files in a particular programming
language (e.g. C or Java) or they may be human-readable text in some mark-up language such as html
(hypertext markup language). Data files may be classified according to the program that created them
or is able to interpret them, e.g., a file may be a Microsoft Word document or Excel spreadsheet or the
output of TeX. The possibilities are endless.
In general (not just in Unix) there are three ways of indicating the type of a file:
16. The operating system may record the type of a file in meta-data stored separately from the file,
but associated with it. Unix only provides enough meta-data to distinguish a regular file from a
directory (or special file), but other systems support more types.
17. The type of a file may be indicated by part of its contents, such as a header made up of the first
few bytes of the file. In Unix, files that store executable programs start with a two byte magic
number that identifies them as executable and selects one of a variety of executable formats. In
the original Unix executable format, called the a.out format, the magic number is the octal
number 0407, which happens to be the machine code for a branch instruction on the PDP-11
computer, one of the first computers to implement Unix. The operating system could run a file
by loading it into memory and jumping to the beginning of it. The 0407 code, interpreted as an
instruction, jumps to the word following the 16-byte header, which is the beginning of the
executable code in this format. The PDP-11 computer is extinct by now, but it lives on through
the 0407 code!
18. The type of a file may be indicated by its name. Sometimes this is just a convention, and
sometimes it's enforced by the OS or by certain programs. For example, the Unix Java compiler
refuses to believe that a file contains Java source unless its name ends with .java.
Some systems enforce the types of files more vigorously than others. File types may be enforced
• Not at all,
• Only by convention,
• By certain programs (e.g. the Java compiler), or
• By the operating system itself.
Unix tends to be very lax in enforcing types.
Access Modes
[ Silberschatz, Galvin, and Gagne, Section 11.2 ]
Systems support various access modes for operations on a file.
• Sequential. Read or write the next record or next n bytes of the file. Usually, sequential access
also allows a rewind operation.
• Random. Read or write the nth record or bytes i through j. Unix provides an equivalent facility
by adding a seek operation to the sequential operations listed above. This packaging of
operations allows random access but encourages sequential access.
• Indexed. Read or write the record with a given key. In some cases, the ``key'' need not be
unique--there can be more than one record with the same key. In this case, programs use a
combination of indexed and sequential operations: Get the first record with a given key, then get
other records with the same key by doing sequential reads.
Note that access modes are distinct from from file structure--e.g., a record-structured file can be
accessed either sequentially or randomly--but the two concepts are not entirely unrelated. For example,
indexed access mode only makes sense for indexed files.
File Attributes
This is the area where there is the most variation among file systems. Attributes can also be grouped by
general category.
Name.
Ownership and Protection.
Owner, owner's ``group,'' creator, access-control list (information about who can to
what to this file, for example, perhaps the owner can read or modify it, other
members of his group can only read it, and others have no access).
Time stamps.
Time created, time last modified, time last accessed, time the attributes were last
changed, etc. Unix maintains the last three of these. Some systems record not only
when the file was last modified, but by whom.
Sizes.
Current size, size limit, ``high-water mark'', space consumed (which may be larger
than size because of internal fragmentation or smaller because of various
compression techniques).
Type Information.
As described above: File is ASCII, is executable, is a ``system'' file, is an Excel
spread sheet, etc.
Misc.
Some systems have attributes describing how the file should be displayed when a
directly is listed. For example MacOS records an icon to represent the file and the
screen coordinates where it was last displayed. DOS has a ``hidden'' attribute
meaning that the file is not normally shown. Unix achieves a similar effect by
convention: The ls program that is usually used to list files does not show files with
names that start with a period unless you explicit request it to (with the -a option).
Unix records a fixed set of attributes in the meta-data associated with a file. If you want to record some
fact about the file that is not included among the supported attributes, you have to use one of the tricks
listed above for recording type information: encode it in the name of the file, put it into the body of the
file itself, or store it in a file with a related name (e.g. ``foo.attributes''). Other systems (notably MacOS
and Windows NT) allow new attributes to be invented on the fly. In MacOS, each file has a resource
fork, which is a list of (attribute-name, attribute-value) pairs. The attribute name can be any four-
character string, and the attribute value can be anything at all. Indeed, some kinds of files put the entire
``contents'' of the file in an attribute and leave the ``body'' of the file (called the data fork) empty.
Operations
[ Silberschatz, Galvin, and Gagne, Section 11.1.2 ]
POSIX, a standard API (application programming interface) based on Unix, provides the following
operations (among others) for manipulating files:

fd = open(name, operation)
fd = creat(name, mode)
status = close(fd)
byte_count = read(fd, buffer, byte_count)
byte_count = write(fd, buffer, byte_count)
offset = lseek(fd, offset, whence)
status = link(oldname, newname)
status = unlink(name)
status = stat(name, buffer)
status = fstat(fd, buffer)
status = utimes(name, times)
status = chown(name, owner, group) or fchown(fd, owner, group)
status = chmod(name, mode) or fchmod(fd, mode)
status = truncate(name, size) or ftruncate(fd, size)
Some types of arguments and results need explanation.
status
Many functions return a ``status'' which is either 0 for success or -1 for errors (there
is another mechanism to get more information about went wrong). Other functions
also use -1 as a return value to indicate an error.
name
A character-string name for a file.
fd
A ``file descriptor'', which is a small non-negative integer used as a short, temporary
name for a file during the lifetime of a process.
buffer
The memory address of the start of a buffer for supplying or receiving data.
whence
One of three codes, signifying from start, from end, or from current location.
mode
A bit-mask specifying protection information.
operation
An integer code, one of read, write, read and write, and perhaps a few other
possibilities such as append only.
The open call finds a file and assigns a decriptor to it. It also indicates how the file will be used by this
process (read only, read/write, etc). The creat call is similar, but creates a new (empty) file. The
mode argument specifies protection attributes (such as ``writable by owner but read-only by others'')
for the new file. (Most modern versions of Unix have merged creat into open by adding an optional
mode argument and allowing the operation argument to specify that the file is automatically
created if it doesn't already exist.) The close call simply announces that fd is no longer in use and
can be reused for another open or creat.
The read and write operations transfer data between a file and memory. The starting location in
memory is indicated by the buffer parameter; the starting location in the file (called the seek pointer
is wherever the last read or write left off. The result is the number of bytes transferred. For write
it is normally the same as the byte_count parameter unless there is an error. For read it may be
smaller if the seek pointer starts out near the end of the file. The lseek operation adjusts the seek
pointer (it is also automatically updated by read and write). The specified offset is added to zero,
the current seek pointer, or the current size of the file, depending on the value of whence.
The function link adds a new name (alias) to a file, while unlink removes a name. There is no
function to delete a file; the system automatically deletes it when there are no remaining names for it.
The stat function retrieves meta-data about the file and puts it into a buffer (in a fixed, documented
format), while the remaining functions can be used to update the meta-data: utimes updates time
stamps, chown updates ownership, chmod updates protection information, and truncate changes
the size (files can be make bigger by write, but only truncate can make them smaller). Most come
in two flavors: one that take a file name and one that takes a descriptor for an open file.
To learn more details about any of these functions, type something like
man 2 lseek
to any Unix system. The `2' means to look in section 2 of the manual, where system calls are explained.
Other systems have similar operations, and perhaps a few more. For example, indexed or indexed
sequential files would require a version of seek to specify a key rather than an offset. It is also
common to have a separate append operation for writing to the end of a file.

The User Interface to Directories


[ Silberschatz, Galvin, and Gagne, Section 11.3 ]
We already talked about file names. One important feature that a file name should have is that it be
unambiguous: There should be at most one file with any given name. The symmetrical condition, that
there be at most one name for any given file, is not necessarily a good thing. Sometimes it is handy to
be able to give multiple names to a file. When we consider implementation, we will describe two
different ways to implement multiple names for a file, each with slightly different semantics. If there
are a lot of files in a system, it may be difficult to avoid giving two files the same name, particularly if
there are multiple uses independently making up names. One technique to assure uniqueness is to prefix
each file name with the name (or user id) of the owner. In some early operating systems, that was the
only assistance the system gave in preventing conflicts.
A better idea is the hierarchical directory structure, first introduced by Multics, then popularized by
Unix, and now found in virtually every operating system. You probably already know about
hierarchical directories, but I would like to describe them from an unusual point of view, and then
explain how this point of view is equivalent to the more familiar version.
Each file is named by a sequence of names. Although all modern operating systems use this technique,
each uses a different character to separate the components of the sequence when displaying it as a
character string. Multics uses `>', Unix uses `/', DOS and its descendants use `\', and MacOS uses ':'.
Sequences make it easy to avoid naming conflicts. First, assign a sequence to each user and only let
him create files with names that start with that sequence. For example, I might be assigned the sequence
(``usr'', ``solomon''), written in Unix as /usr/solomon. So far, this is the same as just appending the
user name to each file name. But it allows me to further classify my own files to prevent conflicts.
When I start a new project, I can create a new sequence by appending the name of the project to the end
of the sequence assigned to me, and then use this prefix for all files in the project. For example, I might
choose /usr/solomon/cs537 for files associated with this course, and name them
/usr/solomon/cs537/foo, /usr/solomon/cs537/bar, etc. As an extra aid, the system
allows me to specify a ``default prefix'' and a short-hand for writing names that start with that prefix. In
Unix, I use the system call chdir to specify a prefix, and whenever I use a name that does not start
with `/', the system automatically adds that prefix.
It is customary to think of the directory system as a directed graph, with names on the edges. Each path
in the graph is associated with a sequence of names, the names on the edges that make up the path. For
that reason, the sequence of names is usually called a path name. One node is designated as the root
node, and the rule is enforced that there cannot be two edges with the same name coming out of one
node. With this rule, we can use path name to name nodes. Start at the root node and treat the path
name as a sequence of directions, telling us which edge to follow at each step. It may be impossible to
follow the directions (because they tell us to use an edge that does not exist), but if is possible to follow
the directions, they will lead us unambiguously to one node. Thus path names can be used as
unambiguous names for nodes. In fact, as we will see, this is how the directory system is actually
implemented. However, I think it is useful to think of ``path names'' simply as long names to avoid
naming conflicts, since it clear separates the interface from the implementation.
Implementing File Systems
Files
[ Silberschatz, Galvin, and Gagne, Section 11.6 ]
We will assume that all the blocks of the disk are given block numbers starting at zero and running
through consecutive integers up to some maximum. We will further assume that blocks with numbers
that are near each other are located physically near each other on the disk (e.g., same cylinder) so that
the arithmetic difference between the numbers of two blocks gives a good estimate how long it takes to
get from one to the other. First let's consider how to represent an individual file. There are (at least!)
four possibilities:
Contiguous [Section 11.6.1]
The blocks of a file are the block numbered n, n+1, n+2, ..., m. We can represent any
file with a pair of numbers: the block number of of first block and the length of the
file (in blocks). (See Figure 11.15 on page 378). The advantages of this approach are
• It's simple

• The blocks of the file are all physically near each other on the disk and in order
so that a sequential scan through the file will be fast.
The problem with this organization is that you can only grow a file if the block
following the last block in the file happens to be free. Otherwise, you would have to
find a long enough run of free blocks to accommodate the new length of the file and
copy it. As a practical matter, operating systems that use this organization require the
maximum size of the file to be declared when it is created and pre-allocate space for
the whole file. Even then, storage allocation has all the problems we considered when
studying main-memory allocation including external fragmentation.
Linked List (Section 11.6.2).
A file is represented by the block number of its first block, and each block contains
the block number of the next block of the file. This representation avoids the
problems of the contiguous representation: We can grow a file by linking any disk
block onto the end of the list, and there is no external fragmentation. However, it
introduces a new problem: Random access is effectively impossible. To find the
100th block of a file, we have to read the first 99 blocks just to follow the list. We
also lose the advantage of very fast sequential access to the file since its blocks may
be scattered all over the disk. However, if we are careful when choosing blocks to
add to a file, we can retain pretty good sequential access performance.
Both the space overhead (the percentage of the space taken up by pointers) and the
time overhead (the percentage of the time seeking from one place to another) can be
decreased by using larger blocks. The hardware designer fixes the block size (which
is usually quite small) but the software can get around this problem by using
``virtual'' blocks, sometimes called clusters. The OS simply treats each group of (say)
four continguous phyical disk sectors as one cluster. Large, clusters, particularly if
they can be variable size, are sometimes called extents. Extents can be thought of as a
compromise between linked and contiguous allocation.
Disk Index
The idea here is to keep the linked-list representation, but take the link fields out of
the blocks and gather them together all in one place. This approach is used in the
``FAT'' file system of DOS, OS/2 and older versions of Windows. At some fixed
place on disk, allocate an array I with one element for each block on the disk, and
move the link field from block n to I[m] (see Figure 11.17 on page 382). The whole
array of links, called a file access table (FAT) is now small enough that it can be read
into main memory when the systems starts up. Accessing the 100th block of a file
still requires walking through 99 links of a linked list, but now the entire list is in
memory, so time to traverse it is negligible (recall that a single disk access takes as
long as 10's or even 100's of thousands of instructions). This representation has the
added advantage of getting the ``operating system'' stuff (the links) out of the pages
of ``user data''. The pages of user data are now full-size disk blocks, and lots of
algorithms work better with chunks that are a power of two bytes long. Also, it means
that the OS can prevent users (who are notorious for screwing things up) from getting
their grubby hands on the system data.
The main problem with this approach is that the index array I can get quite large
with modern disks. For example, consider a 2 GB disk with 2K blocks. There are
million blocks, so a block number must be at least 20 bits. Rounded up to an even
number of bytes, that's 3 bytes--4 bytes if we round up to a word boundary--so the
array I is three or four megabytes. While that's not an excessive amount of memory
given today's RAM prices, if we can get along with less, there are better uses for the
memory.
File Index [Section 11.6.3]
Although a typical disk may contain tens of thousands of files, only a few of them are
open at any one time, and it is only necessary to keep index information about open
files in memory to get good performance. Unfortunately the whole-disk index
described in the previous paragraph mixes index information about all files for the
whole disk together, making it difficult to cache only information about open files.
The inode structure introduced by Unix groups together index information about each
file individually. The basic idea is to represent each file as a tree of blocks, with the
data blocks as leaves. Each internal block (called an indirect block in Unix jargon) is
an array of block numbers, listing its children in order. If a disk block is 2K bytes and
a block number is four bytes, 512 block numbers fit in a block, so a one-level tree (a
single root node pointing directly to the leaves) can accommodate files up to 512
blocks, or one megabyte in size. If the root node is cached in memory, the ``address''
(block number) of any block of the file can be found without any disk accesses. A
two-level tree, with 513 total indirect blocks, can handle files 512 times as large (up
to one-half gigabyte).
The only problem with this idea is that it wastes space for small files. Any file with
more than one block needs at least one indirect block to store its block numbers. A
4K file would require three 2K blocks, wasting up to one third of its space. Since
many files are quite small, this is serious problem. The Unix solution is to use a
different kind of ``block'' for the root of the tree.
An index node (or inode for short) contains almost all the meta-data about a file listed
above: ownership, permissions, time stamps, etc. (but not the file name). Inodes are
small enough that several of them can be packed into one disk block. In addition to
the meta-data, an inode contains the block numbers of the first few blocks of the file.
What if the file is too big to fit all its block numbers into the inode? The earliest
version of Unix had a bit in the meta-data to indicate whether the file was ``small'' or
``big.'' For a big file, the inode contained the block numbers of indirect blocks rather
than data blocks. More recent versions of Unix contain pointers to indirect blocks in
addition to the pointers to the first few data blocks. The inode contains pointers to
(i.e., block numbers of) the first few blocks of the file, a pointer to an indirect block
containing pointers to the next several blocks of the file, a pointer to a doubly indirect
block, which is the root of a two-level tree whose leaves are the next blocks of the
file, and a pointer to a triply indirect block. A large file is thus a lop-sided tree. (See
Figure 11.19 on page 384).
A real-life example is given by the Solaris 2.5 version of Unix. Block numbers are
four bytes and the size of a block is a parameter stored in the file system itself,
typically 8K (8192 bytes), so 2048 pointers fit in one block. An inode has direct
pointers to the first 12 blocks of the file, as well as pointers to singly, doubly, and
triply indirect blocks. A file of up to 12+2048+2048*2048 = 4,196,364 blocks or
34,376,613,888 bytes (about 32 GB) can be represented without using triply indirect
blocks, and with the triply indirect block, the maximum file size is
(12+2048+2048*2048+2048*2048*2048)*8192 = 70,403,120,791,552 bytes (slightly
more than 246 bytes, or about 64 terabytes). Of course, for such huge files, the size of
the file cannot be represented as a 32-bit integer. Modern versions of Unix store the
file length as a 64-bit integer, called a ``long'' integer in Java. An inode is 128 bytes
long, allowing room for the 15 block pointers plus lots of meta-data. 64 inodes fit in
one disk block. Since the inode for a file is kept in memory while the file is open,
locating an arbitrary block of any file requires reading at most three I/O operations,
not counting the operation to read or write the data block itself.

Directories
[ Silberschatz, Galvin, and Gagne, Section 11.3 ]
A directory is simply a table mapping character-string human-readable names to information about
files. The early PC operating system CP/M shows how simple a directory can be. Each entry contains
the name of one file, its owner, size (in blocks) and the block numbers of 16 blocks of the file. To
represent files with more than 16 blocks, CP/M used multiple directory entries with the same name and
different values in a field called the extent number. CP/M had only one directory for the entire system.
DOS uses a similar directory entry format, but stores only the first block number of the file in the
directory entry. The entire file is represented as a linked list of blocks using the disk index scheme
described above. All but the earliest version of DOS provide hierarchical directories using a scheme
similar to the one used in Unix.
Unix has an even simpler directory format. A directory entry contains only two fields: a character-string
name (up to 14 characters) and a two-byte integer called an inumber, which is interpreted as an index
into an array of inodes in a fixed, known location on disk. All the remaining information about the file
(size, ownership, time stamps, permissions, and an index to the blocks of the file) are stored in the
inode rather than the directory entry. A directory is represented like any other file (there's a bit in the
inode to indicate that the file is a directory). Thus the inumber in a directory entry may designate a
``regular'' file or another directory, allowing arbitrary graphs of nodes. However, Unix carefully limits
the set of operating system calls to ensure that the set of directories is always a tree. The root of the tree
is the file with inumber 1 (some versions of Unix use other conventions for designating the root
directory). The entries in each directory point to its children in the tree. For convenience, each directory
also two special entries: an entry with name ``..'', which points to the parent of the directory in the tree
and an entry with name ``.'', which points to the directory itself. Inumber 0 is not used, so an entry is
marked ``unused'' by setting its inumber field to 0.
The algorithm to convert from a path name to an inumber might be written in Java as

int namei(int current, String[] path) {


for (int i = 0; i<path.length; i++) {
if (inode[current].type != DIRECTORY)
throw new Exception("not a directory");
current = nameToInumber(inode[current], path[i]);
if (current == 0)
throw new Exception("no such file or directory");
}
return current;
}
The procedure nameToInumber(Inode node, String name) (not shown) reads through the
directory file represented by the inode node, looks for an entry matching the given name and returns
the inumber contained in that entry. The procedure namei walks the directory tree, starting at a given
inode and following a path described by a sequence of strings. There is a procedure with this name in
the Unix kernel. Files are always specified in Unix system calls by a character-string path name. You
can learn the inumber of a file if you like, but you can't use the inumber when talking to the Unix
kernel. Each system call that has a path name as an argument uses namei to translate it to an inumber.
If the argument is an absolute path name (it starts with `/'), namei is called with current == 1.
Otherwise, current is the current working directory.
Since all the information about a file except its name is stored in the inode, there can be more than one
directory entry designating the same file. This allows multiple aliases (called links) for a file. Unix
provides a system call link(old-name, new-name) to create new names for existing files. The
call link("/a/b/c", "/d/e/f") works something like this:

if (namei(1, parse("/d/e/f")) != 0)
throw new Exception("file already exists");
int dir = namei(1, parse("/d/e")):
if (dir==0 || inode[dir].type != DIRECTORY)
throw new Exception("not a directory");
int target = namei(1, parse("/a/b/c"));
if (target==0)
throw new Exception("no such directory");
if (inode[target].type == DIRECTORY)
throw new Exception("cannot link to a directory");
addDirectoryEntry(inode[dir], target, "f");
The procedure parse (not shown here) is assumed to break up a path name into its components. If, for
example, /a/b/c resolves to inumber 123, the entry (123, "f") is added to directory file
designated by "/d/e". The result is that both "/a/b/c" and "/d/e/f" resolve to the same file
(the one with inumber 123).
We have seen that a file can have more than one name. What happens if it has no names (does not
appear in any directory)? Since the only way to name a file in a system call is by a path name, such a
file would be useless. It would consume resources (the inode and probably some data and indirect
blocks) but there would be no way to read it, write to it, or even delete it. Unix protects against this
``garbage collection'' problem by using reference counts. Each inode contains a count of the number of
directory entries that point to it. ``User'' programs are not allowed to update directories directly. System
calls that add or remove directory entries (creat, link, mkdir, rmdir, etc) update these reference
counts appropriately. There is no system call to delete a file, only the system call unlink(name)
which removes the directory entry corresponding to name. If the reference count of an inode drops to
zero, the system automatically deletes the files and returns all of its blocks to the free list.
We saw before that the reference counting algorithm for garbage collection has a fatal flaw: If there are
cycles, reference counting will fail to collect some garbage. Unix avoids this problem by making sure
cycles cannot happen. The system calls are designed so that the set of directories will always be a single
tree rooted at inode 1: mkdir creates a new empty (except for the . and .. entries) as a leaf of the
tree, rmdir is only allowed to delete a directory that is empty (except for the . and .. entries), and
link is not allowed to link to a directory. Because links to directories are not allowed, the only place
the file system is not a tree is at the leaves (regular files) and that cannot introduce cycles.
Although this algorithm provides the ability to create aliases for files in a simple and secure manner, it
has several flaws:
• It's hard to figure own how to charge users for disk space. Ownership is associated with the file
not the directory entry (the owner's id is stored in the inode). A file cannot be deleted without
finding all the links to it and deleting them. If I create a file and you make a link to it, I will
continue to be charged for it even if I try to remove it through my original name for it. Worse
still, your link may be in a directory I don't have access to, so I may be unable to delete the file,
even though I'm being charged for its space. Indeed, you could make it much bigger after I have
no access to it.
• There is no way to make an alias for a directory.
• As we will see later, links cannot cross boundaries of physical disks.
• Since all aliases are equal, there's no one ``true name'' for a file. You can find out whether two
path names designate the same file by comparing inumbers. There is a system call to get the
meta-data about a file, and the inumber is included in that information. But there is no way of
going in the other direction: to get a path name for a file given its inumber, or to find a path
name of an open file. Even if you remember the path name used to get to the file, that is not a
reliable ``handle'' to the file (for example to link two files together by storing the name of one in
the other). One of the components of the path name could be removed, thus invalidating the
name even though the file still exists under a different name.
While it's not possible to find the name (or any name) of an arbitrary file, it is possible to figure out the
name of a directory. Directories do have unique names because the directories form a tree, and one of
the properties of a tree is that there is a unique path from the root to any node. The ``..'' and ``.''
entries in each directory make this possible. Here, for example, is code to find the name of the current
working directory.

class DirectoryEntry {
int inumber;
String name;
}
String cwd() {
FileInputStream thisDir = new FileInputStream(".");
int thisInumber = nameToInumber(thisDir, ".");
getPath(".", thisInumber);
}
String getPath(String currentName, int currentInumber) {
String parentName = currentName + "/..";
FileInputSream parent = new FileInputStream(parentName);
int parentInumber = nameToInumber(parent, ".");
String fname = inumberToName(parent, currentInumber);
if (parentInumber == 1)
return "/" + fname;
else
return getPath(parentInumber, parentName) + "/" + fname;
}
The procedure nameToInumber is similar to the procedure with the same name described above, but
takes an InputStream as an argument rather than an inode. Many versions of Unix allow a program
to open a directory for reading and read its contents just like any other file. In such systems, it would be
easy to write nameToInumber as a user-level procedure if you know the format of a directory. 2 The
procedure inumberToName is similar, but searches for an entry containing a particular inumber and
returns the name field of the entry.
Symbolic Links
To get around the limitations with the original Unix notion of links, more recent versions of Unix
introduced the notion of a symbolic link (to avoid confusion, the original kind of link, described in the
previous section, is sometimes called a hard link). A symbolic link is a new type of file, distinguished
by a code in the inode from directories, regular files, etc. When the namei procedure that translates
path names to inumbers encounters a symlink, it treats the contents of the file as a pathname and uses it
to continue the translation. If the contents of the file is a relative path name (it does not start with a
slash), it is interpreted relative to the directory containing the link itself, not the current working
directory of the process doing the lookup.

int namei(int current, String[] path) {


for (int i = 0; i<path.length; i++) {
if (inode[current].type != DIRECTORY)
throw new Exception("not a directory");
current = nameToInumber(inode[current], path[i]);
if (current == 0)
throw new Exception("no such file or directory");
while (inode[current].type == SYMLINK) {
String link = getContents(inode[current]);
String[] linkPath = parse(link);
if (link.charAt(0) == '/')
current = namei(1, linkPath);
else
current = namei(current, linkPath);
if (current == 0)
throw new Exception("no such file or directory");
}
}
return current;
}
The only change from the previous version of this procedure is the addition of the while loop. Any
time the procedure encounters a node of type SYMLINK, it recursively calls itself to translate the
contents of the file, interpreted as a path name, into an inumber.
Although the implementation looks complicated, it does just what you would expect in normal
situations. For example, suppose there is an existing file named /a/b/c and an existing directory /d.
Then the the command

ln -s /a/b /d/e
makes the path name /d/e a synonym for /a/b, and also makes /d/e/c a synonym for /a/b/c.
From the user's point of view, the the picture looks like this:

In implementation terms, the picture looks like this


where the hexagon denotes a node of type symlink.
Here's a more elaborate example that illustrates symlinks with relative path names. Suppose I have an
existing directory /usr/solomon/cs537/s90 with various sub-directories and I am setting up
project 5 for this semester. I might do something like this:

cd /usr/solomon/cs537
mkdir f96
cd f96
ln -s ../s90/proj5 proj5.old
cat proj5.old/foo.c
cd /usr/solomon/cs537
cat f96/proj5.old/foo.c
cat s90/proj5/foo.c

Logically, the situation looks like this:

and physically, it looks like this:

All three of the cat commands refer to the same file.


The added flexibility of symlinks over hard links comes at the expense of less security. Symlinks are
neither required nor guaranteed to point to valid files. You can remove a file out from under a symlink,
and in fact, you can create a symlink to a non-existent file. Symlinks can also have cycles. For example,
this works fine:

cd /usr/solomon
mkdir bar
ln -s /usr/solomon foo
ls /usr/solomon/foo/foo/foo/foo/bar

However, in some cases, symlinks can cause infinite loops or infinite recursion in the namei
procedure. The real version in Unix puts a limit on how many times it will iterate and returns an error
code of ``too many links'' if the limit is exceeded. Symlinks to directories can also cause the ``change
directory'' command cd to behave in strange ways. Most people expect that the two commands

cd foo
cd ..
to cancel each other out. But in the last example, the commands

cd /usr/solomon
cd foo
cd ..
would leave you in the directory /usr. Some shell programs treat cd specially and remember what
alias you used to get to the current directory. After cd /usr/solomon; cd foo; cd foo, the
current directory is /usr/solomon/foo/foo, which is an alias for /usr/solomon, but the
command cd .. is treated as if you had typed cd /usr/solomon/foo.
Mounting
[ Silberschatz, Galvin, and Gagne, Sections 11.5.2, 17.6, and 20.7.5 ]
What if your computer has more than one disk? In many operating systems (including DOS and its
descendants) a pathname starts with a device name, as in C:\usr\solomon (by convention, C is the
name of the default hard disk). If you leave the device prefix off a path name, the system supplies a
default current device similar to the current directory. Unix allows you to glue together the directory
trees of multiple disks to create a single unified tree. There is a system call

mount(device, mount_point)
where device names a particular disk drive and mount_point is the path name of an existing node
in the current directory tree (normally an empty directory). The result is similar to a hard link: The
mount point becomes an alias for the root directory of the indicated disk. Here's how it works: The
kernel maintains a table of existing mounts represented as (device1, inumber, device2)
triples. During namei, whenever the current (device, inumber) pair matches the first two fields in one
of the entries, the current device and inumber become device2 and 1, respectively. Here's the
expanded code:

int namei(int curi, int curdev, String[] path) {


for (int i = 0; i<path.length; i++) {
if (disk[curdev].inode[curi].type != DIRECTORY)
throw new Exception("not a directory");
curi = nameToInumber(disk[curdev].inode[curi], path[i]);
if (curi == 0)
throw new Exception("no such file or directory");
while (disk[curdev].inode[curi].type == SYMLINK) {
String link = getContents(disk[curdev].inode[curi]);
String[] linkPath = parse(link);
if (link.charAt(0) == '/')
current = namei(1, linkPath);
else
current = namei(current, linkPath);
if (current == 0)
throw new Exception("no such file or directory");
}
int newdev = mountLookup(curdev, curi);
if (newdev != -1) {
curdev = newdev;
curi = 1;
}
}
return current;
}
In this code, we assume that mountLookup searches the mount table for matching entry, returning -1
if no matching entry is found. There is a also a special case (not shown here) for ``..'' so that the ``..''
entry in the root directory of a mounted disk behaves like a pointer to the parent directory of the mount
point.
The Network File System (NFS) from Sun Microsystems extends this idea to allow you to mount a disk
from a remote computer. The device argument to the mount system call names the remote computer
as well as the disk drive and both pieces of information are put into the mount table. Now there are
three pieces of information to define the ``current directory'': the inumber, the device, and the computer.
If the current computer is remote, all operations (read, write, creat, delete, mkdir, rmdir, etc.) are sent as
messages to the remote computer. Information about remote open files, including a seek pointer and the
identity of the remote machine, is kept locally. Each read or write operation is converted locally to one
or more requests to read or write blocks of the remote file. NFS caches blocks of remote files locally to
improve performance.
Special Files
I said that the Unix mount system call has the name of a disk device as an argument. How do you
name a device? The answer is that devices appear in the directory tree as special files. An inode whose
type is ``special'' (as opposed to ``directory,'' ``symlink,'' or ``regular'') represents some sort of I/O
device. It is customary to put special files in the directory /dev, but since it is the inode that is marked
``special,'' they can be anywhere. Instead of containing pointers to disk blocks, the inode of a special
file contains information (in a machine-dependent format) about the device. The operating system tries
to make the device look as much like a file as possible, so that ordinary programs can open, close, read,
or write the device just like a file.
Some devices look more like real file than others. A disk device looks exactly like a file. Reads return
whatever is on the disk and writes can scribble anywhere on the disk. For obvious security reasons, the
permissions for the raw disk devices are highly restrictive. A tape drive looks sort of like a disk, but a
read will return only the next physical block of data on the device, even if more is requested.
The special file /dev/tty represents the terminal. Writes to /dev/tty display characters on the
screen. Reads from /dev/tty return characters typed on the keyboard. The seek operation on a
device like /dev/tty updates the seek pointer, but the seek pointer has no effect on reads or writes.
Reads of /dev/tty are also different from reads of a file in that they may return fewer bytes than
requested: Normally, a read will return characters only up through the next end-of-line. If the number of
bytes requested is less than the length of the line, the next read will get the remaining bytes. A read call
will block the caller until at least one character can be returned. On machines with more than one
terminal, there are multiple terminal devices with names like /dev/tty0, /dev/tty1, etc.
Some devices, such as a mouse, are read-only. Write operations on such devices have no effect. Other
devices, such as printers, are write-only. Attempts to read from them give an end-of-file indication (a
return value of zero). There is special file called /dev/null that does nothing at all: reads return end-
of-file and writes send their data to the garbage bin. (New EPA rules require that this data be recycled.
It is now used to generate federal regulations and other meaningless documents.) One particularly
interesting device is /dev/mem, which is an image of the memory space of the current process. In a
sense, this device is the exact opposite of memory-mapped files. Instead of making a file look like part
of virtual memory, it makes virtual memory look like a device.
This idea of making all sorts of things look like files can be very powerful. Some versions of Unix
make network connections look like files. Some versions have a directory with one special file for each
active process. You can read these files to get information about the states of processes. If you delete
one of these files, the corresponding process is killed. Another idea is to have a directory with one
special file for each print job waiting to be printed. Although this idea was pioneered by Unix, it is
starting to show up more and more in other operating systems.

Previous Disks
Next More About File Systems
Contents

1
Note that we are referring here to a single pathname component.
2
The Solaris version of Unix on our workstations has a special system call for reading directories, so
this code couldn't be written in Java without resorting to native methods.

solomon@cs.wisc.edu
Mon Jan 24 13:34:18 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


CS 537
Lecture Notes, Part 11
More About File Systems
Contents
• Long File Names
• Space Management
• Block Size and Extents
• Free Space
• Reliability
• Bad-block Forwarding
• Back-up Dumps
• Consistency Checking
• Transactions
• Performance
This web page extends the previous page with more information about the implementation of file
systems.
Long File Names
The Unix implementation described previously allows arbitrarily long path names for a files, but each
component is limited in length. In the original Unix implementation, each directory entry is 16 bytes
long: two bytes for the inumber and 14 bytes for a path name component. 1

class Dirent {
public short inumber;
public byte name[14];
}
If the name is less than 14 characters long, trailing bytes are filled with nulls (bytes with all bits set to
zero--not to be confused with `0' characters). An inumber of zero is used to mark an entry as unused
(inumbers for files start at 1).
• To look up a name, search the whole directory, starting at the beginning.
• To ``remove'' an entry, set its inumber flag to zero.
• To add an entry, search for an entry with a zero inumber field and re-use it. If there aren't any,
add an entry to the end (making the file 16 bytes bigger).
This representation has one advantage.
• It is very simple. In particular, space allocation is easy because all entries are the same length.
However, it has several disadvantages.
• Since an inumber is only 16 bits, there can be at most 65,535 files on any one disk.
• A file name can be at most 14 characters long.
• Directories grow, but they never shrink.
• Searching a very large directory can be slow.
The people at Berkeley, while they were rewriting the file system code to make it faster, also changed
the format of directories to get rid of the first two problems (they left the remaining problems unfixed).
This new organization has been adopted by many (but not all) versions of Unix introduced since then.
The new format of a directory entry looks like this:2

class DirentLong {
int inumber;
short reclen;
short namelen;
byte name[];
}
The inumber field is now a 4-byte (32-bit) integer, so that a disk can have up to 4,294,967,296 files.
The reclen field indicates the entire length of the DirentLong entry, including the 8-byte header.
The actual length of the name array is thus reclen - 8 bytes. The namelen field indicates the
length of the name. The remaining space in the name array is unused. This extra padding at the end of
the entry serves three purposes.
• It allows the length of the entry to be padded up to a multiple of 4 bytes so that the integer fields
are properly aligned (some computer architectures require integers to be stored at addresses that
are multiples of 4).
• The last entry in a disk block can be padded to make it extend to the end of the block. With this
trick, Unix avoids entries that cross block boundaries, simplifying the code.
• It supports a cute trick for coalescing free space. To delete an entry, simply increase the size of
the previous entry by the size of the entry being deleted. The deleted entry looks like part of the
padding on the end of the previous entry. Since all searches of the directory are done
sequentially, starting at the beginning, the deleted entry will effectively ``disappear.'' There's
only one problem with this trick: It can't be used to delete the first entry in the directory.
Fortunately, the first entry is the `.' entry, which is never deleted.
To create a new entry, search the directory for an entry that has enough padding (according to its
reclen and namelen fields) to hold the new entry and split it into two entries by decreasing
its reclen field. If no entry with enough padding is found, extend the directory file by one
block, make the whole block into one entry, and try again.
This approach has two very minor additional benefits over the old scheme. In the old scheme, every
entry is 16 bytes, even if the name is only one byte long. In the new scheme, an name uses only as
much space as it needs (although this doesn't save much, since the minimum size of an entry in the new
scheme is 9 bytes--12 if padding is used to align entries to integer boundaries). The new approach also
allows nulls to appear in file names, but other parts of the system make that impractical, and besides,
who cares?
Space Management
Block Size and Extents
All of the file organizations I've mentioned store the contents of a file in a set of disk blocks. How big
should a block be? The problem with small blocks is I/O overhead. There is a certain overhead to read
or write a block beyond the time to actually transfer the bytes. If we double the block size, a typical file
will have half as many blocks. Reading or writing the whole file will transfer the same amount of data,
but it will involve half as many disk I/O operations. The overhead for an I/O operations includes a
variable amount of latency (seek time and rotational delay) that depends on how close the blocks are to
each other, as well as a fixed overhead to start each operation and respond to the interrupt when it
completes.
Many years ago, researchers at the University of California at Berkeley studied the original Unix file
system. They found that when they tried reading or writing a single very large file sequentially, they
were getting only about 2% of the potential speed of the disk. In other words, it took about 50 times as
long to read the whole file as it would if they simply read that many sequential blocks directly from the
raw disk (with no file system software). They tried doubling the block size (from 512 bytes to 1K) and
the performance more than doubled! The reason the speed more than doubled was that it took less than
half as many I/O operations to read the file. Because the blocks were twice as large, twice as much of
the file's data was in blocks pointed to directly by the inode. Indirect blocks were twice as large as well,
so they could hold twice as many pointers. Thus four times as much data could be accessed through the
singly indirect block without resorting to the doubly indirect block.
If doubling the block size more than doubled performance, why stop there? Why didn't the Berkeley
folks make the blocks even bigger? The problem with big blocks is internal fragmentation. A file can
only grow in increments of whole blocks. If the sizes of files are random, we would expect on the
average that half of the last block of a file is wasted. If most files are many blocks long, the relative
amount of waste is small, but if the block size is large compared to the size of a typical file, half a block
per file is significant. In fact, if files are very small (compared to the block size), the problem is even
worse. If, for example, we choose a block size of 8k and the average file is only 1K bytes long, we
would be wasting about 7/8 of the disk.
Most files in a typical Unix system are very small. The Berkeley researchers made a list of the sizes of
all files on a typical disk and did some calculations of how much space would be wasted by various
block sizes. Simply rounding the size of each file up to a multiple of 512 bytes resulted in wasting 4.2%
of the space. Including overhead for inodes and indirect blocks, the original 512-byte file system had a
total space overhead of 6.9%. Changing to 1K blocks raised the overhead to 11.8%. With 2k blocks, the
overhead would be 22.4% and with 4k blocks it would be 45.6%. Would 4k blocks be worthwhile? The
answer depends on economics. In those days disks were very expensive, and a wasting half the disk
seemed extreme. These days, disks are cheap, and for many applications people would be happy to pay
twice as much per byte of disk space to get a disk that was twice as fast.
But there's more to the story. The Berkeley researchers came up with the idea of breaking up the disk
into blocks and fragments. For example, they might use a block size of 2k and a fragment size of 512
bytes. Each file is stored in some number of whole blocks plus 0 to 3 fragments at the end. The
fragments at the end of one file can share a block with fragments of other files. The problem is that
when we want to append to a file, there may not be any space left in the block that holds its last
fragment. In that case, the Berkeley file system copies the fragments to a new (empty) block. A file that
grows a little at a time may require each of its fragments to be copied many times. They got around this
problem by modifying application programs to buffer their data internally and add it to a file a whole
block's worth at a time. In fact, most programs already used library routines to buffer their output (to
cut down on the number of system calls), so all they had to do was to modify those library routines to
use a larger buffer size. This approach has been adopted by many modern variants of Unix. The Solaris
system you are using for this course uses 8k blocks and 1K fragments.
As disks get cheaper and CPU's get faster, wasted space is less of a problem and the speed mismatch
between the CPU and the disk gets worse. Thus the trend is towards larger and larger disk blocks.
At first glance it would appear that the OS designer has no say in how big a block is. Any particular
disk drive has a sector size, usually 512 bytes, wired in. But it is possible to use larger ``blocks''. For
example, if we think it would be a good idea to use 2K blocks, we can group together each run of four
consecutive sectors and call it a block. In fact, it would even be possible to use variable-sized ``blocks,''
so long as each one is a multiple of the sector size. A variable-sized ``block'' is called an extent. When
extents are used, they are usually used in addition to multi-sector blocks. For example, a system may
use 2k blocks, each consisting of 4 consecutive sectors, and then group them into extents of 1 to 10
blocks. When a file is opened for writing, it grows by adding an extent at a time. When it is closed, the
unused blocks at the end of the last extent are returned to the system. The problem with extents is that
they introduce all the problems of external fragmentation that we saw in the context of main memory
allocation. Extents are generally only used in systems such as databases, where high-speed access to
very large files is important.
Free Space
[ Silberschatz, Galvin, and Gagne, Section 11.7 ]
We have seen how to keep track of the blocks in each file. How do we keep track of the free blocks--
blocks that are not in any file? There are two basic approaches.
• Use a bit vector. That is simply an array of bits with one bit for each block on the disk. A 1 bit
indicates that the corresponding block is allocated (in some file) and a 0 bit says that it is free.
To allocate a block, search the bit vector for a zero bit, and set it to one.
• Use a free list. The simplest approach is simply to link together the free blocks by storing the
block number of each free block in the previous free block. The problem with this approach is
that when a block on the free list is allocated, you have to read it into memory to get the block
number of the next block in the list. This problem can be solved by storing the block numbers of
additional free blocks in each block on the list. In other words, the free blocks are stored in a
sort of lopsided tree on disk. If, for example, 128 block numbers fit in a block, 1/128 of the free
blocks would be linked into a list. Each block on the list would contain a pointer to the next
block on the list, as well as pointers to 127 additional free blocks. When the first block of the
list is allocated to a file, it has to be read into memory to get the block numbers stored in it, but
then we and allocate 127 more blocks without reading any of them from disk. Freeing blocks is
done by running this algorithm in reverse: Keep a cache of 127 block numbers in memory.
When a block is freed, add its block number to this cache. If the cache is full when a block is
freed, use the block being freed to hold all the block numbers in the cache and link it to the head
of the free list by adding to it the block number of the previous head of the list.
How do these methods compare? Neither requires significant space overhead on disk. The bitmap
approach needs one bit for each block. Even for a tiny block size of 512 bytes, each bit of the bitmap
describes 512*8 = 4096 bits of free space, so the overhead is less than 1/40 of 1%. The free list is even
better. All the pointers are stored in blocks that are free anyhow, so there is no space overhead (except
for one pointer to the head of the list). Another way of looking at this is that when the disk is full
(which is the only time we should be worried about space overhead!) the free list is empty, so it takes
up no space. The real advantage of bitmaps over free lists is that they give the space allocator more
control over which block is allocated to which file. Since the blocks of a file are generally accessed
together, we would like them to be near each other on disk. To ensure this clustering, when we add a
block to a file we would like to choose a free block that is near the other blocks of a file. With a bitmap,
we can search the bitmap for an appropriate block. With a free list, we would have to search the free list
on disk, which is clearly impractical. Of course, to search the bitmap, we have to have it all in memory,
but since the bitmap is so tiny relative to the size of the disk, it is not unreasonable to keep the entire
bitmap in memory all the time. To do the comparable operation with a free list, we would need to keep
the block numbers of all free blocks in memory. If a block number is four bytes (32 bits), that means
that 32 times as much memory would be needed for the free list as for a bitmap. For a concrete
example, consider a 2 gigabyte disk with 8K blocks and 4-byte block numbers. The disk contains 2 31/213
= 218 = 262,144 blocks. If they are all free, the free list has 262,144 entries, so it would take one
megabyte of memory to keep them all in memory at once. By contrast, a bitmap requires 218 bits, or 215
= 32K bytes (just four blocks). (On the other hand, the bit map takes the same amount of memory
regardless of the number of blocks that are free).
Reliability
Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile memory. There
are several techniques that can be used to mitigate the effects of these failures. We only have room for a
brief survey.
Bad-block Forwarding
When the disk drive writes a block of data, it also writes a checksum, a small number of additional bits
whose value is some function of the ``user data'' in the block. When the block is read back in, the
checksum is also read and compared with the data. If either the data or checksum were corrupted, it is
extremely unlikely that the checksum comparison will succeed. Thus the disk drive itself has a way of
discovering bad blocks with extremely high probability.
The hardware is also responsible for recovering from bad blocks. Modern disk drives do automatic
bad-block forwarding. The disk drive or controller is responsible for mapping block numbers to
absolute locations on the disk (cylinder, track, and sector). It holds a little bit of space in reserve, not
mapping any block numbers to this space. When a bad block is discovered, the disk allocates one of
these reserved blocks and maps the block number of the bad block to the replacement block. All
references to this block number access the replacement block instead of the bad block. There are two
problems with this scheme. First, when a block goes bad, the data in it is lost. In practice, blocks tend to
be bad from the beginning, because of small defects in the surface coating of the disk platters. There is
usually a stand-alone formatting program that tests all the blocks on the disk and sets up forwarding
entries for those that fail. Thus the bad blocks never get used in the first place. The main reason for the
forwarding is that it is just too hard (expensive) to create a disk with no defects. It is much more
economical to manufacture a ``pretty good'' disk and then use bad-block forwarding to work around the
few bad blocks. The other problem is that forwarding interferes with the OS's attempts to lay out files
optimally. The OS may think it is doing a good job by assigning consecutive blocks of a file to
consecutive block numbers, but if one of those blocks is forwarded, it may be very far away for the
others. In practice, this is not much of a problem since a disk typically has only a handful of forwarded
sectors out of millions.
The software can also help avoid bad blocks by simply leaving them out of the free list (or marking
them as allocated in the allocation bitmap).
Back-up Dumps
[ Silberschatz, Galvin, and Gagne, Section 11.10.2 ]
There are a variety of storage media that are much cheaper than (hard) disks but are also much slower.
An example is 8 millimeter video tape. A ``two-hour'' tape costs just a few dollars and can hold two
gigabytes of data. By contrast, a 2GB hard drive currently casts several hundred dollars. On the other
hand, while worst-case access time to a hard drive is a few tens of milliseconds, rewinding or fast-
forwarding a tape to desired location can take several minutes. One way to use tapes is to make periodic
back up dumps. Dumps are really used for two different purposes:
• To recover lost files. Files can be lost or damaged by hardware failures, but far more often they
are lost through software bugs or human error (accidentally deleting the wrong file). If the file is
saved on tape, it can be restored.
• To recover from catastrophic failures. An entire disk drive can fail, or the whole computer can
be stolen, or the building can burn down. If the contents of the disk have been saved to tape, the
data can be restored (to a repaired or replacement disk). All that is lost is the work that was done
since the information was dumped.
Corresponding to these two ways of using dumps, there are two ways of doing dumps. A physical dump
simply copies all of the blocks of the disk, in order, to tape. It's very fast, both for doing the dump and
for recovering a whole disk, but it makes it extremely slow to recover any one file. The blocks of the
file are likely to be scattered all over the tape, and while seeks on disk can take tens of milliseconds,
seeks on tape can take tens or hundreds of seconds. The other approach is a logical dump, which copies
each file sequentially. A logical dump makes it easy to restore individual files. It is even easier to
restore files if the directories are dumped separately at the beginning of the tape, or if the name(s) of
each file are written to the tape along with the file.
The problem with logical dumping is that it is very slow. Dumps are usually done much more
frequently than restores. For example, you might dump your disk every night for three years before
something goes wrong and you need to do a restore. An important trick that can be used with logical
dumps is to only dump files that have changed recently. An incremental dump saves only those files
that have been modified since a particular date and time. Fortunately, most file systems record the time
each file was last modified. If you do a backup each night, you can save only those files that have
changed since the last backup. Every once in a while (say once a month), you can do a full backup of
all files. In Unix jargon, a full backup is called an epoch (pronounced ``eepock'') dump, because it
dumps everything that has changed since ``the epoch''--January 1, 1970, which is the the earliest
possible date in Unix.3
The Computer Sciences department currently does backup dumps on about 260 GB of disk space.
Epoch dumps are done once every 14 days, with the timing on different file systems staggered so that
about 1/14 of the data is dumped each night. Daily incremental dumps save about 6-10% of the data on
each file system.
Incremental dumps go fast because they dump only a small fraction of the files, and they don't take up a
lot of tape. However, they introduce new problems:
• If you want to restore a particular file, you need to know when it was last modified so that you
know which dump tape to look at.
• If you want to restore the whole disk (to recover from a catastrophic failure), you have to restore
from the last epoch dump, and then from every incremental dump since then, in order. A file that
is modified every day will appear on every tape. Each restore will overwrite the file with a
newer version. When you're done, everything will be up-to-date as of the last dump, but the
whole process can be extremely slow (and labor-intensive).
• You have to keep around all the incremental tapes since the last epoch. Tapes are cheap, but
they're not free, and storing them can be a hassle.
The First problem can be solved by keeping a directory of what was dumped when. A bunch of UW
alumni (the same guys that invented NFS) have made themselves millionaires by marketing software to
do this. The other problems can be solved by a clever trick. Each dump is assigned a positive integer
level. A level n dump is an incremental dump that dumps all files that have changed since the most
recent previous dump with a level greater than or equal to n. An epoch dump is considered to have
infinitely high level. Levels are assigned to dumps as follows:
This scheme is sometimes called a ruler schedule for obvious reasons. Level-1 dumps only save files
that have changed in the previous day. Level-2 dumps save files that have changed in the last two days,
level-3 dumps cover four days, level-4 dumps cover 8 days, etc. Higher-level dumps will thus include
more files (so they will take longer to do), but they are done infrequently. The nice thing about this
scheme is that you only need to save one tape from each level, and the number of levels is the logarithm
of the interval between epoch dumps. Thus even if did a dump each night and you only did an epoch
dump only once a year, you would need only nine levels (hence nine tapes). That also means that a full
restore needs at worst one restore from each of nine tapes (rather than 365 tapes!). To figure out what
tapes you need to restore from if your disk is destroyed after dump number n, express n in binary, and
number the bits from right to left, starting with 1. The 1 bits tell you which dump tapes to use. Restore
them in order of decreasing level. For example, 20 in binary is 10100, so if the disk is destroyed after
the 20th dump, you only need to restore from the epoch dump and from the most recent dumps at levels
5 and 3.
Consistency Checking
[ Silberschatz, Galvin, and Gagne, Section 11.10.1 ]
Some of the information in a file system is redundant. For example, the free list could be reconstructed
by checking which blocks are not in any file. Redundancy arises because the same information is
represented in different forms to make different operations faster. If you want to know which blocks are
in a given file, look at the inode. If you you want to know which blocks are not in any inode, use the
free list. Unfortunately, various hardware and software errors can cause the data to become inconsistent.
File systems often include a utility that checks for consistency and optionally attempts to repair
inconsistencies. These programs are particularly handy for cleaning up the disks after a crash.
Unix has a utility called fscheck. It has two principal tasks. First, it checks that blocks are properly
allocated. Each inode is supposed to be the root of a tree of blocks, the free list is supposed to be a tree
of blocks, and each block is supposed to appear in exactly one of these trees. Fscheck runs through
all the inodes, checking each allocated inode for reasonable values, and walking through the tree of
blocks rooted at the inode. It maintains a bit vector to record which blocks have been encountered. If
block is encountered that has already been seen, there is a problem: Either it occurred twice in the same
file (in which case it isn't a tree), or it occurred in two different files. A reasonable recovery would be to
allocate a new block, copy the contents of the problem block into it, and substitute the copy for the
problem block in one of the two places where it occurs. It would also be a good idea to log an error
message so that a human being can check up later to see what's wrong. After all the files are scanned,
any block that hasn't been found should be on the free list. It would be possible to scan the free list in a
similar manner, but it's probably easier just to rebuild the free list from the set of blocks that were not
found in any file. If a bitmap instead of a free list is used, this step is even easier: Simply overwrite the
file system's bitmap with the bitmap constructed during the scan.
The other main consistency requirement concerns the directory structure. The set of directories is
supposed to be a tree, and each inode is supposed to have a link count that indicates how many times it
appears in directories. The tree structure could be checked by a recursive walk through the
directories,but it is more efficient to combine this check with the walk through the inodes that checks
for disk blocks, but recording, for each directory inode encountered, the inumber of its parent. The set
of directories is a tree if and only if and only if every directory other than the root has a unique parent.
This pass can also rebuild the link count for each inode by maintaining in memory an array with one
slot for each inumber. Each time the inumber is found in a directory, increment the corresponding
element of the array. The resulting counts should match the link counts in the inodes. If not, correct the
counts in the inodes.
This illustrates a very important principal that pops up throughout operating system implementation
(indeed, throughout any large software system): the doctrine of hints and absolutes. Whenever the same
fact is recorded in two different ways, one of them should be considered the absolute truth, and the
other should be considered a hint. Hints are handy because they allow some operations to be done much
more quickly that they could if only the absolute information was available. But if the hint and the
absolute do not agree, the hint can be rebuilt from the absolutes. In a well-engineered system, there
should be some way to verify a hint whenever it is used. Unix is a bit lax about this. The link count is a
hint (the absolute information is a count of the number of times the inumber appears in directories), but
Unix treats it like an absolute during normal operation. As a result, a small error can snowball into
completely trashing the file system.
For another example of hints, each allocated block could have a header containing the inumber of the
file containing it and its offset in the file. There are systems that do this (Unix isn't one of them). The
tree of blocks rooted at an inode then becomes a hint, providing an efficient way of finding a block, but
when the block is found, its header could be checked. Any inconsistency would then be caught
immediately, and the inode structures could be rebuilt from the information in the block headers.
By the way, if the link count calculated by the scan is zero (i.e., the inode, although marked as
allocated, does not appear in any directory), it would not be prudent to delete the file. A better recovery
is to add an entry to a special lost+found directory pointing to the orphan inode, in case it contains
something really valuable.
Transactions
The previous section talks about how to recover from situations that ``can't happen.'' How do these
problems arise in the first place? Wouldn't it be better to prevent these problems rather than recover
from them after the fact? Many of these problems arise, particularly after a crash, because some
operation was ``half-completed.'' For example, suppose the system was in the middle of executing a
unlink system call when the lights went out. An unlink operation involves several distinct steps:
• remove an entry from a directory,
• decrement a link count, and if the count goes to zero,
• move all the blocks of the file to the free list, and
• free the inode.
If the crash occurs between the first and second steps, the link count will be wrong. If it occurs during
the third step, a block may be linked both into the file and the free list, or neither, depending on the
details of how the code is written. And so on...
To deal with this kind of problem in a general way, transactions were invented. Transactions were first
developed in the context of database management systems, and are used heavily there, so there is a
tradition of thinking of them as ``database stuff'' and teaching about them only in database courses and
text books. But they really are an operating system concept. Here's a two-bit introduction.
We have already seen a mechanism for making complex operations appear atomic. It is called a critical
section. Critical sections have a property that is sometimes called synchronization atomicity. It is also
called serializability because if two processes try to execute their critical sections at about the same
time, the next effect will be as if they occurred in some serial order.4 If systems can crash (and they
can!), synchronization atomicity isn't enough. We need another property, called failure atomicity, which
means an ``all or nothing'' property: Either all of the modifications of nonvolatile storage complete or
none of them do.
There are basically two ways to implement failure atomicity. They both depend on the fact that a
writing a single block to disk is an atomic operation. The first approach is called logging. An append-
only file called a log is maintained on disk. Each time a transaction does something to file-system data,
it creates a log record describing the operation and appends it to the log. The log record contains
enough information to undo the operation. For example, if the operation made a change to a disk block,
the log record might contain the block number, the length and offset of the modified part of the block,
and the the original content of that region. The transaction also writes a begin record when it starts, and
a commit record when it is done. After a crash, a recovery process scans the log looking for transactions
that started (wrote a begin record) but never finished (wrote a commit record). If such a transaction is
found, its partially completed operations are undone (in reverse order) using the undo information in
the log records.
Sometimes, for efficiency, disk data is cached in memory. Modifications are made to the cached copy
and only written back out to disk from time to time. If the system crashes before the changes are written
to disk, the data structures on disk may be inconsistent. Logging can also be used to avoid this problem
by putting into each log record redo information as well as undo information. For example, the log
record for a modification of a disk block should contain both the old and new value. After a crash, if the
recovery process discovers a transaction that has completed, it uses the redo information to make sure
the effects of all of its operations are reflected on disk. Full recovery is always possible provided
• The log records are written to disk in order,
• The commit record is written to disk when the transaction completes, and
• The log record describing a modification is written to disk before any of the changes made by
that operation are written to disk.
This algorithm is called write-ahead logging.
The other way of implementing transactions is called shadow blocks.5 Suppose the data structure on
disk is a tree. The basic idea is never to change any block (disk block) of the data structure in place.
Whenever you want to modify a block, make a copy of it (called a shadow of it) instead, and modify
the parent to point to the shadow. Of course, to make the parent point to the shadow you have to modify
it, so instead you make a shadow of the parent an modify it instead. In this way, you shadow not only
each block you really wanted to modify, but also all the blocks on the path from it to the root. You keep
the shadow of the root block in memory. At the end of the transaction, you make sure the shadow
blocks are all safely written to disk and then write the shadow of the root directly onto the root block. If
the system crashes before you overwrite the root block, there will be no permanent change to the tree
on disk. Overwriting the root block has the effect of linking all the modified (shadow blocks) into the
tree and removing all the old blocks. Crash recovery is simply a matter of garbage collection. If the
crash occurs before the root was overwritten, all the shadow blocks are garbage. If it occurs after, the
blocks they replaced are garbage. In either case, the tree itself is consistent, and it is easy to find the
garbage blocks (they are blocks that aren't in the tree).
Database systems almost universally use logging, and shadowing is mentioned only in passing in
database texts. But the shadowing technique is used in a variant of the Unix file system called
(somewhat misleadingly) the Log-structured File System (LFS). The entire file system is made into a
tree by replacing the array of inodes with a tree of inodes. LFS has the added advantage (beyond
reliability) that all blocks are written sequentially, so write operations are very fast. It has the
disadvantage that files that are modified here and there by random access tend to have their blocks
scattered about, but that pattern of access is comparatively rare, and there are techniques to cope with it
when it occurs. The main source of complexity in LFS is figuring out when and how to do the
``garbage collection.''
Performance
[ Silberschatz, Galvin, and Gagne, Section 11.9 ]
The main trick to improve file system performance (like anything else in computer science) is caching.
The system keeps a disk cache (sometimes also called a buffer pool) of recently used disk blocks. In
contrast with the page frames of virtual memory, where there were all sorts of algorithms proposed for
managing the cache, management of the disk cache is pretty simple. On the whole, it is simply
managed LRU (least recently used). Why is it that for paging we went to great lengths trying to come
up with an algorithm that is ``almost as good as LRU'' while here we can simply use true LRU? The
problem with implementing LRU is that some information has to be updated on every single reference.
In the case of paging, references can be as frequent as every instruction, so we have to make do with
whatever information hardware is willing to give us. The best we can hope for is that the paging
hardware will set a bit in a page-table entry. In the case of file system disk blocks, however, each
reference is the result of a system call, and adding a few extra instructions added to a system call for
cache maintenance is not unreasonable.
Adding page caching to the file system implementation is actually quite simple. Somewhere in the
implementation, there is probably a procedure that gets called when the system wants to access a disk
block. Let's suppose the procedure simply allocates some memory space to hold the block and reads it
into memory.

Block readBlock(int blockNumber) {


Block result = new Block();
Disk.read(blockNumber, result);
return result;
}
To add caching, all we have to do is modify this code to search the disk cache first.

class CacheEntry {
int blockNumber;
Block buffer;
CacheEntry next, previous;
}
class DiskCache {
CacheEntry head, tail;
CacheEntry find(int blockNumber) {
// Search the list for an entry with a matching block number.
// If not found, return null.
}
void moveToFront(CacheEntry entry) {
// more entry to the head of the list
}
CacheEntry oldest() {
return tail;
}
Block readBlock(int blockNumber) {
Block result;
CacheEntry entry = find(blockNumber);
if (entry == null) {
entry = oldest();
Disk.read(blockNumber, entry.buffer);
entry.blockNumber = blockNumber;
}
moveToFront(entry);
return entry.buffer;
}
}
This code is not quite right, because it ignores writes. If the oldest buffer is dirty (it has been modified
since it was read from disk), it first has to be written back to the disk before it can be used to hold the
new block. Most systems actually write dirty buffers back to the disk sooner than necessary to
minimize the damage caused by a crash. The original version of Unix had a background process that
would write all dirty buffers to disk every 30 seconds. Some information is more critical than others.
Some versions of Unix, for example, write back directory blocks (the data block of directory files of
type directory) as each time they are modified. This technique--keeping the block in the cache but
writing its contents back to disk after any modification--is called write-through caching. (Some modern
versions of Unix use techniques inspired by database transactions to minimize the effects of crashes).
LRU management automatically does the ``right thing'' for most disk blocks. If someone is actively
manipulating the files in a directory, all of the directory's blocks will probably be in the cache. If a
process is scanning a large file, all of its indirect blocks will probably be in memory most of the time.
But there is one important case where LRU is not the right policy. Consider a process that is traversing
(reading or writing) a file sequentially from beginning to end. Once that process has read or written the
last byte of a block, it will not touch that block again. The system might as well immediately move the
block to the tail of the list as soon as the read or write request completes. Tanenbaum calls this
technique free behind. It is also sometimes called most recently used (MRU) to contrast it with LRU.
How does the system know to handle certain blocks MRU? There are several possibilities.
• If the operating system interface distinguishes between random-access files and sequential files,
it is easy. Data blocks of sequential files should be managed MRU.
• In some systems, all files are alike, but there is a different kind of open call, or a flag passed to
open, that indicates whether the file will be accessed randomly or sequentially.
• Even if the OS gets no explicit information from the application program, it can watch the
pattern of reads an writes. If recent history indicates that all (or most) reads or writes of the file
have been sequential, the data blocks should be managed MRU.
A similar trick is called read-ahead. If a file is being read sequentially, it is a good idea to read a few
blocks at a time. This cuts down on the latency for the application (most of the time the data the
application wants is in memory before it even asks for it). If the disk hardware allows multiple blocks
to be read at a time, it can cut the number of disk read requests, cutting down on overhead such as the
time to service a I/O completion interrupt. If the system has done a good job of clustering together the
disks of the file, read-ahead also takes better advantage of the clustering. If the system reads one block
at a time, another process, accessing a different file, could make the disk head move away from the area
containing the blocks of this file between accesses.
The Berkeley file system introduced another trick to improve file system performance. They divided
the disk into chunks, which they called cylinder groups (CGs) because each one is comprised of some
number of adjacent cylinders. Each CG is like a miniature disk. It has its own super block and array of
inodes. The system attempts to put all the blocks of a file in the same CG as its inode. It also tries to
keep all the inodes in one directory together in the same CG so that operations like

ls -l *
will be fast. It uses a variety to techniques to assign inodes and blocks to CGs in such as way as to
distribute the free space fairly evenly between them, so there will be enough room to do this clustering.
In particular,
• When a new file is created, its inode is placed in the same CG as its parent directory (if
possible). But when a new directory is created, its inode is placed in CG with the largest amount
of free space (so that the files in the directory will be able to be near each other).
• When blocks are added to a file, they are allocated (if possible) from the same CG that contains
it inode. But when the size of the file crosses certain thresholds (say every megabyte or so), the
system switches to a different CG, one that is relatively empty. The idea is to prevent a big file
from hogging all the space in one CG and preventing other files in the CG from being well
clustered.
1
This Java declaration is actually a bit of a lie. In Java, an instance of class Dirent would include
some header information indicating that it was a Dirent object, a two-byte short integer, and a
pointer to an array object (which contains information about its type an length, in addition to the 14
bytes of data). The actual representation is given by the C (or C++) declaration

struct direct {
unsigned short int inumber;
char name[14];
}
Unfortunately, there's no way to represent this in Java.
2
This is also a lie for the reasons cited in the previous footnote as well as the fact that the field byte
name[], which is intended to indicate an array of indeterminant length, rather than a pointer to an
array. The actual C declaration is

struct dirent {
unsigned long int inumber;
unsigned short int reclen;
unsigned short int reclen;
char name[256];
}
The array size 256 is a lie. The code depends on the fact that the C language does not do any array
bounds checking.
3
The dictionary defines epoch as
1 : an instant of time or a date selected as a point of
reference in astronomy
2 a : an event or a time marked by an event that begins a
new
period or development
b : a memorable event or date
4
Critical sections are usually implemented so that they actually occur one after the other, but all that is
required is that they behave as if they were serialized. For example, if neither transaction modifies
anything, or if they don't touch any overlapping data, they can be run concurrently without any harm.
Database implementations of transactions go to a great deal of trouble to allow as much concurrency as
possible.
5
Actually, the technique is usually called ``shadow paging'' because in the context of databases, disk
blocks are often called ``pages.'' We reserve the term ``pages'' for virtual memory.
CS 537
Lecture Notes, Part 12
Protection and Security
Contents
• Security
• Threats
• The Trojan Horse
• Design Principles
• Authentication
• Protection Mechanisms
• Access Control Lists
• Capabilities
The terms protection and security are often used together, and the distinction between them is a bit
blurred, but security is generally used in a broad sense to refer to all concerns about controlled access to
facilities, while protection describes specific technological mechanisms that support security.

Security
As in any other area of software design, it is important to distinguish between policies and mechanisms.
Before you can start building machinery to enforce policies, you need to establish what policies you are
trying to enforce. Many years ago, I heard a story about a software firm that was hired by a small
savings and loan corporation to build a financial accounting system. The chief financial officer used the
system to embezzle millions of dollars and fled the country. The losses were so great the S&L went
bankrupt, and the loss of the contract was so bad the software company also went belly-up. Did the
accounting system have a good or bad security design? The problem wasn't unauthorized access to
information, but rather authorization to the wrong person. The situation is analogous to the old saw that
every program is correct according to some specification. Unfortunately, we don't have the space here
to go into the whole question of security policies here. We will just assume that terms like ``authorized
access'' have some well-defined meaning in a particular context.
Threats
Any discussion of security must begin with a discussion of threats. After all, if you don't know what
you're afraid of, how are you going to defend against it? Threats are generally divided in three main
categories.
• Unauthorized disclosure. A ``bad guy'' gets to see information he has no right to see (according
to some policy that defines ``bad guy'' and ``right to see'').
• Unauthorized updates. The bad guy makes changes he has no right to change.
• Denial of service. The bad guy interferes with legitimate access by other users.
There is a wide spectrum of denial-of-service threats. At one end, it overlaps with the previous
category. A bad guy deleting a good guy's file could be considered an unauthorized update. A the other
end of the spectrum, blowing up a computer with a hand grenade is not usually considered an
unauthorized update. As this second example illustrates, some denial-of-service threats can only be
enforced by physical security. No matter how well your OS is designed, it can't protect my files from
his hand grenade. Another form of denial-of-service threat comes from unauthorized consumption of
resources, such as filling up the disk, tying up the CPU with an infinite loop, or crashing the system by
triggering some bug in the OS. While there are software defenses against these threats, they are
generally considered in the context of other parts of the OS rather than security and protection. In short,
discussion of software mechanisms for computer security generally focus on the first two threats.
In response to these threats counter measures also fall into various categories. As programmers, we
tend to think of technological tricks, but it is also important to realize that a complete security design
must involve physical components (such as locking the computer in a secure building with armed
guards outside) and human components (such as a background check to make sure your CFO isn't a
crook, or checking to make sure those armed guards aren't taking bribes).
The Trojan Horse
Break-in techniques come in numerous forms. One general category of attack that comes in a great
variety of disguises is the Trojan Horse scam. The name comes from Greek mythology. The ancient
Greeks were attacking the city of Troy, which was surrounded by an impenetrable wall. Unable to get
in, they left a huge wooden horse outside the gates as a ``gift'' and pretended to sail away. The Trojans
brought the horse into the city, where they discovered that the horse was filled with Greek soldiers who
defeated the Trojans to win the Rose Bowl (oops, wrong story). In software, a Trojan Horse is a
program that does something useful--or at least appears to do something useful--but also subverts
security somehow. In the personal computer world, Trojan horses are often computer games infected
with ``viruses.''
Here's the simplest Trojan Horse program I know of. Log onto a public terminal and start a program
that does something like this:

print("login:");
name = readALine();
turnOffEchoing();
print("password:");
passwd = readALine();
sendMail("badguy",name,passwd);
print("login incorrect");
exit();
A user waking up to the terminal will think it is idle. He will attempt to log in, typing his login name
and password. The Trojan Horse program sends this information to the bad guy, prints the message
login incorrect and exits. After the program exits, the system will generate a legitimate login:
message and the user, thinking he mistyped his password (a common occurrence because the password
is not echoed) will try again, log in successfully, and have no suspicion that anything was wrong. Note
that the Trojan Horse program doesn't actually have to do anything useful, it just has to appear to.
Design Principles
19. Public Design. A common mistake is to try to keep a system secure by keeping its algorithms
secret. That's a bad idea for many reasons. First, it gives a kind of all-or-nothing security. As
soon as anybody learns about the algorithm, security is all gone. In the words of Benjamin
Franklin, ``Two people can keep a secret if one of them is dead.'' Second, it is usually not that
hard to figure out the algorithm, by seeing how the system responds to various inputs,
decompiling the code, etc. Third, publishing the algorithm can have beneficial effects. The bad
guys probably have already figured out your algorithm and found its weak points. If you publish
it, perhaps some good guys will notice bugs or loopholes and tell you about them so you can fix
them.
20. Default = No Access. Start out by granting as little access a possible and adding privileges only
as needed. If you forget to grant access where it is legitimately needed, you'll soon find out
about it. Users seldom complain about having too much access.
21. Timely Checks. Checks tend to ``wear out.'' For example, the longer you use the same
password, the higher the likelihood it will be stolen or deciphered. Be careful: This principle can
be overdone. Systems that force users to change passwords frequently encourage them to use
particularly bad ones. A system that forced users to supply a password every time they wanted
to open a file would inspire all sorts of ingenious ways to avoid the protection mechanism
altogether.
22. Minimum Privilege. This is an extension of point 2. A person (or program or process) should
be given just enough powers to get the job done. In other contexts, this principle is called ``need
to know.'' It implies that the protection mechanism has to support fine-grained control.
23. Simple, Uniform Mechanisms. Any piece of software should be as simple as possible (but no
simpler!) to maximize the chances that it is correctly and efficiently implemented. This is
particularly important for protection software, since bugs are likely be usable as security
loopholes. It is also important that the interface to the protection mechanisms be simple, easy to
understand, and easy to use. It is remarkably hard to design good, foolproof security policies;
policy designers need all the help they can get.
24. Appropriate Levels of Security. You don't store your best silverware in a box on the front
lawn, but you also don't keep it in a vault at the bank. The US Strategic Air Defense calls for a
different level of security than my records of the grades for this course. Not only does excessive
security mechanism add unnecessary cost and performance degradation, it can actually lead to a
less secure system. If the protection mechanisms are too hard to use, users will go out of their
way to avoid using them.
Authentication
Authentication is a process by which one party convinces another of its identity. A familiar instance is
the login process, though which a human user convinces the computer system that he has the right to
use a particular account. If the login is successful, the system creates a process and associates with it
the internal identifier that identifies the account. Authentication occurs in other contexts, and it isn't
always a human being that is being authenticated. Sometimes a process needs to authenticate itself to
another process. In a networking environment, a computer may need to authenticate itself to another
computer. In general, let's call the party that whats to be authenticated the client and the other party the
server.
One common technique for authentication is the use of a password. This is the technique used most
often for login. There is a value, called the password that is known to both the server and to legitimate
clients. The client tells the server who he claims to be and supplies the password as proof. The server
compares the supplied password with what he knows to be the true password for that user.
Although this is a common technique, it is not a very good one. There are lots of things wrong with it.
Direct attacks on the password.
The most obvious way of breaking in is a frontal assault on the password. Simply try all possible
passwords until one works. The main defense against this attack is the time it takes to try lots of
possibilities. If the client is a computer program (perhaps masquerading as a human being), it can try
lots of combinations very quickly, but by if the password is long enough, even the fastest computer
cannot try succeed in a reasonable amount of time. If the password is a string of 8 letters and digits,
there are 2,821,109,907,456 possibilities. A program that tried one combination every millisecond
would take 89 years to get through them all. If users are allowed to pick their own passwords, they are
likely to choose ``cute doggie names'', common words, names of family members, etc. That cuts down
the search space considerably. A password cracker can go through dictionaries, lists of common names,
etc. It can also use biographical information about the user to narrow the search space. There are
several defenses against this sort of attack.
• The system chooses the password. The problem with this is that the password will not be easy to
remember, so the user will be tempted to write it down or store it in a file, making it easy to
steal. This is not a problem if the client is not a human being.
• The system rejects passwords that are too ``easy to guess''. In effect, it runs a password cracker
when the user tries to set his password and rejects the password if the cracker succeeds. This has
many of the disadvantages of the previous point. Besides, it leads to a sort of arms race between
crackers and checkers.
• The password check is artificially slowed down, so that it takes longer to go through lots of
possibilities. One variant of this idea is to hang up a dial-in connection after three unsuccessful
login attempts, forcing the bad guy to take the time to redial.
Eavesdropping.
This is a far bigger program for passwords than brute force attacks. In comes in many disguises.
• Looking over someone's shoulder while he's typing his password. Most systems turn off
echoing, or echo each character as an asterisk to mitigate this problem.
• Reading the password file. In order to verify that the password is correct, the server has to have
it stored somewhere. If the bad guy can somehow get access to this file, he can pose as anybody.
While this isn't a threat on its own (after all, why should the bad guy have access to the
password file in the first place?), it can magnify the effects of an existing security lapse.
Unix introduced a clever fix to this problem, that has since been almost universally copied. Use
some hash function f and instead of storing password, store f(password). The hash
function should have two properties: Like any hash function it should generate all possible
result values with roughly equal probability, and in addition, it should be very hard to invert--
that is, given f(password), it should be hard to recover password. It is quite easy to
devise functions with these properties. When a client sends his password, the server applies f to
it and compares the result with the value stored in the password file. Since only f(password)
is stored in the password file, nobody can find out the password for a given user, even with full
access to the password file, and logging in requires knowing password, not f(password).
In fact, this technique is so secure, it has become customary to make the password file publicly
readable!
• Wire tapping. If the bad guy can somehow intercept the information sent from the client to the
server, password-based authentication breaks down altogether. It is increasingly the case the
authentication occurs over an insecure channel such as a dial-up line or a local-area network.
Note that the Unix scheme of storing f(password) is of no help here, since the password is
sent in its original form (``plaintext'' in the jargon of encryption) from the client to the server.
We will consider this problem in more detail below.
Spoofing.
This is the worst threat of all. How does the client know that the server is who it appears to be? If the
bad guy can pose as the server, he can trick the client into divulging his password. We saw a form of
this attack above. It would seem that the server needs to authenticate itself to the client before the client
can authenticate itself to the server. Clearly, there's a chicken-and-egg problem here. Fortunately, there's
a very clever and general solution to this problem.
Challenge-response.
There are wide variety of authentication protocols, but they are all based on a simple idea. As before,
we assume that there is a password known to both the (true) client and the (true) server. Authentication
is a four-step process.
• The client sends a message to the server saying who he claims to be and requesting
authentication.
• The server sends a challenge to the client consisting of some random value x.
• The client computes g(password,x) and sends it back as the response. Here g is a hash
function similar to the function f above, except that it has two arguments. It should have the
property that it is essentially impossible to figure out password even if you know both x and
g(password,x).
• The server also computes g(password,x) and compares it with the response it got from the
client.
Clearly this algorithm works if both the client and server are legitimate. An eavesdropper could learn
the user's name, x and g(password,x), but that wouldn't help him pose as the user. If he tried to
authenticate himself to the server he would get a different challenge x', and would have no way to
respond. Even a bogus server is no threat. The change provides him with no useful information.
Similarly, a bogus client does no harm to a legitimate server except for tying him up in a useless
exchange (a denial-of-service problem!).

Protection Mechanisms
First, some terminology:
objects
The things to which we wish to control access. They include physical (hardware)
objects as well as software objects such as files, databases, semaphores, or processes.
As in object-oriented programming, each object has a type and supports certain
operations as defined by its type. In simple protection systems, the set of operations
is quite limited: read, write, and perhaps execute, append, and a few others.
Fancier protection systems support a wider variety of types and operations, perhaps
allowing new types and operations to be dynamically defined.
principals
Intuitively, ``users''--the ones who do things to objects. Principals might be individual
persons, groups or projects, or roles, such as ``administrator.'' Often each process is
associated with a particular principal, the owner of the process.
rights
Permissions to invoke operations. Each right is the permission for a particular
principal to perform a particular operation on a particular object. For example,
principal solomon might have read rights for a particular file object.
domains
Sets of rights. Domains may overlap. Domains are a form of indirection, making it
easier to make wholesale changes to the access environment of a process. There may
be three levels of indirection: A principal owns a particular process, which is in a
particular domain, which contains a set of rights, such as the right to modify a
particular file.
Conceptually, the protection state of a system is defined by an access matrix. The rows correspond to
principals (or domains), the columns correspond to objects, and each cell is a set of rights. For example,
if

access[solomon]["/tmp/foo"] = { read, write }


Then I have read and write access to file "/tmp/foo". I say ``conceptually'' because the access is
never actually stored anywhere. It is very large and has a great deal of redundancy (for example, my
rights to a vast number of objects are exactly the same: none!), so there are much more compact ways
to represent it. The access information is represented in one of two ways, by columns, which are called
access control lists (ACLs), and by rows, called capability lists.
Access Control Lists
An ACL (pronounced ``ackle'') is a list of rights associated with an object. A good example of the use of
ACLs is the Andrew File System (AFS) originally created at Carnegie-Mellon University and now
marketed by Transarc Corporation as an add-on to Unix. This file system is widely used in the
Computer Sciences Department. Your home directory is in AFS. AFS associates an ACL with each
directory, but the ACL also defines the rights for all the files in the directory (in effect, they all share the
same ACL). You can list the ACL of a directory with the fs listacl command:

% fs listacl /u/c/s/cs537-1/public
Access list for /u/c/s/cs537-1/public is
Normal rights:
system:administrators rlidwka
system:anyuser rl
solomon rlidwka
The entry system:anyuser rl means that the principal system:anyuser (which represents the
role ``anybody at all'') has rights r (read files in the directory) and l (list the files in the directory and
read their attributes). The entry solomon rlidwka means that I have all seven rights supported by
AFS. In addition to r and l, they include the rights to insert new file in the the directory (i.e., create
files), delete files, write files, lock files, and administer the ACL list itself. This last right is very
powerful: It allows me to add, delete, or modify ACL entries. I thus have the power to grant or deny
any rights to this directory to anybody. The remaining entry in the list shows that the principal
system:administrators has the same rights I do (namely, all rights). This principal is the name
of a group of other principals. The command pts membership system:administrators
lists the members of the group.
Ordinary Unix also uses an ACL scheme to control access to files, but in a much stripped-down form.
Each process is associated with a user identifier (uid) and a group identifier (gid), each of which is a
16-bit unsigned integer. The inode of each file also contains a uid and a gid, as well as a nine-bit
protection mask, called the mode of the file. The mask is composed of three groups of three bits. The
first group indicates the rights of the owner: one bit each for read access, write access, and
execute access (the right to run the file as a program). The second group similarly lists the rights of
the file's group, and the remaining three three bits indicate the rights of everybody else. For example,
the mode 111 101 101 (0755 in octal) means that the owner can read, write, and execute the file,
while members of the owning group and others can read and execute, but not write the file. Programs
that print the mode usually use the characters rwx- rather than 0 and 1. Each zero in the binary value
is represented by a dash, and each 1 is represented by r, w, or x, depending on its position. For
example, the mode 111101101 is printed as rwxr-xr-x.
In somewhat more detail, the access-checking algorithm is as follows: The first three bits are checked
to determine whether an operation is allowed if the uid of the file matches the uid of the process trying
to access it. Otherwise, if the gid of the file matches the gid of the process, the second three bits are
checked. If neither of the id's match, the last three bits are used. The code might look something like
this.

boolean accessOK(Process p, Inode i, int operation) {


int mode;
if (p.uid == i.uid)
mode = i.mode >> 6;
else if (p.gid == i.gid)
mode = i.mode >> 3;
else mode = i.mode;
switch (operation) {
case READ: mode &= 4; break;
case WRITE: mode &= 2; break;
case EXECUTE: mode &= 1; break;
}
return (mode != 0);
}
(The expression i.mode >> 3 denotes the value i.mode shifted right by three bits positions and
the operation mode &= 4 clears all but the third bit from the right of mode.) Note that this scheme
can actually give a random user more powers over the file than its owner. For example, the mode ---
r--rw- (000 100 110 in binary) means that the owner cannot access the file at all, while members
of the group can only read the file, and other can both read and write. On the other hand, the owner of
the file (and only the owner) can execute the chmod system call, which changes the mode bits to any
desired value. When a new file is created, it gets the uid and gid of the process that created it, and a
mode supplied as an argument to the creat system call.
Most modern versions of Unix actually implement a slightly more flexible scheme for groups. A
process has a set of gid's, and the check to see whether the file is in the process' group checks to see
whether any of the process' gid's match the file's gid.

boolean accessOK(Process p, Inode i, int operation) {


int mode;
if (p.uid == i.uid)
mode = i.mode >> 6;
else if (p.gidSet.contains(i.gid))
mode = i.mode >> 3;
else mode = i.mode;
switch (operation) {
case READ: mode &= 4; break;
case WRITE: mode &= 2; break;
case EXECUTE: mode &= 1; break;
}
return (mode != 0);
}
When a new file is created, it gets the uid of the process that created it and the gid of the containing
directory. There are system calls to change the uid or gid of a file. For obvious security reasons, these
operations are highly restricted. Some versions of Unix only allow the owner of the file to change it
gid, only allow him to change it to one of his gid's, and don't allow him to change the uid at all.
For directories, ``execute'' permission is interpreted as the right to get the attributes of files in the
directory. Write permission is required to create or delete files in the directory. This rule leads to the
surprising result that you might not have permission to modify a file, yet be able to delete it and replace
it with another file of the same name but with different contents!
Unix has another very clever feature--so clever that it is patented! The file mode actually has a few
more bits that I have not mentioned. One of them is the so-called setuid bit. If a process executes a
program stored in a file with the setuid bit set, the uid of the process is set equal to the uid of the file.
This rather curious rule turns out to be a very powerful feature, allowing the simple rwx permissions
directly supported by Unix to be used to define arbitrarily complicated protection policies.
As an example, suppose you wanted to implement a mail system that works by putting all mail
messages in to one big file, say /usr/spool/mbox. I should be able to read only those message that
mention me in the To: or Cc: fields of the header. Here's how to use the setuid feature to implement this
policy. Define a new uid mail, make it the owner of /usr/spool/mbox, and set the mode of the
file to rw------- (i.e., the owner mail can read and write the file, but nobody else has any access to
it). Write a program for reading mail, say /usr/bin/readmail. This file is also owned by mail
and has mode srwxr-xr-x. The `s' means that the setuid bit is set. My process can execute this
program (because the ``execute by anybody'' bit is on), and when it does, it suddenly changes its uid to
mail so that it has complete access to /usr/spool/mbox. At first glance, it would seem that letting
my process pretend to be owned by another user would be a big security hole, but it isn't, because
processes don't have free will. They can only do what the program tells them to do. While my process is
running readmail, it is following instructions written by the designer of the mail system, so it is safe
to let it have access appropriate to the mail system. There's one more feature that helps readmail do
its job. A process really has two uid's, called the effective uid and the real uid. When a process executes
a setuid program, its effective uid changes to the uid of the program, but its real uid remains unchanged.
It is the effective uid that is used to determine what rights it has to what files, but there is a system call
to find out the real uid of the current process. Readmail can use this system call to find out what user
called it, and then only show the appropriate messages.
Capabilities
An alternative to ACLs are capabilities. A capability is a ``protected pointer'' to an object. It designates
an object and also contains a set of permitted operations on the object. For example, one capability may
permit reading from a particular file, while another allows both reading and writing. To perform an
operation on an object, a process makes a system call, presenting a capability that points to the object
and permits the desired operation. For capabilities to work as a protection mechanism, the system has to
ensure that processes cannot mess with their contents. There are three distinct ways to ensure the
integrity of a capability.
Tagged architecture.
Some computers associate a tag bit with each word of memory, marking the word as
a capability word or a data word. The hardware checks that capability words are only
assigned from other capability words. To create or modify a capability, a process has
to make a kernel call.
Separate capability segments.
If the hardware does not support tagging individual words, the OS can protect
capabilities by putting them in a separate segment and using the protection features
that control access to segments.
Encryption.
Each capability can be extended with a cryptographic checksum that is computed
from the rest of the content of the capability and a secret key. If a process modifies a
capability it cannot modify the checksum to match without access to the key. Only
the kernel knows the key. Each time a process presents a capability to the kernel to
invoke an operation, the kernel checks the checksum to make sure the capability
hasn't been tampered with.
Capabilities, like segments are a ``good idea'' that somehow seldom seems to be implemented in real
systems in full generality. Like segments, capabilities show up in an abbreviated form in many systems.
For example, the file descriptor for an open file in Unix is a kind of capability. When a process tries to
open a file for writing, the system checks the file's ACL to see whether the access is permitted. If it is,
the process gets a file descriptor for the open file, which is a sort of capability to the file that permits
write operations. Unix uses the separate segment approach to protect the capability. The capability
itself is stored in a table in the kernel and the process has only an indirect reference to it (the index of
the slot in the table). File descriptors are not full-fledged capabilities, however. For example, they
cannot be stored in files, because they go away when the process terminates.

Previous More About File Systems


Next Cryptographic Protocols
Contents

solomon@cs.wisc.edu
Mon Jan 24 13:34:19 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.


CS 537
Lecture Notes, Part 13
Cryptographic Protocols
Contents
• Encryption
• Key Distribution
• Public Key Encryption

Encryption
In distributed systems, data is usually sent over insecure channels. A prudent user should assume that it
is easy for a "bad guy" to see all the data that goes over the wire. In fact, the bad guy may be assumed
to have the power to modify the data as it goes by, delete messages, inject new messages into the
stream, or any combination of these operations, such as stealing a message and playing it back at a later
time. In such environments, security is based on cryptographic techniques.
Messages are scrambled, or encrypted before they are sent, and decrypted on receipt.

Here M is the original plaintext message, E is the encrypted message, f1 and f2 are the encryption and
decryption functions, and K is the key. In mathematical notation,
E = f1(M,K)
f2(E,K) = f2(f1(M,K), K) = M
According to the principle of public design, the encryption and decryption functions are well-known
publicly available algorithms. It is the key K, known only to the sender and receiver, that provides
security.
The most important feature of the encryption algorithm f1 is that be infeasible to invert the function.
That is, it should be impossible, or at least very hard, to recover M from E without knowing K. In fact,
it is quite easy to come up with such an algorithm: exclusive or. If the length of K (in bits) is the same
as the length of M, let each bit of E be zero if corresponding bits of M and K are the same, and one if
they are different. Another way of looking at this function is that it flips bits of M that correspond to
one bits in K and passes through unchanged bits of M in the same position as zero bits of K. In this
case, f1 and f2 are the same function. Where there is a zero bit in K the corresponding bit of M passes
through both boxes unchanged; where there is a one bit, the input bit gets flipped by the first box and
flipped back to its original value by the second box. This algorithm is perfect, from the point of view of
invertability. If the bits of K are all chosen at random, knowing E tells you absolutely nothing about M.
However, it has one fatal flaw: The key has to be the same length as the message, and you can only use
it once (in the jargon of encryption, this is a one-time pad cipher). Encryption algorithms have been
devised with fixed-length keys of 100 or so bits (regardless of the length of M) with the property that M
is provably hard (computationally infeasible) to recover from E even if the bad guy
• Has seen lots of messages encrypted with the same key,
• Has seen lots of (M,E) pairs encrypted with the same key (a "known plaintext" attack),
• Can trick the sender into encrypting sample messages chosen by the bad guy (a "chosen
plaintext" attack).
The algorithms (proof of their properties) depend on high-powered mathematics that is beyond the
scope of this course.

Key Distribution
Even with such an algorithm in hand, there's still the problem of how the two parties who wish to
communicate get the same key in the first place -- the key distribution problem. If the key is sent over
the network without encryption, a bad guy could see it and it would become useless. But if the key is to
be sent encrypted, the two sides have to somehow agree on a key to encrypt the key, which leaves us
back where we started. One could always send the key through some other means, such as a trusted
courier (think of James Bond with a briefcase handcuffed to his wrist). This is called "out-of-band"
transmission. It tends to be expensive and introduces risks of its own (see any James Bond movie for
examples). Ultimately, some sort of out-of-band transmission is required to get things going, but we
would like to minimize it.
A clever partial solution to the key distribution problem was devised by Needham and Schroeder. The
algorithm is a bit complicated, and would be totally unreadable without some helpful abbreviations.
Instead denoting the result of encrypting message M with key K with the expression f1(M,K), we will
write it as [M]K. Think of this as a box with M inside secured with a lock that can only be opened with
key K. We will assume that there is a trusted Key Distribution Center (KDC) that helps processes
exchange keys with each other securely. A the beginning of time, each process A has a key KA that is
known only to A and the KDC. Perhaps these keys were distributed by some out-of-band technique. For
example, the software for process A may have been installed from a (trusted!) floppy disk that also
contained a key for A to use. The algorithm uses five messages.
Message 1 is very simple. A sends the KDC a message saying that it wants to establish a secure channel
with B. It includes in the message a random number, denoted id, which it makes up on the spot.
1: request + id
(In these examples, "+" represents concatenation of messages.)
The KDC makes up a brand new key Kc which it sends back to A in a rather complicated message.
2: [Kc + id + request + [Kc + A]KB]KA
First note that the entire message is encrypted with A's key KA. The encryption serves two purposes.
First, it prevents any eavesdropper from opening the message and getting at the contents. Only A can
open it. Second, it acts as a sort of signature. When A successfully decrypts the message, it knows it
must have come from the KDC and not an imposter, since only the KDC (besides A itself) knows K A
and could use it to create a message that properly decrypts.1
A saves the key Kc from the body of the message for later use in communicating with B. The original
request is included in the response so that A can see that nobody modified the request on its way to
KDC. The inclusion of id proves that this is a response to the request just sent, not an earlier response
intercepted by the bad guy and retransmitted now. The last component of the response is itself
encrypted with B's key. A does not know B's key, so it cannot decrypt this component, but it doesn't
have to. It just sends it to B as message 3.
3: [Kc + A]KB
As with message 2, the encryption by KB serves both to hide Kc from eavesdroppers and to certify to B
that the message is legitimate. Since only the KDC and B know KB, when B successfully decrypts this
message, it knows that the message was prepared by the KDC. A and B now know the new key K c, and
can use it to communicate securely. However, there are two more messages in the protocol.
Messages 4 and 5 are used by B to verify that the message 3 was not a replay. B chooses another
random number id' and sends it to A encrypted with the new key Kc. A decrypts the message, modifies
the random number in some well-defined way (for example, it adds one to it), re-encrypts it and sends it
back.
4: [ id' ]Kc
5: [ f(id') ] Kc
This is an example of a challenge/response protocol.

Public Key Encryption


In the 1970's, Diffie and Hellman invented a revolutionary new way of doing encryption, called public-
key (or asymmetric) cryptography. At first glance, the change appears minor. Instead of using the same
key to encrypt and decrypt, this method uses two different keys, one for encryption and one for
decryption. Diffie and Hellman invented an algorithm for generating a pair of keys (P,S) and an
encryption algorithm such that messages encrypted with key P can be decrypted only with key S.
Since then, several other similar algorithms have been devised. The most commonly used one is called
RSA (after its inventors, Rivest, Shamir, and Adelman) and is patented (although the 17-year lifetime of
the patent is about to run out). By contrast, the older cryptographic technique described above is called
conventional, private key, or symmetric cryptography. In most public-key algorithms, the functions f1
and f2 are the same, and either key can be used to decrypt messages encrypted with the other. That is,
f(f(M,P), S) = f(f(M,S), P) = M.
The beauty of public key cryptography is that if I want you to send me a secret message, all I have to do
is generate a key pair (P,S) and send you the key P. You encrypt the message with P and I decrypt it
with S. I don't have to worry about sending P across the network without encrypting it. If a bad guy
intercepts it, there's nothing he can do with it that can harm me (there's no way to compute S from P or
vice versa). S is called the secret key and the P is the public key.
However, there's a catch. A bad guy could pretend to be me and send you his own public key Pbg,
claiming it was my public key. If you encrypt the message using Pbg, the bad guy could intercept it and
decrypt it, since he knows the corresponding Sbg. Thus my problem is not how to send my public key to
you securely, it is how to convince you that it really is mine. We'll see in a minute a (partial) solution to
this problem.
Public key encryption is particularly handy for digital signatures. Suppose I want to send you a
message M in such a way as to assure you it really came from me. First I compute a hash function h(M)
from M using a cryptographic hash function f. Then I encrypt h(M) using my secret key S. I send you
both M and the signature [h(M)]S. When you get the message, you compute the hash code h(M) and use
my public key P to decrypt the signature. If the two values are the same, you can conclude that the
message really came from me. Only I know my secret key S, so only I could encrypt h(M) so that it
would correctly decrypt with S. As before, for this to work, you must already know and believe my
public key.
An important application of digital signatures is a certificate, which is a principal's name and public
key signed by another principal. Suppose Alice wants to send her public key to Bob in such a way that
Bob can be reassured that it really is Alice's key. Suppose, further, that Alice and Bob have a common
friend Charlie, Bob knows and trusts Charlie's public key, and Charlie knows and trusts Alice's public
key. Alice can get a certificate from Charlie, which contains Alice's name and public key, and which is
signed by Charlie:
[Alice + PAlice]SCharlie
Alice sends this certificate to Bob. Bob verifies Charlie's signature on the certificate, and since he trusts
Charlie, he believes that PAlice really is Alice's public key. He can use it to send secret messages to Alice
and to verify Alice's signature on messages she sends to him. Of course, this scenario starts by
assuming Bob has Charlie's public key and Charlie has Alice's public key. It doesn't explain how they
got them. Perhaps they got them by exchanging other certificates, just as Bob got Alice's key. Or
perhaps the keys were exchanged by some out-of-band medium such snail mail, a telephone call, or a
face-to-face meeting.
A certificate authority (CA) is a service that exists expressly for the purpose of issuing certificates.
When you install a web browser such as Netscape, it has built into it a set of public keys for a variety of
CAs. In Netscape, click the "Security" button or select "Security info" item from the "Communicator"
menu. In the window that appears, click on "Signers". You will get a list of these certificate authorities.
When you visit a "secure" web page, the web server sends your browser a certificate containing its
public key. If the certificate is signed by one of the CAs it recognizes, the browser generates a
conventional key and uses the server's public key to transmit it securely to the server the browser and
the server can now communicate securely by encrypting all their communications with the new key.
The little lock-shaped icon in the lower left corner of the browser changes shape to show the lock
securely closed to indicate the secure connection. Note that both public-key and conventional (private
key) techniques are used. The public-key techniques are more flexible, but conventional encryption is
much faster, so it is used whenever large amounts of data need to be transmitted. You can learn more
about how Netscape handles security from Netscape's web site.

Previous Protection and Security


Contents
1
A message encrypted with some other key could be "decrypted" with KA, but the results would be
gibberish. The inclusion of the requests and id in the message ensures that A can tell the difference
between a valid message and gibberish.

solomon@cs.wisc.edu
Mon Jan 24 13:28:57 CST 2000

Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

S-ar putea să vă placă și