Sunteți pe pagina 1din 7

What Is Garbage Collection?

Garbage Collection (GC) is a mechanism that provides automatic memory


reclamation for unused memory blocks. Programmers dynamically allocate
memory, but when a block is no longer needed, they do not have to return it to
the system explicitly with a free() call. The GC engine takes care of recognizing
that a particular block of allocated memory (heap) is not used anymore and
puts it back into the free memory area. GC was introduced by John McCarthy
in 1958, as the memory management mechanism of the LISP language. Since
then, GC algorithms have evolved and now can compete with explicit memory
management. Several languages are natively based on GC. Java probably is the
most popular one, and others include LISP, Scheme, Smalltalk, Perl and
Python. C and C++, in the tradition of a respectable, low-level approach to
system resources management, are the most notable exceptions to this list.

Many different approaches to garbage collection exist, resulting in some


families of algorithms that include reference counting, mark and sweep and
copying GCs. Hybrid algorithms, as well as generational and conservative
variants, complete the picture. Choosing a particular GC algorithm usually is
not a programmer's task, as the memory management system is imposed by
the adopted programming language. An exception to this rule is the Boehm-
Demers-Weiser (BDW) GC library, a popular package that allows C and C++
programmers to include automatic memory management into their programs.
The question is: Why would they want to do a thing like this?
Memory Allocation and Garbage Collection

True confession time: my web pages serve several purposes. One of these is to act as
a notebook that I will have access to where ever I have a computer and an Internet
connection. This page is under construction. In the short term the purpose of this
page is to provide a draft of an article on memory and to provide links to garbage
collection sources. In the longer term I hope to provide a polished article that will be
of interest to other software engineers. This page was prompted by J.D. Marrow, who
kindly sent me a reference to Hans-J. Boehm's Web page on garbage collection.

Dynamically Allocated Data Structures

Dynamicly created data structures like trees, linked lists and hash tables (which can be
implemented as arrays of linked lists) are key to the construction of many large
software systems. For example, a compiler for a programming language will maintain
symbol tables and type information which is dynamically constructed by reading the
source program. Many modern compilers also read the source program (e.g., parse it)
and translate it into an internal tree form (commonly called an abstract syntax tree)
that is also dynamically created. Graphics programs, like 3D rendering packages, also
make extensive use of dynamic data structures. In fact it is rare to find any program
that is larger than a couple of thousand lines that does not make use of dynamically
allocated data structures.

Problems with Dynamic Allocation

To avoid consuming vast amounts of virtual memory and destroying performance


through page swapping, many programs that use dynamic allocation also dynamically
deallocate memory when the programmer believes it is no longer needed. This leads
to a number of problems including:

 Poor performance. The openGL graphics library supports "display lists" which
describe the 3D polygons that make up a shape. A complex scene may have a
large number of display lists. Deallocation of a display list, one element at a
time, can be very time consuming and can have a large impact on program
performance. This can be avoided using a block based memory allocator, where
the display list elements are allocated from large memory blocks. When a
display list is no longer needed, a pool of blocks can be deallocated. While this
technique is effective, designing, implementing and debugging a good block
allocator is time consuming.
 Memory leaks. A memory leak takes place when memory is allocated but never
deallocated. A major computer manufacturer had a subtle memory leak in one
of the window types for their window system. One group of users would run
large test programs over night and then expect to scroll back through the
windows buffer. The buffer was not infinite and only buffered a certain number
of lines. The software displaying this scrolling buffer had a memory leak. It
would recover most, but not all of the memory allocated as the window
scrolled. As the test ran over night, the window software used up more and
more memory, until the entire virtual memory was consumed. The window
would then crash, destroying the test run information the user wanted. Tools
like purifywhere not available at the time and this bug tool a huge amount of
effort to track down and fix.
 Pointers to deallocated memory. When memory is deallocated there may still
be pointers in use to the deallocated memory. The deallocated memory will be
recovered by the memory allocation system and reused. Since it may now have
two pointers to it (e.g., the live pointer to the old deallocated data structure and
the pointer to the newly allocated data structure) the program behavior may be
bizzare, running correctly sometimes and getting the wrong result in others.

Even carefully crafted code written by experienced software engineers tends to suffer
from problems with memory leaks and references to deallocated memory. Products
like like purify (for UNIX) and BoundsChecker (for Windows NT) are literally worth
their weight in gold when trying to track down these errors.

Java and Garbage Collection

There are a variety of reasons that the Java programming language seems to be
popular. Some of these reasons actually involve software engineering issues, rather
than hype and religion. Java does not support explicit pointers (which are a source of a
lot of complexity in C and C++) and it supports garbage collection. This can greatly
reduce the amount programming effort needed to manage dynamic data structures.
Although the programmer still allocates data structures, they are never explicitly
deallocated. Instead, they are "garbage collected" when no live references to them are
detected. This avoids the problem of having a live pointer to a dead object. If there is
a live pointer to an object, the garbage collection system will not deallocate it. When
an object is no longer referenced, its memory is automaticly recovered. So, in theory,
Java avoids many of the problems listed above. The cost, however, is in performance.
Automatic garbage collection is usually not as efficient as programmer managed
allocation and deallocation. Also, garbage collectors tend to deallocate objects from a
low level, which can hurt performance (e.g., deallocating an openGL display list at the
element level). There is a rich body of work on garbage collector architecture and
garbage collection algorithms (stemming largely from LISP runtime support, which
makes heavy use of garbage collection). A lot of this work is aimed at avoiding the
more severe pitfalls of garbage collection.

C/C++ and Garbage Collection

Many garbage collection systems rely on compiler support to help the system
determine when a pointer goes out of scope, so the object the pointer points to can be
deallocated. However, most C and C++ compilers don't support garbage collection.
There are several packages that support garbage collection for C and C++. See, for
example, Hans-J. Boehm's Web page on the Boehm-Demers-Weiser conservative
garbage collector. Another version of the web page on the Hoehm-Demers-Weiser
garbage collector is also available at a site hosted by Xerox PARC. This garbage
collector is designed to be a replacement for malloc in C and newin C++. The Web
pages also provides a reference list. So far I have not looked into the pros and cons of
this software (as I mentioned above, this Web page is still under construction).

Pool Allocation

In pool allocation, objects are allocated from a pool of large blocks. When a point in
the program is reached where this memory is no longer needed, the entire pool can be
deallocated at once. For a discussion of pool allocation of C++ objects
see Overloading New in C++.

Other Web pages

 Jamie Zawinski's argument for garbage collection vs. ad hoc memory


management (or no memory management and memory leaks).
 The Hoard Memory allocator

Hoard is not a garbage collecting allocator. However, it does very fast


multiprocessor memory allocation for multithreaded software. Accross multiple
processors with a threaded application the authors of Hoard claim that they
have very close to linear allocator performance. Hoard would be an excellent
platform on which a garbage collecting allocator could be built.
garbage collection. 
by Jamie Zawinski <jwz@jwz.org> 
1-Feb-1998.

``Any sufficiently complicated C or Fortran program contains an ad hoc informally-


specified bug-ridden slow implementation of half of Common Lisp.''
-- Philip Greenspun

In a conversation at work about the architecture of a new product, someone made an


assertion along the lines of, ``then we'd need to use a garbage collector, and that
would hurt performance.''

This is wrong.

It's a common belief that garbage collection means inferior performance, because
everyone who has gotten into programming in the last decade regards manual storage
management as a fact of life, and totally discounts the effort and performance impact
of doing everything by hand.

In a large application, a good garbage collector is more efficient than malloc/free.

Now, you can argue over whether there are good C++ garbage collectors available
-- but that's an implementation detail, which is a whole different level. In a large
application, a good implementation of GC will be more efficient than an equivalently-
good implementation of malloc/free. This is because large applications have nontrivial
object life cycles, and so you end up spending all of your time trying to figure out
what exactly those life cycles are. With GC, you'll get your program written faster,
and it'll be more efficient to boot.

Absolute speed of your CPU, has nothing to do with it; for a large, complex
application a good GC will be more efficient than a zillion pieces of hand-tuned,
randomly micro-optimized storage management. It's a relative measure; on a slow
CPU, good GC will still be more efficient.

Note that I said a good garbage collector. Don't blame the concept of GC just because
you've never seen a good GC that interfaces well with your favorite language.
(Especially if your favorite language is C or C++, which are really just overgrown
PDP-11 assemblers, despite attempts to drag them kicking and screaming into the
1980s.)

If you're like most people, you've never seen a good garbage collector, because, while
all of the commercial Lisp and Smalltalk systems had high quality GC, just about all
of the open source garbage collectors are junk (e.g., Perl, Python, and Emacs.) Perl's
GC is an especially bad joke: if you have circular references, the objects won't get
collected until the program exits! Give me a break! (Update: I'm told that Python's so-
called GC is just as bad, since it is also merely reference-counting instead of a real
GC.)

Another point that often gets overlooked is that existence of a GC doesn't mean that
the programmer has to play a totally hands-off role in the allocation of objects; as with
any coding, there are usually a few bottlenecks that deserve special care. A system
with a good GC will provide opportunities to tune; for example, to make hints to the
GC about lifetime and locality and so on.

Java makes this really hard because (largely as a result of its security and type-safety
requirements) it doesn't expose any low-level places where one can get down and
dirty with the allocator and play all the usual carve-out-a-chunk-of-memory tricks that
one can play in other languages like C (or even in the historical production-quality
GC-oriented environments, the Lisp Machines.)

Java and Emacs are really bad examples of GC-oriented environments. They're
primitive, badly tuned, and give the user very little opportunities for optimization.
One way to work around this is by throwing faster and faster CPUs at the problem,
but that's not the only way, and the fact that people seem to like to solve the problem
that way reflects inappropriately badly on the notion of GC itself.

Based on my experience using both kinds of languages, for years at a stretch, I claim
that a good GC always beats doing explicit malloc/free in both computational
efficiency and programmer time.

However, I also claim that, because of the amount of programmer time that is saved
by using GC rather than explicit malloc/free, as well as the dramatic reduction in
hard-to-debug storage-management problems, even using a mediocre garbage
collector will still result in your ending up with better software faster.

Most of the time, throwing memory and CPU at the problem is still cheaper than
throwing programmer time at the problem, even when you multiply the
CPUs/memory by the number of users. This isn't true all the time, but it's probably
true more often than you think, because Worse is Better.

S-ar putea să vă placă și