Review:Garbage Collection 105
Garbage Collection | |
author | Richard Jones and Rafael Lins |
pages | |
publisher | Wiley & Sons |
rating | 6 |
reviewer | A.M. Kuchling |
ISBN | |
summary | A highly detailed survey of garbage collection algorithms, and the research literature surrounding them. Interesting, if at times very dry. |
The first time you encounter garbage collection (GC, for short) is often by using Lisp or Scheme in an introductory computer science course, where it can seem an almost magical technology. How does the interpreter determine that a variable has become garbage, and can safely be deallocated? This book explains how GC works, in great detail, and interest people implementing garbage collectors or who've wondered how it's done. There are 3 approaches to GC, all of which are covered in this book:
- Reference counting:
Every object keeps a count of the references pointing to it. When new references are created, the counter must be increased; when references are destroyed, the counter is decremented, and, if the count is zero, the object has become garbage and can be reclaimed. Reference counting unfortunately exacts the most overhead from constantly incrementing and decrementing the counters, and in its simplest form it can't collect cyclic data structures, because their count will never reach zero. - Mark-sweep:
Traverse the graph of object references, starting from the roots of the computation; roots might be the program's global variables, the contents of the C stack, static class variables, or a main namespace of some sort. Each object that's traversed is marked as being reachable; when the traversal is complete, any unmarked objects are unreachable because they can't be reached by the program any longer, and can be reclaimed as garbage. - Copying:
Divide memory into two halves, called "Fromspace" and "Tospace". Memory for new objects is always allocated out of Fromspace. When Fromspace fills up, a traversal similar to the one used by mark-sweep is done, but instead of just marking objects as reachable, they're copied into Tospace. Once the traversal is done, all the live objects are now in Tospace, and Fromspace no longer contains any live data; only garbage objects can be left in Fromspace. Tospace is now declared the new Fromspace, and new objects are allocated until it fills up; the algorithm is then repeated.
This book covers the above 3 algorithms for GC in great detail; there's a chapter for each algorithm, examining its limitations, common implementation techniques, and historical notes on the its development. Later chapters cover more advanced collectors that extend the basic algorithms in various ways. Generational collectors are based on the observation that many objects are temporary and have short lifetimes, while some few objects may last for the entire lifetime of the program. A generational garbage collector concentrates on young objects, because they will occupy the most recyclable space, and spends less time examining old objects that probably won't be reclaimed. Incremental and concurrent collectors don't bring the whole program to a halt while they scavenge through memory doing a collection, making them more suited for interactive or even real-time applications.
Chapter 9 is an interesting examination of the implementation of GC
in an unfriendly environment, namely C programs. The Boehm-Demer-Weiser
collector implements a mark-sweep garbage collector that can
replace C's malloc()
function, scanning stack frames and
registers for potential pointers, and managing to do it fairly
efficiently. One study found the Boehm-Demer-Weiser collector added
20% more execution time overhead than malloc()
did --
that is, the time spent in memory allocation was 20% longer. (That
doesn't mean it made programs 20% slower, because most programs spend
more time computing results than they spend allocating memory). A
garbage collector for C is a tricky job, complicated by the lack of
type information, data values that may mimic pointers, and compiler
optimizations that may render objects invisible to the
collector. Barlett's Mostly Copying Garbage
Collector, also described in this chapter, is more liberal and
requires some assistance from the C programmer in exchange for
greater accuracy.
The remaining 3 chapters consider suggestions for adding garbage collection to C++, the interaction of GC with cache memory, and distributed garbage collection. The topics of these chapters are still subjects of active research, so the final chapters are lighter on implementation details and heavy on references to research papers, making them a lot less interesting.
The authors are computer scientists specializing in GC (Richard Jones maintains a garbage collection page), and their coverage is quite complete, which at times makes for exceedingly dry reading as the authors enumerate what seems like every minor implementation variation and every relevant paper. On the other hand, if you're actually faced with implementing a garbage collector, the pointers into the research literature will be very useful. A casual reader (like me) can simply skim, or even skip, pages until the next interesting section arrives.
Excerpt
However, the matter is more subtle than this. The collectors
described in this chapter are very simple and inefficient. The
behaviour of these collectors cannot be automatically ascribed to
their more sophisticated variants. The cost of copying an object is
likely to be more expensive than simply testing and setting a
mark-bit, particularly if the object is large. Although mark-sweep
must sweep the entire heap, in practice its real cost is dominated by
the mark phase. Linearly scanning the heap will generally be less
expensive than tracing data structures even using the simple technique
shown above. ... Furthermore, more sophisticated methods can
substantially reduce the cost of the sweep. Although copying
collectors have predominated in the past, recent studies suggest that
the choice between mark-sweep and copying garbage collection may well
depend as much on the behaviour of the client program as on the
inherent properties of the garbage collection algorithm.
-- From section 2.4 of the book.
To purchase this book, head over to Fatbrain.
Table of Contents
- Introduction
- The Classical Algorithms
- Reference Counting
- Mark-Sweep Garbage Collection
- Mark-Compact Garbage Collection
- Copying Garbage Collection
- Generational Garbage Collection
- Incremental and Concurrent Garbage Collection
- Garbage Collection for C
- Garbage Collection for C++
- Cache-Conscious Garbage Collection
- Distributed Garbage Collection
But it needs a new euphamism (Score:1)
true no leaks with GC, but it has opposite problem (Score:1)
Almost as a rule, Java implementation don't have modern GC. You are still exaggerating.
The memory requirements of a garbage collected program are generally about 2x that of a manually deallocated one.
Garbage collection leads to better programs. (Score:2)
GC was a revelation to me when I switched from C/C++ to Perl. Of course Perl's memory management goes beyond GC, but that's another story.
Rapid Application Development is nearly impossible without GC -- you can't develop rapidly when you are hunting down memory leaks and illegal pointers.
Which is GC "oddly named"? (Score:1)
The GC term is used in literature, code, conferences, and everyday conversation about memory management.
What's odd about that (besides the obvious geekiness of it all)?
-----------------------------
Computers are useless. They can only give answers.
Garbage collection leads to better programs. (Score:1)
It's still a lot easier then not having a GC at all, though.
-Peter
But it needs a new euphamism (Score:1)
Sanitary deportation?
Forceful of unwanted elements?
-Peter
Grrr... (Score:1)
-Peter
Memory Management Reference web site worth reading (Score:1)
http://www.harlequin.com/mm/reference/
--
malloc/free can be slower than GC (Score:1)
--
Grrr... (Score:2)
Obviously you have never written a real program spanning several million lines of code with subroutines written by someone other then you.
Combine the above with a data structure malloced in a library not written by you (possibal in OS/2, though it isn't malloc) that needs to be freeed in a latter subroutine. Maybe we should toss threads in too, so you may have a differnt thread looking at your memory after your subroutine is done, and your subroutine has to return a pointer to memory malloced. Suddenly you don't know who is done with that memory you just started with first. Can't call free() too soon lest you prevent access by someone else who needs it, but you have to call it. GC solves this problem, and if it was fast enough would be more universial.
The creater of C++ has listed allowing himself to be convinced that GC was too slow as his greatest regret.
Now for simple cases you are right, but the simple cases are rarely got wrong (for long) but when the memory needs to hang around for a long time the case is no longer simple.
very good / one complaint (Score:1)
Of course, "nobody seems to be using the object recently" is useful for objects which can be shut down and restarted...
GC is the way (Score:1)
GC and simplified pointers help you to pay more attention to the problem *at hand*, then to have to worry about differentiating between (in C++):
- stack allocated objects that don't have to be deleted but can't be safely passed around because they go out of scope
- heap allocated objects that have to be deleted
- pointers vs references (i.e. dot-notation vs. '->' notation)
Granted, these are not big problems by themselves. Put together however, they wind up being an unnecessary headache, especially when dealing with application-level programming. And on the bright side of things, Java catches your null-pointers with a catchable-exception - not a segfault.
When dealing with systems-level stuff, putting up with C++'s self-memory management is a "grin & bear it scenario", until Java's performance increases.
Though, of course, C++ and C will always be used where bit-twiddling is needed.
SmallTalk (Score:2)
SmallTalk did catch on in many industries, but it never quite penetrated into mainstream use. Does anyone really know why? Perhaps it primarily because it was so advanced for its time (in the business community, I'd say late 80s to early 90s). Perhaps it was because C++ offered the "path of least resistance". I don't really know.
Java is just an attempt (imho, and from what I understand, some veteren SmallTalker's opinions) at doing SmallTalk over again for the C++ crowd (albeit with some limitations, since the programming community at large seems to be set against weakly-typed languages)
I love SmallTalk [and am very interested in Dylan or Self], and wish it would resurge with industry support. Until it does, (go Squeak go!) however, Java is an acceptable alternative, as it also provides the "path of least resistence" to some of SmallTalk's better technologies: GC, dynamic typing (though restricted through java interfaces), and a Packages/Classes/Methods IDE (through VisualAge for Java).
Regarding COM: I felt that Microsoft actually did some cool stuff with their VM to support COM and make it MUCH easier to use than the C++-way of using COM. COM inherently is based upon the idea of C/C++ pointers, so it's not going to naturally map to any language, Java or otherwise (witness the atrocity called "Automation" to allow COM to talk to Visual Basic).
Furthermore, as for your "double lines of code" assertion, I think you're overstating it a bit. Java doesn't have blocks, and it requires explicit type down-casting, but I don't see this as "doubling" code requirements. Certainly, the code requirement is larger than SmallTalk, but not significantly so.
Dry my foot... (Score:1)
"A Good Collection of Garbage"
But it needs a new euphamism (Score:1)
Memory cleaning?
Data cleansing?
Memory downsizing? (or "rightsizing")
RIM? ("Reduction in (allocated) memory" - analogous to corporate-speak RIF)
Reference Counting vs. Mark & Sweep (Score:1)
In the short run, this doesn't matter so much with Perl because it's more useful as a tool than as an actual language. Try to run mod_perl, where the interpreter is persistent, and issues with reference counting become immediately apparent.
Many people don't consider reference counting to be "true" GC at all -- Java has some interesting arguments about their GC implementation, along with the little known fact that up to 40% of CPU time can be used just in GC: they do it right but it's not cheap...
I believe there's even a koan regarding GC in the Hacker Dictionary(which is appropriately self-referential...)
Grrr... (Score:1)
No one ever tries to be not careful. The cautious statement you made of "if you were careful" points out that there are times that you were possibly not careful. malloc()/free() pairs gets more complicated with distributed code.
GC has a place (Score:1)
bad name. Usually the same people who misuse GC also misuse pointers.
The main form of GC misuse in OO-land is to avoid properly designing object destructors. As things
get more complicated and multiple people start working on things this is often the place of the
first design failure.
Selecting the right time to destroy persitent objects is the biggest problem for real-time/
multi-threaded/retentrant code. Subtle bugs (like the TCP listen-timeout bugs) usually live here.
Defering these problems to a generic GC is probably the wrong thing to do in most cases.
Using GC to free nodes on a recursive traversal is great, but using GC to solve buffer deadlock
is probably a horrible thing to do.
Remember that most GC algorithms are only faster in that they essentially defer free() calls and
process them in bulk reducing overhead. Ref-count and mark-sweep type algorithms require that
more work be done than simply marking unused on a call to free().
As they say: pick the right tool for the job.
Chip designers, WAKE UP! (Score:1)
And just a little bit of silicon helps GC in a big way!
I hope that the trend resumed by PicoJava will propagate to other embedded and general purpose processors.
And dynamic type checking is mainstream too now,
but that is another story..
Too many _Hackers_ (in the bad sense of the word) (Score:1)
a) Don't know anything besides C and C++
b) Are forced to use C/C++ because that's the company orientation
so that then they later force that trash on developers in the form of libraries, debuggers that crash, broken compilers, etc.
Every time I upgrade systems/tools, the release notes make me sick.
No. Open Source is not doing better. You can't ignore C/C++ on Linux either.
For simple things, malloc/free can be enough, but
there's a certain point where the complex relationships between components/subsystems/clients crave for GC.
If you want to help Dylan.. (Score:1)
But at least one book.. (Score:1)
Try alloca() in C/C++ (Score:1)
But seriously, how bad is alloca for portability? Man pages generally claim that it is system dependent...
Grrr... (Score:1)
But reference counting is a form of garbage collection. So in this case, you suggest writing your own garbage collection instead of having the language provide it.
GC and Timing (Score:1)
very good / one complaint (Score:1)
My one complaint: as is often the case for academic texts, is it does not even mention important deployed industry technologies.
For example, if I recall correctly, in its discussion of distributed GC, there is nary a mention of DCOM or CORBA. (The book was published in 1996, when DCOM and some CORBA ORBs were shipping.)
Garbage collection leads to better programs. (Score:1)
Actually the other way around... I've worked as a C++ programmer under Windows (ugh!). A segfault is an obvious indication that something is wrong and will tell you where it occured, while a memory leak can go unnoticed for days or weeks of development and is a pain to track down. Give me segfaults over memory leaks everytime.
I can growl louder than you can (Score:2)
Garbage collection for those who find it to lazy or dificult to call free()...
I am afraid you have just pressed one of my buttons. There is nothing that alerts me to danger more than one programmer acusing another of being 'lazy' because he does not want to use some rediculously complicated tool of which the first programmer is fond.
I have NEVER seen a place where, if you were careful (as you should be) in designing your code, you couldn't have nice matching malloc()/free() pairs.
If this were usenet, at this point I would endeavour to call you out on this, and we would end up in a pissing contest over who had the most experience and whose experience was most real
Since this is not usenet, and I am in a good mood, I will not. I will just say that I have seen many examples of supposed OO code hacked to death or made overly complicated by the need to keep track of memory allocation. The stability, memory usage and locality of a great many C programs (including XFree86) can be improved by linking them with a garbage collector. In many cases this can actually make them faster.
I am, at this very minute, attempting to track down a bug in a C++ library linked against my Java program, which is caused by a failure of the evil reference counting scheme the library author used to keep track of his memory usage. Now I know, from looking at the code, that he could not have used malloc and free in the traditional way without breaking the clean interface of his code, and using a real garbage collector with C++ is hard, so what choice did he have ? But the need to reimplement garbage collection badly for every project rather than doing it properly once is a total waste of time.
Are you serious? (Score:1)
Garbage collection leads to better programs. (Score:1)
I think I'd rather have the segfaults... sometimes memory leaks can be tricky to track down. At any rate, gc is the way to go.
GC was a revelation to me when I switched from C/C++ to Perl. Of course Perl's memory management goes beyond GC, but that's another story.
Rapid Application Development is nearly impossible without GC -- you can't develop rapidly when you are hunting down memory leaks and illegal pointers.
This, IMHO, is one of the really nice things about java, and perhaps a feature that really should have made it into C++ but didn't. It seems to me that properly handling exceptions w/o gc can make life real hairy for the C++ compiler.
I think GC is an interesting topic, and I am glad to read about it on
Garbage collection leads to better programs. (Score:1)
FWIW, there is a variant of C++ called EC++ (embedded C++), which strips out many of the less well though out "features" of C++, like exception throwing (other than in constructors, where it is required to deal with error conditions, because you can't return a value), templates, etc.
I have heard that it helps reduce the C++ code bloat a lot...
Lack of Real Hackers (Score:1)
It is sad that the world does not have very many people that can handle powerful tools like manual memory management.
I don't see what is so hard about it. If you allocate something, you should free it. If you don't allocate something, you had damn well better not try to use or free it.
For the less clueful, try static arrays. They are friendly and fast.
If I take that argument to the extreme, then I would arrive at the conclusion that real hackers should program in assembly. (or even machine language!)
Yes GC makes it harder for idiots to make mistakes, but I don't think that limits its usefulness to real hackers.
GC is of more use than simply memory management. In an OO language that supports destructors, GC can handle resources other than memory. For example, if you had an class that mixed audio streams together and wrote them to the audio device, your class's constructor could detach from the audio device freeing it for use by other programs. There would not be an easy way to do this without GC besides reference counting... and what hacker wants to waste time re-inventing the wheel!
BTW, I can't see any good excuse for using static arrays, other than perhaps real-time embedded systems where you can't afford the overhead of GC.
Grrr... (Score:2)
Well, you could probably do without GC, but the question is whether you can do it better without GC.
Without GC, there are situations where you either need to (a) implement reference counting (which is a form of GC) in order to know when to free some data, or (b) make duplicate copies of objects so each object can be freed by the code that receives a reference to the object when it is done with it. I am really not convinced that either approach is better than just using GC in the first place. (FWIW, the example I am thinking of is a multithreaded application that communicates between threads via message passing, where you would like to be able to have multicast messages. The objects passed are treated as read-only, so side effects are not an issue. I am sure that given a couple minutes I could come up with a better example.)
Grrr... (Score:1)
This is garbage collection! It's just GC done on a per-program level, rather than a systemwide level. I think you just defeated your own argument...
But reference counting is a poor form of garbage collection; it's slow and can't reclaim cyclical structures. Most modern garbage collectors are based on mark and sweep somewhere along the line. (I'll go so far as to claim that copying is a modification of mark and sweep, with the two operations combined into one copy.)
Actually, if a GC algorithm isn't reliable, it's useless. A hell of a lot of care and attention is paid to proving GC algorithms correct. It's permissible for GC to err on the side of caution (we won't reclaim that block because it might still be in use), and a fair few GC methods do this (usually incremental and conservative algorithms) but reclaiming blocks in use is a complete no-no. And whilst I appreciate your points about copying, for one thing it's usually unnecessary in modern systems - tracing works just as well, if not better, and even Baker's copying method can be done without copying - and for another thing, in conservative systems you must trace, as copying would have undesirable side effects on those things which were not pointers...
In any case, you say `lazy' as if it were an insult. I would have thought that the good programmers were the lazy ones, the ones who aren't going to expend any unnecessary effort or have their programs do the same. Why build storage management more than once? If the OS or the language runtime has a totally reliable (and remember, it *must* be, or it's useless) GC system, why storage manage at all?
In any case, there's no harm in having a `free' call that you can call to put something on the collectable list straightaway - but when it becomes difficult to free something in an orderly fashion, why not let the infrastructure take the strain? It's what it's there for.
--
Garbage Collection considered harmful? (Score:1)
Garbage collection isn't necessary, but it makes the programmer's life much easier and simplifies your code by moving the memory management responsibilities out of the program and into the language runtime library (or VM, or whatever).
peter
Commercial garbage collection for C and C++ (Score:1)
Grrr... (Score:1)
I have NEVER seen a place where, if you were careful (as you should be) in designing your code, you couldn't have nice matching malloc()/free() pairs.
(Yes, I know in LISP and such GC is a must, but those languages don't even have the same concept of variables as a normal imperitive language such as C)
-The Code Nazi
(now you see why some friends of mine
gave me that nickname...)
Grrr... (Score:1)
That is rather assuming... and self contradictory.
[snip - some example of program design]
My point is exactly that. If you have bad program design, GC is necessary, but well thought out design makes it unnecessary. In your example, all you need is some sort of semaphore between the threads, or maintain your own refrence counts internally, or some such.
If the library blindly mallocs such structures, and gives the user a pointer to them, I have a few things to say about the library as well. I guess that could be ok, if the library never touched it again, and there was an equivilent destruct funtion in the library (it becomes another malloc/free pair). If the library returns a pointer, and continues to modify it, you not only have a GC problem, you will be clobbering data!
I guess the point is, that while GC can be a time-saver (in having to not re-design sections of badly written code), it is also a crutch bad programmers use far too often to cover their own lazyness, and for really complex programs, doesn't solve the problem reliably! (I really hate it when structures and code change behind my back (hence my hate for C++) )
-The Code Nazi
"No Overhead Rule" (Score:1)
Memory allocation and deallocation in C++ imposes runtime costs qualitatively no different from costs that you incur if you allocate memory in the presence of a GC. Quantitatively, GC is actually often faster.
The main reason C++ doesn't have GC is because C++ has C pointer semantics and because many C++ programmers habitually make a lot of unportable assumptions about the runtime and memory layout. That makes adding a GC to C++ tricky and would break lots of programs, programs that are not legal C++ programs but that still work.
The best compromise for GC in C++ is to use a conservative GC (like Boehm's GC). Boehm's GC may break some valid programs, but on the whole it lives fairly well within the framework of C and C++ and does not change C++ semantics too much. But in most languages other than C++, one can do considerably better with GC.
true no leaks with GC, but it has opposite problem (Score:1)
Java's Hello World does not take 80M of memory. In fact, Java runs in pretty small footprints and on the whole is probably smaller than current C/C++ runtimes. The only reason you don't see the C/C++ runtimes is because they are usually loaded at boot time these days and you consider them part of the OS.
comparing apples and oranges (Score:1)
Because Java isolates different software components from one another, you can run different applications and services within the same runtime, and that's really the right comparison to make (at least in the long run when more and more applications run in Java).
That may seem like a distant hypothetical future, but it actually already has a lot of practical significance. For example, when writing servlets in Java, the footprint of each additional servlet is tiny, yet servlets are isolated from one another as well as if they were running in separate processes. In contrast, CGI scripts and even shared library implementations of web services have enormous overhead.
Garbage collection leads to better programs. (Score:1)
Grrr... (Score:1)
I once worked on a compiler which did garbage collection this way. When you compile a procedure, the signature is persistent, but the body isn't. That is to say, the data structure used to represent the signature will remain forever, but the data structures used to represent the bodies are not useful after you have finished compiling the body. So, you allocate space for the persistent stuff in permanent storage, and space for the body in some other, ephemeral blocks of storage. When you are finished with the procedure, you can just collect all the ephemeral stuff and throw it out, without thinking about it. This is much faster than using reference counts, since you don't pay a penalty for each pointer assignment. Since the language was Ada, what was persistent in one context was ephemeral in another. C or C++ wouldn't have the same problem. I thought it was a neat idea, and wish that it was *my* idea.
A wonderful book (Score:1)
Leaks and segfaults are both bugs that need fixing (Score:1)
Try alloca() in C/C++ (Score:1)
Good book! (Score:1)
This is a really good book. It explains garbage collection in lots of detail and covers various implementation tradeoffs.
If you're not willing to read several hundred academic papers on this subject--and sort them all out--this book is definitely the best starting place.
Other languages (Score:1)
Wow. Somebody's posted an advertisement for our project. Since high praise always makes me nervous, let me try to deflect some of that enthusiasm in various directions. :-)
You also want the check out GNU Common Lisp, GNU Guile, Eiffel, ML, Prolog, Haskell and the various R4RS scheme interpreters.
There's a lot of free (and GPL'd) programming languages out there, and some of them are pretty fast and impressively powerful.
Gwydion Dylan (Hans Boehm garbage collector) (Score:1)
Sure... but if you learned Dylan well enough to implement a Python compiler in Dylan, then you might find yourself losing interest in Python in favor of Dylan. :-)
That's not true! I help maintain a Dylan compiler, and I still like Python!
Gwydion Dylan (Hans Boehm garbage collector) (Score:1)
Well, my favorite language is Python, but I just wondered, is there any documentation about Dylan-to-C compiler ? The internals manuals aren't very verbose :-).
Oops. We're still filling those out as we find our way around the code.
Grrr... (Score:1)
GC and Timing (Score:1)
Reference Counting vs. Mark & Sweep (Score:1)
Hmm.. That sounds odd... I imagine it's dependant on the type of program that's running... One with lots of idle time should be better off with GC, one with less should be better with manual deallocation.
Missing distributed GC treatment (Score:1)
A good paper on distributed GC came from one of the INRIA project OS's called COOL, but there have been several research papers on the topic.
Missing distributed GC treatment (Score:1)
Then they sure picked a funny title for section 12...