ICFP 2001 Programming Contest Notes

by Rob Mayoff

I entered the ICFP 2001 Programming Contest. My team name was "dqd", and I was the only team member. The program name was "press", which is short for "compress". Originally I called it smlngz, because z is the standard letter meaning "compression", but got tired of typing that. (Filename completion didn't work because smlngz collided with sml.h, one of my header files.)

I am the guy who asked FAQ #30, regarding the output speed. I specifically mentioned a disk file and a 9600 bps connection in my e-mail.

I wrote press in a combination of flex and C. The flex code parsed the input and compressed whitespace. The C code did everything else. The C code used a number of functions marked as "public domain" in the daemontools 0.76 distribution.

The submitted code, in a .tar.gz file. Actually, the tar file I submitted contained an executable program, but this tar file does not. You'll have to build it yourself.

The "fixed" code, which includes a bug fix and some additional enhancements that didn't make the deadline.

Safety

To ensure that press always produces some valid output no longer than the input, I took several steps:

Thus, press will virtually always produce some valid output before the time expires, and the output will never be longer than the original input.

Strategy

Aside from the basic whitespace optimizations (which I did in the flex parser), I know of two basic strategies for the task at hand. One, the "rebuild" strategy, is to parse the document into a representation of its meaning - an array of decorated characters - and then build a new output document from scratch by analyzing the array and deciding what tags to emit. The other, the "transform" strategy, is to perform transformations on the existing tag structure of the document. I use the rebuild strategy.

Parsing

For the rebuild strategy, the document can be viewed as a list of runs of characters. All of the characters in each run have the same decorations; adjacent runs may have completely different decorations. So my flex parser builds two arrays: one contains the document characters (with whitespace compressed) and the other contains the runs; each run consists of a context and the number of characters in the run. The context specifies how the characters in the run should be decorated; it consists of one byte for the color, one byte for the size, two bits for the U level, and one bit each for B, I, S, EM, and TT.

Note that there is no tag for the document's initial size or color. This means (due to the rebuild strategy) that I must consider the initial size to be distinct from all the sizes I can describe with tags; likewise for the initial color. In my internal representation, I simply store the ASCII character representing the size or color of the run, or \000 for the initial size or color. That's why I allocate a whole byte for each of size and color.

Optimizing

My basic optimization strategy is simple. The first run always has no decoration flags turned on, so I simply emit the characters in the first run. (The first run may be empty depending on the input document.) Then, for each run, I consider its context (the "current" context) and the next run's context. I also know what start tags I have emitted for which I have not yet emitted end tags, and what context will result from emitting each of those end tags; this constitutes the "context stack". My goal is to get into the next context as cheaply as possible. So I consider all the ways I can get to that context by emitting start and end tags, emit the cheapest combination, and update the context stack to reflect those tags.

In deciding what combination of tags will be cheapest, I have a lookahead limit; initially the limit is one and I double it on each optimization pass until I run out of time or memory. To evaluate each tag combination, I consider its combined cost. The cost of a tag combination is the length of the start tags to emit, plus the length of the end tags I will eventually have to emit to match those start tags. Examples:

Tags Cost
<I> 7: 3 for "<I>" and 4 for the "</I>" that will eventually match it.
<PL><I> 16
</2></r> 0: end tags cost nothing to emit; I considered their cost when I emitted their corresponding start tags.
</2><r> 7

If my lookahead limit is 1, then the combined cost of those tags is the total cost. If my lookahead limit is 2, then I consider the tag combinations I will have to emit to get from the next context to the context after that, and add the cheapest of those to the cost of the tag combination under consideration. If my lookahead limit is 3, then I also account for the tag combination after that. In other words, the lookahead limit means exactly what you think it means.

Tag Combinations

Of course, I need an algorithm for enumerating the tag combinations that I wish to evaluate for emission. Here is the algorithm:

If the current context does not have the initial size or color, but the next context does, then all tag combinations must begin with sufficient close tags (as dictated by the context stack) to reach the initial size and/or color as necessary.

Then, if the current context has any decoration flags set that are not set in the next context, then I must emit either a PL start tag, or sufficient end tags (as dictated by the context stack) to turn all of those flags off. I consider both possibilities.

Then I can consider emitting any number of additional end tags, up to the number on the context stack. I call these "optional pops". I try each of those possibilities, unless I detect one of two situations:

  1. If I just emitted a PL start tag, I don't allow any optional pops.
  2. If n optional pops would leave me in a context where additional pops (or a PL tag) are required to reach the next context, then I skip n optional pops and go on to n+1. For example, if the context stack says "<B><I><PL><U>", and I want to be in a context of B, then I do not consider emitting "</U></PL>" because more pops (or a PL tag) would be necessary to remove the I state.

Then I determine what start tags I must emit to reach the next context, given all of the above pushes and pops. If the lookahead limit is 1, then I just count the cost of emitting them in a hardcoded order. Otherwise, if the lookahead limit is larger than 1, then I consider each order in which I can emit the start tags, and determining the minimum cost after doing so of reaching the context after the next context, and so on until the lookahead limit is reached.

Once I have considered all of the tag combinations and determined the cheapest, I emit it and move to the next state. I don't save any of the lookahead work I did when starting on the next state.

Memory Management

I need memory dynamically to store the tag combinations I'm considering, and the resulting context stacks. For complex documents or large lookaheads, this will be a large amount of memory.

I store the tag combinations in dynamically-sized strings which I allocate and free as I go.

For the context stacks, I use a special-purpose allocator/garbage collector. The context stack is stored as a linked list, and context stacks for tag combinations under consideration share tails. The allocator allocates a large number of stack elements in advance and doles them about very rapidly (essentially by incrementing a pointer). I call the garbage collector after I finish computing the tag combination to emit for the next run. I give the garbage collector a pointer to the top of the new context stack. The GC relocates all the of elements in the stack to the beginning of the arena, marks the rest of the arena as available, and returns the new pointer to the relocated top of the context stack. This makes allocation O(1) and garbage collection O(N), where N is the depth of the context stack.

Bugs

The version of press that I submitted has a specific algorithmic error and various other flaws.

The algorithmic error is that it abandons the lookahead process if it needs to emit only a single tag (other than possibly a PL) to get to the next state, and in those cases that lookahead path is considered to have zero cost. This means that press will produce suboptimal output in these cases. The fix is simple: remove "&& ntags > 1" from line 100 of optimize.c. The "fixed" code above includes this fix.

One flaw is that press uses a very large amount of memory because it garbage collects so rarely. It can be made to garbage collect much more often.

press also formats tag combinations into strings unnecessarily. When press is looking ahead past the tag combination needed to get to the next state, it does not actually need to generate the string representation of the lookahead tag combinations, but it does anyway. This wastes additional memory.

In its submitted state, it quickly grows to its 250 MB limit in only a few levels of lookahead and then exits. With smarter garbage collection and suppression of tag generation during lookahead, it stays under 200K even for very complex inputs; it becomes entirely CPU-bound. The "fixed" code above includes these changes.

Another flaw is in its permutation generator. If press needs to push <B>, <I>, and two <U> tags, it considers 4! = 24 orderings for those tags, but there are only 4!/2! = 12 distinguishable orderings because the two U tags are identical. If press needs to generate three U tags, the waste is even greater (a factor of six instead of two).

I have thought of various ways to reduce the number of lookahead tag combinations to consider that I did not have time to implement. The "fixed" code above contains some of these.

Other Stuff

The dumpdecorated program uses the same flex parser as press but simply dumps the decorated characters to standard output. The try shell script runs press on a file and then uses dumpdecorated and cmp to validate the press output.

The testpermute program calls the permutation generator and writes the results to standard output. The runtestpermute shell script runs testpermute and validates the output.

The maketest program takes two arguments, size and depth. It generates a random but valid SML/NG output stream of approximately size bytes with a maximum tag nesting depth of depth.

The first line of conf-cc contains the name of the compiler and the compiler flags to use to generate a .o file from a .c file. The remaining lines are not used.

The first line of conf-link contains the name of the linker and the linker flags to use to generate an executable file from .o files. The remaining lines are not used.

The first line of conf-flex contains the name of the flex program and the flex options to use to generate a .c file from a .l file. The remaining lines are not used.

The make-makefile script generates Makefile from the *=x files (each of which contains a list of .o filenames) and Makefile.tail. You must re-run make-makefile if you add or remove a *=x file or change the #include files used by any .c or .l file. You do not need to re-run make-makefile after modifying conf-cc, conf-link, conf-flex, an existing *=x file, or Makefile.tail.

mayoff@dqd.com