Compression – The ryg blog

This is about the “secret ingredient” in my EXE packer kkrunchy, which was used in our (Farbrausch) 64k intros starting from “fr-030: Candytron”, and also in a lot of other 64k intros or similarly size-limited productions by other groups – including several well-known 64ks from other groups such as Conspiracy, Equinox and Fairlight, I’m happy to add.

There are two primary versions of kkrunchy. The first one was developed between 2003 and 2005 and uses a fairly simple LZ77 coder with arithmetic coding backend, designed to have good compression and a small depacker (about 300 bytes when using hand-optimized x86 code). If you’re interested in the details, a simplified C++ version of the depacker that I then used to write the ASM code from is still available here (of the various programs I’ve written, this is still one of my favorites: it’s just nice to have something so fundamentally simple that’s actually useful).

The second was developed roughly between 2006 and 2008 and uses PAQ-style context mixing. To be precise, it’s a spin-off borrowing a lot of ideas from PAQ7 including its neural network mixer, but with a custom model design optimized to use a lot less memory (around 5.5mb instead of hundreds of megabytes) and run a lot faster. It’s still dog slow, but it managed to process >120kb/s on my then-current P4 2.4GHz instead of the roughly 5kb/s that PAQ7 managed on the same machine when targeting roughly the same compression ratio. The depacker was optimized for both size and speed (with this much computation per bit, making it 4x slower for the sake of saving a few bytes is a bad deal!), and took up a comparatively hefty 1.5k, but normally more than made up for it. For example, our 96k shooter “kkrieger” shrunk by over 10k when recompressed with the new engine (if only we’d had that before, it would’ve saved us a lot of work in the last week before release and prevented some very embarrassing bugs).

Anyway, both packing engines were competitive with the state of the art in EXE compression for the given target sizes when they were first released, but neither were in any way spectacular, and both were biased towards simplicity and tweakability, not absolutely maximum compression. That’s because unlike most other EXE packers, kkrunchy doesn’t rely solely on its backend compression engine to bring in the space savings.

Preprocessing code for fun and profit

Especially 64k intros (the original target for kkrunchy) usually have a fairly large amount of the overall executable size devoted to code, and a much smaller amount spent on content (since 64ks are usually one-file packages, this content comes in the form of either statically initialized data or resource sections). A typical ratio for us was between 48k and 56k of the final compressed executable spent on code and overhead (you still need OS-parseable file headers and some way of importing DLLs), with the rest being data. Clearly, if you want to really make a dent on the packed size, you should concentrate on code first. This is especially true because code always comes in the same format (executable x86 opcodes) while the composition and structure of data in the executable varies a lot.

If you have a context modeling based compressor, one way to inject “domain knowledge” about x86 code into the compressor is to include models specifically designed to analyze x86 opcodes and deal with them. This is the approach taken by recent PAQ versions, and also what you’ll find in the PPMexe paper (I’ll come back to this in a bit). In context coders, this is probably the most direct way to inject domain knowledge into the (de)compressor, but it has two problems: first, that domain knowledge translates to a lot of extra code you need to carry around in your decompressor, and second, you can’t do it if you’re not using a context coder! (Remember the first version of kkrunchy used a plain LZ+Ari backend).

So kkrunchy instead uses a filter on x86 code. The compressor first takes the original x86 code, passes it through the filter (this should better be a reversible process!), and then hands it to the main compression engine. The decompressor first undoes the actual compression, then reverse-applies the filter in a second step. This is something done by most EXE compressors, but all others (that I know of) use very simple filters, mainly intended to make jump and call offsets more compressible. kkrunchy looks deeper than that; it actually disassembles the instruction stream, albeit only very roughly. This is based on the “split-stream” algorithm in the PPMexe paper (even though the paper only appeared in a journal in 2007, an earlier version was available as a MS Research Technical Report in 2005, when I first read it).

Split-stream encoding

Unlike most modern, RISC-influenced instruction encodings, x86 doesn’t have fixed-length instruction words, but instead uses a variable-length encoding that leads to higher code density but is full of exceptions and special cases, making instruction decoding (or even the simpler problem of just determining instruction lengths) fairly complicated. An x86 instruction can have these components:

Zero or more prefixes. A prefix is a single byte that can influence the way an operation is performed (e.g. REP, LOCK), the size of an operand (the operand-size prefix 0x66 turns a 16-bit operation into a 32-bit operation and vice versa), the size of an addressing computation (the address-size prefix 0x67 which does the same for address calculations) or the source/target of a read/write operation (via segment prefixes). There’s also a set of extra prefixes in 64-bit mode (the so-called REX byte), and finally some of the prefixes are also used to extend the opcode byte for newer instructions (e.g. SSE/SSE2).
The actual opcode, which encodes the operation to be performed (duh). “Classic” x86 has one-byte and two-byte opcodes. Two-byte opcodes have 0x0f as the first byte, the rest is one-byte opcodes. (More recent instructions also have three-byte opcodes that start with either 0x0f 0×38 or 0x0f 0x3a).
A ModR/M byte, which encodes either the addressing mode or registers used for an instruction.
A SIB byte (scale-index-base), used to extend the ModR/M byte for more advanced 386+ addressing modes like [esi+eax*4+0x1234].
A displacement (usually 8-bits or 32-bits, signed, in 32-bit code) which is just an immediate offset added to the result of an address calculation (like the “0×1234″ in the example above). There’s also absolute address mode (e.g. mov [0x01234567], 0x89abcdef) which is not relative to any base register.
An immedate operand. Again, usually 8 or 32 bits in 32-bit code, but there are some exceptions. Some ops (like add eax, 42) or (mov [edi], 23) have an immediate operand. This field is separate from the displacement; both can occur in the same instruction.

What the split-stream filter does is this: It splits x86 instructions (which are an interleaved sequence of these bytes with different meanings) into separate streams based on their function. The idea is that there’s lots of similarity between values in the same stream, but very little relation between values in different streams: Especially compiler-generated code won’t use all opcodes (some are a lot more frequent than others), but 8-bit jump offsets are almost completely random. Small 32-bit immediates are a lot more common than large ones; most 32-bit displacements are positive, but stack-based references (based off either ESP or EBP, e.g. used to access function parameters or local variables) are often negative. SIB bytes look different from ModRM bytes (or opcode bytes), and some are a lot more common than others. And so on. kkrunchy doesn’t split all of these into distinct groups (for example, the prefix, opcode and ModR/M bytes tend to be heavily correlated, so they’re kept all in one stream) and splits some of them even further: jump offsets are special 32-bit displacements, as are call offsets, and for 8-bit displacements, it pays to have separate streams based on the “base register” field of the ModR/M byte, since offsets used off the same register – usually containing a pointer – tend to be correlated. But the basic idea is just to split instructions into their components and encode the streams separately. This is done by just keeping them in distinct regions, so we need to add a header at the beginning that tells us how long each stream is (a bit of overhead there).

One thing worth noting is that the regular (non-far, non-indirect) JMP (plus conditional variants) and CALL instructions use displacements, not absolute target addresses; i.e. they encode the distance between the target instruction and the end of the current (call/jump) instruction, not the actual target address. This is intentionally done to greatly reduce the number of relocations that need to be made if some piece of code needs to be moved to an address other than its preferred load address in memory. However, it also means that code that repeatedly jumps to or calls the same target address (e.g. a frequently-called function) will have a different encoding of the target address every time, since the displacement depends on the address of the call instruction! For compression purposes, we’d like calls to the same target instruction to result in the exact same encoding every time, i.e. convert the offsets into absolute addresses. This is fairly easy to do – just add the address of a call instruction to its destination address (of course, you need to undo this after decompression or the code will jump somewhere wrong). This is the one thing that most executable compressors (or general-purpose compressors that have special modes for x86 code, for that matter) will do on their own: Search for byte sequences that look like a CALL or JMP instruction in the byte stream, and add the current offset to the 4 bytes following it. Then do the same thing in reverse on decompression. Very simple trick, but huge effect on the compression ratio: doing this one transform before compression usually reduces the final compressed size by about 10%, both for LZ-based and context-based coders.

Fun with calls

Function calls are special: A lot of functions will get called very often indeed. You can just rely on the relative-to-absolute offset transform and your backend encoder to handle this; but when developing kkrunchy I found that it actually pays to special-case this one a bit. So instead I keep a 255-entry LRU cache (or a FIFO for simplicity in earlier kkrunchy versions) of recent call sites. If a call is repeated, I only need to store the index; otherwise I write a “255″ byte to denote a cache miss, then code the full 32-bit target address and add it to the cache. The actual call offsets aren’t correlated to the offsets, so the “cache index” and “verbatim call offset” fields are written into separate streams.

There’s two more tricks to the function call cache: First, code that follows a RET instruction has a good likelihood of being the start of a new function, and second, VC++ pads the unused space between functions (for alignment reasons) with 0xcc bytes which correspond to the INT3 opcode (software breakpoint). Since INT3 opcodes don’t appear in most functions, any instruction following one or more INT3 ops is also likely to start a new function. All such potential function addresses are also speculatively added to the function address cache. A false positive will throw an address not used for at least 255 subsequent CALL instructions of the cache – not a big deal. OTOH, if our guess turns out to be right and we’ve found the start of a function that gets called soon, that’s 32 bits we never need to encode now.

Typical hit rates for the function address cache are between 70 and 80 percent, and every hit saves us 3 bytes. Trust me, there’s a lot of calls in most pieces of x86 code, and this is a big deal. This is the only transform in the code preprocessor that results in a lower number of byte to be encoded; all other steps of the transform are either 1:1 in terms of bytes or add some overhead (headers, escape codes). Still, running the filter usually reduces the size of the file by a few percent even before any actual compression is done.

Dealing with non-code bytes

The original PPMexe algorithm assumes that it’s only actually running on code, and that it knows about all basic blocks in the program (i.e. all potential jump targets are known). It will then reorder instructions to get better compression. This is not an option when dealing with arbitrary programs; for once, we don’t know where the basic blocks are (the compiler doesn’t tell us!). We also don’t want to change the program in any way (even ignoring potential performance penalties, just think about the kind of problems arbitrary reordering of instructions could introduce in lock-free code…), and finally, it’s not generally a safe assumption that the code section only contains code, or that the compressor knows about all x86 instructions (after all, there’s new ones being added every so often, and we can’t generally parse them correctly without knowing about them). Luckily, x86 disassembling is mostly self-synchronizing (even if you get a few instructions wrong, you will generally by in sync again after a few instructions unless the code was explicitly designed to be “ambiguous”), so we don’t need to sweat this too much. We do, however, need to be able to encode instruction bytes that we can’t disassemble properly.

Luckily, there’s a few one-byte opcodes that basically never occur in any real program; we can use any of them as an “escape code”. We then flag that instruction as invalid in our disassembler table; if the program actually contains that instruction, it will need to use an escape code to do so, but that’s fine. kkrunchy originally used INTO (0xce) as escape opcode that just means “copy the next byte verbatim”; this is supposed to trigger an interrupt if arithmetic overflow occurred in the previous flag-setting instruction. I’ve never seen this in any real program; certainly not a compiled one. Should kkrunchy ever run across it, it can still encode that instruction byte using an escape: INTO turns into 0xce 0xce. No problem. More recent versions use the ICEBP opcode (0xf1) as escape, which is even more obscure. It only exists on a few x86 models, and again, I’ve never seen it actually used anywhere.

One piece of non-code that tends to occur in code sections is jump tables for switch statements. Luckily they’re fairly easy to detect with good accuracy: If we have at least 12 consecutive bytes, starting at an address divisible by 4, that look like addresses in the code (i.e. are between the “code start” and “code end” addresses which are assumed to be known), we use another escape opcode to denote “jump table with N entries follows”, where N-1 is encoded as a single-byte value. (If a jump table has more than 256 entries, we just use chain several such escapes).

Between the two tricks, unknown instructions or junk in the code section doesn’t usually cause a problem.

A box inside a box inside a box…

Because the kkrunchy code preprocessing filter does more than most such filters, it’s also bigger (around 900 bytes of code+data combined in the last version I used). But since it runs after the main decompression phase, there’s no need to pay that cost in full: The filter code can be compressed along with the filtered data! This shaves off another few hundred bytes at the cost of one extra jump at the end of the main decompression code. All other code that runs after decompression (e.g. import processing) can also be stored in compressed format. This is a bit like the “bootstrapping” process used for some C compilers: The core part of a C compiler (syntactic analysis and a simple code generator for the target machine) is written in a very portable subset of C. This core C compiler is compiled using whatever other C compiler is currently present on the system, with optimizations disabled. The result is the stage-1 compiler. The stage-1 compiler is then used to compile a fully-featured version of our C compiler (including extensions and a fully-featured code generator with optimization), creating the stage-2 compiler. This stage-2 compiler is “feature complete”, but may be slow because it was compiled by the stage-1 compiler which doesn’t have all optimizations, so the stage-2 compiler compiles itself again (with full optimization this time!) to produce the stage-3 compiler, which is the final result (some compilers only use a stage-2 bootstrap process, and some cross compilers may use more stages, but you get the general idea). You can chain compression and filtering stages the same way; indeed, at some point I was bothered by the 1.5k spent on the context modeling depacker and considered adding a very simple pure LZ at the start of the chain (to unpack the unpacker…), but the best I managed was an overall gain of less than 100 bytes (after subtracting the “stage-0″ depacker size and overhead), so I didn’t pursue that any further.

So how much does it help?

All these techniques combined typically reduce final compressed size by about 20% compared to no preprocessing at all, with both the LZ-based and the context model coder (this is true not just with the kkrunchy backends, but also with other compressors like ZIP, RAR, LZMA or PAQ). Compare to the simpler “call offsets only” preprocessing, which typically reduces size by about 10%. That’s clearly the low-hanging fruit right there, but an extra 10% isn’t to be sneezed at, particularly since both the encoding and decoding are really quite easy and quick.

If you’re interested in the details, I made a (cleaned-up version) of the preprocessor available a while ago here (that code is placed into the public domain); there’s extra documentation with some more details in the code. The biggest lacking feature is probably x86-64 support. It shouldn’t be too hard to add, but I never got around to it since I stopped doing size-optimized stuff a while ago, and besides, even now, 64-bit x86 apps are relatively rare.

Anyway, if you have any questions or are interested in extending/using that code somewhere, just drop me a mail (or reply here). I’d certainly like to see it being useful in other contexts!

In the previous post, I wrote about rANS in general. The ANS family is, in essence, just a different design approach for arithmetic coders, with somewhat different trade-offs, strengths and weaknesses than existing coders. In this post, I am going to talk specifically about using rANS as a drop-in replacement for (static) Huffman coding: that is, we are encoding data with a known, static probability distribution for symbols. I am also going to assume a compress-once decode-often scenario: slowing down the encoder (within reason) is acceptable if doing so makes the decoder faster. It turns out that rANS is very useful in this kind of setting.

Review

Last time, we defined the rANS encoding and decoding functions, assuming a finite alphabet $\mathcal{A} = \{ 0, \dots, n - 1 \}$ of n symbols numbered 0 to n-1.

$C(s,x) := M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)$
$D(x) := (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s)$ where $s = s(x \bmod M)$ .

where F_s is the frequency of symbol s, $B_s = \sum_{i=0}^{s-1} F_i$ is the sum of the frequencies of all symbols before s, and $M = \sum_{i=0}^{n-1} F_i$ is the sum of all symbol frequencies. Then a given symbol s has (assumed) probability p_s = F_s / M .

Furthermore, as noted in the previous post, M can’t be chosen arbitrarily; it must divide L (the lower bound of our normalized interval) for the encoding/decoding algorithms we saw to work.

Given these constraints and the form of C and D, it’s very convenient to have M be a power of 2; this replaces the divisions and modulo operations in the decoder with bit masking and bit shifts. We also choose L as a power of 2 (which needs to be at least as large as M, since otherwise M can’t divide L).

This means that, starting from a reference probability distribution, we need to approximate the probabilities as fractions with common denominator M. My colleague Charles Bloom just wrote a blog post on that very topic, so I’m going to refer you there for details on how to do this optimally.

Getting rid of per-symbol divisions in the encoder

Making M a power of two removes the division/modulo operations in the decoder, but the encoder still has to perform them. However, note that we’re only ever dividing by the symbol frequencies F_s , which are known at the start of the encoding operation (in our “static probability distribution” setting). The question is, does that help?

You bet it does. A little known fact (amongst most programmers who aren’t compiler writers or bit hacking aficionados anyway) is that division of a -bit unsigned integer by a constant can always be performed as fixed-point multiplication with a reciprocal, using 2p+1 bits (or less) of intermediate precision. This is exact – no round-off error involved. Compilers like to use this technique on integer divisions by constants, since multiplication (even long multiplication) is typically much faster than division.

There are several papers on how to choose the “magic constants” (with proofs); however, most of them are designed to be used in the code generator of a compiler. As such, they generally have several possible code sequences for division by constants, and try to use the cheapest one that works for the given divisor. This makes sense in a compiler, but not in our case, where the exact frequencies are not known at compile time and doing run-time branching between different possible instruction sequences would cost more than it saves. Therefore, I would suggest sticking with Alverson’s original paper “Integer division using reciprocals”.

The example code I linked to implements this approach, replacing the division/modulo pair with a pair of integer multiplications; when using this approach, it makes sense to limit the state variable to 31 bits (or 63 bits on 64-bit architectures): as said before, the reciprocal method requires 2p+1 bits working precision for worst-case divisors, and reducing the range by 1 bit enables a faster (and simpler) implementation than would be required for a full-range variant (especially in C/C++ code, where multi-precision arithmetic is not easy to express). Note that handling the case F_s=1 requires some extra work; details are explained in the code.

Symbol lookup

There’s one important step in the decoder that I haven’t talked about yet: mapping from the “slot index” $x \bmod M$ to the corresponding symbol index. In normal rANS, each symbol covers a contiguous range of the “slot index” space (by contrast to say tANS, where the slots for any given symbol are spread relatively uniformly across the slot index space). That means that, if all else fails, we can figure out the symbol ID using a binary search in $\lceil\log_2 n\rceil$ steps (remember that n is the size of our alphabet) from the cumulative frequency table (the B_s , which take O(n) space) – independent of the size of M. That’s comforting to know, but doing a binary search per symbol is, in practice, quite expensive compared to the rest of the decoding work we do.

At the other extreme, we can just prepare a look-up table mapping from the cumulative frequency to the corresponding symbol ID. This is very simple (and the technique used in the example code) and theoretically constant-time per symbol, but it requires a table with M entries – and if the table ends up being too large to fit comfortably in a core’s L1 cache, real-world performance (although still technically bounded by a constant per symbol) can get quite bad. Moving in the other direction, if M is small enough, it can make sense to store the per-symbol information in the M-indexed table too, and avoid the extra indirection; I would not recommend this far beyond M=2¹² though.

Anyway, that gives us two design points: we can use O(n) space, at a cost of $O(\log n)$ per symbol lookup; or we can use O(M) space, with O(1) symbol lookup cost. Now what we’d really like is to get O(1) symbol lookup in O(n) space, but sadly that option’s not on the menu.

Or is it?

The alias method

To make a long story short, I’m not aware of any way to meet our performance goals with the original unmodified rANS algorithm; however, we can do much better if we’re willing to relax our requirements a bit. Notably, there’s no deep reason for us to require that the slots assigned to a given symbol s be contiguous; we already know that e.g. tANS works in a much more relaxed setting. So let’s assume, for the time being, that we can rearrange our slot to symbol mapping arbitrarily (we’ll have to check if this is actually true later, and also work through what it means for our encoder). What does that buy us?

It buys us all we need to meet our performance goals, it turns out (props to my colleague Sean Barrett, who was the first one to figure this out, in our internal email exchanges anyway). As the section title says, the key turns out to be a stochastic sampling technique called the “alias method”. I’m not gonna explain the details here and instead refer you to this short introduction (written by a computational evolutionary geneticist, on randomly picking base pairs) and “Darts, Dice and Coins”, a much longer article that covers multiple ways to sample from a nonuniform distribution (by the way, note that the warnings about numerical instability that often accompany descriptions of the alias method need not worry us; we’re dealing with integer frequencies here so there’s no round-off error).

At this point, you might be wondering what the alias method, a technique for sampling from a non-uniform discrete probability distribution, has anything to do with entropy (de)coding. The answer is that the symbol look-up problem is essentially the same thing: we have a “random” value $x \bmod M$ from the interval [0,M-1], and a matching non-uniform probability distribution (our symbol frequencies). Drawing symbols according to that distribution defines a map from [0,M-1] to our symbol alphabet, which is precisely what we need for our decoding function.

So what does the alias method do? Well, if you followed the link to the article I mentioned earlier, you get the general picture: it partitions the probabilities for our n-symbol alphabet into n “buckets”, such that each bucket i references at most 2 symbols (one of which is symbol i), and the probabilities within each bucket sum to the same value (namely, 1/n). This is always possible, and there is an algorithm (due to Vose) which determines such a partition in O(n) time. More generally, we can do so for any N≥n, by just adding some dummy symbols with frequency 0 at the end. In practice, it’s convenient to have N be a power of two, so for arbitrary n we would just pick N to be the smallest power of 2 that is ≥n.

Translating the sampling terminology to our rANS setting: we can subdivide our interval [0,M-1] into N sub-intervals (“buckets”) of equal size k=M/N, such that each bucket i references at most 2 distinct symbols, one of which is symbol i. We also need to know what the other symbol referenced in this bucket is – alias[i], the “alias” that gives the methods its name – and the position divider[i] of the “dividing line” between the two symbols.

With these two tables, determining the symbol ID from x is quick and easy:

  uint xM = x % M; // bit masking (power-of-2 M)
  uint bucket_id = xM / K; // shift (power-of-2 M/N!)
  uint symbol = bucket_id;
  if (xM >= divider[bucket_id]) // primary symbol or alias?
    symbol = alias[bucket_id];

This is O(1) time and O(N) = O(n) space (for the “divider” and “alias” arrays), as promised. However, this isn’t quite enough for rANS: remember that for our decoding function D, we need to know not just the symbol ID, but also which of the (potentially many) slots assigned to that symbol we ended up in; with regular rANS, this was simple since all slots assigned to a symbol are sequential, starting at slot B_s :
$D(x) = (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s)$ where $s = s(x \bmod M)$ .
Here, the $(x \bmod M) - B_s$ part is the number we need. Now with the alias method, the slot IDs assigned to a symbol aren’t necessarily contiguous anymore. However, within each bucket, the slot IDs assigned to a symbol are sequential – which means that instead of the cumulative frequencies B_s , we now have two separate per bucket. This allows us to define the complete “alias rANS” decoding function:

  // s, x = D(x) with "alias rANS"
  uint xM = x % M;
  uint bucket_id = xM / K;
  uint symbol, bias;
  if (xM < divider[bucket_id]) { // primary symbol or alias?
    symbol = bucket_id;
    bias = primary_start[bucket_id];
  } else {
    symbol = alias[bucket_id];
    bias = alias_start[bucket_id];
  }
  x = (x / M) * freq[symbol] + xM - bias;

And although this code is written with branches for clarity, it is in fact fairly easy to do branch-free. We gained another two tables indexed with the bucket ID; generating them is another straightforward linear pass over the buckets: we just need to keep track of how many slots we’ve assigned to each symbol so far. And that’s it – this is all we need for a complete “alias rANS” decoder.

However, there’s one more minor tweak we can make: note that the only part of the computation that actually depends on symbol is the evaluation of freq[symbol]; if we store the frequencies for both symbols in each alias table bucket, we can get rid of the dependent look-ups. This can be a performance win in practice; on the other hand, it does waste a bit of extra memory on the alias table, so you might want to skip on it.

Either way, this alias method allow us to perform quite fast (though not as fast as a fully-unrolled table for small M) symbol look-ups, for large M, with memory overhead (and preparation time) proportional to n. That’s quite cool, and might be particularly interesting in cases where you either have relatively small alphabets (say on the order of 15-25 symbols), need lots of different tables, or frequently switch between tables.

Encoding

However, we haven’t covered encoding yet. With regular rANS, encoding is easy, since – again – the slot ranges for each symbol are contiguous; the encoder just does
$C(s,x) = M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)$
where $B_s + (x \bmod F_s)$ is the slot id corresponding to the $(x \bmod F_s)$ ‘th appearance of symbol s.

With alias rANS, each symbol may have its slots distributed across multiple, disjoint intervals – up to N of them. And the encoder now needs to map $(x \bmod F_s)$ to a corresponding slot index that will decode correctly. One way to do this is to just keep track of the mapping as we build the alias table; this takes O(M) space and is O(1) cost per symbol. Another is to keep a sorted list of subintervals (and their cumulative sizes) assigned to each symbol; this takes only O(N) space, but adds a $O(\log_2 N)$ (worst-case) lookup per symbol in the encoder. Sound familiar?

In short, using the alias method doesn’t really solve the symbol lookup problem for large M; or, more precisely, it solves the lookup problem on the decoder side, but at the cost of adding an equivalent problem on the encoder side. What this means is that we have to pick our poison: faster encoding (at the some extra cost in the decoder), or faster decoding (at some extra cost in the encoder). This is fine, though; it means we get to make a trade-off, depending on which of the two steps is more important to us overall. And as long as we are in a compress-once decompress-often scenario (which is fairly typical), making the decoder faster at some reasonable extra cost in the encoder is definitely useful.

Conclusion

We can exploit static, known probabilities in several ways for rANS and related coders: for encoding, we can precompute the right “magic values” to avoid divisions in the hot encoding loop; and if we want to support large M, the alias method enables fast decoding without generating a giant table with M entries – with an O(n) preprocessing step (where n is the number of symbols), we can still support O(1) symbol decoding, albeit with a (slightly) higher constant factor.

I’m aware that this post is somewhat hand-wavey; the problem is that while Vose’s algorithm and the associated post-processing are actually quite straightforward, there’s a lot of index manipulation, and I find the relevant steps to be quite hard to express in prose without the “explanation” ending up harder to read than the actual code. So instead, my intent was to convey the “big picture”; a sample implementation of alias table rANS, with all the details, can be found – as usual – on Github.

And that’s it for today – take care!

I’ve been meaning to write another proper post on this for a while, but the last few months have been very busy and I didn’t feel like writing in my spare time.

This is not that post, sadly. This is about something else: namely, me only citing Jarek Duda’s work on ANS and not Yann Collet‘s work on FSE. Apparently, there were multiple versions of Jarek’s ANS paper, and the second version (which contains rANS, the topic I’ve been writing about) was significantly influenced by Yann’s experiences with integrating ideas from the first version into FSE.

Anyway, I did not mention Yann’s work in my rANS posts at all. I just want to make clear that this was because I was writing about rANS not tANS (the family that FSE is a member of), and I simply wasn’t aware that Yann’s work and input significantly influenced the second version of the ANS paper. My apologies; this was a simple oversight, not a deliberate attempt to talk down Yann’s contribution!

On a separate but related note: Mid-february, I wrote a short paper on how entropy coders (with a focus on ANS) can be interleaved on the encode side to allow the decoder side to exploit instruction-level parallelism and/or SIMD instructions. I originally meant to do a separate post about it here, but on trying to write it discovered that I didn’t have much to say on the topic that wasn’t in the paper. Hence, no separate blog post. But I figured I should at least link to it once from here.

Anyway, more regular blog updates should start again soon. Until then!

Applying the rANS-with-alias-table construction from “rANS with static probability distributions” to Huffman codes has some interesting results. In a sense, there’s nothing new here once you have these two ingredients. I remember mentioning this idea in a mail when I wrote ryg_rans, but it didn’t seem worth writing an article about. I’ve changed my mind on that: while the restriction to Huffman-like code lengths is strictly weaker than “proper” arithmetic coding, we do get a pretty interesting variant on table/state machine-style “Huffman” decoders out of the deal. So let’s start with a description of how they usually operate and work our way to the alias rANS variant.

Table-based Huffman decoders

Conceptually, a Huffman decoder starts from the root, then reads one bit at a time, descending into the sub-tree denoted by that bit. If that sub-tree is a leaf node, return the corresponding symbol. Otherwise, keep reading more bits and descending into smaller and smaller sub-trees until you do hit a leaf node. That’s all there is to it.

Except, of course, no serious implementation of Huffman decoding works that way. Processing the input one bit at a time is just a lot of overhead for very little useful work done. Instead, normal implementations effectively look ahead by a bunch of bits and table-drive the whole thing. Peek ahead by k bits, say k=10. You also prepare a table with 2^k entries that encodes what the one-bit-at-a-time Huffman decoder would do when faced with those k input bits:

struct TableEntry {
    int num_bits; // Number of bits consumed
    int symbol;   // Index of decoded symbol
};

If it reaches a leaf node, you record the ID of the symbol it arrived at, and how many input bits were actually consumed to get there (which can be less than k). If not, the next symbol takes more than k bits, and you need a back-up plan. Set num_bits to 0 (or some other value that’s not a valid code length) and use a different strategy to decode the next symbol: typically, you either chain to another (secondary) table or fall back to a slower (one-bit-at-a-time or similar) Huffman decoder with no length limit. Since Huffman coding only assigns long codes to rare symbols – that is, after all, the whole point – it doesn’t tend to matter much; with well-chosen k (typically, slightly larger than the log2 of the size of your symbol alphabet), the “long symbol” case is pretty rare.

So you get an overall decoder that looks like this:

while (!done) {
    // Read next k bits without advancing the cursor
    int bits = peekBits(k);

    // Decode using our table
    int nbits = table[bits].num_bits;
    if (nbits != 0) { // Symbol
        *out++ = table[bits].symbol;
        consumeBits(nbits);
    } else {
        // Fall-back path for long symbols here!
    }
}

This ends up particularly nice combined with canonical Huffman codes, and some variant of it is used in most widely deployed Huffman decoders. All of this is classic and much has been written about it elsewhere. If any of this is news to you, I recommend Moffat and Turpin’s 1997 paper “On the implementation of minimum redundancy prefix codes”. I’m gonna assume it’s not and move on.

State machines

For the next step, suppose we fix k to be the length of our longest codeword. Anything smaller and we need to deal with the special cases just discussed; anything larger is pointless. A table like the one above then tells us what to do for every possible combination of k input bits, and when we turn the k-bit lookahead into explicit state, we get a finite-state machine that decodes Huffman codes:

state = getBits(k); // read initial k bits
while (!done) {
    // Current state determines output symbol
    *out++ = table[state].symbol;

    // Update state (assuming MSB-first bit packing)
    int nbits = table[state].num_bits;
    state = (state << nbits) & ((1 << k) - 1);
    state |= getBits(nbits);
}

state is essentially a k-bit shift register that contains our lookahead bits, and we need to update it in a way that matches our bit packing rule. Note that this is precisely the type of Huffman decoder Charles talks about here while explaining ANS. Alternatively, with LSB-first bit packing:

state = getBits(k);
while (!done) {
    // Current state determines output symbol
    *out++ = table[state].symbol;

    // Update state (assuming LSB-first bit packing)
    int nbits = table[state].num_bits;
    state >>= nbits;
    state |= getBits(nbits) << (k - nbits);
}

This is still the exact same table as before, but because we’ve sized the table so that each symbol is decoded in one step, we don’t need a fallback path. But so far this is completely equivalent to what we did before; we’re just explicitly keeping track of our lookahead bits in state.

But this process still involves, essentially, two separate state machines: one explicit for our Huffman decoder, and one implicit in the implementation of our bitwise IO functions, which ultimately read data from the input stream at least one byte at a time.

A bit buffer state machine

For our next trick, let’s look at the bitwise IO we need and turn that into an explicit state machine as well. I’m assuming you’ve implemented bitwise IO before; if not, I suggest you stop here and try to figure out how to do it before reading on.

Anyway, how exactly the bit IO works depends on the bit packing convention used, the little/big endian of the compression world. Both have their advantages and their disadvantages; in this post, my primary version is going to be LSB-first, since it has a clearer correspondence to rANS which we’ll get to later. Anyway, whether LSB-first or MSB-first, a typical bit IO implementation uses two variables, one for the “bit buffer” and one that counts how many bits are currently in it. A typical implementation looks like this:

uint32_t buffer;   // The bits themselves
uint32_t num_bits; // Number of bits in the buffer right now

uint32_t getBits(uint32_t count)
{
    // Return low "count" bits from buffer
    uint32_t ret = buffer & ((1 << count) - 1);

    // Consume them
    buffer >>= count;
    num_bits -= count;

    // Refill the bit buffer by reading more bytes
    // (kMinBits is a constant here)
    while (num_bits < kMinBits) {
        buffer |= *in++ << num_bits;
        num_bits += 8;
    }

    return ret;
}

Okay. That’s fine, but we’d like for there to be only one state variable in our state machine, and preferably not just on a technicality such as declaring our one state variable to be a pair of two values. Luckily, there’s a nice trick to encode both the data and the number of bits in the bit buffer in a single value: we just keep an extra 1 bit in the state, always just past the last “real” data bit. Say we have a 8-bit state, then we assign the following codes (in binary):

in_binary(state)	num_bits
`0 0 0 0 0 0 0 1`	0
`0 0 0 0 0 0 1 *`	1
`0 0 0 0 0 1 * *`	2
`0 0 0 0 1 * * *`	3
`0 0 0 1 * * * *`	4
`0 0 1 * * * * *`	5
`0 1 * * * * * *`	6
`1 * * * * * * *`	7

The locations denoted * store the actual data bits. Note that we’re fitting 1 + 2 + … + 128 = 255 different states into a 8-bit byte, as we should. The only value we’re not using is “0”. Also note that we have num_bits = floor(log2(state)) precisely, and that we can determine num_bits using bit scanning instructions when we need to. Let’s look at how the code comes out:

uint32_t state; // As described above

uint32_t getBits(uint32_t count)
{
    // Return low "count" bits from state
    uint32_t ret = state & ((1 << count) - 1);

    // Consume them
    state >>= count;

    // Refill the bit buffer by reading more bytes
    // (kMinBits is a constant here)
    // Note num_bits is a local variable!
    uint32_t num_bits = find_highest_set_bit(state);
    while (num_bits < kMinBits) {
        // Need to clear 1-bit at position "num_bits"
        // and add a 1-bit at bit "num_bits + 8", hence the
        // "+ (256 - 1)".
        state += (*in++ + (256 - 1)) << num_bits;
        num_bits += 8;
    }

    return ret;
}

Okay. This is written to be as similar as possible to the implementation we had before. You can phrase the while condition in terms of state and only compute num_bits inside the refill loop, which makes the non-refill case slightly faster, but I wrote it the way I did to emphasize the similarities.

Consuming bits is slightly cheaper than the regular bit buffer, refilling is a bit more expensive, but we’re down to one state variable instead of two. Let’s call that a win for our purposes (and it certainly can be when low on registers). Note I only covered LSB-first bit packing here, but we can do a similar trick for MSB bit buffers by using the least-significant set bit as a sentinel instead. It works out very similar.

So what happens when we plug this into the finite-state Huffman decoder from before?

State machine Huffman decoder with built-in bit IO

Note that our state machine decoder above still just kept the k lookahead bits in state, and that they’re not exactly hard to recover from our bit buffer state. In fact, they’re pretty much the same. So we can just fuse them together to get a state machine-based Huffman decoder that only uses byte-wise IO:

state = 1; // No bits in buffer
refill();  // Run "refill" step from the loop once

while (!done) {
    // Current state determines output symbol
    index = state & ((1 << k) - 1);
    *out++ = table[index].symbol;

    // Update state (consume bits)
    state >>= table[index].num_bits;

    // Refill bit buffer (make sure at least k bits in it)
    // This reads bytes at a time, but could just as well
    // read 16 or 32 bits if "state" is large enough.
    num_bits = find_highest_set_bit(state);
    while (num_bits < k) {
        state += (*in++ + (256 - 1)) << num_bits;
        num_bits += 8;
    }
}

The slightly weird refill() call at the start is just to keep the structure as similar as possible to what we had before. And there we have it, a simple Huffman decoder with one state variable and a table. Of course you can combine this type of bit IO with other Huffman approaches, such as multi-table decoding, too. You could also go even further and bake most of the bit IO into tables like Charles describes here, effectively using a table on the actual state and not just its low bits, but that leads to enormous tables and is really not a good idea in practice; not only are the tables too large to fit in the cache, general-purpose compressors will also usually spend more time building these tables than they ever spend using them (since it’s rare to use a single Huffman table for more than a few dozen kilobytes at a time).

Okay. So far, there’s nothing in here that’s not at least 20 years old.

Let’s get weird, stage 1

The decoder above still reads the exact same bit stream as the original LSB-first decoder. But if we’re willing to prescribe the exact form of the decoder, we can use a different refilling strategy that’s more convenient (or cheaper). In particular, we can do this:

state = read_3_bytes() | (1 << 24); // might as well!

while (!done) {
    // Current state determines output symbol
    index = state & ((1 << k) - 1);
    *out++ = table[index].symbol;

    // Update state (consume bits)
    state >>= table[index].num_bits;

    // Refill
    while (state < (1 << k))
        state = (state << 8) | *in++;
}

This is still workable a Huffman decoder, and it’s cheaper than the one we saw before, because refilling got cheaper. But it also got a bit, well, strange. Note we’re reading 8 bits and putting them into the low bits of state; since we’re processing bits LSB-first, that means we added them at the “front” of our bit queue, rather than appending them as we used to! In principle, this is fine. Bits are bits. But processing bits out-of-sequence in that way is certainly atypical, and means extra work for the encoder, which now needs to do extra work to figure out exactly how to permute the bits so the decoder reads them in the right order. In fact, it’s not exactly obvious that you can encode this format efficiently to begin with.

But you definitely can, by encoding backwards. Because, drum roll: this isn’t a regular table-driven Huffman decoder anymore. What this actually is is a rANS decoder for symbols with power-of-2 probabilities. The state >>= table[index].num_bits; is what the decoding state transition function for rANS reduces to in that case.

In other words, this is where we start to see new stuff. It might be possible that someone did a decoder like this before last year, but if they did, I certainly never encountered it before. And trust me, it is weird; the byte stream the corresponding encoder emits is uniquely decodable and has the same length as the bit stream generated for the corresponding Huffman or canonical Huffman code, but the bit-shuffling means it’s not even a regular prefix code stream.

Let’s get weird, stage 2: binary alias coding

But there’s one more, which is a direct corollary of the existence of alias rANS: we can use the alias method to build a fast decoding table with size proportional to the number of symbols in the alphabet, completely independent of the code lengths!

Note the alias method allows you to construct a table with an arbitrary number of entries, as long as it’s larger than the number of symbols. For efficiency, you’ll typically want to round up to the next power of 2. I’m not going to describe the exact encoder details here, simply because it’s just rANS with power-of-2 probabilities, and the ryg_rans encoder/decoder can handle that part just fine. So you already have example code. But that means you can build a fast “Huffman” decoder like this:

kMaxCodeLen = 24; // max code len in bits
kCodeMask = (1 << kMaxCodeLen) - 1;
kBucketShift = kMaxCodeLen - SymbolStats::LOG2NSYMS;

state = read_3_bytes() | (1 << 24); // might as well!

while (!done) {
    // Figure out bucket in alias table; same data structures as in
    // ryg_rans, except syms->slot_nbits (number of bits in Huffman
    // code for symbol) instead of syms->slot_nfreqs is given.
    uint32_t index = state & kCodeMask;
    uint32_t bucket_id = index >> kBucketShift;
    uint32_t bucket2 = bucket_id * 2;
    if (index < syms->divider[bucket_id])
        ++bucket2;

    // bucket determines output symbol
    *out++ = syms->sym_id[bucket2];

    // Update state (just D(x) for pow2 probabilities)
    state = (state & ~kCodeMask) >> syms->slot_nbits[bucket2];
    state += index - syms->slot_adjust[bucket2];

    // Refill (make sure at least kMaxCodeLen bits in buffer)
    while (state <= kCodeMask)
        state = (state << 8) | *in++;
}

I find this remarkable because essentially all other fast (~constant time per symbol) Huffman decoding tricks have some dependence on the distribution of code lengths. This one does not; the alias table size is determined strictly by the number of symbols. The only fundamental data-dependency is how often the “refill” code is run (it runs, necessarily, once per input byte, so it will run less often – relatively speaking – on highly compressible data than it will on high-entropy data). (I’m not counting the computation of bucket2 here because it’s just a conditional add, and is in fact written the way it is precisely so that it can be mapped to a compare-then-add-with-carry sequence.)

Note that this one really is a lot weirder still than the previous variant, which at least kept the “space” assigned to individual codes connected. This one will, through the alias table construction, end up allocating small parts of the code range for large symbols all over the place. It’s still exactly equivalent to a Huffman coder in terms compression ratio and code “lengths”, but the underlying construction really doesn’t have much to do with Huffman at all at this point, and we’re not even emitting particular bit strings for code words anymore.

All that said, I don’t think this final variant is actually interesting in practice; if I did, I would have written about it earlier. If you’re bothering to implement rANS and build an alias table, it really doesn’t make sense to skimp out on the one extra multiply that turns this algorithm into a full arithmetic decoder (as opposed to quasi-Huffman), unless your multiplier is really slow that is.

But I do find it to be an interesting construction from a theoretical standpoint, if nothing else. And if you don’t agree, well, maybe you at least learned something about certain types of Huffman decoders and their relation to table-based ANS decoders. :)

This is precisely what the last post was about. So nothing new. This is just my original mail on the topic with some more details that might be interesting and/or amusing to a few people. :)

Date: Wed, 05 Feb 2014 16:43:36 -0800
From: Fabian Giesen
Subject: Alias Huffman coding.

Huffman <= ANS (strict subset)
(namely, power-of-2 frequencies)

We can take any discrete probability distribution of N events and use the Alias method to construct a O(N)-entry table that allows us to sample from that distribution in O(1) time.

We can apply that same technique to e.g. rANS coding to map from (x mod M) to “what symbol is x”. We already have that.

Ergo, we can construct a Huffman-esque coder that can decode symbols using a single table lookup, where the table size only depends on N_sym and not the code lengths. (And the time to build said table given the code lengths is linear in N_sym too).

Unlike regular/canonical Huffman codes, these can have multiple unconnected ranges for the same symbol, so you still need to deal with the range remapping (the “slot_adjust” thing) you have in Alias table ANS; basically, the only difference ends up being that you have a shift instead of a multiply by the frequency.

But there’s still some advantages in that a few things simplify; for example, there’s no need (or advantage) to using a L that’s larger than M. An obvious candidate is choosing L=M=B so that your Huffman codes are length-limited to half your word size and you never do IO in smaller chunks than that.

Okay. So where does that get us? Well, something like the MSB alias rANS decoder, with a shift instead of a multiply, really:

   // decoder state
   // suppose max_code_len = 16
   U32 x;
   U16 const * input_ptr;

   U32 const m = (1 << max_code_len) - 1;
   U32 const bucket_shift = max_code_len - log2_nbuckets;

   // decode:
   U32 xm = x & m;
   U32 xm_shifted = xm >> bucket_shift;
   U32 bucket = xm_shifted * 2;
   if (xm < hufftab_divider[xm_shifted])
     bucket++;

   x = (x & ~m) >> hufftab_shift[bucket];
   x += xm - hufftab_adjust[bucket];

   if (x < (1<<16))
     x = (x << 16) | *input_ptr++;

   return hufftab_symbol[bucket];

So with a hypothetical compiler that can figure out the adc-for-bucket
thing, we’d get something like

   ; x in eax, input_ptr in esi
   movzx    edx, ax     ; x & m (for bucket id)
   shr      edx, 8      ; edx = xm_shifted
   movzx    ebx, ax     ; ebx = xm
   cmp      ax, [hufftab_divider + edx*2]
   adc      edx, edx    ; edx = bucket
   xor      eax, ebx    ; eax = x & ~m
   mov      cl, [hufftab_shift + edx]
   shr      eax, cl
   movzx    ecx, word [hufftab_adjust + edx*2]
   add      eax, ebx    ; x += xm
   movzx    edx, byte [hufftab_symbol + edx] ; symbol
   sub      eax, ecx    ; x -= adjust[bucket]
   cmp      eax, (1<<16)
   jae      done
   shl      eax, 16
   movzx    ecx, word [esi]
   add      esi, 2
   or       eax, ecx
done:

   ; new x in eax, new input_ptr in esi
   ; symbol in edx

which is actually pretty damn nice considering that’s both Huffman decode and bit buffer rolled into one. Especially so since it handles all cases – there’s no extra conditions and no cases (rare though they might be) where you have to grab more bits and look into another table. Bonus points because it has an obvious variant that’s completely branch-free:

   ; same as before up until...
   sub      eax, ecx    ; x -= adjust[bucket]
   movzx    ecx, word [esi]
   mov      ebx, eax
   shl      ebx, 16
   or       ebx, ecx
   lea      edi, [esi+2]
   cmp      eax, (1<<16)
   cmovb    eax, ebx
   cmovb    esi, edi

Okay, all that’s nice and everything, but for x86 it’s nothing we haven’t seen before. I have a punch line though: the same thing works on PPC – the adc thing and “sbb reg, reg” both have equivalents, so you can do branch-free computation based on some carry flag easily.

BUT, couple subtle points:

this thing has a bunch of (x & foo) >> bar (left-shift or right-shift) kind of things, which map really really well to PPC because there’s rlwinm / rlwimi.
The in-order PPCs hate variable shifts (something like 12+ cycles microcoded). Well, guess what, everything we multiply with is a small per-symbol constant, so we can just store (1 << len) per symbol and use mullw. That’s 9 cycles non-pipelined (and causes a stall after issue), but still, better than the microcode. But… wait a second.

If this ends up faster than your usual Huffman, and there’s a decent chance that it might (branch-free and all), the fastest “Huffman” decoder on in-order PPC would, in fact, be a full-blown arithmetic decoder. Which amuses me no end.

   # NOTE: LSB of "bucket" complemented compared to x86

   # r3 = x, r4 = input ptr
   # r20 = &tab_divider[0]
   # r21 = &tab_symbol[0]
   # r22 = &tab_mult[0]
   # r23 = &tab_adjust[0]

   rlwinm    r5, r3, 24, 23, 30   # r5 = (xm >> bucket_shift) * 2
   rlwinm    r6, r3, 0, 16, 31    # r6 = xm
   lhzx      r7, r20, r5          # r7 = tab_divider[xm_shifted]
   srwi      r8, r3, 16           # r8 = x >> log2(m)
   subfc     r9, r7, r6           # (r9 ignored but sets carry)
   lhz       r10, 0(r4)           # *input_ptr
   addze     r5, r5               # r5 = bucket
   lbzx      r9, r21, r5          # r9 = symbol
   add       r5, r5, r5           # r5 = bucket word offs
   lhzx      r7, r22, r5          # r7 = mult
   li        r6, 0x10000          # r6 = op for sub later
   lhzx      r5, r23, r5          # r5 = adjust
   mullw     r7, r7, r8           # r7 = mult * (x >> m)
   subf      r5, r5, r6           # r5 = xm - tab_adjust[bucket]
   add       r5, r5, r7           # r5 = new x
   subfc     r6, r6, r5           # sets carry iff (x >= (1<<16))
   rlwimi    r10, r5, 16, 0, 16   # r10 = (x << 16) | *input_ptr
   subfe     r6, r6, r6           # ~0 if (x < (1<<16)), 0 otherwise
   slwi      r7, r6, 1            # -2 if (x < (1<<16)), 0 otherwise
   and       r10, r10, r6
   andc      r5, r5, r6
   subf      r4, r7, r4           # input_ptr++ if (x < (1<<16))
   or        r5, r5, r10          # new x

That should be a complete alias rANS decoder assuming M=L=b=2¹⁶.

-Fabian

Let’s talk a bit about probability distributions in data compression; more specifically, about a problem in dealing with multi-symbol alphabets during adaptation.

Distributions over a binary alphabet are easy to mix (in the “context mixing” sense): they can be represented by a single scalar (I’ll use the probability of the symbol being a ‘1’, probability of ‘0’ works too) which are easily combined with other scalars. In practice, these probabilities are represented as fixed-size integers in a specific range. A typical choice is $p=k/4096, 0 \le k \le 4096$ . Often, p=0 (“it’s certainly a 0″) and p=1 (certain 1) are excluded, because they only allow us to “encode” (using zero bits!) one of the two symbols.

Multi-symbol alphabets are trickier. A binary alphabet can be described using one probability p; the probability of the other symbol must be 1-p. Larger alphabets involve multiple probabilities, and that makes things trickier. Say we have a n-symbol alphabet. A probability distribution for such an alphabet is, conventionally, a vector $p \in \mathbb{R}^n$ such that $p_k \ge 0$ for all $1 \le k \le n$ and furthermore $\sum_{k=1}^n p_k = 1$ . In practice, we will again represent them using integers. Rather than dealing with the hassle of having fractions everywhere, let’s just define our “finite-precision distributions” as integer vectors $p \in \mathbb{Z}^n$ , again $p_k \ge 0$ for all k, and $\sum_{k=1}^n p_k = T$ where T is the total. Same as with the binary case, we would like T to be a constant power of 2. This lets us use cheaper variants of conventional arithmetic coders, as well as rANS. Unlike the binary case, maintaining this invariant takes explicit work, and it’s not cheap.

In practice, the common choice is to either drop the requirement that T be constant (most adaptive multi-symbol models for arithmetic coding take that route), or switch to a semi-adaptive model such as deferred summation that performs model updates in batches, which can either pick batches such that the constant sum is automatically maintained, or at least amortize the work spent enforcing it. A last option is using a “sliding window” style model that just uses a history of character counts over the last N input symbols. This makes it easy to maintain the constant sum, but it’s pretty limited.

A second problem is mixing – both as an explicit operation for things like context mixing, and as a building block for model updates more sophisticated than simply incrementing a counter. With binary models, this is easy – just a weighted sum of probabilities. Multi-symbol models with a constant sum are harder: for example, say we want to average the two 3-symbol models $p=\begin{pmatrix}3 & 3 & 2\end{pmatrix}$ and $q=\begin{pmatrix}1 & 6 & 1\end{pmatrix}$ while maintaining a total of T=8. Computing $\frac{1}{2}(p+q) = \begin{pmatrix}2 & 4.5 & 1.5\end{pmatrix}$ illustrates the problem: the two non-integer counts have to get rounded back to integers somehow. But if we round both up, we end up at a total of 9; and if we round both down, we end up with a total of 7. To achieve our desired total of 8, we need to round one up, and one down – but it’s not exactly obvious how we should choose on any given symbol! In short, maintaining a constant total while mixing in this form proves to be tricky.

However, I recently realized that these problems all disappear when we work with the cumulative distribution function (CDF) instead of the raw symbol counts.

Working with cumulative counts

For a discrete probability distribution p with total T, define the corresponding cumulative probabilities:

$P_j := \sum_{k=1}^j p_k, \quad 0 \le j \le n$

P₀ is an empty sum, so P₀=0. On the other end, P_n = T, since that’s just the sum over all values in p, and we know the total is T. For the elements in between, p_k ≥ 0 implies that P_k ≥ P_k-1, i.e. P is monotonically non-decreasing. Conversely, given any P with these three properties, we can compute the differences between adjacent elements and determine the underlying distribution p.

And it turns out that while mixing quantized symbol probabilities is problematic, mixing cumulative distribution functions pretty much just works. To wit: suppose we are given two CDFs P and Q with the same total T, and a blending factor $\lambda \in [0,1]$ , then define:

$\tilde{R}_j := (1-\lambda) P_j + \lambda Q_j, \quad 0 \le j \le n$

Note that because summation is linear, we have

$\tilde{R}_j = (1-\lambda) \left(\sum_{k=1}^j p_k\right) + \lambda \left(\sum_{k=1}^j q_k\right)$
$= \sum_{k=1}^j ((1-\lambda) p_k + \lambda q_k)$

so this is indeed the CDF corresponding to a blended model between p and q. However, we’re still dealing with real values here; we need integers, and the easiest way to get there is to just truncate, possibly with some rounding bias $b \in [0,1)$ :

$R_j := \lfloor \tilde{R}_j + b \rfloor$

It turns out that this just works, but this requires proof, and to get there it helps to prove a little lemma first.

Spacing lemma: Suppose that for some j, $P_j - P_{j-1} \ge m$ and $Q_j - Q_{j-1} \ge m$ where m is an arbitrary integer. Then we also have $R_j - R_{j-1} \ge m$ .

Proof: We start by noting that

$\tilde{R}_j - \tilde{R}_{j-1} = (1-\lambda) (P_j - P_{j-1}) + \lambda (Q_j - Q_{j-1}) \ge (1-\lambda) m + \lambda m = m$

and since m is an integer, we have

$R_j - R_{j-1} = \lfloor \tilde{R}_j + b \rfloor - \lfloor \tilde{R}_{j-1} + b \rfloor$
$= \lfloor \tilde{R}_{j-1} + (\tilde{R}_j - \tilde{R}_{j-1}) + b \rfloor - \lfloor \tilde{R}_{j-1} + b \rfloor$
$\ge \lfloor \tilde{R}_{j-1} + m + b \rfloor - \lfloor \tilde{R}_{j-1} + b \rfloor$
$= m + \lfloor \tilde{R}_{j-1} + b \rfloor - \lfloor \tilde{R}_{j-1} + b \rfloor = m$

which was our claim.

Using the lemma, it’s now easy to show that R is a valid CDF with total T: we need to establish that R₀=0, R_n=T, and show that R is monotonic. The first two are easy, since the P and Q we’re mixing both agree at these points and 0≤b<1.

$R_0 = \lfloor \tilde{R}_0 + b \rfloor = \lfloor b \rfloor = 0$
$R_n = \lfloor \tilde{R}_n + b \rfloor = \lfloor T + b \rfloor = T$

As for monotonicity, note that $R_j \ge R_{j-1}$ is the same as $R_j - R_{j-1} \ge 0$ (and the same for P and Q). Therefore, we can apply the spacing lemma with m=0 for 1≤j≤n: monotonicity of P and Q implies that R will be monotonic too. And that’s it: this proves that R is valid CDF for a distribution with total T. If we want the individual symbol frequencies, we can recover them as $r_k = R_k - R_{k-1}$ .

Note that the spacing lemma is a fair bit stronger than just establishing monotonicity. Most importantly, if p_k≥m and q_k≥m, the spacing lemma tells us that r_k≥m – these properties carry over to the blended distribution, as one would expect! I make note of this because this kind of invariant is often violated by approximate techniques that first round the probabilities, then fudge them to make the total come out right.

Conclusion

This gives us a way to blend between two probability distributions while maintaining a constant total, without having to deal with dodgy ad-hoc rounding decisions. This requires working in CDF form, but that’s pretty common for arithmetic coding models anyway. As long as the mixing computation is done exactly (which is straightforward when working in integers), the total will remain constant.

I only described linear blending, but the construction generalizes in the obvious way to arbitrary convex combinations. It is thus directly applicable to mixing more than two distributions while only rounding once. Furthermore, given the CDFs of the input models, the corresponding interval for a single symbol can be found using just two mixing operations to find the two interval end points; there’s no need to compute the entire CDF for the mixed model. This is in contrast to direct mixing of the symbol probabilities, which in general needs to look at all symbols to either determine the total (if a non-constant-T approach is used) or perform the adjustments to make the total come out to T.

Furthermore, the construction shows that probability distributions with sum T are closed under “rounded convex combinations” (as used in the proof). The spacing lemma implies that the same is true for a multitude of more restricted distributions: for example, the set of probability distributions where each symbol has a nonzero probability (corresponding to distributions with monotonically increasing, instead of merely non-decreasing, CDFs) is also closed under convex combinations in this sense. This is a non-obvious result, to me anyway.

One application for this (as frequently noted) is context mixing of multi-symbol distributions. Another is as a building block in adaptive model updates that’s a good deal more versatile than the obvious “steal the count from one symbol, add it to another” update step.

I have no idea whether this is new or not (probably not); I certainly hadn’t seen it before, and neither had anyone else at RAD. Nor do I know whether this will be useful to anyone else, but it seemed worth writing up!

My colleague Charles Bloom recently announced Oodle LZNA (a new compressor for RAD’s Oodle compression library), which usually beats LZMA on compression while being much faster to decode. The major innovation in LZNA is switching the coding back-end from bit-wise adaptive modeling to models with larger alphabets. With traditional arithmetic coders, you take a pretty big speed hit going from binary to larger alphabets, and you end up leaning heavily on integer divides (one per symbol usually), which is the neglected bastard stepchild of every integer instruction set. But as of last year, we have the ANS family of coders, which alters the cost landscape considerably. Most interesting for use with adaptive models is rANS, which is significantly cheaper than conventional large-alphabet arithmetic (and without divisions in the decoder), but can still code directly from a cumulative probability distribution, just like other arithmetic coders. In short, rANS speeds up non-binary alphabets enough that it makes sense to revisit adaptive modeling for larger alphabets in practical coder design, something that’s been pretty much on hold since the late 90s.

In this post, I will try to explain the high-level design behind the adaptive models that underlie LZNA, and why they are interesting. But let’s start with the models they are derived from.

Adaptive binary models

The canonical adaptive binary model simply keeps count of how many 0s and 1s have been seen:

counts[0] = 1;     // or different init
counts[1] = 1;
prob_for_1 = 0.5;  // used for coding/decoding

void adapt(int bit)
{
    counts[bit]++;
    prob_for_1 = counts[1] / (counts[0] + counts[1]);
}

This is the obvious construction. The problem with this is that probabilities move quickly after initialization, but very soon start to calcify; after a few thousand symbols, any individual observation barely changes the probabilities at all. This sucks for heterogeneous data, or just data where the statistics change over time: we spend a long time coding with a model that’s a poor fit, trying to “unlearn” the previous stats.

A better approach for data with statistics that change over time is to gradually “forget” old stats. The resulting models are much quicker to respond to changed input characteristics, which is usually a decent win in practice. The canonical “leaky” adaptive binary model is, essentially, an exponential moving average:

prob_for_1 = 0.5;
f = 1.0 - 1.0 / 32.0;   // adaptation rate, usually 1/pow2

void adapt(int bit)
{
    if (bit == 0)
        prob_for_1 *= f;
    else
        prob_for_1 = prob_for_1 * f + (1.0 - f);
}

This type of model goes back to Howard and Vitter in “Practical implementations of arithmetic coding” (section 3.4). It is absolutely everywhere, albeit usually implemented in fixed point arithmetic and using shifts instead of multiplications. That means you will normally see the logic above implemented like this:

scale_factor = 4096; // .12 fixed point
prob_for_1_scaled = scale_factor / 2;

void adapt(int bit)
{
    if (bit == 0)
        prob_for_1_scaled -= prob_for_1_scaled >> 5;
    else
        prob_for_1_scaled += (scale_factor - prob_for_1_scaled) >> 5;
}

This looks like it’s a straightforward fixed-point version of the code above, but there’s a subtle but important difference: the top version, when evaluated in real-number arithmetic, can have prob_for_1 get arbitrarily close to 0 or 1. The bottom version cannot; when prob_for_1_scaled ≤ 31, it cannot shrink further, and likewise, it cannot grow past scale_factor – 31. So the version below (with a fixed-point scale factor of 4096 and using a right-shift of 5) will keep scaled probabilities in the range [31,4065], corresponding to about [0.0076, 0.9924].

Note that with a shift of 5, we always stay 31 steps away from the top and bottom end of the interval – independent of what our scale factor is. That means that, somewhat counter-intuitively, that both the scale factor and the adaptation rate determine what the minimum and maximum representable probability are. In compression terms, the clamped minimum and maximum probabilities are equivalent to mixing a uniform model into a “real” (infinite-precision) Howard/Vitter-style adaptive model, with low weight. Accounting for this fact is important when trying to analyze (and understand) the behavior of the fixed-point models.

A symmetric formulation

Implementations of binary adaptive models usually only track one probability; since there are two symbols and the probabilities must sum to 1, this is sufficient. But since we’re interested in models for larger than binary alphabets, let’s aim for a more symmetric formulation in the hopes that it will generalize nicely. To do this, we take the probabilities p₀ and p₁ for 0 and 1, respectively, and stack them into a column vector p:

$\mathbf{p} = \begin{pmatrix} p_0 \\ p_1 \end{pmatrix}$

Now, let’s look at what the Howard/Vitter update rule produces for the updated vector p‘ when we see a 0 bit (this is just plugging in the update formula from the first program above and using p₁ = 1 – p₀):

$\mathbf{p}' = \begin{pmatrix} 1 - f p_1 \\ f p_1 \end{pmatrix} = \begin{pmatrix} 1 - f (1 - p_0) \\ f p_1 \end{pmatrix} = f \mathbf{p} + \begin{pmatrix} 1 - f \\ 0 \end{pmatrix}$

And for the “bit is 1″ case, we get:

$\mathbf{p}' = \begin{pmatrix} 1 - (f p_1 + (1 - f)) \\ f p_1 + (1 - f) \end{pmatrix} = \begin{pmatrix} f p_0 \\ f p_1 + (1 - f) \end{pmatrix} = f \mathbf{p} + \begin{pmatrix} 0 \\ 1 - f \end{pmatrix}$

And just like that, we have a nice symmetric formulation of our update rule: when the input bit is i, the updated probability vector is

$\mathbf{p}' = f \mathbf{p} + (1 - f) \mathbf{e}_i$

where the e_i are the canonical basis vectors. Once it’s written in this vectorial form, it’s obvious how to generalize this adaptation rule to a larger alphabet: just use a larger probability vector! I’ll go over the implications of this in a second, but first, let me do a brief interlude for those who recognize the shape of that update expression.

Aside: the DSP connection

Interpreted as a discrete-time system

$\mathbf{p}_{k+1} = f \mathbf{p}_k + (1-f) \mathbf{e}_{s_k}$

where the s_k denote the input symbol stream, note that we’re effectively just running a multi-channel IIR filter here (one channel per symbol in the alphabet). Each “channel” is simple linearly interpolating between its previous state and the new input value – it’s a discrete-time leaky integrator, a 1st-order low-pass filter.

Is it possible to use other low-pass filters? You bet! In particular, one popular variant on the fixed-point update rule has two accumulators with different update rates and averages them together. The result is equivalent to a 2nd-order (two-pole) low-pass filter. Its impulse response, corresponding to the weight of symbols over time, initially decays faster but has a longer tail than the 1st-order filter.

And of course you can use FIR low-pass filters too: for example, the incremental box filters described in my post Fast blurs 1 correspond to a “sliding window” model. This approach readily generalizes to more general piecewise-polynomial weighting kernels using the approach described in Heckberts “Filtering by repeated integration”. I doubt that these are actually useful for compression, but it’s always nice to know you have the option.

Can we use arbitrary low-pass filters? Alas, we can’t: we absolutely need linear filters with unit gain (so that the sum of all probabilities stays 1 exactly), and furthermore our filters need to have non-negative impulse responses, since we’re dealing with probabilities here, and probabilities need to stay non-negative. Still, we have several degrees of freedom here, and at least in compression, this space is definitely under-explored right now. I’m sure some of it will come in handy.

Practical non-binary adaptive models

As noted before, the update expression we just derived

$\mathbf{p}' = f \mathbf{p} + (1 - f) \mathbf{e}_i$

quite naturally generalizes to non-binary alphabets: just stack more than two probabilities into the vector p. But it also generalizes in a different direction: note that we’re just linearly interpolating between the old model p and the “model” for the newly observed symbol, as given by e_i. We can change the model we’re mixing in if we want different update characteristics. For example, instead of using a single spike at symbol i (as represented by e_i), we might decide to slightly boost the probabilities for adjacent values as well.

Another option is to blend between the “spike” and a uniform distribution with a given weight u:

$\mathbf{p}' = f \mathbf{p} + (1 - f) ((1-u) \mathbf{e}_i + u\mathbf{1})$

where 1 is the vector consisting of all-1s. This is a way to give us the clamping of probabilities at some minimum level, like we get in the fixed-point binary adaptive models, but in a much more controlled fashion; no implicit dependency on the scaling factor in this formulation! (Although of course an integer implementation will still have some rounding effects.) Note that, for a fixed value of u, the entire right-hand side of that expression only depends on i and does not care about the current value of p at all.

Okay, so we can design a bunch of models using this on paper, but is this actually practical? Note that, in the end, we will still want to get fixed-point integer probabilities out of this, and we would greatly prefer them to all sum to a power of 2, because then we can use rANS. This is where my previous post “Mixing discrete probability distributions” comes in. And this is where I’ll stop torturing you and skip straight to the punchline: a working adaptation step, with pre-computed mix-in models (depending on i only), can be written like this:

int rate = 5;       // use whatever you want!
int CDF[nsyms + 1]; // CDF[nsyms] is pow2
int mixin_CDFs[nsyms][nsyms + 1];

void adapt(int sym)
{
    // no need to touch CDF[0] and CDF[nsyms]; they always stay
    // the same.
    int *mixin = mixin_CDFs[sym];
    for (int i = 1; i < nsyms; ++i)         CDF[i] += (mixin[i] - CDF[i]) >> rate;
}

which is a pretty sweet generalization of the binary model update rule. As per the “Mixing discrete probability distributions”, this has several non-obvious nice properties. In particular, if the initial CDF has nonzero probability for every symbol, and all the mixin_CDFs do as well, then symbol probabilities will never drop down to zero as a result of this mixing, and we always maintain a constant total throughout. What this variant doesn’t handle very well is round-off; there’s better approaches (although you do want to make sure that something like the spacing lemma from the mixing article still holds), but this variant is decent, and it’s certainly the most satisfying because it’s so breathtakingly simple.

Note that the computations for each i are completely independent, so this is data-parallel and can be written using SIMD instructions, which is pretty important to make larger models practical. This makes the update step fast; you do still need to be careful in implementing the other half of an arithmetic coding model, the symbol lookup step.

Building from here

This is a pretty neat idea, and a very cool new building block to play with, but it’s not a working compressor yet. So what we can do with this, and where is it interesting?

Well. For a long time, in arithmetic coding, there were basically two choices. You could use larger models with very high per-symbol overhead: one or more divisions per symbol decoded, a binary search to locate the right symbol… pretty expensive stuff. Expensive enough that you want to make sure you code very few symbols this way. Adaptive symbols, using something like Fenwick trees, were even worse, and limited in the set of update rules they could support. In practice, truly adaptive large-alphabet models were exceedingly rare; if you needed some adaptation, you were more likely to use a deferred summation model (keep histogramming data and periodically rebuild your CDF from the last histogram) than true per-symbol adaptation, because it was so damn expensive.

At the other extreme, you had binary models. Binary arithmetic coders have much lower per-symbol cost than their non-binary cousins, and good binary models (like the fixed-point Howard/Vitter model we looked at) are likewise quite cheap to update. But the problem is that with a binary coder, you end up sending many more symbols. For example, in LZMA, a single 256-symbol alphabet model gets replaced with 8 binary coding steps. While the individual binary coding steps are much faster, they’re typically not that much faster that you can do 8× as many and still come out much faster. There’s ways to reduce the number of symbols processed. For example, instead of a balanced binary tree binarization, you can build a Huffman tree and use that instead… yes, using a Huffman tree to do arithmetic coding, I know. This absolutely works, and it is faster, but it’s also a bit of a mess and fundamentally very unsatisfying.

But now we have options in the middle: with rANS giving us larger-alphabet arithmetic decoders that are merely slightly slower than binary arithmetic coding, we can do something better. The models described above are not directly useful on a 256-symbol alphabet. No amount of SIMD will make updating 256 probability estimates per input symbol fast. But they are useful on medium-sized alphabets. The LZNA in “Oodle LZNA” stands for “LZ-nibbled-ANS”, because the literals are split into nibbles: instead of having a single adaptive 256-symbol model, or a binary tree of adaptive binary models, LZNA has much flatter trees of medium-sized models. Instead of having one glacially slow decode step per symbol, or eight faster steps per symbol, we can decode an 8-bit symbol in two still-quite-fast steps. The right decomposition depends on the data, and of course on the relative speeds of the decoding steps. But hey, now we have a whole continuum to play with, rather than just two equally unsavory points at the extreme ends!

This is not a “just paste it into your code” solution; there’s still quite a bit of art in constructing useful models out of this, several fun tricks in actually writing a fast decoder for this, some subtle variations on how to do the updates. And we haven’t even begun to properly play with different mix-in models and “integrator” types yet.

Charles’ LZNA is the first compressor we’ve shipped that uses these ideas. It’s not gonna be the last.

UPDATE: As pointed out by commenter “derf_” on Hacker News, the non-binary context modeling in the current Daala draft apparently describes the exact same type of model. Details here (section 2.2, “Non-binary context modeling”, variant 3). This apparently was proposed in 2012. We (RAD) weren’t aware of this earlier (pity, that would’ve saved some time); very cool, and thanks derf for the link!

This year, we (RAD) shipped two new lossless codecs, both using rANS. One of the two is Oodle LZNA (released in May), which Charles has already written about. The other is called BitKnit, which first shipped in July as part of Granny, and is slated for inclusion into more RAD products.

So, with two production-quality versions written and successfully shipped, this seems like a good time to write up some of the things we’ve learned, especially in terms of implementation concerns. Let’s get cracking! (I’m assuming you’re familiar with ANS. If not, I wrote a paper that has a brief explanation, and various older blog posts that give more details on the proofs. I’m also assuming you’re familiar with “conventional” arithmetic coding; if not, you’re not gonna get much out of this.)

One small note before we start…

I’ll be referring to the ANS family as a class of arithmetic coders, because that’s what they are (and so are “range coders“, by the way). So here’s a small historical note before we get cracking: the “bottom-up” construction of ANS and the LIFO encoding seem quite foreign once you’re used to most “modern” arithmetic coders, but what’s interesting is that some of the earliest arithmetic coders actually looked very similar.

In particular, I’m talking about Rissanen’s 1976 paper “Generalized Kraft Inequality and Arithmetic Coding” (which coined the term!). Note the encoding and decoding functions C and D on the second page, very reminiscent to the “bottom-up” construction of ANS (with the code being represented by a number that keeps growing), and the decoder returning symbols in the opposite order they were encoded!

Rissanen’s coder (with its rather cumbersome manual truncated floating point arithmetic subject to careful rounding considerations) never saw any widespread application, as far as I can tell. The coders that actually got traction use the now familiar top-down interval subdivision approach, and a different strategy to adapt to fixed-precision operation. But reading Rissanen’s paper from today’s perspective is really interesting; it feels like a very natural precursor to ANS, and much closer in spirit to ANS than to most of the other algorithms that descended from it.

Why rANS (and not FSE/tANS)?

On my blog especially, I’ve been talking almost exclusively about rANS, and not so much FSE/tANS, the members of the ANS family that have probably been getting the most attention. Why is that?

Briefly, because they’re good at different things. FSE/tANS are (nearly) drop-in replacements for Huffman coding, and have similar strengths and weaknesses. They have very low (and similar) per-symbol decode overhead, but the fast decoders are table-driven, where the table depends on the symbol probabilities. Building Huffman decoding tables is somewhat faster; FSE/tANS offer better compression. Both Huffman and FSE/tANS can in theory support adaptive probabilities, but there’s little point in doing anything but periodic rebuilds; true incremental updates are too slow to be worthwhile. At that point you might as well use a coder which is more suited to incremental adaptation.

Which brings us to rANS. rANS is (again, nearly) a drop-in replacement for multi-symbol Arithmetic coders (such as range coders). It uses fewer registers than most arithmetic coders, has good precision, and the decoder is division-free without introducing any approximations that hurt coding efficiency. Especially with the various tweaks I’ll describe throughout this post, rANS has what are easily the fastest practical multi-symbol alphabet arithmetic decoders I know. rANS coders are also quite simple in implementation, with none of the tricky overflow and underflow concerns that plague most arithmetic coders.

So rANS is a pretty sweet deal if you want a fast arithmetic coder that deals well with relatively fast-changing probabilities. Great! How do we make it work?

Reverse encoding

As mentioned above, ANS coders are LIFO: whatever order you encode symbols in, the decoder will produce them in the opposite order. All my ANS coders (including the public ryg_rans) use the convention that the encoder processes the data in reverse (working from the end towards the beginning), whereas the decoder works forwards (beginning towards end).

With a static model, this is odd, but not especially problematic. With an adaptive model, decoder and model have to process data in the same direction, since the decoder updates the model as it goes along and needs current model probabilities to even know what to read next. So the decoder and the model want to process data in the same direction (forward being the natural choice), and the rANS encoder needs to be processing symbols in the opposite order.

This is where it comes in handy that rANS has an interface very much like a regular arithmetic coder. For example, in my sample implementation, the symbol is described by two values, start and freq, which are equivalent to the symbol interval lower bound and size in a conventional arithmetic coder, respectively.

Most arithmetic coders perform the encoding operation right there and then. In rANS, we need to do the actual encoding backwards, which means we need to buffer the symbols first: (the first description I’ve seen of this idea was in Matt Mahoney’s fpaqa)

// Say our probabilities use 16-bit fixed point.
struct RansSymbol {
  uint16_t start; // start of symbol interval
  uint16_t range; // size of symbol interval
};

class BufferedRansEncoder {
  std::vector<RansSymbol> syms; // or similar

public:
  void encode(uint16_t start, uint16_t range)
  {
    assert(range >= 1);
    assert(start + range <= 0x10000); // no wrap-around

    RansSymbol sym = { start, range };
    syms.push_back(sym);
  }

  void flush_to(RansEncoder &coder);
};

With this, we can use rANS exactly like we would any other arithmetic coder. However, it will not be generating the bitstream incrementally during calls to encode; instead, it simply buffers up operations to be performed later. Once we’re done we can then pop off the symbols one by one, in reverse order, and generate the output bitstream. Easy:

void BufferedRansEncoder::flush_to(RansEncoder &coder)
{
  // Replays the buffered symbols in reverse order to
  // the actual encoder.
  while (!syms.empty()) {
    RansSymbol sym = syms.back();
    coder.encode(sym.start, sym.range);
    syms.pop_back();
  }
}

Once you have this small piece of scaffolding, you really can use rANS as a drop-in replacement for a conventional arithmetic coder. There’s two problems with this, though: if you use this to encode an entire large file, your symbol buffer can get pretty huge, and you won’t get a single bit of output until you’ve processed the entire input stream.

The solution is simple (and can also be found in the aforementioned fpaqa): instead of accumulating all symbols emitted over the entire input stream and doing one big flush at the end, you just process the input data in chunks and flush the coder periodically, resetting the rANS state every time. That reduces compression slightly but means the encoder memory usage remains bounded, which is an important practical consideration. It also guarantees that output is not delayed until the end of stream; finally, lots of compressors are already chunk-based anyway. (For the opportunity to send incompressible data uncompressed rather than wasting time on the decoder end, among other things.)

Basic interleaving

One thing that rANS makes easy is interleaving the output from several encoders into a single bitstream without needing any extra signaling. I’m not going into detail why that works here; I wrote a paper on the subject if you’re interested in details. But the upshot is that you can use multiple rANS encoders and decoders simultaneously, writing to the same output bitstream, rather than having a single one.

Why do we care? Because this is what decoding a single symbol via rANS looks like (adapted from my public ryg_rans code):

static const uint32_t kProbBits = 16;
static const uint32_t kProbMask = (1 << kScaleBits) - 1;

class RansDecoder {
  uint32_t state; // current rANS state
  // (IO stuff omitted)

  uint32_t renormalize_state(uint32_t x)
  {
    // Byte-wise for simplicity; can use other ways.
    while (x < RANS_L)
      x = (x << 8) | read_byte();

    return x;
  }

public:
  uint32_t decode_symbol()
  {
    uint32_t x = state; // Current state value

    uint32_t xm = x & kProbMask; // low bits determine symbol
    Symbol sym = lookup_symbol(xm); // (various ways to do this)

    // rANS state advance
    x = sym.range * (x >> kProbBits) + xm - sym.start;
    x = renormalize_state(x);

    // Save updated state and return symbol
    state = x;
    return sym.id;
  }
};

Note how literally every single line depends on the results of the previous one. This translates to machine code that has a single, very long, dependency chain with relatively low potential for instruction-level parallelism (ILP). This is fairly typical for all entropy coder inner loops, by the way, not just rANS. And because superscalar processors depend on ILP to deliver high performance, this is bad news; we’re not making good use of the machine.

Hence interleaving. The idea is that we have two RansDecoder instances, each with their own state, but implicitly sharing the same bitstream read pointer (referenced by read_byte). Now, when we have code like this:

  RansDecoder dec0, dec1;
  // ...
  uint32_t sym0 = dec0.decode_symbol():
  uint32_t sym1 = dec1.decode_symbol();

the processor’s out-of-order execution logic can overlap execution of both decodes, effectively running them at the same time. The renormalize step for dec0 needs to happen before the renormalize of dec1, but other than that, there’s no dependencies between the two. For what it’s worth, this does not actually require out-of-order execution; a compiler for an in-order architecture can also work with this, provided it has enough dataflow information to know that dec0 calling read_byte() does not influence anything that dec1 does before its renormalize step. So what interleaving does is convert a very serial task into one that’s much more amenable to superscalar execution.

What it boils down to is this: a regular rANS decoder is a fast, precise, divide-less arithmetic decoder (which is nice to begin with). Interleave two streams using what is otherwise the exact same code, and you get a pretty good boost in performance; (very roughly) around 1.4× faster, on both the decoder and encoder. But this is by now an old hat; this was all in the initial release of ryg_rans.

Some of the early experiments leading up to what later became BitKnit uses this directly, pretty much the same as in the ryg_rans example code, but it turns out it was a bit of a pain in the neck to work with: because the two streams need to interleave properly, the BufferedRansEncoder needs to keep track of which symbol goes to which stream, and both the encoder and decoder code needs to (somewhat arbitrarily) assign symbols to either stream 0 or stream 1. You’d prefer the streams to keep alternating along any given control-flow path, but that’s not always possible, since sometimes you have conditionals where there’s an even number of symbols send on one path, and an odd number sent on the other! So having two explicit streams: not so great. But we found a better way.

Implicit interleaving to the rescue

What we actually ended up doing was interleaving with a twist – literally. We give the underlying rANS encoders (and decoders) two separate state values, and simply swap the two stream states after every encoding and decoding operation (that’s where the “BitKnit” name comes from – it keeps two active states on the same “needle” and alternates between them). The modifications from the decoder shown above are pretty small:

class RansDecoder {
  uint32_t state1; // state for "thread 1"
  uint32_t state2; // state for "thread 2"

  // ...

public:
  uint32_t decode_symbol()
  {
    uint32_t x = state1; // Pick up thread 1

    // ---- BEGIN of code that's identical to the above

    uint32_t xm = x & kProbMask; // low bits determine symbol
    Symbol sym = lookup_symbol(xm); // (various ways to do this)

    // rANS state advance
    x = sym.range * (x >> kProbBits) + xm - sym.start;
    x = renormalize_state(x);

    // ---- END of code that's identical to the above

    // Save updated state, switch the threads, and return symbol
    state1 = state2; // state2 becomes new state1
    state2 = x;      // updated state goes to state2

    return sym.id;
  }
};

The changes to the encoder are analogous and just as simple. It turns out that this really is enough to get all the performance benefits of 2× interleaving, with none of the extra interface complexity. It just looks like a regular arithmetic decoder (or encoder). And assuming you write your implementation carefully, compilers are able to eliminate the one extra register-register move instruction we get from swapping the threads on most paths. It’s all win, basically.

Bypass coding

Borrowing a term from CABAC here; the “bypass coding mode” refers to a mode in the arithmetic coder that just sends raw bits, which you use for data that’s known a priori to be essentially random/incompressible, or at least not worth modeling further. With conventional arithmetic coders, you really need special support for this, since interleaving an arithmetic code stream with a raw bitstream is not trivial.

With rANS, that’s much less of a problem: you can just use a separate bitbuffer and mix it into the target bitstream with no trouble. However, you may not want to: rANS has essentially all of the machinery you need to act as a bit buffer. Can you do it?

Well, first of, you can just use the arithmetic coder with a uniform distribution to send a set number of bits (up to the probability resolution). This works with any arithmetic coder, rANS included, and is fairly trivial:

  // write value "bits" using "numbits"
  coder.encode(bits << (kProbBits - numbits),
                  1 << (kProbBits - numbits));

and the equivalent on the decoder side. However, this is not particularly fast. Fortunately, it’s actually really easy to throw raw bit IO into a rANS coder: we just add the bits at the bottom of our state variable (or remove them from there in the decoder). That’s it! The only thing we need to do is work out the renormalization condition in the encoder. Using the conventions from the bytewise ryg_rans, an example implementation of the encoder is:

static inline void RansEncPutBits(RansState* r, uint8_t** pptr,
  uint32_t val, uint32_t nbits)
{
  assert(nbits <= 16);
  assert(val < (1u << nbits));

  // nbits <= 16!
  RansState x = RansEncRenorm(*r, pptr, 1 << (16 - nbits), 16);

  // x = C(s,x)
  *r = (x << nbits) | val;
}

and the corresponding getbits in our ongoing example decoder looks like this:

class RansDecoder {
  // ...

  uint32_t get_bits(uint32_t nbits)
  {
    uint32_t x = state1; // Pick up thread 1

    // Get value from low bits then shift them out and
    // renormalize
    uint32_t val = x & ((1u << nbits) - 1);
    x = renormalize_state(x >> nbits);

    // Save updated state, switch the threads, and return value
    state1 = state2; // state2 becomes new state1
    state2 = x;      // updated state goes to state2

    return val;
  }
};

note that except for the funky state swap (which we carry through for consistency), this is essentially just a regular bit IO implementation. So our dual-state rANS admits a “bypass mode” that is quite cheap; usually cheaper than having a separate bit buffer would be (which would occupy yet another CPU register in the decoder), at least in my tests.

Note that if you combine this with the buffering encoder described above, you need a way to flag whether you want to emit a regular symbol or a burst of raw bits, so our RansSymbol structure (and the code doing the actual encoding) gets slightly more complicated since we now have two separate types of “opcodes”.

The implementation above has a limit of 16 bits you can write in a single call to RansEncPutBits. How many bits you can send at once depends on the details of your renormalization logic, and how many bits of rANS state you keep. If you need to send more than 16, you need to split it into multiple operations.

Tying the knot

I got one more: a rANS encoder needs to write its final state to the bitstream, so that the decoder knows where to start. You can just send this state raw; it works just fine. That’s what the ryg_rans example code does.

However, rANS states are not equally likely. In fact, state x occurs with a probability proportional to 1/x. That means that an ideal code should spend approximately $\log_2(x)$ bits to encode a final state of x. Charles has already written about this. Fortunately, the ideal coder for this distribution is easy: we simply send the index of the highest set bit in the state (using a uniform code), followed by the remaining bits.

One options is to do this using regular bit I/O. But now you still need a separate bit IO implementation!

Fortunately, we just covered how do send raw bits through a rANS encoder. So one thing we can do is encode the final state value of stream 2 using the “stream 1” rANS as the output bit buffer, using the putbits functionality just described (albeit without the thread-switching this time). Then we send the final state of the “stream 1” rANS raw (or using a byte-aligned encoding).

This approach is interesting because it takes a pair of two rANS encoder threads and “ties them together” – making a knot, so to speak. In the decoder, undoing the knot is serial (and uses a single rANS decoder), but immediately after initialization, you have a working dual-stream coder. This saves a few bytes compared to the sloppier flushing and is just plain cute.

This technique really comes into its own for the wide-interleave SIMD rANS coders described in my paper, because it can be done in parallel on lots of simultaneous rANS coders in a reduction tree: group lanes into pairs, have each odd-indexed lane flush into its even-indexed neighbor. Now look at groups of 4 lanes; two have already been flushed, and we can flush the rightmost “live” lane into the leftmost lane coder. And so forth. This allows flushing a N× interleaved SIMD rANS coder in $O(\log(N))$ coding operations, and still has some parallelism while doing so. This is not very exciting for a 2× or 4× interleaved coder, but for GPU applications N is typically on the order of 32 or 64, and at that level it’s definitely interesting.

Conclusion and final notes

Using the techniques described in this post, you can write rANS encoders and decoders that have about the same amount of code as a conventional arithmetic coder with comparable feature set, have a similar interface (aside from the requirement to flush the encoder regularly), are significantly faster to decode (due to the combination of the already-fast rANS decoder with implicit interleaving), and have very cheap “bypass coding” modes.

This is a really sweet package, and we’re quite happy with it. Anyone interested in (de)compression using adaptive models should at least take a look. (For static models, FSE/tANS are stronger contenders.)

What example code there is in this article uses byte-wise renormalization. That’s probably the simplest way, but not the fastest. Oodle LZNA uses a 63-bit rANS state with 32-bits-at-a-time renormalization, just like rans64.h in ryg_rans. That’s a good choice if you’re primarily targeting 64-bit platforms and can afford a 64-bit divide in the encoder (which is quite a bit more expensive than a 32-bit divide on common CPUs). BitKnit uses a 32-bit rANS state with 16-bits-at-a-time renormalization, similar to the coder in ryg_rans rans_word_sse41.h. This is friendlier to 32-bit targets and admits a branch-free renormalize option, but also means the coder has somewhat lower precision. Using a probability range of 16 bits would not be wise in this case; BitKnit uses 14 bits.

This post is about general techniques for handling end-of-buffer checks in code that processes an input stream a byte at a time, or a few bytes at a time at the most. Concretely, I’ll be talking about decompression code, but many of these ideas are also applicable to related sequential input processing tasks like lexical analysis.

A basic decoder

To show how the problem crops up, let’s look at a simple decompressor and at what happens when we try to make an efficient implementation. Here’s our simple decoder for a toy LZ77 variant:

while (!done) { // main loop
    if (get_bits(1) != 0) { // match
        int offset = 1 + get_bits(13);
        int len = 3 + get_bits(5);

        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
       *dest++ = get_bits(8);
    }
}

This particular coding scheme is just arbitrarily chosen to have a simple example, by the way. It’s not one I would actually use.

How does get_bits look like? The design space of bit IO is a big topic on its own, and I won’t be spending any time on the trade-offs here; let’s just use a basic variant with MSB-first (big endian-like) bit packing, reading the input stream from a memory buffer, one byte at a time:

const uint8_t *input_cursor;    // current input cursor
const uint8_t *input_end;       // end of input buffer

uint8_t read_byte()
{
    // If we reached the end of the input buffer, return 0!
    if (input_cursor >= input_end)
        return 0;

    return *input_cursor++;
}

uint32_t bitcount; // number of bits in bitbuf
uint32_t bitbuf;   // values of bits (bitcount bits from MSB down)

uint32_t get_bits(uint32_t nbits)
{
    assert(0 < nbits && nbits <= 24);

    // Refill: read extra bytes until we have enough bits
    // in buffer. Insert new bits below the ones we already
    // have.
    while (bitcount < nbits) {
        bitbuf |= read_byte() << (24 - bitcount);
        bitcount += 8;
    }

    // The requested bits are the top nbits bits of bitbuf.
    uint32_t ret = bitbuf >> (32 - nbits);

    // Shift them out.
    bitbuf <<= nbits;
    bitcount -= nbits;
    return ret;
}

Note we do an explicit end-of-buffer check in read_byte and return a defined value (0 in this case) past the end of the input stream. This kind of check is generally required to avoid crashes (or buffer overrun bugs!) if there is any chance the input stream might be invalid or corrupted – be it as the result of a deliberate attack, or just a transmission error. Returning 0 past the end of buffer is an arbitrary choice, but a convention I tend to stick with in my code.

Reducing overhead

As for get_bits, the implementation is a fairly typical one. However, as should be obvious, reading a few bits like this is still a relatively involved process, because every call to get_bits involves the refill check and an update of the bit buffer state. A key trick in many decompressors is to reduce this overhead by separating looking at bits from consuming them, which allows us to grab lots of bits at once (speculatively), and then later decide how far to move the input cursor. This basically boils down to splitting get_bits into two parts:

uint32_t peek_bits(uint32_t nbits)
{
    assert(0 < nbits && nbits <= 24);

    // Refill: read extra bytes until we have enough bits
    // in buffer. Insert new bits below the ones we already
    // have.
    while (bitcount < nbits) {
        bitbuf |= readbyte() << (24 - bitcount);
        bitcount += 8;
    }

    // Return requested bits, starting from the MSB in bitbuf.
    return bitbuf >> (32 - nbits);
}

void consume_bits(uint32_t nbits)
{
    assert(bitcount <= nbits);
    bitbuf <<= nbits; // shift them out
    bitcount -= nbits;
}

Using this new interface, we can modify our decoder to reduce bit IO overhead, by doing a single peek_bits call early and then manually extracting the different sub-fields from it:

while (!done) { // main loop
    // We read up to 19 bits; grab them all at once!
    uint32_t bits = peek_bits(19);
    if (bits & (1u << 18)) { // match bit set?
        int offset = 1 + ((bits >> 5) & 0x1fff);
        int len = 3 + (bits & 0x1f);

        consume_bits(19); // 1b flag + 13b offs + 5b len
        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
        *dest++ = (uint8_t) (bits >> 10);
        consume_bits(9); // 1b flag + 8b value
    }
}

This trick of peeking ahead and deciding later how many bits were actually consumed is very important in practice. The example given here is a simple one; a very important use case is decoding Huffman codes (or other variable-length codes) aided by a look-up table.

Note, however, that we changed the input behavior: before, we really only called read_byte when we knew it was necessary to complete reading the current code. Now, we peek ahead more aggressively, and will actually peek past the end of the input bitstream whenever the last token is a literal. It’s possible to avoid this type of problem by being more restrained in the usage of peek_bits: only ever peek ahead by the minimum amount of bits that we know is going to get consumed no matter what. However, doing so forces us to do a bit more work at runtime than the code fragment shown above entails.

However, the variant shown above is still completely correct: our implementation of read_byte checks for the end of the input stream, and returns zeroes once we’ve passed it. However, this is no longer an exceptional condition: rather than being a “contingency plan” in case of corrupted input data, we can now expect to hit this path when decoding many valid bit streams.

In short, we’re taking a check we need for correctness (the end-of-buffer check) and making it serve double duty to simplify the rest of our decoder. So far, all the code we’ve seen is very standard and not remarkable at all. The resulting bit-IO implementation is fairly typical, more so once we stop trying to only call read_byte when strictly necessary and simplify the buffer refill logic slightly by always refilling to have >24 bits in the buffer no matter what the peek amount is.

Even beyond such details, though, this underlying idea is actually quite interesting: the end-of-buffer check is not one we can easily get rid of without losing correctness (or at least robustness in the face of invalid data). But we can leverage it to simplify other parts of the decoder, reducing the “sting”.

How far can we push this? If we take as granted that reading past the end of the buffer is never acceptable, what is the least amount of work we can do to enforce that invariant?

Relaxed requirements

In fact, let’s first go one further and just allow reading past the end-of-buffer too. You only live once, right? Let’s pull out all the stops and worry about correctness later!

It turns out that if we’re allowed to read a few bytes past the end of the buffer, we can use a nifty branch-free refill technique. At this point, I’m going to manually inline the bit IO so we can see more clearly what’s going on:

while (!done) { // main loop
    // how many bytes to read into bit buffer?
    uint32_t refill_bytes = (32 - bitcount) / 8;

    // refill!
    bitbuf |= read_be32_unaligned(input_cursor) >> bitcount;
    bitcount += refill_bytes * 8;
    input_cursor += refill_bytes;

    assert(bitcount > 24);

    // peek at next 19 bits
    uint32_t bits = bitbuf >> (32 - 19);

    if (bits & (1u << 18)) { // match bit set?
        int offset = 1 + ((bits >> 5) & 0x1fff);
        int len = 3 + (bits & 0x1f);

        // consume_bits(19);
        bitbuf <<= 19;
        bitcount -= 19;
        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
        *dest++ = (uint8_t) (bits >> 10);
        // consume_bits(9);
        bitbuf <<= 9;
        bitcount -= 9;
    }
}

This style of branchless bit IO is used in e.g. Yann Collet’s FSE and works great when the target machine supports reading unaligned 32-bit big endian values quickly — the read_be32_unaligned function referenced above. This is the case on x86 (MOV and BSWAP or just MOVBE where supported), ARMv6 and later (LDR provided unaligned accesses are allowed, plus REV when in little-endian mode) and POWER/PPC; not sure about other architectures. And for what it’s worth, I’m only showing 32-bit IO here, but this technique really comes into its own on 64-bit architectures, since having at least 56 bits in the buffer means we can usually go for a long while without further refill checks.

That’s a pretty nice decoder! The only problem being that we have no insurance against corrupted bit streams at all, and even valid streams will read past the end of the buffer as part of regular operation. This is, ahem, hardly ideal.

But all is not lost. We know exactly how this code behaves: every iteration, it will try reading 4 bytes starting at input_cursor. We just need to make sure that we don’t execute this load if we know it’s going to be trouble.

Let’s say we work out the location of the spot where we need to start being careful:

// Before the decoder runs:
const uint8_t *input_mark;

if (input_end - input_cursor >= 4)
    input_mark = input_end - 4;
else
    input_mark = input_cursor;

The simplest thing we can do with that information is to just switch over to a slower (but safe) decoder once we’re past that spot:

while (!done && input_cursor <= input_mark) {
    // fast decoder here: we know that reading 4 bytes
    // starting at input_cursor is safe, so we can use
    // branchless bit IO
}

while (!done) {
    // finish using safe decoder that refills one byte at
    // a time with careful checks!
}

This works just fine, and is the technique chosen in e.g. the zlib inflate implementation: one fast decoder that runs when the buffer pointers are well away from the boundaries, and a slower decoder that does precise checking.

Note that the input_cursor < input_mark check is the only addition to our fast decoder that was necessary to make the overall process safe. We have some more prep work, and it turns out we ended up with an entire extra copy of the decoder for the cold “near the end of the buffer” path, but the path we expect to be much more common — decoding while still being safely away from the end of the input stream — really only does that one extra compare (and branch) more than the “fast but unshippable” decoder does!

And now that I’ve done my due diligence and told you about the boring way that involves code duplication, let’s do something much more fun instead!

One decoder should be enough for anyone!

The problem we’re running into is that our buffer is running out of bytes to read. The “safe decoder” solution just tries to be really careful in that scenario. But if we’re not feeling very careful today, well, there’s always the ham-fisted alternative: just switch to a different input buffer that’s not as close to being exhausted yet!

Our input buffers are just arrays of bytes. If we start getting too close to the end of our “real” input buffer, we can just copy the remaining bytes over to a small temp buffer that ends with a few padding bytes:

uint8_t temp_buf[16]; // any size >=4 bytes will do.

while (!done) {
    if (input_cursor >= input_mark) {
        assert(input_cursor < input_end);

        // copy remaining bytes to temp_buf
        size_t bytes_left = (size_t) (input_end - input_cursor);
        assert(bytes_left < sizeof(temp_buf));
        memmove(temp_buf, input_cursor, bytes_left);

        // fill rest of temp_buf with zeros
        memset(temp_buf + bytes_left, 0, sizeof(temp_buf) - bytes_left);

        // and update our buffer pointers!
        input_cursor = temp_buf;
        input_end = temp_buf + sizeof(temp_buf);
        input_mark = input_end - 4;
    }

    assert(input_cursor <= input_mark);
    // rest of fast decoder using branchless bit IO
}

And with that little bit of extra logic, we can use our fast decoder for everything: note that we never read past the bounds of the original buffer. Also note that the logic given above can generate an arbitrary amount of trailing zero bytes: if after swapping buffers around, our input cursor hits the mark again, we just hit the refill path again to generate more zeroes. (This is why the copying code uses memmove).

This is nifty already, but we can push this idea much further still.

Switching input buffers

So far, we’re effectively switching from our regular input buffer to the conceptual equivalent of /dev/zero. But there’s no need for that restriction: we can use the same technique to switch over to a different input buffer altogether.

We again use a temporary transition buffer that we switch to when we reach the end of the current input buffer, but this time, we copy over the first few bytes from the next input buffer after the end of the current buffer, instead of filling the rest with zeroes. We still do this using our small temp buffer.

We place our input mark at the position in the temp buffer where data from the new input buffer starts. Once our input cursor is past that mark, we can change pointers again to resume reading from the new input buffer directly, instead of copying data to the temp buffer.

Note that handling cases like really short input buffers (shorter than our 4-byte “looakhead window”) requires some care here, whereas it’s not a big deal when we do the bounds checking on every consumed input byte. We’re not getting something for nothing here: our “sloppy” end-of-input window simplifies the core loop at the expense of adding some complexity in the boundary case handling.

Once we reach the actual end of the input stream, we start zero-filling, just as before. This all dovetails nicely into my old post “Buffer-centric IO” which combines very well with this technique. Together, we get almost-zero-copy IO, except for the copies into the transition buffer near buffer boundaries, which only touch a small fraction of all bytes and are there to make our lives easier.

A final few generalizations

The example I’ve been using was based on a single get_bits (or later peek_bits) call. But this is really not substantial at all. The crucial property we’re exploiting in the decoder above is that we have a known bound for the number of bytes that can be consumed by a single iteration of the loop. As long as we can establish such a bound, we can do a single check per iteration, and in general, we need to check our input cursor at least once inside every loop that consume a potentially unbounded (or at least large) number of input bytes — which in this example is only the main loop.

For the final generalization, note that a lot of compressors use a stream interface similar to zlib. In essence, this is a buffer interface similar to the one described in “Buffer-centric IO” for both the input and output buffers; the decompressor then gets called and processes data until the input or output buffers are exhausted, the end of stream is reached, or an error occurs, whichever happens first. This type of interface makes the (de)compressor somewhat harder to write but is much more convenient for the client code, which can just hand in whatever.

A typical way to implement this type of interface is described in Simon Tatham’s old article “Coroutines in C” — the key property being that the called function needs to be able to save its state at any point where I/O happens, in case it runs out of buffer space; and furthermore it needs to be able to later resume at exactly that point.

The solution is to effectively turn the (de)compressor into a state machine, and Tatham’s article describes a way to do so using a variant of Duff’s Device, quite probably the most infamous coding trick in the C language. Most (de)compressors with a zlib-like interface end up using this technique (or an equivalent) so they can jump into the middle of the decoder and resume where they left off.

So why do I mention all this? Well, the technique I’ve outlined in this article is applicable here as well: Tatham’s description assumes byte-level granularity IO, which means there’s generally lots of points inside the decoder main loop where we might need to save our state and resume later. If the decoder instead ensures there’s enough bytes left in the buffer to make it through one full iteration of the main loop no matter what, that means we have many fewer points where we need to save our state and later resume, often only in a single location.

What’s particularly interesting about combining the relaxed-refill technique with a coroutine-style decoder is that all of the refill and transition buffer logic can be pulled outside of the decoder proper. In library code, that means it can be shared between multiple decoders; so the logic that deals with the transition buffers and short input buffers only needs to be implemented and debugged once.

Discussion

The key simplification in this scheme is relaxing the strict “check for end of buffer on every byte consumed” check. Instead, we establish an upper bound N on the number of input bytes that can be consumed in a single iteration through our decoder main loop, and make sure that our current input buffer always has at least N bytes left — by switching to a different temporary input buffer if necessary.

This allows us to reduce the number of end-of-buffer checks we need to execute substantially. More importantly, it greatly increases the applicability of branch-less refill techniques in bit IO and arithmetic coding, without having to keep a separate “safe” decoder around.

The net effect is one of concentrating a little complexity from several places in hot code paths (end-of-buffer checks on every byte consumed) into somewhat increased complexity in a single cold code path (buffer switching). This is often desirable.

The biggest single caveat with this technique is that as a result of the decoder requiring N bytes in the input buffer at all times, the decoder effectively “lags behind” by that many bytes – or, depending on your point of view, it “looks ahead” by N bytes, reading from the input stream sooner than strictly necessary.

This can be a problem when, for example, several compressed streams are concatenated into a single file: the decoder may only get to decoding the “end of stream” symbol for stream A after N bytes from stream B have already been submitted to the decoder. The decoder would then need to “un-read” (in the sense of ungetc) the last few bytes or seek backwards. No matter how you dice it, this is annoying and awkward.

As a result, this technique is not all that useful when this is a required feature (e.g. as part of a DEFLATE decoder obeying the zlib interface).
However, there are ways to sidestep this problem: if the bitstream specifies the compressed size for either the entire stream or individual blocks, or if the framing format ends in N or more trailing “footer” bytes (a checksum or something similar), we can use this approach just fine.

UPDATE: As commenter derf_ notes on Hacker News, there’s a nice trick to produce implicit trailing zero bits in a bit reader like the one described above by just setting bitcount to a high value once the last byte’s been read into bitbuf. However, this only works with a decoder exactly like the one shown above. The nice part about switching to an explicit zero-padding buffer is that it works not just with all bit IO implementations I’m aware of, but also with byte-normalized (or larger) arithmetic coders like typical range coders or rANS.

I spent the majority of last year working on LZ77-style codecs. I’ve written about some results before. But there were also several smaller (in scope) but still quite neat discoveries along the way.

One of them has to do with repeated match offsets. BitKnit was originally designed for Granny files, which usually contain 3D meshes, animations, sometimes textures, and can also store other user-defined data. As far as a compressor is concerned, Granny files are highly structured, mostly consisting of a few large, homogeneous arrays of fixed-size records.

Repeated match offsets

Often, there is significant correlation between adjacent records, for various reasons. What this means in a LZ77-style dictionary compressor is that there will usually be a lot of matches with a match distance (or match offset) that is a small integer multiple of the record size, and matches with the same offset tend to clump together.

The way LZ77 compressors typically exploit this fact is by reserving special codes for “reuse a recent match distance”. To my knowledge, this technique first appeared in LZX, which keeps a 3-element cache of recent match offsets with a LRU eviction policy. The basic idea seems to have spread from there. Many compressors (too many to list) reserve at least a single special, cheaper code to send another match with the same offset as the previous one (corresponding to a 1-element “cache”). This, among other things, gives a cheaper way to code “gap matches” (a match that resumes after being interrupted by a few mismatching bytes) and appears to be beneficial on most types of data.

On text and data that skews towards variable-size records, having extra codes for more repeated match offsets doesn’t help much, if at all (at least, they don’t seem to hurt, either). However, on data heavy on fixed-size records, it is often a big win. LZX, as mentioned before, has 3 “repeat offset slots”. LZMA uses 4. Several experiments early in the design of BitKnit indicated that at least for the highly structured Granny files it was designed for, there was a good case to be made for having even more repeat offset slots. We re-evaluated this several times, but a repeat offset count of 8 made it into the final codec; essentially, having a larger number of offset slots allows us to “get away” with an overall less sophisticated offset coder (reducing compression, but improving decoder speed), and is a very solid win on the highly structured data that was the target use case.

Experiments with lots of repeat match offsets

Two interesting problems arise from making the “repeat offset cache” this large. First, 8 entries is large enough that it’s worth thinking about different algorithmic variants. Second, at that size, it makes sense to investigate different eviction policies as well as other strategies such as maybe “pinning” a few match distances that we expect to be useful (for example, multiples of the record size in front of homogeneous sections).

Second part first: the effect of either “preloading” or pinning useful match distances was either “in the noise” (almost any change to a LZ encoder using adaptive models will make some files larger and others smaller simply due to getting a slightly different parse) or strictly worse in all our tests. Considering how much interface complications this implies (the pre-loaded offsets for different sections need to get to the compressor somehow, and they either need to be known in the decoder from other sources or stored in the compression stream, reducing the gains even further) that makes the idea fairly uninteresting. Empirically, at least in our tests, LZ compressors find useful match distances quickly, and once they’re in the cache, they tend to stick around. Since such a cache is naturally adaptive to the data (whereas static pinned offsets are not), keeping them fully dynamic seems like a good idea.

The next test was to decouple the eviction policy from offset modeling. LZX, LZMA etc. always keep their list of recent match offsets in MRU order: slot 0 is the most recently used offset, slot 1 is the second-most recent, and so forth. One experiment we tried with a very early version of BitKnit was a variant I dubbed “stable index MRU”: offsets are still evicted on a LRU basis, but instead of shuffling the indices around on every match so that the new most recent match gets offset 0, new offsets would get inserted into the least recently referenced slot without moving the slot IDs around.

This affects the modeling; before, you have a very skewed distribution: slot 0 (most recent) is much more important than slot 1, which is more important than slot 2, and so forth. After, they are more spread out; but the idea was that in highly structured files where the same few offsets stick around for a fairly long time, you might capture more useful correlations by keeping these offsets in a single spot (which the entropy coder then tried to capitalize on).

Here were the results on a few granny files, listing the compressed sizes in bytes, with “anchor” being a x86 executable file that doesn’t have a significant amount of record-structured data in it, as a baseline. “MTF” refers to move-to-front index update policy, “stable” is the stable-index variant just described.

Configuration	granny1	granny2	granny3	anchor
4 offsets, MTF	18561994	22127519	15156085	1300427
4 offsets, stable	18533728	22261691	15177584	1300980
8 offsets, MTF	17825495	21746140	14800616	1300534
8 offsets, stable	17580201	21777192	14819439	1304406
12 offsets, MTF	17619775	21640749	14677706	1301007
12 offsets, stable	17415560	21448142	14681142	1306434
16 offsets, MTF	17554769	21523058	14600474	1300984
16 offsets, stable	17341154	21462241	14669793	1308508

First off, as you can see from these experiments, going from 4 to 8 repeat match offsets really does help significantly on these files; an extra 1.5%-4% reduction in file size may not sound like much, but it’s a fairly big deal in compression terms. The experiments with even more repeat offsets were mainly to get a feel for when we start to hit diminishing returns; also, as you can see, the compressed size for the “anchor” file (which is not record-structured) doesn’t seem to care much about the difference between 4 and 8 repeat offset slots, and gets worse after.

As for stable-index coding, well, it’s a mixed bag. It does help on some files, and on the files that do seem to improve from it, it’s a bigger win when using more offset slots, but on e.g. “granny3” and “anchor” it was a net negative. Interesting experiment, but it didn’t go in.

Insertion policy

Another experiment we ran was on insertion policy. Specifically, our hypothesis was that at least on highly structured data (where the repeat offsets really help), we really want to make sure the important offsets stay “in front”. But occasionally, you will still get other matches that “don’t fit the pattern”. The problem is that this puts some other random offset in front that will then slowly “slide down” and meanwhile cause our actually important offsets to be more expensive to code.

This is more of a problem with greedy LZ parsers (which make their decisions locally); optimizing parsers (which usually try to look ahead by a few kilobytes or so) are better at correctly estimating the cost of “disrupting” the set of offsets. Either way, it’s annoying.

We tried a couple different approaches with this; the best overall approach we found in our tests was to stick with a basic MTF coding scheme and LRU eviction policy (bog-standard in other words), but distinguish between updates caused by repeat matches and those caused by inserting a new offset not currently in the repeat offset set. The former (repeat matches) just do a full move-to-front step, as usual. The latter don’t; instead of inserting a new offset all the way in front, we insert it further back. If it then gets reused a second time, it really will be moved all the way to the front of the list; but if it doesn’t get referenced again, it will drop out more quickly and with less disruption of the remaining repeat offsets.

Here’s the batch of test results, from the same compressor version. “kSlotNew” is the slot where new offsets are inserted; 0 corresponds to inserting in front (regular MTF), 1 is the second position, and so forth.

Configuration	granny1	granny2	granny3	anchor
4 offsets, kSlotNew=0	18561994	22127519	15156085	1300427
4 offsets, kSlotNew=1	18450961	22153774	15154609	1304707
4 offsets, kSlotNew=2	18118014	22000266	15181128	1305755
4 offsets, kSlotNew=3	17736663	22002942	15209073	1307550
8 offsets, kSlotNew=0	17825495	21746140	14800616	1300534
8 offsets, kSlotNew=4	17327247	21546289	14771634	1305128
8 offsets, kSlotNew=6	17197347	21425116	14713121	1305588
16 offsets, kSlotNew=0	17554769	21523058	14600474	1300984
16 offsets, kSlotNew=14	17122510	21177337	14578492	1305432

We can see that the anchor prefers pure MTF, but the Granny files definitely see a win from not moving new offsets all the way to the front the first time they’re seen. There were a few more tests than the one shown, but in general, inserting new offsets in the second-to-last slot seemed like a good rule of thumb for the Granny files.

This one is definitely more contextual. As you can see, different types of files really prefer different settings here. BitKnit went with 8 offsets and insertion at the second-to-last slot (corresponding to the “8 offsets, kSlotNew=6” row above), because it produced the overall best results on the data it was designed for. (As evaluated on a larger test set not shown here.)

So, this is fairly neat, and a comparatively major win over the baseline 4 offsets and insert-in-front variant (a la LZMA) for the data in question. Now how to implement this efficiently?

Implementation notes

The basic implementation of the offset maintenance logic in the decoder is dead simple. You just keep an array of recent offsets and shuffle it around with something like this:

if (is_repeat_match) {
    // move slot "rep_idx" to front.
    // this involves grabbing the offset at the corresponding
    // location and then sliding everything before that position
    // down by one slot.
    tmp = offsets[rep_idx];
    for (uint i = rep_idx; i > 0; --i)
        offsets[i] = offsets[i - 1];
    offsets[0] = tmp;
} else {
    // implement the "insert in second-to-last position"
    // rule, which touches exactly two elements.
    offsets[kNumReps - 1] = offsets[kNumReps - 2];
    offsets[kNumReps - 2] = newOffset;
}

This works just fine, but it has a lot of data-dependent branches in the repeat match case, which is a performance trap in decompressors; generally speaking, you want to avoid branching on data you just read out of a bitstream, because it tends to be relatively high entropy and thus cause a lot of branch mispredictions, which are expensive.

One way to fix this is to add several entries worth of padding in front of the actual used part of offsets, and always copy the same number of entries in the “sliding down” phase. This gets rid of the data-dependent branches and makes it easy to unroll the loop fully (since the trip count is now constant) or express it using a few unaligned SIMD loads/stores (where supported).

However, BitKnit uses a different approach derived from our earlier experiments with “stable index MRU” that doesn’t need anything beyond regular integer arithmetic. The basic idea is to leave the offsets array alone; instead, we keep a secondary “data structure” that tells us which logical “repeat offset” list position corresponds to which index in the offsets array.

I write “data structure” in quotes because that information is actually stored in (drum roll)… a single 32-bit unsigned integer! Here’s the idea: we have a uint32_t mtf_state that represents the current offset permutation. It does this by storing the offset array index for the i’th logical repeat offset slot in the i’th nibble (numbered starting from the LSB upwards). At initialization time, we set mtf_state = 0x76543210, the identity mapping: the logical and actual offset indices coincide.

Why does this help? Because the fundamental operation for move-to-front processing is moving a bunch of offsets “one slot down” in their array position. If they’re separate integers, that means either a lot of copying, or less copying but using much wider (e.g. SIMD) instructions. Our array of 4-bit indices is compact enough that 8 indices fit inside a single 32-bit uint; we can slide them all “down” or “up” using nothing but a single bit shift. Now, our code above doesn’t actually move all elements, just the ones at position ≥rep_idx; but that turns out to be easily remedied with some bit masking operations.

So the alternative variant is this:

if (is_repeat_match) {
    // move slot "rep_idx" to front by permuting mtf_state. first,
    // determine the offset slot ID at that position in the list
    uint32_t rep_idx4 = rep_idx*4;
    uint32_t slot_id = (mtf_state >> rep_idx4) & 0xf;
    match_offs = offsets[slot_id]; // decoder needs this later!

    // moved_mtf: slide down everything by one slot, then put
    // "slot_id" in front.
    uint32_t moved_mtf = (mtf_state << 4) + slot_id;
    uint32_t keep_mask = ~0xf << rep_idx4; // bits that don't move
    mtf_state = (mtf_state & keep_mask) | (moved_mtf & ~keep_mask);
} else {
    // implement the "insert in second-to-last position"
    // rule, which touches exactly two elements. this is easier
    // to do by just modifying the offsets directly.
    uint32_t last = (mtf_state >> ((kNumReps - 1)*4)) & 0xf;
    uint32_t before_last = (mtf_state >> ((kNumReps - 2)*4)) & 0xf;

    offsets[last] = offsets[before_last];
    offsets[before_last] = newOffset;
}

It’s a bit of integer arithmetic, but not a lot, and there’s no dependence on vector instructions, fast unaligned memory access, or in fact anything outside of standard C/C++. BitKnit uses a 32-bit mtf_state to implement an 8-entry LRU cache. Using 64-bit values (and still using nibbles to store array indices), the exact same approach (with essentially no modifications to the source save for type names) can manage a 16-entry LRU.

An 8-entry LRU actually only needs 24 bits (when storing array indices in groups of three bits instead of nibbles), but that’s not a very useful size. A 4-entry LRU state fits in 4*log2(4) = 8 bits, which is nice and compact, although for 4 entries, this way is generally not a win (at least in our tests).

And now this is time to come clean: I kinda like this approach, and it’s the real reason I wrote this whole thing up. I probably would’ve still written it up even if it hadn’t turned out to be useful in practice, but it did, which is always a nice bonus.

Finally, over the years, I’ve found a few instances like this where packing a small “data structure” (using the term loosely) inside a single register-width integer produces interesting results. There’s a good chance I’ll write about more in the future! Until then.

Resurrecting an old tradition of this blog, let’s take a simple problem, go over a way too long list of alternative solutions, and evaluate their merits.

Our simple problem is this: we want to read data encoded using some variable-bit-length code from a byte stream. Reading individual bytes, machine words etc. is directly supported by most CPUs and many programming languages, but for bit-granularity IO, you generally need to implement it yourself.

This sounds simple enough, and in some sense it is. The first source of problems is that this operation tends to be a hot spot in codecs—-and yes, compute bound, not memory or I/O bound. So we’d like not just an implementation that works; we’d like it to be efficient as well. And along the way we’ll run into many other complications: interactions with IO buffering, end-of-buffer handling, corner cases in the way bit shifts are specified both in C/C++ and in various processor architectures, as well as other bit shift peculiarities.

I’ll mainly focus on many different ways to handle the reader side in this post; essentially all techniques covered in here apply equally to the writer side, but I don’t want to double the number of algorithm variations I’m presenting. There will be plenty as it is.

Degrees of freedom

“Read a variable number of bits” is not a sufficient problem specification. There are lots of plausible ways to pack bits into bytes, and all have their strengths and weaknesses that I’ll go into later. For now, let’s just cover the differences.

The first major decision to make is whether fields are packed MSB-first or LSB-first (“most significant bit” and “least significant bit”, respectively). That is, if we call our to-be-implemented function getbits and run the code sequence

a = getbits(4);
b = getbits(3);

on some bitstream that we just opened, we expect both values to come from the same byte, but how are they arranged in that byte? If bits are packed MSB-first, then “a” occupies 4 bits starting at the MSB, and “b” is below “a”, leading to this arrangement:

MSB-first bit packing

I’m numbering bits with the LSB being bit 0 and increasing towards the MSB, which is the conventional ordering in most contexts. LSB-first packing is the opposite, where the first field occupies bit 0 and later fields proceed upwards:

LSB-first bit packing

Both conventions are used in everyday file formats. For example, JPEG uses MSB-first bit packing for its bitstream, and DEFLATE (zip) uses LSB-first.

The next question we have to settle is what’s supposed to happen when a value ends up spanning multiple bytes. Say we have another value, “c”, that we want to encode in 5 bits. What do we want to end up with? We can postpone the issue slightly be declaring that we’re packing values into 32-bit words or 64-bit words and not bytes, but ultimately we’ll have to settle on something. This is where we suddenly get a whole lot of different variants, and I’m only going to cover the main contenders.

We can think of MSB-first bit packing as iterating over our bitfield “c” from its MSB towards the LSB, inserting one bit at a time. And once we fill up one byte, we start with the next byte. Following these rules for our new bitfield c, this is where the bits end up in our stream:

MSB-first bit packing across multiple bytes

Note that by following these rules, we end up with the same two bytes we would have gotten from MSB-first bit packing into a big integer and then stored it in big-endian byte order. If we had instead decided to split c such that its LSB goes into the first byte and the four higher-order bits into the second byte, this wouldn’t have worked. I’m going to call bit-packing rules that are self-consistent like that “natural” rules.

LSB-first bit-packing of course also has its corresponding natural rule, which is to insert the new value bit by bit starting from the LSB upwards, and if we do that here, we end up with this bit stream:

LSB-first bit packing across multiple bytes

LSB-first natural packing gives us the same bytes as LSB-first packing into a big integer then storing it in little-endian byte order. Also, we’re clearly running into some awkwardness with the drawing here: the logically contiguous packing of c into multiple bytes looks discontinuous when drawing it like this, whereas the drawing for the MSB-first packing looked the way you’d expect. But there’s trouble brewing there too: in the MSB-first drawing, we’re numbering the bits in increasing order from right to left (as is common) and the bytes in increasing order from left to right (as is also common).

Here’s what happens with the LSB-first bit diagram if we draw bit 0 (the LSB) in each byte at the left and bit 7 (the MSB) at the right:

LSB-first bitpacking across multiple bytes, LSB->MSB from left to right

If you draw it this way, it looks the way you’d expect. Putting the MSB on the right feels weird when thinking of a byte as a number and much less weird when you just think about it as an array of 8 bits (which is, effectively, what we’re treating it as when we’re doing bit-wise IO).

Incidentally, some big-endian architectures (e.g. IBM POWER) do number bits this way – bit 0 is the MSB, bit 31 (or 63) is the LSB. Diagramming MSB-first bit packing on such a machine with bit 0=MSB, and also numbering our own bit fields so that their bit 0 corresponds to the MSB, we’d get the exact same diagram (but it would mean something slightly different). This convention makes bit and byte ordering consistent (nice) but breaks the handy convention of bit k corresponding to a value of 2^k (less nice).

And if the idea of renumbering bits so bit 0 is the MSB breaks your head, you can also leave the bit numbering as it is but be a rebel and draw byte addresses increasing to the left. Or alternatively keep drawing increasing addresses to the right but be a slightly different kind of rebel and write your byte stream backwards. If you do either of those, you end up with this layout:

LSB-first bitpacking across multiple bytes, reverse byte order

I realize this is getting confusing, and I’ll stop now, but I’m trying to make an actual point here: you should not get too attached to the way this stuff is drawn. It’s easy to fool yourself into thinking that one variant is better than another because it looks prettier, but the conventions of drawing bytes left to right and bits within them right to left are completely arbitrary, and no matter which one you pick, they always end up having that One Weird Thing about them.

It turns out that MSB-first and LSB-first packing conventions both have advantages and disadvantages, and it’s much more useful to think of them as tools with different areas of application than it is to designate one as the “right way” and the other as the “wrong way”. As for byte order, and whether to pack values into bytes, words, or something larger, I highly recommend that you use whatever the natural order for your bit-packing convention is: MSB-first corresponds naturally to big-endian style byte ordering, and LSB-first corresponds naturally to little-endian byte ordering. Unless you’re writing byte streams in reverse—believe it or not, there’s good reasons to do that too—in which case we have reverse-order MSB-first corresponding to little-endian and reverse-order LSB-first corresponding to big-endian.

The reason to prefer “natural” orders is that they tend to give you more freedom for different implementations. A natural-order stream admits a variety of different decoders (and encoders), all with different trade-offs (and good on different targets). “Unnatural” orders are usually designed with exactly one implementation in mind and tend to very awkward to decode any other way.

Our first `getbits` (bit extract)

Now that we’ve specified the problem sufficiently, we can implement a solution. A particularly simple version is possible if we assume the entire bit stream is sequential in memory (as a byte array) and also completely ignore, for the time being, such pesky issues as running into the end of the array. Let’s just pretend it’s infinite! (Or large enough and zero-padded, anyway.)

In that case, we can base everything on a purely functional “bit extract” function, which I am then going to illustrate a number of issues that come up in all kind of bit-reading functions. Let’s start with LSB-first bit packing:

// Treating buf[] as a giant little-endian integer, grab "width"
// bits starting at bit number "pos" (LSB=bit 0).
uint64_t bit_extract_lsb(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 0 && width <= 64 - 7);

    // Read a 64-bit little-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64LE(&buf[pos / 8]);

    // Shift out the bits inside the first byte that we've
    // already consumed.
    // After this, the LSB of our bit field is in the LSB of bits.
    bits >>= pos % 8;

    // Return the low "width" bits, zeroing the rest via bit mask.
    return bits & ((1ull << width) - 1);
}

// State variable, assumed to be local variables, or factored
// into some object; hopefully not actual globals.
const uint8_t *bitstream; // The input bistream
size_t bit_pos; // Current position in the stream, in bits.

uint64_t getbits_extract_lsb(int width) {
    // Read the bits
    uint64_t result = bit_extract_lsb(bitstream, bit_pos, width);
    // Advance the cursor
    bit_pos += width;
    return result;
}

We’re just using the fact I noted earlier that a LSB-first bit stream is just a big little-endian number. We first grab 64 contiguous byte-aligned bits starting from the first byte containing any bits we care about, do a right shift to get rid of the remaining 0-7 extra bits below the first bit we care about, and then return the result masked to the desired width.

Depending on the value of pos, that right shift can cost us an extra 7 bits. So even though we read a full 64 bits, the maximum number of bits we can read in one go with the code above is 64-7=57 bits.

With bitextract in hand, getbits is straightforward; we just keep track of the current position in the bitstream (in bits), and increment it after reading. Easy.

The corresponding MSB-first variant works out quite similarly, except for one annoying issue I’ll explain after showing the code:

// Treating buf[] as a giant big-endian integer, grab "width"
// bits starting at bit number "pos" (MSB=bit 0).
uint64_t bit_extract_msb(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 1 && width <= 64 - 7);

    // Read a 64-bit big-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64BE(&buf[pos / 8]);

    // Shift out the bits we've already consumed.
    // After this, the MSB of our bit field is in the MSB of bits.
    bits <<= pos % 8;

    // Return the top "width" bits.
    return bits >> (64 - width);
}

uint64_t getbits_extract_msb(int width) {
    // Read the bits
    uint64_t result = bit_extract_msb(bitstream, bit_pos, width);
    // Advance the cursor
    bit_pos += width;
    return result;
}

This works very similar to the one before: read 64 contiguous byte-aligned bits (big-endian this time), do a left shift to align the top of the bit field we want with the MSB of bits (where before, we did a right shift to align the bottom of our bit field with the LSB of bits), and then do a right shift to place the top width bits at the bottom to return them, because when somebody calls getbits(3), they generally expect to see a value between 0 and 7 inclusive.

Shift boundary cases

So what’s the problem? Well, this version doesn’t allow width to be zero. The issue is that if we allow width == 0, then the final shift will try to right-shift a 64-bit value by 64 bits, and that’s undefined behavior in C/C++! In this case, we may only shift by 0 through 63.

Now, in some cases, C/C++ leave details unspecified for backwards-compatibility with machines essentially nobody cares about at this point. Not requiring that the representation of signed numbers use two’s complement being a famous example; while non-two’s complement architectures exist, they’re all museum pieces at this point.

Sadly, this is not one of these cases. Here’s what happens on different widespread CPU architectures when the shift amount is out of range:

For 32-bit x86 and x86-64, shift amounts are interpreted mod 32 for operand widths of 32 bits and lower, and mod 64 for 64-bit operands. So right-shifting a 64-bit value by 64 will yield the same as shifting by 0, i.e. a no-op.
In 32-bit ARM (A32/T32 instruction sets), shift amounts are taken mod 256. Right-shifting a 32-bit value by 32 (or 64) will hence yield 0, as will right-shifting it by 255, but right-shifting it by 256 will leave the value untouched.
In 64-bit ARM (A64 ISA), shift amounts are taken mod 32 for 32-bit shifts and mod 64 for 64-bit shifts (essentially the same as x86-64).
RISC-V also follows the same rule: 32-bit shifts distances are mod 32, 64-bit shift distances are mod 64.
For POWER/PowerPC, 32-bit shifts take the shift amount mod 64 and 64-bit shifts take the shift amount mod 128.

To make matters even more confusing, in most of these instruction sets with SIMD extensions, the SIMD integer instructions have different out-of-range shift behavior
than the non-SIMD instructions. In short, this is one of those cases where there’s actual architectural differences between mainstream platforms; POWER and RISC-V might be a bit obscure for most of you, but say 32-bit vs. 64-bit ARM both have hundreds of millions of devices sold at this point.

Therefore, even if all the compiler does with that right-shift is to emit the corresponding right-shift instruction for the target architecture (which is generally what happens), you will still see different behavior on different architectures, and on both ARM A64 and x86-64, the result of a shift by 64 will be effectively a no-op, and getbits(0) will therefore (generally) return a non-0 value, where one would expect that it’s always zero.

Why does it matter in the first place? Having a hard-coded getbits(0) is indeed not an interesting use case; but sometimes you might want to perform a getbits(x) for some variable x, where x can be zero in some cases, and it’s nice for that case to just work and not require some special-case testing.

If you really need that case to work, one option is to explicitly test for width == 0 and handle that specially; another is to use an branch-less expression that works for zero widths, for example this variant used in Yann Collet‘s FSE:

    // Return the top "width" bits, avoiding width==0 edge cases.
    return (bits >> 1) >> (63 - width);

This particular case is easier to handle with LSB-first bit streams. And while I’m mentioning them, let’s talk about the masking operation I used to isolate the lowest width bits:

    // Return the low "width" bits, zeroing the rest via bit mask.
    return bits & ((1ull << width) - 1);

There’s an equivalent form that is slightly cheaper on architectures with a three-operand and-with-complement (AND-NOT) instruction. This includes many RISC CPUs as well as x86s with BMI1. Namely, we can take a mask of all-1 bits, left-shift it to introduce width zeroes at the bottom, then complement the whole thing:

    return bits & ~(~0ull << width);

If you’re on an x86 with not just BMI1, but also BMI2 support, you can also use the BZHI instruction that is tailor-made for this use case, if you can figure out how to make your compiler emit it (or use assembly). Yet another option that is advantageous in some cases is to simply prepare a small look-up table of bit masks, which then simplifies the code into

    return bits & width_to_mask_table[width];

It may feel ridiculous to prepare a lookup table storing the result of two integer operations, especially since computing the address of the table element to load usually involves both a shift and an addition—the exact two operations we’d be performing if we didn’t have the table!—but there’s a method to this madness: the required address calculation can be done as part of the memory access in a single load instruction on e.g. x86 and ARM machines, and so this shift and add get executed on an Address Generation Unit (AGU) as part of the load pipeline of the CPU, not with integer arithmetic and shift instructions. So counter-intuitive as it may sound, replacing two integer ALU instruction with one integer load instruction like this can result in a noticeable speed-up in bit-IO-intensive code, because it balances the workload better between available execution units.

Another interesting property is that the LSB-first version using the mask table lookup only performs one shift by a variable amount (to shift out the already-consumed bits). This matters because, for a variety of (often banal) reasons, integer shifts by a variable amount are more expensive than shifts by a compile-time constant amount on many micro-architectures. For example, the Cell PPU/Xbox 360 Xenon CPU were infamous for having variable-distance shifts that stalled the CPU core for a whopping 12 cycles, whereas regular shifts were pipelined and would complete in two cycles. Less drastically, on many Intel x86 microarchitectures, “traditional” variable x86 shifts (SHR reg, cl and friends) are about three times as expensive as shifts by a compile-time constant.

This seems like another way in which the MSB-first variant is behind: the variants we’ve seen so far perform either two or three shifts per extract operation, two of which are variable-distance. But there’s a trick to get down to a single shift-type operation, namely by using a bit rotate (cyclical shift) instead of a regular shift:

// Treating buf[] as a giant big-endian integer, grab "width"
// bits starting at bit number "pos" (MSB=bit 0).
uint64_t bit_extract_rot(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 0 && width <= 64 - 7);

    // Read a 64-bit big-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64BE(&buf[pos >> 3]);

    // Rotate left to align the bottom of our bit field with the LSB.
    bits = rotate_left64(bits, (pos & 7) + width);

    // Return the bottom "width" bits.
    // (Using a table here, but the other ways of masking discussed
    // for LSB-first bit IO also work.)
    return bits & width_to_mask_table[width];
}

Here, the rotate-left (which you need to figure out how to best access in your C compiler yourself; I want to avoid further digressions) first takes up the work of the original left-shift, and then rotates by an extra “width” bits to make our bit field wrap around from the most-significant bits of the value down to the least-significant, where we can then mask it the same way as in the LSB-first variant.

Oh, and I’ve also started writing the division by 8 as shift-right by 3, and the mod-8 of our unsigned position as a binary AND with 7. These are equivalent substitutions, and you’ll see both forms in real-world code, so I figured I’d mix it up a little.

This rotate-style masking operation was how we (being RAD Game Tools) used to perform bit reading on the Cell PPU and Xbox 360, purely because the variable-distance shifts were so dismal on that particular core. And I’ll note that this version also has no problems whatsoever with width == 0; the only problem is its dependence on rotate instructions, which are available (and fast) in most architectures, but tend to be somewhat inconvenient to access from C code.

At this point, I’ve talked a lot about many, many different ways to do a bit shifting and masking, and I’ll be referring back to these later. What I haven’t shown you yet is any actual practical bit IO logic—the bit extraction form is a useful starting point, and convenient to explain these concepts in without having to worry about state, but you’re unlikely to ever use it as presented in any production code, not least because it assumes that the entire bit stream is in memory and will gleefully read past the end of the buffer during normal operation!

However, this post is already long enough that WordPress editing is starting to get sluggish on me. So what I’ll do is split this up into a multi-part series; you get to take a breather, and we’ll meet back in the next part. Until then!

(Continued from part 1.)

Last time, I established the basic problem and went through various ways of doing shifting and masking, and the surprising difficulties inherent therein. The “bit extract” style I picked is based on a stateless primitive, which made it convenient to start with because there’s no loop invariants involved.

This time, we’re going to switch to the stateful style employed by most bit readers you’re likely to encounter (because it ends up cheaper). We’re also going to switch from a monolithic getbits function to something a bit more fine-grained. But let’s start at the beginning.

Variant 1: reading the input one byte at a time

Our “extract”-style reader assumed the entire bitstream was available in memory ahead of time. This is not always possible or desirable; so let’s investigate the other extreme, a bit reader that requests additional bytes one at a time, and only when they’re needed.

In general, our stateful variants will dribble in input a few bytes at a time, and have partially processed bytes lying around. We need to store that data in a variable that I will call the “bit buffer”:

// Again, this is understood to be per-bit-reader state or local
// variables, not globals.
uint64_t bitbuf = 0;   // value of bits in buffer
int      bitcount = 0; // number of bits in buffer

While processing input, we will always be seesawing between putting more bits into that buffer when we’re running low, and then consuming bits from that buffer while we can.

Without further ado, let’s do our first stateful getbits implementation, reading one byte at a time, and starting with MSB-first this time:

// Invariant: there are "bitcount" bits in "bitbuf", stored from the
// MSB downwards; the remaining bits in "bitbuf" are 0.

uint64_t getbits1_msb(int count) {
    assert(count >= 1 && count <= 57);

    // Keep reading extra bytes until we have enough bits buffered
    // Big endian; our current bits are at the top of the word,
    // new bits get added at the bottom.
    while (bitcount < count) {
        bitbuf |= (uint64_t)getbyte() << (56 - bitcount);
        bitcount += 8;
    }

    // We now have enough bits present; the most significant
    // "count" bits in "bitbuf" are our result.
    uint64_t result = bitbuf >> (64 - count);

    // Now remove these bits from the buffer
    bitbuf <<= count;
    bitcount -= count;

    return result;
}

As before, we can get rid of the count≥1 requirement by changing the way we grab the result bits, as explained in the last part. This is the last time I’ll mention this; just keep in mind that whenever I show any algorithm variant here, the variations from last time automatically apply.

The idea here is quite simple: first, we check whether there’s enough bits in our buffer to satisfy the request immediately. If not, we dribble in extra bytes one at a time until we have enough. getbyte() here is understood to ideally use some buffered IO mechanism that just boils down to dereferencing and incrementing a pointer on the hot path; it should not be a function call or anything expensive if you can avoid it. Because we insert 8 bits at a time, the maximum number of bits we can read in a single call is 57 bits; that’s the largest number of bits we can refill the buffer to without risking anything dropping out.

After that, we grab the top count bits from our buffer, then shift them out. The invariant we maintain here is that the first unconsumed bit is kept at the MSB of the buffer.

The other thing I want you to notice is that this process breaks down neatly into three separate smaller operations, which I’m going to call “refill”, “peek” and “consume”, respectively. The “refill” phase ensures that a certain given minimum number of bits is present in the buffer; “peek” returns the next few bits in the buffer, without discarding them; and “consume” removes bits without looking at them. These all turn out to be individually useful operations; to show how things shake out, here’s the equivalent LSB-first algorithm, factored into smaller parts:

// Invariant: there are "bitcount" bits in "bitbuf", stored from the
// LSB upwards; the remaining bits in "bitbuf" are 0.
void refill1_lsb(int count) {
    assert(count >= 0 && count <= 57);
    // New bytes get inserted at the top end.
    while (bitcount < count) {
        bitbuf |= (uint64_t)getbyte() << bitcount;
        bitcount += 8;
    }
}

uint64_t peekbits1_lsb(int count) {
    assert(bit_count >= count);
    return bitbuf & ((1ull << count) - 1);
}

void consume1_lsb(int count) {
    assert(bit_count >= count);
    bitbuf >>= count;
    bitcount -= count;
}

uint64_t getbits1_lsb(int count) {
    refill1_lsb(count);
    uint64_t result = peekbits1_lsb(count);
    consume1_lsb(count);
    return result;
}

Writing getbits as the composition of these three smaller primitives is not always optimal. For example, if you use the rotate method for MSB-first bit buffers, you really want to have only rotate shared by the peekbits and consume phases; an optimal implementation shares that work between the two. However, breaking it down into these individual steps is still a useful thing to do, because once you conceptualize these three phases as distinct things, you can start putting them together differently.

Lookahead

The most important such transform is amortizing refills over multiple decode operations. Let’s start with a simple toy example: say we want to read our three example bit fields from the last part:

    a = getbits(4);
    b = getbits(3);
    c = getbits(5);

With getbits implemented as above, this will do the refill check (and potentially some actual refilling) up to 3 times. But this is silly; we know in advance that we’re going to be reading 4+3+5=12 bits in one go, so we might as well grab them all at once:

    refill(4+3+5);
    a = getbits_no_refill(4);
    b = getbits_no_refill(3);
    c = getbits_no_refill(5);

where getbits_no_refill is yet another getbits variant that does peekbits and consume, but, as the name suggests, no refilling. And once you get rid of the refill loop between the individual getbits invocations, you’re just left with straight-line integer code, which compilers are good at optimizing further. That said, the all-fixed-length case is a bit of a cheap shot; it gets far more interesting when we’re not sure exactly how many bits we’re actually going to consume, like in this example:

    temp = getbits(5);
    if (temp < 28)
        result = temp;
    else
        result = 28 + (temp - 28)*16 + getbits(4);

This is a simple variable-length code where values from 0 through 27 are sent in 5 bits, and values from 28 through 91 are sent in 9 bits. The point being, in this case, we don’t know in advance how many bits we’re eventually going to consume. We do know that it’s going to be no more than 9 bits, though, so we can still make sure we only refill once:

    refill(9);
    temp = getbits_no_refill(5);
    if (temp < 28)
        result = temp;
    else
        result = 28 + (temp - 28)*16 + getbits_no_refill(4);

In fact, if you want to, you can go wild and split operations even more, so that both execution paths only consume bits once. For example, assuming a MSB-first bit buffer, we could write this small decoder as follows:

    refill(9);
    temp = peekbits(5);
    if (temp < 28) {
        result = temp;
        consume(5);
    } else {
        // The "upper" and "lower" code are back-to-back,
        // and combine exactly the way we want! Synergy!
        result = getbits_no_refill(9) - 28*16 + 28;
    }

This kind of micro-tweaking is really not recommended outside very hot loops, but as I mentioned in the previous part, some of these decoder loops are quite hot indeed, and in that case saving a few instructions here and a few instructions there adds up. One particularly important technique for decoding variable-length codes (e.g. Huffman codes) is to peek ahead by some fixed number of bits and then do a table lookup based on the result. The table entry then lists what the decoded symbol should be, and how many bits to consume (i.e. how many of the bits we just peeked at really belonged to the symbol). This is significantly faster than reading the code a bit or two at a time and consulting a Huffman tree at every step (the method sadly taught in many textbooks and other introductory texts.)

There’s a problem, though. The technique of peeking ahead a bit (no pun intended) and then later deciding how many bits you actually want to consume is quite powerful, but we’ve just changed the rules: the getbits implementation above is careful to only read extra bytes when it’s strictly necessary. But our modified variable-length code reader example above always refills so the buffer contains at least 9 bits, even when we’re only going to consume 5 bits in the end. Depending on where that refill happens, it might even cause us to read past the end of the actual data stream.

In short, we’ve introduced lookahead. The modified code reader starts grabbing extra input bytes before it’s sure whether it needs them. This has many advantages, but the trade-off is that it may cause us to read past the logical end of a bit stream; it certainly implies that we have to make sure this case is handled correctly. It certainly should never crash or read out of bounds; but beyond that, it implies certain thing about the way input buffering or the framing layer have to work.

Namely, if we’re going to do any lookahead, we need to figure out a way to handle this. The primary options are as follows:

We can punt and make it someone else’s problem by just requiring that everyone hand us valid data with some extra padding bytes after the end. This makes our lives easier but is an inconvenience for everyone else.
We can arrange things so the outer framing layer that feeds bytes to our getbits routine knows when the real data stream is over (either due to some escape mechanism or because the size is sent explicitly); then we can either stop doing any lookahead and switch to a more careful decoder when we’re getting close to the end, or pad the stream with some dummy value after its end (zeroes being the most popular choice).
We can make sure that whatever bytes we might grab during lookahead while decoding a valid stream are still part of our overall byte stream that’s being processed by our code. For example, if you use a 64-bit bit buffer, we can finagle our way around the problem by requiring that some 8-byte footer (say a checksum or something) be present right after a valid bit stream. So while our bit buffer might overshoot, it’s still data that’s ultimately going to be consumed by the decoder, which simplifies the logistics considerably.
Barring that, whatever I/O buffering layer we’re using needs to allow us to return some extra bytes we didn’t actually consume into the buffer. Namely, whatever lookahead bytes we have left in our bit buffer after we’re done decoding need to be returned to the buffer for whoever is going to read it next. This is essentially what the C standard library function ungetc does, except you’re not allowed to call ungetc more than once, and we might need to. So going along this route essentially dooms you to taking charge of IO buffer management.

I won’t sugarcoat it, all of these options are a pain in the neck, some more so than others: hand-waving it away by putting something else at the end is easiest, handling it in some outer framing layer isn’t too bad, and taking over all IO buffering so you can un-read multiple bytes is positively hideous, but you don’t have great options when you don’t control your framing. In the past, I’ve written posts about handy techniques that might help you in this context; and in some implementations you can work around it, for example by setting bitcount to a huge value just after inserting the final byte from the stream. But in general, if you want lookahead, you’re going to have to put some amount of work into it. That said, the winnings tend to be fairly good, so it’s not all for nothing.

Variant 2: when you really want to read 64 bits at once

The methods I’ve discussed so far both have some “slop” from working in byte granularity. The extract-style bit reader started with a full 64-bit read but then had to shift by up to 7 positions to discard the part of the current byte that’s already consumed, and the getbits1 above inserts one byte at a time into the bit buffer; if there’s 57 bits already in the buffer, there’s no space for another byte (because that would make 65 bits, more than the width of our buffer), so that’s the maximum width the getbits1 method supports. Now 57 bits is a useful amount; but if you’re doing this on a 32-bit platform, the equivalent magic number is 25 bits (32-7), and that’s definitely on the low side, enough so to be inconvenient sometimes.

Luckily, if you want the full width, there’s a way to do it (like the rotate-and-mask technique for MSB-first bit buffers, I learned this at RAD). At this point, I think you get the correspondence between the MSB-first and LSB-first methods, so I’m only going to show one per variant. Let’s do LSB-first for this one:

// Invariant: "bitbuf" contains "bitcount" bits, starting from the
// LSB up; 1 <= bitcount <= 64
uint64_t bitbuf = 0;     // value of bits in buffer
int      bitcount = 0;   // number of bits in buffer
uint64_t lookahead = 0;  // next 64 bits
bool     have_lookahead = false;

// Must do this to prime the pump!
void initialize() {
    bitbuf = get_uint64LE();
    bitcount = 64;
    have_lookahead = false;
}

void ensure_lookahead() {
    // grabs the lookahead word, if we don't have
    // it already.
    if (!have_lookahead) {
        lookahead = get_uint64LE();
        have_lookahead = true;
    }
}

uint64_t peekbits2_lsb(int count) {
    assert(bitcount >= 1);
    assert(count >= 0 && count <= 64);

    if (count <= bitcount) { // enough bits in buf
        return bitbuf & width_to_mask_table[count];
    } else {
        ensure_lookahead();

        // combine current bitbuf with lookahead
        // (lookahead bits go above end of current buf)
        uint64_t next_bits = bitbuf;
        next_bits |= lookahead << bitcount;
        return next_bits & width_to_mask_table[count];
    }
}

void consume2_lsb(int count) {
    assert(bitcount >= 1);
    assert(count >= 0 && count <= 64);

    if (count < bitcount) { // still in current buf
        // just shift the bits out
        bitbuf >>= count;
        bitcount -= count;
    } else { // all of current buf consumed
        ensure_lookahead();
         
        // we advanced fully into the lookahead word
        int lookahead_consumed = count - bitcount;
        bitbuf = lookahead >> lookahead_consumed;
        bitcount = 64 - lookahead_consumed;
        have_lookahead = false;
    }

    assert(bitcount >= 1);
}

uint64_t getbits2_lsb(int count) {
    uint64_t result = peekbits2_lsb(count);
    consume2_lsb(count);
    return result;
}

This one is a bit more complicated than the ones we’ve seen before, and needs an explicit initialization step to make the invariants work out just right. It also involves several extra branches compared to the variants we’ve seen before, which makes it less than ideal for deeply pipelined machines, which includes desktop PCs, and note that I’m using the width_to_mask_table again, and not just for show: none of the arithmetic expressions we saw last time to compute the mask for a given width work for the full 0-64 range of allowed widths on any common 64-bit architecture that’s not POWER, and even that only if we ignore them invoking undefined behavior, which we really shouldn’t.

The underlying idea is fairly simple: instead of just one bit buffer, we keep track of two values. We have however many bits are left of the last 64-bit value we read, and when that’s not enough for a peekbits, we grab the next 64-bit value from the input stream (via some externally-implemented get_uint64LE()) to give us the bits we’re missing. Likewise, consume checks whether there will still be any bits left in the current input buffer after consuming width bits. If not, we switch over to the bits from the lookahead value (shifting out however many of them we consumed) and clear the have_lookahead flag to indicate that what used to be our lookahead value is now just the contents of our bit buffer.

There are some contortions in this code to ensure we don’t do out-of-range (undefined-behavior-inducing) shifts. For example, note how peekbits tests whether count <= bitcount to detect the bits-present-in-buffer case, whereas consume uses count < bitcount. This is not an accident: in peekbits, the next_bits calculation involves a right-shift by bitcount. Since it only happens in the path where bitcount < count ≤ 64, that means that bitcount < 64, and we’re safe. In consume, the situation is reversed: we shift by lookahead_consumed = count - bitcount. The condition around the block guarantees that lookahead_consumed ≥ 0; in the other direction, because count is at most 64 and bitcount is at least 1, we have lookahead_consumed ≤ 64 – 1 = 63. That said, to paraphrase Knuth: beware of bugs in the above code; I’ve only proved it correct, not tried it.

This technique has another thing going for it besides supporting bigger bit field widths: note how it always reads full 64-bit uints at a time. Variant 1 above only reads bytes at a time, but requires a refill loop; the various branchless variants we’ll see later implicitly rely on the target CPU supporting fast unaligned reads. This version alone has the distinction of doing reads of a single size and with consistent alignment, which makes it more attractive on targets that don’t support fast unaligned reads, such as many old-school RISC CPUs.

Finally, as usual, there’s several more variations here that I’m not showing. For example, if you happen to have the data you’re decoding fully in memory, there’s no reason to bother with the boolean have_lookahead flag; just keep a pointer to the current lookahead word, and bump that pointer up whenever the current lookahead is consumed.

Variant 3: bit extraction redux

The original bit extraction-based bit reader from the previous part was a bit on the expensive side. But as long as we’re OK with the requirement that the entire input stream be in memory at once, we can wrangle it into the refill/peek/consume pattern to get something useful. It still gives us a bit reader that looks ahead (and hence has the resulting difficulties), but such is life. For this one, let’s do MSB again:

const uint8_t *bitptr; // Pointer to current byte
uint64_t       bitbuf = 0; // last 64 bits we read
int            bitpos = 0; // how many of those bits we've consumed

void refill3_msb() {
    assert(bitpos <= 64);

    // Advance the pointer by however many full bytes we consumed
    bitptr += bitpos >> 3;

    // Refill
    bitbuf = read64BE(bitptr);

    // Number of bits in the current byte we've already consumed
    // (we took care of the full bytes; these are the leftover
    // bits that didn't make a full byte.)
    bitpos &= 7;
}

uint64_t peekbits3_msb(int count) {
    assert(bitpos + count <= 64);
    assert(count >= 1 && count <= 64 - 7);

    // Shift out the bits we've already consumed
    uint64_t remaining = bitbuf << bitpos;

    // Return the top "count" bits
    return remaining >> (64 - count);
}

void consume3_msb(int count) {
    bitpos += count;
}

This time, I’ve also left out the getbits built from refill / peek / consume calls, because that’s yet another pattern that should be pretty clear by now.

It’s a pretty sweet variant. Once we break the bit extraction logic into separate “refill” and “peek”/”consume” pieces, it becomes clear how all of the individual pieces are fairly small and clean. It’s also completely branchless! It does expect unaligned 64-bit big-endian reads to exist and be reasonably cheap (not a problem on mainstream x86s or ARMs), and of course a realistic implementation needs to include handling of the end-of-buffer cases; see the discussion in the “lookahead” section.

Variant 4: a different kind of lookahead

And now that we’re here, let’s do another branchless lookahead variant. This exact variant is, to the best of my knowledge, another RAD special – discovered by my colleague Charles Bloom and me while working on Kraken (UPDATE: as Yann points out in the comments, this basic idea was apparently used in Eric Biggers’ “Xpack” long before Kraken was launched; I wasn’t aware of this and I don’t think Charles was either, but that means we’re definitely not the first ones to come up with the idea. Our variant has an interesting wrinkle though – details in my reply). Now all branchless (well, branchless if you ignore end-of-buffer checking in the refill etc.) bit readers look very much alike, but this particular variant has a few interesting properties (some of which I’ll only discuss later because we’re lacking a bit of necessary background right now), and that I haven’t seen anywhere else in this combination; if someone else did it first, feel free to inform me in the comments, and I’ll add the proper attribution! Here goes; back to LSB-first again, because I’m committed to hammering home just how similar and interchangeable LSB-first/MSB-first are at this level, holy wars notwithstanding.

const uint8_t *bitptr;   // Pointer to next byte to insert into buf
uint64_t bitbuf = 0;     // value of bits in buffer
int      bitcount = 0;   // number of bits in buffer

void refill4_lsb() {
    // Grab the next few bytes and insert them right above
    // the current top.
    bitbuf |= read64LE(bitptr) << bitcount;

    // Advance the read pointer for next iteration
    bitptr += (63 - bitcount) >> 3;

    // Update the available bit count
    bitcount |= 56; // now bitcount is in [56,63]
}

uint64_t peekbits4_lsb(int count) {
    assert(count >= 0 && count <= 56);
    assert(count <= bitcount);
    
    return bitbuf & ((1ull << count) - 1);
}

void consume4_lsb(int count) {
    assert(count <= bitcount);

    bitbuf >>= count;
    bitcount -= count;
}

The peek and consume phases are nothing we haven’t seen before, although this time the maximum permissible bit width seems to have shrunk by one more bit down to 56 bits for some reason.

That reason is in the refill phase, which works slightly differently from the ones we’ve seen so far. Reading 64 little-endian bits and shifting them up to align with the top of our current bit buffer should be straightforward at this point. But the bitptr / bitcount manipulation needs some explanation.

It’s actually easier to start with the bitcount part. The variants we’ve seen so far generally have between 57 and 64 bits in the buffer after refill. This version instead targets having between 56 and 63 bits in the buffer (which is also why the limit on count went down by one). But why? Well, inserting some integer number of bytes means bitcount is going to be incremented by some multiple of 8 during the refill; that means that bitcount & 7 (the low 3 bits) won’t change. And if we refill to a target of [56,63] bits in the buffer, we can compute the updated bit count with a single binary OR operation.

Which brings me to the question of how many bytes we should advance the pointer by. Well, let’s just look at the values of the original bitcount:

If 56 ≤ bitcount ≤ 63, we were already in our target range and don’t want to advance by another byte.
If 48 ≤ bitcount ≤ 55, we’re adding exactly 1 byte (and so want to advance bit_ptr by that much).
If 40 ≤ bitcount ≤ 47, we’re adding exactly 2 bytes.

and so forth. This works out to the (63 - bitcount) >> 3 bytes we’re adding to bitptr.

Now, the bits in bitbuf above bitcount can end up getting ORed over multiple times. However, when that happens, we OR over the same value every time, so it doesn’t change the result. Therefore, once they later travel downwards (from the right-shift in the consume function), they’re fine; no need to worry about garbage.

Okay. So what’s interesting, but what’s so special about this particular variant? When would you choose this over, say, variant 3 above?

One simple reason: in this variant, the address the refill is loading from does not have a dependency on the current value of bitcount. In fact, the next load address is known as soon as the previous refill is complete. This is a subtle distinction that turns out to be a fairly major advantage on an out-of-order CPU. Among integer operations, even when hitting the L1 cache, loads are on the high latency side (typically somewhere between 3 and 5 cycles, whereas most integer operations take a single cycle), and the exact value of bitcount at the end of some loop iteration is often only known late (consider the simple variable-length code example I gave above).

Having the load address not depend on bitcount means the load can potentially issue as soon as the previous refill is complete; then we have plenty of time to complete the load, potentially byte-swap the value if the endianness of our load doesn’t match the target ISA (say because we’re using a MSB-first bit buffer on a little-endian CPU), and then the only thing that depends on the previous value of bitcount is the shift, which is a regular ALU operation and generally takes a single cycle.

In short, this somewhat obscure form of refill looks weird, but provides a tangible increase in available instruction-level parallelism. It was good for about a 10% throughput improvement on desktop PCs (vs. the earlier branchless refill it replaced) in the then-current version of the Kraken Huffman decoder when I tested it in early 2016.

Consider this a teaser for the next (and hopefully last) part of this series, in which I won’t introduce many more variants (maybe one more), and will instead talk a lot more about the performance of bit stream decoders and what kinds of things to watch out for.

Until then!

(Continued from part 2. “A whirlwind introduction to dataflow graphs” is required reading.)

Last time, we saw a whole bunch of different bit reader implementations. This time, I’ll continue with a few more variants, implementation considerations on pipelined superscalar CPUs, and some ways to use the various degrees of freedom to our advantage.

Dependency structure of bit readers

To get a better feel for where the bottlenecks in bit decoding are, let me restate some of the bit reading approaches we’ve covered in the previous parts again in our pseudo-assembly language, and then we can have a look at the corresponding dependency graphs.

Let’s start with variant 3 from last time, but I’ll do a LSB-first version this time:

refill3_lsb:
    rBytesConsumed = lsr(rBitPos, 3);
    rBitPtr = rBitPtr + rBytesConsumed;
    rBitBuf = load64LE(rBitPtr);
    rBitPos = rBitPos & 7;

peekbits3_lsb(count):
    rBits = lsr(rBitBuf, rBitPos);
    rBitMask = lsl(1, count);
    rBitMask = rBitMask - 1;
    rBits = rBits & rBitMask; // result

consume3_lsb(count):
    rBitPos = rBitPos + count;

Note that if count is a compile-time constant, the computation for rBitMask can be entirely constant-folded. Peeking ahead by a constant, fixed number of bits then working out from the result how many bits to actually consume is quite common in practice, so that’s what we’ll do. If we do a refill followed by two peek/consume cycles with the consume count being determined from the read bits “somehow”, followed by another refill (for the next loop iteration), the resulting pseudo-asm is like this:

    // Initial refill
    rBytesConsumed = lsr(rBitPos, 3);   // Consumed 0
    rBitPtr = rBitPtr + rBytesConsumed; // Advance 0
    rBitBuf = load64LE(rBitPtr);        // Load 0
    rBitPos = rBitPos & 7;              // LeftoverBits 0

    // First decode (peek count==19)
    rBits = lsr(rBitBuf, rBitPos);      // BitsRemaining 0
    rBits = rBits & 0x7ffff;            // BitsMasked 0
    rCount = determineCount(rBits);     // DetermineCount 0
    rBitPos = rBitPos + rCount;         // PosInc 0
    
    // Second decode
    rBits = lsr(rBitBuf, rBitPos);      // BitsRemaining 1
    rBits = rBits & 0x7ffff;            // BitsMasked 1
    rCount = determineCount(rBits);     // DetermineCount 1
    rBitPos = rBitPos + rCount;         // PosInc 1

    // Second refill
    rBytesConsumed = lsr(rBitPos, 3);   // Consumed 1
    rBitPtr = rBitPtr + rBytesConsumed; // Advance 1
    rBitBuf = load64LE(rBitPtr);        // Load 1
    rBitPos = rBitPos & 7;              // LeftoverBits 1

And the dependency graph looks dishearteningly long and skinny:

Dependency graph for bit reading, variant 3

Ouch. That’s averaging less than one instruction per cycle, and it’s all in one big, serial dependency chain. Not depicted in this graph but also worth noting is that the 4-cycle latency edge from “Load” to “BitsRemaining” is a recurring delay that will occur on every refill, because the computation of the updated rBitPtr depends on the decode prior to the refill having been completed. Now this is not a full decoder, since I’m showing only the parts to do with the bitstream IO (presumably a real decoder also contains code to, you know, actually decode the bits and store the result somewhere), but it’s still somewhat disappointing. Note that the DetermineCount step is a placeholder: if the count is known in advance, for example because we’re reading a fixed-length field, you can ignore it completely. The single cycle depicted in the graph is OK for very simple cases; more complicated cases will often need multiple cycles here, for example because they perform a table lookup to figure out the count. Either way, even with our optimistic single-cycle single-operation DetermineCount step, the critical path through this graph is pretty long, and there’s very little latent parallelism in it.

Does variant 4 fare any better? The primitives look like this in pseudo-ASM:

refill4_lsb:
    rNext = load64LE(rBitPtr);
    rNextSh = lsl(rNext, rBitCount);
    rBitBuf = rBitBuf | rNextSh;

    // Most instruction sets don't have a subtract-from-immediate
    // but do have xor-by-immediate, so this is an advantageous
    // way to write 63 - rBitCount. (This works since we know that
    // rBitCount is in [0,63]).
    rBitsAdvance = rBitCount ^ 63;
    rBytesAdvance = lsr(rBitsAdvance, 3);
    rBitPtr = rBitPtr + rBytesAdvance;

    rBitCount = rBitCount | 56;

peekbits4_lsb(count):
    rBitMask = lsl(1, count);
    rBitMask = rBitMask - 1;
    rBits = rBitBuf & rBitMask; // result

consume4_lsb(count):
    rBitBuf = lsr(rBitBuf, count);
    rBitCount = rBitCount - count;

the pseudo-code for our “refill, do two peek/consume cycles, then refill again” scenario looks like this:

    // Initial refill
    rNext = load64LE(rBitPtr);          // LoadNext 0
    rNextSh = lsl(rNext, rBitCount);    // NextShift 0
    rBitBuf = rBitBuf | rNextSh;        // BitInsert 0
    rBitsAdv = rBitCount ^ 63;          // AdvanceBits 0
    rBytesAdv = lsr(rBitsAdv, 3);       // AdvanceBytes 0
    rBitPtr = rBitPtr + rBytesAdv;      // AdvancePtr 0
    rBitCount = rBitCount | 56;         // RefillCount 0

    // First decode (peek count==19)
    rBits = rBitBuf & 0x7ffff;          // BitsMasked 0
    rCount = determineCount(rBits);     // DetermineCount 0
    rBitBuf = lsr(rBitBuf, rCount);     // ConsumeShift 0
    rBitCount = rBitCount - rCount;     // ConsumeSub 0

    // Second decode
    rBits = rBitBuf & 0x7ffff;          // BitsMasked 1
    rCount = determineCount(rBits);     // DetermineCount 1
    rBitBuf = lsr(rBitBuf, rCount);     // ConsumeShift 1
    rBitCount = rBitCount - rCount;     // ConsumeSub 1

    // Second refill
    rNext = load64LE(rBitPtr);          // LoadNext 1
    rNextSh = lsl(rNext, rBitCount);    // NextShift 1
    rBitBuf = rBitBuf | rNextSh;        // BitInsert 1
    rBitsAdv = rBitCount ^ 63;          // AdvanceBits 1
    rBytesAdv = lsr(rBitsAdv, 3);       // AdvanceBytes 1
    rBitPtr = rBitPtr + rBytesAdv;      // AdvancePtr 1
    rBitCount = rBitCount | 56;         // RefillCount 1

with this dependency graph:

Dependency graph for bit reading, variant 4

That’s a bunch of differences, and you might want to look at variant 3 and 4 in different windows side-by-side. The variant 4 refill does take 3 extra instructions, but we can immediately see that we get more latent instruction-level parallelism (ILP) in return:

The variant 4 refill splits into three dependency chains, not two.
The LoadNext for the second refill can start immediately after the AdvancePtr for the first refill, moving the load off the critical path for the second and subsequent iterations. Variant 3 has a 6-cycle latency from the determination of the final rBitPos in the first iteration to a refilled rBitBuf; in variant 4, that latency shrinks to 2 cycles (one shift and an OR). In other words, while the refill takes more instructions, most of them are off the critical path.
The consume step in variant 4 has two parallel computations; in variant 3, the rBitPos update is critical and feeds into the shift in the next “peek” operation. Variant 4 has a single shift (to consume bits) on the critical path to the next peek; as a result, the latency between two subsequent decodes is one cycle less in variant 4: 3 cycles instead of 4.

In short, this version trades a slight increase in refill complexity for a noticeable latency reduction of several key steps, provided it’s running on a superscalar CPU. That’s definitely nice. On the other hand, the key decode steps are still very linear. We’re limited by the latency of a long chain of serial computations, which is a bad place to be: if possible, it’s generally preferable to be limited by throughput (how many instructions we can execute), not latency (how fast we can complete them). Especially so if most of the latency in question comes from integer instructions that already have a single cycle of latency. Over the past 30 years, the number of executions units and instructions per cycle in mainstream CPU parts have steadily, if slowly, increased. But if we want to see any benefit from this, we need to write code that has a use for these extra execution resources.

Multiple streams

As is often the case, the best solution to this problem is the straightforward one: if decoding from a single bitstream is too serial, then why not decode from multiple bitstreams at once? And indeed, this is much better; there’s not much point to showing a graph here, since it’s literally just two copies of a single-stream graph next to each other. Even with a very serial decoder like variant 3 above, you can come a lot closer to filling up a wide out-of-order machine as long as you use enough streams. To a first-order approximation, using N streams will also give you N times the latent ILP—and given how serial a lot of the direct decoders are, this will translate into a substantial (usually not quite N-times, but still very noticeable) speed-up in the decoder on wide-enough processors. So what’s the catch? There are several:

Using multiple streams is a change to the bitstream format, not just an implementation detail. In particular, in any long-term storage format, any change in the number of bitstreams is effectively a change in the protocol or file format.
You need to define how to turn the multiple streams into a single output bytestream. This can be simple concatenation along with a header, it can be some form of interleaving or a sophisticated framing format, but no matter what it ends up being, it’s an increase in complexity (and usually also in storage overhead) relative to producing a single bitstream that contains everything in the order it’s read.
For anything with short packets and low latency requirements (e.g. game packets or voice chat), you either have to interleave streams fairly finely-grained (increasing size overhead), or suffer latency increases.
Decoding from N streams in parallel increases the amount of internal state in the decoder. In the decoder variants shown above, a N-wide variant needs N copies of rBitBuf, rBitPos/rBitCount and rBitPtr, at the very least, plus several temporary registers. For N=2 this is usually not a big deal, but for large counts you will start to run out of registers at least on some targets. There’s relatively little work being done on any given individual data item; if values get spilled from registers, the resulting loads and stores tend to have a very noticeable cost and will easily negate the benefit from using more streams.

In short, it’s not a panacea, but one of the usual engineering trade-offs. So how many streams should you use? It depends. At this point, for anything that is even remotely performance-sensitive, I would recommend trying at least N=2 streams. Even if your decoder has a lot of other stuff going on (computations with the decoded values etc.), bitstream decoding tends to be serial enough that there’s many wasted cycles otherwise, even on something relatively narrow like a dual-issue in-order machine. Having two streams adds a relatively small amount of overhead to the bitstream format (to signal the start of the data for stream 2 in every coding unit, or something equivalent), needs a modest amount of extra state for the second bit decoder, and tends to result in sizeable wins on pretty much any current CPU.

Using more than 2 streams can be a significant win in tight loops that do nothing but bitstream decoding, but is overkill in most other cases. Before you commit to a specific (high) number, you ideally want to try implementations on at least a few different target devices; a good number on one device may be past a big performance cliff on another, and having that kind of thing enshrined in a protocol or file format is unfortunate.

Aside: SIMD? GPU?

If you use many streams, can you use SIMD instructions, or offload work to a GPU? Yes, you can, but the trade-offs get a bit icky here.

Vectorizing the simple decoders outlined above directly is, generally speaking, not great. There’s not a lot of computation going on per iteration, and operations such as refills end up using gathers, which tend to have a high associated overhead. To hide this overhead, and the associated latencies, you generally still need to be running multiple instances of your SIMD decoder in parallel, so your total number of streams ends up being the number of SIMD lanes times two (or more, if you need more instances). Having a high number of streams may be OK if all your targets have good wide SIMD support, but can be a real pain if you need to decode on at least one that doesn’t.

The same thing goes for GPUs, but even more so. With single warps/wavefronts of usually 16-64 invocations, we’re talking many streams just to not be running a kernel at quarter utilization, and we generally need to dispatch multiple warps worth of work to hide memory access latency. Between these two factors, it’s easy to end up needing well over 100 parallel streams just to not be stalled most of the time. At that scale, the extra overhead for signaling individual stream boundaries is definitely not negligible anymore, and the magic numbers are different between different GPU vendors; striking a useful compromise between the needs of different GPUs while also retaining the ability to decode on a CPU if no suitable GPU is available starts to get quite tricky.

There are techniques to at least make the memory access patterns and interleaving overhead somewhat more palatable (I wrote about this elsewhere), but this is an area of ongoing research, and so far there’s no silver bullet I could point at and just say “do this”. This is definitely a challenge going forward.

Tricks with multiple streams

If you’re using multiple streams, you need to decide how these multiple streams get assembled into the final output bitstream. If you don’t have any particular reason to go with a fine-grained interleaving, the easiest and most straightforward option is to concatenate the sub-streams, with a header telling you how long the individual pieces are, here pictured for 3 streams:

Three sub-streams, linear layout

Also pictured are the initial stream bit pointers before reading anything (pointers in a C-like or assembly-like setting; if you’re using something higher-level, probably indices into a byte slice). The beginning of stream 0 is implicit—right after the end of the header—and the end of the final stream is often supplied by an outer framing layer, but the initial positions of bitptr1 and bitptr2 need to be signaled in the bytestream somehow, usually by encoding the length of streams 0 and 1 in the header.

One thing I haven’t mentioned so far are bounds checks. Compressed data is normally untrusted since all the channels you might get that data from tend to be prone to either accidental (error during storage or in transit) or intentional (malicious attacker trying to craft harmful data) corruption, so careful input validation is not optional. What this usually boils down to in practice is that every load from the bitstream needs to be guarded by a range check that guarantees it’s in bounds. The overhead of this can be reduced in various ways. For example, one popular method is to unroll loops a few times and check at the top that there are enough bytes left for worst-case number of bytes consumed in the number of unrolled iterations, then only dropping to a careful loop that checks every single byte access at the very end of the stream. I’ve written about another useful technique before.

But why am I mentioning this here? Because it turns out that with multiple streams laid out sequentially, the overhead of bounds checking can be reduced. A direct range check for 3 streams that checks whether there are at least K bytes left would look like this:

// This needs to happen before we do any loads:
// If any of the streams are close to exhausted
// (fewer than K bytes left), drop to careful loop
if (bitend0 - bitptr0 < K ||
    bitend1 - bitptr1 < K ||
    bitend2 - bitptr2 < K)
    break;

But when the three streams are sequential, we can use a simpler expression. First, we don’t actually need to worry about reading past the end of stream 0 or stream 1 as long as we still stay within the overall containing byte slice. And second, we can relax the check in the inner loop to use a much weaker test:

// Only check the last stream against the end; for
// other streams, simply test whether an the read
// pointer for an earlier stream is overtaking the
// read ponter for a later stream (which is never
// valid)
if (bitptr0 > bitptr1 ||
    bitptr1 > bitptr2 ||
    bitend2 - bitptr2 < K)
    break;

The idea is that bitptr1 starts out pointing at bitend0, and only keeps increasing from there. Therefore, if we ever have bitptr0 > bitptr1, we know for sure that something went wrong and we read past the end of stream 0. That will give us garbage data (which we need to handle anyway), but not read out of bounds, since the checks maintain the invariant that bitptr0 ≤ bitptr1 ≤ bitptr2 ≤ bitend2 - K. A later careful loop should use more precise checking, but this variant of the test is simpler and doesn’t require most of the bitend values to be reloaded in every iteration of our decoding loop.

Another interesting option is to reverse the order of some of the streams (which flips endianness as a side effect), and then glue pairs of forward and backward streams together, like shown here for streams 1 and 2:

Three sub-streams with forward/backward pair

I admit this sounds odd, but this has a few interesting properties. One of them is that it shrinks the amount of header space somewhat: in the image, the initial stream pointer for stream 2 is the same as the end of the buffer, and if there were 4 streams, the initial read pointers for stream 2 and 3 would start out in the same location (but going opposite directions). In general, we only need to denote the boundaries between stream pairs instead of individual streams. Then we let the decoder run as before, checking that the read cursors for the forward/backward pair don’t cross. If everything went right, once we’ve consumed the entire input stream, the final read cursors in a forward/backward pair should end up right next to each other. It’s a bit strange in that we don’t know the size of either stream in advance, just their sum, but it works fine.

Another consequence is that there’s no need to keep track of an explicit end pointer in the inner decoder loop if the final stream is a backwards stream; the pointer-crossing check takes care of it. In our running example, we’re now down to

// Check for pointer crossing; if done right, we get end-of-buffer
// checks for free.
if (bitptr0 > bitptr1 ||
    bitptr1 > bitptr2)
    break;

In this version, bitptr0 and bitptr1 point at the next byte to be read in the forwards stream, whereas bitptr2 is offset by -K to ensure we don’t overrun the buffer; this is just a constant offset however, which folds into the memory access on regular load instructions. It’s all a bit tricky, but it saves a couple instructions, makes the bitstream slightly smaller and reduces the number of live variables in a hot loop, with the savings usually being larger the cost of a single extra endian swap. With a two-stream layout, generating the second bitstream in reverse also happens to be convenient on the encoder side, because we can reserve memory for the expected (or budgeted) size of the combined bitstream without having to guess how many bytes end up in either half; it’s just a regular double-ended stack. Once encoding is done, the two parts can be compacted in-place by moving the second half downwards.

None of these properties are a big deal in and of themselves, but they make for a nice package, and a two-stream setup with a forwards/backwards pair is now our default layout for most parts in most parts of the Oodle bitstream (Oodle is a lossless data compression library I work on).

Between the various tricks outlined so far, the size overhead and the extra CPU cost for wrangling multiple streams can be squeezed down quite far. But we still have to deal with the increased number of live variables that multiple streams imply. It turns out that if we’re willing to tolerate a moderate increase in critical path latency, we can reduce the amount of state variables per bit reader, in some cases while simultaneously (slightly) reducing the number of instructions executed. The advantage here is that we can fit more streams into a given number of working registers than we could otherwise; if we can use enough streams that we’re primarily limited by execution throughput and not critical path latency, increasing said latency is OK, and reducing the overall number of instructions helps us increase the throughput even more. So how does that work?

Bit reader variant 5: minimal state, throughput-optimized

The bit reader variants I’ve shown so far generally split the bit buffer state across two variables: one containing the actual bits and another keeping track of how many bits are left in the buffer (or, equivalently, keeping track of the current read position within the buffer). But there’s a simple trick that allows us to reduce this to a single state variable: the bit shifts we use always shift in zeros. If we turn the MSB (for a LSB-first bit buffer) or the LSB (for a MSB-first bit buffer) into a marker bit that’s always set, we can use that marker to track how many bits we’ve consumed in total come the next refill. That allows us to get rid of the bit count and the instructions that manipulate it. That means one less variable in need of a register, and depending on which variant we’re comparing to, also fewer instructions executed per “consume”.

I’ll present this variant in the LSB-first version, and this time there’s an actual reason to (slightly) prefer LSB-first over MSB-first.

const uint8_t *bitptr; // Pointer to current byte
uint64_t bitbuf = 1ull << 63; // Init to marker in MSB

void refill5_lsb() {
    assert(bitbuf != 0);

    // Count how many bits we consumed using a "leading zero
    // count" instruction. See notes below.
    int bits_consumed = CountLeadingZeros64(bitbuf);

    // Advance the pointer
    bitptr += bits_consumed >> 3;

    // Refill and put the marker in the MSB
    bitbuf = read64LE(bitptr) | (1ull << 63);

    // Consume the bits in this byte that we've already used.
    bitbuf >>= bits_consumed & 7;
}

uint64_t peekbits5_lsb(int count) {
    assert(count >= 1 && count <= 56);
    // Just need to mask the low bits.
    return bitbuf & ((1ull << count) - 1);
}

void consume5_lsb(int count) {
    bitbuf >>= count;
}

This “count leading zeros” operation might seem strange and weird if you haven’t seen it before, but it happens to be something that’s useful in other contexts as well, and most current CPU architectures have fast instructions that do this! Other than the strangeness going on in the refill, where we first have to figure out the number of bits consumed from the old marker bit, then insert a new marker bit and do a final shift to consume the partial bits from the first byte, this is like a hybrid between variants 3 and 4 from last time.

The pseudo-assembly for our running “refill, two decodes, then another refill” scenario goes like this: (not writing out the marker constant explicitly here)

    // Initial refill
    rBitsConsumed = clz64(rBitBuf);     // CountLZ 0
    rBytesAdv = lsr(rBitsConsumed, 3);  // AdvanceBytes 0
    rBitPtr = rBitPtr + rBytesAdv;      // AdvancePtr 0
    rNext = load64LE(rBitPtr);          // LoadNext 0
    rMarked = rNext | MARKER;           // OrMarker 0
    rLeftover = rBitsConsumed & 7;      // LeftoverBits 0
    rBitBuf = lsr(rMarked, rLeftover);  // ConsumeLeftover 0

    // First decode (peek count==19)
    rBits = rBitBuf & 0x7ffff;          // BitsMasked 0
    rCount = determineCount(rBits);     // DetermineCount 0
    rBitBuf = lsr(rBitBuf, rCount);     // Consume 0

    // Second decode
    rBits = rBitBuf & 0x7ffff;          // BitsMasked 1
    rCount = determineCount(rBits);     // DetermineCount 1
    rBitBuf = lsr(rBitBuf, rCount);     // Consume 1

    // Second refill
    rBitsConsumed = clz64(rBitBuf);     // CountLZ 1
    rBytesAdv = lsr(rBitsConsumed, 3);  // AdvanceBytes 1
    rBitPtr = rBitPtr + rBytesAdv;      // AdvancePtr 1
    rNext = load64LE(rBitPtr);          // LoadNext 1
    rMarked = rNext | MARKER;           // OrMarker 1
    rLeftover = rBitsConsumed & 7;      // LeftoverBits 1
    rBitBuf = lsr(rMarked, rLeftover);  // ConsumeLeftover 1

The refill has 7 integer operations, the same as variant 4 (“looakhead”) above, and 3 more than variant 3 (“bit extract”), while the decode step takes 3 operations (including the determineCount step), one fewer than variants 3 (“bit extract”) and 4 (“lookahead”). The latter means that we equalize with the regular bit extract form in terms of instruction count when we perform at least 3 decodes per refill, and start to pull ahead if we manage more than 3. For completeness, here’s the dependency graph:

Dependency graph for bit reading, variant 5

Easily the longest critical path of the variants we’ve seen so far, and very serial indeed. It doesn’t help that not only do we not know the load address early, we also have several more steps in the refill compared to the basic variant 3. But having the entire “hot” bit buffer state concentrated in a single register (rBitBuf) during the decodes means that we can afford many streams at once, and with enough streams that extra latency can be hidden.

This one definitely needs to be deployed carefully, but it’s a powerful tool when used in the right place. Several of the fastest (and hottest) decoder loops in Oodle use it.

Note that with this variation, there’s a reason to stick with the LSB-first version: the equivalent MSB-first version needs a way to count the number of trailing zero bits, which is a much less common instruction, although it can be synthesized from a leading zero count and standard arithmetic/logical operations at acceptable extra cost. Which brings me to my final topic for this post.

MSB-first vs. LSB-first: the final showdown

Throughout this 3-parter series, I’ve been continually emphasizing that there’s no major reason to prefer MSB-first or LSB-first for bit IO. Both are broadly equivalent and have efficient algorithms. But having now belabored that point sufficiently, if we can make both of them work, which one should we choose?

There are definitely differences that push you into one direction or another, depending on your intended use case. Here are some you might want to consider, in no particular order:

As we saw in part 2, the natural MSB-first peekbits and getbits implementations run into trouble (of the undefined-behavior and hardware-actually-behaving-in-surprising-ways kind) when count == 0, whereas with the natural LSB-first implementation, this case is unproblematic. If you need to support counts of 0 (usefol for e.g. variable-length codes), LSB-first tends to be slightly more convenient. Alternatives for MSB-first are a rotate-based implementation (which has no problems with 0 count) or using an extra shift, turning x >> (64 - count) into (x >> 1) >> (63 - count).
MSB-first coding tends to have a big edge for universal variable-length codes. Unary codes can be decoded quickly via the aforementioned “count leading zero” instructions; gamma codes and the closely related Exp-Golomb codes also admit direct decoding in a fairly slick way; and the same goes for Golomb-Rice codes and a few others. If you’re considering universal codes, MSB-first is definitely handier.
At the other extreme, LSB-first coding often ends up slightly cheaper for the table-based decoders commonly used when a code isn’t fixed as part of the format; Huffman decoders for example.
MSB-first meshes somewhat more naturally with big-endian byte order, and LSB-first with little-endian. If you’re deeply committed to either side in this particular holy war, this might drive you one way or the other.

Charles and me both tend to default to MSB-first but will switch to LSB-first where it’s a win on multiple target architectures (or on a single important target).

Conclusion

That’s it for both this post and this mini-series; apologies for the long delay, caused by first a surprise deadline that got dropped in my lap right as I was writing the series originally, and then exacerbated by a combination of technical difficulties (alas, still ongoing) and me having gotten “out of the groove” in the intervening time.

This post ended up longer than my usual, and skips around topics a bit more than I’d like, but I really didn’t want to make this series a four-parter; I still have a few notes here and there, but I don’t want to drag this topic out much longer, not on this general level anyway. Instead, my plan is to write about some more down-to-earth case studies soon, so I can be less hand-wavy, and maybe even do some actual assembly-level analysis for an actual real-world CPU instead of an abstract idealized machine. We’ll see.

Until then!

“Rate-distortion optimization” is a term in lossy compression that sounds way more complicated and advanced than it actually is. There’s an excellent chance that by the end of this post you’ll go “wait, that’s it?”. If so, great! My goal here is just to explain the basic idea, show how it plays out in practice, and maybe get you thinking about other applications. So let’s get started!

What does “rate-distortion optimization” actually mean?

The mission statement for lossless data compression is pretty clear. A lossless compressor transforms one blob of data into another one, ideally (but not always) smaller, in a way that is reversible: there’s another transform (the decoder) that, when applied to the second blob, will return an exact copy of the original data.

Lossy compression is another matter. The output of the decoder is usually at least slightly different from the original data, and sometimes very different; and generally, it’s almost always possible to take an existing lossily-compressed piece of data, say a short video or MP3 file, make a slightly smaller copy by literally deleting a hundred bytes from the middle of the file with a hex editor, and get another version of the original video (or audio) file that is still clearly recognizable yet degraded. Seriously, if you’ve never done this before, try it, especially with MP3s: it’s quite hard to mangle a MP3 file in a way that will make it not play back anymore, because MP3s have no required file-level headers, tolerate arbitrary junk in the middle of the bitstream, and are self-synchronizing.

With lossless compression, it makes sense to ask “what is the smallest I can make this file?”. With lossy compression, less so; you can generally get files far smaller than you would ever want to use, because the data is degraded so much it’s barely recognizable (if that). Minimizing file size alone isn’t interesting; we want to minimize size while simultaneously maximizing the quality of the result. The way we do this is by coming up with some error metric that measures the distance between the original image and the result the decoder will actually produce given a candidate bitstream. Now our bitstream has two associated values: its length in bits, the (bit) rate, usually called R in formulas, and a measure of how much error was introduced as a result of the lossy coding process, the distortion, or D in formulas. R is almost always measured in bits or bytes; D can be in one of many units, depending on what type of error metric is used.

Rate-distortion optimization (RDO for short) then means that an encoder considers several possible candidate bit streams, evaluates their rate and distortion, and tries to make an optimal choice; if possible, we’d like it to be globally optimal (i.e. returning a true best-possible solution), but at the very least optimal among the candidates that were considered. Sounds great, but what does “better” mean here? Given a pair (R_1,D_1) and another pair (R_2,D_2) , how do we tell which one is better?

The pareto frontier

Suppose what we have 8 possible candidate solutions we are considering, each with their own rate and distortion scores. If we prepare a scatter plot of rate on the x-axis versus distortion on the y-axis, we might get something like this:

Rate vs. Distortion scatterplot

The thin, light-gray candidates aren’t very interesting, because what they have in common is that there is at least one other candidate solution that is strictly better than them in every way. That is, some other candidate point has the same (or lower) rate, and also the same (or lower) distortion as the grayed-out points. There’s just no reason to ever pick any of those points, based on the criteria we’re considering, anyway. In the scatterplot, this means that there is at least one other point that is both to the left (lower rate) and below (lower distortion).

For any of the points in the diagram, imagine a horizontal and a vertical line going through it, partitioning the plane into four quadrants. Any point that ends up in the lower-left quadrant (lower rate and lower distortion) is clearly superior. Likewise, any point in the upper-right quadrant (higher rate and higher distortion) is clearly inferior. The situation with the other two quadrants isn’t as clear.

Which brings us to the other four points: the three fat black points, and the red point. These are all points that have no other points to the left and below them, meaning they are pareto efficient. This means that, unlike the situation with the light gray points, we can’t pick another candidate that improves one of the metrics without making the other worse. The set of points that are pareto efficient constitutes the pareto frontier.

These points are not all the same, though. The three fat black points are not just pareto efficient, but are also on the convex hull of the point set (the lower left contour of the convex hull here is drawn using blue lines), whereas the red point is not. The points that are both pareto and on the convex hull of the pareto frontier are particularly important, and can be characterized in a different way.

Namely, imagine taking a straightedge, angling it so that it’s either horizontal, “descending” (with reference to our graph) from left to right, or vertical, and then sliding it slowly “upwards” from the origin without changing its orientation until it hits one of the points in our set. It will look something like this:

Rate vs. Distortion scatterplot with lines

The shading here is meant to suggest the motion of the green line; we keep sliding it up until it eventually catches on the middle of our three fat black points. If we change the angle of our line to something else, we can manage to hit the other two black points, but not the red one. This has nothing to do with this particular problem and is a general property of convex sets: any vertices of the set must be extremal in some direction.

Getting a bit more precise here, the various copies of the green line I’ve drawn correspond to lines

$w_1 R + w_2 D = \mathrm{const.}$

and the constraints I gave on the orientation of the line boil down to $w_1, w_2 \ge 0$ (with at least one being nonzero). Sliding our straightedge until we hit a point corresponds to the minimization problem

$\min_i w_1 R_i + w_2 D_i$

for a given choice of w₁ and w₂, and the three black points are the three possible solutions we might get for our set of candidate points. So we’ve switched from purely minimizing rate or purely minimizing distortion (both of which, in general, tend to give somewhat pathological results) towards minimizing some linear combination of the two with non-negative weights; and doing so with various choices of the weights w₁ and w₂ will allow us to sweep out the lower left convex hull of the pareto frontier, which is often the “interesting” part (although, as the red point in our example illustrates, this process excludes some of the points on the pareto frontier).

This does not seem particularly impressive so far: we don’t want to purely minimize one quantity or the other, so instead we’re minimizing a linear combination of the two. That seems it would’ve been the obvious next thing to try. But looking at the characterization above does at least give us some idea on what looking at these linear combinations ends up doing, and exactly what we end up giving up (namely, the pareto points not on the convex hull). And there’s another massively important aspect to consider here.

The Lagrangian connection

If we take our linear combination above and divide through by w₁ or w₂ (assuming they are non-zero; dividing our objective by a scalar constant does not change the results of the optimization problem), respectively, we get:

$L_1 = R + \frac{w_2}{w_1} D =: R + \lambda_1 D$
$L_2 = D + \frac{w_1}{w_2} R =: D + \lambda_2 R$

which are essentially the Lagrangians we would get for continuous optimization problems of the form “minimize R subject to D=const.” (L₁) and “minimize D subject to R=const.” (L₂); that is, if we were in a continuously differentiable setting (for data compression we usually aren’t), trying to solve the problems of either minimizing bit rate while hitting a set distortion target or minimizing distortion within a given bit rate limit woud lead us to study the same type of expression. Generalizing one step further, allowing not just equality but also inequality constraints (i.e. rate or distortion at most a certain value, rather then requiring exact match) still leads to essentially the same functions, this time by way of the KKT conditions.

In this post, I intentionally went “backwards” by starting with the Lagrangian-esque expressions and then mentioning the connection to continuous optimization problems because I want to emphasize that this type of linear combination of the different metrics arises naturally, even in a fully discrete setting; starting out with Lagrange or KKT multipliers would get us into technical difficulties with discrete decisions that do not necessary admit continuously differentiable objectives or constraint functions. Since the whole process makes sense even without explicitly mentioning Lagrange multipliers at all, that seemed like the most natural way to handle it.

What this means in practice

I hopefully now have you convinced that looking at the minima of the linear combination

w_1 R + w_2 D

is a sensible thing to do, and both our direct derivation and the two Lagrange multiplier formulations for continuous problems I mentioned lead us towards it. Neither our direct derivation nor the Lagrange multiplier version tells us what to set our mystery weight parameters to, however. In fact, the Lagrange multiplier method flat-out tells you that for every instance of your optimization problem, there exist the right values for the Lagrange multipliers that correspond to an optimum of the problem you’re interested in, but it doesn’t tell you how to get them.

However, what’s nice is that the same thing also works in reverse, as we saw earlier with our line-sweeping procedure: picking the angle of the line we’re sliding around corresponds to picking a Lagrange multiplier. No matter which one we pick, we are going to end up finding an optimal point for some trade-off, namely one that is pareto; it just might not be the exact trade-off we wanted.

For example, suppose we decide that a single-variable parameterization like in the Lagrange multiplier scenario is convenient, and we pick something like L₁, namely w₁ fixed at 1, w₂ = λ. We were assuming from the beginning that we have some method of generating candidate solutions; all that’s left to do is to rank them and pick a winner. Let’s start with λ=1, which leads to a solution with some (R,D) pair that minimizes R + D . Note it’s often useful to think of these quantities with units attached; we typically measure R in bits and [D] is whatever it is, so the unit of λ must be [λ] = bits/[D], i.e. λ is effectively an exchange rate that tells us how many units of distortion are worth as much as a single bit. We can then look at the R and D values of the solution we got back. If say we’re trying to hit a certain bit rate, then if R is close to that target, we might be happy and just stop. If R is way above the target bit rate, we overshot, and clearly need to penalize high distortions less; we might try another round with λ=0.1 next. Conversely, if say R is a few percent below the target rate, we might try another round with slightly higher lambda, say λ=1.02, trying to penalize distortion a bit more so we spend more bits to reduce it, and see if that gets us even closer.

With this kind of process, even knowing nothing else about the problem, you can systematically explore the options along the pareto frontier and try to find a better fit. What’s nice is that finding the minimum for a given choice of parameters (λ in our case) doesn’t require us to store all candidate options considered and rank them later; storing all candidates is not a big deal in our original example, where we were only trying to decide between a handful of options, but in practice you often end up with huge search spaces (exponential in the problem size is not unheard of), and being able to bake it down to a single linear function is convenient in other ways; for example, it tends to work well with efficient dynamic programming solutions to problems that would be infeasible to handle with brute force.

Wait, that’s it?

Yes, pretty much. Instead of trying to purely minimize bit rate or distortion, you use a combination $R + \lambda D$ and vary λ to taste to hit your targets. Often, you know enough about your problem space to have a pretty good idea of what values λ should have; for example, in video compression, it’s pretty common to tie the λ used when coding residuals to quantizer step size, based on the (known) behavior of the quantizer and the (expected) characteristics of real-world signals. But even when you don’t know anything about your data, you can always use a search process for λ as outlined above (which is, of course, slower).

Now the one thing to note in practice is that you rarely use just a single distortion metric; for example, in video coding, it’s pretty common to use one distortion metric when quantizing residuals, a different one for motion search, and a third for overall block mode decision. In general, the more frequently something is done (or the bigger the search space is), the more willing codecs are to make compromises with their distortion measure to enable more efficient searches. In general, good results require both decent metrics and doing a good job exploring the search space, and accepting some defects in the metrics in exchange for a significant increase in search space covered in the same amount of time is often a good deal.

But the basic process is just this: measure bit rate and distortion (in your unit of choice) for every option you’re considering, and then rank your options based on their combined $R + \lambda D$ (or $D + \lambda' R$ , which is a different but equivalent parameterization) scores. This gives you points on the lower left convex hull of the pareto frontier, which is what you want.

This applies in other settings as well. For example, the various lossless compressors in Oodle are, well, lossless, but they still have a bunch of decisions to make, some of which take more time in the decoder than others. For a lossless codec, measuring “distortion” doesn’t make any sense (it’s always 0), but measuring decode time does; so the Oodle encoders optimize for a trade-off between compressed size and decode time.

Of course, you can have more parameters too; for example, you might want to combine these two ideas and do joint optimization over bit rate, distortion, and decode time, leading to an expression of the type $R + \lambda D + \mu T$ with two Lagrange multipliers, with λ as before, and a second multiplier μ that encodes the exchange rate from time units into bits.

Either way, the details of this quickly get complicated, but the basic idea is really quite simple. I hope this post de-mystifies it a bit.

April 26, 2016 was the release date of Oodle 2.1.5 which introduced Kraken, so it celebrated its 5-year anniversary recently. A few months later we launched Selkie and Mermaid, which wetre already deep in development at the time of the Kraken release but not quite finished yet, and in early 2018 we added Leviathan, adding a higher-compression option to our current suite of codecs, the “sea beastiary”.

In those 5 years, these codecs have seen wide adoption in their intended market (which is video games).

There’s a few interesting things to talk about, and the 5-year anniversary seems as good a reason as any to get started; this is the first in what will be a series of yet to be determined length, in which I’ll do a deep-dive on one interesting aspect of these codecs, namely the way they handle entropy coding.

Before I can get into any details, first some general notes on how things fit together. Kraken, Mermaid, Selkie, and Leviathan are lossless data compression algorithms, all variations on the basic LZ77 + entropy coding formula that has been the de facto standard in general-purpose compressors since the late 80s, because such codecs can achieve a good balance of compression ratio and compression/decompression speed for practical applications. Other codecs belonging to this family include Deflate (ZIP/gzip), LZX (Amiga LZX/CAB), LZMA (7zip/xz), Zstd, Brotli, LZHAM, and many others, including most of the older Oodle Data codecs (Oodle LZH/LZHLW, LZA, LZNA, and BitKnit).

The “LZ77” portion here refers to the LZ77 algorithm, which, broadly speaking, compresses data by replacing repeated byte sequences in a stream with references to prior occurrences; as long as the back-reference is smaller than the bytes themselves, this will result in compression. Nobody actually uses the original LZ77 algorithm per se (which has a very inefficient encoding by today’s standards), but the general idea remains the same. Some codecs, especially ones designed for faster decoding or encoding, use just this byte/string matching approach without any entropy decoding; this includes many well-known faster codecs such as LZ4, Snappy, LZO, and again many others, including the remaining older Oodle Data
codecs (Oodle LZB/LZBLW and LZNib).

I won’t talk about the LZ77 portion here (that’s its whole own thing), and instead focus on the “entropy coding” portion of things. In essence, what all these codecs do is identify repeated strings, resulting in some mixture of back-references (“matches”) and bytes that couldn’t be profitably matched to anything earlier in the data stream (“literals”). All the codecs mentioned differ in how exactly these things are encoded, but all of them eventually boil it down to one
or more data streams that each have symbols from a finite-size and not too large alphabet (typical sizes include a few dozen to a few hundred distinct symbols in the alphabet), plus maybe a few extra streams that contain raw bits for data that is essentially random.

Entropy coding, then, processes these streams of symbols from a finite alphabet and tries to turn them into an equivalent form that is smaller. There are numerous ways to do this; one well-known method, and indeed a mainstay in data compression to this day (with some caveats), is Huffman coding, which counts how often individual symbols occur and assigns short bit codes to more likely symbols and longer symbols to less likely symbols, in a way that remains unambiguous and uniquely decodable.¹

The second major class of techniques is arithmetic coding, which compresses slightly better than Huffman for typical data and can give significant improvements on either small alphabets or highly skewed distributions, and is also more suitable for applications when there is not an a priori known fixed symbol distribution, but rather an on-line model that is being updated as new data is seen (“adaptive coding”).

Finally, the new kid on the block is the ANS (“Asymmetric Numeral System”) family of methods, which is conceptually related to arithmetic coding but works differently, and has different trade-offs. Kraken and Mermaid use ANS sometimes and Leviathan frequently; I’m sure I’ll cover its usage in those codecs at some point, and for now there are plenty of my older writings on the topic from around 2014-2016.

Anyway, it is common for LZ77-derived algorithms with a Huffman or arithmetic coding backend to use odd alphabet sizes; when you’re assigning variable-length codes to everything anyway, it doesn’t really matter whether your alphabet has just 256 distinct byte values in it, or if the limit is 290 or 400 or something else non-power-of-2. It’s likewise common to interleave all kinds of data into a single contiguous bitstream. The sea beastiary family does things a bit differently than most codecs.

Namely, the input (uncompressed) data is processed in 128KiB chunks, which all have self-contained bytestreams with an explicitly signaled size.² These chunk bytestreams are themselves made up out of multiple independent bitstreams for different types of data, and most of these streams (except for a few “misc”/leftover data streams) use a single shared entropy coding layer to handle what we now call “primary streams”.³

This layer works on streams that always use a byte alphabet, i.e. unsigned values between 0 and 255. That means that everything that goes through this layer needs to be presented in a byte-packed format. This kind of restriction is common in fast byte-aligned LZ77 coders without a final entropy coding stage (like LZ4 or Snappy), but atypical outside of that space.

Working in bytes needs some care to be taken in other parts of the codec design, but comes with a massive advantage, because it allows us to have an uncompressed mode that just stores streams as uncompressed bytestreams and doesn’t need any decoding at all. That means that streams that are nearly random, or just short, don’t have to waste CPU time on setting up a more complicated decoder and then decoding bytes individually through a more complicated algorithm when just grabbing uncompressed bytes straight from the bytestream is good enough.

In short, for most codecs, the choice of entropy coding scheme is a design-time decision. Algorithms like Deflate which uses LZ+Huffman always perform Huffman coding, even when doing so isn’t actually worthwhile, because they use extended alphabets with more than 256 symbols in them that mean pass-through coding would still need to use larger-than-8-bit units, both introducing extra waste and making it more expensive in terms of CPU time. Sticking with a byte alphabet means that the Oodle sea beastiary codecs can either use a conventional entropy coding stage or choose to send data uncompressed and act like a faster byte-aligned LZ codec. In fact, that is exactly how Selkie works: under the hood, Selkie is the same codec as Mermaid, but with all streams in uncompressed mode at all times (and some other features disabled).

The final big departure from most codecs is that the sea beastiary codecs run all entropy decoding as separate passes that consume an input bytestream with a common header and output an array of bytes. Uncompressed streams can skip the decoding step and consume data straight from the input bytestream. These passes just handle whatever entropy coding scheme is used and nothing else. This means that instead of the choice of entropy coders being fundamentally built into the algorithm, it’s a collection of small loops that can easily be interchanged or added to, and indeed that’s what we’ve been doing; the original version of Kraken shipped in Oodle 2.1.5 with a choice between just uncompressed streams and one particular version of Huffman coding. We’ve added several more options since then, including an alternative Huffman coding variant, an optional RLE pre-pass, TANS, and the ability to mix-and-match all of these within a single stream, all without it being a big redesign, and we’re quite happy with the results.

There’s many interesting things in here that I plan to cover in small installments, starting with the way we do Huffman (de)coding, which is worth several posts on its own. I’ll try to keep updating this at a reasonable cadence, but no promises since my blogging time is pretty limited these days!

Footnotes

[1] Huffman coding technically refers to such variable-length
coding (meaning not all symbols need to be coded with the same number of bits) only when the code in question is actually generated via Huffman’s algorithm (which minimizes the length of the coded stream given a certain number of assumptions), but in practice most implementations don’t actually use Huffman’s original algorithm which can assign awkwardly long code lengths to rare symbols. Instead the maximum code length is usually capped, which is a different problem and yields codes that are not actually Huffman codes (and are very slightly less efficient), but just general variable-length codes. Having said that, approximately everyone calls it Huffman coding even when the codes are length-limited anyway, and I won’t sweat it either in the following.

[2] The chunks being independent is how we implement what is called “thread-phased” decoding. As is explained further down in the article, Oodle decodes the bitstreams to temporary memory buffers in a first phase, and the chunks can be processed completely independently for this part of the process, making it a prime candidate to move off to a separate thread (or threads). After phase 1 completes, the actual LZ decoding takes place in “phase 2”; this portion can make references to any bytes earlier in the stream, so it must be run in order and can’t as easily be threaded. Customers that want wider parallelism (or easy seeking) usually chop data into fixed-size blocks, typically between 64KiB and 512KiB, that are encoded independently with no back-references permitted across block boundaries. This decreases compression ratio but makes it trivial to have many workers decoding different blocks at the same time.

[3] In the original code, these streams of bytes are, very non-distinctly, just called “arrays”. When specifying the bitstream more formally, a lot of things suddenly needed proper names to avoid confusion, and “arrays” was just too indistinct and overloaded, so “primary streams” was what we settled on. Sometimes additional data streams exist within a chunk for information not contained in the primary streams, and these are therefore called “secondary streams”.

Last time I covered the big picture, so we know the ground rules for the modular entropy coding layer in Oodle Data: bytestream consisting of several independent streams, pluggable algorithms, bytes in and bytes out, and entropy decoding is done as a separate pass, not inlined into the code that consumes the decoded data.

There are good and bad news here: the good news is that the encoders and decoders need to do just one thing, they have homogeneous data to work with, and they ultimately boil down to very simple operations in extremely predictable loops so they can pull out all the stops when it comes to optimization. The bad news is that since they are running on their own, if they’re inefficient, there’s no other useful work being accomplished when the decoders are stalled. Careful design is a necessity if we want to ensure that things go smoothly come decode time.

In this post, I’ll start by looking at Huffman coding, or rather, Huffman decoding; encoding, we won’t have to worry about much, since it’s an inherently easier problem to begin with and the design decisions we’ll make to enable fast decoding will also make encoding faster. Required reading for everything that follows will be my earlier posts “A whirlwind introduction to dataflow graphs” as well as my series “Reading bits in far too many ways” (part 2, part 3). I will assume the contents of these posts as known, so if something is unclear and has to do with bit IO, it’s in there.

Huffman (de)coding basics

Huffman coding is a form of variable-length coding that in its usual presentation is given a finite symbol alphabet Σ along with a positive count f_s (the frequency) of how often each symbol s occurs in the source. Huffman’s classic algorithm then assigns variable-length binary codes (meaning strings of 0s and 1s) of length c_s to each symbol that guarantee unique decodability and minimize the number of bits in the coded version of the source; we want to send the symbols from the source in fewer bits, and the way to do so is to assign shorter codes to more frequent symbols. To be precise, it solves the optimization problem

$\displaystyle \min \sum_{s \in \Sigma} f_s c_s$

(i.e. minimize the number of bits sent for the payload, because each c_s-bit code word occurs f_s times in the source) for integer c_s subject to the constraints

$\displaystyle c_s \in \mathbb{Z}, 1 \le c_s \forall s \in \Sigma, \sum_{s \in \Sigma} 2^{-c_s} = 1$

The first of these constraints just requires that code lengths be positive integers and is fairly self-explanatory, the second requires equality in the Kraft-McMillan inequality which guarantees that the resulting code is both complete and uniquely decodable; in fact, let’s just call the quantity $K = (\sum_{s \in \Sigma} 2^{-c_s}) - 1$ the “Kraft defect”. When K<0, the code has redundancy (i.e. there free code space left for assignment), for K=0 it is complete, and K>0 is over-complete (not uniquely decodable or using more code space than is available).

Huffman’s algorithm to perform this construction is a computer science classic, intuitive, a literal textbook example for greedy algorithms and matroids, and it even gives us not just the sequence of code lengths, but an actual code assignment that achieves those code lengths. Nevertheless, actual Huffman codes are of limited use in applications, primarily due to one big problem: the above problem places no upper bound on the code lengths, and indeed a n-symbol alphabet can reach a maximum code length of up to n-1 given the right (adversarial) frequency distribution that produces a fully left- or right-leaning tree.¹ This is bad for both encoders and decoders, which generally have to adhere at the very least to limits derived from the choice of target platform (mostly register width) and bit I/O refill strategy (see my long series on the topic mentioned earlier), although it is often desirable to enforce much tighter limits (more on that below). One way to accomplish this is by taking the results of the standard Huffman construction and using heuristics and local rewrite rules on either the tree or the code lengths to make the longest codes shorter at the expense of making some other (more frequent) codes longer than they would otherwise have to be. Another easy heuristic option is to just use the regular Huffman construction and if codes end up over-long, divide all symbol frequencies by 2 while rounding up (so that non-zero symbol frequencies never drop to zero) and keep trying again until success. This works because it keeps the frequencies of more popular symbols approximately correct and in the right relation to each other, while it flattens out the distribution of the less popular symbols, which all end up eventually having symbol frequencies of 1, resulting in a uniform symbol distribution at the bottom end. A non-heuristic, actually optimal option is to directly solve a constrained version of the above optimized problem that adds a maximum codeword length constraint. The standard solution to the resulting combinatorial optimization problem is called the package-merge algorithm.

A related observation is that there is really no particular reason to prefer the code assignment produced by Huffman’s algorithm. Even for cases where no length-limiting is required to fall within usual limits, there are usually many possible code assignments that all result in the exact same bit count for the payload portion; and given an assignment of code length to symbols, it is easy to come up with a codeword assignment that matches those code lengths. In short, the per-symbol code lengths make a difference to how efficient a given code is, but we can freely choose which codes we assign subject to the length constraint, and can do so in a way that is convenient for us.

This has the effect of deciding on a single canonical Huffman tree for each valid sequence of code lengths (valid here meaning K=0), even in cases where Huffman’s algorithm generates distinct trees and code assignments that happen to have the same code lengths. The actual coded bits are just part of the puzzle; an actual codec needs to transmit not just the coded bits, but also enough information for the decoder to work out what codes were assigned to which symbols, and specifying a canonical Huffman code requires less information than its non-canonical counterpart would.² There are several options as to how exactly we assign codes from code lengths, the most common of which is to assign code words sequentially to symbols in lexicographic order for the (c_s, s) pairs—that is, code lengths paired up with symbol values, ordered by ascending code length and then symbol value.³

Using canonical, length-limited codes is a common choice. For example, Deflate (the algorithm used in zlib/ZIP) uses canonical codes limited to a maximum length of 15 bits, and JPEG limits the maximum code length to 16 bits.

In principle, such a data stream can be decoded by reading a bit at a time and walking the Huffman tree accordingly, starting from the root. Whenever a leaf is reached, the corresponding symbol is emitted and the tree walk resets back to the root. However, processing data one bit at a time in this fashion is very slow, and implementations generally avoid it. Instead, decoders try to decode many bits at once. At the extreme, when all code lengths are below some upper bound N, a decoder can just “peek ahead” N bits into the bitstream, and prepare a table that contains the results of the tree walks for all 2^N possible bit patterns; table entries store which symbol was eventually reached and how many bits were actually consumed along the way. The decoder emits the symbol, advances its read cursor by the given number of bits, and reads the next symbol.

This method can be much faster and is easy to implement; however, when N is large, the resulting tables will be too, and there are steep penalties whenever table sizes start exceeding the sizes of successive cache levels, since the table lookups themselves are essentially random.⁴ Furthermore, most sources being compressed aren’t static, so Huffman tables aren’t used forever. A typical Huffman table will
last somewhere between 5 and 100 kilobytes of data. For larger N, it’s easy for table initialization to take as long, or longer, than actual bitstream decoding with that table, certainly for shorter-lived Huffman tables. The compromise most commonly used in practical implementations is to use a multi-level table scheme, where individual tables cover something like 9 to 12 bits worth of input. All the shorter symbols are decoded in a single table access, but the entries for longer symbols don’t contain a symbol value; instead, they reference a secondary table that resolves the remaining bits (there can be many such secondary tables). The point of Huffman coding is to assign shorter codes to more common symbols, so having to take a secondary table walk is relative uncommon. And of course there are many possible variants of this technique: table walks can do more than 2 levels, and in the other direction, if we have many shorter codes (say at most 4 bits long), then a single access to a 512-entry, 9-bit table can resolve two symbols (or even more) at once. This can accelerate decoding of very frequent symbols, at the expense of a more complicated decoder inner loop and slower table build step.⁵

Huffman coding in Oodle

Our older codecs, namely Oodle LZH and LZHLW, used a 15-bit code length limit and multi-level tables, both common choices. The problem with this design is that every symbol decode needs to check whether a second-level table lookup is necessary; a related problem is that efficient bit IO implementations maintain a small buffer that is periodically refilled to contain somewhere between 23 and 57 bits of data (details depend on the choice of bit IO implementation and whether we’re on a 32-bit or 64-bit platform, see the bit-IO series I referenced earlier). When we know that codes are guaranteed to be short enough, we can decode multiple symbols per refill. This is significant because refilling the bit buffer is about as expensive as decoding a symbol itself; being able to always decode two symbols per refill, as opposed to one, has the potential to reduce refill overhead from around 50% of decode time to about 33% on 32-bit targets!

When Charles first prototyped what later turned into Kraken, he was using our standard bit IO implementation, which uses byte-wise refill so post-refill, the 32-bit version guarantees at least 25 usable bits, while for in 64-bit the number is 57 bits, all typical limits.

Therefore, if codes are limited to at most 12 bits, the 32-bit decoder can sustain a guaranteed two symbols per refill; for 64-bit, we can manage 4 decodes per refill (using 48 out of 57 bits). As a side effect, the tables with a 12-bit limit are small enough that even without a multi-level table scheme, everything fits in the L1D cache. And if codes are limited to 11 bits, the 32-bit decoders still manage 2 symbols per refill, but the 64-bit decoders now go up to 5 decodes per refill, which is a nice win. Therefore, the newer Oodle codecs use Huffman codes length-limited to only 11 bits, which allows branch-free decoding of 2 symbols per refill on 32-bit platforms, and 5 symbols per refill on 64-bit targets.⁶

The other question is, of course, how much limiting the code lengths aggressively in this way costs in terms of compression ratio, to which the answer is: not much, although I don’t have good numbers for the 11-bit limit we eventually ended up with. For a 12-bit limit, the numbers in our tests were below a 0.1% hit vs. the original 15-bit limit (albeit with a note that this was when using proper package-merge, and that the hit was appreciably worse with some of the hackier heuristics). We ended up getting a fairly significant speed-up from this choice, so that slight hit is well worth it.

With these simplifications applied, our core Huffman decoder uses pre-baked table entries like this:

struct HuffTableEntry {
    uint8_t len; // length of code in bits
    uint8_t sym; // symbol value
};

and the decoder loop looks something like this (32-bit version):

while (!done) {
    bitbuf.refill();
    // can decode up to 25 bits without refills!

    // Decode first symbol
    {
        intptr_t index = bitbuf.peek(11);
        HuffTableEntry e = table[index];
        bitbuf.consume(e.len);
        output.append(e.sym);
    }

    // Decode second symbol
    {
        intptr_t index = bitbuf.peek(11);
        HuffTableEntry e = table[index];
        bitbuf.consume(e.len);
        output.append(e.sym);
    }
}

Note that with a typical bit buffer implementation, the peek operation uses a single bit-wise AND or shift instruction; the table access is a 16-bit load (and load address calculation, which depending on the target CPU might or might not be an addressing mode). The bit buffer consume operation usually takes an integer shift and an add, and finally the “append” step boils down to an integer store, possibly an integer shift to get the symbol value in the right place prior to the store (the details of this very much depend on the target ISA, I’ll get into such details in later posts in this series), and some pointer updating that are easily amortized if the loop is unrolled more than just twice.

That means that in terms of the work we’re doing, we’re already starting out with something around 5-7 machine instructions per symbol, and if we want good performance, it matters a lot what exactly those instructions are and how they relate. But before we get there, there’s a few other details about the resulting dataflow to understand first.

The first and most important thing to note here is what the critical path for the entire decoder operation is: everything is in serialized through the bit buffer access. The peek operation depends on the bit buffer state after the previous consume, the table access depends on the result of the peek operation, and the consume depends on the result of the table access (namely the len field) which completes our cycle. This is the critical dependency chain for the entire decoder, and the actual symbol decoding or output isn’t even part of it. Nothing in this loop depends on or cares about e.sym at all; its load and eventual store to the output buffer should be computed eventually, but there’s no rush. The code length determination, on the other hand, is our primary bottleneck. If the critical computation for peek and consume works out to a single 1-cycle latency integer instruction each (which it will turn out it does on most of the targets we’ll see later in this series), then everything hinges on how fast we can do that table access.

On current out-of-order CPUs (no matter the ISA), the answer usually turns out to be 4 or 5 clock cycles, making the critical path from one symbol decoded to the next either 6 or 7 clock cycles. This is very bad news, because as just pointed out, we’re executing around 5-7 instructions per symbol decoded, total. This means that, purely from the dependency structure, we’re limited to around 1 instruction per cycle average on CPUs that can easily sustain more than 3 instructions per cycle given the right code. Viewed another way, in those 7 cycles per symbol decoded, we could easily run 21 independent instructions (or even more) if we had them, but all we do is something like 6 or 7.⁷

The obvious solution here is to not have a single bit buffer (or indeed bitstream), but several at once. Once again we were not the first or only ones to realize this, and e.g. Yann Collet’s Huff0 uses the same idea. If the symbols we want to decode are evenly distributed between multiple bitstreams, we can have multiple one-symbol-at-a-time dependency chains in flight at once, and hopefully get to a state where we’re mostly (or even completely) limited by instruction throughput (i.e. total instruction count) and not latency.

Our high-level plan of attack will be exactly that. Huff0 uses 4 parallel bitstreams. For reasons that will become clearer in subsequent parts of this series, Huffman coding in the newer Oodle codecs has two different modes, a 3-bitstream mode and a 6-bitstream mode. Using more bitstreams enables more parallelism but the flipside is that it also requires more registers to store the per-bitstream state and is more sensitive to other micro-architectural issues; the actual counts were chosen to be a good compromise between the desires of several different targets.⁸

For the rest of this post, I’ll go over other considerations; in the next part of this series, I plan to start looking at some actual decoder inner loops.

Aside: tANS vs. Huffman

The ANS family of algorithms is the new kid on the block among entropy coders. Originally we meant to use tANS in the original Kraken release back in 2016, but the shipping version of Kraken doesn’t contain it; it was later added as a new entropy coding variant in 2018 along with Leviathan, but is rarely used.

However, one myth that was repeatedly claimed around the appearance of tANS was that it is “faster than Huffman”. While it is true that some tANS decoders are faster than some Huffman decoders, this is mostly an artifact of other implementation choices and the resulting dependency structure of the decoder, so I want to spend a few paragraphs explaining when and why this effect appears, and why despite it, it’s generally not a good idea to replace Huffman decoders with tANS decoders.

I’m not going to explain any of the theory or construction of tANS here; all I’ll note is that the individual per-symbol decode step uses tables similar to a Huffman decode table, and paraphrasing old code from Charles here, the key decode step ends up being:

HuffTableEntry e = table[state];
state = next_state[state] | bitbuffer.get(e.len); 
output.append(e.sym);

In essence, this is a “lookahead Huffman” decoder, where each stream keeps N bits of lookahead at all times. Instead of looking at the next few bits in the bit buffer to figure out the next code length, we already have enough bits to figure out what the next code is stored in our state, and after decoding a symbol we shift out a number of bits corresponding to the length of the code and replenish the bottom of our state from the bitstream.⁹

The bits come from a different place, and we now have the extra state de-facto shift register in the mix, so why would this run faster, at least sometimes? Again we need to look at the data dependencies.

In the regular Huffman decode, we had a single critical dependency chain with the peek/table lookup/consume loop, as well as the bit-buffer refill every few symbols. In this version of the loop, we get not one but two separate dependency chains:

The state update has a table lookup (two in fact, but they’re independent and can overlap), the bit buffer get operation to get the next set of low bits for the next state (usually something like an integer add and a shift), and an OR to combine them. This is not any better than the regular Huffman decoder (in fact it’s somewhat worse in multiple ways), but note that the high-latency memory loads happen before any interactions with the bit buffer.
The bit buffer update has the periodic refill, and then the get operations after the state table lookups (which are a combined peek and consume).

The crucial difference here is that in the regular Huffman decoder, most of the “refill” operation (which needs to know how many bits were consumed, load extra bytes and shift them into place, then do some accounting) is on the same dependency chain as the symbol decoding, and decoding the next symbol after a refill can make no meaningful progress until that refill is complete. In the TANS-style decoder, the refill can overlap with the state-dependent table lookups; and then once those table lookups complete, the updates to the bit buffer state involve only ALU operations. In throughput terms the amount of work per symbol in the TANS-style decoder is not better (in fact it’s appreciably worse), but the critical-path latency is improved, and when not using enough streams to keep the machine busy, that latency is all that matters.

A related technique that muddles this further is that ANS-based coders can use interleaving, where (in essence) we just keep multiple state values around, all of which use the same underlying bitstream. This gives us twice the parallelism: two TANS decoders with their own state are independent, and even making them share the same underlying bitstream keeps much of the advantage, because the high-latency portion (the table lookup) is fully overlapped, and individual reads from the bit buffer (once all the lengths are known) can usually be staggered just 1 cycle apart from each other.

A 2× interleaved tANS decoder is substantially faster than a single-stream Huffman decoder, and has lower incremental cost (a single extra integer state is easier to add and track than the registers required for a second bitstream state). However it also does more work per symbol than a 2-stream Huffman decoder does, so asymptotically it is slower—if you ever get close to the asymptote, that is.

A major disadvantage of tANS-style decoding is that interleaving multiple states into the same bitstream means that there is essentially only a single viable decoder implementation; bits need to be read in the same way and at the same time by all decoders to produce the same results, fast implementations need larger and more expensive to set up tables than a straight Huffman decoder, plus other overheads in the bitstream. In contrast, as we’ll see later, the 6-stream Huffman
coding variant is decoded on some platforms as two 3-stream blocks, which would not easily map to a tANS-style implementation. In the end, the lower asymptotic performance and decreased flexibility is what made us stick with a regular Huffman decoder in Oodle Data, because as it turns out we can get decent wins from decoding the bitstreams somewhat differently on different machines.

Huffman table initialization

As mentioned several times throughout this post, it’s easy for Huffman table initialization to become a major time sink, especially when tables are large, used for just a short time, or the table builder is unoptimized. Luckily, optimizing the table builder isn’t too hard, and it also gives a nice and intuitive (to me anyway) view of how canonical Huffman codes work in practice.

For now, suppose that decoding uses a MSB-first bit packing convention, i.e. a bit-at-a-time Huffman decoder would read the MSB of a codeword first and keep reading additional less significant bits until the symbol is determined. Although our 11-bit code length limit is quite short, that’s still a large number of bits to work through in an example, so let’s instead consider at a toy scenario with a code length limit of 4 bits and four symbols a, b, c, d with code lengths 1, 3, 2, and 3 bits, respectively.

The code length limit of 4 bits means we have a total space of 16 possible 4-bit patterns to cover; I’ll refer to each such bit pattern as a “slot”. Our 1-bit symbol “a” is the first and shortest, which means it gets assigned first, and as such gets a 1-bit code of “0”, with the next 3 bits irrelevant. In short, all slots with a leading 0 bit correspond to the symbol ‘a’; there’s 2^4-1 = 8 of them. The next-shortest symbol is ‘c’, which get the next available 2-bit code. All codes starting with 0 are taken; the next available slot starts with a 2-bit pattern of “10”, so that’s what we use. There are 2^4-2 = 4 slots corresponding to a bit pattern starting “10”, so that’s what we assign to c. The two remaining symbols ‘b’ and ‘d’ get 3-bit codes each. Anything starting with “0” or “10” is taken, so they need to start with “11”, which means our choice is forced: ‘b’ being earlier in the alphabet (since we break ties in code length by which symbol is lower-valued) gets the codes starting “110”, ‘d’ gets “111”, and both get 2^4-3=2 table slots each. If we write up the results, we get:

Code	Sym	Len
0000	a	1
0001	a	1
0010	a	1
0011	a	1
0100	a	1
0101	a	1
0110	a	1
0111	a	1
1000	c	2
1001	c	2
1010	c	2
1011	c	2
1100	b	3
1101	b	3
1110	d	3
1111	d	3

Note that we simply assigned all of the code space in linear order to symbols sorted by their code length and breaking ties by symbol value, where each symbol gets 2^N–k slots, where N is the overall code length limit and k is the code length for the symbol in question. At any step, the next code we assign simply maps to the first few bits of the next available table entry. It’s trivial to discover malformed codes in this form: we need to assign exactly 2^N slots total. If any symbol has a code length larger than N or we end up assigning more or less than that number of slots total, the code is, respectively, over- or under-complete; the current running total of slots assigned is a scaled version of the number in the Kraft inequality.

Moreover, it should also be obvious how the canonical table initialization for a MSB-first code works out in practice: the table is simply written top to bottom in linear order, with each entry repeated 2^N–k times; in essence, a slightly more complicated memset. Eventually, each slot is just duplicated a small number of times, like 4, 2 or just 1, so as code lengths get closer to the limit it’s better to use specialized loops and not treat it as a big memset anymore, but the basic idea remains the same.

This approach requires a list of symbols sorted by code length; for Oodle, the code that decodes Huffman table descriptions already produces the symbol list in this form. Other than that, it is straightforward.

The only wrinkle is that, for reasons to do with the details of the decoder implementation (which will be covered in future posts), the fast Huffman decoders in Oodle do not actually work MSB-first; they work LSB-first. In effect, we produce the table as above, but then reverse all the bit strings. While it is possible to directly build a canonical Huffman decode table in a LSB-first fashion, all the nice (and SIMD-friendly) sequential memory writes we had before now turn into scatters, which is a lot messier. It turns out to be easier and faster to first build a MSB-first table as described above and then apply a separate bit-reverse permutation to the elements to obtain the corresponding LSB-first decoding table. This is extra work, but bit-reverse permutations have a very regular structure and also admit a fast SIMD implementation that ends up being essentially a variant of a matrix transpose operation. In our experiments, the SIMD-friendly (and memset-able) MSB-first table initialization followed by a MSB→LSB bit-reverse was much faster than a direct LSB-first canonical table initialization, and LSB-first decoding saved enough work in the core decode loops to be worth the extra step during table building for the vast majority of our Huffman tables, so that’s what we settled on.

And that’s it for the high-level overview of Huffman decoding in Oodle Data. For the next few upcoming posts, I plan to get into detail about a bunch of the decoder loops we use and the CPUs they are meant for, which should keep us busy for a while!

Footnotes

[1] An easy way to construct a pathological frequency distribution is to pick the symbol frequencies as Fibonacci numbers. They grow exponentially with n, but not so quickly as to make the symbol frequencies required to exceed common practical code length limits impossible to occur for typical block sizes. The existence of such cases for at least certain input blocks means that implementations need to handle them.

[2] There are strictly more Huffman trees than there are canonical Huffman trees. For example, if symbol ‘a’ has a 1-bit code and symbols ‘b’ through ‘e’ all get 3-bit codes, then our corresponding canonical tree for that length distribution might be (a ((b c) (d e))), but it is equally possible (with the right symbol frequencies) to get say the tree (a ((b d) (c e))) or one of many other possible trees that all result in the same code lengths. The encoded size depends only on the chosen code lengths, and spending extra bits on distinguishing between these equivalent trees is wasteful unless the tree shape carries other useful information, which it can in some applications, but not in usual Huffman coding. Along the same lines, for any given Huffman tree there are in general many possible sets of symbol frequencies that produce it, so sending an encoding of the original symbol frequencies rather than the tree shape they result in is even more wasteful.

[3] When assigning codewords this way and using a MSB-first bit packing convention, the blocks of codes assigned to a given code length form intervals. When codes are limited to N bits, that means the length of a variable-length code word in the bitstream can be determined by figuring out which interval a code falls into, which takes at most N-1 comparisons against constants of width ≤N-1 bits. This allows for different, and often more efficient, implementation options for the latency-critical length determination step.

[4] This is inherent: the entire point of entropy coding is to make bits in the output bitstream approach an actual information content around 1 bit. In the process of Huffman coding, we are purposefully making the output bitstream “more random”, so we can’t rely on locality of reference to give us better cache hit rates here.

[5] Decoding multiple symbols at once may sound like an easy win, but it does add extra work and complications to decoding symbols that are long enough not to participate in pair-at-a-time decoding, and once again the table build time needs to be carefully watched; generally speaking multi-symbol decoding tends to help on extremely redundant input but slows down decoding of data that achieves maybe 7 bits per byte from Huffman coding, and the latter tends to be more common than the former. The actual evaluation ends up far from straightforward and ends up highly dependent on other implementation details. Multi-symbol decoding tends to look good when plugged into otherwise inefficient decoders, but as these inefficiencies are eliminated the trade-offs get more complicated.

[6] No doubt we were neither the first nor last to arrive at that particular design point; with an 8-bit alphabet, the set of viable options for at least two decodes per refill if 32b targets matter essentially boils down to 9b to 12b, with 11b being the first where 5 decodes/refill on 64b targets are possible.

[7] This gap is one of the things that makes multi-symbol decoding look great when plugged into a simple toy decoder. Decoding multiple symbols at once requires extra bookkeeping and instructions, and also introduces extra dependencies between the stores since the number of symbols decoded in each step is no longer known at compile time and folded into an addressing mode, but when we’re only using about a third of the capacity of the machine, inserting extra instructions into the symbol decode is very cheap.

[8] Oodle has the same bitstream format for everyone; while we can (and do) make choices to avoid pathologies on all of the targets we care about, any choice we make that ends up in the bitstream needs to work reasonably well for all targets.

[9] next_state here essentially just tabulates a shift and mask operation; in a “real” tANS decoder this can be more complicated, but when we’re looking at a case corresponding to a Huffman table and symbol distribution, that’s all it does.

In the last part we went over the general ideas of Huffman coding as implemented in the newer Oodle Data coders, this time we’ll be looking at one particular implementation that is both interesting and “historically relevant”: Oodle was designed with games in mind, an important class of hardware to consider for game middleware is game consoles, and versions of the AMD Jaguar CPU were in both the PS4 and Xbox One (mostly unmodified except for a bump in clock rate in the “mid-lifecycle upgrade” models of both). We wanted Kraken to perform well on those machines, so we spent some time optimizing Oodle for it. Before I go into the details, let’s do a bit of background on the machine itself, but be advised that this will be in-depth and that you may need to re-read the previous part first. Furthermore, this post contains plenty of x86 assembly; if you’re uncomfortable or unfamiliar with that, you probably won’t get much out of it, sorry.

Meet the AMD Jaguar

The Jaguar, or less prosaically “Family 16h”, is a small, low-power, out-of-order 64-bit x86 CPU core designed for small systems and embedded applications such as, well, game consoles. In the game console variants, the Jaguar CPUs appear on the main SoC along with the GPU (and most other components). It’s designed for multi-core operation and cores usually appear in clusters of four that share a common L2 cache, usually 512kb of L2 per core. In the Xbox/PS4 versions, there are two such clusters and thus two L2 cache slices. This post is only concerned with tasks that run on a single thread so I won’t be spending time on this part of the architecture.

Each core has 32KiB of L1 instruction and 32KiB of L1 data cache. The frontend decode/dispatch/retire logic is 2 instructions wide, and the relevant unit for most of it is what is variously (depending on the source) called “macro-ops” or “cops” (complex ops), depending on the source. I’ll stick with macro-ops. Macro-ops are typically a data-processing instruction along with a memory reference, so something like the x86 instruction add rax, [rsi] would be a single macro-op.¹ Macro-ops get broken into either one or two micro-ops (I’ll write uops in the following) for execution, but instruction decoding, tracking and retirement all works on macro-ops. The backend has six execution units, each of which can accept one micro-op per cycle: two integer (which I’ll refer to as I0 and I1), one load (L), one store (S), and two SIMD/floating point (which I’ll refer to as F0 and F1). The pipelines are very symmetric: almost all integer instructions can execute in either I0 and I1 (the biggest exceptions being multiplies and divides, which are I1 only), and most FP/SIMD instructions can execute in either F0 or F1 (FP addition and SIMD integer multiplication are only supported in F0, and FP multiplies and store/convert are F1 only). Consequently, most pairings of two independent instructions can execute in the same cycle, if they’re not both contending for the same resource. Of these backend limitations, in my experience the one you’re most likely to hit is the one load per cycle limit.

That said, whenever I’ve looked, the two instructions per cycle decode/dispatch limit is usually the more relevant one. On the Jaguars, using the load-operate and even read-modify-write instructions where possible is a good idea (because it gives you two uops per macro-op), and generally preferable to splitting loads out.

Speaking of loads, the L1 data cache is 8-way associative, write-back, and internally splits 64-byte cache lines into 16-byte sectors. Unaligned loads and stores that stay within a single sector are free, crossing a sector boundary occupies the load/store pipes for an extra cycle (potentially more if it also crosses a page boundary etc.). The load-to-use latency for the L1 data cache is 3 cycles to the integer pipes, 5 cycles to the FP/SIMD pipes, both of which are quite low numbers compared to most of its contemporaries.² The theme of low latencies continues for other parts of the backend: FP32 multiplies and SIMD integer multiplies complete in 2 clock cycles, FP32 and FP64 adds in 3, and most SIMD ALU operations take a single cycle. L2 misses take relatively long though, at a minimum load-to-use latency of 25 cycles.

Lots of console developers found these cores underwhelming, mostly due to the narrow design and fairly low clock rates (around 1.6 and 1.7GHz in the original PS4 and Xbox One, respectively). On the other hand, these cores are quite small, power-efficient, and the PS4/Xb1 console generation came with 8 of them, at a time when more than 4 cores was a rarity in the consumer space. Personally, I quite like them: they’re not the fastest but what they are is extremely even-tempered and predictable. They have a relatively low ceiling on the instructions per cycle and peak performance they can achieve, but getting there is generally a fairly straightforward process, and there’s not much in the way of gotchas or nasty surprises. They’re a pleasant core to optimize for³, and AMD helped by providing good documentation for it.

The Plan

Because of the aforementioned decode/dispatch/retire limits and low instruction latencies, optimizing code with reasonably nice memory access patterns for the Jaguars is, more often than not, an exercise in minimizing number of instructions executed for a given task. (As I said, they’re fairly straightforward to optimize for!) Therefore, if we want a fast Huffman decoder on these machines, it’s a good idea to see if we can do it with as few instructions as possible.

While reviewing the above-quoted docs, one thing I noticed was that BEXTR, an instruction from BMI1, turns into one uop, is supported on both integer pipes, and has 1-cycle latency. BEXTR is an odd duck: it extracts a given number of bits from a given starting point in the first source operand, and as such is essentially a counterpart to PowerPCs rlwinm or ARMs UBFM, but while these latter two instructions have the bitfield position and width given as an immediate operand, BEXTR takes a register operand for the bitfield specification.⁴ Code that wants to do repeated bitfield extraction with the same operands can burn a register on a constant (itself a fairly steep cost on the relatively register-starved x86) and then use BEXTR, which replaces a move, shift, and an bitwise AND instruction.⁵ The second source register operand to BEXTR contains, itself, bit-packed values: the lower 8 bits give the index of the LSB of the bitfield to extract, the next 8 bits give the width in bits.

This is usable for the bitstream decoding part of our Huffman decoder. Using a “bit extraction” style decoder (variant 3 in this post) means we repeatedly do operations of the form (bit_buf >> bit_pos) & ((1 << 11) - 1)) to peek at our next 11 bits, and that is just BEXTR(bit_buf, bit_pos + (11 << 8)). It doesn’t cause any problems to have a constant bias that shows up only in the high bytes added to our bit position, so we can just declare our bit positions to have that offset added at all times while in registers, and that lets us do our bit buffer peek in a single 1-uop instruction on Jaguar cores. Because of another x86 quirk, namely that byte-sized instructions exist and preserve the remaining bits of the register, we can do updates of bit_pos using byte-sized additions or subtractions that leave the high bits alone, if we want to.⁶

Finally, we don’t want to do a store for every byte we decode, because that’s an extra instruction and we’re easily limited by instructions (or rather, macro-ops) executed. Fortunately we can use the SSE4.1 instruction PINSRB (packed insert byte), which inserts a byte value from an integer register or memory into a given lane of a vector register. Vector registers hold 128 bits (16 bytes), which means we can amortize the number of stores and do one every 16 or so bytes instead of for every codeword. Finally, because Jaguar cores treat memory references inside an instruction as separate uops but not separate macro-ops, and macro-ops are one of our main limiters, we want to use memory references liberally if doing so lets us reduce the number of macro-ops we need.

Putting this all together, note that the pseudocode for the per-symbol processing in a LSB-first Huffman decoder, as outlined in the previous part, looks something like this:

  // peek
  uint32_t bits = (bit_buf >> bit_pos) & 2047;
  // consume bits
  bit_pos += table[bits].len;  
  // decode symbol
  emit(table[bits].sym);

and using the various techniques outlined above, we can turn this into a mere 3 x86 instructions:

  ; peek. rBitPos[15:8] = 11
  bextr   rBits, rBitBuf, rBitPos
  ; advance bit offset (update low byte only)
  add     rBitPosb, [rTableBase + rBits*2]
  ; put table[bits].sym at position N into xmm0
  vpinsrb xmm0, xmm0, [rTableBase + rBits*2 + 1], N

On the Jaguar, this decomposes into 3 macro-ops and 5 uops: 2 loads, 2 integer ops, 1 SIMD. The bit extract to grab rBits from rBitBuf takes a single cycle; the bit position update takes 3 cycles to load the value from the table and an extra cycle to complete the addition. We don’t actually care about the top bytes being preserved here, since we don’t expect overflows, but we do care about our load-operand being byte-sized. Either way, that’s 5 cycles critical path latency from one decoded symbol to the next. Finally, the vector byte inserts to collect the output bytes are not on the critical path. They need to be fast enough to keep up with our decoding bytes once the table loads finish (and they are, single they can complete at a rate of 1 per cycle) but that’s about it. With the Jaguar frontend supporting at most 2 macro-ops per cycle, this code takes at least 1.5 cycles per symbol decoded in the frontend, and 2 cycles per symbol decoded in the load pipeline. Meaning that as given, this code is limited more by the load pipeline than the frontend. However, this is not the only work that needs to happen in this loop, and the Jaguar is out-of-order, so we can build up a backlog of load pipeline work; if we later need to do more integer work in the loop that does not take many loads (spoiler: we will), the load pipeline will get to catch up.

Finally, as mentioned above, our critical path between back-to-back loads from the same stream is 5 cycles on the Jaguar. If we use 3 streams and interleave their processing during decode, then the frontend will get around to the first instruction for the second byte of stream 0 about 4.5 cycles in (although the load pipeline will take about 6 cycles to work through its backlog before then). In other words, the timing here can roughly work out, but it’s not perfectly matched; we will build up a bit of backlog in the load pipeline and the reorder buffer before this is done, but as long as we choose our instructions carefully and don’t go too lopsided, we can make this work while keeping the core nice and busy the whole time through.

I was pretty excited when I first realized this 3-instruction sequence was a viable candidate for the core of our Huffman decoder on Jaguar, but to get a real decoder we also need to deal with bit buffer refills, pointer advancing, and end-of-buffer checks.

The actual decoder

As mentioned above (and in the previous part), we use 3 separate bitstreams for parallelism. Of these, two bitstreams are regular “forward” bitstreams in increasing address order, and one is written backwards. The numbering of these is a bit odd: in the physical Oodle format, the layout is strm0-> | strm2-> | <-strm1, i.e. stream 0 is forward and comes first (as you would expect), stream 1 is backward and comes last, and stream 2 is also forward and somewhat awkwardly sandwiched in the middle, for “historical reasons”. Namely, Kraken uses forward-backward stream pairs in many places.⁷ The Huffman decoder used to be the same way; when we noticed (while working on this Jaguar decoder, in fact) that three streams would be advantageous, we had to put the third stream somewhere. Putting stream 2 in the middle turned out to be slightly easier.⁸ The advantage of the odd-looking backward stream is that it saves us a bit of signaling in the container format (not a trivial concern for a compression format) and also gives us a nice way to do end-of-buffer checks. Namely, the three are contiguous, and all three read pointers (called in0, in1 and in2 in the following) are in that single contiguous region. Furthermore, in0 keeps increasing, in1 keeps decreasing, and at any point in a well-formed stream, we have buffer_begin ≤ in0 ≤ in2 ≤ in1 ≤ buffer_end. During decoding, we do the two interior checks of the read pointers against each other; the end-of-buffer checks on either end are implied by transitivity, and we don’t need to actually do them, or keep those extra pointers around. The only pointers we need to check are the ones we already keep around anyway. Neat!

Now, loads aren’t zero-sized; we use the (common in C) convention that “end” pointers point one past the last element of arrays. So we don’t want to start loading from in1, and with 64-bit (8-byte) loads, the largest address we can ever safely load from is at buffer_end - 8, assuming the buffer is at least 8 bytes to begin with (which we check beforehand). Decrementing in1 by 8 before the loop takes care of both issues: now in1 points to the last address we can do a valid 64-bit load from, and as a side effect in0 ≤ in2 ≤ in1 ends up guaranteeing that in0 and in2 are also good to safely load 8 bytes from without overrunning the buffer. Finally, the optimized decoder loop described here decodes 5 bytes each from 3 streams and writes the results using a 16-byte SIMD store, so it can only safely run until 16 bytes before the intended end of the buffer. All the remaining special cases (less than 8 bytes left in some of the streams, very short input streams, or close to the end of the output buffer) are left to a dedicated safe loop that generally handles the last few bytes, needs to do more careful checking, and is certainly not using hand-tweaked assembly. There would be no point for speed since it only ever handles very few bytes, and besides that’s the exact loop where you very much want a higher-level language for better debugging facilities and good integrations with sanitizers, fuzzers etc.

With a plan for all those details, all we need to take care of now is the refill logic and sort out the remaining plumbing. Looking at the decoder sketch above, we see that we need at least 2 registers worth of state per bitstream: one register to contain bit_buf (rBitBuf in the pseudo-ASM), and one for bit_pos. Once we consider refilling, we also need the corresponding read pointer (the inN I was just taking about). For 3 streams, 3 registers of state per stream works out to 9 registers, a bit more than half of our general-purpose register name pool, which is workable.

As for refill, that is luckily straightforward in a “bit extract” style scheme. At the top of every iteration, we want to load the next 8 bytes from the current input pointer:

  mov  rBitBuf0, [in0]

For the reverse byte order in1 stream, we use a big-endian load (MOVBE) instead, which is the same cost as the regular load on the Jaguars.⁹

Then we decode 5 values from each of the 3 streams. With our 11 bits code length limit, that means we end up consuming at most 55 bits from each stream. Most relevant bit reading techniques support at most either 56 or 57 bits in a row without a refill when using 64-bit registers, so this fits well.¹⁰ Decoding 3×5 = 15 symbols also works out very nicely with our 128-bit vector registers, so we do a single unaligned vector store every 15 bytes.¹¹

Finally, after each stream has decoded 5 symbols, we need to check how many bytes to advance the read pointer by, and what the new start position within the byte is. The number of bytes we need to advance the read pointer by is (bit_pos >> 3) & 7, which on the Jaguar, we can compute using a single BEXTR if we can afford a register just to store the constant 0x303, which we can.¹² We then either add or subtract this from the corresponding in pointer. Finally, we need to clear the bits corresponding to the byte position (that we just took care of) in bit_pos, which is an AND by ~0x38. This keeps the high bits, containing the 11 bitfield length that we need, intact. The actual code below does this masking at the start of the next iteration instead of at the end of the current iteration, but conceptually this belongs with the pointer advance.

And that’s pretty much it. Here’s the full decoder loop, written in NASM. We originally tried to write this in C++ with intrinsics, but that got nasty, so we eventually switched to a real assembler. The original version has the comments laid out differently but I need to fit this into an annoyingly narrow blog CMS theme, so this will look a bit clunky:

        ; main decode loop
        ; rax = scratch
        ; rbx = bitextr0
        ; rcx = bitextr1
        ; rdx = bitextr2
        ; rbp = bextr const
        ; rsi = table ptr
        ; rdi = -bytes_left_to_decode
        ; r8  = in0
        ; r9  = in1
        ; r10 = in2
        ; r11 = bits0 (only live in inner loop)
        ; r12 = bits1 (only live in inner loop)
        ; r13 = bits2 (only live in inner loop)
        ; r14 = decodeend
        ; r15 = (unused)

        sub             r9, 8 ; in1 -= 8
        mov             ebx, 0xb00 ; 11 field width
        mov             ecx, 0xb00
        mov             edx, 0xb00
        mov             ebp, 0x303 ; for byte step

        align           16
.bulk_inner:
        ; non-crossing invariant: in0 <= in2 && in2 <= in1
        cmp             r8, r10
        ja              .bulk_done
        cmp             r10, r9
        ja              .bulk_done

        ; refill stream 0
        ; read next bits0, keep bit offset within byte
        mov             r11, [r8]
        and             ebx, ~0x38

        ; refill stream 1
        movbe           r12, [r9]
        and             rcx, ~0x38

        ; refill stream 2
        mov             r13, [r10]
        and             rdx, ~0x38

        %assign i 0
%rep N_DECS_PER_REFILL
        ; stream 0
        ; peek
        bextr           rax, r11, rbx
        ; consume
        add             bl, [rsi+rax*2]
        ; grab sym
        vpinsrb         xmm0, xmm0, [rsi+rax*2+1], i+0

        ; stream 1
        bextr           rax, r12, rcx
        add             cl, [rsi+rax*2]
        vpinsrb         xmm0, xmm0, [rsi+rax*2+1], i+1

        ; stream 2
        bextr           rax, r13, rdx
        add             dl, [rsi+rax*2]
        vpinsrb         xmm0, xmm0, [rsi+rax*2+1], i+2

        %assign i i+3
%endrep
        %undef i

        ; final advances
        ; num_bytes_step0
        bextr           rax, rbx, rbp
        ; in0 += num_bytes_step0
        add             r8, rax
        bextr           rax, rcx, rbp
        sub             r9, rax
        bextr           rax, rdx, rbp
        add             r10, rax

        vmovdqu         [rdi+r14], xmm0
        add             rdi, 15
        ; loop while bytes_to_decode > 0
        js              .bulk_inner

That’s the core 3-stream Huffman decoder loop. Time to quit it with the hand-waving and do an actual analysis (if only back of the envelope) to make sure we’re on the right track here.

Analysis

We already looked at the core decode step earlier and noted that it has 3 macro-ops (I’ll write 3M in the following), and for the backend: 2 integer 0/1 ops (just 2I for short), 2 load unit cycles for aligned loads (2L for short), and 1 FP/SIMD op (1F for short). We do this 5 times per stream. Also per stream is the refill/advance logic, which we now know the instructions for: 1 load for the refill, and 3 integer ALU ops for the byte advance and bitpos update. The load in the refill is almost always unaligned, though. It’s a 64-bit load, and as noted in the introduction, unaligned loads are free if they stay within an aligned 16-byte sector, and cost at least 1 cycle extra when they don’t. Out of the possible load offsets mod 16, 9 (0 through 8 inclusive) stay within a 16-byte sector, the other 7 do not. That’s 7/16=0.4375 odds of at least one cycle extra, and some of those cases (such as crossing cache lines and pages) get more expensive than just adding a cycle. For sanity in the following, let’s just say that we bake this all down to somewhat simpler numbers and expect around 1.44 cycles average case (but probably closer to 1.5 in realistic conditions) for those unaligned refill loads, 2 cycles for a much more pessimistic estimate. In other words, we want to bill the unaligned refill loads at costing more than single aligned load, since the expected number the load pipelines are occupied with them is larger.

Taking that into account, the four instructions involved in refill and advance for a single stream boil down to 4M, 1.44-2L, and 3I.

Then, we have some cross-stream shared instructions: the two compare/jump pairs for our pointer-crossing check at the beginning account for 4M 4I, the final store accounts for 1M 1.44-2S (since it’s also unaligned), and the final ADD/JS pair contribute another 2M 2I to the tally. That’s all instructions in the loop accounted for.

For an overall throughput estimate, we get:

15 × 3M (decodes) + 3 × 4M (stream refill/advance) + 7M (shared rest) = 64M total, so 64 macro-ops, enough to occupy the front-end for at least 32 cycles.
15 × 2L (decodes) + 3 × 1.44-2L (stream refills) = 34.3-36L total, so the load unit is busy for 34.3-36 cycles.
15 × 2I (decodes) + 3 × 3I (stream refills) + 6I (shared) = 45I total, evenly distributes over both integer ALU pipes for 22.5 cycles worth of pressure.
15 × 1F (decodes) = 15F total, distributed over both FP/SIMD pipes for 7.5 cycles worth of pressure, so they’re loafing.
1.44-2S for 1.44-2 cycles worth of pressure on the store pipe which I assume is sitting on the sidelines munching popcorn.

Purely in terms of pressure on the execution resources, we’re mainly limited by the load pipes which are busy for around 34.5-36 cycles every iteration, closely followed by the frontend which is occupied for at least 32 cycles if everything goes perfectly. 34.3-36 cycles to decode 15 bytes works out to 2.287-2.4 cycles per byte decoded. This is assuming we can ever get throughput-bound to begin with, and is budgeting absolutely no time for L1 cache misses and such.

How does the critical path look? By my reckoning, the most likely candidate takes a freshly updated in pointer from the end of a previous iteration, does an unaligned load to refill which takes 4 cycles for the data to show up, then does 5 back-to-back decodes from that stream which we know have a critical path latency of 5 cycles each, and then finally needs to do a BEXTR on the resulting bitpos followed by an integer add/subtract to produce the next load address. That’s 4 + 5 × 5 + 2 = 31 cycles of critical path latency through the stream decodes, worse if anything bad happens, like extra delays due to page crossings on a load or similar. 31 cycles is close enough to our other 2 limiters for it to be considered a 3-way near-tie. A hitch in the front-end or load pipes or any extra delay along the critical path is likely to end up delaying any given iteration if it occurs. Note I’m purely looking an ideal throughput estimates here, there is no modeling or simulation of machine details going on, all we’re doing is tallying up some figures based on known machine characteristics.

In short, from this rough estimate, we would expect somewhere around 2.3-2.4 cycles per byte for this decoder under very idealized circumstances where there’s not a single cache miss or hitch anywhere along the way, and “a bit worse” (to be decided what that means) when less idealized.

The rubber hits the road

So what happens when we actually run it?

One of the nice things about coding for game consoles is that the hardware is known and tends to have very predictable, repeatable performance. With that said, here’s stats for the exact loop quoted above running on a PS4 on a synthetic test set (decoding a random stream with a very boring “every symbol is 8 bits” Huffman table, which of course you’d never do, but makes for a test run that’s very easy to validate the results of) 1000 times, and reporting 1st, 50th (median) and 95th percentile cycles per byte:

    huff3_jaguar_asm: med 2.28/b, 1st% 2.25/b, 95th% 2.38/b

I swear I did not fudge this in any way, that’s the actual figures I got on a real test run just now. So that much, ahem, very encouraging, to say the least. But what happens when you time it in the middle of an actual Kraken decode of ~250MB of real non-synthetic test data?

SimpleProf              :seconds  calls     count :     clk/call    clk/count
get_array_huff          : 0.3879  17689 178617138 :      34946.1         3.46
huff_x64jaguar_loop     : 0.3035  31350 178617138 :      15427.6         2.71

Here get_array_huff includes everything the Huffman decoder has to do, including reading the headers, the Huffman table descriptions, validating the code lengths, setting up the tables, the fast decode loops, and the slower near-end-of-data tail decoders, and huff_x64jaguar_loop is just the core optimized decode loop that handles most of the bulk data. “Calls” is the number of calls to either function and “count” is the number of bytes decoded. In this case, 250MB of data “only” decode about 178MB through the Huffman decoders; less than 250MB because we also do LZ-style dictionary compression (not covered in this series). So in this particular real-world use case (which very much does not have all the buffers already nicely in the L1/L2 caches as it’s running), we’re about 17% slower than our ideal average-case throughput estimate for this loop, which is still respectable. Also visible from these two lines is that the core decode loop where around 75-80% of the overall Huffman decoding time is spent with the rest being in setup or tail handling. That is fairly typical for our decoder implementations on various platforms. And you can infer from the figures given that our average array of bytes that use a single Huffman table is about 10k long. For this part, looking at averages is misleading: the distribution is quite wide. Many arrays are 60k+, but many others are well below 3k. The former spend more time in the core decoder (always nice since that part is flat), the latter spend a lot more time proportionately in header parsing and table initialization, which is why we can’t neglect it.

This last part is also why the Jaguar decoder (or, for that matter, all other decoders in Oodle) doesn’t bother with trying to set up tables to decode multiple symbols at once. This sounds enticing but it makes table setup more complicated and slower, and also adds many complications to the decoders because everything emits a variable number of bytes now. For example, using PINSRB to group output bytes would not work if each decode step produce either 1 or 2 bytes; we would need to do individual stores for every symbol, and also increment the destination pointer after every decode. This would add at least 2 instructions per byte decoded (a store and a destination pointer add), probably more. When your single-byte-at-a-time decode kernel runs 3 instructions per byte to begin with and instruction count is a major limiting factor, adding 2 extra more instructions to maybe decode 2 bytes at a time isn’t all that tempting. We can decode another byte in an extra three instructions with the single-byte-at-a-time decoder guaranteed, and we don’t need to do any expensive extra work during table setup to do so. We also don’t build up a debt of 2 instructions every time we don’t manage to decode 2 symbols at once that we later have to make up just to break even. Multi-symbol decoding is an old standby when decoding from a single stream, because you can hide a lot of extra work in the shadow of that nasty long critical path, but decoding from multiple streams simultaneously gives you more productive ways to spend those CPU cycles and maybe even get to the holy grail of being primarily throughput bound.

And that’s all I got for this post! I’m not sure which of the other decoder variants I’ll tackle next. Apologies for the long delay, but writing these up takes more effort than my usual blog post, and I need to be in the right headspace to even try doing it.

Footnotes

[1] If you’re familiar with Intel microarchitectures but not AMD, macro-ops are roughly comparable to what Intel calls “fused-domain micro-ops”, except they’re “even CISCier”, in the sense that even read-modify-write instructions like add [rdx], rax count as a single macro-op, where Intel would split them into a add-from-memory and a store internally. I’ll also add that historically speaking describing AMD macro-ops as similar to Intel fused-domain uops is backwards; AMD has been using “fat” macro-ops as part of their x86 instruction decomposition for a long time, since at least the K7 (original Athlon) architectures. Intel added fused micro-ops (which are more restricted) years later to their microarchitectures when they realized that having the frontend deal with these “chunkier” units was beneficial.

[2] Typical L1D load-to-use times for contemporary designs were 4 or 5 cycles. For that matter, they still are at time of writing, 9 years after Jaguar-based HW hit the shelves. That said, the Jaguars target much lower clock rates than those other designs—the fastest Jaguar descendants ran a bit above 2GHz, other OoO x86 cores from the same timeframe have similar L1D sizes and were typically designed to hit 4GHz or above, so presumably the Jaguar cores can fit a lot more logic into a pipeline stage.

[3] To editorialize even more, the Jaguar’s nearly complete lack of sharp edges and huge performance cliffs was a welcome change after the PS3/Xbox 360 generation, where the main CPU cores seemed at times like they had nothing but.

[4] Presumably due to a problem with the way immediate operand encoding in x86 works. Specifying a bitfield position and width needs at least 6 + 6 = 12 bits to be generally useful on a 64-bit machine. But x86 immediate operands for instructions with 32- or 64-bit operands only come in two sizes: 8 bits and 32 bits. The former is not enough, the latter is very wasteful. 16-bit immediates only exist for 16-bit register instructions, and this part of the encoding would be quite expensive to add exceptions to. Interestingly AMD added a short-lived immediate-operand version of BEXTR that indeed spends a full 32 bits, but this version of the instruction decodes to 2 macro-ops on Jaguar and was removed from Zen 3. It never shipped on any Intel CPU.

[5] All “big core” Intel CPUs that support BEXTR (that I’m aware of, anyway) decode it into 2 uops. The same CPUs can usually eliminate register-register moves during register renaming and would also take 2 uops for a shift-and-mask combination, so BEXTR has never been particularly interesting on them, since it offers at best some minor advantages in the frontend. It’s more interesting on the Intel Atom-derived cores such as the Alder Lake E-cores (which like Jaguar have a 1 uop, 1 cycle version) and AMD Zen cores, though.

[6] Yet another AMD/Intel difference: Intel has been renaming the 8-bit parts of registers such as AL and AH of RAX or R9B of R9 separately for a long time, meaning AL and AH can reside in separate locations in the physical register file. When referencing the merged halves as a single register (such as AX, EAX or RAX) later, Intel CPUs used to either stall (the “partial register stall” of long ago) or, later, started inserting merge operations that combined the results into the instruction stream. AMD has never done this, and instead seems to do partial updates on every operation. That means that on Intel CPUs, code that alternately updates AL and AH can execute as two independent dependency streams, whereas on AMD CPUs all these updates run in series. However, AMD never needs to insert any merge ops either, and has no special penalty for referring to the full register after a partial update, which is handy in our use case: on the Jaguar we’re always concerned with macro-ops through the front-end, so injecting merge operations would suck for us. Good that we don’t get any!

[7] Two streams so that we can alternate decoding from them, because (as seen in this post already and also mentioned in previous parts) sequential decoding from a single bitstream tends to result in very long dependency chains and is a bottleneck. Having pairs of forward and backwards streams with the two meeting in the middle allows us to store a single size in the bitstream to signal the start position of the reverse stream and the combined size as a single value. They also can, in some contexts, act as padding for each other. During decoding, ensuring the pointers don’t cross corresponds to a end-of-buffer check; we don’t know the exact size of either stream up front, but we know that the read pointers may never cross, and once decoding is done they should point to the same location.

[8] Arguably, we should have at least renumbered the streams and swapped labels of streams 1 and 2; but as is often the case, this was quickly prototyped, found to be working, then for a while we were concerned with other things such as buffer overflow safety and such, and by the time we realized it was pretty odd for stream 2 to appear in the bitstream before stream 1, it had long shipped to customers and was very much not worth a format-breaking change to rectify.

[9] Very convenient how the bitstream layout chosen works out so nicely for the most constrained of the important target platforms for Oodle.

[10] Yet another very-much-not-a-coincidence.

[11] This one actually is a coincidence, but I’ll take it.

[12] Yes, the decoder loop keeps 0x303 pinned in a register the entire time it’s running. Long-time friends will realize why this delights me; sometimes the universe just smiles at you like that. This one’s for you, Felix.

The adventure continues! Last time I went over the core decoder loop we use on AMD Jaguar-based consoles, i.e. Xbox One and PS4. I felt that one was a good example to start with since the target hardware is fixed, behaves quite predictably, and the various features and limitations of the machine conspire to make the solution essentially canonical: the narrow CPU front-end makes it so that minimizing macro-ops issued is a significant concern, and the BEXTR-based decode is a very natural fit.

This time, our hardware target is “any 64-bit x86”, used on less specialized targets like Linux, Windows and Mac 64-bit x86. We don’t have a single CPU type we’re optimizing for; it needs to run on everything, and we’d like to have good performance on anything released within the last 10 years (at the time this code was written in 2016) or so. We do get to have variants for newer ISA extensions when they help but we’d like a good baseline version for “any 64-bit x86 CPU”.

As with the previous two parts, you should read the introduction to the Huffman decoders first, continue with the “Reading bits in far too many ways” series on this blog (part 2, part 3) which is also a prerequisite, and if you can’t read and understand x86-64 assembly language this will be hard to follow.

You know how this goes by now

I’m not going into as much detail in this post as I did in the Jaguar one, both because our target this time is somewhat fuzzier and because we’re exploring variations on a theme here. It’s interesting to cover the key ideas, not so interesting to re-do variants of the same analysis many times in a row.

Our setting, then, is the same as it was last time: we have a Huffman code stream where the code lengths are constrained to be relatively short so we can do 5 in a row between refills, we use a 4KiB table with 2048 entries covering all possible 11-bit codewords, and decoding proceeds without having to do any significant data-dependent branches.¹ We know that decoding back-to-back values from the same bitstream is limited by critical path latency, so we use multiple bitstreams.

On the AMD Jaguar, 3 independent bitstreams was sufficient to get us out of the miserable latency-bound hole and into the more desirable throughput-bound regime where we’re doing a decent job keeping the execution units busy. On the wider out-of-order cores that can sustain 4 or more instructions per cycle, 3 streams is not going to be enough for that, which is why we also have 6-stream variants that will be covered in a later part.²

Once again, I’ll go over this from the inside out: first figure out how to decode individual bytes, then sort out the refill strategy, then deal with the remaining plumbing.

The basic decode step

Our game plan hasn’t changed: peek ahead at the next 11 bits, do our table lookup, consume the right number of bits, emit the resulting byte. The most basic version of this, assuming we keep an explicit “remaining bits” count per stream, works out to something like this:

  ; peek at low 11 bits
  mov rcx, rBits
  and rcx, 2047
  ; table fetch
  movzx ecx, word [rTablePtr + rcx*2]
  ; shift out consumed bits, update bit count
  shr rBits, cl
  sub rBitCount, cl
  ; emit decoded byte
  shr ecx, 8
  mov [rDest + <index>], cl

So far, there should be no surprises here.³ As given, this works out to 7 instructions per byte decoded, quite a lot more than the 3 we had in the Jaguar version, but this time we don’t have guaranteed access to BMI instructions – and even if we did, they would not be as good as they are on Jaguar.

Two more small tweaks

Part of x86’s historical baggage is that bits 8 through 15 inclusive of the a, b, c and d registers can be accessed as their own register, and this can be used on the emit sequence to save an instruction, getting us down to 6:

  mov [rDest + <index>], ch

Due to the way the x86-64 instruction encoding works out, rDest in this code can’t be just anything. It has to be from a limited set of registers⁴ but other than that, we’re still not doing anything special here.

Note that we want to be using ECX as the register we load the table entry into because the variable-distance shifts in basic x86-64 all require the shift amount to live in CL.⁵ We don’t particularly care what register name we use for our scratch temporary; RCX/ECX is as good as any other. Nevertheless, using RCX (and only RCX) as the temporary to load table entries into is clearly the right choice here, but compilers would frequently insert extra moves (sometimes more than one extra on the critical path) here, which is how we ended up hand-writing these loops in assembly to begin with.⁶

The second fun wrinkle is that even though we can easily arrange for our shift count to always be in CL to begin with, on Intel CPUs with BMI2 available, we always want to use the new SHLX instruction instead, even when we’re shifting by the low bits of RCX, because SHLX takes 1 uop instead of 2 or 3 on newer Intel uArchs.⁷

The final non-obvious change we do for the “real” version of the decode is that instead of the peek logic presented above, we actually do this instead:

  mov ecx, 2047
  and rcx, rBits

Why load the immediate first using a separate instruction and then AND with rBits? Well, the important realization here is that moving an immediate value into ecx/rcx has no dependencies and can execute at any time, whereas mov rcx, rBits does depend on rBits and thus ends up becoming a part of the critical path. Now this register-register move is generally “free” on microarchitectures that can implement reg-reg moves via renaming at an effective 0-cycle latency, which on the Intel side is Ivy Bridge and later, but there are two important caveats: first, not every machine we’re targeting (or at least were targeting at the time this code was written, 6 years ago) is Ivy Bridge and later, and second, some later microarchitectures had subtle bugs in their implementation of MOV elimination and ended up disabling it later. Hence, as a general rule, the load-immediate-first version is preferable since it takes 1 cycle off the critical path for every decode when MOV elimination is not performed, for whatever reason.⁸

Refill

As mentioned above, for this version we prefer the methods with explicit “remaining bits” counts, specifically what I described as variant 4 in my write-up on bit reading.

The reason to prefer this approach here is that it keeps the load for the bit buffer refill independent of the final bit count, meaning that refill load’s latency does not necessarily end up on the critical path. Since the latency along the critical path is the primary concern for this loop, this is definitely what we want.

In the “production” version of the loop, we keep the remaining bit count for stream 0 in al, the refill pointer in r8, and the bit buffer for stream 0 in r11. The version we use is this (with the instructions scheduled slightly differently, but that’s extremely minor):

  ; refill streams
  mov    rbp, [r8] ; next0 = load(in0)
  movzx  rcx, al   ; bitc0
  shl    rbp, cl   ; next0 << bitc0; use shlx with BMI2+!
  or     r11, rbp  ; bits0 |= next0 << bitc0
  shr    ecx, 3    ; 7 - bytes_consumed0
  sub    r8, rcx   ; in0 += bytes_consumed0 - 7
  or     al, 56    ; bitc0 |= 56
  add    r8, 7     ; in0 += 7

That’s a fair number of instructions, but the dependency structure is quite favorable:

The load of the next bits to refill only depends on r8, which is known as soon as the previous iterations’ refill is complete (all the instructions that update r8 are in the above code snippet!) and in particular is completely independent of the 5 table loads per stream.
Insertion of the new bits depends on the final bit count being known, but once the 3 instructions (MOVZX, SHL, OR, all pure int ALU) for that part of the dependency chain are done, we can start decoding the next byte.
Updating the bit count and the read pointers takes several more instructions, but neither of these are as critical because the bit count doesn’t determine any load offsets and the read pointer is only used for the next iterations’ refill.

Once again, using SHLX for the shift when available is preferred. In this case, in addition to the 2 uops saved, it also lets us remove the 1 cycle for the MOVZX from the critical path.

Beyond that, there’s a few more instructions in the loop to increment the destination pointer, maintain/test the “bytes to decode” count and check whether the input pointers have crossed, but all of this works the same as in the last part and is not on the critical part so I won’t spend time on it here.

Analysis

With all that said, what’s the expected cost of this decoder? Unlike the Jaguar example which targets a single known microarchitecture, this version of the loop targets a general PC target and thus should be good for numerous uArch generations from multiple vendors over something like a 10-year window.

We have in fact done just that kind of testing, and there are some minor variants (mainly depending on whether BMI2 and thus SHLX is available, as noted above), but the details here get very repetitive and tedious. I don’t particularly feel like writing all of this down, nor would it make for interesting reading.

Instead, I’m going to take a single uArch as our proxy. I’ll be using Intel’s Skylake (SKL), which Intel essentially (in progressively tweaked versions) kept re-releasing in between 2015 and 2021, meaning there’s a nearly 7-year window of Intel consumer CPUs that are all essentially SKL spin-offs. In the notebook space we saw Ice Lake in late 2019 but for desktop, 2021’s Rocket Lake (the first “real” SKL successor for desktop) was neither well-received nor sold well, and was soon replaced by Alder Lake in late 2021.

In practice, this means that at the time of this writing, the mean, median and mode Intel desktop x86 that is at least a year or so old is some SKL variant; during that time AMD made major inroads into the desktop space with their Zen cores (which, fortunately for our analysis, have quite similar performance characteristics). Notebooks are a bit more varied, but if I’m going to look at only one x86 uArch, SKL seems like a solid choice.

As mentioned multiple times throughout this post already, we’re mainly concerned about latency here. As such, I won’t go into many uArch details here and just point out that the relevant cores are a lot wider than the Jaguars we looked at last time (namely, they support 4+ instructions decoded/dispatched/retired per cycle), have 4 integer ALUs, can (under the right conditions) support 2 loads and 1 store per cycle, up to 2 branches per cycle, and in general have just more of essentially everything compared to the small Jaguar cores we looked at last time. (They also typically clock at twice the frequency, or more.)

Ironically, precisely because these cores are so much wider, the analysis gets a lot simpler. Here’s the gist of it: the all important critical path between iterations for a single stream starts with the final bit count for the last iteration and then goes through the refill and 5 subsequent symbols decodes until the bit count for the current iteration is known.

In the code as presented above, that means we start with the MOVZX, SHL and OR from the refill (all basic ALU ops that are 1-cycle latency on pretty much every x86 core you can buy these days) until we know the new value of rBits, for 3 cycles refill latency (can be 2 if we use SHLX).

Then we have 5 subsequent decodes where we care about the latency until we have the updated (shifted) rBits. In the version with the hoisted immediate that uses and rcx, rBits, the 3 relevant instructions on that dependency chain are this AND, the table load, and then the shift by CL. The AND and shift are also 1-cycle latency. In the final iteration, we also need to know the final bit count; this is computed by another independent instruction (the SUB) that depends on the table load and is likely to issue at the same time as our shift, so it doesn’t change the latency estimate. Finally, the load is the longest-latency instruction. Now SKL has 4-cycle-latency loads for certain very restricted cases, if the address itself originated on the load unit (i.e. for pointer chasing), but for our table lookup, we have a complicated addressing mode and the index register originates on the ALU not the load unit, so 5-cycle load latency is what we get.

Everything else is not critical, so for now, we’re going to ignore it. What does that leave us with? For one of our 3 streams, the latency along that critical path ends up being 3 + 5×(1 + 5 + 1) = 38 cycles.

Skylake can sustain 4+ fused-domain⁹ uOps per cycle during those 38 cycles, giving us 38 × 4 = 152 issue slots. One refill takes around 8 fused-domain uOps (with SHLX), each decode step takes 6. That means one stream’s worth of decodes takes around 8 + 5×6 = 38 uops, so for 3 streams that’s 114 uops out of our budget, leaving another 38 issue slots remaining.

That leaves the work we haven’t accounted for so far: one of our streams needs to do a BSWAP (another 3 uops), we have the two pointer-crossing checks (2-4 uops depending on what macro-fusion happens), the destination pointer increment (1 uop), and the final loop condition/test (1-2 uops, again depending on macro-fusion). And that’s it. At worst, that takes another 10 uops out of our budget.

We’re left with about 124 issue slots used over 38 cycles, about 81% of what the machine can sustain. That leaves enough air in the schedule that we don’t need to be too worried about exactly what the schedulers and port load-balancing logic do; we really do expect to be mostly latency-limited still, but at least latency-limited with a decent uArch utilization. Finally, our instruction mix is balanced enough, and the machine has enough resources, that we needn’t worry too much about any particular type of resource (like say load units or shifters) getting overcrowded.

In short, the best we can expect out of the loop described above is to decode 15 bytes (5 bytes each from 3 streams) every 38 cycles, which works out to 2.53 cycles per byte. In practice, on my Skylake test machine, benchmarks hover around 41.8 cycles per 15 bytes (i.e. around 2.79 cycles/byte), both in the idealized test rig and on real-world data.

That’s approximately 10% above our rough lower-bound estimate, purely from operation latencies, assuming scheduling is perfect, there are no cache misses, and so forth. At this point, we could dig deeper and try to figure out where we’re losing these 3.8 cycles relative to our critical path estimate, and if there’s something we can do about it, but the writing’s already on the wall: with one stream, we would be utterly latency-limited with a mostly-empty machine. The 3 streams we’re using now get us to a more respectable level of uArch utilization, but while they were enough to get us into our target throughput-bound regime on Jaguar, they’re not enough, not even in a lower-bound estimate where there are never any delays due to cache misses or suboptimal uop scheduling.

Discussion

Really, this should not be a surprise. On a machine twice as wide, with many more execution resources, instructions per cycle, and 2 cycle longer load latency, of course we need more streams to hide those latencies.

However, we’re starting to bump into other problems. Our bit-IO method uses 3 registers worth of state per stream: the bits, the bit count, and the refill pointer. That’s 9 registers used. We also definitely need a stack pointer, a destination buffer pointer, at least one scratch register (rcx), and also a base pointer for our Huffman table and preferably also an “end-of-buffer” pointer (or, equivalently, a counter) in a register for the final loop test. That’s 14 registers spoken for, and x86-64 only has 16 GPRs.

Now, some of those registers don’t need to be live for most of the loop. The “end-of-buffer” pointer never changes and is only used once at the end so it’s not a big problem to leave in memory, and our refill pointers are only used during refill so we can spill some of them without wreaking havoc if we don’t mind the extra instructions to spill and reload them, but we’re getting on thin ice here.

Using more streams would be interesting, but we need to be careful not to introduce a lot of extra instructions. Potentially troublesome spills/refills introducing latency in the wrong places would suck as well. In short, we’d like a few more streams (just one extra is cutting it too close), but that in turn means 3 GPRs per stream is not going to work. We need something more compact, which leads us to a quite different design for the 6-stream decoders we added in Oodle 2.3.0. But that will have to wait for another post. Until then!

Footnotes

[1] The “how many bytes are left” loop condition is just a counted loop and extremely predictable. The “did the pointers cross” check is technically data-dependent, but it is also extremely predictable (in the fast decoder loops anyway), because the decoder loops only run until our already-advanced pointers first cross, at which point we exit the fast loops and switch to the more careful variants.

[2] The trickiest part to negotiate when doing this experimentally is the transition from latency- to throughput-bound. As long as you’re latency-bound, your best move is to use strategies that minimize latency, even when doing so increases dynamic instruction count or utilization of limited execution resources. Once you have enough streams to cover the critical path latency with some slack left over, you need to worry about those bottlenecks instead, even when doing so adds latency. Options that were clearly inferior in the latency-bound regime will end up preferable in the throughput-bound regime.

[3] When writing this code in C/C++, you would probably keep the remaining bit count for the bit buffer in a 32- or 64-bit integer, not a byte, but done naively that would result in an extra instruction to extract and zero-extend the “length” field from the low 8 bits of the table entry. Using a byte register avoids the extra instruction, or alternatively, you could subtract the full 16-bit table entry from the bit count and mask the final bit count down to 8 bits (effectively reducing modulo 256) before using it. The two approaches are equivalent; for this presentation I went with the byte register variant because it’s more straightforward.

[4] x86-64 was designed to share as much of the instruction encoding (and hence the decoder logic) between the 32-bit and 64-bit versions. 32-bit x86 has 8 general-purpose registers and 3-bit register numbers; x64-64 increases this to 16 GPRs (which take 4 bits), and the high bits of the up to 3 register numbers in an instruction are grouped together in what’s called a REX prefix byte. Only some instructions have a REX prefix: if all 3 register number high bits are zero and the instruction is not 64-bit either, it’s not required. However, the register numbering for 8-bit registers worked differently than it did for the other registers: for “classic” x86 code, the 8 8-bit registers are AL, CL, DL, BL, AH, CH, DH, and BH (in that order), where those last 4 are refer to [15:8] in the corresponding 32-bit register. With the REX prefix present, every 64-bit register has its low 8 bits addressable as a corresponding “L” register, but the “H” register names are inaccessible. Thus, if you want to store CH to [rDest + immed], rDest needs to be one of the low 8 registers.

[5] Other (non-x86) architectures mostly don’t care about which architectural register the shift amount lives in, but almost everyone looks only at the low 5-8 bits (depending on architecture and context) of the shift count. This makes it so that arranging our (len,sym) pairs so that len is in the low byte (which for LE byte order means first byte) is almost universally the best choice.

[6] Hand-writing in ASM also made it practical to use the “store from CH” trick, which is not a huge deal in this particular version since there are other problems that we’ll get to in a minute, but this particular quirk of x86 is one that compilers generally won’t even try to use, even though it’s advantageous in cases like this one.

[7] This is one of my “favorite” x86 warts. The reason SHL by CL is awkward, and often more uOps than you would expect, is because the architectural definition of this (and other variable-shift instructions) is quite strange for backwards-compatibility reasons. Namely, all the shift-by-CL instructions set different flags (and set them in different ways) depending on whether the effective shift count ends up being 0, 1, or something else. This goes back to a problem with the original 8086. The 8086 didn’t have a barrel shifter or any other way to do fast distant shifts; “SHL AX, CL” existed but effectively just repeated SHL AX, 1 CL times. This definition was presumably convenient given the extremely limited microcode ROM space the 8086 worked with but has the two unfortunate side effects that a) you could make shifts take a very long time on the 8086 by setting CL to 255 and b) when CL is 0, no shifting takes place at all, and in particular the flags stay unchanged from before. Moreover, there’s another wrinkle in that SHL AX, 1 sets the overflow flag “correctly” (probably mostly by accident, because for left-shifts specifically the natural implementation of it on a 8086-like ALU data path is ADD AX, AX, which does, and right-shifts can’t overflow), but when doing an iterated sequence of such shifts, the final overflow flag will indicate whether the last shift-by-1 overflowed, which is useless when the count is not 1, because for a “shift by 3” operation what you really want to know is whether there was an overflow at any point during the process, not whether the last instruction in a lengthy expansion overflowed (which is neither here nor there). The tightened definition that appears in the original 80286 manuals specifies that only the last 5 bits of the shift count matter (no doubt anticipating the coming expansion to 32-bit with the 80386) and further says that 1-bit-shifts set the overflow flag correctly but multi-bit shifts leave the overflow flags in an undefined state (not mentioned in these early manuals is that an effective shift count of 0 leaves the flags unchanged, but all x86 CPUs I’m aware of nevertheless behave this way). Long story short, the actual data portion of x86 SHL reg, CL is exactly what you would expect, but which flags get updated and how is a tangled mess that, on most Intel CPUs, takes another 1 or 2 uops to resolve. Interestingly, no AMD uArchs I’m aware of, present or historical, have a problem with it.

[8] Another option available on machines with BMI1 or later is to use ANDN to perform our AND operation. The idea is to put the ones’ complement of our desired mask (so ~2047) into a scratch register and then use andn rcx, rInvertedMask, rBits to initialize rcx. In short, we don’t actually care about the “not” part of the ANDN, but it conveniently comes as a non-destructive 3-operand instruction that we can use to avoid the move. This does the job in one instruction, which is great, but requires us to sacrifice another register, which is not so great. In this particular code it’s not really worthwhile since we’re entirely latency-limited on sufficiently wide cores, so having a MOV/AND pair is not a problem as long as we don’t extend the critical path unnecessarily.

[9] I won’t go into what exactly fused-domain uOps are; it’s there for precision, but if you don’t know what that means, just ignore it, it’s way beyond the scope of what we need for this post.

Back in 2007 I wrote my DXT1/5 (aka BC1/3) encoder rygdxt, originally for “fr-041: debris” (so it was size-constrained). A bit later I put up the source and Sean Barrett adapted it into “stb_dxt”, which is probably the form that most know it in today.

It’s a simple BC1 encoder that gives decent quality, the underlying algorithm is reasonably fast (fast enough to say bake textures you produce once per session in a game from a character creator, say, which is one of the cases I’ve ended up using it in in a professional context), and is easy to integrate.

The basic algorithm uses the same primitives most BC1 encoders use (I’ll assume in the following you know how BC1 works): compute the average and covariance matrix of the block of pixels, compute the principal component of the covariance to get an initial guess for what direction the vector between the two endpoints should point in. Then we project all pixel values onto that vector to find the min/max support points in that direction as the initial seed endpoints (which determines the initial palette), assign each pixel the palette entry closest to it, and do some iterative refinement of the whole thing.

rygdxt/stb_dxt only uses the 4-color mode. It did not bother with the 3-color + 1-bit transparency mode. It’s much more niche, increases the search space appreciably and complicates the solver, and is rarely useful. The original code was written for 64k intros and the like and this was an easy small potential win to give up on to save on code size.

Nothing in there was invented by me, but at the time I wrote it anyway, DXT1/BC1 encoders (at least the ones I was looking at) were doing some of these steps in a way that was more complicated than necessary. For example, one thing I vividly recall is that one encoder I was looking at at the time (I believe it was Squish?) was determining the principal component of the covariance matrix by forming the characteristic polynomial (which for a 3×3 matrix is cubic), using Cardano’s formula to find the roots of the polynomial to determine eigenvalues, and finally using Gaussian Elimination (I think it was) to find vectors spanning the nullspace of the covariance matrix minus the eigenvalue times the identity. That’s a very “undergrad Linear Algebra homework” way of attacking the problem; it works, but it’s complicated with a fair amount of tricky code and numerical issues to wrestle with.

rygdxt, with its focus on size, goes for a much simpler approach: use power iteration. Which is to say, pick a start vector (the only tricky bit), and then iterate multiplying covariance matrix by that vector and normalizing the result a few times. This gives you a PCA vector directly. In practice, 3-6 iterations usually sufficient to get as good a direction vector as makes sense for BC1 encoding (stb_dxt uses 4), and the whole thing fits in a handful lines of code with absolutely nothing subtle or numerically tricky going on. I don’t claim to have invented this, but I will say that before rygdxt I saw a bunch of encoders futzing around with more complicated and messier approaches like Squish, or even bothering with computing a general 3×3 (or even NxN) eigendecomposition, and these days power iteration seems to be the go-to, so if nothing else I think rygdxt helped popularize a much simpler technique in that space.

Another small linear algebra trick in there is with the color matching. We have a 4-entry palette with colors that lie (approximately, because everything is quantized to an 8-bit integer grid) on a line segment through RGB space. The brute-force way to find the best match is to compute the 4 squared distances between the target pixel value and each of the 4 palette entries. This is not necessarily a bad way to do it (especially if you use narrow data types and SIMD instructions, because the dataflow is very simple and regular), but it is essentially computing four 4-element dot products per pixel. What rygdxt/stb_dxt uses instead is the fact that if it were a perfect line segment in a continuous space, we could just use an orthogonal projection to find the nearest point on the line, which with appropriate normalization boils down to a single dot product per pixel. In that continuous simplification, the two interpolated colors sit exactly 1/3rd and 2/3rds along the way between the two endpoints. However working on the aforementioned 8-bit integer grid means that the interpolated colors can sometimes be noticeably off from their ideal placement, especially when the two endpoints are close together. What rygdxt therefore does is compute where the actual interpolated 1/3rd-of-the-way and 2/3rds-of-the-way colors land on the line (two more dot products), and then we can do our single dot product with the line direction and use the values we computed earlier to figure out which of these 4 colors is closest in the 1D space along the line, which is just a few comparisons and can be done branchlessly if desired.

The result doesn’t always match with the distances code using the brute-force solution would get, but it’s very close in practice, and reducing the computations by a factor of nearly four sped up the BC1 encoding process nicely (old 2008 evaluation by my now-colleague Charles here).

That leaves us with the actual subject of this blog post, the iterative refinement logic! I just answered an email by someone asking for an explanation of what that code does and why, so here goes.

The refinement function

The code in question is here.

Ultimately, what rygdxt/stb_dxt does to refine the results is linear least-squares minimization. (Another idea that’s not mine, this one I definitely got from Squish). We’re holding the DXT indices (interpolation weights) constant and solving for optimal endpoints in a least-squares sense. In each of the RGB color channels, the i’th target pixel is approximated by a linear interpolation

$\displaystyle (1-w_i) x_0 + w_i x_1$

where x0, x1 are the endpoints we’re solving for and w_i is one of {0, 1/3, 2/3, 3/3} depending on which of the 4 indices is used for that pixel. Writing that out for the whole block in say the red channel turns into an overdetermined system of 16 linear equations

$\displaystyle \begin{pmatrix} 1-w_0 & w_0 \\ 1-w_1 & w_1 \\ \vdots & \vdots \\ 1-w_{15} & w_{15} \end{pmatrix} \begin{pmatrix} x_{0r} \\ x_{1r} \end{pmatrix} = \begin{pmatrix} r_0 \\ r_1 \\ \vdots \\ r_{15} \end{pmatrix}$

to be solved for x0r, x1r (the first/second endpoint’s R value).

Let’s call that tall and skinny matrix on the left A; $x = (x_{0r}, x_{1r})^T$ is the 2D column vector we’re solving for, and the RHS vector of the pixel r values we can just call “r’.

That means our problem is to minimize ||Ax - r||_2 (2-norm), or equivalently ||Ax - r||^2 .

||Ax-r||^2 is quadratic; the way you find its minimum is by computing the derivative and equating it to 0, which leads us to what’s called the Normal Equations, which for this problem are

$\displaystyle A^T A x - A^T r = 0 \\ \Leftrightarrow A^T A x = A^T r$

A is a 16×2 matrix, so A^TA is a tiny symmetric matrix (2×2) and contains the dot products of the columns of A with each other.

We have 3 color channels, not just r but g and b as well. That means we have 3 copies of the same linear system with 3 different right-hand sides, or equivalently we can view the whole thing as a matrix equation with a 3-wide right-hand side. Either way, all 3 systems have the same A matrix, the only thing that differs is the right-hand sides.

We accumulate A^T r (and g and b as well) in the first pass, and also compute the entries of A^T A . To solve the system above, because we just have a 2×2 matrix, we can use Cramer’s rule to solve it directly, no need to mess around with factorizations or Gaussian Elimination or similar.

That’s the basic idea. The RefineBlock function uses two more tricks:

instead of the weights being {0, 1/3, 2/3, 3/3}, we multiply them by 3 and also scale the RHS by 3 (the latter, we never actually do explicitly). Getting the extra scaling is essentially free during the linear system solve, especially since we already need to do some scaling per-channel anyway, because the R/B endpoint values we solve for are in [0,31] (instead of [0,255] for the input pixel values), and the G values are in [0,63]. Scaling everything by 3 means there are no fractions involved in the computation, it’s all small integers, which will be useful for the second trick. It also means that when we compute the determinant of the 2×2 system for Cramer’s rule, it’s an integer computation, so we don’t have to worry about near-zero values and the like. (Although in this case, it’s not too hard to prove that A is singular, i.e. has determinant 0, if and only if all the w_i are the same, which is easy enough to detect up front.)
now our weights w_i (matrix entries in A) are all in {0,1,2,3}. The three entries in sum, respectively (note we scaled by 3, so turns into ): , , and . Note that all three values we’re summing only depend on w_i, and w_i is one of 4 possible values (depending on the index), so we can just precompute all of them. Also note they’re all quite small: and are at most 9, and is either 0 or 2, so they all comfortably fit in 4 bits. We’re summing 16 of these values (one per pixel in the block), and they’re all positive. That means the final sums fit into 8 bits no problem. Therefore the actual table we have packs the 3 values into one integer:
$\displaystyle ((3-w_i)^2 \ll 16) + ((3-w_i) w_i \ll 8) + (w_i)^2$ .
We sum that one integer per pixel. All of the individual sums are guaranteed to be <256 so we can extract the corresponding bits after the accumulation loop. That means the cost of computing the entries of A^T A becomes quite cheap (a single 4-entry table lookup and 32-bit integer add per pixel), and the bulk of the work in that first loop is just computing the right-hand sides.

And that’s about it. I know that approach was later also used by NVidia Texture Tools, beyond that I have no idea of its reach, but if it’s handy for someone, cool!