ArithmeticCoding: Arithmetic Coding

Example

    void encode()
    { unsigned char *input, *output=0;      // input and output buffer
      size_t countof_input, countof_output; // number of symbols in input and output buffer
      float *cdf=0;
      size_t nsym;                          // number of symbols in the input alphabet
      // somehow load the data into input array
      cdf_build(&cdf,&nsym,input,countof_input);
      encode_u1_u8(                         // encode unsigned chars to a string of bits (1 bit per output symbol)
        (void**)&out,&countof_output,
               input, countof_input,
                 cdf, nsym);
      // do something with the output
      free(out);
      free(cdf);
    }

Features

Encoding/decoding to/from variable symbol alphabets.

Implicit use of a STOP symbol means that you don't need to know the number of symbols in the decoded message in order to decode something.

Can encoded messages stored as signed or unsigned chars, shorts, longs or long longs. Keep in mind, however, that the number of encodable symbols may be limiting. You can't encode 2^64 different integers, sorry.

Can encode to streams of variable symbol width; either 1,4,8, or 16 bits. There is are two tradeoffs here.

Smaller bit-width (e.g. 1) give better compression than larger bit-width (e.g. 16), but compression is slower (I think?).

The implimentation puts a limit on the smallest probability of an encoded symbol. Smaller bit-width (e.g. 1) can accomidate a larger range of probabilities than large bit-width (e.g. 16).

Todo:: Markov chain adaptive encoding/decoding.

History and Caveats

I wrote this in order to learn about arithmetic coding. The end goal was to get to the point where I had an adaptive encoder/decoder that could compress markov chains. I got a little distracted along the way by trying to encode to a variable symbol alphabet. Variable symbol encoding alphabets are fun because you can encode to non-whitespace ASCII characters (94 symbols) and that means the encoded message can be embeded in a text file format.

Arithmetic coding is a computationally expensive coding method. I've done nothing to address this problem; I've probably made it worse.

To learn more about arithmetic coding see this excelent reference:

    Said, A. “Introduction to Arithmetic Coding - Theory and Practice.” 
         Hewlett Packard Laboratories Report: 2004–2076.
         www.hpl.hp.com/techreports/2004/HPL-2004-76.pdf

Cumulative Distribution Functions (CDFs)

All the encoding and decoding functions require knowledge of the probability of observing input symbols. This probability is specified as a CDF of a particular form. Namely:

1. cdf[0] must be zero (0.0).

2. The probability density for symbol i should be cdf[i+1]-cdf[i].

This implies, for an alphabet with M symbols:

cdf has M+1 elements.

cdf[M] is 1.0 (within floating point precision)

See also:: cdf_build() for an example of how to build a CDF from a given input message.

Encoding Functions

Encoding functions all have the same form:

    encode_<TDst>_<TSrc>(void **out, size_t *nout, uint8_t *in, size_t nin, float *cdf, size_t nsym);

where TDst and TSrc are the output and input types, respectively. The output buffer, *out, can be an heap-allocated buffer; it's capacity in bytes should be passed as *nout. If *out is not large enough (or NULL), it will be reallocated. The new capacity will be returned in *nout.

Variable symbol encodings are a little different. These functions take the number of output symbols, noutsym, as well. They always encode to a 1-byte stream. Their form is:

    vencode_<TSrc>(void **out, size_t *nout, size_t noutsym, uint8_t *in, size_t nin, size_t ninsym, float *cdf);

Decoding functions

Decoding functions all have the same form:

    decode_<TDst>_<TSrc>(void **out, size_t *nout, uint8_t *in, size_t nin, float *cdf, size_t nsym);

This is basically identical to the encode functions, but here "output" refers to the decoded message and "input" refers to an encoded message.

Variable symbol decoding has the form:

    vdecode_<TDst>(void **out, size_t *nout, size_t noutsym, uint8_t *in, size_t nin, size_t ninsym, float *cdf);

Author:: Nathan Clack <https://github.com/nclack>