ArithmeticCoding  0.0.0a
Arithmetic coding in C
Arithmetic Coding

Contents

Example

    void encode()
    { unsigned char *input, *output=0;      // input and output buffer
      size_t countof_input, countof_output; // number of symbols in input and output buffer
      float *cdf=0;
      size_t nsym;                          // number of symbols in the input alphabet
      // somehow load the data into input array
      cdf_build(&cdf,&nsym,input,countof_input);
      encode_u1_u8(                         // encode unsigned chars to a string of bits (1 bit per output symbol)
        (void**)&out,&countof_output,
               input, countof_input,
                 cdf, nsym);
      // do something with the output
      free(out);
      free(cdf);
    }

Features

Todo:
Markov chain adaptive encoding/decoding.

History and Caveats

I wrote this in order to learn about arithmetic coding. The end goal was to get to the point where I had an adaptive encoder/decoder that could compress markov chains. I got a little distracted along the way by trying to encode to a variable symbol alphabet. Variable symbol encoding alphabets are fun because you can encode to non-whitespace ASCII characters (94 symbols) and that means the encoded message can be embeded in a text file format.

Arithmetic coding is a computationally expensive coding method. I've done nothing to address this problem; I've probably made it worse.

To learn more about arithmetic coding see this excelent reference:

    Said, A. “Introduction to Arithmetic Coding - Theory and Practice.” 
         Hewlett Packard Laboratories Report: 2004–2076.
         www.hpl.hp.com/techreports/2004/HPL-2004-76.pdf
   

Cumulative Distribution Functions (CDFs)

All the encoding and decoding functions require knowledge of the probability of observing input symbols. This probability is specified as a CDF of a particular form. Namely:

1. cdf[0] must be zero (0.0).

2. The probability density for symbol i should be cdf[i+1]-cdf[i].

This implies, for an alphabet with M symbols:

See also:
cdf_build() for an example of how to build a CDF from a given input message.

Encoding Functions

Encoding functions all have the same form:

    encode_<TDst>_<TSrc>(void **out, size_t *nout, uint8_t *in, size_t nin, float *cdf, size_t nsym);

where TDst and TSrc are the output and input types, respectively. The output buffer, *out, can be an heap-allocated buffer; it's capacity in bytes should be passed as *nout. If *out is not large enough (or NULL), it will be reallocated. The new capacity will be returned in *nout.

Variable symbol encodings are a little different. These functions take the number of output symbols, noutsym, as well. They always encode to a 1-byte stream. Their form is:

    vencode_<TSrc>(void **out, size_t *nout, size_t noutsym, uint8_t *in, size_t nin, size_t ninsym, float *cdf);

Decoding functions

Decoding functions all have the same form:

    decode_<TDst>_<TSrc>(void **out, size_t *nout, uint8_t *in, size_t nin, float *cdf, size_t nsym);

where TDst and TSrc are the output and input types, respectively. The output buffer, *out, can be an heap-allocated buffer; it's capacity in bytes should be passed as *nout. If *out is not large enough (or NULL), it will be reallocated. The new capacity will be returned in *nout.

This is basically identical to the encode functions, but here "output" refers to the decoded message and "input" refers to an encoded message.

Variable symbol decoding has the form:

    vdecode_<TDst>(void **out, size_t *nout, size_t noutsym, uint8_t *in, size_t nin, size_t ninsym, float *cdf);
Author:
Nathan Clack <https://github.com/nclack>
 All Classes Files Functions Variables Typedefs Defines