ArithmeticCoding
0.0.0a
Arithmetic coding in C
|
void encode() { unsigned char *input, *output=0; // input and output buffer size_t countof_input, countof_output; // number of symbols in input and output buffer float *cdf=0; size_t nsym; // number of symbols in the input alphabet // somehow load the data into input array cdf_build(&cdf,&nsym,input,countof_input); encode_u1_u8( // encode unsigned chars to a string of bits (1 bit per output symbol) (void**)&out,&countof_output, input, countof_input, cdf, nsym); // do something with the output free(out); free(cdf); }
I wrote this in order to learn about arithmetic coding. The end goal was to get to the point where I had an adaptive encoder/decoder that could compress markov chains. I got a little distracted along the way by trying to encode to a variable symbol alphabet. Variable symbol encoding alphabets are fun because you can encode to non-whitespace ASCII characters (94 symbols) and that means the encoded message can be embeded in a text file format.
Arithmetic coding is a computationally expensive coding method. I've done nothing to address this problem; I've probably made it worse.
To learn more about arithmetic coding see this excelent reference:
Said, A. “Introduction to Arithmetic Coding - Theory and Practice.” Hewlett Packard Laboratories Report: 2004–2076. www.hpl.hp.com/techreports/2004/HPL-2004-76.pdf
All the encoding and decoding functions require knowledge of the probability of observing input symbols. This probability is specified as a CDF of a particular form. Namely:
1. cdf
[0] must be zero (0.0).
2. The probability density for symbol i should be cdf
[i+1]-cdf[i].
This implies, for an alphabet with M symbols:
cdf
[M] is 1.0 (within floating point precision)Encoding functions all have the same form:
encode_<TDst>_<TSrc>(void **out, size_t *nout, uint8_t *in, size_t nin, float *cdf, size_t nsym);
where TDst and TSrc are the output and input types, respectively. The output buffer, *out, can be an heap-allocated buffer; it's capacity in bytes should be passed as *nout. If *out is not large enough (or NULL), it will be reallocated. The new capacity will be returned in *nout.
Variable symbol encodings are a little different. These functions take the number of output symbols, noutsym, as well. They always encode to a 1-byte stream. Their form is:
vencode_<TSrc>(void **out, size_t *nout, size_t noutsym, uint8_t *in, size_t nin, size_t ninsym, float *cdf);
Decoding functions all have the same form:
decode_<TDst>_<TSrc>(void **out, size_t *nout, uint8_t *in, size_t nin, float *cdf, size_t nsym);
where TDst and TSrc are the output and input types, respectively. The output buffer, *out, can be an heap-allocated buffer; it's capacity in bytes should be passed as *nout. If *out is not large enough (or NULL), it will be reallocated. The new capacity will be returned in *nout.
This is basically identical to the encode functions, but here "output" refers to the decoded message and "input" refers to an encoded message.
Variable symbol decoding has the form:
vdecode_<TDst>(void **out, size_t *nout, size_t noutsym, uint8_t *in, size_t nin, size_t ninsym, float *cdf);