A New Algorithm for Data Compression

Many compression algorithms replace frequently occurring bit patterns with
shorter representations. The simple approach I present is to replace common
pairs of bytes by single bytes.
The algorithm compresses data by finding the most frequently occurring
pairs of adjacent bytes in the data and replacing all instances of the pair with a
byte that was not in the original data. The algorithm repeats this process until
no further compression is possible, either because there are no more frequently
occurring pairs or there are no more unused bytes to represent pairs. The
algorithm writes out the table of pair substitutions before the packed data.

This algorithm is fundamentally multi-pass and requires that all the data
be stored in memory. This requirement causes two potential problems: algorithm cannot handle streams, and some files may be too large to fit in
memory. Also, large binary files might contain no unused characters to represent
pair substitutions.
Buffering small blocks of data and compressing each block separately solves
these problems. The algorithm reads and stores data until the buffer is full or
only a minimum number of unused characters remain. The algorithm then compresses the buffer and outputs the compressed buffer along with its pair table.
Using a different pair table for each data block also provides local adaptation
to varying data and improves overall compression.
Listing 1 and Listing 2 show pseudocode for the compression and expansion

2) Implementation
Listing 3 and Listing 4 provide complete C programs for compression and expansion of files. The code is not machine dependent and should work with any
ANSI C compiler. For simplicity, error handling is minimal. You may want
to add checks for hash table overflow, expand stack overflow and input/output
errors. The expansion program is much simpler and faster than the compression
The compression algorithm spends the most time finding the most frequent
pair of adjacent characters in the data. The program maintains a hash table
consisting of arrays left[], right[], and count[] to count pair frequencies. The
hash table size HASHSIZE must be a power of two, and should not be too
much smaller than the buffer size BLOCKSIZE or overflow may occur. Programmers can adjust the value of BLOCKSIZE for optimum performance, up
to a maximum of 32767 bytes. The parameter THRESHOLD, which specifies
the minimum occurrence count of pairs to be compressed, can also be adjusted.
Once the algorithm finds the most frequently occurring pair, it must replace
the pair throughout the data buffer with an unused character. The algorithm
performs this replacement in place within a single buffer. As it replaces each
pair, the algorithm updates the hash table’s pair counts. This method of updating the hash table is faster than rebuilding the entire hash table after each
pair substitution.

3)  Pair Table Compression

After the program has compressed a buffer, the pair table contains entries of
those pairs of bytes that were replaced by single bytes within the buffer. Figure 1
shows a sample pair table resulting from compression of a string of 9 characters,
with a hypothetical character set limited to 8 characters. The pair table does not
store the replacement bytes; rather, a pair’s position in the table indicates the
value of the replacement byte. For example, in Figure 1, pair ’A’:’B’ is found in
the pair table’s 8th entry, which indicates that this particular pair was replaced



Share it on:

Leave a Comment

Your email address will not be published. Required fields are marked *