LZSS

From XentaxWiki
Jump to: navigation, search

General Description

The LZSS (Lempel-Ziv-Storer-Szymanski) algorithm is a compression scheme which tries to use its knowledge of past ("already seen") data to compress the current data.

To achieve this, the compressor holds a fixed-length buffer in memory which contains the last read bytes. The goal is to find a match for the current data in the buffer in order to store just a buffer reference instead of the actual data. During decompression, the algorithm reconstructs this buffer from the compressed data.

This particular algorithm was often used to compress data in older games since decompression is quite easy to achieve and fast to do in memory.


How to Recognize an LZSS-compressed File

Frequently, files compressed by LZSS variants can be recognized by certain "give-aways", especially when the format of the desired output file is known:

  • The data starts fairly normal, but seems to become "garbled" later on.
  • There are intermittent bytes (especially at the beginning) at 8 or 16 bytes distance which have the values 0xff or 0x00 respectively.
  • Distinct patterns of two bytes intermitted by 0x00 at 16 bytes distance in the compressed file which match large uniform areas in the output file.


Compression and Parameter Description

The compression technique of LZSS can be roughly separated into these phases:

  • Set up the algorithm parameters. These include:
    • N: Size of history buffer. This will determine the length of a circular buffer that stores the last N read bytes from the input file.
    • F: Size of length window. This will directly influence the maximum length of matches that can be used in the buffer. Usually, N and F are chosen as powers of 2 such that the number of bits needed to describe N - 1 plus the number of bits used to describe F - 1 equal 16. This way, buffer references can later on be stored as exactly two bytes.
    • Threshold: Minimum compression size. To avoid negative compression results due to very short matches, a minimum size for a match is introduced. As buffer references usually take up two bytes (see above), this value is often set to 3. A side effect of this is the modified range of a match length: Suppose F has been chosen as 16, which would allow a value range of 0..15. As the values 0..2 will never be used to describe a match length, we can instead shift the range by the Threshold value and interpret it as 3..18.
    • BufPos: Initial buffer writing position. This value has to be set to determine where data writing will start in the history buffer. Often chosen as N - F.
    • BufVoid: Initial buffer fill byte. The buffer has to be brought into a defined state before beginning compression and is thus filled with a defined byte. This can actually affect the compression, as e. g. a whitespace (0x20) might be used when compressing text files.
  • Read the input file. During this:
    • Try to find the longest possible sequence of upcoming bytes that can also be found in the buffer.
      • If this sequence has a length of at least Threshold, save the buffer offset and the sequence length into the output file. Record the literal bytes into the buffer.
      • If such a match cannot be found, copy over one literal byte into the output and the buffer.
    • To record whether a compression or a literal copy was made, set a bit in a special flag byte. This flag byte will be saved before the affected data, thus there will be exactly 8 actions (copy or compression) taken until a new flag byte has to be recorded.

Decompression

To decompress data, the above process has to be reversed. Obviously, the parameters mentioned above have to be known to successfully restore the original data. Often enough, a good guess by looking at the compressed data (especially areas with patterns) will work here. As soon as the parameters are known, the decompression process is fairly simple. Assuming that everything is set up correctly, the pseudo code below will illustrate a rough example of how such a routine could work:

while InputPos < InputSize do
begin
  FlagByte := Input[InputPos++]
  for i := 1 to 8 do
  begin
    if (FlagByte and 1) = 0 then
    begin
      OfsLen := Input[InputPos] + Input[InputPos + 1] shl 8
      InputPos += 2
      if OfsLen = 0 then
        Exit;
      Length := OfsLen and LengthMask + Threshold
      Offset := (BufPos - (OfsLen shr LengthBits)) and (N - 1)
      { copy Length bytes from Buffer[Offset] onwards while increasing BufPos }
    end
    else
    begin
      { copy 1 byte from Input[InputPos] }
      InputPos += 1
    end
    FlagByte := FlagByte shr 1
  end
end


Note the following:

  • LengthBits describes the number of bits needed to describe the F parameter. The above code assumes that the length of the buffer reference is saved in the lower bits of the word. This does not have to be the case. LengthMask is derived from this via
    (1 shl LengthBits) - 1.
  • It is further assumed that the buffer offset is stored relatively to the current buffer write position. It might as well be an absolute offset, making the subtraction from BufPos unnecessary.
  • Always think of the fact that the buffer is circular. So do a wrap-around of the offset value via mod or and when you access the buffer.
  • Always mirror any writes to the output into the history buffer.
  • In most cases copying the data out of the history buffer will be implemented byte-wise. While you could also copy the whole sequence at once, some implementations you might come across "abuse" cross-over situations between the buffer read and write positions during lookup and actually expect a byte-wise copy! (This situation might arise if you are supposed to read a certain number of bytes out of the buffer, but will cross the initial write position while doing so, thus effectively repeating the pattern you have already read.)

Further Information

More extensive information on this algorithm can easily be found on the web. Starting points might be this page featuring extensive implementation thoughts or this page showing a graphical explanation of the algorithm.


Games