Difference between revisions of "LZSS"

From XentaxWiki
Jump to: navigation, search
(Added a list that might be handy for the next man)
 
(8 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
== General Description ==
 
== General Description ==
  
The LZSS (Lempel-Ziv-Storer-Szymanski) algorithm is a compression scheme which tries to use its knowledge of past ("already seen") data to compress the current data.<br>
+
The LZSS (Lempel-Ziv-Storer-Szymanski) algorithm is a compression scheme which tries to use its knowledge of past ("already seen") data to compress the current data.
To achieve this, the compressor holds a fixed-length buffer in memory which contains the last read bytes. The goal is to find a match for the current data in the buffer in order to store just a buffer reference instead of the actual data. During decompression, the algorithm reconstructs this buffer from the compressed data.<br><br>
+
 
 +
To achieve this, the compressor holds a fixed-length buffer in memory which contains the last read bytes. The goal is to find a match for the current data in the buffer in order to store just a buffer reference instead of the actual data. During decompression, the algorithm reconstructs this buffer from the compressed data.
 +
 
 
This particular algorithm was often used to compress data in older games since decompression is quite easy to achieve and fast to do in memory.
 
This particular algorithm was often used to compress data in older games since decompression is quite easy to achieve and fast to do in memory.
  
Line 11: Line 13:
  
 
* The data starts fairly normal, but seems to become "garbled" later on.
 
* The data starts fairly normal, but seems to become "garbled" later on.
* There are intermittent bytes (especially at the beginning) at 8 or 16 bytes distance which have the values 0xff or 0x00 respectively.
+
* There are intermittent bytes (especially at the beginning) at 8 or 16 bytes distance which have the values<tt> 0xff </tt>or<tt> 0x00 </tt>respectively.
* Large uniform areas in the output file which match to distinct pattern of two bytes intermitted by 0x00 at 16 bytes distance in the compressed file.
+
* Distinct patterns of two bytes intermitted by<tt> 0x00 </tt>at 16 bytes distance in the compressed file which match large uniform areas in the output file.
  
  
 
== Compression and Parameter Description ==
 
== Compression and Parameter Description ==
  
The compression phase of LZSS can be roughly separated into these phases:
+
The compression technique of LZSS can be roughly separated into these phases:
  
 
* Set up the algorithm parameters. These include:
 
* Set up the algorithm parameters. These include:
 
** <tt>N: </tt>Size of history buffer. This will determine the length of a circular buffer that stores the last N read bytes from the input file.
 
** <tt>N: </tt>Size of history buffer. This will determine the length of a circular buffer that stores the last N read bytes from the input file.
** <tt>F: </tt>Size of length window. This will directly influence the maximum length of matches that can be used in the buffer. Usually,<tt> N </tt>and<tt> F </tt>are chosen such that the number of bits needed to describe<tt> N </tt>plus the number of bits used to describe<tt> F </tt>equal<tt> 16</tt>. This way, buffer references can later on be stored as exactly two bytes.
+
** <tt>F: </tt>Size of length window. This will directly influence the maximum length of matches that can be used in the buffer. Usually,<tt> N </tt>and<tt> F </tt>are chosen as powers of<tt> 2 </tt>such that the number of bits needed to describe<tt> N - 1 </tt>plus the number of bits used to describe<tt> F - 1 </tt>equal<tt> 16</tt>. This way, buffer references can later on be stored as exactly two bytes.
** <tt>Threshold: </tt>Minimum compression size. To avoid negative compression results due to very short matches, a minimum size for a match is introduced. As buffer references usually take up two bytes (see above), this value is often set to<tt> 3</tt>. A side effect of this is the modified range of a match length: Suppose<tt> F </tt>has been chosen as<tt> 16</tt>. This would allow a value range of <tt>0..15</tt>. As the values <tt>0..2</tt> will never be used to describe a length, we can instead interpret the range as <tt>3..18</tt>.
+
** <tt>Threshold: </tt>Minimum compression size. To avoid negative compression results due to very short matches, a minimum size for a match is introduced. As buffer references usually take up two bytes (see above), this value is often set to<tt> 3</tt>. A side effect of this is the modified range of a match length: Suppose<tt> F </tt>has been chosen as<tt> 16</tt>, which would allow a value range of <tt>0..15</tt>. As the values <tt>0..2</tt> will never be used to describe a match length, we can instead shift the range by the<tt> Threshold </tt> value and interpret it as <tt>3..18</tt>.
 
** <tt>BufPos: </tt>Initial buffer writing position. This value has to be set to determine where data writing will start in the history buffer. Often chosen as<tt> N - F</tt>.
 
** <tt>BufPos: </tt>Initial buffer writing position. This value has to be set to determine where data writing will start in the history buffer. Often chosen as<tt> N - F</tt>.
 
** <tt>BufVoid: </tt>Initial buffer fill byte. The buffer has to be brought into a defined state before beginning compression and is thus filled with a defined byte. This can actually affect the compression, as e. g. a whitespace (<tt>0x20</tt>) might be used when compressing text files.
 
** <tt>BufVoid: </tt>Initial buffer fill byte. The buffer has to be brought into a defined state before beginning compression and is thus filled with a defined byte. This can actually affect the compression, as e. g. a whitespace (<tt>0x20</tt>) might be used when compressing text files.
Line 31: Line 33:
 
** To record whether a compression or a literal copy was made, set a bit in a special flag byte. This flag byte will be saved before the affected data, thus there will be exactly<tt> 8 </tt>actions (copy or compression) taken until a new flag byte has to be recorded.
 
** To record whether a compression or a literal copy was made, set a bit in a special flag byte. This flag byte will be saved before the affected data, thus there will be exactly<tt> 8 </tt>actions (copy or compression) taken until a new flag byte has to be recorded.
  
 +
== Decompression ==
 +
 +
To decompress data, the above process has to be reversed. Obviously, the parameters mentioned above have to be known to successfully restore the original data. Often enough, a good guess by looking at the compressed data (especially areas with patterns) will work here. As soon as the parameters are known, the decompression process is fairly simple. Assuming that everything is set up correctly, the pseudo code below will illustrate a rough example of how such a routine could work:
  
== Decompression ==
+
<syntaxhighlight lang="pascal">
 +
while InputPos < InputSize do
 +
begin
 +
  FlagByte := Input[InputPos++]
 +
  for i := 1 to 8 do
 +
  begin
 +
    if (FlagByte and 1) = 0 then
 +
    begin
 +
      OfsLen := Input[InputPos] + Input[InputPos + 1] shl 8
 +
      InputPos += 2
 +
      if OfsLen = 0 then
 +
        Exit;
 +
      Length := OfsLen and LengthMask + Threshold
 +
      Offset := (BufPos - (OfsLen shr LengthBits)) and (N - 1)
 +
      { copy Length bytes from Buffer[Offset] onwards while increasing BufPos }
 +
    end
 +
    else
 +
    begin
 +
      { copy 1 byte from Input[InputPos] }
 +
      InputPos += 1
 +
    end
 +
    FlagByte := FlagByte shr 1
 +
  end
 +
end
 +
</syntaxhighlight>
  
To decompress data, the above process has to be reversed. Obviously, the parameters mentioned above have to be known to successfully restore the original data. Often enough, a good guess by looking at the compressed data (especially areas with patterns) will work. As soon as they are known, the decompression process is fairly simple. Assuming that everything is set up correctly, the pseudo code below will illustrate a rough example of how such a routine could work:
 
  
<br>
 
<tt>
 
<b>while</b> InputPos < InputSize <b>do</b><br>
 
<b>begin</b><br>
 
: FlagByte := Input[InputPos++]<br>
 
: <b>for</b> i := 1 <b>to</b> 8 <b>do</b><br>
 
: <b>begin</b><br>
 
::  <b>if</b> FlagByte <b>and</b> 1 <b>then</b><br>
 
::  <b>begin</b><br>
 
:::  OfsLen := Input[InputPos] + Input[InputPos + 1] <b>shl</b> 8<br>
 
:::  InputPos += 2<br>
 
:::  <b>if</b> OfsLen = 0 <b>then</b><br>
 
::::    Exit;<br>
 
:::  Length := OfsLen <b>and</b> LengthMask + Threshold<br>
 
:::  Offset := (BufPos - (OfsLen <b>shr</b> LengthBits)) <b>and</b> (N - 1)<br>
 
:::  ''copy Length bytes from Buffer[Offset] onwards''<br>
 
:::  ''while increasing BufPos''
 
::  <b>end</b><br>
 
::  <b>else</b><br>
 
::  <b>begin</b><br>
 
:::  ''copy 1 byte from Input[InputPos]''<br>
 
:::  InputPos += 1<br>
 
::  <b>end</b><br>
 
::  FlagByte := FlagByte <b>shr</b> 1<br>
 
: <b>end</b><br>
 
<b>end</b><br>
 
</tt>
 
<br>
 
 
Note the following:
 
Note the following:
 
* <tt>LengthBits </tt>describes the number of bits needed to describe the<tt> F </tt>parameter. The above code assumes that the length of the buffer reference is saved in the lower bits of the word. This does not have to be the case.<tt> LengthMask </tt> is derived from this via<br><tt>(1 <b>shl</b> LengthBits) - 1</tt>.
 
* <tt>LengthBits </tt>describes the number of bits needed to describe the<tt> F </tt>parameter. The above code assumes that the length of the buffer reference is saved in the lower bits of the word. This does not have to be the case.<tt> LengthMask </tt> is derived from this via<br><tt>(1 <b>shl</b> LengthBits) - 1</tt>.
 
* It is further assumed that the buffer offset is stored relatively to the current buffer write position. It might as well be an absolute offset, making the subtraction from<tt> BufPos </tt> unnecessary.
 
* It is further assumed that the buffer offset is stored relatively to the current buffer write position. It might as well be an absolute offset, making the subtraction from<tt> BufPos </tt> unnecessary.
* Always think of the fact that the buffer is circular. So do a wrap-around of the offset value via <tt><b> mod </b></tt> or<tt><b> and </b></tt>when you access the buffer.
+
* Always think of the fact that the buffer is circular. So do a wrap-around of the offset value via<tt><b> mod </b></tt> or<tt><b> and </b></tt>when you access the buffer.
 
* Always mirror any writes to the output into the history buffer.
 
* Always mirror any writes to the output into the history buffer.
* In most cases copying the data out of the history buffer will be implemented byte-wise. While you could also copy the whole sequence at once, some implementations you might come across "abuse" cross-over situations between the buffer read and write during lookup and actually expect a byte-wise copy! (This situation might arise if you are supposed to read a certain number of bytes out of the buffer, but will cross the initial write position while doing so.)
+
* In most cases copying the data out of the history buffer will be implemented byte-wise. While you could also copy the whole sequence at once, some implementations you might come across "abuse" cross-over situations between the buffer read and write positions during lookup and actually expect a byte-wise copy! (This situation might arise if you are supposed to read a certain number of bytes out of the buffer, but will cross the initial write position while doing so, thus effectively repeating the pattern you have already read.)
 
 
  
 
== Further Information ==
 
== Further Information ==
Line 76: Line 75:
 
More extensive information on this algorithm can easily be found on the web. Starting points might be  [http://michael.dipperstein.com/lzss this page] featuring extensive implementation thoughts or [http://oldwww.rasip.fer.hr/research/compress/algorithms/fund/lz/lzss.html this page] showing a graphical explanation of the algorithm.
 
More extensive information on this algorithm can easily be found on the web. Starting points might be  [http://michael.dipperstein.com/lzss this page] featuring extensive implementation thoughts or [http://oldwww.rasip.fer.hr/research/compress/algorithms/fund/lz/lzss.html this page] showing a graphical explanation of the algorithm.
  
==Games that use this==
+
 
 +
== Games ==
 +
 
 +
* Star Trek: 25th Anniversary
 +
* The 7th Guest
  
  
* [[The 7th Guest]]
+
<br><br>
 +
[[Category:Articles]]

Latest revision as of 00:14, 25 July 2021

General Description

The LZSS (Lempel-Ziv-Storer-Szymanski) algorithm is a compression scheme which tries to use its knowledge of past ("already seen") data to compress the current data.

To achieve this, the compressor holds a fixed-length buffer in memory which contains the last read bytes. The goal is to find a match for the current data in the buffer in order to store just a buffer reference instead of the actual data. During decompression, the algorithm reconstructs this buffer from the compressed data.

This particular algorithm was often used to compress data in older games since decompression is quite easy to achieve and fast to do in memory.


How to Recognize an LZSS-compressed File

Frequently, files compressed by LZSS variants can be recognized by certain "give-aways", especially when the format of the desired output file is known:

  • The data starts fairly normal, but seems to become "garbled" later on.
  • There are intermittent bytes (especially at the beginning) at 8 or 16 bytes distance which have the values 0xff or 0x00 respectively.
  • Distinct patterns of two bytes intermitted by 0x00 at 16 bytes distance in the compressed file which match large uniform areas in the output file.


Compression and Parameter Description

The compression technique of LZSS can be roughly separated into these phases:

  • Set up the algorithm parameters. These include:
    • N: Size of history buffer. This will determine the length of a circular buffer that stores the last N read bytes from the input file.
    • F: Size of length window. This will directly influence the maximum length of matches that can be used in the buffer. Usually, N and F are chosen as powers of 2 such that the number of bits needed to describe N - 1 plus the number of bits used to describe F - 1 equal 16. This way, buffer references can later on be stored as exactly two bytes.
    • Threshold: Minimum compression size. To avoid negative compression results due to very short matches, a minimum size for a match is introduced. As buffer references usually take up two bytes (see above), this value is often set to 3. A side effect of this is the modified range of a match length: Suppose F has been chosen as 16, which would allow a value range of 0..15. As the values 0..2 will never be used to describe a match length, we can instead shift the range by the Threshold value and interpret it as 3..18.
    • BufPos: Initial buffer writing position. This value has to be set to determine where data writing will start in the history buffer. Often chosen as N - F.
    • BufVoid: Initial buffer fill byte. The buffer has to be brought into a defined state before beginning compression and is thus filled with a defined byte. This can actually affect the compression, as e. g. a whitespace (0x20) might be used when compressing text files.
  • Read the input file. During this:
    • Try to find the longest possible sequence of upcoming bytes that can also be found in the buffer.
      • If this sequence has a length of at least Threshold, save the buffer offset and the sequence length into the output file. Record the literal bytes into the buffer.
      • If such a match cannot be found, copy over one literal byte into the output and the buffer.
    • To record whether a compression or a literal copy was made, set a bit in a special flag byte. This flag byte will be saved before the affected data, thus there will be exactly 8 actions (copy or compression) taken until a new flag byte has to be recorded.

Decompression

To decompress data, the above process has to be reversed. Obviously, the parameters mentioned above have to be known to successfully restore the original data. Often enough, a good guess by looking at the compressed data (especially areas with patterns) will work here. As soon as the parameters are known, the decompression process is fairly simple. Assuming that everything is set up correctly, the pseudo code below will illustrate a rough example of how such a routine could work:

while InputPos < InputSize do
begin
  FlagByte := Input[InputPos++]
  for i := 1 to 8 do
  begin
    if (FlagByte and 1) = 0 then
    begin
      OfsLen := Input[InputPos] + Input[InputPos + 1] shl 8
      InputPos += 2
      if OfsLen = 0 then
        Exit;
      Length := OfsLen and LengthMask + Threshold
      Offset := (BufPos - (OfsLen shr LengthBits)) and (N - 1)
      { copy Length bytes from Buffer[Offset] onwards while increasing BufPos }
    end
    else
    begin
      { copy 1 byte from Input[InputPos] }
      InputPos += 1
    end
    FlagByte := FlagByte shr 1
  end
end


Note the following:

  • LengthBits describes the number of bits needed to describe the F parameter. The above code assumes that the length of the buffer reference is saved in the lower bits of the word. This does not have to be the case. LengthMask is derived from this via
    (1 shl LengthBits) - 1.
  • It is further assumed that the buffer offset is stored relatively to the current buffer write position. It might as well be an absolute offset, making the subtraction from BufPos unnecessary.
  • Always think of the fact that the buffer is circular. So do a wrap-around of the offset value via mod or and when you access the buffer.
  • Always mirror any writes to the output into the history buffer.
  • In most cases copying the data out of the history buffer will be implemented byte-wise. While you could also copy the whole sequence at once, some implementations you might come across "abuse" cross-over situations between the buffer read and write positions during lookup and actually expect a byte-wise copy! (This situation might arise if you are supposed to read a certain number of bytes out of the buffer, but will cross the initial write position while doing so, thus effectively repeating the pattern you have already read.)

Further Information

More extensive information on this algorithm can easily be found on the web. Starting points might be this page featuring extensive implementation thoughts or this page showing a graphical explanation of the algorithm.


Games

  • Star Trek: 25th Anniversary
  • The 7th Guest