[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B. Compressed File Format

This chapter explains details about compressed file format used by ebzip.

B.1 Overview about Compression File Format  
B.2 Data Part  
B.3 Index Part  
B.4 Header Part  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B.1 Overview about Compression File Format

The compressed file format has the following features.

A compressed file consists of header, index and data parts. They are placed in that order.

 
+--------+-------------+-----------------------------+
| header |    index    |            data             |
+--------+-------------+-----------------------------+
                                                     EOF


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B.2 Data Part

An orignal file is compressed by the following processes.

At first, ebzip slices an original file. Each slice except for the last slice (slice N in the following picture) has the same size.

 
+---------------+---------------+--   --+----------+
|    slice 1    |    slice 2    |  ...  | slice N  |
+---------------+---------------+--   --+----------+
                                                   EOF

Slice size is determined by the compression level (see section 4.3 Compression Level, about compression level):

compression level slice size
0 2048 bytes
1 4096 bytes
2 8192 bytes
3 16384 bytes
4 32768 bytes
5 65536 bytes

Second, if the last slice is shorter than the slice size, ebzip extends the last slice to the slice size by padding bytes of 0x00.

 
                                                    pad 
+---------------+---------------+--   --+---------+-----+
|    slice 1    |    slice 2    |  ...  |    slice N    |
+---------------+---------------+--   --+---------+-----+
                                                        EOF

Finally, ebzip compresses each slice into the DEFLATE compressed data format, described in RFC 1951. A slice is compressed independently of another slice. Usually, each compressed slice occupies different size. If the number of bits of the compressed slice is not a multiple of 8, 1 to 7 bits are padded to the number of bits come to a multiple of 8 at the tail of the compressed slice. Thus, each compressed slice starts at byte boundary. The contents of the padded bits are undefined, but the padded bits are never used.

 
+------------+----------+--   --+--------------+
| compressed |compressed|  ...  |  compressed  |
|   slice 1  | slice 2  |  ...  |   slice N    |
+------------+----------+--   --+--------------+

This is a data part of the compressed file format, which consists of compressed slices.

The padding in the last slice is compressed as a part of the slice. When ebunzip recovers the last slice, it uncompresses the slice and then remove the padding.

When a compressed slice is larger than or equal to slice size, ebzip discards the compressed data of the slice. In this case, ebzip records original data as the compressed data for that slice instead.

If an original file is empty, the data part is not appered in a compressed file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B.3 Index Part

At compression, ebzip records an index for each compressed slice. An index represents a distance betweeen the beginning of the compressed file to the beginning of a compressed slice. The unit of the distance is byte.

 
+---------+---------+--           --+---------+---------+
| index 1 | index 2 |  ...........  | index N |index END|
+---------+---------+--           --+---------+---------+
     |         |                        |         |
 +---+         +----+                   +------+  +-----------+
 V                  V                          V              V
+------------------+------------------+--   --+--------------+
|    compressed    |    compressed    |  ...  |  compressed  |
|      slice 1     |     slice 2      |  ...  |   slice N    |
+------------------+------------------+--   --+--------------+

Each index takes from 2 to 4 bytes, according with size of an original file:

original file size index size
0 ... 65535 bytes 2 bytes
65535 ... 16777215 bytes 3 bytes
16777216 ... 4294967295 bytes 4 bytes

All multi-byte numbers in the indexes stored with the most significant byte first. For example, 0x1234 is stored as follows. First byte holds 0x12, and second byte holds 0x34.

 
+---------+---------+
|0001 0010|0011 0100|
+---------+---------+
  (0x12)    (0x34)

The index part begins with the index for the compressed slice 1, and the index for the compressed slice 2 follows it. The index for compressed slice N is followed by the index for END; index for the next byte of the end of the compressed slice N. This index also represents the size of the compressed file.

 
+---------+---------+--       --+---------+---------+
| index 1 | index 2 |  .......  | index N |index END|
+---------+---------+--       --+---------+---------+

If its size is equal to the slice size, the data of the slice is not compressed accutually.

If an original file is empty, the index part has only one index. The index represents the size of the compressed file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B.4 Header Part

A header part occupies 22 bytes. It consists of the following fields.

 
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|   magic ID   |*1| *2  |    file size    | Adler-32  |   mtime   |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21

*1: zip mode and compression level
*2: reserved area

magic ID (5 bytes)
It has the fixed value; 0x45, 0x42, 0x5a, 0x69, 0x70 (`EBZip' in ASCII string).

zip mode (4 bits of the most significant bit side)
This identifies the compression mode. In EB Library version 4.0, only 1 (0001 in binary digits) can be prescribed as this version.

compression level (4 bits of the least-significant-bit side)
This identifies the compression level.

reserved area (2 bytes)
Reserved but not used. It is filled with 0x0000.

file size (6 bytes)
This contains the size of the original (uncompressed) file.

Adler-32 (4 bytes)
This is a checksum value of the uncompressed data computed according to Adler-32 algorithm described in RFC 1950.

mtime (4 bytes)
This gives the most recent modification time of the original file. This value is the time in seconds since 00:00:00 GMT, Jan. 1, 1970.

Both zip mode and compression level are packed into the 5th byte in the header. zip mode includes a most siginificant bit, and compression level includes least-significant-bit. If zip mode is 1 and compression level is 2, then 5th byte of the header is 0x12.

 
 MSB                         LSB
+---+---+---+---+---+---+---+---+
| 0   0   0   1   0   0   1   0 | = 0x12
+---+---+---+---+---+---+---+---+
   (zip mode)   | (compression level)

All multi-byte numbers in the header are stored with the most significant byte first.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Motoyuki Kasahara on December, 28 2003 using texi2html