Compressing – Decompressing web GZIP stream


Compressing and decompressing is one of the essential task when dealing with the web oriented programming. Qt allow to manipulate compressed data stream using qUncompress/qCompress functions, but they are problematic and don’t really helps much. From my experience, with these functions, it’s practically impossible to decompress gzip stream (although I can be wrong on this one, and maybe there is a way – never say never) .
Alternative is to use zlib library, although there is quazip wrapper, I personally prefer to keep it as simple as possible and in my opinion zlib is easier. This article describe GZIP structure and how to uncompress GZIP web response using ZLIB library,

A brief introduction:

GZIP compression allow to compress stream and basically is used widely on the web to compress web page content to save bandwidth. Currently, probably, 99.9% of web servers offer serving content using gzip. Of course this compression is used to other tasks also, for example image compression (PNG) or simply file compression. GZIP offers lossless compression, that means after decompression stream is the same as the stream before compression (no data is lost).

I want to note here that GZIP compress only stream and is not “compression format”. That means, after compression output stream is compressed GZIP stream with NO information about file structure of the compressed content. In web reply there is no need for file structure, because only data frame is compressed.

Content structure:

As said before there is no particular file structure, but content itself has some information about compressed data. There is header and trailer, that describe basic information about compression parameters and content itself.

Header:
Here is a brief description of the header content, some of them are omitted:

First byte : ID1 = 0x1F
Second byte: ID2 = 0x8B
(These are “magic numbers” that describe / identify GZIP compression)

CM – compression method:

0x0-7 reserved
0x08 DEFLATE (although deflate itself is a compression method, it is set when dealing with the GZIP streams – that’s due to the HTTP request accept policy: “Accept-Encoding: gzip,deflate”)

FLAGS 1 byte (they are not mandatory and don’t need to be set, although can be present):

bit 0: FTEXT            – if set probably an ASCII text
bit 1: FHCRC           – CRC16 for header
bit 2: FEXTRA          – extra field present
bit 3: FNAME           – oryginal file name present, terminated by zero byte
bit 4: FCOMMENT    – comments present, terminated by zero byte
bit 5:reserved          – reserved must be set to zero bits
bit 6:reserved          – reserved must be set to zero bits
bit 7:reserved          – reserved must be set to zero bits

MTIME: modify time (most recent) of compressed files – 0 means no MTIME present

XFL: extra flags:

2 – max compression
4 – fast compression

OS – operating system:

0 – FAT filesystem (MS-DOS, OS/2, NT/Win32)
1 – Amiga
2 – VMS (or OpenVMS)
3 – Unix
4 – VM/CMS
5 – Atari TOS
6 – HPFS filesystem (OS/2, NT)
7 – Macintosh
8 – Z-System
9 – CP/M
10 – TOPS-20
11 – NTFS filesystem (NT)
12 – QDOS
13 – Acorn RISCOS
255 – unknown

XLEN: if FEXTRA that gave length of extra field

Trailer:

CRC32: Cyclic Redundancy Check – 8 byte
ISIZE: input size of uncompressed data, modulo 2^32 – 4bytes, limitation to maximum compressed file size = 4GB, although ZLIB bypass that right now

Here is an example GET GZIP response from the server:

0000   1f 8b 08 00 00 00 00 00 00 03 c5 5a db 52 db 48 …. rest of the compressed data
0ae0   57 f0 a4 ae e9 d5 88 ca f1 9d cd ff 02 1e 6d 98
0af0   d7 a8 23 00 00

Legend:

Magic number
Compression method
FLAGS – as You can see only first 5bits are present, rest is not set

MTIME

Operating system
Compressed data
CRC32

ISIZE

Note here that in this reply uncompressed size was 9128 bytes = hex 23A8, but as You can see in above example ISIZE is set to A823! That’s because this is Big Endian notation.

Implementation:

QByteArray gzipHttpDec::gzipDecompress( QByteArray compressData )
{
    //decompress GZIP data
    //strip header and trailer
      compressData.remove(0, 10);
      compressData.chop(12);

      const int buffersize = 16384;
      quint8 buffer[buffersize];

      z_stream cmpr_stream;
      cmpr_stream.next_in = (unsigned char *)compressData.data();
      cmpr_stream.avail_in = compressData.size();
      cmpr_stream.total_in = 0;

      cmpr_stream.next_out = buffer;
      cmpr_stream.avail_out = buffersize;
      cmpr_stream.total_out = 0;

      cmpr_stream.zalloc = Z_NULL;
      cmpr_stream.zalloc = Z_NULL;

      if( inflateInit2(&cmpr_stream, -8 ) != Z_OK) {
              qDebug() << "cmpr_stream error!";
      }

        QByteArray uncompressed;
        do {
                int status = inflate( &cmpr_stream, Z_SYNC_FLUSH );

                if(status == Z_OK || status == Z_STREAM_END) {
                        uncompressed.append(QByteArray::fromRawData(
                             (char *)buffer,
                             buffersize - cmpr_stream.avail_out));
                        cmpr_stream.next_out = buffer;
                        cmpr_stream.avail_out = buffersize;
                } else {
                         inflateEnd(&cmpr_stream);
                        }

                if(status == Z_STREAM_END) {
                    inflateEnd(&cmpr_stream);
                    break;
                }

        }while(cmpr_stream.avail_out == 0);

        return uncompressed;
}

Usage:

gzipHttpDec *decompressor;
QByteArray uncompressedData = gzipDecompress( compressedByteArray );

Of course this is a part of my uncompress class, You can simply copy content of the gzipDecompress function and do whatever You want to do with it.

Reference:
[1] RFC 1952
[2] ZLIB – http://www.zlib.net/
[3] QUAZIP – http://quazip.sourceforge.net/

, , , , ,

Comments are closed.