As I continue to work on writing a simple HTTP server, one thing I often end up thinking about is how to write a server which uses minimal resources. Keeping the code simple and reducing memory allocations are ways to lower the CPU and memory usage, but there is also network bandwith to consider as well. Network bandwith can also be managed in a few ways, with compressing the content before sending it over the network being one of the primary methods. From some research, as well as general observation of web servers I work with, the GZIP file format seems to be the most common way of accomplishing this.
Usually I work on a Windows system, where compressed files typically use the ZIP file format, so I'm actually not too familiar with how GZIP works, or how exactly it differs from the ZIP files that I'm more accustomed to. Therefore, in order to both satisfy my own curiousity, as well as to better understand the systems I use daily, I figured that I'd try implementing the GZIP file format in order to learn how it works.
Thankfully GZIP is very well documented in two Request for Comments (RFC) documents. RFC 1952 - GZIP File Format Specification specifies the header and trailer format used by GZIP files, as well as some sample code for calculating a Cyclic Redundancy Check (CRC) which can be used to verify that the data hasn't been corrupted. GZIP files themselves wrap DEFLATE files, which perform the actual compression, and are instead documented in RFC 1951 - DEFLATE Compressed Data Format Specification version 1.3.
Implementing GZIP is quite simple, and only requires writing a few header/trailer bytes:
// Writes a GZIP header, as defined by pages 4-7 of RFC 1952.
void gzip_header(FILE *out) {
putc(0x1f, out); // ID1
putc(0x8b, out); // ID2
putc(0x08, out); // CM = deflate
putc(0x00, out); // FLG = no flags set
putc(0x00, out); // MTIME (0) = not set
putc(0x00, out); // MTIME (1) = not set
putc(0x00, out); // MTIME (2) = not set
putc(0x00, out); // MTIME (3) = not set
putc(0x00, out); // XFL = no extra flags set
putc(0xff, out); // OS = unknown
}
// Writes a GZIP trailer, as defined by pages 4-7 of RFC 1952.
void gzip_trailer(FILE *out, size_t isize, uint32_t crc32) {
// CRC32 = 32 bit cyclic redundancy check defined by CRC-32-IEEE
putc((uint8_t)(crc32 >> 0), out);
putc((uint8_t)(crc32 >> 8), out);
putc((uint8_t)(crc32 >> 16), out);
putc((uint8_t)(crc32 >> 24), out);
// ISIZE = size of the original input, modulo 2^32
putc((uint8_t)(isize >> 0), out);
putc((uint8_t)(isize >> 8), out);
putc((uint8_t)(isize >> 16), out);
putc((uint8_t)(isize >> 24), out);
}
As well as calculating the Cyclic Redundancy Check. This is more complex, but the code for it is provided in RFC 1502. I've taken that code, and just adapted it to use different variable names and the fixed-width integer types introduced in C99.
static uint32_t crc_table[256];
// Initialize a lookup table for CRC-32-IEEE calculations.
//
// Code taken from pages 10-11 of RFC 1952.
void make_crc_table(void) {
for (int i = 0; i < 256; i++) {
uint32_t crc = i;
for (int j = 0; j < 8; j++) {
if (crc & 1) {
crc = UINT32_C(0xedb88320) ^ (crc >> 1);
} else {
crc = crc >> 1;
}
}
crc_table[i] = crc;
}
}
// Update a CRC to check additional data. Pass a crc value of 0 if this is the
// first block of data being checked.
//
// Code taken from pages 10-11 of RFC 1952.
uint32_t update_crc(uint32_t crc, const uint8_t *buffer, size_t len) {
crc = ~crc;
for (size_t i = 0; i < len; i++) {
crc = crc_table[((uint8_t)crc) ^ buffer[i]] ^ (crc >> 8);
}
return ~crc;
}
DEFLATE, on the other hand, is more complicated and itself involves two more algorithms. The first is LZ77, which identifies repeated sections of data and de-duplicates these. The second is Huffman coding, which involves performing a frequency analysis of how often each byte occurs, and representing the most common bytes using shorter sequences of bits. DEFLATE does also allow for uncompressed blocks of data however, and so in the interest of creating a minimal implementation I've only implemented that part of the algorithm for now. Implementing DEFLATE using only uncompressed blocks of data requires splitting the data into blocks of up to 65,535 (aka 216 − 1), and prefixing these blocks with a few bytes that indicate that the data is uncompressed and how much data there is.
// Reads input, converts the input to DEFLATE format as defined by pages 8-10 of
// RFC 1951, and writes the output. Only implements uncompressed blocks (those
// with a BTYPE value of 00).
//
// Returns ISIZE and CRC32 for the corresponding GZIP fields.
void deflate(FILE *in, FILE *out, size_t *isize, uint32_t *crc32) {
uint8_t buffer[UINT16_MAX]; // An uncompressed block can be at most 2^16 - 1
*isize = 0;
*crc32 = 0;
while (1) {
size_t len = fread(buffer, 1, sizeof(buffer), in);
if (ferror(in)) {
fprintf(stderr, "Error reading from stream: %d\n", ferror(in));
exit(EXIT_FAILURE);
} else if (feof(in)) {
putc(0x01, out); // BFINAL = true, BTYPE = no compression
} else {
putc(0x00, out); // BFINAL = false, BTYPE = no compression
}
// LEN = number of bytes read
putc((uint8_t)(len >> 0), out);
putc((uint8_t)(len >> 8), out);
// NLEN = one's complement of LEN
putc((uint8_t)(~len >> 0), out);
putc((uint8_t)(~len >> 8), out);
fwrite(buffer, 1, len, out);
if (ferror(out)) {
fprintf(stderr, "Error writing to stream: %d\n", ferror(out));
exit(EXIT_FAILURE);
}
*isize += len;
*crc32 = update_crc(*crc32, buffer, len);
if (feof(in)) {
break;
}
}
}
Finally, you just put all of these pieces together and you have a minimal implementation of the GZIP algorithm:
// Reads input, converts the input to GZIP format as defined by pages 4-7 of RFC
// 1952, and writes the output.
void gzip(FILE *in, FILE *out) {
size_t isize;
uint32_t crc32;
if (!crc_table_computed) {
make_crc_table();
crc_table_computed = 1;
}
gzip_header(out);
deflate(in, out, &isize, &crc32);
gzip_trailer(out, isize, crc32);
}
In order to test that the implementation works, I created a random stream of bytes, passed it into the custom gzip function I wrote, passed the output from that into the official gunzip program, and then compared the result to the original random stream of bytes to ensure they were equivalent.
int main(void) {
// Generate original random byte stream
uint8_t input_buffer[UINT16_MAX + 256] = {0};
for (size_t i = 0; i < sizeof(input_buffer); i++) {
input_buffer[i] = rand();
}
int input_pipe[2];
int gzipped_pipe[2];
int output_pipe[2];
assert(_pipe(input_pipe, sizeof(input_buffer) + 256, _O_BINARY) != -1);
assert(_pipe(gzipped_pipe, sizeof(input_buffer) + 256, _O_BINARY) != -1);
assert(_pipe(output_pipe, sizeof(input_buffer) + 256, _O_BINARY) != -1);
assert(write(input_pipe[WRITE_PIPE], input_buffer, sizeof(input_buffer)) ==
sizeof(input_buffer));
assert(close(input_pipe[WRITE_PIPE]) != -1);
FILE *input_stream = fdopen(input_pipe[READ_PIPE], "rb");
assert(input_stream != NULL);
FILE *gzipped_stream = fdopen(gzipped_pipe[WRITE_PIPE], "wb");
assert(gzipped_stream != NULL);
// Apply GZIP to original randomly-generated byte stream
gzip(input_stream, gzipped_stream);
assert(fclose(gzipped_stream) != EOF);
// Run external gunzip program to decompress the gzipped byte stream
char command_buffer[256];
int len = snprintf(command_buffer, sizeof(command_buffer),
"gzip --decompress --stdout <&%d >&%d",
gzipped_pipe[READ_PIPE], output_pipe[WRITE_PIPE]);
assert(len > 0 && (size_t)len < sizeof(command_buffer));
assert(system(command_buffer) == EXIT_SUCCESS);
uint8_t output_buffer[sizeof(input_buffer) + 256] = {0};
len = read(output_pipe[READ_PIPE], output_buffer, sizeof(output_buffer));
// Assert that the decompressed byte stream is identical to the original
// randomly-generated byte stream
int got = len, want = sizeof(input_buffer);
if (got != want) {
fprintf(stderr, "Output length = %d, want %d", got, want);
exit(EXIT_FAILURE);
}
for (int i = 0; i < len; i++) {
uint8_t got = input_buffer[i], want = output_buffer[i];
if (got != want) {
fprintf(stderr, "Output[%d] = %x, want %x", i, got, want);
exit(EXIT_FAILURE);
}
}
return EXIT_SUCCESS;
}
In the end, I feel like I learned a decent bit just from reading the RFCs.
They're a little dense, but provide all the technical details necessary.
Having an official reference implementation to check the work is also very
helpful, and did uncover a few issues. In particular, stdin/stdout on
Windows systems will default to being text streams, which means that they
convert \n
(the line feed character) into \r\n
(the carriage return & line feed characters), which resulted in the
gzipped output being corrupted
(i.e. not matching the CRC check).
After a bit of research I found the solution and was able to set the streams
to binary mode when the code runs on Windows.
Next I'd like to implement the actual compression part of the DEFLATE algorithm, e.g. LZ77 and Huffman coding, but that will have to be done in another post. For now, full code for the current GZIP implementation can be found below: