How to use compression algorithms in Rust

Pure Rust implementations of many compression algorithms exist and waiting for you to try them out. This article describes what's available, when to choose one over another and how to use them.

Creating a uniform interface

Each library gets to decide how it works. That makes things relatively complicated when you try to compare one against another.

If you don’t have special requirements and you are not squeezing every last nanosecond of performance out of your application, then you value API simplicity. You probably want to put some bytes into a function and get some bytes out of a function.

To create some consistency, we’ll define a Compression trait provides us with the API that we want and see how easy is is to mould Rust’s compression libraries to that simple interface.

The Compression trait

				
					pub trait Compression {
	fn compress(&mut self, data: &[u8]) -> std::io::Result<Vec<u8>>;

	fn decompress(&mut self, data: &[u8]) -> std::io::Result<Vec<u8>>;
}

				
			

The Compression trait has two methods with identical type signatures. They both take a data stream (as a borrowed “byte slice”, &[u8] ) as an argument and return a data stream (as an owned vector of bytes, Vec<u8>) wrapped in a std::io::Result.

The difference is in the body of those functions. They’re performing inverse operations. While it’s not encoded in the type system, the intent is that each operation can encode or decode whatever it wants.

One of the quirks of this interface is that it doesn’t allow callers to control the output buffer. We use a return value to provide the (un)compressed data, whereas traditional systems programming would accept another argument, a mutable pointer to some buffer that’s been created by the caller.

We could enable this, but it muddies up the API. For readers who want that sort of control, you should be able to take the code below and create in-place variants. You’re welcome to ping me to request assistance.

Both the compress() and decompress() methods take a mutable reference (also known as a unique reference) to self, e.g. &mut self. This enables implementers to modify their internal state between calls. This is at the expense of only allowing a single instance to compress or decompress a single data stream at the same time.

An example implementation

We can see the trait in use with this Dummy struct, which doesn’t actually do any compression. Unfortunately, it does use quite a lot of code for not a lot of compression. That’s because we’re attempting to demonstrate the technique that our more complicated implementations use.

				
					use std::io;

struct Dummy;

impl Compression for Dummy {
	fn compress(&mut self, data: &[u8]) -> std::io::Result<Vec<u8>> {
		let mut reader = io::BufReader::new(data);
		let mut buffer = Vec::with_capacity(data.len());
		io::copy(&mut reader, &mut buffer)?;

		Ok(buffer)
	}

	fn decompress(&mut self, data: &[u8]) -> std::io::Result<Vec<u8>> {
		todo!()
	}
}
				
			

Here is the general pattern that most of our implementations will use.

  1.  Wrap the input data in some other type that performs the encoding (for compression) or decoding (for decompression).
    Looking at line 7, you’ll see that we’re using std::io::BufReader for that purpose.
  2. Define a buffer to collect the output data before it’s returned.
    Line 8 does this with Vec::with_capacity(data.len()).
  3. Do the processing and get the data into the output buffer
    Our code uses std::io::copy to do this very quickly. This adds a constraint, however. It requires that wrapper type defined in step 1 must implement the std::io::Read trait.

Answers to some questions that you may have thought of:

In step 1, you require that the internal data processor (encoder or decoder) implements the `std::io::Read` trait. What should I do if my data processor doesn’t implement that trait?

You’ll need to replace std::io::copy with something else.

Why use BufReader rather than passing the data argument directly to std::io::copy?

Because copy requires mutable access and we don’t have mutable access to give. So we need to wrap data in something, and BufReader is a reasonable first option.

Why use BufReader rather than something else that implements std::io::Read?

BufReader is useful to remember because there are cases where the input data stream can come in bursts and the delays are caused by waiting on syscalls from the operating system to return. But arguably in our case, a plain Reader would be sufficient here because the data variable’s type is a byte slice (&[u8]), so it won’t trigger those syscalls.

DEFLATE, gzip, zlib

The two (seemingly) competing crates offering DEFLATE and its peers are miniz_oxide and flate2. Although they’re apparently in conflict, recent versions of flate2 defer to miniz_oxide for the pure Rust implementation. flate2 offers a slightly nicer streaming interface, which we’ll make use of here.

Defining supporting data structures

There’s no state to maintain, so we’ll use a zero-sized type to satisfy the trait.

				
					struct Deflate;
				
			

In latter examples, these structures will be used to contain parameters for encoding.

Compressing with the Flate2 crate

				
					use std::io::{Read, Write};
use flate2::{read, write}; // Cargo.toml: flate2 = "1"

impl Compression for Deflate {
    fn compress(&mut self, data: &[u8]) -> std::io::Result<Vec<u8>> {
        let buffer = Vec::with_capacity(data.len() >> 3);
        let mut enc = write::DeflateEncoder::new(buffer, flate2::Compression::best());
        enc.write_all(data)?;

        enc.finish()
    }

    fn decompress(&mut self, data: &[u8]) -> std::io::Result<Vec<u8>> {
        let mut buffer = Vec::with_capacity(data.len());
        let mut dec = read::DeflateDecoder::new(data);
        dec.read_to_end(&mut buffer)?;

        Ok(buffer)
    }
}
				
			

They’re not exposed in our code, but these encoders do have a number of parameters. We’ll see how to adjust our code to enable that in the next section.

Brötli

A Brötli (pronounced brute-lee) is a Swiss German word for a small bread roll. In the case of compression, it’s normally seen without its umlaut. The algorithm has a few dials to adjust for the optimal performance for your use case, and three of them are exposed here. They’re discussed below.

				
					struct Brotli {
	quality: i32,
	lgwin: i32,
	buffer_size: usize,
}

impl Brotli {
	fn new(quality: i32, lgwin: i32, buffer_size: usize) -> Self {
	    assert!(quality >= 1);
		assert!(quality <= 11);
		assert!(lgwin >= 10);
		assert!(lgwin <= 24);
		
		Brotli {
			quality, lgwin, buffer_size
		}
	}
}
				
			

The three fields are the same as what are exposed by rust-brotli’s API.

quality affects how much CPU Brotli invests in compressing the data. Higher values produce better compression, but are slower. Corresponds to the BROTLI_PARAM_QUALITY parameter of the original library.

lgwin specifies the width in bits of the encoding window. Significantly affects memory use. The parameter name lgwin is short for “log of bit window”, where log is itself a shortened form of logarithm. It corresponds to the BROTLI_PARAM_LGWIN parameter of the original library.

buffer_size relates to the size of the internal buffer. Wider values produce better compression at the expense of higher latency.

Compressing with Brotli

The full code that follows looks intimidating, but the most important lines are 17-19 and 23-25. They’re where the pattern introduced in the Dummy struct is implemented.

				
					impl Compression for Brotli {
	fn compress(&mut self, data: &[u8]) -> io::Result<Vec<u8>> {
		let mode = match std::str::from_utf8(data) {
			Ok(_) => BrotliEncoderMode::BROTLI_MODE_TEXT,
			Err(_) => BrotliEncoderMode::BROTLI_MODE_GENERIC,
		};

		let params = BrotliEncoderParams {
			lgwin: self.lgwin,
			quality: self.quality,
			mode,
			..Default::default()
		};

		let mut comp = CompressorReader::with_params(data, self.buffer_size, &params);

        let mut compressed_data = Vec::new();
        io::copy(&mut comp, &mut compressed_data)?;
        Ok(compressed_data)
	}

	fn decompress(&mut self, data: &[u8]) -> io::Result<Vec<u8>> {
		let mut decomp = Decompressor::new(data, self.buffer_size);
	    let mut decompressed_data = Vec::new();
	    io::copy(&mut decomp, &mut decompressed_data)?;
	    Ok(decompressed_data)
	}
}
				
			

I’ve added a few parts to this code to give you the opportunity to reflect and consider whether you would do the same. For example, Brotli has a mode that can perform better with UTF-8 data. Is it worthwhile to check whether the data is UTF-8?

Snappy

Snappy is a different type of compression algorithm, which is designed for optimal speed. It’s designed for high-throughput, low-latency applications. The leading Rust implementation is provided by the snap crate. snap offers few options, meaning that we don’t need to store any data to work with our interface.

				
					struct Snappy { }
				
			

Compressing with Snappy

Implementing our Compression trait nearly matches our Dummy example, but there is some extra work to handle errors. We need to convert the FrameError error provided by snap (“frame” refers to a compression frame) to a std::io::Error. As FrameError is actually a wrapper over std::io::Error, you’ll see that we end up using a method that’s provided specifically to enable this.

				
					use std::io::{self, BufReader};

impl Compression for Snappy {
    fn compress(&mut self, data: &[u8]) -> std::io::Result<Vec<u8>> {
        let buffer = Vec::with_capacity(data.len() >> 3);
        let mut data = BufReader::new(data);
        let mut encoder = snap::write::FrameEncoder::new(buffer);

        io::copy(&mut data, &mut encoder)?;

        encoder.into_inner()
            .map_err(|err| err.into_error())
    }

    fn decompress(&mut self, data: &[u8]) -> std::io::Result<Vec<u8>> {
        let mut buffer = Vec::with_capacity(data.len());
        let mut decoder = snap::read::FrameDecoder::new(data);

        io::copy(&mut decoder, &mut buffer)?;

        Ok(buffer)
    }
}

				
			

Why choose a Rust-based implementation?

By using a pure-Rust implementation, you’re likely to reduce the number of issues at build time. If you build for multiple platforms, you won’t need to find a C compiler for your compile target.

If the implementation avoids the unsafe keyword, then you know that you have Rust’s memory safety guarantees to protect your code. Encoders are often exposed to hostile data and they’re an area where you really want to be memory safe.

Learn more

Review the Code

Full source code for the examples in this article are available on GitHub. Click the Octocat icon and give the repo a star!

Contents