Creating a uniform interface
Each library gets to decide how it works. That makes things relatively complicated when you try to compare one against another.
If you don’t have special requirements and you are not squeezing every last nanosecond of performance out of your application, then you value API simplicity. You probably want to put some bytes into a function and get some bytes out of a function.
To create some consistency, we’ll define a Compression
trait provides us with the API that we want and see how easy is is to mould Rust’s compression libraries to that simple interface.
The Compression trait
pub trait Compression {
fn compress(&mut self, data: &[u8]) -> std::io::Result>;
fn decompress(&mut self, data: &[u8]) -> std::io::Result>;
}
The Compression
trait has two methods with identical type signatures. They both take a data stream (as a borrowed “byte slice”, &[u8]
) as an argument and return a data stream (as an owned vector of bytes, Vec<u8>
) wrapped in a std::io::Result
.
The difference is in the body of those functions. They’re performing inverse operations. While it’s not encoded in the type system, the intent is that each operation can encode or decode whatever it wants.
One of the quirks of this interface is that it doesn’t allow callers to control the output buffer. We use a return value to provide the (un)compressed data, whereas traditional systems programming would accept another argument, a mutable pointer to some buffer that’s been created by the caller.
We could enable this, but it muddies up the API. For readers who want that sort of control, you should be able to take the code below and create in-place variants. You’re welcome to ping me to request assistance.
Both the compress()
and decompress()
methods take a mutable reference (also known as a unique reference) to self
, e.g. &mut self
. This enables implementers to modify their internal state between calls. This is at the expense of only allowing a single instance to compress or decompress a single data stream at the same time.
An example implementation
We can see the trait in use with this Dummy
struct, which doesn’t actually do any compression. Unfortunately, it does use quite a lot of code for not a lot of compression. That’s because we’re attempting to demonstrate the technique that our more complicated implementations use.
use std::io;
struct Dummy;
impl Compression for Dummy {
fn compress(&mut self, data: &[u8]) -> std::io::Result> {
let mut reader = io::BufReader::new(data);
let mut buffer = Vec::with_capacity(data.len());
io::copy(&mut reader, &mut buffer)?;
Ok(buffer)
}
fn decompress(&mut self, data: &[u8]) -> std::io::Result> {
todo!()
}
}
Here is the general pattern that most of our implementations will use.
- Wrap the input data in some other type that performs the encoding (for compression) or decoding (for decompression).
Looking at line 7, you’ll see that we’re usingstd::io::BufReader
for that purpose. - Define a buffer to collect the output data before it’s returned.
Line 8 does this withVec::with_capacity(data.len())
. - Do the processing and get the data into the output buffer
Our code usesstd::io::copy
to do this very quickly. This adds a constraint, however. It requires that wrapper type defined in step 1 must implement thestd::io::Read
trait.
Answers to some questions that you may have thought of:
In step 1, you require that the internal data processor (encoder or decoder) implements the `std::io::Read` trait. What should I do if my data processor doesn’t implement that trait?
You’ll need to replace std::io::copy
with something else.
Why use BufReader
rather than passing the data
argument directly to std::io::copy
?
Because copy
requires mutable access and we don’t have mutable access to give. So we need to wrap data
in something, and BufReader
is a reasonable first option.
Why use BufReader
rather than something else that implements std::io::Read
?
BufReader
is useful to remember because there are cases where the input data stream can come in bursts and the delays are caused by waiting on syscalls from the operating system to return. But arguably in our case, a plain Reader
would be sufficient here because the data
variable’s type is a byte slice (&[u8]
), so it won’t trigger those syscalls.
DEFLATE, gzip, zlib
The two (seemingly) competing crates offering DEFLATE and its peers are miniz_oxide
and flate2
. Although they’re apparently in conflict, recent versions of flate2
defer to miniz_oxide
for the pure Rust implementation. flate2
offers a slightly nicer streaming interface, which we’ll make use of here.
Defining supporting data structures
There’s no state to maintain, so we’ll use a zero-sized type to satisfy the trait.
struct Deflate;
In latter examples, these structures will be used to contain parameters for encoding.
Compressing with the Flate2 crate
use std::io::{Read, Write};
use flate2::{read, write}; // Cargo.toml: flate2 = "1"
impl Compression for Deflate {
fn compress(&mut self, data: &[u8]) -> std::io::Result> {
let buffer = Vec::with_capacity(data.len() >> 3);
let mut enc = write::DeflateEncoder::new(buffer, flate2::Compression::best());
enc.write_all(data)?;
enc.finish()
}
fn decompress(&mut self, data: &[u8]) -> std::io::Result> {
let mut buffer = Vec::with_capacity(data.len());
let mut dec = read::DeflateDecoder::new(data);
dec.read_to_end(&mut buffer)?;
Ok(buffer)
}
}
They’re not exposed in our code, but these encoders do have a number of parameters. We’ll see how to adjust our code to enable that in the next section.
Brötli
A Brötli (pronounced brute-lee) is a Swiss German word for a small bread roll. In the case of compression, it’s normally seen without its umlaut. The algorithm has a few dials to adjust for the optimal performance for your use case, and three of them are exposed here. They’re discussed below.
struct Brotli {
quality: i32,
lgwin: i32,
buffer_size: usize,
}
impl Brotli {
fn new(quality: i32, lgwin: i32, buffer_size: usize) -> Self {
assert!(quality >= 1);
assert!(quality <= 11);
assert!(lgwin >= 10);
assert!(lgwin <= 24);
Brotli {
quality, lgwin, buffer_size
}
}
}
The three fields are the same as what are exposed by rust-brotli’s API.
quality affects how much CPU Brotli invests in compressing the data. Higher values produce better compression, but are slower. Corresponds to the BROTLI_PARAM_QUALITY parameter of the original library.
lgwin specifies the width in bits of the encoding window. Significantly affects memory use. The parameter name lgwin is short for “log of bit window”, where log is itself a shortened form of logarithm. It corresponds to the BROTLI_PARAM_LGWIN parameter of the original library.
buffer_size relates to the size of the internal buffer. Wider values produce better compression at the expense of higher latency.
Compressing with Brotli
The full code that follows looks intimidating, but the most important lines are 17-19 and 23-25. They’re where the pattern introduced in the Dummy struct is implemented.
impl Compression for Brotli {
fn compress(&mut self, data: &[u8]) -> io::Result> {
let mode = match std::str::from_utf8(data) {
Ok(_) => BrotliEncoderMode::BROTLI_MODE_TEXT,
Err(_) => BrotliEncoderMode::BROTLI_MODE_GENERIC,
};
let params = BrotliEncoderParams {
lgwin: self.lgwin,
quality: self.quality,
mode,
..Default::default()
};
let mut comp = CompressorReader::with_params(data, self.buffer_size, ¶ms);
let mut compressed_data = Vec::new();
io::copy(&mut comp, &mut compressed_data)?;
Ok(compressed_data)
}
fn decompress(&mut self, data: &[u8]) -> io::Result> {
let mut decomp = Decompressor::new(data, self.buffer_size);
let mut decompressed_data = Vec::new();
io::copy(&mut decomp, &mut decompressed_data)?;
Ok(decompressed_data)
}
}
I’ve added a few parts to this code to give you the opportunity to reflect and consider whether you would do the same. For example, Brotli has a mode that can perform better with UTF-8 data. Is it worthwhile to check whether the data is UTF-8?
Snappy
Snappy is a different type of compression algorithm, which is designed for optimal speed. It’s designed for high-throughput, low-latency applications. The leading Rust implementation is provided by the snap
crate. snap
offers few options, meaning that we don’t need to store any data to work with our interface.
struct Snappy { }
Compressing with Snappy
Implementing our Compression
trait nearly matches our Dummy
example, but there is some extra work to handle errors. We need to convert the FrameError
error provided by snap
(“frame” refers to a compression frame) to a std::io::Error
. As FrameError is actually a wrapper over std::io::Error
, you’ll see that we end up using a method that’s provided specifically to enable this.
use std::io::{self, BufReader};
impl Compression for Snappy {
fn compress(&mut self, data: &[u8]) -> std::io::Result> {
let buffer = Vec::with_capacity(data.len() >> 3);
let mut data = BufReader::new(data);
let mut encoder = snap::write::FrameEncoder::new(buffer);
io::copy(&mut data, &mut encoder)?;
encoder.into_inner()
.map_err(|err| err.into_error())
}
fn decompress(&mut self, data: &[u8]) -> std::io::Result> {
let mut buffer = Vec::with_capacity(data.len());
let mut decoder = snap::read::FrameDecoder::new(data);
io::copy(&mut decoder, &mut buffer)?;
Ok(buffer)
}
}
Why choose a Rust-based implementation?
By using a pure-Rust implementation, you’re likely to reduce the number of issues at build time. If you build for multiple platforms, you won’t need to find a C compiler for your compile target.
If the implementation avoids the unsafe
keyword, then you know that you have Rust’s memory safety guarantees to protect your code. Encoders are often exposed to hostile data and they’re an area where you really want to be memory safe.
Learn more
Review the Code
Full source code for the examples in this article are available on GitHub. Click the Octocat icon and give the repo a star!