Goal

Today I'm a bit tired and don't want to tackle anything complicated. I'll ponder on ways on improving how to represent chunk IDs on the command line, in URLs, and in a database. This has recently been a problem so I'd like to find a solution that will suffice for a long time.

Plan

  • Consider needs and wants.
  • Compare possible ways.
  • Pick one.

Notes

Needs and wants

  • Chunks have a unique identifier. In a backup repository, every chunk MUST have a unique identifier. They SHOULD be unique across repositories, if that can be achieved without difficulty, but this is not actually important.

  • Chunk IDs are visible on the command line, in filenames, in URLs, and in text and JSON output. They are also stored in databases, without being visible to humans, except in extreme troubleshooting scenarios. I'll treat chunk IDs stores in chunks to refer to other chunks as the database case.

  • There will be many chunks. The chunk ID, when stored in a database, SHOULD be compact and take little space.

  • An ID MUST be possible to express unambiguously on the command line and in text and JSON output. The representation MUST be suitable for humans to type and compare. It SHOULD be printable ASCII and SHOULD be easy to copy-paste.

Babbling

  • An obvious way to generate a chunk ID is to pick one randomly. The UUIDv4 format would be easy. It's 16 bytes and extremely unlikely to have an accidental collision, even across backup repositories. However, the backup repository will guard against collisions, whether local or remote. Thus, it's not actually necessary to spend 16 bytes per ID to make collisions extremely rare. We can save space by using a shorter random ID.

  • However, if the ID space is small, and starts to fill up, the likelihood of collisions go up. A collision has a cost: the client needs to pick a new ID and try again.

  • I'll drop the idea that a chunk ID should be unique across repositories. There's no way to trust it's the same chunk just because the ID is the same.

  • Shorter IDs would also be more human-friendly. The canonical representation of a UUID as text is 36 bytes, but of course we can have our own shorter representation.

  • We don't even need to use a fixed size ID. If we only care about uniqueness in one backup repository, we can start small and increase size as the repository grows.

  • A backup repository will need store at least a petabyte to be useful for me. I'm unlikely to be ever have that much data, or storage, but it's the next unit up from a terabyte, and a terabyte is woefully inadequate for this.

  • My gut feeling is that an average chunk size will vary a lot. For large files, a megabyte is probably doable, but most files are smaller. In the home directory on my main computer, I have about 2.5 million files. Median length is 3448 bytes, average is about 115 kilobytes. About 27 thousand files are longer than a megabyte.

  • Today, I'll assume the average chunk size 100 kilobytes. This means a chunk ID format needs to handle at least 1024^5 / (100*1024) = 10995116277.76 chunks. Round that 1010 chunks, or 233 or 234.

Scenario: Fixed size IDs

  • Let's assume a chunk ID has a fixed size of I octets. We generate the octets randomly.

  • For 34 bites, we'd need 5 octets. We could store 8 octets, to fit into CPU word sizes better. That'd still be half of a UUID in size and allow spreading randomly picked ID across a large space and get fewer collisions.

Scenario: Variable size IDs

  • In a variable sized ID, there are at least 1 octet, but there can be any number. In principle.

  • It'd be silly to have 1-octet IDs. There's so few of them that adding a byte to represent the length seems wasteful.

  • It'd be silly to allow IDs that are very long. Since an ID only represents a chunk, and is unique, there's no need for, say, a 128-byte ID. Shorter ones use less space, and are easier to communicate.

  • We could start with IDs that are 32 bits, but allow up to 128 bits. Or we could always use 128-bit (8-octet) IDs, but pick from narrower ranges until that narrow range fills up. Something simple like "after a collision, double the range" might work well enough.

Fixes vs variable size IDs

  • From the above discussion, I don't actually see a need for true variable size IDs. It'll be easier to used a fixed size, but vary the range from which IDs are randomly picked. A smaller ID can be represented as a shorter string (or byte string).

Possible representations

  • In a database, an ID can be stored as a short binary string, to save space compared to a human-readable textual representation. For human-readable representation, there are a number of options. Without doing a through search of the literature, I list some options.

  • Hexadecimal.

    • pro: simple and fast
    • con: human-meaningless, hard to type correctly
  • Base 64.

    • pro: simple and fast, slightly smaller than hex
    • con: human-meaningless, hard to type correctly
  • Other base-N representations.

    • pro: simple and fast, slightly smaller than hex
    • con: human-meaningless, hard to type correctly
  • URL encoding.

    • pro: allows test and humans to pick a textual label that is its own encoding
    • con: anything that's not an ASCII printable word is likely to look like a mess
  • HTML escaping.

    • pro: allows test and humans to pick a textual label that is its own encoding
    • con: anything that's not an ASCII printable word is likely to look like a mess
  • Convert bytes or nybbles to words.

    • pro: results in quite human-friendly textual IDs
    • con: IDs can be quite long; conversion to/from is likely slow
  • JSON / RFC 8259.

    • pro: allows test and humans to pick a textual label that is its own encoding
    • con: anything that's not an ASCII printable word is likely to look like a mess
  • There's probably more, but this will do for a start.

  • It's clear that the want to be human-readable is what makes things tricky. Being friendly to humans is good for debugging, and more readable test, but it might not be sufficiently important a want to satisfy.

  • It'd be possible to allow a mix of representations, with some way to uniquely distinguish between them, such as by requiring a prefix. For example, a prefix of "b" could mean a specific variant of base 64, and "h" any human oriented ASCII printable text. Or base 64 could be upper case and digits, all-lowercase would be printable ASCII. Except that base 64 uses both upper and lower case. The case differentiation would work for hex, though.

Discussion

  • Option 1: URL-safe base 64 encoding without padding, a.k.a. base64url.

  • Option 2: prefix . for ASCII printable text, and anything else as base64url.

  • I'm not sure the prefix version is worth it. I'll pick base64url only, at least for now. It's entirely possible that I'll be told of something ingenious I'll want to switch later. That's OK. I only need to nail this down by the time Obnam starts being able to make backups and people start using it. Even then, the export/import feature I want to add will make this possible to change later.

Implementation

  • I'll implement Id as a wrapper around a 128-bit integer.

  • In memory, this will be represented as the Rust u128 type.

  • The binary representation will be a Postcard variable size integer, to avoid storing all 128 bits unless necessary. If we need so many chunk IDs that we need 128 bites, the overhead of a 128-bit varint it not worth worrying about.

  • The textual representation is base64url.

  • When generating new ID, I'll use a per-run range that starts from 16-bit, and doubles in size when there's a collision, until the maximum size. I'll need an IdFactory type to keep track of the range.

  • I'll use the rand crate to generate random numbers. It seems the best available. I'm aware that generating random numbers well is tricky.

  • Implemented a straightforward new module for a new Id type. It's just enough code that I want to move it out of the chunk module. The new module does the encoding I've chosen, with tests for round trips.

  • Converting the exiting code to use the type is tedious, as the API has changed. Maybe the biggest problem is that since Id no longer can generate itself, I'll need to add an IdFactory in any place that needs that.

  • I ran out of time implementing this, but I can continue next time.

Summary

I've come up with a way to solve the long-standing problem of chunk IDs and their representation and did some of the work for implementing that.

Comments?

If you have feedback on this, please use the following fediverse thread: https://toot.liw.fi/@liw/116351164622880623.

If you'd like to fund Obnam development, see my funding page.