I'm starting a new "Obnam 3" project. See also thoughs I've written up. This is the first "development note" where I document what I do, with the intention that others can follow and see what I do and why.

Goal
Plan
Notes
Summary
Comments?

Goal

The goal for today is to get started.

I will create a new Git repository, with enough scaffolding to make it easy for me to add new sub-commands. I expect to make a lot of sub-commands to provide functionality that can be tested. There will also be support for adding automated tests. I will set up CI for this too.
I will start implementing chunks. I will create the command obnam chunk encrypt that reads cleartext data, and encrypts it, and the corresponding obnam chunk decrypt tat decrypts it.

Acceptance criteria for today is that all of the above works.

Plan

Create a new Git repository, with a README, and put that on Radicle.
Create the structure of a program.
Set up automated acceptance testing with Subplot.
Set up CI on my personal CI setup.
Implement chunk encryption and decryption.

Notes

Create a new Git repository, put it on Radicle

cd ~/pers/obnam
git init git
create README.md and commit it
rad init
The repository id is rad:zbWNQYkQ4QKgdSQcd1tjaemv6d6x
Add it to my public seed node https://app.radicle.xyz/nodes/radicle.liw.fi/rad:zbWNQYkQ4QKgdSQcd1tjaemv6d6x

Create the structure of a Rust program, with sub-commands

New Git branch.
cargo init --name obnam
Depend on clap with the derive feature. That's my preferred way to add command line parsing to a Rust program.
Also on thiserror for nicer error messages for users.
Add a Makefile for my convenience. I know there is xtask, and probably others, but I've on learned them so make it is. I could also write a shell script, but make is fine.
Started setting up the scaffolding for a sub-command structure. I like to define a trait Leaf for the lowest level sub-commands, which don't have sub-commands of their own, and put those leaf-commands in their own modules, grouped so that all the leaves of a parent sub-command are in the same module. Then the main program defines the parent sub-command structure all in one place. I find this makes for a clear, tidy structure that's reasonably easy to change and maintain.
Ran into a problem. I have

use clap::Parser;

use obnam::cmd;

fn main() {
    println!("Hello, world!");
}

#[derive(Parser)]
struct Args {
    #[clap(subcommand)]
    cmd: Cmd,
}

#[derive(Parser)]
enum Cmd {
    Chunk(ChunkCmd),
}

#[derive(Parser)]
enum ChunkCmd {
    Hello(cmd::chunk::Hello),
}

The compiler complains that ChunkCmd doesn't implement the clap::Args trait, yet it should, given the derive.
Oh! ChunkCmd needs to be a struct, not an enum. D'oh.
This works:

use clap::Parser;

use obnam::cmd;

fn main() {
    println!("Hello, world!");
}

#[derive(Parser)]
struct Args {
    #[clap(subcommand)]
    cmd: Cmd,
}

#[derive(Parser)]
enum Cmd {
    Chunk(ChunkCmd),
}

#[derive(Parser)]
struct ChunkCmd {
    #[clap(subcommand)]
    cmd: ChunkSubCmd,
}

#[derive(Parser)]
enum ChunkSubCmd {
    Hello(cmd::chunk::Hello),
}

Add a main program that actually parses the command line and an executes sub-commands. I like to have a main that calls a fallible_main, to make error reporting easier.

fn main() {
    if let Err(e) = fallible_main() {
        eprintln!("ERROR: {}", e);
        let mut e = e.source();
        while let Some(source) = e {
            eprintln!("caused by: {}", source);
            e = source.source();
        }
        exit(1);
    }
}

fn fallible_main() -> Result<(), Box<dyn std::error::Error>> {
    let args = Args::parse();
    args.cmd.run(&args)?;

    Ok(())
}

I also like the Box<dyn std::error::Error> construct over the anyhow crate. I prefer each module to have its own error type, but this construct avoids having to convert or wrap error from called sub-commands.
Merge the branch to main.

Acceptance testing with Subplot

I co-authored the Subplot program for documenting acceptance criteria and how they are verified. Obviously I will want to use it here. I like to set up the structure for this early on, even if "acceptance criteria" is a bit silly at this stage.

Create a new branch.
Add the Subplot files.
I choose to use Python, as implementing Subplot scenarios in Rust is not as easy as I'd like it to be.
Added scaffolding and a placeholder scenario that can be dropped once obnam does something useful..
Merge into main.

Set up CI

I use Radicle CI with the Ambient adapter. My software, again. I am the biggest not-invented-here buffoon in the world.
At this stage, I'll just have CI run make.
Added .radicle/ambient.yaml : ~~~yaml pre_plan:
- action: cargo_fetch plan:
- action: cargo_fmt
- action: shell shell: | make ~~~
Added the obnam repository to my Radicle CI node.
First CI run fails: cargo fmt --check is not happy.
Run the rad-ci command I also wrote, which emulates CI locally, for easier testing of CI failures.
cargo fmt --check fails in that, too, which is good.
Also when I run it locally.
For some reason, my editor auto-formats this wrongly.
I'm not in the mood for fighting Emacs and rustfmt today, so I'll drop that check from CI.
CI still fails, because of a problem in Ambient, which doesn't set all the environment variables for all the actions that it runs, and so the Rust toolchain can't be found in the shell action. Easily fixed with export PATH=/root/.cargo/bin:/bin:/sbin.
Next failure: cargo wants to download its index, which doesn't work under Ambient, on purpose.
I'll split the Makefile so that the cargo steps are in their own targets, and the Subplot running in its own. Then I'll change .radicle/ambient.yaml to be: ~~~yaml. pre_plan:
- action: cargo_fetch plan:
- action: cargo_clippy
- action: cargo_build
- action: shell shell: | export PATH=/root/.cargo/bin:/bin:/sbin export CARGO_TARGET_DIR=/workspace/cache make subplot ~~~
This lets Ambient do the Rust building successfully, and runs the built obnam binary in the subplot.
CI is happy, merge.
This is where I start wishing for code review, but as I'm the only person on this project, only self-review is available.

Chunk representation

This is where the real fun starts.

Before I bet to encrypting chunks, I'll set up chunk representation in general. Chunks in a backup program represent some data from the files being backed up. Chunks sizes vary. Chunks do not have a context, to enable de-duplication: if the same data is in two files, but preceded or followed by different data, the same chunk can be used. This means the data in the chunk needs only be stored once in the backup repository.

I'm not going to worry now how file data is divided into chunks. That's an interesting problem for another time.

In some systems, chunks are identified using a cryptographically strong check sum (or "hash") of the data. The likelihood of two chunks with different content having the same hash, accidentally or maliciously, is so low that it can be ignored. I don't want to rely on that, because those who research hashes and hash collisions also need to have backups. So I'm going to assign a unique ID for each chunk, and refer to chunks by that ID.

I've previously implemented this so that the ID is assigned by the backup repository ("backup server"), but that means the ID can's also be stored in the chunk itself. This means that if you have the chunk file, but lose the repository mapping of ID to chunk, you're lost. That's not necessarily a likely scenario, but I'd like to ward against it.

Thus, I'll have the backup client choose an ID, embed that in the chunk, offer the chunk+ID to the repository, and if the repository rejects the ID, I'll have the client deal with the rejection in some way that I don't want to decide right now. There are possibilities: a very random ID that is highly unlikely to collide, or a per-client namespace for IDs so that clients will only collide with themselves and can talk to themselves if they do.

I will, however, want to support finding chunks based on the content, which in practice means based on the checksum of the contents. I'll add a checksum to the chunk as well, and the repository (when I get to implementing that) will have a mechanism to look up by the checksum.

However, I don't want to expose the checksum as such. That would leak information: it's possible to sometimes guess the content of a chunk from its checksum.

Instead, I'll have the client encrypt the checksum. I'll call that the "chunk label". This has the benefit of being useful in other ways. In addition to an encrypted checksum, labels can be a type of chunk. I expect to have a per-client chunk, with encryption keys for that client. If the same backup repository is shared by multiple clients, which I want to do, they can each find their own chunk with their label.

The label does not need to be unique. Many chunks can have the same label.

Thus, my initial thinking is that a cleartext chunk can be represented in Rust as the following structure:

struct CleartextChunk {
    id: ChunkId,
    label: Label,
    data: Vec<u8>,
}

When this is encrypted and stored on disk, it will look like this:

struct EncryptedChunk {
     associated: Associated,
     ciphertext: Vec<u8>,
}

I'm going to be using AEAD) for encryption. This means the encryption encrypts not just the data, but also some associated data, and the decryption verifies that the encrypted data has not been modified, and that the associated data is also correct. In other words, this both signs and encrypts the data and associated data. The associated data is in clear text, but signed. It needs to be in the clear, because it's needed for decryption.

In addition, both encryption and decryption will need the secret key. That won't be stored in the chunk, but is handled otherwise.

I will use the combination of ID and label as associated data:

struct Associated {
    id: ChunkId,
    label: Label,
}

This way, the backup repository can pick apart the associated data, and allow looking up by either ID or label, and the client needs both to decrypt successfully.

I will use the postcard crate for serializing the chunks.

Meta interval

It's been over two hours in this session already, and I've imposed a limit of three hours, so I won't get to actually encrypting today. No worries.

Chunks, continued

I'll create subcommands for encoding and decoding a chunk using postcard:

obnam chunk encode FILENAME --label LABEL
obnam chunk decode FILENAME

I'll implement the actual encoding and decoding in the library part of the obnam crate. That will be my general principle: the actual sub-commands provide the command line interface, but call the library to the real work.

I'll initially use UUID4 for chunk ids. I'll wrap that in its own type, for type safety.

Should labels be arbitrary binary data or text? Some of them will be encrypted checksums and it'll save a little space if those don't need to be encoded as text.

Got the sub-commands to encode and decoded chunks, but didn't add acceptance criteria yet. Will continue from here next time. Pushed what I did to a new patch, 3d24321.

Summary

Got started. Didn't quite get everything done that I had hoped, but got somewhere useful.

Comments?

If you have feedback on this development session, please use the following fediverse thread: https://toot.liw.fi/@liw/114171104856444046