I'm starting a new "Obnam 3" project. See also thoughs I've written up. This is the first "development note" where I document what I do, with the intention that others can follow and see what I do and why.
Goal
The goal for today is to get started.
- I will create a new Git repository, with enough scaffolding to make it easy for me to add new sub-commands. I expect to make a lot of sub-commands to provide functionality that can be tested. There will also be support for adding automated tests. I will set up CI for this too.
- I will start implementing chunks. I will create the command
obnam chunk encrypt
that reads cleartext data, and encrypts it, and the correspondingobnam chunk decrypt
tat decrypts it.
Acceptance criteria for today is that all of the above works.
Plan
- Create a new Git repository, with a README, and put that on Radicle.
- Create the structure of a program.
- Set up automated acceptance testing with Subplot.
- Set up CI on my personal CI setup.
- Implement chunk encryption and decryption.
Notes
Create a new Git repository, put it on Radicle
cd ~/pers/obnam
git init git
- create
README.md
and commit it rad init
- The repository id is
rad:zbWNQYkQ4QKgdSQcd1tjaemv6d6x
- Add it to my public seed node https://app.radicle.xyz/nodes/radicle.liw.fi/rad:zbWNQYkQ4QKgdSQcd1tjaemv6d6x
Create the structure of a Rust program, with sub-commands
- New Git branch.
cargo init --name obnam
- Depend on
clap
with thederive
feature. That's my preferred way to add command line parsing to a Rust program. - Also on
thiserror
for nicer error messages for users. - Add a
Makefile
for my convenience. I know there isxtask
, and probably others, but I've on learned them somake
it is. I could also write a shell script, butmake
is fine. - Started setting up the scaffolding for a sub-command structure. I
like to define a trait
Leaf
for the lowest level sub-commands, which don't have sub-commands of their own, and put those leaf-commands in their own modules, grouped so that all the leaves of a parent sub-command are in the same module. Then the main program defines the parent sub-command structure all in one place. I find this makes for a clear, tidy structure that's reasonably easy to change and maintain. - Ran into a problem. I have
use clap::Parser;
use obnam::cmd;
fn main() {
println!("Hello, world!");
}
#[derive(Parser)]
struct Args {
#[clap(subcommand)]
cmd: Cmd,
}
#[derive(Parser)]
enum Cmd {
Chunk(ChunkCmd),
}
#[derive(Parser)]
enum ChunkCmd {
Hello(cmd::chunk::Hello),
}
- The compiler complains that
ChunkCmd
doesn't implement theclap::Args
trait, yet it should, given the derive. - Oh!
ChunkCmd
needs to be astruct
, not anenum
. D'oh. - This works:
use clap::Parser;
use obnam::cmd;
fn main() {
println!("Hello, world!");
}
#[derive(Parser)]
struct Args {
#[clap(subcommand)]
cmd: Cmd,
}
#[derive(Parser)]
enum Cmd {
Chunk(ChunkCmd),
}
#[derive(Parser)]
struct ChunkCmd {
#[clap(subcommand)]
cmd: ChunkSubCmd,
}
#[derive(Parser)]
enum ChunkSubCmd {
Hello(cmd::chunk::Hello),
}
- Add a main program that actually parses the command line and an
executes sub-commands. I like to have a
main
that calls afallible_main
, to make error reporting easier.
fn main() {
if let Err(e) = fallible_main() {
eprintln!("ERROR: {}", e);
let mut e = e.source();
while let Some(source) = e {
eprintln!("caused by: {}", source);
e = source.source();
}
exit(1);
}
}
fn fallible_main() -> Result<(), Box<dyn std::error::Error>> {
let args = Args::parse();
args.cmd.run(&args)?;
Ok(())
}
I also like the
Box<dyn std::error::Error>
construct over theanyhow
crate. I prefer each module to have its own error type, but this construct avoids having to convert or wrap error from called sub-commands.Merge the branch to
main
.
Acceptance testing with Subplot
I co-authored the Subplot program for documenting acceptance criteria and how they are verified. Obviously I will want to use it here. I like to set up the structure for this early on, even if "acceptance criteria" is a bit silly at this stage.
- Create a new branch.
- Add the Subplot files.
- I choose to use Python, as implementing Subplot scenarios in Rust is not as easy as I'd like it to be.
- Added scaffolding and a placeholder scenario that can be dropped
once
obnam
does something useful.. - Merge into
main
.
Set up CI
- I use Radicle CI with the Ambient adapter. My software, again. I am the biggest not-invented-here buffoon in the world.
- At this stage, I'll just have CI run
make
. - Added
.radicle/ambient.yaml
: ~~~yaml pre_plan:- action: cargo_fetch plan:
- action: cargo_fmt
- action: shell shell: | make ~~~
- Added the
obnam
repository to my Radicle CI node. - First CI run fails:
cargo fmt --check
is not happy. - Run the
rad-ci
command I also wrote, which emulates CI locally, for easier testing of CI failures. cargo fmt --check
fails in that, too, which is good.- Also when I run it locally.
- For some reason, my editor auto-formats this wrongly.
- I'm not in the mood for fighting Emacs and
rustfmt
today, so I'll drop that check from CI. - CI still fails, because of a problem in Ambient, which doesn't set
all the environment variables for all the actions that it runs, and
so the Rust toolchain can't be found in the
shell
action. Easily fixed withexport PATH=/root/.cargo/bin:/bin:/sbin
. - Next failure:
cargo
wants to download its index, which doesn't work under Ambient, on purpose. - I'll split the
Makefile
so that thecargo
steps are in their own targets, and the Subplot running in its own. Then I'll change.radicle/ambient.yaml
to be: ~~~yaml. pre_plan:- action: cargo_fetch plan:
- action: cargo_clippy
- action: cargo_build
- action: shell shell: | export PATH=/root/.cargo/bin:/bin:/sbin export CARGO_TARGET_DIR=/workspace/cache make subplot ~~~
- This lets Ambient do the Rust building successfully, and runs the
built
obnam
binary in the subplot. - CI is happy, merge.
- This is where I start wishing for code review, but as I'm the only person on this project, only self-review is available.
Chunk representation
This is where the real fun starts.
Before I bet to encrypting chunks, I'll set up chunk representation in general. Chunks in a backup program represent some data from the files being backed up. Chunks sizes vary. Chunks do not have a context, to enable de-duplication: if the same data is in two files, but preceded or followed by different data, the same chunk can be used. This means the data in the chunk needs only be stored once in the backup repository.
I'm not going to worry now how file data is divided into chunks. That's an interesting problem for another time.
In some systems, chunks are identified using a cryptographically strong check sum (or "hash") of the data. The likelihood of two chunks with different content having the same hash, accidentally or maliciously, is so low that it can be ignored. I don't want to rely on that, because those who research hashes and hash collisions also need to have backups. So I'm going to assign a unique ID for each chunk, and refer to chunks by that ID.
I've previously implemented this so that the ID is assigned by the backup repository ("backup server"), but that means the ID can's also be stored in the chunk itself. This means that if you have the chunk file, but lose the repository mapping of ID to chunk, you're lost. That's not necessarily a likely scenario, but I'd like to ward against it.
Thus, I'll have the backup client choose an ID, embed that in the chunk, offer the chunk+ID to the repository, and if the repository rejects the ID, I'll have the client deal with the rejection in some way that I don't want to decide right now. There are possibilities: a very random ID that is highly unlikely to collide, or a per-client namespace for IDs so that clients will only collide with themselves and can talk to themselves if they do.
I will, however, want to support finding chunks based on the content, which in practice means based on the checksum of the contents. I'll add a checksum to the chunk as well, and the repository (when I get to implementing that) will have a mechanism to look up by the checksum.
However, I don't want to expose the checksum as such. That would leak information: it's possible to sometimes guess the content of a chunk from its checksum.
Instead, I'll have the client encrypt the checksum. I'll call that the "chunk label". This has the benefit of being useful in other ways. In addition to an encrypted checksum, labels can be a type of chunk. I expect to have a per-client chunk, with encryption keys for that client. If the same backup repository is shared by multiple clients, which I want to do, they can each find their own chunk with their label.
The label does not need to be unique. Many chunks can have the same label.
Thus, my initial thinking is that a cleartext chunk can be represented in Rust as the following structure:
struct CleartextChunk {
id: ChunkId,
label: Label,
data: Vec<u8>,
}
When this is encrypted and stored on disk, it will look like this:
struct EncryptedChunk {
associated: Associated,
ciphertext: Vec<u8>,
}
I'm going to be using AEAD) for encryption. This means the encryption encrypts not just the data, but also some associated data, and the decryption verifies that the encrypted data has not been modified, and that the associated data is also correct. In other words, this both signs and encrypts the data and associated data. The associated data is in clear text, but signed. It needs to be in the clear, because it's needed for decryption.
In addition, both encryption and decryption will need the secret key. That won't be stored in the chunk, but is handled otherwise.
I will use the combination of ID and label as associated data:
struct Associated {
id: ChunkId,
label: Label,
}
This way, the backup repository can pick apart the associated data, and allow looking up by either ID or label, and the client needs both to decrypt successfully.
I will use the postcard
crate
for serializing the chunks.
Meta interval
It's been over two hours in this session already, and I've imposed a limit of three hours, so I won't get to actually encrypting today. No worries.
Chunks, continued
I'll create subcommands for encoding and decoding a chunk using postcard:
obnam chunk encode FILENAME --label LABEL
obnam chunk decode FILENAME
I'll implement the actual encoding and decoding in the library part of
the obnam
crate. That will be my general principle: the actual
sub-commands provide the command line interface, but call the library
to the real work.
I'll initially use UUID4 for chunk ids. I'll wrap that in its own type, for type safety.
Should labels be arbitrary binary data or text? Some of them will be encrypted checksums and it'll save a little space if those don't need to be encoded as text.
Got the sub-commands to encode and decoded chunks, but didn't add
acceptance criteria yet. Will continue from here next time. Pushed
what I did to a new patch, 3d24321
.
Summary
Got started. Didn't quite get everything done that I had hoped, but got somewhere useful.
Comments?
If you have feedback on this development session, please use the following fediverse thread: https://toot.liw.fi/@liw/114171104856444046