I'm starting a new "Obnam 3" project. See also thoughs I've written up. This is the first "development note" where I document what I do, with the intention that others can follow and see what I do and why.
Goal
The goal for today is to get started.
- I will create a new Git repository, with enough scaffolding to make it easy for me to add new sub-commands. I expect to make a lot of sub-commands to provide functionality that can be tested. There will also be support for adding automated tests. I will set up CI for this too.
- I will start implementing chunks. I will create the command
obnam chunk encryptthat reads cleartext data, and encrypts it, and the correspondingobnam chunk decrypttat decrypts it.
Acceptance criteria for today is that all of the above works.
Plan
- Create a new Git repository, with a README, and put that on Radicle.
- Create the structure of a program.
- Set up automated acceptance testing with Subplot.
- Set up CI on my personal CI setup.
- Implement chunk encryption and decryption.
Notes
Create a new Git repository, put it on Radicle
cd ~/pers/obnamgit init git- create
README.mdand commit it rad init- The repository id is
rad:zbWNQYkQ4QKgdSQcd1tjaemv6d6x - Add it to my public seed node https://app.radicle.xyz/nodes/radicle.liw.fi/rad:zbWNQYkQ4QKgdSQcd1tjaemv6d6x
Create the structure of a Rust program, with sub-commands
- New Git branch.
cargo init --name obnam- Depend on
clapwith thederivefeature. That's my preferred way to add command line parsing to a Rust program. - Also on
thiserrorfor nicer error messages for users. - Add a
Makefilefor my convenience. I know there isxtask, and probably others, but I've on learned them somakeit is. I could also write a shell script, butmakeis fine. - Started setting up the scaffolding for a sub-command structure. I
like to define a trait
Leaffor the lowest level sub-commands, which don't have sub-commands of their own, and put those leaf-commands in their own modules, grouped so that all the leaves of a parent sub-command are in the same module. Then the main program defines the parent sub-command structure all in one place. I find this makes for a clear, tidy structure that's reasonably easy to change and maintain. - Ran into a problem. I have
use clap::Parser;
use obnam::cmd;
fn main() {
println!("Hello, world!");
}
#[derive(Parser)]
struct Args {
#[clap(subcommand)]
cmd: Cmd,
}
#[derive(Parser)]
enum Cmd {
Chunk(ChunkCmd),
}
#[derive(Parser)]
enum ChunkCmd {
Hello(cmd::chunk::Hello),
}
- The compiler complains that
ChunkCmddoesn't implement theclap::Argstrait, yet it should, given the derive. - Oh!
ChunkCmdneeds to be astruct, not anenum. D'oh. - This works:
use clap::Parser;
use obnam::cmd;
fn main() {
println!("Hello, world!");
}
#[derive(Parser)]
struct Args {
#[clap(subcommand)]
cmd: Cmd,
}
#[derive(Parser)]
enum Cmd {
Chunk(ChunkCmd),
}
#[derive(Parser)]
struct ChunkCmd {
#[clap(subcommand)]
cmd: ChunkSubCmd,
}
#[derive(Parser)]
enum ChunkSubCmd {
Hello(cmd::chunk::Hello),
}
- Add a main program that actually parses the command line and an
executes sub-commands. I like to have a
mainthat calls afallible_main, to make error reporting easier.
fn main() {
if let Err(e) = fallible_main() {
eprintln!("ERROR: {}", e);
let mut e = e.source();
while let Some(source) = e {
eprintln!("caused by: {}", source);
e = source.source();
}
exit(1);
}
}
fn fallible_main() -> Result<(), Box<dyn std::error::Error>> {
let args = Args::parse();
args.cmd.run(&args)?;
Ok(())
}
I also like the
Box<dyn std::error::Error>construct over theanyhowcrate. I prefer each module to have its own error type, but this construct avoids having to convert or wrap error from called sub-commands.Merge the branch to
main.
Acceptance testing with Subplot
I co-authored the Subplot program for documenting acceptance criteria and how they are verified. Obviously I will want to use it here. I like to set up the structure for this early on, even if "acceptance criteria" is a bit silly at this stage.
- Create a new branch.
- Add the Subplot files.
- I choose to use Python, as implementing Subplot scenarios in Rust is not as easy as I'd like it to be.
- Added scaffolding and a placeholder scenario that can be dropped
once
obnamdoes something useful.. - Merge into
main.
Set up CI
- I use Radicle CI with the Ambient adapter. My software, again. I am the biggest not-invented-here buffoon in the world.
- At this stage, I'll just have CI run
make. - Added
.radicle/ambient.yaml: ~~~yaml pre_plan:- action: cargo_fetch plan:
- action: cargo_fmt
- action: shell shell: | make ~~~
- Added the
obnamrepository to my Radicle CI node. - First CI run fails:
cargo fmt --checkis not happy. - Run the
rad-cicommand I also wrote, which emulates CI locally, for easier testing of CI failures. cargo fmt --checkfails in that, too, which is good.- Also when I run it locally.
- For some reason, my editor auto-formats this wrongly.
- I'm not in the mood for fighting Emacs and
rustfmttoday, so I'll drop that check from CI. - CI still fails, because of a problem in Ambient, which doesn't set
all the environment variables for all the actions that it runs, and
so the Rust toolchain can't be found in the
shellaction. Easily fixed withexport PATH=/root/.cargo/bin:/bin:/sbin. - Next failure:
cargowants to download its index, which doesn't work under Ambient, on purpose. - I'll split the
Makefileso that thecargosteps are in their own targets, and the Subplot running in its own. Then I'll change.radicle/ambient.yamlto be: ~~~yaml. pre_plan:- action: cargo_fetch plan:
- action: cargo_clippy
- action: cargo_build
- action: shell shell: | export PATH=/root/.cargo/bin:/bin:/sbin export CARGO_TARGET_DIR=/workspace/cache make subplot ~~~
- This lets Ambient do the Rust building successfully, and runs the
built
obnambinary in the subplot. - CI is happy, merge.
- This is where I start wishing for code review, but as I'm the only person on this project, only self-review is available.
Chunk representation
This is where the real fun starts.
Before I bet to encrypting chunks, I'll set up chunk representation in general. Chunks in a backup program represent some data from the files being backed up. Chunks sizes vary. Chunks do not have a context, to enable de-duplication: if the same data is in two files, but preceded or followed by different data, the same chunk can be used. This means the data in the chunk needs only be stored once in the backup repository.
I'm not going to worry now how file data is divided into chunks. That's an interesting problem for another time.
In some systems, chunks are identified using a cryptographically strong check sum (or "hash") of the data. The likelihood of two chunks with different content having the same hash, accidentally or maliciously, is so low that it can be ignored. I don't want to rely on that, because those who research hashes and hash collisions also need to have backups. So I'm going to assign a unique ID for each chunk, and refer to chunks by that ID.
I've previously implemented this so that the ID is assigned by the backup repository ("backup server"), but that means the ID can's also be stored in the chunk itself. This means that if you have the chunk file, but lose the repository mapping of ID to chunk, you're lost. That's not necessarily a likely scenario, but I'd like to ward against it.
Thus, I'll have the backup client choose an ID, embed that in the chunk, offer the chunk+ID to the repository, and if the repository rejects the ID, I'll have the client deal with the rejection in some way that I don't want to decide right now. There are possibilities: a very random ID that is highly unlikely to collide, or a per-client namespace for IDs so that clients will only collide with themselves and can talk to themselves if they do.
I will, however, want to support finding chunks based on the content, which in practice means based on the checksum of the contents. I'll add a checksum to the chunk as well, and the repository (when I get to implementing that) will have a mechanism to look up by the checksum.
However, I don't want to expose the checksum as such. That would leak information: it's possible to sometimes guess the content of a chunk from its checksum.
Instead, I'll have the client encrypt the checksum. I'll call that the "chunk label". This has the benefit of being useful in other ways. In addition to an encrypted checksum, labels can be a type of chunk. I expect to have a per-client chunk, with encryption keys for that client. If the same backup repository is shared by multiple clients, which I want to do, they can each find their own chunk with their label.
The label does not need to be unique. Many chunks can have the same label.
Thus, my initial thinking is that a cleartext chunk can be represented in Rust as the following structure:
struct CleartextChunk {
id: ChunkId,
label: Label,
data: Vec<u8>,
}
When this is encrypted and stored on disk, it will look like this:
struct EncryptedChunk {
associated: Associated,
ciphertext: Vec<u8>,
}
I'm going to be using AEAD) for encryption. This means the encryption encrypts not just the data, but also some associated data, and the decryption verifies that the encrypted data has not been modified, and that the associated data is also correct. In other words, this both signs and encrypts the data and associated data. The associated data is in clear text, but signed. It needs to be in the clear, because it's needed for decryption.
In addition, both encryption and decryption will need the secret key. That won't be stored in the chunk, but is handled otherwise.
I will use the combination of ID and label as associated data:
struct Associated {
id: ChunkId,
label: Label,
}
This way, the backup repository can pick apart the associated data, and allow looking up by either ID or label, and the client needs both to decrypt successfully.
I will use the postcard crate
for serializing the chunks.
Meta interval
It's been over two hours in this session already, and I've imposed a limit of three hours, so I won't get to actually encrypting today. No worries.
Chunks, continued
I'll create subcommands for encoding and decoding a chunk using postcard:
obnam chunk encode FILENAME --label LABEL
obnam chunk decode FILENAME
I'll implement the actual encoding and decoding in the library part of
the obnam crate. That will be my general principle: the actual
sub-commands provide the command line interface, but call the library
to the real work.
I'll initially use UUID4 for chunk ids. I'll wrap that in its own type, for type safety.
Should labels be arbitrary binary data or text? Some of them will be encrypted checksums and it'll save a little space if those don't need to be encoded as text.
Got the sub-commands to encode and decoded chunks, but didn't add
acceptance criteria yet. Will continue from here next time. Pushed
what I did to a new patch, 3d24321.
Summary
Got started. Didn't quite get everything done that I had hoped, but got somewhere useful.
Comments?
If you have feedback on this development session, please use the following fediverse thread: https://toot.liw.fi/@liw/114171104856444046