Goal
Plan
Notes
Summary
Comments?

Goal

The next important thing to build for Obnam3 is support for both local and remote backup storage. There is initial support for local storage. That needs to be reviewed. Needs and wants for backup storage need to be identified and written down. An initial implementation of remote support needs to built. After that, further improvements can be done iteratively.

The goal for today is to write down my current thinking about backup storage. As part this writing, I'll review the current local storage implementation to see what Obnam currently does. That will inform my thinking on what is needed.

Plan

Review current src/store.rs and take notes.
Write up my S3 research.
Write down an initial set of needs and wants.
Write down a first proposal for a storage implementation architecture for Obnam 3.
Identify clear next actions to allow implementation work to start.

Notes

Review of current local storage implementation

Here is the current storage implementation interface, distilled to its essentials.

pub struct ChunkStore {...}

impl ChunkStore {
    pub fn is_init(dir: &Path) -> bool { ... }
    pub fn init(dir: &Path) -> Result<Self, StoreError> { ... }
    pub fn open(dir: &Path) -> Result<Self, StoreError> { ... }
    pub fn all_chunks(&self) -> Result<Vec<Metadata>, StoreError> { ... }
    pub fn add_chunk(&self, chunk: &Chunk) -> Result<(), StoreError> { ... }
    pub fn get_data_chunk(&self, id: &Id) -> Result<Chunk, StoreError> { ... }
    pub fn get_client_chunk(&self, id: &Id) -> Result<Chunk, StoreError> { ... }
    pub fn open_client(&self, engine: &Engine, wanted: &str) -> Result<(Id, ClientChunk), StoreError> { ... }
    pub fn get_credential_chunk(&self, id: &Id) -> Result<Credential, StoreError> { ... }
    pub fn remove_chunk(&self, id: &Id) -> Result<(), StoreError> { ... }
    pub fn find_chunks(&self, label: &Label) -> Result<Vec<Metadata>, StoreError> { ... }
    pub fn find_client_chunks(&self) -> Result<Vec<Metadata>, StoreError> { ... }
    pub fn find_our_client_chunks(&self, engine: &Engine) -> Result<Vec<ClientChunk>, StoreError> { ... }
    pub fn find_credential_chunks(&self) -> Result<Vec<Metadata>, StoreError> { ... }
    pub fn get_credential_chunks(&self) -> Result<Vec<Credential>, StoreError> { ... }
    pub fn chunk_filename(&self, id: &Id) -> PathBuf { ... }
}

The operations to initialize a directory for storage, and opening up a previously initialized storage, are intentionally separate. I find this important as it reduces the risk of mistakes. For example, if the directory is on removable media (such as an USB drive), the separation allows Obnam to notice if the drive isn't mounted.

If the storage is always implicitly created if missing, there would be a risk of creating a new backup repository where the removable media would be mounted.

(I know this from experienced. Obnam 1 did this.)
When the backup repository is remote, the user may not have permission to create it. In fact, it may even be better to separate the actions of creating the location (such as an S3 bucket) and preparing it to be used for backup storage.
The operation to list all chunks is useful for users to troubleshoot. It is needed.
The interface above has operations for finding and retrieving each kind of chunk. For the rest of Obnam that makes sense, to catch bugs where Obnam thinks it's retrieving one type of chunk but actually gets another kind. However, it may be useful to separate this into layers: a lower level one that acts on any kind of chunks, and a higher level ones that distinguishes chunks by type. This would avoid duplicating code for different kinds of storage implementations.
The operation to add a chunk should maybe also be typed. At the moment it's not, as the act of creating a chunk is typed.
In fact, the operations for retrieving chunks are not typed at the Rust level, but they verify they get the expected kind of chunk from the store. That is probably enough. The Chunk type is an enum that handles all kinds of chunks. This turned out to be an implementation approach that resulted in less entangled code and easier for me to think about.

(Smarter people than I could do this better, and they are welcome to do so. I need to develop with the limitations of my older model brain. I've learned that I need to understand the code I develop and maintain and thus I sometimes build things that are simpler than someone else would.)
At the level of backup storage, there are only chunks. The backup repository does not distinguish between kinds of chunk. This is good for future changes: if the storage doesn't care, we can add new kinds of chunks more easily. Only the client cares about chunk kinds, and the current approach uses the Rust type system to make sure all the kinds are handled in the they should without getting confused with each other.
The various operations to find chunks all operate only on chunk metadata. The discussion on type safety for finding specific kinds of chunks applies here too.
The chunk_filename method should probably be more generic chunk_location method. It's only used in the obnam store chunk-path command and may only be useful for operating on local storage.

Summary:

I probably want to separate the interface into lower and higher level layers.
Otherwise, the current internal interface should be pretty OK

S3

I've played around with the Amazon AWS S3 interface, using an instance by UpCloud, which I pay for (because I don't want to deal with Amazon directly more than have I have) and the aws-sdk-s3 crate. For now, that crate is acceptable. I may switch to something else, later, or implement the small parts of S3 API use I need in Obnam directly. Not now, I first want to be sure what I actually need from S3.
There are a number of implementations of the S3 API other than the original by Amazon. This is good: I would not want build on the API if there was only the one implementation. I've given Garage a quick try. It seems to work OK. This opens up a path to self host the S3 instance for one's own backups, which I'm sure someone will want to do.
Conceptually, from the Obnam point of view, S3 has buckets of objects, objects are just blobs with arbitrary tags for metadata. A holder of credentials may create short-lived presigned URLs it can give to others to actually perform an operation. Those others do not need any credentials to use the URLs.
Obnam could use S3 in at least two use cases, allowing the Obnam user to choose the one that suits their needs better.
- Obnam client runs on the machine with data getting backed up, and a separate Obnam server runs on a suitable server. The server has S3 credentials and provides client with presigned URLs to access an S3 instance. This means the client uploads data directly to S3, or downloads them directly from there, and does not route them via the Obnam server. At the same time, the client does not need S3 credentials itself. The Obnam server can be shared by any number of Obnam clients. This should be good for a group or organization scenario.
- There is only an Obnam client and it has credentials for the S3 instance in use. In effect, the server becomes part of the client. The way this is achieved is an implementation detail: the client may spawn the server as a child process, if that's simpler for the implementation.
The AWK S3 SDK for Rust uses async, which is fine. However, I may not want to use async otherwise, so separating the server into its own program may turn out to be convenient.

Summary:

S3 looks like a sufficient, generic storage API for Obnam.
I will implement a separate server for controlling S3 access. The client can optionally embed the server logic to avoid having to run a server.

Needs and wants for backup storage

The storage should store blobs of data (encrypted chunks), with metadata consisting of IDs and labels.
The Obnam user needs to be able to choose whether storage is local (in a directory on the local filesystem) or remote (accessed over the network).
For remote storage, the Obnam user must be able to trust it doesn't remove blobs. Otherwise, it is too difficult for the Obnam client to keep track of what backups are stored in the repository. The client would need to have a trusted place to store the ID of the latest backup, in addition to the actual backup repository.
The client uses cryptography to ensure the remote storage can't see what is stored in the blobs there, and to detect if any blobs or their metadata has been modified.
For remote storage, it needs to be possible to restrict a client so that it can only create new backups, not remove existing backups. This protects against an attacker getting control of the client and deleting some or all backups.

I am sure the above list will change, but it's a start.

Proposed storage architecture

For local storage, Obnam will access it directly, as it does now. No server component is used or needed. If nothing else, the user, or an attacker, has direct filesystem access and can delete files if they want to. They can physically destroy the USB drive. The only real way to protect against these types of attacks is to have a trusted remote server that stops the attacks.
For remote storage, three hosts are involved: the client, the Obnam server, and the S3 instance. The Obnam user may, of course, run all of these on the same hardware, but Obnam needs to assume they're separate. The Obnam server has credentials for the S3 instance. The Obnam client and server communicate over a custom HTTP API. The Obnam server gives the Obnam client presigned URLs so the client can access the S3 instance directly.
If the Obnam user wants to do so, the Obnam client can have S3 credentials and can thus have direct S3 access, avoiding the need to run an Obnam server separately.
I like this tentative architecture, because it separates the need of holding credentials from accessing backup storage location. At the same time, the architecture does not prevent a client-only use case.

The Obnam server API

The API can probably be a simple REST-like CRUD one. Those are fairly easy to implement and understand, so that's good.
The client will have an API token as part of the credential chunk. The API token controls types of access (read-only access; only write access it to create new chunks; allow removing chunks). By embedding the API token in the credential, the client can be configured so that it normally only has one type of access, and requires special action by the user to have other types of access. For example, the client may normally have read-only access using a credential tied to a TPM chip. New backups would require an OpenPGP key on an OpenPGP card. Chunk removal access would be in a credential tied to the OpenPGP card and a passphrase.

(Obnam doesn't yet support use cases where more than one credential is needed. That needs to be added.)
GET /chunk would return a list of all chunk IDs.
GET /chunk/ID would return a specific chunk, given its ID.
PUT /chunk?id=ID&label=LABEL would upload a chunk with a given ID and label
GET /chunk?label=LABEL would return a list of IDs of chunks with a given label
DELETE /chunk/ID would remove a specific chunk
In all cases the response has a presigned URL the client will use with the S3 instance to execute the actual operation.
The source of truth of what chunks exist, and what metadata they have, is the S3 instance. The Obnam server does not keep track of that. This avoids the two from getting out of sync, resulting in confusion and even data loss.
In addition, the API needs to provide for API token management, which probably need user (client) management in general. This needs a separate design session and probably research into identity provider software. There may be a need for an interactive web interface.

Next actions

Verify that the S3 API can be used to allow creation of a new object, without also allowing changing an object. Obnam chunks must never change. I forgot to do this while playing around with S3.
Split the current storage API to lower and higher layers.
Sketch an initial version of the Obnam server. It should authenticate every API request. There should be a way to create API tokens, and to revoke the tokens. Token management can on the server command line, for now. This avoids being blocked on having an IDP to work on the server.
Implement client code for the Obnam server API. Update the Obnam program (client) to allow using a server. Allow the user to provide the API token via the command line.
Add optional Obnam server API token to credential chunks.
Design an implementation of being able to open multiple credential chunks to gain authorization.

Summary

Today was only thinking and planning. No code changes.

Comments?

If you have feedback on this development session, please use the following Fediverse thread: https://toot.liw.fi/@liw/115915312972021855.