Obnam and online storage: possibilities

Obnam will need support for online storage in some way. By that I mean that the backup repository can be on another computer and accessed over the network. The other computer might be in the same room or it might be somewhere distant. It might be owned by the same person running Obnam, or might be rented. It might even just be rented storage rather than a computer as such.

I've looked at options, now and for earlier iterations of Obnam. A condensed version of my thinking:

Obnam could only support local storage, and online storage would require synchronizing local backup storage with a tool such as rsync. This requires having sufficient local storage for your entire backup repository and makes sharing a backup repository much harder.
Obnam could access the other computer via SSH, probably with its SFTP subsystem, or maybe rsync. Obnam version 1 did that. On the plus side, this approach allows sharing the backup repository. If the other computer is your own, this is probably the easiest to set up. I'd like to avoid this approach, however, because using the usual file system directly on the remote end makes it harder to share the backup repository, unless all participants are fully trusted and fully secure. If any participant goes rogue, or their computer is compromised so that an attacker can use their credentials to access the backup repository, everyone loses.

For example, this approach makes it difficult to implement a system where a backup client can create a new backup, but not delete one. It's possible to implement that, and I've experimented doing that, but it gets tricky.

When it comes to security, tricky is bad.
Obnam could implement its own network protocol for accessing backup storage, probably on top of HTTP. This would have many benefits. It'd make the backup system more flexible, and certainly allows the scenario above to only allow clients to make new backups. However, it would require running an Obnam server on any computer providing backup storage and that makes deployment and operation more complicated in some use cases.

Worse, it would make it harder to choose a commercial online storage provider. You can't run an Obnam server on their server. That would mean that if I go this route, everyone wanting to use Obnam for their backups would have to provide their own storage. I am philosophically inclined to do that for myself, but I know that it can be expensive in money and time. Many people would prefer to pay to use hosted storage.
Obnam could implement a server with its own protocol based on HTTP, but in a way that supports the S3 protocol commonly used by online storage providers. S3 was originally developed by Amazon for their cloud platform. It is now the default protocol (or API) for online storage and there are several implementations of it, including as open source.

This means that if you do want to provide online storage for your backups yourself, you can. You do not have to use Amazon or one of the many other providers. You can host your S3 instance where you want, including in your own home.

The S3 protocol could even be used as the only protocol. However, it's not really well suited for all the use cases I have in mind for Obnam, so a custom protocol seems like the better approach.

It seems to me at this stage of Obnam development that S3 is the way to go. At least as the first option. It may be feasible to later add support for other options.

In any case, Obnam will always support backing up to local storage, without having to run an S3 implementation. Obnam will directly access the filesystem. There will be compromises for what features are available based on the way a backup repository is used. If Obnam is using files directly, there's no way reasonable way to prevent the Obnam client from deleting them.

Compromises are an inevitable part of life. What I will do is provide a sensible set of options to choose from.

Using S3 in Obnam, initial thoughts

I've recently been experimenting with the S3 protocol. It seems overall sensible, but full of details, and I do not like how Amazon documents it. It's also complicated enough that it requires using a library specific for it, not just a generic HTTP library. The gist of it, from an Obnam point of view:

Storage is provided for an account owned by a customer. The customer decides what they're willing to pay for.
Storage is split into "buckets", which contain "objects". There are limits to the number of buckets and objects per bucket. The limits can be adjusted, but that may cost extra.
An object is a blob. S3 doesn't care what it contains.
Objects may be assigned "tags", key/value pairs, both being strings.
There are protocol operations for uploading, downloading, listing, and deleting objects in a bucket.
Access control is defined as "profiles", with different clients having a profile that allows them to do what they need and no more.

The crucial protocol operation that seems to be missing is searching for objects based on tags. I can work around that in Obnam. Other than that, S3 maps well into what Obnam needs for chunk storage.

The naive approach of using S3 for Obnam would be for the Obnam client to access S3 storage directly. The client could have a profile that doesn't allow it to delete objects. Another profile would allow deletion, to remove backups that are no longer wanted, since that is also needed. This would mean an Obnam server component would probably be unnecessary. This approach is certainly a possibility. It would require the Obnam client to have S3 credentials, but that can be arranged. They could be part of the Obnam credential chunks, maybe. One crededential would allow making and restoring backups, another would be required to delete backups.

Another approach would be for there to be an Obnam server, with its own protocol. The server would use S3 to access storage. The Obnam client would upload a chunk to the server via the Obnam protocol, the server would upload it to storage via S3. The reverse for downloading chunks. This would mean transferring chunks twice. For large amounts of data that seems unacceptably wasteful.

The S3 protocol has a feature called the "presigned URL". This means that an S3 client can create a special URL that it can give to another program. That other program can use the URL without any S3 credentials of its own.

This opens up a third approach: the Obnam server has S3 credentials, it creates presigned URLs for the Obnam client. The client uploads or downloads chunks directly. This neatly avoids having to transfer chunks twice. It also avoids having to provide the client with S3 credentials.

This also means the Obnam server can run anywhere and should not require much in terms of resources. It will need to be kept secure, as it has the S3 credentials.

I could implement both the naive and the presigned approaches. The naive approach would be good for those who do not want to run a server, but do want to use online storage. The presigned approach would be better for a group approach, such as for organizations and companies.

S3 plans

Adding S3 support is probably require some further experimentation and research. It's going to take a fair bit of time. There are too many unknowns for me to estimate how long it will take.

My plan is to start adding backup repository support in the coming months. I will implement both local filesystem access and some form of online access via S3. Details to be determind as I work.

Comments?

If you have feedback on this, please use the following fediverse thread: https://toot.liw.fi/@liw/115824414980663747.