Date: Fri, 8 Jun 2012 07:09:33 +0000
tl;dr: Version 1.0 of Obnam, my snapshotting, de-duplicating, encrypting backup program is released. See the end of this announcement for the details.
Where we see the hero in his formative years; parental influence
From the very beginning, my computing life has involved backups.
In 1984, when I was 14, my father was an independent telecommunications consultant, which meant he needed a personal computer for writing reports. He bought a Luxor ABC-802, a Swedish computer with a Z80 microprocessor and two floppy drives.
My father also taught me how to use it. When I needed to save files, he gave me not one, but two floppies, and explained that I should store my files one one, and then copy them to the other one every now and then.
Later on, over the years, I've made backups from a hard disk (30 megabytes!) to a stack of floppies, to a tape drive installed into a floppy interface (400 megabytes!), to a DAT drive, and various other media. It was always a bit tedious.
The start of the quest; lengthy justification for NIH
In 2004, I decided to do a full backup, by burning a copy of all my files onto CD-R disks. It took me most of the day. Afterwards, I sat admiring the large stack of disks, and realized that I would not ever do that again. I'm too lazy for that. That I had done it once was an aberration in the space-time continuum.
Switching to DVD-Rs instead CD-Rs would reduce to the number of disks to burn, but not enough: it would still take a stack of them. I needed something much better.
I had a little experience with tape drives, and that was enough to convince me that I didn't want them. Tape drives are expensive hardware, and the tapes also cost money. If the drive goes bad, you have to get a compatible one, or all your backups are toast. The price per gigabyte was coming down fast for hard drives, and it was clear that they were about to be very competitive with tapes for price.
I looked for backup programs that I could use for disk based backups.
rsync
, of course, was the obvious choice, but there were others.
I ended up doing what many geeks do: I wrote my own wrapper around
rsync
. There's hundred, possibly thousands, of such wrappers around
the Internet.
I also got the idea that doing a startup to provide online backup space would be a really cool thing. However, I didn't really do anything about that until 2007. More on that later.
The rsync
wrapper script I wrote used hardlinked directory trees
to provide a backup history, though not in the smart way that
backuppc does it.
The hardlinks were wonderful, because they were
cheap, and provided de-duplication. They were also quite cumbersome,
when I needed to move my backups to a new disk the first time. It
turned out that a lot of tools deal very badly with directory trees
with large numbers of hardlinks.
I also decided I wanted encrypted backups. This led me to find duplicity, which is a nice program that does encrypted backups, but I had issues with some of its limitations. To fix those limitations, I would have had to re-design and possibly re-implement the entire program. The biggest limitation was that it treated backups as full backup, plus a sequence of incremental backups, which were deltas against the previous backup.
Delta based incrementals make sense for tape drives. You run a full backup once, then incremental deltas for every day. When enough time has passed since the full backup, you do a new full backup, and then future incrementals are based on that. Repeat forever.
I decided that this makes no sense for disk based backups. If I already have backed up a file, there's no point in making me backup it again, since it's already there on the same hard disk. It makes even less sense for online backups, since doing a new full backup would require me to transmit all the data all over again, even though it's already on the server.
The first battle
I could not find a program that did what I wanted to do, and like every good NIHolic, I started writing my own.
After various aborted attempts, I started for real in 2006. Here is the first commit message:
revno: 1
committer: Lars Wirzenius <liw@iki.fi>
branch nick: wibbr
timestamp: Wed 2006-09-06 18:35:52 +0300
message:
Initial commit.
wibbr
was the placeholder name for Obnam until we came up with
something better. We was myself and Richard Braakman, who was going
to be doing the backup startup with me. We eventually founded the
company near the end of 2006, and started doing business in 2007.
However, we did not do very much business, and ran out of money in September 2007. We ended the backup startup experiment. That's when I took a job with Canonical, and Obnam became a hobby project of mine: I still wanted a good backup tool.
In September 2007, Obnam was working, but it was not very good. For example, it was quite slow and wasteful of backup space.
That version of Obnam used deltas, based on the rsync
algorithm, to
backup only changes. It did not require the user to do full and
incremental backups manually, but essentially created an endless
sequence of incrementals. It was possible to remove any generation,
and Obnam would manage the deltas as necessary, keeping the ones
needed for the remaining generations, and removing the rest.
Obnam made it look as if each generation was independent of each other.
The wasteful part was the way in which metadata about files was stored: each generation stored the full list of filenames and their permissions and other inode fields. This turned out to be bigger than my daily delta.
The lost years; getting lost in the forest
For the next two years, I did a little work on Obnam, but I did not make progress very fast. I changed the way metadata was stored, for example, but I picked another bad way of doing it: the new way was essentially building a tree of directory and file nodes, and any unchanged subtrees were shared between generations. This reduced the space overhead per generation, but made it quite slow to look up the metadata for any one file.
The final battle; finding cows in the forest
In 2009 I decided to leave Canonical and after that, my Obnam hobby
picked up in speed again. Below is a table of the number of commits
per year, from the very first commit (bzr log -n0 |
awk '/timestamp:/ { print $3}' | sed 's/-.*//' | uniq -c |
awk '{ print $2, $1 }' | tac
):
2006 466
2007 353
2008 402
2009 467
2010 616
2011 790
2012 282
During most of 2010 and 2011 I was unemployed, and happily hacking Obnam, while moving to another country twice. I don't recommend that as a way to hack on hobby projects, but it worked for me.
After Canonical, I decided to tackle the way Obnam stores data from a new angle. Richard told me about the copy-on-write (or COW) B-trees that btrfs uses, originally designed by Ohad Rodeh (see his paper for details), and I started reading about that. It turned out that they're pretty ideal for backups: each B-tree stores data about one generation. To start a new generation, you clone the previous generation's B-tree, and make any modifications you need.
I implemented the B-tree library myself, in Python. I wanted something that was flexible about how and where I stored data, which the btrfs implementation did not seem to give me. (Also, I worship at the altar of NIH.)
With the B-trees, doing file deltas from the previous generation no longer made any sense. I realized that it was, in any case, a better idea to store file data in chunks, and re-use chunks in different generations as needed. This makes it much easier to manage changes to files: with deltas, you need to keep a long chain of deltas and apply many deltas to reconstruct a particular version. With lists of chunks, you just get the chunks you need.
The spin-off franchise; lost in a maze of dependencies, all alike
In the process of developing Obnam, I have split off a number of helper programs and libraries:
- genbackupdata generates reproducible test data for backups
- seivot runs benchmarks on backup software (although only Obnam for now)
- cliapp is a Python framework for command line applications
- cmdtest runs black box tests for Unix command line applications
- summain
makes diff-able file manifests (
md5sum
on steroids), useful for verifying that files are restored correctly - tracing allows run-time selectable debug log messages that is really fast during normal production runs when messages are not printed
I have found it convenient to keep these split off, since I've been able to use them in other projects as well. However, it turns out that those installing Obnam don't like this: it would probably make sense to have a fat release with Obnam and all dependencies, but I haven't bothered to do that yet.
The blurb; readers advised about blatant marketing
The strong points of Obnam are, I think:
- Snapshot backups, similar to btrfs snapshot subvolumes. Every generation looks like a complete snapshot, so you don't need to care about full versus incremental backups, or rotate real or virtual tapes. The generations share data as much as possible, so only changes are backed up each time.
- Data de-duplication, across files, and backup generations. If the backup repository already contains a particular chunk of data, it will be re-used, even if it was in another file in an older backup generation. This way, you don't need to worry about moving around large files, or modifying them.
- Encrypted backups, using GnuPG.
Backups may be stored on local hard disks (e.g., USB drives), any locally mounted network file shares (NFS, SMB, almost anything with remotely Posix-like semantics), or on any SFTP server you have access to.
What's not so strong is backing up online over SFTP, particularly with long round trip times to the server, or many small files to back up. That performance is Obnam's weakest part. I hope to fix that in the future, but I don't want to delay 1.0 for it.
The big news; readers sighing in relief
I am now ready to release version 1.0 of Obnam. Finally. It's been a long project, much longer than I expected, and much longer than was really sensible. However, it's ready now. It's not bug free, and it's not as fast as I would like, but it's time to declare it ready for general use. If nothing else, this will get more people to use it, and they'll find the remaining problems faster than I can do on my own.
I have packaged Obnam for Debian, and it is in unstable
, and will
hopefully get into wheezy
before the Debian freeze. I provide
packages built for squeeze
on my own repository,
see the download page.
The changes in the 1.0 release compared to the previous one:
- Fixed bug in finding duplicate files during a backup generation. Thanks to Saint Germain for reporting the problem.
- Changed version number to 1.0.
The future; not including winning lottery numbers
I expect to get a flurry of bug reports in the near future as new people try Obnam. It will take a bit of effort dealing with that. Help is, of course, welcome!
After that, I expect to be mainly working on Obnam performance for the foreseeable future. There may also be a FUSE filesystem interface for restoring from backups, and a continous backup version of Obnam. Plus other features, too.
I make no promises about how fast new features and optimizations will happen: Obnam is a hobby project for me, and I work on it only in my free time. Also, I have a bunch of things that are on hold until I get Obnam into shape, and I may decide to do one of those things before the next big Obnam push.
Where; the trail of an errant hacker
I've developed Obnam in a number of physical locations, and I thought it might be interesting to list them: Espoo, Helsinki, Vantaa, Kotka, Raahe, Oulu, Tampere, Cambridge, Boston, Plymouth, London, Los Angeles, Auckland, Wellington, Christchurch, Portland, New York, Edinburgh, Manchester, San Giorgio di Piano. I've also hacked on Obnam in trains, on planes, and once on a ship, but only for a few minutes on the ship before I got seasick.
Thank you; sincerely
- Richard Braakman, for helping me with ideas, feedback, and some code optimizations, and for doing the startup with me. Even though he has provided little code, he's Obnam's most significant contributor so far.
- Chris Cormack, for helping to build Obnam for Ubuntu. I no longer use Ubuntu at all, so it's a big help to not have to worry about building and testing packages for it.
- Daniel Silverstone, for spending a Saturday with me hacking Obnam, and rewriting the way repository file filters work (compression, encryption), thus making them not suck.
- Tapani Tarvainen for running Obnam for serious amounts of real data, and for being patient while I fixed things.
- Soile Mottisenkangas for believing in me, and helping me overcome periods of despair.
- Everyone else who has tried Obnam and reported bugs or provided any other feedback. I apologize for not listing everyone.