On some kinds of large files, Obnam's de-duplication does not work very well, even though it should. For example, MySQL dump files from successive days are mostly the same data, but Obnam does badly with them. Below is an explanation of how the Obnam de-duplication works, and why it works badly in some cases.

Obnam does de-duplication by splitting up file data into chunks, and storing those individually. If two files have the same data, Obnam re-uses the already backed up chunk. So far, so good. However, due to performance issues, Obnam currently only notices chunks when they start at integer multiples of the chunk size.

For example, assume a chunk size of 4 bytes, and the following two files:


In this case, Obnam will easily notice that there are three chunks ("AAAA", "BBBB", and "CCCC"), and will store them only once in the backup repository. However, consider the following file:


File 3 is identical to file 1, except that a new byte has been inserted into the file. This makes Obnam look at file 3 as four chunks: "xAAA", "ABBB", "BCCC", and "C". None of these chunks match the chunks already in the backup repository. Thus, Obnam thinks they're all new.

There is no technical reason why Obnam could not notice that file 3 only has one inserted byte. However, doing so would require a very large number of lookups in the repository, and thus would be quite slow. There may be better ways of noticing the minute difference, and perhaps someday one of them will be implemented in Obnam.

Note that Obnam does not do a "diff" (or "xdelta" or other such approach) to notice differences between successive versions of files. Doing so would make backup generations be dependent on each other, and re-introduce "full" versus "incremental" backups in a way that is not acceptable.

With SQL dumps of databases, there are often small changes at the beginning of of the file, or in the middle of the file, which makes Obnam's de-duplication work very badly, even if the data as such has only changed a tiny bit.

Unfortunately, I don't know of a trick that would make the SQL dumps work better with Obnam. In any case, you should not have to munge your live data to suit Obnam: Obnam needs to be able to deal with whatever data you have. Until Obnam's de-duplication becomes better, though, perhaps someone would have a workaround?

The best idea, untested, I have is to keep the first SQL dump, in the live data, and then do a new dump before each backup, diff the two dumps, delete the new dump, and then run the backup. This way, each successive Obnam backup generation will have two files (the original SQL dump, and the diff), and you'll need to apply the diff to get the real dump you need to restore your database. Does that make sense to anyone?

See also the mailing list thread: