A Design For a Public-key Dedup Backup Tool

6 minute read

This is a design proposal for a tool that allows to store deduplicated backups using public-key encryption.

The Problem

The idea emerged from a project, whose backup restoration key was physically stored in a vault. Public-key encryption is particularly good for this use case.

However, very few backup tools support public-key encryption and those that support it, either rely on GnuPG and require access to the secret / private key, or don’t support deduplication and incremental archives.

Scope of the Project

Requirements

  • Public-key encryption of the repository.
  • The secret key is not required, except when you want to restore a backup.
  • Backup deduplication through chunking.
  • Compressing the file chunks.
  • Full backups. No diffing. Incrementally adds new / changed content.
  • No need for an agent / server counterpart on the storage side.
  • Suitable for immutable storage - e.g. S3 WORM.
  • Limited number of configuration settings.
  • Repository can be transferred on a medium with FAT32 filesystem.

Nice to have

  • Support for multiple providers: local FS, S3, SSH, FTP, SMB, WebDAV.
  • Configurable rolling hash settings
  • Adding new public keys

Non-goals

  • Conseal all metadata

Repository Format

The filesystem structure of a backup repository is inspired by git.

repository/
├── archives/
└── objects/
    ├── 00/
    ├── ...
    └── ff/
└── config

There is an immutable config file that is created with the registry and it stores the format version, the public key for encryption and a shared salt for the object hashes.

The archives/ is an optional, but convenient directory that exposes unencrypted metadata about the archives to the user. If you remove the directory, you can still recover your data, but you need to traverse through all the objects to find those that are archives.

The objects/ directory stores everything, just like git1. The objects are placed in subdirectories matching their first byte. All objects are encrypted and their content is not available and generally not required. Only if you want to initiate a backup restoration, then you need provide a secret / private key to access archive’s content. In all other cases, the tool can make decisions based on the object names that are hash of their content. There are four types of objects that can be stored: archive, tree node / directory, file, chunk.

  • archive - Archive metadata and a hash reference to a starting tree node.
  • tree node - A directory. A list of files and tree nodes and their hash references.
  • file - A list of hash references to chunks.
  • chunk - A chunk of content from a file.

Backup Algorithm

The algorithm use depth-first traversal the directories, since tree nodes depend on the hash reference of its subtree nodes. We begin by creating a tree node for the root directory of the archive.

Tree node contains a list of tuples (object hash, file type, file name, metadata). The metadata may include filesystem dates, attributes, permissions and ownership. To populate the list we iterate over the entries in the directory.

If the current entry is a file, we start by calculating its hash. We can now add the file to the tree node list. If an object with the same hash exists, we skip to the next entry. If it doesn’t then we proceed to splitting the file into chunks using a rolling hash algorithm, that would yield the same chunks, based on the file content. All the chunks are hashed and stored as objects, except for those that are already present in the repository. After that, if the number of chunks is more than one, a file object is created that lists all chunk hashes needed to produce the file.

If the current entry is a directory, then we traverse recursively, since we need to calculate the hash of the tree. Once the list of the tree node is complete, we hash the serialised content of the tree node, check if an object already exists and write it to the repository, if it doesn’t.

When the root tree node is finished, we proceed to create an archive object. This object type contains a reference to the root tree node and metadata, e.g. name, time of creation, per-archive settings. In the archives/ directory in the repository we store the same archive object, but unencrypted. The name of the unencrypted archive object is the Unix timestamp in milliseconds. It prevents collisions and skips the need to read the whole archives/ directory.

Details and Rationale

Object Directory Structure

We store objects in objects/<prefix>/<rest-of-the-hash>, where prefix is the first byte of the object hash and rest-of-the-hash are the following bytes. Both are encoded in hexadecimal format. This is done so we can support uploading the repository to a FAT32 thumb drive.

The filesystem FAT32 has a limit of 65536 files per directory. The assumption is that hashes have a uniform distribution and we would hit the limit roughly at 16M objects - unique files, chunks and directories. If we go for a default of 1MB chunks, 16M x 1MB is about 16TB of unique data. FAT32 has a limit of 2TB.

Object File and Records

Object file contains a record, e.g. archive record, file record. An object that contains an archive record is simply an archive object. We introduce two container record types - compression and encryption. Those records can nest other records. This way we store a file chunk as a chunk record nested in a compression record, nested in a encryption record.

C h u n k E O n b c r j r e e y c c p o t t r i d F o i n l e C o m r p e r c e o s r s d i o n C r h e u c n o k r d c d h a u t n a k

We can also discriminate between file record and chunk record, thus we don’t need a separate file record for single-chunk files.

Hashing and Salt

BLAKE32 is a very-fast secure hash algorithm and since it also has a builtin MAC, we can use a repository’s shared salt as a key, instead of adding the salt to the data. This way we would generate unique hashes for that specific repository and it would be harder for an attacker to guess the repository content just by looking at the hashes.

Encryption Scheme

We favour NaCl as an encryption library. A sealed box3 with an ephemeral key is a good choice. However, since chunks can be configured per archive, having large chunks encrypted and decrypted in memory could put a lot of strain on small machines. To mitigate that we split the encapsulated data in the encryption record in to smaller messages, e.g. 256KiB in size.

The encryption scheme is similar to a sealed box, but we add multiple messages. We start by generating an ephemeral key-pair (esk, epk) and we append the epk to the record. Then we calculate the shared key sk using esk and repository’s public key rpk. After that we split the data into messages with predefined size and start making NaCl secretbox for every message using sk as a key and message’s number as a nonce.

Rolling Hash and Chunking Algorithm

There are multiple algorithms that can be used to chunk the content. However, there is already a proof-of-concept that works with the Fast Content-Defined Chunking (FastCDC).

Discussion

I would like to hear your thoughts.
Text me on social media or send me an email.

Acknowledgements

Thank you (you know who) for listening through all my blabbering on the topic, while we were hiking.


  1. https://matthew-brett.github.io/curious-git/git_object_types.html ↩︎

  2. https://github.com/BLAKE3-team/BLAKE3-specs/raw/ea51a3ac997288bf690ee82ac9cfc8b3e0e60f2a/blake3.pdf ↩︎

  3. https://doc.libsodium.org/public-key_cryptography/sealed_boxes ↩︎