Bitrot - When Ones and Zeros Become Fuzzy

Without thinking much about it I always thought safeguarding my data was mostly a matter of having multiple copies, so files could be recovered in case of accidental deletion or disk crashes. Performing backups on a regular basis solved the first type of event whereas partitions mirrored on multiple physical disks solved the second.

What I did not consider was the risk of data degenerating due to the way it is physically stored. According to Wikipedia no media is immune to this, but, even worse, the most commonly deployed file systems do not detect such changes. An affected picture will sometimes get corrupted to such a degree that it cannot be displayed, but other times it may just exhibit random artifacts. Other types of files are affected in similar ways.

Real world examples

I use Adobe Lightroom to manage my pictures on a partition mirrored with Intel RST using the NTFS file system (previously mirrored using the RAID driver built in to Windows 7 and 8). Four pictures can no longer be edited, resulting in a "There was an error working with the photo" message when entering the develop module. They are flagged as having Lightroom edits, so they have been editable in the past.

Other people have had a similar experience:

Detection

My first thought was that I needed a way to detect when data degrades, so affected files can be restored while unaffected copies still exist. A simple, yet effective, solution is to make a hash of each file and compare that with a newly computed hash when files are checked for corruption. Going further along that line of thought brought the following requirements forward:

  • Re-compute hash when intended modifications are performed (when "last modified" date changes)
  • Windows compatible
  • Use a non-proprietary file format for the hash database
  • Use a non-proprietary hashing algorithm
  • Fast both at building hashes and verifying files
  • Detection of bit rot in the database itself
  • Command line interface
  • Use hash database to verify backups

I have a preliminary PowerShell script up and running that satisfies most of these requirements. I plan to release it when it has been thoroughly tested.

Mitigation

My backup routine is based on encrypted disk images on an external USB harddisk on site. That harddisk is then transported to an off site location, the oldest backup file is deleted, and the new backup is copied to one of a set of USB harddisks located there. Being a bit paranoid I occasionally burn blu-rays with my most precious data, stored off site. It is critical to have access to at least one copy of affected data when bit rot has been detected, so I hope my backup strategy is sufficient to cater for this; I hope one of my blu-ray backups has an unaffected copy of the pictures Lightroom is unable to edit.

Most contemporary operating systems provide means to continually backup versions of files as their contents change. Windows has File History, and it may come in handy as a last resort to recover at least some of the data that got corrupted, should no backup contain a fresh copy.

Avoiding the effects of bit rot in the first place is my long term goal. Being a Windows user the natural option would be ReFS, introduced in Windows 8. Using ReFS for mirroring defaults to using integrity streams which can be used to detect and heal corrupt data (see "Battling "bit rot"" in the "Building the next generation file system for Windows: ReFS" blog post). Other platforms have file systems with similar properties, such as ZFS and BTRFS (see "Per-block checksumming" on "Bitrot and atomic COWs: Inside "next-gen" filesystems"). They are of less interest to me at the moment, but my come in handy when my NAS is to be upgraded.