A neat hack

A colleague of mine discovered that a scripting error had caused a few months of his Apache access logs (compressed with gzip) to get transferred in FTP ASCII mode before being archived to DVD. He asked whether there was any hope for recovery.

Those FTP transfers corrupted about 0.4% of the input bytes. Because every bit counts in a compressed file, these errors send the gzip/inflate decompressor “into the woods” pretty quickly, and every error disrupts the expansion of everything afterwards. The output turns to unrecognizable gibberish almost immediately. The decompressor itself doesn’t know it’s lost until the final crc check. (There are few illegal states on the way; if there were, that would mean that there is redundancy in the data, and a compressor’s job is to find redundancy and squeeze it out.)

The state of the art among the numerous “zip file repair programs” out there seems to concentrate on only two easy fixes (please correct me if I’m wrong):

  • Fix incorrect crc/checksums so that users won’t get an error message any more.
    This doesn’t repair any data, but it does recover from some trivial file truncation or extension things that must happen occasionally to somebody (else why would this function be helpful?).
  • Skip over archive members with corrupt data and find other members that are not corrupt. This is useful if the cause of corruption is a bad block on the hardware medium.

Neither of these does anything to improve corrupted data.

In the general case, solving this problem by brute-force search through all possible repairs is not feasible; unless the file is small, it’ll still be running when the lights go out on the universe. It turns out, though, that if the data has some structure, that’s enough to prune most of the search tree, and prioritize the rest, so that the highest-probability possibilities are tried first.

Apache access logs have plenty of structure, so my colleague got back a close match to his original data. I’ve documented the process
(look here for slightly more detail)
to offer hope to others in difficult cases of critical data in otherwise hopelessly damaged files. Unfortunately it’s not a turn-key process, each case requires a certain amount of tuning based on the cause of the corruption and the structure of the data.

Leave a Reply