escapewindow | hg-git part vi: hg corruption

(This is one of several shorter blog posts about features in the new vcs sync process.)

We've seen instances of hg corruption before, like

abort: data/dom/network/interfaces/nsIDOMNetworkStats.idl.i@50a5a9362723: unknown parent!
abort: data/mobile/android/base/DoorHangerPopup.java.i@62e6137d125c: no match found!
abort: data/b2g/config/gaia.json.i@327bccdb70e8: no node!
abort: unknown revision '177bf37a49f5'!

People have theorized that at least one of these may be caused by a race condition while pulling from an http head during an ssh push (edit: bug 737865); others seem a bit more random. We deal with this in our TBPL build/test CI by detecting these types of failures, and nuking/recloning the repo.

We also see these in the legacy vcs-sync process. With a single-, non-cvs-prepended-, mozilla-central-based- repo, recovering from hg corruption in the working conversion directory is a manual process that can take multiple hours. I saw this as a non-starter for a repo like beagle or gecko.git, where rebuilding the entire thing from scratch can take over a week.

As I mentioned here, the new process has an intermediate stage_source clone:

hg.mozilla.org -> stage_source clone -> conversion_dir

When we detect corruption in the stage_source clone, we don't have to worry very much; just clobber and reclone. The time to recreate a fresh clone of a single mercurial repo is a matter of a few minutes, not multiple hours. This approach uses more disk, but helps prevent long downtimes.

Previously I had been running hg verify in each stage_source clone before proceeding, but that slowed each pull/convert/push cycle down by ~5 minutes per source mercurial repo (and doesn't always fix the problem), making it a non-viable option.