escapewindow | Pooling, part 2: the tinderbox model on multiple branches

(Continuing the blogging blitz: here is pooling, part 2.
This illustrates how the Tinderbox model can quickly become a headache to maintain on multiple branches, and what problems the pooling model is trying to solve.)

Since each column is its own build machine, if trunk has 12 columns (and you want to have the same coverage on the new branch), you need to spin up 12 new tinderbox machines with similar configurations for the new branch.

Let's reexamine the benefits and drawbacks of the Tinderbox model, with multiple branches in mind:

[i] Anyone can spin up a new builder.

If anyone wants to start working on a new project, platform, or branch, they can run their own tinderbox and send the results to the tinderbox server on their own schedule. This meant that developers could have the coverage they wished, and community members could add ports for the platforms they cared about.

After these ran for a while, they were often "donated" to the Release team to maintain.

This worked fairly well, but donated tinderboxen often came undocumented, resulting in maintenance headaches down the road. Many, many machines were labeled "Don't Touch!" because no one knew if any changes would break anything, and no one knew how to rebuild them if anything catastrophic happened.

[ii] It's relatively simple to make changes to a single build.

If a particular branch needs a different toolchain or setting, it's not difficult to set up that branch's build machine with that. In fact, when we wanted to, say, change compilers on a single branch, we usually spun up a new build machine with that new compiler, and ran it in parallel with the old one until it was reliably green.

These inconsistencies also made it difficult to determine why changes worked on one branch but not another. Was it the new compiler? Or a hidden environment variable? Were the patch/service pack levels the same? Did it matter that one tinderbox was running win2k when the other ran NT?

[iii] Consistent wait times [for a single build].

No matter how many checkins happen on any (or all) branches, wait times stay consistent.

On the other hand, if a flurry of checkins happen on trunk, and the branches lie idle, all of those changes are picked up by the trunk builders. The branch builders continue building the latest revision on those idle branches or lie idle.

The drawbacks stay the same, although amplified with each additional machine and build type to administer and maintain.

I wasn't at Mozilla at the time, but as I understand it, a little more than two years ago, the tree would regularly be held closed whenever a single build machine went down -- unscheduled downtimes on a fairly consistent basis. In addition to the tree closures required to figure out who broke the build.

These were among the reasons for the move to Buildbot pooling, which I'll cover in part 3.