Oct. 27th, 2009

escapewindow: escape window (Default)

(Continuing the blogging blitz: here is pooling, part 1.
This illustrates how builds were set up at one point in Mozilla and Netscape's past, mainly to contrast with how they're set up currently.)

(from here)

There were many variations* of the old Tinderbox model of continuous integration. The basic concept involved a single machine running a single type of build (e.g., Win32 Opt Depend) on a 24/7 basis; when the previous build finished, the next build would start.

When we needed more build types (e.g. adding MacOSX coverage), we added more machines, one for each new build type. They would each be represented by their own column on the Tinderbox page, color coded green for success, red for failure, orange for test failed.

There are inherent benefits to such a model:

[i] Anyone can spin up a new builder.

This is partially due to the delivery of logs via mail (and later, in Tinderbox 2, via ftp), but also because each machine and tinderbox client is standalone. Anyone with a spare machine can spin up a Tinderbox builder.

[ii] It's relatively simple to make changes to a single build.

Need a new compiler? A different SDK? A whole new toolchain? Track down the machine running that build and make those changes, and you're done.

(You documented that, right?)

[iii] Consistent wait times [for a single build].

The maximum wait time for one build type to pick up your change is a little less than one full build cycle (if you happen to check in immediately after a build cycle starts, you need to wait for the next cycle). If a full build takes one hour, the longest end-to-end time is a little less than 2 hours. This is true whether one person checked in or five hundred people checked in.

(Later, people started running two of the same build and staggering them so that the longest wait time was a little less than 1/2 a full build cycle.)

Any drawbacks?

[i] The tree has many single points of failure.

Most of these build machines are unique. If something happens to one machine, that column goes perma-red or drops from the waterfall. If it's measuring something critical (and most of them are), that means tree closure.

[ii] It's easy to lose track of build [script|machine] changes.

It is simple to make changes to the build toolchain, scripts, or environment on individual tinderboxen. Unfortunately, it's also simple to make those changes without properly documenting or checking in those changes. It's only a matter of time before this becomes a problem.

Missing or faulty documentation might only be discovered after massive hardware failure, long after the people responsible for those changes have moved on. If you're unfortunate enough to not have a recent clone or full backup of that machine, you may be looking at a possible multi-day tree closure.

This also affects spinning up a new build machine or making changes to an existing one. If there are settings you're unaware of, troubleshooting the problem can eat up valuable time.

[iii] It's hard to track down who broke the tree.

Since each build cycle can pick up multiple checkins, it can be difficult to tell which checkin broke a particular build or test. This can become a protracted session of finger pointing, involving multiple developers and the reliability of the build machine(s) in question.

This was exacerbated by the old CVS problem of figuring out which build actually picked up your checkin. Also, the fact that each build machine (tinderbox column) has different-length cycles means that builds start at different times, and each build picks up different combinations of new checkins. Those can each break in new and exciting ways, for different reasons.

Don't get me wrong; I have a fondness for Tinderbox that it seems few people share. But I can be objective about its strengths and weaknesses, and one of its weaknesses is that it doesn't scale very well. At least not Scale with a capital Scale. (And scaling is a major factor in our decisions today.)

I'll illustrate that a bit more in the next segment: the tinderbox model on multiple branches.

* (We did have depend tinderboxen that spit out clobber or release builds at certain times of day or when a certain file was touched. We also had machines that cycled through several different build types -- these exceptions tended to occur on side projects that had fewer developers or less hardware. But for the most part, it was a single machine for a single build type.)

escapewindow: escape window (Default)

(Continuing the blogging blitz: here is pooling, part 2.
This illustrates how the Tinderbox model can quickly become a headache to maintain on multiple branches, and what problems the pooling model is trying to solve.)

Since each column is its own build machine, if trunk has 12 columns (and you want to have the same coverage on the new branch), you need to spin up 12 new tinderbox machines with similar configurations for the new branch.

Let's reexamine the benefits and drawbacks of the Tinderbox model, with multiple branches in mind:

[i] Anyone can spin up a new builder.

If anyone wants to start working on a new project, platform, or branch, they can run their own tinderbox and send the results to the tinderbox server on their own schedule. This meant that developers could have the coverage they wished, and community members could add ports for the platforms they cared about.

After these ran for a while, they were often "donated" to the Release team to maintain.

This worked fairly well, but donated tinderboxen often came undocumented, resulting in maintenance headaches down the road. Many, many machines were labeled "Don't Touch!" because no one knew if any changes would break anything, and no one knew how to rebuild them if anything catastrophic happened.

[ii] It's relatively simple to make changes to a single build.

If a particular branch needs a different toolchain or setting, it's not difficult to set up that branch's build machine with that. In fact, when we wanted to, say, change compilers on a single branch, we usually spun up a new build machine with that new compiler, and ran it in parallel with the old one until it was reliably green.

These inconsistencies also made it difficult to determine why changes worked on one branch but not another. Was it the new compiler? Or a hidden environment variable? Were the patch/service pack levels the same? Did it matter that one tinderbox was running win2k when the other ran NT?

[iii] Consistent wait times [for a single build].

No matter how many checkins happen on any (or all) branches, wait times stay consistent.

On the other hand, if a flurry of checkins happen on trunk, and the branches lie idle, all of those changes are picked up by the trunk builders. The branch builders continue building the latest revision on those idle branches or lie idle.

The drawbacks stay the same, although amplified with each additional machine and build type to administer and maintain.

I wasn't at Mozilla at the time, but as I understand it, a little more than two years ago, the tree would regularly be held closed whenever a single build machine went down -- unscheduled downtimes on a fairly consistent basis. In addition to the tree closures required to figure out who broke the build.

These were among the reasons for the move to Buildbot pooling, which I'll cover in part 3.

February 2017


Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 22nd, 2017 01:23 pm
Powered by Dreamwidth Studios