escapewindow | Pooling, part 3: the pooling model on multiple branches (Reply)

(Continuing the blogging blitz: here is pooling, part 3.)

The build pool consists of a number of identical build machines that can handle all builds of a certain category, across branches.

Builds on checkin

Pooling lends itself to building on checkin: each checkin triggers a full set of builds.

This gives much more granular information about each checkin: does it build on every platform? Does it pass all tests? This saves many hours of "who broke the build" digging. As a result, the tree should stay open longer.

The tradeoff is wait times. During peak traffic, checkins can trigger more build requests than there are available build slaves. As builds begin to queue, new build requests sit idle for longer and longer before build slaves are available to handle those requests.

You can combat wait times via queue collapsing: Once builds queue, the master can combine multiple checkins in the next build. However, this negatively affects granular per-checkin information.

Another solution to wait times is adding more build slaves to the pool.

Dynamic allocation

As long as there are available build slaves, the pool is dynamically allocated to where it's needed. If one branch is especially busy, more build slaves can be temporarily allocated to that branch. Or if the debug build takes twice as long, more build slaves can be allocated to keep it from falling behind.

(At Mozilla, this happens within Buildbot and requires no manual intervention beyond the initial configuration.)

This is in direct contrast to the tinderbox model, where busier branches or longer builds would always mean more changes per build.

Dynamic allocation adds a certain amount of fault tolerance. In the tinderbox model, a single machine going down could cause tree closure. In the pooling model, a number of build machines in the pool could fall over, and the builds would continue at a slower rate.

The main drawback to dynamic allocation is that an extremely long build or an overly busy branch can starve the other builds/branches of available build machines.

Self-testing process

In the tinderbox model, one of the weaknesses was machine setup documentation. This can be assuaged with strict change management and VM cloning, but there's no real ongoing test to verify that the documentation is up to date.

Since pooled slaves jump from build to build and from branch to branch, it's easier to detect whether breakage is build slave- or code/branch- specific. This isn't perfect, especially with heisenbugs, but it's definitely an improvement.

In addition, every time you set up a new build slave, that tests the documentation and process. This happens much, much more often than spinning up new tinderboxes in the tinderbox model.

Spinning up a new branch or build

Since the pool of slaves can handle any existing branch or build, it's relatively easy to spin up a new, compatible branch or build type. It's even possible to do so by merely updating the master config files, with none of the "spin up N new tinderbox machines" work.

However, new branches and build types do add more load to the pool; it's important to keep capacity and wait times in mind. As the full set of builds show, it's easy to lose track of just how much your build pool is responsible for.

Still, I think it's clear that this is a big Win for pooling, as the number of active branches and builds at Mozilla are as high as I've seen anywhere.

The tyranny of the single config

It's very, very powerful to have a single configuration that works for all builds across all branches. However, this is also a very strict limitation.

In the tinderbox model, a change could be made to a single machine without affecting the stability of any other builds or branches. Once that one build goes green, you're golden.

In the pooling model, the change needs to propagate across the entire pool, and it affects all builds across all branches. As the number of branches and build types grow, the testing matrix for config changes grows as well.

And at some point, new, incompatible requirements rear their ugly head -- maybe an incompatible new toolchain that can't coexist with the previous one, or a whole new platform. At that point, you need to create a new pool. And ramping that up from zero can be a time consuming process.

I hope the above helps illustrate the pooling model and some of its benefits and drawbacks.

We don't just have a single build pool here, however; we have multiple, and the number is growing. This was partially by design, and partially to deal with growing pains as we scale larger and larger.

I'll illustrate where we are today in the next segment: split pools.