escapewindow: escape window (Default)

(Continuing the blogging blitz: here is pooling, part 3.)

The build pool consists of a number of identical build machines that can handle all builds of a certain category, across branches.

Builds on checkin

Pooling lends itself to building on checkin: each checkin triggers a full set of builds.

This gives much more granular information about each checkin: does it build on every platform? Does it pass all tests? This saves many hours of "who broke the build" digging. As a result, the tree should stay open longer.

The tradeoff is wait times. During peak traffic, checkins can trigger more build requests than there are available build slaves. As builds begin to queue, new build requests sit idle for longer and longer before build slaves are available to handle those requests.

You can combat wait times via queue collapsing: Once builds queue, the master can combine multiple checkins in the next build. However, this negatively affects granular per-checkin information.

Another solution to wait times is adding more build slaves to the pool.

Dynamic allocation

As long as there are available build slaves, the pool is dynamically allocated to where it's needed. If one branch is especially busy, more build slaves can be temporarily allocated to that branch. Or if the debug build takes twice as long, more build slaves can be allocated to keep it from falling behind.

(At Mozilla, this happens within Buildbot and requires no manual intervention beyond the initial configuration.)

This is in direct contrast to the tinderbox model, where busier branches or longer builds would always mean more changes per build.

Dynamic allocation adds a certain amount of fault tolerance. In the tinderbox model, a single machine going down could cause tree closure. In the pooling model, a number of build machines in the pool could fall over, and the builds would continue at a slower rate.

The main drawback to dynamic allocation is that an extremely long build or an overly busy branch can starve the other builds/branches of available build machines.

Self-testing process

In the tinderbox model, one of the weaknesses was machine setup documentation. This can be assuaged with strict change management and VM cloning, but there's no real ongoing test to verify that the documentation is up to date.

Since pooled slaves jump from build to build and from branch to branch, it's easier to detect whether breakage is build slave- or code/branch- specific. This isn't perfect, especially with heisenbugs, but it's definitely an improvement.

In addition, every time you set up a new build slave, that tests the documentation and process. This happens much, much more often than spinning up new tinderboxes in the tinderbox model.

Spinning up a new branch or build

Since the pool of slaves can handle any existing branch or build, it's relatively easy to spin up a new, compatible branch or build type. It's even possible to do so by merely updating the master config files, with none of the "spin up N new tinderbox machines" work.

However, new branches and build types do add more load to the pool; it's important to keep capacity and wait times in mind. As the full set of builds show, it's easy to lose track of just how much your build pool is responsible for.

Still, I think it's clear that this is a big Win for pooling, as the number of active branches and builds at Mozilla are as high as I've seen anywhere.

The tyranny of the single config

It's very, very powerful to have a single configuration that works for all builds across all branches. However, this is also a very strict limitation.

In the tinderbox model, a change could be made to a single machine without affecting the stability of any other builds or branches. Once that one build goes green, you're golden.

In the pooling model, the change needs to propagate across the entire pool, and it affects all builds across all branches. As the number of branches and build types grow, the testing matrix for config changes grows as well.

And at some point, new, incompatible requirements rear their ugly head -- maybe an incompatible new toolchain that can't coexist with the previous one, or a whole new platform. At that point, you need to create a new pool. And ramping that up from zero can be a time consuming process.

I hope the above helps illustrate the pooling model and some of its benefits and drawbacks.

We don't just have a single build pool here, however; we have multiple, and the number is growing. This was partially by design, and partially to deal with growing pains as we scale larger and larger.

I'll illustrate where we are today in the next segment: split pools.

escapewindow: escape window (Default)

As part of RelEng's Blogging Blitz, I'm going to write a bit about [build slave] pooling concepts, differentiating between the old Tinderbox model and the Buildbot pool-of-slaves model.

The topics covered will be:

  • The tinderbox model on a single branch.
  • The tinderbox model on multiple branches.
  • The pooling model on multiple branches.
  • Split pools.
  • Some new approaches.
escapewindow: escape window (Default)


I keep running into the same questions, project after project, company after company. How do I see who broke the build? How do I know if this bug has been fixed in this codeline? How do I see the difference between these two builds? And how can we make this all happen faster? Smoother? Easier? And you revisit the same questions as new tangles and complexities of scale are added to the equation.

I joined Netscape in 2001, partially to play with a bunch of weird unices, partially to see How Build Engineering is Done Properly. I had already tackled LXR/ViewCVS + Bonsai + Tinderbox elsewhere, and that toolchain has loomed large at every place I've been, at least in concept. Source browsing + repository querying + continuous integration, no matter which specific products you use.

Here I am, back at Mozilla after a number of years away, and I'm amused (and pleased) to see that the current state of things is surprisingly analogous to how I designed our build system elsewhere. We had the db with history and archived configurations; buildbot has the daemon with real-time log views; but otherwise things are fairly close. Both systems are in similar stages of development, ready to take that next step.

Here are some thoughts about the direction that next step could go. Please keep in mind that I'm trying to ignore how much work this will all be to implement for the moment, so I can write down my ideas without cringing.

Is hgweb a strong bonsai replacement? ) Tinderbox: waterfall vs dashboard ) Buildbot: strengths and weaknesses ) Pull, not push ) Still a ways out )

November 2016

6789 101112
13 141516171819


RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 20th, 2017 07:42 am
Powered by Dreamwidth Studios