escapewindow | Entries tagged with tinderbox

tl;dr
internal benefits to generically packaged tools
learning from history: tinderbox
mozharness and scriptharness
treeherder and taskcluster
other tools

tl;dr

One of the reasons I'm back at Mozilla is to work in-depth with some exciting new tools and infrastructure, at scale. However, I wish these tools could be used equally well by employees and non-employees. I see some efforts to improve this. But if this is really important to us, we need to make it a point of emphasis.

If we do this, we can benefit from a healthier, extended community. There are also internal benefits to making our tools packaged in a generic way. I'll go into these in the next section.

I did start to contact some tool maintainers, and so far the response is good. I'll continue doing so. Hopefully I can write a followup blog post about how efforts are under way to make generically packaged tools a reality.

internal benefits to generically packaged tools

Besides the strengthened community, there are other internal benefits.

upgrades
Once installation is packaged and automated, an upgrade to a service might be:
- spin up a new service
- test it
- send over some traffic (applicable if the service is load balanced)
- go/no-go
- cut over to the appropriate service and turn off the other one.
This entire process can be fully automated. Once this process is smooth enough, upgrading a service can be seamless and relatively worry free.
disaster recovery
If a service is only installable manually, a disaster recovery scenario might involve people working around the clock to reinstall a service.
Once the installation is automated and configurable, this changes. A cold backup solution might be similar to the above upgrade scenario. If disaster strikes, have someone install a new one from the automation, or have a backup instance already installed, ready for someone to switch over.
A hot backup solution might involve having multiple load balanced services running across regions, with automatic failovers. The automated install helps guarantee that each node in the cluster is configured correctly, without human error.
good first bugs
(or intern projects, or GSOC projects, or...)
The more special-snowflake and Mozilla-specific our tools are, the more likely the tool will be tied closely to other Mozilla-specific services, so a seemingly simple change might require touching many different codebases. These types of tools are also more likely to require VPN or special LDAP access that present barriers to new contributors.
If a new contributor is able to install tools locally, that guarantees that they can work on standalone bugs/projects involving those tools. And successful good first bugs and intern/GSOC type projects directly lead to a stronger contributor base.
special projects
At various team work weeks years past, we brainstormed being able to launch entire chunks of infrastructure up in self-contained units. These could handle project branch type work. When the code was merged back into trunk, we could archive the data and shut down the instances used.
This concept also works for special projects: things that don't fit within the standard workflow.
If we can spin up services in a separate, network isolated area, riskier or special-requirement work (whether in terms of access control, user permissions, partner secrets, etc) could happen there without affecting production.
self-testing
Installing the package from scratch is the test for the generic packaging feature. The more we install it, the smaller the window of changes we need to inspect for installation bustage. This is the same as any other software feature.
Having an install test for each tool gives us reassurances that the next time we need to install the service (upgrade, disaster recovery, etc.) it'll work.

learning from history: tinderbox

In 2000, a developer asked me to install tinderbox, a continuous integration tool written and used at Netscape. It would allow us see the state of the tree, and minimize bustage.

One of the first things I saw was this disclaimer:

This is not very well packaged code.  It's not packaged at all.  Don't
come here expecting something you plop in a directory, twiddle a few
things, and you're off and using it.  Much work has to be done to get
there.  We'd like to get there, but it wasn't clear when that would be,
and so we decided to let people see it first.

Don't believe for a minute that you can use this stuff without first
understanding most of the code.

I managed to slog through the steps and get a working tinderbox/bonsai/mxr install up and running. However, since then, I've met a number of other people who had tried and given up.

I ended up joining Netscape in 2001. (My history with tinderbox helped me in my interview.) An external contributor visited and presented tinderbox2 to the engineering team. It was configurable. It was modular. It removed Netscape-centric hardcodes.

However, it didn't fully support all tinderbox1 features, and certain default behaviors were changed by design. Beyond that, Netscape employees already had fully functional, well maintained instances that worked well for us. Rather than sinking time into extending tinderbox2 to cover our needs, we ended up staying with the disclaimered, unpackaged tinderbox1. And that was the version running at tinderbox.mozilla.org, until its death in May 2014.

For a company focused primarily on shipping a browser, shipping the tools used to build that browser isn't necessarily a priority. However, there were some opportunity costs:

Tinderbox1 continued to suffer from the same large barrier of entry, stunting its growth.
I don't know how widely tinderbox2 was used, but I imagine adoption at Netscape would have been a plus for the project. (I did end up installing tinderbox2 post-Netscape.)
A larger, healthier community could have result in upstreamed patches, and a stronger overall project in the long run.
People who use the same toolset may become external contributors or employees to the project in general (like me). People who have poor impressions of a toolset may be less interested in joining as contributors or employees.

mozharness and scriptharness

In my previous stint at Mozilla, I wrote mozharness, which is a python framework for scripts.

I intentionally kept mozilla-specific code under mozharness.mozilla and generic mozharness code under mozharness.base. The idea was to make it easier for external users to grab a copy of mozharness and write their own scripts and modules for a non-Mozilla project. (By "non-Mozilla" and "external user", I mean anyone who wants to automate software anywhere.)

However, after I left Mozilla, I didn't use mozharness for anything. Why not?

There's a non-trivial learning curve for people new to the project, and the benefits of adopting mozharness are most apparent when there's a certain level of adoption. I was working at time scales that didn't necessarily lend themselves to this.
I intentionally kept mozharness clone-and-run. I think this was the right model at the time, to lower the barrier for using mozharness until it had reached a certain level of adoption. Clone-and-run made it easier to use mozharness in buildbot, but makes it harder to install or use just the mozharness.base module.
We did our best to keep Mozilla-isms out of mozharness.base via review. However, this would have been more successful with either an external contributer speaking up before we broke their usage model, or automated tests, or both.

So even with the best intentions in mind, I had ended up putting roadblocks in the way of external users. I didn't realize their scope until I was fully in the mindset of an external user myself.

I wrote scriptharness to try to address these problems (among others):

I revisited certain decisions that made mozharness less approachable. The mixins and monolithic Script object. Requiring a locked config. Missing docstrings and tests.
I made scriptharness resources available at standard locations (generic packages on pypi, source at github, full docs at readthedocs).
Since it's a self-contained package, it's usable here or elsewhere. Since it's written to solve a generic problem rather than a Mozilla-specific problem, it's unencumbered by Mozilla-specific solutions.

I'd like to backport some of the better ideas from scriptharness to mozharness, to address some of these issues.

treeherder and taskcluster

After I left Mozilla, on several occasions we wanted to use other Mozilla tools in a non-Mozilla environment. As a general rule, this didn't work.

Continuous Integration (CI) Dashboard
We had multiple Jenkins servers, each with a partial picture of our set of build+test jobs. Figuring out the state of the code base was complex and a specialized skill. Why couldn't we have one dashboard showing a complete view?
I took a look at Treeherder. It has improved upon the original TBPL, but is designed to work specifically with Mozilla's services and workflows. I found it difficult to set up outside of a Mozilla environment.
CI Infrastructure
We were investigating other open source CI solutions. There are many solutions for server-side apps, or linux-only solutions, or cross-platform at small- to medium- scale. TaskCluster is the only one I know of that's cross-platform at massive scale.
When we looked, all the tutorials and docs had to do with using the existing Mozilla production instance, which required a mozilla.com email address at the time. There are no docs for setting up TaskCluster itself.
(Spoiler: I hear it may be a 2H project :D :D :D )
Single Sign-On
An open source, trusted SSO solution sounded like a good thing to implement.
However, we found out Persona has been EOL'd. We didn't look too closely at the implementation details after that.

(These are just the tools I tried to use in my 1 1/2 years away from Mozilla. There are more tools that might be useful to the outside world. I'll go into those in the next section.)

There are reasons behind each of these, and they may make a lot of sense internally. I'm not trying to place any blame or point fingers. I'm only raising the question of whether our future plans include packaging our tools for outside use, and if not, why not?

We're facing a similar decision to Netscape's. What's important? We can say we're a company that ships a browser, and the tools only exist for that purpose. Or we can put efforts towards making these tools useful in many environments. Generically packaged, with documentation that doesn't start with a disclaimer about how difficult they are to set up.

It's easy to say we'd like to, but we're too busy with ______. That's the gist of the tinderbox disclaimer. There are costs to designing and maintaining tools for use outside of one's subset of needs. But as long as packaging tools for outside use is not a point of emphasis, we'll maintain the status quo.

other tools

The above were just the tools that we tried to install. I asked around and built a list of Mozilla tools that might be useful to the outside world. I'm not sure if I have all the details correct; please correct me if I'm wrong!

mach - if all the mozilla-central-specific functions were moved to libraries, could this be useful for others?
bughunter - I don't know enough to say. This looks like a crash/assertion finder, tying into Socorro and bugzilla.
balrog - this now has docker support, which is promising for potential outside use.
marionette (already used by others)
reftest (already used by others)
pulse - this is a taskcluster dep.
Bugzilla - I've seen lots of instances successfully used at many other companies. Its installation docs are here.
I also hear that Socorro is successfully used at a number of other companies.

So we already have some success here. I'd love to see it extended -- more tools, and more use cases, e.g. supporting bugzilla or jira as the bug db backend when applicable.

I don't know how much demand there will be, if we do end up packaging these tools in a way that others can use them. But if we don't package them, we may never know. And I do know that there are entire companies built around shipping tools like these. We don't have to drop any existing goals on the floor to chase this dream, but I think it's worth pursuing in the future.

(Continuing the blogging blitz: here is pooling, part 3.)

The build pool consists of a number of identical build machines that can handle all builds of a certain category, across branches.

Builds on checkin

Pooling lends itself to building on checkin: each checkin triggers a full set of builds.

This gives much more granular information about each checkin: does it build on every platform? Does it pass all tests? This saves many hours of "who broke the build" digging. As a result, the tree should stay open longer.

The tradeoff is wait times. During peak traffic, checkins can trigger more build requests than there are available build slaves. As builds begin to queue, new build requests sit idle for longer and longer before build slaves are available to handle those requests.

You can combat wait times via queue collapsing: Once builds queue, the master can combine multiple checkins in the next build. However, this negatively affects granular per-checkin information.

Another solution to wait times is adding more build slaves to the pool.

Dynamic allocation

As long as there are available build slaves, the pool is dynamically allocated to where it's needed. If one branch is especially busy, more build slaves can be temporarily allocated to that branch. Or if the debug build takes twice as long, more build slaves can be allocated to keep it from falling behind.

(At Mozilla, this happens within Buildbot and requires no manual intervention beyond the initial configuration.)

This is in direct contrast to the tinderbox model, where busier branches or longer builds would always mean more changes per build.

Dynamic allocation adds a certain amount of fault tolerance. In the tinderbox model, a single machine going down could cause tree closure. In the pooling model, a number of build machines in the pool could fall over, and the builds would continue at a slower rate.

The main drawback to dynamic allocation is that an extremely long build or an overly busy branch can starve the other builds/branches of available build machines.

Self-testing process

In the tinderbox model, one of the weaknesses was machine setup documentation. This can be assuaged with strict change management and VM cloning, but there's no real ongoing test to verify that the documentation is up to date.

Since pooled slaves jump from build to build and from branch to branch, it's easier to detect whether breakage is build slave- or code/branch- specific. This isn't perfect, especially with heisenbugs, but it's definitely an improvement.

In addition, every time you set up a new build slave, that tests the documentation and process. This happens much, much more often than spinning up new tinderboxes in the tinderbox model.

Spinning up a new branch or build

Since the pool of slaves can handle any existing branch or build, it's relatively easy to spin up a new, compatible branch or build type. It's even possible to do so by merely updating the master config files, with none of the "spin up N new tinderbox machines" work.

However, new branches and build types do add more load to the pool; it's important to keep capacity and wait times in mind. As the full set of builds show, it's easy to lose track of just how much your build pool is responsible for.

Still, I think it's clear that this is a big Win for pooling, as the number of active branches and builds at Mozilla are as high as I've seen anywhere.

The tyranny of the single config

It's very, very powerful to have a single configuration that works for all builds across all branches. However, this is also a very strict limitation.

In the tinderbox model, a change could be made to a single machine without affecting the stability of any other builds or branches. Once that one build goes green, you're golden.

In the pooling model, the change needs to propagate across the entire pool, and it affects all builds across all branches. As the number of branches and build types grow, the testing matrix for config changes grows as well.

And at some point, new, incompatible requirements rear their ugly head -- maybe an incompatible new toolchain that can't coexist with the previous one, or a whole new platform. At that point, you need to create a new pool. And ramping that up from zero can be a time consuming process.

I hope the above helps illustrate the pooling model and some of its benefits and drawbacks.

We don't just have a single build pool here, however; we have multiple, and the number is growing. This was partially by design, and partially to deal with growing pains as we scale larger and larger.

I'll illustrate where we are today in the next segment: split pools.

(Continuing the blogging blitz: here is pooling, part 2.
This illustrates how the Tinderbox model can quickly become a headache to maintain on multiple branches, and what problems the pooling model is trying to solve.)

Since each column is its own build machine, if trunk has 12 columns (and you want to have the same coverage on the new branch), you need to spin up 12 new tinderbox machines with similar configurations for the new branch.

Let's reexamine the benefits and drawbacks of the Tinderbox model, with multiple branches in mind:

[i] Anyone can spin up a new builder.

If anyone wants to start working on a new project, platform, or branch, they can run their own tinderbox and send the results to the tinderbox server on their own schedule. This meant that developers could have the coverage they wished, and community members could add ports for the platforms they cared about.

After these ran for a while, they were often "donated" to the Release team to maintain.

This worked fairly well, but donated tinderboxen often came undocumented, resulting in maintenance headaches down the road. Many, many machines were labeled "Don't Touch!" because no one knew if any changes would break anything, and no one knew how to rebuild them if anything catastrophic happened.

[ii] It's relatively simple to make changes to a single build.

If a particular branch needs a different toolchain or setting, it's not difficult to set up that branch's build machine with that. In fact, when we wanted to, say, change compilers on a single branch, we usually spun up a new build machine with that new compiler, and ran it in parallel with the old one until it was reliably green.

These inconsistencies also made it difficult to determine why changes worked on one branch but not another. Was it the new compiler? Or a hidden environment variable? Were the patch/service pack levels the same? Did it matter that one tinderbox was running win2k when the other ran NT?

[iii] Consistent wait times [for a single build].

No matter how many checkins happen on any (or all) branches, wait times stay consistent.

On the other hand, if a flurry of checkins happen on trunk, and the branches lie idle, all of those changes are picked up by the trunk builders. The branch builders continue building the latest revision on those idle branches or lie idle.

The drawbacks stay the same, although amplified with each additional machine and build type to administer and maintain.

I wasn't at Mozilla at the time, but as I understand it, a little more than two years ago, the tree would regularly be held closed whenever a single build machine went down -- unscheduled downtimes on a fairly consistent basis. In addition to the tree closures required to figure out who broke the build.

These were among the reasons for the move to Buildbot pooling, which I'll cover in part 3.

(Continuing the blogging blitz: here is pooling, part 1.
This illustrates how builds were set up at one point in Mozilla and Netscape's past, mainly to contrast with how they're set up currently.)

(from here)

There were many variations* of the old Tinderbox model of continuous integration. The basic concept involved a single machine running a single type of build (e.g., Win32 Opt Depend) on a 24/7 basis; when the previous build finished, the next build would start.

When we needed more build types (e.g. adding MacOSX coverage), we added more machines, one for each new build type. They would each be represented by their own column on the Tinderbox page, color coded green for success, red for failure, orange for test failed.

There are inherent benefits to such a model:

[i] Anyone can spin up a new builder.

This is partially due to the delivery of logs via mail (and later, in Tinderbox 2, via ftp), but also because each machine and tinderbox client is standalone. Anyone with a spare machine can spin up a Tinderbox builder.

[ii] It's relatively simple to make changes to a single build.

Need a new compiler? A different SDK? A whole new toolchain? Track down the machine running that build and make those changes, and you're done.

(You documented that, right?)

[iii] Consistent wait times [for a single build].

The maximum wait time for one build type to pick up your change is a little less than one full build cycle (if you happen to check in immediately after a build cycle starts, you need to wait for the next cycle). If a full build takes one hour, the longest end-to-end time is a little less than 2 hours. This is true whether one person checked in or five hundred people checked in.

(Later, people started running two of the same build and staggering them so that the longest wait time was a little less than 1/2 a full build cycle.)

Any drawbacks?

[i] The tree has many single points of failure.

Most of these build machines are unique. If something happens to one machine, that column goes perma-red or drops from the waterfall. If it's measuring something critical (and most of them are), that means tree closure.

[ii] It's easy to lose track of build [script|machine] changes.

It is simple to make changes to the build toolchain, scripts, or environment on individual tinderboxen. Unfortunately, it's also simple to make those changes without properly documenting or checking in those changes. It's only a matter of time before this becomes a problem.

Missing or faulty documentation might only be discovered after massive hardware failure, long after the people responsible for those changes have moved on. If you're unfortunate enough to not have a recent clone or full backup of that machine, you may be looking at a possible multi-day tree closure.

This also affects spinning up a new build machine or making changes to an existing one. If there are settings you're unaware of, troubleshooting the problem can eat up valuable time.

[iii] It's hard to track down who broke the tree.

Since each build cycle can pick up multiple checkins, it can be difficult to tell which checkin broke a particular build or test. This can become a protracted session of finger pointing, involving multiple developers and the reliability of the build machine(s) in question.

This was exacerbated by the old CVS problem of figuring out which build actually picked up your checkin. Also, the fact that each build machine (tinderbox column) has different-length cycles means that builds start at different times, and each build picks up different combinations of new checkins. Those can each break in new and exciting ways, for different reasons.

Don't get me wrong; I have a fondness for Tinderbox that it seems few people share. But I can be objective about its strengths and weaknesses, and one of its weaknesses is that it doesn't scale very well. At least not Scale with a capital Scale. (And scaling is a major factor in our decisions today.)

I'll illustrate that a bit more in the next segment: the tinderbox model on multiple branches.

^* (We did have depend tinderboxen that spit out clobber or release builds at certain times of day or when a certain file was touched. We also had machines that cycled through several different build types -- these exceptions tended to occur on side projects that had fewer developers or less hardware. But for the most part, it was a single machine for a single build type.)

As part of RelEng's Blogging Blitz, I'm going to write a bit about [build slave] pooling concepts, differentiating between the old Tinderbox model and the Buildbot pool-of-slaves model.

The topics covered will be:

The tinderbox model on a single branch.
The tinderbox model on multiple branches.
The pooling model on multiple branches.
Split pools.
Some new approaches.

[brainstorm]:

I keep running into the same questions, project after project, company after company. How do I see who broke the build? How do I know if this bug has been fixed in this codeline? How do I see the difference between these two builds? And how can we make this all happen faster? Smoother? Easier? And you revisit the same questions as new tangles and complexities of scale are added to the equation.

I joined Netscape in 2001, partially to play with a bunch of weird unices, partially to see How Build Engineering is Done Properly. I had already tackled LXR/ViewCVS + Bonsai + Tinderbox elsewhere, and that toolchain has loomed large at every place I've been, at least in concept. Source browsing + repository querying + continuous integration, no matter which specific products you use.

Here I am, back at Mozilla after a number of years away, and I'm amused (and pleased) to see that the current state of things is surprisingly analogous to how I designed our build system elsewhere. We had the db with history and archived configurations; buildbot has the daemon with real-time log views; but otherwise things are fairly close. Both systems are in similar stages of development, ready to take that next step.

Here are some thoughts about the direction that next step could go. Please keep in mind that I'm trying to ignore how much work this will all be to implement for the moment, so I can write down my ideas without cringing.

Is hgweb a strong bonsai replacement?
Tinderbox: waterfall vs dashboard
Buildbot: strengths and weaknesses
Pull, not push
Still a ways out

( Is hgweb a strong bonsai replacement? ) ( Tinderbox: waterfall vs dashboard ) ( Buildbot: strengths and weaknesses ) ( Pull, not push ) ( Still a ways out )