escapewindow: escape window (Default)

My previous blog post, what is mozharness?, appears to have caused some dismay.
I'm writing this to answer Axel's invitation to respond, and to clear up some misconceptions about mozharness.


[assertions]

  1. Respect for Mozilla l10n.

    This conversation is larger than just localization. However, the above thread brings up l10n specifically, and the first two mozharness scripts involve l10n, so here are my high-level thoughts:

    I remain highly impressed with the state of l10n at Mozilla: the sheer number of locales, the contributions of volunteer localizers, the sim-shipping of localized releases with en-US releases. Never have I seen localization done anywhere near so efficiently and well, anywhere in software. I believe strongly that this is key to Firefox's success. And Axel is a significant part of that.

    I also believe that however good our localization story may be, there's definitely room for improvement.
    We seem to disagree as to how to improve that story for the moment.


  2. Mozharness is imperfect software with time-tested concepts.

    I will be the first to admit that mozharness is imperfect. I'm self-taught. Python is a relatively new language to me. I wrote mozharness to solve complex problems with tight deadlines. Mozharness, as it exists today, is essentially non-feature-complete beta software.

    (With that in mind, I will be speaking about what mozharness could be, in forward-thinking statements. Saying "Nothing prevents us from doing ____" does not mean it'll take zero development/testing/roll out time. I am speaking about technical feasibility only.)

    However, the concepts behind mozharness are lessons I've learned over the years. Usually the hard way.
    I'm open to changing mozharness' specific implementation details, but I strongly believe the concepts themselves are right.
    It falls on my shoulders to communicate those clearly.


  3. Buildbot was not written to micromanage slaves.

    The above statement sums up my first conversation with Brian Warner, where we vehemently agreed that too much complex logic had been relegated to Mozilla's buildbot masters.

    Even if one ignores his original intent, I assert that moving complex logic from an overloaded master to its relatively unloaded slaves will be

    1. more efficient
    2. more scalable
    3. more portable

    I'll revisit this, and the other assertions, in the next three sections.


[fallacy #1: mozharness will make {builds,tests,repacks} less granular]

This can be split into two concerns: granularity of status, and granularity of easy-to-replicate steps. This is going to be a long section; I still haven't fully convinced everyone on my own team about these points.

  • Mozharness can sum up status at the end of jobs.

    I don't want to spend this entire post talking about abstractions and what mozharness could be if this nebulous vision in my head somehow sees the light of day. So here's a concrete example that exists today.

    signdebs.py outputs a summary at the end of every log:

    19:29:25 INFO - #####
    19:29:25 INFO - ##### MaemoDebSigner summary:
    19:29:25 INFO - #####
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/he/fennec_4.0~b3~20101117005726_armel.deb; skipping he on fremantle
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/ja/fennec_4.0~b3~20101117005726_armel.deb; skipping ja on fremantle
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/ja-JP-mac/fennec_4.0~b3~20101117005726_armel.deb; skipping ja-JP-mac on fremantle
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/he/fennec_4.0~b3~20101117010937_armel.deb; skipping he on fremantle-qt
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/ja/fennec_4.0~b3~20101117010937_armel.deb; skipping ja on fremantle-qt
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/ja-JP-mac/fennec_4.0~b3~20101117010937_armel.deb; skipping ja-JP-mac on fremantle-qt
    19:29:25 INFO - Uploaded multi on fremantle successfully.
    19:29:25 INFO - Uploaded en-US on fremantle successfully.
    19:29:25 INFO - Uploaded ar on fremantle successfully.
    19:29:25 INFO - Uploaded be on fremantle successfully.
    19:29:25 INFO - Uploaded ca on fremantle successfully.
    19:29:25 INFO - Uploaded cs on fremantle successfully.

    ... snip 76 more lines of "Uploaded ___ on ___ succesfully."

    I'd love for this to be prettier and take fewer lines, like "82/88 deb repos uploaded successfully!" with more verbose information about the ones that failed. But again, deadlines, and pretty statuses were not a hard requirement.

    • What's keeping us from emailing this summary somewhere? Nothing.
    • What's keeping us from updating a database or hitting a cgi with this status? Nothing.
    • What's keeping us from sending out a Pulse message with this info? Probably the fact that I know very very little about how to send a Pulse message. Other than that, probably nothing.

    We'll want to make those updates conditional, if "notify-pulse" in self.actions:, for example, so staging or development runs don't attempt to send production status messages. But if it can be done in a script, it can be done at the end of any mozharness script.


  • Mozharness can update status during jobs.

    What's keeping us from updating status in the middle of a for locale in locales: for loop, for example? NOTHING.

    Though it is more expensive. If you want to hit a cgi, for example, mid-for-loop, then that cgi needs to be up and available the entirety of every script run (or the script needs to fail gracefully, or have retry logic, or queueing logic, or whatever.) You need to decide when the cost of faster status updates is worth the effort, since post-processing tends to be less expensive.

    There is nothing preventing us from doing so, however, if the need outweighs the cost of updating status throughout the script.
    If it can be done in a script, it can be done inside of any mozharness for loop.


  • Mozharness parses its own logs during runCommand() calls already.

    The runCommand() method uses pre-defined error lists to determine whether specific lines in the log are errors.

    These are far from complete, and need to be fleshed out further, but the framework is there.

    Recently, I thought we should add a summary (or somesuch) key to that list of dictionaries. We could, say, add a substr of "Assertion failure: !(addr & GC_CELL_MASK)" with a custom level of intermittent_orange (somewhere between info and warning?) and a summary of "Intermittent orange: This looks like bug 583554!"

    As long as we're in conjecture-land, we can combine this with a post- or mid-job status update that can populate an intermittent orange database with the specific details of this job.

    Awesome or not awesome? You vote.


  • Mozharness can create buildbot-parseable status.

    This point's reverse could be a top level fallacy in itself.

    There are two approaches here.

    If it's a hard requirement that we lose none of the "granularity" of the existing buildbot steps, nothing prevents us from creating an action list that encapsulates each buildbot action.

    Then you can script.py --only-set-props-builddir, for example. A buildbot addStep(['python', 'script.py', '--only-%s' % stepName]) for every single thrilling step in the one-hundred-one steps here.

    Can you? Yes. Do you want to? I would argue no.

    Do developers really care if buildbot step #73 dies with python exception ____? Or do they only really care if compilation fails on file X at line Y (link to hg annotate with appropriate finger pointing here)?

    (What if we tied in a ping in #developers or an email message for the suspected culprit in the notify action? Not free, in terms of mozharness development time, but is it doable in scripts? Seems like it, to me.)

    Do localizers really care if buildbot step #45 dies with compare-locales exception ___? Or do they just want a description of which strings are missing, or which XML files need updating, with a link to a wiki page on how to fix those things?

    The second approach could involve writing a buildbot-property parseable file during or at the end of the mozharness script, and adding a buildbot addStep(SetProperty("cat filename")) afterwards to set buildbot-statusdb-parseable buildbot properties.


  • Mozharness could create a step-summary log.

    As I explained in my previous blog post, I lean very heavily towards more verbosity than less, though you could --log-level error (or w/e) to ignore most of this verbosity.

    In a --multi-log run, I could add a step-level log. Basically, any calls to the BaseScript wrapper methods could also write their equivalent to a log.

    (Huh??? English, do you speak it?)

    If we cared enough about this, we could create yet another log file. The BaseScript.chdir() method could output cd DIRNAME to this log. And the BaseScript.runCommand() method could write its command line to this log file, and so on. This log file would approach being an executable shell script.

    I say approach since there aren't always going to be universal scriptable equivalents. But this would be an improvement over the status quo.


  • High-level granularity is not always desirable.

    So you're a developer or a localizer. You've painstakingly set up scratchbox to the point where it works (congratulations!!). You now want to attempt a Maemo multilocale build without checking into the tree and either requesting or waiting for a nightly. What steps do you follow?

    You could drill down into the 101 buildbot steps, each with python-list-format command lines, and full env dumps per step. Granular, no? Run each one, with appropriate env settings, and then it should work! Maybe! Have fun!

    Or you could take a [yet to be written, but next in line] mozharness script, a config file that's oriented towards standalone developers (you may have to edit paths), and an example command line, and run it. It should hopefully either give you a descriptive error message (and ultra-verbose logs), or a usable multilocale deb file.

    Which would you prefer? (Don't let me influence your decision.)


[fallacy #2: a script means complex undecipherable command lines]

(Phew! That last section was a doozy, wasn't it? This section will be shorter, I promise!!)

This section also has a couple components, though much shorter.

  • First, Ben's work in bug 608004 is not mozharness.

    Is it a step in that direction? Sure, it's moving logic out of buildbot towards slave-side python scripts. Does that mean it's mozharness? No. Are we considering merging tools/lib/python code into mozharness? Yes. But it's not reality yet.

    I haven't looked at his patch; I've been bogged down trying to fix the Android Tegras. And writing this lengthy blog post. But it is not mozharness.

    Could it be ported to mozharness? In my eyes, yes, easily, without even looking at the code. I've thought about writing this script in mozharness myself in my "spare time". But again, I haven't looked at the patch.

    Could it be reliant on long command line arguments? Sure, that's probably the case outside of mozharness. But that's not what I'm discussing in this blog post.

    The only two scripts in mozharness, as of this writing, are for Maemo deb signing and Android multilocale.

  • Second, in mozharness, practically every option that's specifiable via commandline is specifiable in a config file.

    What does this mean? You certainly could specify a massive command line that challenges ARG_MAX every single time. Or, if you find yourself doing this often, you could save all of that in a config file (json only, currently, but we can add .py or other filetype support relatively easily) and just run path/to/script.py --config-file path/to/config/file .

    In fact, right now any mozharness script will look, by default, for ./localconfig.json and use that if no --config-file is specified. I'm debating whether this might actually be harmful in production systems, but can you really get much simpler than path/to/script.py? Or python path/to/script.py if your #! support is broken.


[fallacy #3: mozharness will replace all of buildbot]

Is mozharness powerful enough to replace all of the complex buildbotcustom.process.factory logic in buildbot? Absolutely.

Is that my short term goal? No. I have real bugs to fix. Blockers for shipping real product. Replacing working code with a rewrite-without-urgent-need is at the bottom of my todo list.

The reasons I wrote however much of mozharness I did include:

  • RelEng has been considering ways to move build logic to slave-side scripts, and this is my proof-of-concept
  • I've been trying to solve real problems with real deadlines. Like MultiNightlyL10n being the only real blocker to moving the mobile build infrastructure from the crufty buildbot-0.7 branch to the supported and shiny default 0.8.x branch.
  • I do secretly hope that the community will buy into this, to the point where I can afford to spend the time to do this. Because, if I haven't made it clear by this point, I believe in this. But if there's no immediate goal or community buy-in, that's a huge task to tackle.

I mean, the barrier of entry to buildbot is... high. First, install buildbot! Then, navigate our buildbot-configs and buildbotcustom repos (easy!), set up your master, then set up a slave that points to the master, then somehow use the appropriate one of the six buildbot methods to trigger a build/test/repack that you want, and debug from there.

Or, check out this repo, potentially modify this config file that's tailored to your use case, and run this script. You'll either get a [hopefully] useful error message, or your <select ... > <option ... >(while i'm promising the moon, select one)</option> <option ... >multilocale build</option> <option ... >l10n repack</option> <option ... >standalone talos performance results</option> <option ... >orange intermittent test results</option> <option ... >pgo build</option> <option ... >WHATEVER A SCRIPT CAN DO</option> <option ... >THAT WE FEEL IS WORTH THE TIME TO WRITE</option> <option ... ></option> </select> with verbose logs and python source to tweak if you want to delve into this shit.

But am I going to volunteer to port all of that stuff if people aren't into it? Fuck no. I will argue this to the ground, evidently, because I've thought about this stuff for years and years. This blog post may end up being my own personal ten fucking days rubicon, with its forward-thinking year's worth of "we could do this!" statements. But I still feel like I haven't touched on everything I've thought about over the years. And I'm defending my not-yet-fully-formed mozharness against allegations that it's going to be harmful for some reason.

Even if we did port all of factory.py to mozharness, buildbot's ability to queue and manage multiple build slave pools is a level above and beyond mozharness.

......... If this section seems less coherent and well-thought out compared to the previous sections,

  • tired
  • late
  • bourbon

Ok. It's late. I'm getting punchy. My profanity-to-signal ratio is rising sharply. I've written a diatribe that is probably the longest post on a script harness EVAR. And I have a nine-fucking-o-clock meeting I have to be up for. And coherent.

I'm stopping this post right meow.

[EDIT]: (let's just pretend dreamwidth/eljay didn't munge my select/option tags, shall we?)

escapewindow: escape window (Default)

(Continuing the blogging blitz: here is pooling, part 3.)

The build pool consists of a number of identical build machines that can handle all builds of a certain category, across branches.

Builds on checkin

Pooling lends itself to building on checkin: each checkin triggers a full set of builds.

This gives much more granular information about each checkin: does it build on every platform? Does it pass all tests? This saves many hours of "who broke the build" digging. As a result, the tree should stay open longer.

The tradeoff is wait times. During peak traffic, checkins can trigger more build requests than there are available build slaves. As builds begin to queue, new build requests sit idle for longer and longer before build slaves are available to handle those requests.

You can combat wait times via queue collapsing: Once builds queue, the master can combine multiple checkins in the next build. However, this negatively affects granular per-checkin information.

Another solution to wait times is adding more build slaves to the pool.


Dynamic allocation

As long as there are available build slaves, the pool is dynamically allocated to where it's needed. If one branch is especially busy, more build slaves can be temporarily allocated to that branch. Or if the debug build takes twice as long, more build slaves can be allocated to keep it from falling behind.

(At Mozilla, this happens within Buildbot and requires no manual intervention beyond the initial configuration.)

This is in direct contrast to the tinderbox model, where busier branches or longer builds would always mean more changes per build.

Dynamic allocation adds a certain amount of fault tolerance. In the tinderbox model, a single machine going down could cause tree closure. In the pooling model, a number of build machines in the pool could fall over, and the builds would continue at a slower rate.

The main drawback to dynamic allocation is that an extremely long build or an overly busy branch can starve the other builds/branches of available build machines.


Self-testing process

In the tinderbox model, one of the weaknesses was machine setup documentation. This can be assuaged with strict change management and VM cloning, but there's no real ongoing test to verify that the documentation is up to date.

Since pooled slaves jump from build to build and from branch to branch, it's easier to detect whether breakage is build slave- or code/branch- specific. This isn't perfect, especially with heisenbugs, but it's definitely an improvement.

In addition, every time you set up a new build slave, that tests the documentation and process. This happens much, much more often than spinning up new tinderboxes in the tinderbox model.


Spinning up a new branch or build

Since the pool of slaves can handle any existing branch or build, it's relatively easy to spin up a new, compatible branch or build type. It's even possible to do so by merely updating the master config files, with none of the "spin up N new tinderbox machines" work.

However, new branches and build types do add more load to the pool; it's important to keep capacity and wait times in mind. As the full set of builds show, it's easy to lose track of just how much your build pool is responsible for.

Still, I think it's clear that this is a big Win for pooling, as the number of active branches and builds at Mozilla are as high as I've seen anywhere.


The tyranny of the single config

It's very, very powerful to have a single configuration that works for all builds across all branches. However, this is also a very strict limitation.

In the tinderbox model, a change could be made to a single machine without affecting the stability of any other builds or branches. Once that one build goes green, you're golden.

In the pooling model, the change needs to propagate across the entire pool, and it affects all builds across all branches. As the number of branches and build types grow, the testing matrix for config changes grows as well.

And at some point, new, incompatible requirements rear their ugly head -- maybe an incompatible new toolchain that can't coexist with the previous one, or a whole new platform. At that point, you need to create a new pool. And ramping that up from zero can be a time consuming process.


I hope the above helps illustrate the pooling model and some of its benefits and drawbacks.

We don't just have a single build pool here, however; we have multiple, and the number is growing. This was partially by design, and partially to deal with growing pains as we scale larger and larger.

I'll illustrate where we are today in the next segment: split pools.

escapewindow: escape window (Default)

As part of RelEng's Blogging Blitz, I'm going to write a bit about [build slave] pooling concepts, differentiating between the old Tinderbox model and the Buildbot pool-of-slaves model.

The topics covered will be:

  • The tinderbox model on a single branch.
  • The tinderbox model on multiple branches.
  • The pooling model on multiple branches.
  • Split pools.
  • Some new approaches.
escapewindow: escape window (Default)

[brainstorm]:

I keep running into the same questions, project after project, company after company. How do I see who broke the build? How do I know if this bug has been fixed in this codeline? How do I see the difference between these two builds? And how can we make this all happen faster? Smoother? Easier? And you revisit the same questions as new tangles and complexities of scale are added to the equation.

I joined Netscape in 2001, partially to play with a bunch of weird unices, partially to see How Build Engineering is Done Properly. I had already tackled LXR/ViewCVS + Bonsai + Tinderbox elsewhere, and that toolchain has loomed large at every place I've been, at least in concept. Source browsing + repository querying + continuous integration, no matter which specific products you use.

Here I am, back at Mozilla after a number of years away, and I'm amused (and pleased) to see that the current state of things is surprisingly analogous to how I designed our build system elsewhere. We had the db with history and archived configurations; buildbot has the daemon with real-time log views; but otherwise things are fairly close. Both systems are in similar stages of development, ready to take that next step.

Here are some thoughts about the direction that next step could go. Please keep in mind that I'm trying to ignore how much work this will all be to implement for the moment, so I can write down my ideas without cringing.


Is hgweb a strong bonsai replacement? ) Tinderbox: waterfall vs dashboard ) Buildbot: strengths and weaknesses ) Pull, not push ) Still a ways out )

July 2014

S M T W T F S
  12345
6789101112
13141516171819
2021222324 2526
2728293031  

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 23rd, 2014 10:17 am
Powered by Dreamwidth Studios