This can be split into two concerns: granularity of status, and granularity of easy-to-replicate steps. This is going to be a long section; I still haven't fully convinced everyone on my own team about these points.
Mozharness can sum up status at the end of jobs.
I don't want to spend this entire post talking about abstractions and what mozharness could be if this nebulous vision in my head somehow sees the light of day. So here's a concrete example that exists today.
signdebs.py outputs a summary at the end of every log:
19:29:25 INFO - #####
19:29:25 INFO - ##### MaemoDebSigner summary:
19:29:25 INFO - #####
19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/he/fennec_4.0~b3~20101117005726_armel.deb; skipping he on fremantle
19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/ja/fennec_4.0~b3~20101117005726_armel.deb; skipping ja on fremantle
19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/ja-JP-mac/fennec_4.0~b3~20101117005726_armel.deb; skipping ja-JP-mac on fremantle
19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/he/fennec_4.0~b3~20101117010937_armel.deb; skipping he on fremantle-qt
19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/ja/fennec_4.0~b3~20101117010937_armel.deb; skipping ja on fremantle-qt
19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/ja-JP-mac/fennec_4.0~b3~20101117010937_armel.deb; skipping ja-JP-mac on fremantle-qt
19:29:25 INFO - Uploaded multi on fremantle successfully.
19:29:25 INFO - Uploaded en-US on fremantle successfully.
19:29:25 INFO - Uploaded ar on fremantle successfully.
19:29:25 INFO - Uploaded be on fremantle successfully.
19:29:25 INFO - Uploaded ca on fremantle successfully.
19:29:25 INFO - Uploaded cs on fremantle successfully.
... snip 76 more lines of "Uploaded ___ on ___ succesfully."
I'd love for this to be prettier and take fewer lines, like "82/88 deb repos uploaded successfully!" with more verbose information about the ones that failed. But again, deadlines, and pretty statuses were not a hard requirement.
- What's keeping us from emailing this summary somewhere? Nothing.
- What's keeping us from updating a database or hitting a cgi with this status? Nothing.
- What's keeping us from sending out a Pulse message with this info? Probably the fact that I know very very little about how to send a Pulse message. Other than that, probably nothing.
We'll want to make those updates conditional,
if "notify-pulse" in self.actions:, for example, so staging or development runs don't attempt to send production status messages. But if it can be done in a script, it can be done at the end of any mozharness script.
Mozharness can update status during jobs.
What's keeping us from updating status in the middle of a
for locale in locales: for loop, for example? NOTHING.
Though it is more expensive. If you want to hit a cgi, for example, mid-for-loop, then that cgi needs to be up and available the entirety of every script run (or the script needs to fail gracefully, or have retry logic, or queueing logic, or whatever.) You need to decide when the cost of faster status updates is worth the effort, since post-processing tends to be less expensive.
There is nothing preventing us from doing so, however, if the need outweighs the cost of updating status throughout the script.
If it can be done in a script, it can be done inside of any mozharness for loop.
Mozharness parses its own logs during runCommand() calls already.
runCommand() method uses pre-defined error lists to determine whether specific lines in the log are errors.
These are far from complete, and need to be fleshed out further, but the framework is there.
Recently, I thought we should add a
summary (or somesuch) key to that list of dictionaries. We could, say, add a
"Assertion failure: !(addr & GC_CELL_MASK)" with a custom
intermittent_orange (somewhere between
warning?) and a
"Intermittent orange: This looks like bug 583554!"
As long as we're in conjecture-land, we can combine this with a post- or mid-job status update that can populate an intermittent orange database with the specific details of this job.
Awesome or not awesome? You vote.
Mozharness can create buildbot-parseable status.
This point's reverse could be a top level fallacy in itself.
There are two approaches here.
If it's a hard requirement that we lose none of the "granularity" of the existing buildbot steps, nothing prevents us from creating an action list that encapsulates each buildbot action.
Then you can
script.py --only-set-props-builddir, for example. A buildbot
addStep(['python', 'script.py', '--only-%s' % stepName]) for every single thrilling step in the one-hundred-one steps here.
Can you? Yes. Do you want to? I would argue no.
Do developers really care if buildbot step #73 dies with python exception ____? Or do they only really care if compilation fails on file X at line Y (link to hg annotate with appropriate finger pointing here)?
(What if we tied in a ping in #developers or an email message for the suspected culprit in the
notify action? Not free, in terms of mozharness development time, but is it doable in scripts? Seems like it, to me.)
Do localizers really care if buildbot step #45 dies with compare-locales exception ___? Or do they just want a description of which strings are missing, or which XML files need updating, with a link to a wiki page on how to fix those things?
The second approach could involve writing a buildbot-property parseable file during or at the end of the mozharness script, and adding a buildbot
addStep(SetProperty("cat filename")) afterwards to set buildbot-statusdb-parseable buildbot properties.
Mozharness could create a step-summary log.
As I explained in my previous blog post, I lean very heavily towards more verbosity than less, though you could
--log-level error (or w/e) to ignore most of this verbosity.
--multi-log run, I could add a step-level log. Basically, any calls to the BaseScript wrapper methods could also write their equivalent to a log.
(Huh??? English, do you speak it?)
If we cared enough about this, we could create yet another log file. The BaseScript.chdir() method could output
cd DIRNAME to this log. And the
BaseScript.runCommand() method could write its command line to this log file, and so on. This log file would approach being an executable shell script.
I say approach since there aren't always going to be universal scriptable equivalents. But this would be an improvement over the status quo.
High-level granularity is not always desirable.
So you're a developer or a localizer. You've painstakingly set up scratchbox to the point where it works (congratulations!!). You now want to attempt a Maemo multilocale build without checking into the tree and either requesting or waiting for a nightly. What steps do you follow?
You could drill down into the 101 buildbot steps, each with python-list-format command lines, and full env dumps per step. Granular, no? Run each one, with appropriate env settings, and then it should work! Maybe! Have fun!
Or you could take a [yet to be written, but next in line] mozharness script, a config file that's oriented towards standalone developers (you may have to edit paths), and an example command line, and run it. It should hopefully either give you a descriptive error message (and ultra-verbose logs), or a usable multilocale deb file.
Which would you prefer? (Don't let me influence your decision.)