escapewindow | LWR (job scheduling) part ii: a high level overview

compute farm

I think of all the ideas we've brainstormed, the one I'm most drawn to is the idea that our automation infrastructure shouldn't just be a build farm feeding into a test farm. It should be a compute farm, capable of running a superset of tasks including, but not restricted to, builds and tests.

Once we made that leap, it wasn't too hard to imagine the compute farm running its own maintenance tasks, or doing its own dependency scheduling. Or running any scriptable task we need it to.

This perspective also guides the schematics; generic scheduling, generic job running. This job only happens to be a Firefox desktop build, a Firefox mobile l10n repack, or a Firefox OS emulator test. This graph only happens to be the set of builds and tests that we want to spawn per-checkin. But it's not limited to that.

dependency graph (cf.)

Currently, when we detect a new checkin, we kick off new builds. When they successfully upload, they create new dependent jobs (tests), in a cascading waterfall scheduling method. This works, but is hard to predict, and it doesn't lend itself to backfilling of unscheduled jobs, or knowing when the entire set of builds and tests have finished.

Instead, if we create a graph of all builds and tests at the beginning, with dependencies marked, we get these nice properties:

Scheduling changes can be made, debugged, and verified without actually needing to hook it up into a full system; the changes will be visible in the new graph.
It becomes much easier to answer the question of what we expect to run, when, and where.
If we initially mark certain jobs in the graph as inactive, we can backfill those jobs very easily, by later marking them as active.
We are able to create jobs that run at the end of full sets of builds and tests, to run analyses or cleanup tasks. Or "smoketest" jobs that run before any other tests are run, to make sure what we're testing is worth testing further. Or "breakpoint" jobs that pause the graph before proceeding, until someone or something marks that job as finished.
If the graph is viewable and editable, it becomes much easier to toggle specific jobs on or off, or requeue a job with or without changes. Perhaps in a web app.

web app

The dependency graph could potentially be edited, either before it's submitted, or as runtime changes to pending or re-queued jobs. Given a user-friendly web app that allows you to visualize the graph, and drill down into each job to modify it, we can make scheduling even more flexible.

TryChooser could go from a checkin-comment-based set of flags to a something viewable and editable before you submit the graph. Per-job toggles, certainly (just mochitest-3 on windows64 debug, please, but mochitest-2 through 4 on the other platforms).
If the repository + revision were settable fields in the web app, we could potentially get rid of the multi-headed Try repository altogether (point to a user repo and revision, and build from there).
Some project branches might not need per-checkin or nightly jobs at all, given a convenient way to trigger builds and tests against any revision at will.
Given the ability to specify where the job logic comes from (e.g., mozharness repo and revision), people working on the automation itself can test their changes before rolling them out, especially if there are ways to send the output of jobs (job status, artifact uploads, etc.) to an alternate location. This vastly reduces the need for a completely separate "staging" area that quickly falls out of date. Faster iteration on automation, faster turnaround.

community job status

One feature we lost with the Tinderbox EOL was the ability for any community member to contribute job status. We'd like to get it back. It's useful for people to be able to set up their own processes and have them show up in TBPL, or other status queries and dashboards.

Given the scale we're targeting, it's not immediately clear that a community member's machine(s) would be able to make a dent in the pool. However, other configurations not supported by the compute farm would potentially have immense value: alternate toolchains. Alternate OSes. Alternate hardware, especially since the bulk of the compute farm will be virtual. Run your own build or test (or other job) and send your status to the appropriate API.

As for LWR dependency graphs potentially triggering community-run machines: if we had jobs that are useful in aggregate, like a SETI at home communal type job, or intermittent test runner/crasher type jobs, those could be candidates. Or if we wanted to be able to trigger a community alternate-configuration job from the web app. Either a pull-not-push model, or a messaging model where community members can set up listeners, could work here.

Since we're talking massive scale, if the jobs in question are runnable on the compute farm, perhaps the best route would be contributing scripts to run. Releng-as-a-Service.

Releng-as-a-Service

Release Engineering is a bottleneck. I think Ted once said that everyone needs something from RelEng; that's quite possibly true. We've been trying to reverse this trend by empowering others to write or modify their own mozharness scripts: the A-team, :sfink, :gaye, :graydon have all been doing so. More bandwidth. Less bottleneck.

We've already established that compute load on a small subset of servers doesn't work as well as moving it to the massively scalable compute farm. This video on leadership says the same thing, in terms of people: empowering the team makes for more brain power than bottlenecking the decisions and logic on one person. Similarly, empowering other teams to update their automation at their own pace will scale much better than funneling all of those tasks into a single team.

We could potentially move towards a BYOS (bring your own script) model, since other teams know their workflow, their builds, their tests, their processes better than RelEng ever could. :catlee's been using the term Releng-as-a-Service for a while now. I think it would scale.

I would want to allow for any arbitrary script to run on our compute farm (within the realms of operational-, security-, and fiscal- sanity, of course). Comparing talos performance numbers looking for regressions? Parsing logs for metrics? Trying to find patterns in support feedback? Have a whole new class of thing to automate? Run it on the compute farm. We'll help you get started. But first, we have to make it less expensive and complex to schedule arbitrary jobs.

This is largely what we talked about, on a high level, both during our team week and over the years. A lot of this seems very blue sky. But we're so much closer to this as a reality than we were when I was first brainstorming about replacing buildbot, 4-5 years ago. We need to work towards this, in phases, while also keeping on top of the rest of our tasks.

In part 1, I covered where we are currently, and what needs to change to scale up.
In part 3, I'm going to go into some hand-wavy LWR specifics, including what we can roll out in phase 1.
In part 4, I'm going to drill down into the dependency graph.
Then I'm going to start writing some code.

Flat | Top-Level Comments Only

Interesting challenge. Wondering if you have considered using an existing job scheduling software? or maybe take some concepts / implementation ideas from there? For example, there is an open source one here: http://en.wikipedia.org/wiki/JobScheduler

I know we've looked at various third party solutions and found them lacking.
I'll take a look at this one, thanks!

I think, at first blush, I'd be concerned if it can scale to 10,000 machines in its compute farm (with different sets of configurations, or "pools") and a million jobs per day.

Trying it and/or taking concepts/implementation ideas is still a good idea, though.

I'm not sure if you can so easily eliminate the multiheaded try repo. Users want to be able to view the revisions that had jobs run against them. The current common case is pushing from a repo on a user's laptop or desktop, and deleting the revision immediately after pushing (via hg trychooser or mq popping.)

Are there uses for dynamic dependencies? As in, dependencies that are only known after a job is partly or completely finished? Pruning dependencies due to failures doesn't count (eg, failed builds shouldn't spawn useless test jobs, but that's relatively easy.)

I definitely like having the dependencies explicit. That would also enable filling in extra builds and their dependent test jobs at a later time, without rerunning everything. I think you said that already. The "backfilled" jobs you refer to could be used for smart coalescing. (Initially only spawn off the most useful tests, then fill via binaryish search the intervening jobs when a failure is detected and they are determined to be necessary.)

I do your last point in the "web app" section all the time. I'll juggle around a buildbot-configs patch that makes my slave pull from my local mozharness repo. (And I'll often screw up the juggling, resulting in some tedious debugging followed by literal headbanging.)

Having community builds feed into the system is good, but there are times when I also want output from the system. For example, with the hazard builds, the mozharness script uploads them to the ftp server, then my pulsebuildmonitor script sees the job has completed and pulls down the uploaded information for further processing. I don't know how common that is, but it might be useful to explicitly consider.

Hi sfink :)

Re: multiheaded try repo, they can keep their revision in the user repo around. In fact, they have to, or the builds that try to pull that revision will fail.

Re: dynamic dependencies: Why not explicitly define them, and define when they can be skipped? Maybe I'm missing a use case here, but I'd rather not have more guessing. Explicit > implicit.

Yes, I mentioned in the previous blog post how having to set up your own staging env via patches is less than ideal, since we screw it up all the time.

I think any community status would need to be flagged as such. We were originally considering "buckets" as well as "signed graphs" for things that need to not continue if they've been mucked with unofficially, e.g. release builds.

Yes, I see that user repos set up the way you're describing would work. I'm just saying that that's not how people do things right now, and it would require significant workflow changes to switch to. Specifically, I'm usually pushing from a repo that the builds can't see. Which means I need to push to a repo that builds *can* see, and right now that's either the try repo, or a user repo at somewhere like people.mozilla.org. It's extra work to set up the latter, and few people currently do it. So if you set up a generic place that people can push to just so it hangs around long enough for the builds, you're reinventing the try repo. (Well, since it doesn't need to be canonical anymore, you could have rolling monthly repos or team repos or whatever; there's no longer any need to have *everything* in one pile. But they'd still be provided to the developers, not maintained by devs as user repos would be.)

I won't argue for dynamic dependencies, I just wanted to check whether you knew of anywhere they're still needed or wanted.

For the community stuff, I was talking about outputs from "official" jobs to community subscribers. So no need for flagging. It's more about defining the schema with which job results are published to eg a pulse server. The current schema is a little ad hoc, and would probably need to be modified if jobs were defined more generally.

(And yes, this is sfink; my home DSL is alternating between being dead and having serious DNS issues, so I wasn't able to log in before.)

User repos:

1. It's really easy to set up a user repo on hg.m.o:
(mh)deathduck:/src [12:20:35]
572$ ssh hg.mozilla.org clone my-repo mozilla-central
Please wait. Cloning /mozilla-central to /users/asasaki_mozilla.com/my-repo
Clone complete.
Fixing permissions, don't interrupt.
after which you can push to ssh://hg.mozilla.org/users/USER/my-repo .
I imagine we may support git.m.o user repos similarly, in the time frame of supporting building from git (not short term).

2. we regularly hit issues with a) cloning try, and b) having to prune heads on try (or resetting it) because it gets very slow and unresponsive. Resetting Try is disruptive because so many people use it. Resetting your own user repo can be done at your convenience:
(mh)deathduck:/src [12:40:27]
573$ ssh hg.mozilla.org edit my-repo delete YES
Or you can clone a my-repo2 and leave your old my-repo alone for posterity.

3. I'm not saying we *have* to get rid of the multi-headed Try repo, but allowing for this may lower the recurrence of the above issues in (2). And if people find it acceptable, we can get rid of it.

Pulse output: yeah, I agree we'd probably have to define something more structured.

https://bugzilla.mozilla.org/show_bug.cgi?id=962275 and https://bugzilla.mozilla.org/show_bug.cgi?id=959769 for why a single multiheaded Try repo is a tree closure waiting to happen.

LWR (job scheduling) part ii: a high level overview

no subject

no subject

no subject

multiheads

Re: multiheads

Re: multiheads

Re: multiheads

Re: multiheads