escapewindow: escape window (Default)

Today's my last day at Mozilla. It wasn't an easy decision to move on; this is the best team I've been a part of in my career. And working at a company with such idealistic principles and the capacity to make a difference has been a privilege.

Looking back at the past five-and-three-quarter years:

  • I wrote mozharness, a versatile scripting harness. I strongly believe in its three core concepts: versatile locking config; full logging; modularity.



  • I helped FirefoxOS (b2g) ship, and it's making waves in the industry. Internally, the release processes are well on the path to maturing and stabilizing, and b2g is now riding the trains.

    • Merge day: Releng took over ownership of merge day, and b2g increased its complexity exponentially. I don't think it's quite that bad :) I whittled it down from requiring someone's full mental capacity for three out of every six weeks, to several days of precisely following directions.

    • I rewrote vcs-sync to be more maintainable and robust, and to support gecko-dev and gecko-projects. Being able to support both mercurial and git across many hundreds of repos has become a core part of our development and automation, primarily because of b2g. The best thing you can say about a mission critical piece of infrastructure like this is that you can sleep through the night or take off for the weekend without worrying if it'll break. Or go on vacation for 3 1/2 weeks, far from civilization, without feeling guilty or worried.


  • I helped ship three mobile 1.0's. I learned a ton, and I don't think I could have gotten through it by myself; John and the team helped me through this immensely.

    • On mobile, we went from one or two builds on a branch to full tier-1 support: builds and tests on checkin across all of our integration-, release-, and project- branches. And mobile is riding the trains.

    • We Sim-shipped 5.0 on Firefox desktop and mobile off the same changeset. Firefox 6.0b2, and every release since then, was built off the same automation for desktop and mobile. Those were total team efforts.

    • I will be remembered for the mobile pedalboard. When we talked to other people in the industry, this was more on-device mobile test automation than they had ever seen or heard of; their solutions all revolved around manual QA.


      (full set)


    • And they are like effin bunnies; we later moved on to shoe rack bunnies, rackmounted bunnies, and now more and more emulator-driven bunnies in the cloud, each numbering in the hundreds or more. I've been hands off here for quite a while; the team has really improved things leaps and bounds over my crude initial attempts.


  • I brainstormed next-gen build infrastructure. I started blogging about this back in January 2009, based largely around my previous webapp+db design elsewhere, but I think my LWR posts in Dec 2013 had more of an impact. A lot of those ideas ended up in TaskCluster; mozharness scripts will contain the bulk of the client-side logic. We'll see how it all works when TaskCluster starts taking on a significant percentage of the current buildbot load :)

I will stay a Mozillian, and I'm looking forward to see where we can go from here!

escapewindow: escape window (Default)

[stating the problem]

Mozharness currently handles a lot of complexity. (It was designed to be able to, but the ideal is still elegantly simple scripts and configs.)

Our production-oriented scripts take (and sometimes expect) config inputs from multiple locations, some of them dynamic; and they contain infrastructure-oriented behavior like clobberer, mock, and tooltool, which don't apply to standalone users.

We want mozharness to be able to handle the complexity of our infrastructure, but make it elegantly simple for the standalone user. These are currently conflicting goals, and automating jobs in infrastructure often wins out over making the scripts user friendly. We've brainstormed some ideas on how to fix this, but first, some more details:

[complex configs]

A lot of the current complexity involves config inputs from many places:

We want to lock the running config at the beginning of the script run, but we also don't want to have to clone a repo or make external calls to web resources during __init__(). Our current solution has been to populate runtime configs during one of our script actions, but then to support those runtime configs we have to check multiple config locations for our script logic. (self.buildbot_config, self.test_config, self.config, ...)

We're able to handle this complexity in mozharness, and we end up with a single config dict that we then dump to the log + to a json file on disk, which can then be reused to replicate that job's config. However, this has a negative effect on humans who need to either change something in the running configs, or who want to simplify the config to work locally.

[in-tree vs out-of-tree]

We also want some of mozharness' config and logic to ride the trains, but other portions need to be able to handle outside-of-tree processes and config, for various reasons:

  • some processes are volatile enough that they need to change across the board across all trees on a frequent basis;
  • some processes act across multiple trees and revisions, like the bumper scripts and vcs-sync;
  • some infrastructure-oriented code needs to be able to change across all processes, including historical-revision-based processes; and
  • some processes have nothing to do with the gecko tree at all.

[brainstorming solutions]

Part of the solution is to move logic out of mozharness. Desktop Firefox builds and repacks moving to mach makes sense, since they're

  1. configurable by separate mozconfigs,
  2. tasks completely shared by developers, and
  3. completely dependent on the tree, so tying them to the tree has no additional downside.

However, Andrew Halberstadt wanted to write the in-tree test harnesses in mozharness, and have mach call the mozharness scripts. This broke some of the above assumptions, until we started thinking along the lines of splitting mozharness: a portion in-tree running the test harnesses, and a portion out-of-tree doing the pre-test-run machine setup.

(I'm leaning towards both splitting mozharness and using helper objects, but am open to other brainstorms at this point...)

[splitting mozharness]

In effect, the wrapper, out-of-tree portion of mozharness would be taking all of the complex inputs, simplifying them for the in-tree portion, and setting up the environment (mock, tooltool, downloads+installs, etc.); the in-tree portion would take a relatively simple config and run the tests.

We could do this by having one mozharness script call another. We'd have to fix the logging bug that causes us to double-log lines when we instantiate a second BaseScript, but that's not an insurmountable problem. We could also try execing the second script, though I'd want to verify how that works on Windows. We could also modify our buildbot ScriptFactory to be able to call two scripts consecutively, after the first script dynamically generates the simplified config for the second script.

We could land the portions of mozharness needed to run test harnesses in-tree, and leave the others out-of-tree. There will be some duplication, especially in the mozharness.base code, but that's changing less than the scripts and mozharness.mozilla modules.

We would be able to present a user-friendly "inner" script with limited inputs that rides the trains, while also allowing for complex inputs and automation-oriented setup beforehand in the "outer" script. We'd most likely still have to allow for automation support in the inner script, if there's some reporting or error checking or other automation task that's needed after the handoff, but we'd still be able to limit the complexity of that inner script. And we could wrap that inner script in a mach command for easy developer use.

[helper objects]

Currently, most of mozharness' logic is encapsulated in self. We do have helper objects: the BaseConfig and the ReadOnlyDict self.config for config; the MultiFileLogger self.log_obj that handles all logging; MercurialVCS for cloning, ADBDeviceHandler and SUTDeviceHandler for mobile device wrangling. But a lot of what we do is handled by mixins inherited by self.

A while back I filed a bug to create a LocalLogger and BaseHelper to enable parallelization in mozharness scripts. Instead of cloning 90 locale repos serially, we could create 10 helper objects that each clone a repo in parallel, and launch new ones as the previous ones finish. This would have simplified Armen's parallel emulator testing code. But even if we're not planning on running parallel processes, creating a helper object allows us to simplify the config and logic in that object, similar to the "inner" script if we split mozharness into in-tree and out-of-tree instances, which could potentially also be instantiated by other non-mozharness scripts.

Essentially, as long as the object has a self.log_obj, it will use that for logging. The LocalLogger would log to memory or disk, outside of the main script log, to avoid parallel log interleaving; we would use this if we were going to run the helper objects in parallel. If we wanted the helper object to stream to the main log, we could set its log_obj to our self.log_obj. Similarly with its config. We could set its config to our self.config, or limit what config we pass to simplify.

(Mozharness' config locking is a feature that promotes easier debugging and predictability, but in practice we often find ourselves trying to get around it somehow. Other config dicts, self.variables, editing self.config in _pre_config_lock() ... Creating helper objects lets us create dynamic config at runtime without violating this central principle, as long as it's logged properly.)

Because this "helper object" solution overlaps considerably with the "splitting mozharness" solution, we could use a combination of the two to great efficacy.

[functions and globals]

This idea completely alters our implementation of mozharness, by moving self.config to a global config, directly calling logging methods (or wrapped logging methods). By making each method a standalone function that's only slightly different from a standard python function, it lowers the bar for contribution or re-use of mozharness code. It does away with both the downsides and benefits of objects.

The first, large downside I see is this solution appears incompatible with the "helper objects" solution. By relying on a global config and logging in our functions, it's difficult to create standalone helpers that use minimized configs or alternate logging configurations. I also think the global logging may make the double-logging bug more prevalent.

It's quite possible I'm downplaying the benefit of importing individual functions like a standard python script. There are decorators to transform functions into class methods and vice versa, which might allow for both standalone functions and object-based methods with the same code.

[related links]

  • Jordan Lund has some ideas + wip patches linked from bug 753547 comment 6.
  • Andrew Halberstadt's Sharing code not always a good thing and How to deal with IFFY requirements
  • My mozharness core principles example scripts+configs and video
  • Lars Lohn's Crouching Argparse Hidden Configman. Afaict configman appears to solve similar problems to mozharness' BaseConfig, but Argparse requires python 2.7 and mozharness locks the config.
  • Gaia Try

    May. 6th, 2014 11:12 am
    escapewindow: escape window (Default)

    We're now running Gaia tests on TBPL against Gaia pull requests.

    John Ford wrote a commit hook in bug 989131 to push an update to the gaia-try repo that looks like this. The buildbot test scheduler master is polling this repo; when it sees a change, it triggers tests against the latest b2g desktop builds. The test jobs download the pre-built desktop binaries, and clone the appropriate pull-request Gaia repo and revision on top, then run the tests. The buildbot work was completed in bug 986209.

    This should allow us to verify our code doesn't break anything before it's merged to Gaia proper, saving human time and reducing code churn.

    Armen pointed out how gaia code changes currently show up in the push count via bumper processes, but that only reflects merged pull requests, not these un-reviewed pull requests. Now that we've turned on gaia-try, this is equivalent to another mozilla-inbound (an additional 10% of our push load, iirc). Our May pushes should see a significant bump.

    escapewindow: escape window (Default)
    Fur seal rescue in Hercules Bay, South Georgia

    We took a zodiac ride around Hercules Bay on what was to be our final morning in South Georgia. Eric, our zodiac driver, heard chatter on the radio about a fur seal caught in a packing strap, so we went over to check it out.

    Fur seal rescue 1/6
    We found him? sitting on the rocks, a loop around his neck.

    Fur seal rescue 2/6
    He was cute.

    We took some photos, then wandered off to watch the macaroni penguins for a bit. When we heard they were about to free him, we hurried over, cameras in hand.

    Fur seal rescue 3/6
    Our expedition leader, Lisa Kelley, caught the fur seal in a net, so they could remove the packaging material.

    Fur seal rescue 4/6
    Letting him go.

    Afterwards, I noticed he still had a crease around his neck, but he was free of the loop. Later that evening we saw pictures of animals that grew to full size after swimming through a loop of some sort: some are visible here.

    Fur seal rescue 5/6
    Walking away

    Fur seal rescue 6/6
    The loop

    escapewindow: escape window (Default)

    This is a long overdue blost post to say that gecko-dev and gecko-projects are fully live and cut over (and renamed, for those who have been looking at the now-defunct links in my previous blog posts).

    Gecko-dev is on git.mozilla.org and on github; gecko-projects is on github only until it's clear that we won't need regular branch and repo resets. We have retired the following github repos, which we will probably remove after a period of time:

    The mapfiles are still in http://people.mozilla.org/~asasaki/vcs2vcs/, which is not the final location. There's a combined mapfile in here for convenience, but the gecko-mapfile is the canonical mapfile.

    I've been working on a db-based mapper app, which, given a git or hg sha, would allow us to query for the corresponding sha. It also allows for downloading the full mapfile for a repo (gecko.git, gecko-dev, gecko-projects) or the combined full mapfile of two or more of those repos. It also allows for downloading the shas inserted since a particular datetime. Screenshots of my development instance are here.

    Bear with me; I've got lots on my plate and not enough time. Mapper is in the works, however, and I think this should smooth over some of the rough edges when dealing with both hg and git. Additional eyeballs and/or patience would be appreciated!

    My db-based mapper repo is here.
    The db-based mapper bug is here.
    The permanent upload location bug is here.

    escapewindow: escape window (Default)
    The most long-term success tends to come from reducing, to the greatest extent possible, the need for agreement and consensus.

    I think we could rephrase "reducing the need for agreement and consensus" as, "increasing our ability to agree-to-disagree and still work well together". I'm going to talk a bit about modularity, forks, and shared APIs.


    modularity

    Catlee was pushing for LWR modularity, to distance us from buildbot's monolithic app architecture. I was as well, to a degree, with standalone clusters and delegating pooling decisions to a SlaveAPI/Mozpool analog. If you take that a step further, you reach Catlee's view of it (as I understand it):

    • minion1 allocation module: this could simply iterate down a list of minions for smaller clusters, or ask SlaveAPI/Mozpool for an appropriate minion in the gecko cluster.
    • the job launching module could use sshd on the minion pool if the scripts have their own logging and bootstrapping; another version could talk to a custom minion-side daemon.
    • The graph-generation and graph-processing pieces could be modular, allowing for different graph- and job- definition formats between clusters.
    • the pending jobs piece, the graphs db + api piece, the event api, and configs pieces as well

    By making these modular, we reduce the monolithic app to compatible LEGO® pieces that can be mixed and matched and forked if necessary. Plus, it allows for easier development in parallel.

    Similarly, we had been talking about versioning each API and the job- and graph- definitions, so we wouldn't always be locked in to being backwards-compatible. That would allow us to make faster decisions early on that wouldn't necessarily have major long-term implications. It's easier to fail early + fail often if decisions are easily reversible.

    This versioning would also allow us to potentially have different workflows or job+graph definitions between LWR clusters, should we need it. The only requirement here would be that each LWR cluster be internally consistent. This could allow for simpler workflows to operate without the overhead of being fully compatible with a complex workflow; similarly, we wouldn't have to shoehorn a complex workflow onto a simpler one.


    1 Earlier I mentioned that I (and others) wanted to move away from the buildbot master/slave terminology. I think we've settled on minions for the compute farm machines. We're a bit mixed on the server side: Overlord? Professor? Mastermind? I'd prefer to think of the server cluster as more of a traffic cop than anything overarchingly powerful and omniscient, but I'm not too picky here.


    cross-cluster communication

    As a bit of [additional] background: Catlee sees LWR as event-based. We would define types of events, and what happens when an event occurs. And I mentioned wanting to support graphs-of-graphs, and hinted at potentially supporting cross-project or cross-corporation architectures here and here 2.

    Why not allow LWR clusters to talk to each other? They certainly would be able to, with events. If events involve hitting a specific server on a specific port, that may be tricky if there's a firewall in the way, but that might be solvable with listeners on the DMZ.

    Why not allow graphs-of-graphs to reference graphs in another LWR cluster? (If that's feasible.) With this workflow, once the dependencies had all finished, we could trigger a subgraph on another LWR cluster, either via an event or directly in the graph API.

    If we're able to read the other cluster's status and/or request a callback event at the end of that subgraph, this could work, and allow for loose coupling between related LWR clusters without requiring their internals be identical.


    2 While re-reading this post, I noticed that I was writing about next-gen, post-buildbot build infrastructure at Mozilla back in January 2009. My picture back then was a lot more RDBMS- based then, but a lot of the concepts still hold.

    We put that project on the back burner so we could help Mozilla launch a 1.0. Four 1.0's (or equivalent: Maemo Fennec, Android Fennec, Android native Fennec, B2G) and nearly five years later, it looks like we're finally poised to tackle it, together.


    moving faster

    I tend to want to tackle the hardest piece of a problem first. It's easier to keep the simpler problems in mind when designing solutions for the hardest problem, than vice versa. In LWR's case, the hardest workflow is the existing Gecko builds and tests.

    The initial LWR phase 1 may now involve a smaller scoped, Gaia-oriented, pull request autoland system. This worried me: if I'm concerned about forward-compatibility with Gecko phase 1, will I be forced into a stop-energy role during the Gaia phase 1 brainstorming? That sounds counterproductive for everyone involved. That's until I realized modularity and cross-cluster communication could help us avoid requiring future-proof decisions at the outset. If we don't have to worry that decisions made during phase 1 are irreversible, we'll be able to take more chances and move faster.

    If we wanted the Gaia workflow to someday be a part of the Gecko dependency graphs, that can happen by either moving those jobs into the Gecko LWR cluster, or we could have the Gecko cluster drive the Gaia cluster.

    Forked LWR clusters aren't ideal. But blocking Gaia LWR progress on hand-wavy ideas of what Gecko might need in the future isn't helpful. The option of avoiding the tyranny of a single shared workflow or manifest definition frees us to brainstorm the best solution for each, and converge later if it makes sense.



    In part 1, I covered where we are currently, and what needs to change to scale up.
    In part 2, I covered a high level overview of LWR.
    In part 3, I covered some hand-wavy LWR specifics, including what we can roll out in phase 1.
    In part 4, I drilled down into the dependency graph.
    In part 5, I covered some thoughts on the jobs and graphs db.
    We met with the A-team about this, a couple times, and are planning on working on this together!
    Now I'm going to take care of some vcs-sync tasks, prep for our next meeting, and start writing some code.

    escapewindow: escape window (Default)
    [10:57] <catlee>    so one thing - I think it may be premature to look at db models before looking at APIs
    

    I think I agree with that, especially since it looks like LWR may end up looking very different than my initial views of it.

    However, as I noted in part 3's preface, I'm going to go ahead and get my thoughts down, with the caveat that this will probably change significantly. Hopefully this will be useful in the high-level sense, at least.


    jobs and graphs db

    At the bare minimum, this will need graphs and jobs. But right now I'm thinking we may want the following tables (or collections, depending what db solution we end up choosing):

    graph sets

    Graph sets would tie a set of graphs together. If we wanted to, say, launch b2g builds as a separate graph from the firefox mobile and firefox desktop graphs, we could still tie them together as a graph set. Alternately, we could create a big graph that represents everything. The benefit of keeping the separate graphs is it's easier to reference and retrigger jobs of a type: retrigger all the b2g jobs, retrigger all the Firefox desktop win32 PGO talos runs.

    graph templates

    I'm still in the brainstorm mode for the templates. Having templates for graphs and jobs would allow us to generate graphs from the web app, allowing for a faster UI without having to farm out a job to the graph generation pool to clone a repo and generate a graph. These would have templatized bits like branch name that we can fill out to create a graph for a specific branch. It would also be one way we could keep track of changes in customized graphs or jobs (by diffing the template against the actual job or graph).

    graphs

    This would be the actual dependency graphs we use for scheduling, built from the templates or submitted from external requests.

    job templates

    This would work similar to the graph templates: how jobs would look, if you filled in the branch name and such, that we can easily work with in the webapp to create jobs.

    jobs

    These would have the definitions of jobs for specific graphs, but not the actual job run information -- that would go in the "job runs" table.

    job runs

    I thought of "job runs" as separate from "jobs" because I was trying to resolve how we deal with retriggers and retries. Do we embed more and more information inside the job dictionary? What happens if we want to run a new job Y just like completed job X, but as its own entity? Do we know how to scrub the previous history and status of job X, while still keeping the definition the same? (This is what I meant by "volatile" information in jobs). The "job run" table solves this by keeping the "jobs" table all about definitions, and the "job runs" table has the actual runtime history and status. I'm not sure if I'm creating too many tables or just enough here, right now.

    If you're wondering what you might run, you might care about the graph sets, and {graph,job} templates tables. If you're wondering what has run, you would look at the graphs and job runs tables. If you're wondering if a specific job or graph were customized, you would compare the graph or job against the appropriate template. And if you're looking at retriggering stuff, you would be cloning bits of the graphs and jobs tables.

    (I think Catlee has the job runs in a separate db, as the job queue, to differentiate pending jobs from jobs that are blocked by dependencies. I think they're roughly equivalent in concept, and his model would allow for acting on those pending jobs faster.)

    I think the main thing here is not the schema, but my concerns about retries and retriggers, keeping track of customization of jobs+graphs via the webapp, and reducing the turnaround time between creating a graph in LWR and showing it on the webapp. Not all of these things will be as important for all projects, but since we plan on supporting a superset of gecko's needs with LWR, we'll need to support gecko's workflow at some point.


    dependency graph repo

    This repo isn't a mandatory requirement; I see it as a piece that could speed up and [hopefully] streamline the workflow of LWR. It could allow us to:

    • refer to a set of job+graph definitions by a single SHA. That would make it easier to tie graphs- and jobs- templates to the source.
    • easily diff, say, mozilla-inbound jobs against mozilla-central. You can do that if the jobs and graphs definitions live in-tree for each, but it's easier to tell if one revision differs from another revision than if one directory tree in a revision differs from that directory tree in another revision. Even in the same repo: it's hard to tell if m-c revision 2's job and graphs have changed from m-c revision 1, without diffing. The job-and-graph repo would [generally] only have a new revision if something has changed.
    • pre-populate sets of jobs and graph templates in the db that people can use without re-generating them.

    There's no requirement that the jobs+graph definitions live in this repo, but having it would make the webapp graph+job creation a lot easier.

    We could create a branch definitions file in-tree (though it could live elsewhere; its location would be defined in LWR's config). The branch definitions could have trychooser-like options to turn parts of the graphs on or off: PGO, nightlies, l10n, etc. These would be referenced by the job and graph definitions: "enable this job if nightlies are enabled in the branch definitions", etc. So in the branch definitions, we could either point to this dependency graph repo+revision, or at a local directory, for the job+graph definitions. In the gecko model, if someone adds a new job type, that would result in a new jobs+graphs repo revision. A patch to point the branch-definitions file at the new revision would land in Try, first, then one of the inbound branches. Then it would get merged to m-c and then out to the other inbound and project branches; then it would ride the trains.

    (Of course, in the above model, there's the issue of inbound having different branch definitions than mozilla-central, which is why I was suggesting we have overrides by branch name. I'm not sure of a better solution at the moment.)

    The other side of this model: when someone pushes a new jobs+graphs revision, that triggers a "generate new jobs+graphs templates" job. That would enter new jobs+graphs for the new SHA. Then, if you wanted to create a graph or job based on that SHA, the webapp doesn't have to re-build those from scratch; it has them in the db, pre-populated.


    chunks and customizations

    • For chunked jobs (most test jobs, l10n), we were brainstorming about having dynamic numbers of chunks. If only a single machine is free, it would grab the first chunk, finish, then ask LWR if there are more chunks to run. If only a handful of machines are available, LWR could lean towards starting min_chunks jobs. If there are many machines idle, we can trigger max_chunks jobs in parallel. I'm not sure how plausible this is, but this could help if it doesn't add too much complexity.
    • I think that while users should be able to adjust graph- or job-priority, we should have max priorities set per-branch or per-user, so we can keep chemspill release builds at the highest priority.

      Similarly, I think we should allow customization of jobs via the webapp on certain branches, but

      1. we should mark them as customized (by, say, a flag, plus the diff between the template and the job), and
      2. we need to prevent customizing, say, nightly builds that get sent to users, or release builds.

      This causes interesting problems when we want to clone a job or graph: do we clone a template, or the customized job or graph that contains volatile information? (I worked around this in my head by creating the schema above.)

      Signed graphs and jobs, per-branch restrictions, or separate LWR clusters with different ACLs, could all factor into limiting what people can customize.

    retries and retriggers

    For retries, we need to track max [auto] retries, as well as job statuses per run. I played with this in my head: do we keep track of runs separately from job definitions? Or clone the jobs, in which case volatile status and customizing jobs become more of an issue? Keeping the job runs separate, and specifying whether they were an auto-retry or a user-generated retrigger, could help in preventing going beyond max-auto-retries.

    Retriggers themselves are somewhat problematic if we mark jobs as skipped due to dependencies: if a job that was unsuccessful is retriggered and exits successfully, do we revisit previously skipped jobs? Or do we clone the graph when we retrigger, and keep all downstream jobs pending-blocked-by-dependencies? Does this graph mark the previous graph as a parent graph, or do we mark it as part of the same graph set?

    (This is less of a problem currently, since build jobs run sendchanges in that cascading-waterfall type scheduling; any time you retrigger a build, it will retrigger all downstream tests, which is useful if that's what you want. If you don't want the tests, you either waste test machine time or human time in cancelling the tests. Explicitly stating what we want in the graph is a better model imo, but forces us to be more explicit when we retrigger a job and want the downstream jobs.)

    Another possibility: I thought that instead of marking downstream jobs as skipped-due-to-dependencies, we could leave them pending-blocked-by-dependencies until they either see a successful run from their upstream dependencies, or hit their TTL (request timeout). This would remove some complexity in retriggering, but would leave a lot of pending jobs hanging around that could slow down the graph processing pool and skew our 15 minute wait time SLA metrics.

    I don't think I have the perfect answers to these questions; a lot of my conclusions are based on the scope of the problem that I'm holding in my head. I'm certain that some solutions are better for some workflows and others are better for others. I think, for the moment, I came to the soft conclusion of a hand-wavy retriggering-portions-of-the-graph-via-webapp (or web api call, which does the equivalent).

    A random semi-tangential thought: whichever method of pending- or skipped- solution we generate will probably significantly affect our 15 minute wait time SLA metrics, anyway; they may also provide more useful metrics like end-to-end-times. After we move to this model, we may want to revisit our metric of record.


    lwr_runner.py

    (This doesn't really have anything to do with the db, but I had this thought recently, and didn't want to save it for a part 6.)

    While discussing this with the Gaia, A-Team, and perf teams, it became clear that we may need a solution for other projects that want to run simple jobs that aren't ported to mozharness. Catlee was thinking an equivalent process to the buildbot buildslave process: this would handle logging, uploads, status notifications, etc., without requiring a constant connection to the master. Previously I had worked with daemons like this that spawned jobs on the build farm, but then moved to launching scripts via sshd as a lower-maintenance solution.

    The downsides of no daemon include having to solve the mach context problem on macs, having to deal with sshd on windows, and needing a remote logging solution in the scripts. The upsides include being able to add machines to the pool without requiring the daemon, avoiding platform-specific issues with writing and maintaining the daemon(s), and script changes are faster to roll out (and more granular) than upgrading a daemon across an entire pool.

    if we create a mozharness/scripts/lwr_runner.py that takes a set of commands to run against a repo+revision (or set of repos+revisions), with pre-defined metrics for success and failure, then simpler processes don't need their own mozharness script; we could wrap the commands in lwr_runner.py. And we'd get all the logging, error parsing, and retry logic already defined in mozharness.

    I don't think we should rule out either approach just yet. With the lwr_runner.py idea, both approaches seem viable at the moment.



    In part 1, I covered where we are currently, and what needs to change to scale up.
    In part 2, I covered a high level overview of LWR.
    In part 3, I covered some hand-wavy LWR specifics, including what we can roll out in phase 1.
    In part 4, I drilled down into the dependency graph.
    We met with the A-team about this, a couple times, and are planning on working on LWR together!
    Now I'm going to take care of some vcs-sync tasks, prep for this next meeting, and start writing some code.

    escapewindow: escape window (Default)

    I already wrote a bit about the dependency graph here, and :catlee wrote about it here. While I was writing part 2, it became clear that

    1. I had a lot more ideas about the dependency graph, enough for its own blog post, and
    2. since I want to tackle writing the dependency graph first, solidifying my ideas about it beforehand would be beneficial to writing it.

    I've been futzing around with graphviz with :hwine's help. Not half as much fun as drawings on napkins, but hopefully they make sense. I'm still thinking things through.


    jobs and graphs

    A quick look at TBPL was enough to convince me that the dependency graph would be complex enough just describing the relationships between jobs. The job details should be separate. Per-checkin, nightly, and periodic-PGO dependency graphs trigger overlapping sets of jobs, so avoiding duplicate job definitions is a plus.

    We'll need to submit both the dependency graph and the associated job definitions to LWR together. More on how I think jobs and graphs could work in the db below in part 5.

    For phase 1, I think job definitions will only cover enough to feed into buildbot and have them work.


    dummy jobs

    • In my initial dependency graph thoughts, I mentioned breakpoint jobs as a throwaway idea, but it's stuck with me.

      We could use these at the beginning of graphs that we want to view or edit in the web app before proceeding. Or if we submit an experimental patch to Try and want to verify the results after a specific job or set of jobs before proceeding further. Or if we want to represent QA signoff in a release graph, and allow them to continue the release via the web app.

      I imagine we would want a request timeout on this breakpoint, after which it's marked as timed out, and all child jobs are skipped. I imagine we'd also want to set an ACL on at least a subset of these, to limit who can sign off on releases.

      Also in releases, we have simple notification jobs that send email when the release has passed certain milestones. We could later potentially support IRC pings and bug comments.

      A highly simplified representation of part of a release:



      We currently continue the release via manual Release Engineering intervention, after we see an email "go". It would be great to represent it in the dependency graph and give the correct group of people access. Less RelEng bottleneck.
    • We could also have timer jobs that pause the graph until either cancelled or the timeout is hit. So if you want this graph to run at 7pm PST, you could schedule the graph with an initial timer job that marks itself successful at 7, triggering the next steps in the graph.
    • In buildbot, we currently have a dummy factory that sleeps 5 and exits successfully. We used this back in the dark ages to skip certain jobs in a release, since we could only restart the release from the beginning; by replacing long-running jobs with dummy jobs, we could start at the beginning and still skip the previously successful portions of the release.

      We could use dummy jobs to:

      1. simplify the relationships between jobs. In the above graph, we avoided a many-to-many relationship by inserting a notification job in between the linux jobs and the updates.
      2. trigger when certain groups of jobs finish (e.g. all linux64 mochitests), so downstream jobs can watch for the dummy job in Pulse rather than having to know how many chunks of mochitests we expect to run, and keep track as each one finishes.
      3. quickly test dependency graph processing: instead of waiting for a full build or test, replace it with a dummy job. For instance, we could set all the jobs of a type to "success" except one "timed out; retry" to test max retry limits quickly. This assumes we can set custom exit statuses for each dummy job, as well as potentially pointing at pre-existing artifact manifest URLs for downstream jobs to reference.

    Looking at this list, it appears to me that timer and breakpoint jobs are pretty close in functionality, as are notification and dummy (status?) jobs. We might be able to define these in one or two job types. And these jobs seem simple enough that they may be runnable on the graph processing pool, rather than calling out to SlaveAPI/MozPool for a new node to spawn a script on.


    statuses

    At first glance, it's probably easiest to reuse the set of TBPL statuses: success, warning, failure, exception, retry. But there are also the grey statuses 'pending' and 'running'; the pink status 'cancelled'; and the statuses 'timed out', and 'interrupted' which are subsets of the first five statuses.

    Some statuses I've brainstormed:

    • inactive (skipped during scheduling)
    • request cancelled
    • pending blocked by dependencies
    • pending blocked by infrastructure limits
    • skipped due to coalescing
    • skipped due to dependencies
    • request timed out
    • running
    • interrupted due to user request
    • interrupted due to network/infrastructure/spot instance interrupt
    • interrupted due to max runtime timeout
    • interrupted due to idle time timeout (no output for x seconds)
    • completed successful
    • completed warnings
    • completed failure
    • retried (auto)
    • retried (user request)

    The "completed warnings" and "completed failure" statuses could be split further into "with crash", "with memory leak", "with compilation error", etc., which could be useful to specify, but are job-type-specific.

    If we continue as we have been, some of these statuses are only detectable by log parsing. Differentiating these statuses allows us to act on them in a programmatic fashion. We do have to strike a balance, however. Adding more statuses to the list later might force us to revisit all of our job dependencies to ensure the right behavior with each new status. Specifying non-useful statuses at the outset can lead to unneeded complexity and cruft. Perhaps 'state' could be separated from 'status', where 'state' is in the set ('inactive', 'pending', 'running', 'interrupted', 'completed'); we could also separate 'reasons' and 'comments' from 'status'.

    Timeouts are split into request timeouts or runtime timeouts (idle timeouts, max job runtime timeouts). If we hit a request timeout, I imagine the job would be marked as 'skipped'. I also imagine we could mark it as 'skipped successful' or 'skipped failure' depending on configuration: the former would work for timer jobs, especially if the request timeout could be specified by absolute clock time in addition to relative seconds elapsed. I also think both graphs and jobs could have request timeouts.

    I'm not entirely sure how to coalesce jobs in LWR, or if we want to. Maybe we leave that to graph and job prioritization, combined with request timeouts. If we did coalesce jobs, that would probably happen in the graph processing pool.

    For retries, we need to track max [auto] retries, as well as job statuses per run. I'm going to go deeper into this below in part 5.


    relationships

    For the most part, I think relationships between jobs can be shown by the following flowchart:


    If we mark job 2 as skipped-due-to-dependencies, we need to deal with that somehow if we retrigger job 1. I'm not sure if that means we mark job 2 as "pending-blocked-by-dependencies" if we retrigger job 1, or if the graph processing pool revisits skipped-due-to-dependencies jobs after retriggered jobs finish. I'm going to explore this more in part 5, though I'm not sure I'll have a definitive answer there either.

    It should be possible, at some point, to block the next job until we see a specific job status:

    • don't run until this dependency is finished/cancelled/timed out
    • don't run unless the dependency is finished and marked as failure
    • don't run unless the dependency is finished and there's a memory leak or crash

    For the most part, we should be able to define all of our dependencies with this type of relationship: block this job on (job X1 status Y1, job X2 status Y2, ...). A request timeout with a predefined behavior-on-expiration would be the final piece.

    I could potentially see more powerful commands, like "cancel the rest of the [downstream?] jobs in this graph", or "retrigger this other job in the graph", or "increase the request timeout for this other job", being potentially useful. Perhaps we could add those to dummy status jobs. I could also see them significantly increasing the complexity of graphs, including the potential for infinite recursion in some constructs.

    I think I should mark any ideas that potentially introduce too much complexity as out of scope for phase 1.


    branch specific definitions

    Since job and graph definitions will be in-tree, riding the trains, we need some branch-specific definitions. Is this a PGO branch? Are nightlies enabled on this branch? Are all products and platforms enabled on this branch?

    This branch definition config file could also point at a revision in a separate, standalone repo for its dependency graph + job definitions, so we can easily refer to different sets of graph and job definitions by SHA. I'm going to explore that further in part 5.

    I worry about branch merges overwriting branch-specific configs. The inbound and project branches have different branch configs than mozilla-central, so it's definitely possible. I think the solution here is a generic branch-level config, and an optional branch-named file. If that branch-named file doesn't exist, use the generic default. (e.g. generic.json, mozilla-inbound.json) I know others disagree with me here, but I feel pretty strongly that human decisions need to be reduced or removed at merge time.


    graphs of graphs

    I think we need to support graphs-of-graphs. B2G jobs are completely separate from Firefox desktop or Fennec jobs; they only start with a common trigger. Similarly, win32 desktop jobs have no real dependencies on macosx desktop jobs. However, it's useful to refer to them as a single set of jobs, so if graphs can include other graphs, we could create a superset graph that includes the appropriate product- and platform- specific graphs, and trigger that.

    If we have PGO jobs defined in their own graph, we could potentially include it in the per-checkin graph with a branch config check. On a per-checkin-PGO branch, the PGO graph would be included and enabled in the per-checkin graph. Otherwise, the PGO graph would be included, but marked as inactive; we could then trigger those jobs as needed via the web app. (On a periodic-PGO branch, a periodic scheduler could submit an enabled PGO graph, separate from the per-checkin graph.)

    It's not immediately clear to me if we'll be able to depend on a specific job in a subgraph, or if we'll only be able to depend on the entire subgraph finishing. (For example: can an external graph depend on the linux32 mochitest-2 job finishing, or would it need to wait until all linux32 jobs finish?) Maybe named dummy status jobs will help here: graph1.start, graph1.end, graph1.builds_finished, etc. Maybe I'm overthinking things again.

    We need a balancing act between ease of reading and ease of writing; ease of use and ease of maintenance. We've seen the mess a strong imbalance can cause, in our own buildbot configs. The fact that we're planning on making the final graph easily viewable and testable without any infrastructure dependencies helps, in this regard.


    graphbuilder.py

    I think graphbuilder.py, our [to be written] dependency graph generator, may need to cover several use cases:

    • Create a graph in an api-submittable format. This may be all we do in phase 1, but the others are tempting...
    • Combine graphs as needed, with branch-specific definitions, and user customizations (think TryChooser and per-product builds).
    • Verify that this is a well-formed graph.
    • Run other graph unit tests, as needed.
    • Potentially output graphviz files for user-friendly local graph visualization?
    • It's unclear if we want it to also do the graph+job submitting to the api.

    I think the per-checkin graph would be good to build first; the nightly and PGO graphs, as well as the branch-specific defines, might also be nice to have in phase 1.


    I have 4 more sections I wrote skeletons for. Since those sections are more db-oriented, I'm going to move those into a part 5.


    In part 1, I covered where we are currently, and what needs to change to scale up.
    In part 2, I covered a high level overview of LWR.
    In part 3, I covered some hand-wavy LWR specifics, including what we can roll out in phase 1.
    In part 5, I'm going to cover some dependency graph db specifics.
    Now I'm going to meet with the A-team about this, take care of some vcs-sync tasks, and start writing some code.

    escapewindow: escape window (Default)

    preface

    Since I wrote my previous blog post on LWR, I've found/been sent/been reminded of a few links:

    • Automating away operations in production deployments:
    • Other job scheduling systems, which I plan on digging into later. These may or may not work for us:
      • From comments in my previous blog post, JobScheduler, which may or may not scale to the degree we would need it to.
      • And this article that talks about Google Borg and Apache Mesos based solutions, which definitely can scale. It's not immediately clear to me if the compute cluster model lends itself to a heterogeneous compute farm where it's mandatory we be able to target specific nodes for specific tasks (as opposed to finding free resources of any type). It's also unclear at first blush whether they would lend themselves to Windows, OSX, ARM device, or other hardware-based nodes, or are explicitly linux/cloud specific.

    I'm going to keep writing this third LWR blog post as planned, since I think we failed to explain things clearly to our newer team members during the team week. Also, recording our recent brainstorming may help us decide if these other scheduling systems would work for us, and help guide any customizations we may want to make, should we choose to use them.


    whiteboard schematics


    the drawing is either really important, or just a brainstorm first draft. i'm not sure which yet.

    event api
    This would allow specific events to trigger pre-defined behavior. New push, new graph, job start, job finish, timer.
    lwr config
    This would specify the above pre-defined behavior, as well as some timers for nightly/periodic/scheduled tasks.
    graph api
    This would allow for reading/inserting/updating dependency graphs into the system.
    event processor
    This would trigger graph generation and graph processing jobs during phase 1.
    web app
    As described here; most likely only a subset of those features would exist in phase 1.
    graphs db
    This would hold the graphs and jobs. (We might want a different name than 'graphs' for this db, to avoid confusion with graphserver.)
    graph generation pool
    This would generate the graphs using pre-defined defaults. We'd like the defaults for the build/test graphs to live in-tree. We would potentially need TryChooser, per-product (only kick off certain builds depending on which files have been changed), and other customization logic here.
    graph processing pool
    This would read the graphs and determine the next steps. Dependency status checking, triggering next jobs when appropriate, marking the rest of the graph as skipped/timed out/etc if needed.

    We'd like to keep the server-side as dumb as possible. The fewer changes that need to be made there, the more stable it will be. By moving logic and configs off the server, we can make more complex and granular changes without touching the servers. We've already seen the effects of keeping all the logic, all the configs, all the scheduling on the buildbot masters, and we want the opposite.

    We'd like a small server to be runnable on a single machine, so we can test and debug changes on our laptops. Versioned- or backwards-compatible APIs may allow us to upgrade half a production cluster, bring that live, then upgrade the second half. If we're easily able to spin up a new standalone cluster, we can easily support different workflows/audiences like staging-vs-production, standalone project branch "pods", or experimental small project support.


    slaveapi, mozpool, and network logging

    Currently, buildbot has a list of buildslaves per builder, and keeps track of which ones are currently connected to the buildbot master. :catlee then tweaks the nextSlave logic to prefer faster buildslaves over slower, or spot instances over reserved instances (or to use reserved instances if the same job was run on an interrupted spot instance), or the last buildslave that successfully ran that particular job (to improve depend build times). Buildbot doesn't have any concept of how healthy the buildslave is, or how to maintain the buildslaves, and requires that we make any pooling or nextSlave decisions ahead of time and load them into the running buildbot masters via a reconfig. Plus, the master<->buildslave communication requires an uninterrupted network connection, which gives us streaming logs, but adds network fragility.

    SlaveAPI is designed to handle some of the above issues: determine the health of a slave, reboot a slave, or mark it as disabled. In the future it may allow us to spin up and spin down AWS instances, and reimage hardware slaves.

    MozPool is currently limited to Android Pandas, and allows for health checks on Pandas, rebooting Pandas, and IIRC reimaging Pandas. A job that requires a Panda would be able to request [a healthy] one from the pool, run the job, and return it to the pool or mark it as bad.

    With a combination of the two, LWR could request a node with certain properties (with tags, maybe?): any machine; any linux/osx machine; the fastest linux machine available with build tools; or specifically by hostname. If LWR also passed the history and details of the job along with the request, the SlaveAPI/MozPool analog could make decisions like spot instance or reserved instance; fast or slow; most recently successfully built that tree so a depend build would be fast. And we might be able to keep that complexity out of LWR.

    We'd like to be able to spawn the job and detach, to remove the need for an uninterrupted network connection. For status, it might be nice to be able to tail the log on demand, and/or add network logging support to mozharness (via a MultiNetworkLogger class, perhaps). This would probably require a network log cluster of some sort. Someone else suggested that we be able to toggle the network logging on at will, but otherwise default to off, to reduce network traffic. I'm not entirely sure how to do this, but given a trigger we could replay the log from disk over the network, and continue to stream the log as it came in, once the network logging had caught up.

    We could also take this opportunity to move away from the buildbot master/slave terminology, to... perhaps server/node? farmer/cow? :) Technically this wouldn't matter at all. Semantically it does.


    artifact manifests

    Currently, we upload various things from build jobs: installers, crashsymbols, test zips. The build jobs then guess which binary is the installer and sendchange the installer and test zip urls, triggering tests. The buildbot master then uploads the build logs, sometimes to a completely different directory than the installer, which can cause issues with TBPL and other downstream consumers. The test jobs take the installer and test zip urls from the sendchange, and use those to download and install the binary, extract the tests, and run them. Sometimes they need other files: crashsymbols, the robocop apk, so we apply a regex to the installer url to guess the other urls, causing all sorts of fun when this doesn't work. In a similar vein, we download previous MARs to generate partial updates. However, the mar files contain version numbers, causing more fun when we try to guess the filename after a version bump.

    Call me a buzzkill, but I'd like to eliminate all this fun guesswork in favor of a boring and predictable solution. I'd like an artifact manifest. A structured artifact manifest, with versioned manifest formats, so we know how to read them. And while I think it's ok to have a section of the manifest for dumping random blobs of information, if portions of those become generally useful, we should probably put those in the structured area in the next version of the manifest format.

    The manifest would definitely contain naming information for the various artifacts uploaded, as well as what they are. If mozharness jobs uploaded their own logs, they would more predictably live with the other artifacts, and be specified in the manifest. We could also include job status and uid and other such information. Dependent jobs could then act on all of that information without guessing, given only the location of the manifest. This also reduces the amount of information that LWR has to transfer between jobs... and may satisfy :sfink's request for more structured schema for downstream jobs.


    phase one

    We can't write and roll all of this out at once. Besides the magnitude of work represented by this handful of blog posts, we also have existing dependencies on buildbot that we don't yet have replacements for. For phase one, we're picturing the graph processing pool sending jobs into buildbot, probably via the db.

    First we should build the dependency graph for our existing build and test jobs. If I were to tackle one piece first, this would be it, because it's a single script with no infrastructure dependencies. It's easy to verify afterwards, by comparing the output to our existing TBPL runs. Normalizing the builder names would help here.

    Then we could feed that graph into self serve, potentially allowing us to more easily trigger individual builds (we currently use regexes, iirc). Tests and repacks may be trickier, since they expect additional information to come via sendchange and buildbot properties, but that's a start.

    Next we could start writing the server pieces -- trigger polling, graph generation, iterate through the graph. Any web app work could start here. This isn't strictly blocked by the self-serve implementation, so if more people chipped in we could work on those in parallel.

    We could then start feeding the jobs from the graph into buildbot, and disable the respective buildbot polling and scheduling.

    Once we got this far, we could also look into moving certain jobs that are already ported to mozharness out of buildbot and into a pure LWR implementation. That may depend on a streaming log solution or artifact manifest solution. This might belong in phase 2.


    I've been both excited and nervous writing about LWR. Excited, since I'm bursting with ideas for the project. Nervous, because so much of it extends outside of my domain of expertise; because it's a huge project; because portions of it are still nebulous concepts in my head. I think we have the team(s) to build it, though. And since I think best about projects when I write [about] them, these blog posts have helped focus my ideas. To get a first draft down that we can revise later.


    In part 1, I covered where we are currently, and what needs to change to scale up.
    In part 2, I covered a high level overview of LWR.
    In part 4, I'm going to drill down into the dependency graph.
    Then I'm going to meet with the A-team about this, and start writing some code.

    escapewindow: escape window (Default)

    compute farm

    I think of all the ideas we've brainstormed, the one I'm most drawn to is the idea that our automation infrastructure shouldn't just be a build farm feeding into a test farm. It should be a compute farm, capable of running a superset of tasks including, but not restricted to, builds and tests.

    Once we made that leap, it wasn't too hard to imagine the compute farm running its own maintenance tasks, or doing its own dependency scheduling. Or running any scriptable task we need it to.

    This perspective also guides the schematics; generic scheduling, generic job running. This job only happens to be a Firefox desktop build, a Firefox mobile l10n repack, or a Firefox OS emulator test. This graph only happens to be the set of builds and tests that we want to spawn per-checkin. But it's not limited to that.


    dependency graph (cf.)

    Currently, when we detect a new checkin, we kick off new builds. When they successfully upload, they create new dependent jobs (tests), in a cascading waterfall scheduling method. This works, but is hard to predict, and it doesn't lend itself to backfilling of unscheduled jobs, or knowing when the entire set of builds and tests have finished.

    Instead, if we create a graph of all builds and tests at the beginning, with dependencies marked, we get these nice properties:

    • Scheduling changes can be made, debugged, and verified without actually needing to hook it up into a full system; the changes will be visible in the new graph.
    • It becomes much easier to answer the question of what we expect to run, when, and where.
    • If we initially mark certain jobs in the graph as inactive, we can backfill those jobs very easily, by later marking them as active.
    • We are able to create jobs that run at the end of full sets of builds and tests, to run analyses or cleanup tasks. Or "smoketest" jobs that run before any other tests are run, to make sure what we're testing is worth testing further. Or "breakpoint" jobs that pause the graph before proceeding, until someone or something marks that job as finished.
    • If the graph is viewable and editable, it becomes much easier to toggle specific jobs on or off, or requeue a job with or without changes. Perhaps in a web app.

    web app

    The dependency graph could potentially be edited, either before it's submitted, or as runtime changes to pending or re-queued jobs. Given a user-friendly web app that allows you to visualize the graph, and drill down into each job to modify it, we can make scheduling even more flexible.

    • TryChooser could go from a checkin-comment-based set of flags to a something viewable and editable before you submit the graph. Per-job toggles, certainly (just mochitest-3 on windows64 debug, please, but mochitest-2 through 4 on the other platforms).
    • If the repository + revision were settable fields in the web app, we could potentially get rid of the multi-headed Try repository altogether (point to a user repo and revision, and build from there).
    • Some project branches might not need per-checkin or nightly jobs at all, given a convenient way to trigger builds and tests against any revision at will.
    • Given the ability to specify where the job logic comes from (e.g., mozharness repo and revision), people working on the automation itself can test their changes before rolling them out, especially if there are ways to send the output of jobs (job status, artifact uploads, etc.) to an alternate location. This vastly reduces the need for a completely separate "staging" area that quickly falls out of date. Faster iteration on automation, faster turnaround.

    community job status

    One feature we lost with the Tinderbox EOL was the ability for any community member to contribute job status. We'd like to get it back. It's useful for people to be able to set up their own processes and have them show up in TBPL, or other status queries and dashboards.

    Given the scale we're targeting, it's not immediately clear that a community member's machine(s) would be able to make a dent in the pool. However, other configurations not supported by the compute farm would potentially have immense value: alternate toolchains. Alternate OSes. Alternate hardware, especially since the bulk of the compute farm will be virtual. Run your own build or test (or other job) and send your status to the appropriate API.

    As for LWR dependency graphs potentially triggering community-run machines: if we had jobs that are useful in aggregate, like a SETI at home communal type job, or intermittent test runner/crasher type jobs, those could be candidates. Or if we wanted to be able to trigger a community alternate-configuration job from the web app. Either a pull-not-push model, or a messaging model where community members can set up listeners, could work here.

    Since we're talking massive scale, if the jobs in question are runnable on the compute farm, perhaps the best route would be contributing scripts to run. Releng-as-a-Service.


    Releng-as-a-Service

    Release Engineering is a bottleneck. I think Ted once said that everyone needs something from RelEng; that's quite possibly true. We've been trying to reverse this trend by empowering others to write or modify their own mozharness scripts: the A-team, :sfink, :gaye, :graydon have all been doing so. More bandwidth. Less bottleneck.

    We've already established that compute load on a small subset of servers doesn't work as well as moving it to the massively scalable compute farm. This video on leadership says the same thing, in terms of people: empowering the team makes for more brain power than bottlenecking the decisions and logic on one person. Similarly, empowering other teams to update their automation at their own pace will scale much better than funneling all of those tasks into a single team.

    We could potentially move towards a BYOS (bring your own script) model, since other teams know their workflow, their builds, their tests, their processes better than RelEng ever could. :catlee's been using the term Releng-as-a-Service for a while now. I think it would scale.

    I would want to allow for any arbitrary script to run on our compute farm (within the realms of operational-, security-, and fiscal- sanity, of course). Comparing talos performance numbers looking for regressions? Parsing logs for metrics? Trying to find patterns in support feedback? Have a whole new class of thing to automate? Run it on the compute farm. We'll help you get started. But first, we have to make it less expensive and complex to schedule arbitrary jobs.


    This is largely what we talked about, on a high level, both during our team week and over the years. A lot of this seems very blue sky. But we're so much closer to this as a reality than we were when I was first brainstorming about replacing buildbot, 4-5 years ago. We need to work towards this, in phases, while also keeping on top of the rest of our tasks.

    In part 1, I covered where we are currently, and what needs to change to scale up.
    In part 3, I'm going to go into some hand-wavy LWR specifics, including what we can roll out in phase 1.
    In part 4, I'm going to drill down into the dependency graph.
    Then I'm going to start writing some code.

    escapewindow: escape window (Default)

    In my entire career, I have never seen Release Engineering scale anywhere near Mozilla's current numbers 1. The number of machines is over an order of magnitude larger than the next largest system I've seen. Our compute time for a full set of builds and tests is an order of magnitude larger2; two orders of magnitude larger in terms of compute-hours-per-day3. No other company I've worked for has even attempted per-checkin builds and tests, due to the scale required; we just lived with developer finger-pointing and shouting matches after every broken build.

    It's clear to us that our infrastructure is a force multiplier; it's also clear that we need to improve the current state of things to scale an additional order of magnitude.


    Our current implementation of buildbot runs our automation, with issues like:

    • scaling issues:

      • hg polling runs on the masters, causing hangs when the set of changes to be parsed is extremely large (e.g., new pushlog, or when we re-enable an old scheduler);
      • log uploads, job status updates, etc. also happen on the masters, adding more load where we can least afford it;
      • as :catlee points out, we're holding huge dictionaries in-memory, which results in massive duplication of data like this;
      • buildbot needs a persistent connection with slaves. This is great for streaming logs, and poor for load and network robustness;
    • we trigger dependent jobs via sendchange after a build finishes, which prevents us from querying/acting on a set of builds+tests as a single entity.

    Our current configs describe our automation jobs, with issues like:

    • They are a RelEng timesink, which in turn makes us more of a bottleneck to the rest of the project:

      • It's more difficult than it should be to predict what the effect of changes will be without actually running them, which is time consuming;
      • it's near impossible to deal with oddball requests without a large amount of overhead.
    • The scheduling is inflexible, which increases costs in terms of human time, infrastructure time, and money:

      • our current scheduling doesn't allow for things like backfilling test or build jobs on previous commits, forcing us to run a larger set of jobs per-checkin than we would need to otherwise. This is inefficient in terms of compute time and money;
      • trying to find ways to jerry-rig alternate scheduling methods for jobs is time consuming. I see our current efforts as stop-gap solutions until we can roll out something more flexible by design.
    • Since it's difficult to get a full staging environment for RelEng4, we're faced with either landing risky patches with minimal testing (risking closing all trees for an indeterminate amount of time), or spending an inordinate amount of time setting up a rough staging environment that may or may not be giving accurate results, depending on whether you typoed something or missed a step.

    Work is already well under way to move logic out of buildbotcustom, into mozharness. This is the "how do we run a job" to the scheduling's "when" and "where". As the build and test logic becomes more independent of the scheduling, we gain flexibility as to how we schedule jobs.

    Our current implementation of buildbot cannot scale to the degree we need it to. An increase of an order of magnitude would mean tens of thousands of build+test slaves. One million jobs a day. That scale will help the project to develop faster, test faster and more thoroughly, and release better products that are simultaneously more stable and feature-filled. If our infrastructure is a force multiplier, applying a multiplier to the multiplier should result in massive change for good.

    If we also make our configs cleaner, we can be smarter about what we schedule, and when. A 10x increase in our capacity would become an even larger amount of headroom than otherwise. Discussions about what we run, and how often, then become more about business value weighed against infrastructure- and human- time costs, rather than about infrastructure limits.

    We've been talking about this for years, now, but product 1.0's and other external time pressures have kept it on the back burner. With no 1.0's on the horizon and the ability to measure the cost of things, hopefully we will finally be able to prioritize work on scheduling.

    In part 2, I'm going to discuss a high-level overview of our plans and ideas for LWR, our next-gen scheduling system.
    In part 3, I'm going to drill down into some hand-wavy LWR specifics, including what we can roll out in phase 1, which is what we were discussing at length last Tuesday. I didn't think I could dive into those specifics without giving some background context first.



    1 :joduinn has seen scale like this, but I think Mozilla has surpassed those numbers.

    2 We have a much larger matrix of (num_build_platforms · num_build_types) + (num_test_platforms · num_test_types) than any other project I've been a part of.

    3 The workflows I've seen elsewhere have included

    1. on-demand (only build when someone pushes a button),
    2. nightly- or periodic- only,
    3. the tinderbox model where each build restarts after finishing, or
    4. a combination of the above.

    With a smaller number of builds and tests per set, and a much less frequent rate of running those sets of builds and tests, the total number of compute hours per-day is significantly lower.

    4 By "full staging environment", here, I'm not just including a single standalone buildbot master and a single buildbot slave. Depending on what we need to test, this can include a staging instance of self-serve, buildapi, statusdb, clobberer, slavealloc, tbpl, ftp.m.o, graphserver, hg repos (sometimes read-only, but sometimes read-write, e.g. staging releases which tag the repos), sometimes git repos, downstream test master + test slaves, etc. etc., and whatever staging systems we set up in this environment need to communicate with each other and not pollute production systems.

    escapewindow: escape window (Default)
    Mozilla RelEng

    Massimo's video:


    full set, with aquarium photos.

    escapewindow: escape window (Default)
    New Orleans

    water from the balcony gallery



    New Orleans

    by the dawn's early light


    set 2
    escapewindow: escape window (Default)
    New Orleans

    Caught sight of this as the sun was peeking over the horizon... the first rays of light the fact that the yellow-tinted lights were still on turned the (white) fleur de lis yellow.

    set 1
    escapewindow: escape window (Default)

    (you missed these, a tiny bit, didn't you.)

    As I've mentioned here and here, non-fastforward commits are problematic for downstream partners, and are to be strictly avoided. And as I mentioned here, we had identified that our original merge day mechanics introduced non-fastforward commits to our Repositories of Record on a periodic basis.

    I was under the impression that this was a Solved Problem, but I noticed a non-fastforward commit in mozilla-release after the mozilla-beta -> mozilla-release merge for Firefox 25.0 (thanks to the vcs-sync emails I'm getting now). I hadn't updated the instructions on the Merge Documentation wiki page, and the merge scripts followed that document.

    The scripts are now available in this github repo. One pull request later, plus :lsblakk's fixes from the long Firefox 26 merge day, and we now use hg debugsetparents to "merge" in old heads in a fast-forwardable way. One more step towards our goal.

    escapewindow: escape window (Default)

    (Resuming the shorter blog posts about features in the new vcs sync process...)

    After I enabled beagle conversions in cron, I started getting some intermittent non-fastforward emails. When I poked around, I noticed that some changes had landed on GECKO90_2011121217_RELBRANCH on mozilla-release, but not mozilla-beta. When syncing mozilla-release, the tip of this branch stayed in place; when syncing mozilla-beta, the tip of this branch was behind several commits, resulting in a non-fastforward push. We'll continue to see problems like this, if we sync branches with shared names, across mercurial repos, which contain different histories, but we're good for now.

    I was able to solve this by transplanting some changesets, and I recorded my actions for this and one other branch here.

    (This is more a blog post about manual investigation+actions taken after automated emails highlighted the issue, rather than a feature of the new vcs sync process, but the emails themselves were key...)

    :bhearsum filed filed bug 924024 (kill relbranches), which should help reduce the frequency of this happening.

    escapewindow: escape window (Default)

    (Interrupting the shorter blog posts about features in the new vcs sync process, to talk a bit about discussions I've been having...)

    We have three official git repos now:

    • releases/gecko.git, which is partner-oriented, and our highest priority to keep sane (currently lives on git.m.o only),
    • integration/gecko-dev, which is developer-oriented, and we want to offer a strong SLA for. It contains all release and inbound branches (currently lives on both github and git.m.o), and
    • integration/gecko-projects, which contains mercurial repos without strict pre-commit hooks, and are periodically reset; RelEng reserves the right to reset the repo if these cause vcs-sync issues. (Currently on github only, due to the ease of resetting branches (or the entire repo).)

    All three share the same SHAs for shared commits.


    As for this recent, valid question: "Wow, that's a lot of complexity. Not worth just making a clean once-off break of renaming branches?" (In regards to how we name branches differently in gecko-dev and gecko.git.)

    I originally wanted to name the branches after the trains, so gecko.git would be largely hands-off. If they want 1.2 code now, it's in aurora; if they want it after merge day, they'll need to pull from b2g26_v1_2. Easy enough. However, merge day would require careful communication and process timing across languages, companies, and timezones. All work is easier when it's someone else's, but on an objective level, communication and logistics can be more difficult than a technical solution. So now we point the v1.2 branch at trunk, then aurora, then b2g26_v1_2, and our partners can keep pointing at v1.2. And my vcs-sync configs are a little complex as a result of those branch names; a decent trade-off.

    (So that's the why of the naming of things. I'm impressed you've read your way through nine of these (I assume). Insomnia?)

    escapewindow: escape window (Default)

    (Interrupting the shorter blog posts about features in the new vcs sync process, to talk a bit about discussions I've been having...)

    In Firefox desktop browser and Firefox Mobile, the shipped product is a binary that Release Engineering produces. The process of creating these binaries, as well as the venues for distributing them — the Application Update Service (AUS), the ftp server+website, and various Android marketplaces — loom large in our world. While it is also important to support our developers and make it easier for them to write and maintain the products we ship, if we're forced to prioritize between the two, we know the answer.

    Shipping product has been, is, and will continue to be Release Engineering's highest priority.

    In B2G land, we don't ship a binary; partners build the binaries and ship them on devices. However, their binaries are based off of code that we deliver to them. And the vehicle for that code delivery is gecko.git; gecko.git is effectively our shipping product.

    So if we're in a conversation or heated debate, and someone says the equivalent of "I don't care about partners" or "I don't care about gecko.git", I often hear that as "I don't care about Mozilla's capability to ship Firefox OS". It's not conscious, and I'm trying to get better about that. Most likely that isn't what the other party is intending to say. But if a conversation starts that way, most likely the conversation will end up running face-first into The First Law of Release Engineering.

    escapewindow: escape window (Default)

    (This is one of several shorter blog posts about features in the new vcs sync process.)

    At first blush, pushing all branches and tags from a converted mercurial repo to its git mirror seems like a logical choice; no renaming, no filtering. Except, of course, default -> master. Easy enough: a blanket conversion, with one exception.

    However, as noted earlier, we move tags in hg land. Mercurial makes it easy to do, and moving tags is currently part of our official release process. Git, on the other hand, makes it impossible to move tags invisibly; to pick up a moved tag, every downstream user would need to explicitly delete and re-fetch the tag. Otherwise, their tag will continue to point at the old revision. Given the choice between no tags and tags pointing at the wrong revision, I much prefer no tags. We do have a small subset of tags in our partner-oriented gecko.git, though, so in my mind we needed a tag whitelist. With wildcards/regexes, so we wouldn't have to keep updating a static list.

    Branches could have remained a bit simpler, but I had several types of git repo to support. For example, mozilla-b2g18: in gecko.git the default branch is gecko-18. In github/mozilla/mozilla-central, it's b2g18. Neither of those repositories have other, non-default branches from mozilla-b2g18. However, with the lack of {FIREFOX,FENNEC}.*{BUILDn,RELEASE} tags on the git side, I wanted to at least support the GECKO_.*RELBRANCH branches from mozilla-beta and mozilla-release. So not only do we need a whitelist here, but also a map, to say that mozilla-aurora is aurora in gecko-dev, but v1.2 in gecko.git. (We were even considering supporting the standalone git repos; mozilla-aurora:default would have been master in releases-mozilla-aurora. We're going to nuke those in bug 847643, however.)

    In addition to the above, we want to strictly avoid noise (e.g., unwanted branches+tags, or branches+tags with the wrong names) in the partner-oriented gecko.git. I can control this by strictly limiting on push. However, a single layer of safety like that feels a bit dicey; a loose regex or a bug can push the wrong thing. Also, I'm not currently keeping track of which branches+tags I convert for each mercurial repo, so a loose regex for one mercurial repo's conversion whitelist, coupled with a loose regex for another mercurial repo's push whitelist, can result in pushing the former mercurial repo's unwanted branches+tags during the latter mercurial repo's push. It seems easier to not convert unwanted branches+tags in the first place. Because of this, I'm restricting what we convert at all via strict whitelists, and then again restricting at the push-per-target level, for an additional level of safety.

    (Does that make sense? I almost feel like drawing a picture.)

    When I look at the config files, they don't seem very elegant, but given the complexity we're trying to encapsulate, I think they're a pretty decent first draft.

    escapewindow: escape window (Default)

    (This is one of several shorter blog posts about features in the new vcs sync process.)

    We've seen instances of hg corruption before, like

    abort: data/dom/network/interfaces/nsIDOMNetworkStats.idl.i@50a5a9362723: unknown parent!
    abort: data/mobile/android/base/DoorHangerPopup.java.i@62e6137d125c: no match found!
    abort: data/b2g/config/gaia.json.i@327bccdb70e8: no node!
    abort: unknown revision '177bf37a49f5'!
    

    People have theorized that at least one of these may be caused by a race condition while pulling from an http head during an ssh push (edit: bug 737865); others seem a bit more random. We deal with this in our TBPL build/test CI by detecting these types of failures, and nuking/recloning the repo.

    We also see these in the legacy vcs-sync process. With a single-, non-cvs-prepended-, mozilla-central-based- repo, recovering from hg corruption in the working conversion directory is a manual process that can take multiple hours. I saw this as a non-starter for a repo like beagle or gecko.git, where rebuilding the entire thing from scratch can take over a week.

    As I mentioned here, the new process has an intermediate stage_source clone:

    hg.mozilla.org -> stage_source clone -> conversion_dir

    When we detect corruption in the stage_source clone, we don't have to worry very much; just clobber and reclone. The time to recreate a fresh clone of a single mercurial repo is a matter of a few minutes, not multiple hours. This approach uses more disk, but helps prevent long downtimes.

    Previously I had been running hg verify in each stage_source clone before proceeding, but that slowed each pull/convert/push cycle down by ~5 minutes per source mercurial repo (and doesn't always fix the problem), making it a non-viable option.

    July 2014

    S M T W T F S
      12345
    6789101112
    13141516171819
    2021222324 2526
    2728293031  

    Syndicate

    RSS Atom

    Most Popular Tags

    Style Credit

    Expand Cut Tags

    No cut tags
    Page generated Oct. 25th, 2014 05:32 am
    Powered by Dreamwidth Studios