escapewindow: escape window (Default)

[stating the problem]

Mozharness currently handles a lot of complexity. (It was designed to be able to, but the ideal is still elegantly simple scripts and configs.)

Our production-oriented scripts take (and sometimes expect) config inputs from multiple locations, some of them dynamic; and they contain infrastructure-oriented behavior like clobberer, mock, and tooltool, which don't apply to standalone users.

We want mozharness to be able to handle the complexity of our infrastructure, but make it elegantly simple for the standalone user. These are currently conflicting goals, and automating jobs in infrastructure often wins out over making the scripts user friendly. We've brainstormed some ideas on how to fix this, but first, some more details:

[complex configs]

A lot of the current complexity involves config inputs from many places:

We want to lock the running config at the beginning of the script run, but we also don't want to have to clone a repo or make external calls to web resources during __init__(). Our current solution has been to populate runtime configs during one of our script actions, but then to support those runtime configs we have to check multiple config locations for our script logic. (self.buildbot_config, self.test_config, self.config, ...)

We're able to handle this complexity in mozharness, and we end up with a single config dict that we then dump to the log + to a json file on disk, which can then be reused to replicate that job's config. However, this has a negative effect on humans who need to either change something in the running configs, or who want to simplify the config to work locally.

[in-tree vs out-of-tree]

We also want some of mozharness' config and logic to ride the trains, but other portions need to be able to handle outside-of-tree processes and config, for various reasons:

  • some processes are volatile enough that they need to change across the board across all trees on a frequent basis;
  • some processes act across multiple trees and revisions, like the bumper scripts and vcs-sync;
  • some infrastructure-oriented code needs to be able to change across all processes, including historical-revision-based processes; and
  • some processes have nothing to do with the gecko tree at all.

[brainstorming solutions]

Part of the solution is to move logic out of mozharness. Desktop Firefox builds and repacks moving to mach makes sense, since they're

  1. configurable by separate mozconfigs,
  2. tasks completely shared by developers, and
  3. completely dependent on the tree, so tying them to the tree has no additional downside.

However, Andrew Halberstadt wanted to write the in-tree test harnesses in mozharness, and have mach call the mozharness scripts. This broke some of the above assumptions, until we started thinking along the lines of splitting mozharness: a portion in-tree running the test harnesses, and a portion out-of-tree doing the pre-test-run machine setup.

(I'm leaning towards both splitting mozharness and using helper objects, but am open to other brainstorms at this point...)

[splitting mozharness]

In effect, the wrapper, out-of-tree portion of mozharness would be taking all of the complex inputs, simplifying them for the in-tree portion, and setting up the environment (mock, tooltool, downloads+installs, etc.); the in-tree portion would take a relatively simple config and run the tests.

We could do this by having one mozharness script call another. We'd have to fix the logging bug that causes us to double-log lines when we instantiate a second BaseScript, but that's not an insurmountable problem. We could also try execing the second script, though I'd want to verify how that works on Windows. We could also modify our buildbot ScriptFactory to be able to call two scripts consecutively, after the first script dynamically generates the simplified config for the second script.

We could land the portions of mozharness needed to run test harnesses in-tree, and leave the others out-of-tree. There will be some duplication, especially in the mozharness.base code, but that's changing less than the scripts and mozharness.mozilla modules.

We would be able to present a user-friendly "inner" script with limited inputs that rides the trains, while also allowing for complex inputs and automation-oriented setup beforehand in the "outer" script. We'd most likely still have to allow for automation support in the inner script, if there's some reporting or error checking or other automation task that's needed after the handoff, but we'd still be able to limit the complexity of that inner script. And we could wrap that inner script in a mach command for easy developer use.

[helper objects]

Currently, most of mozharness' logic is encapsulated in self. We do have helper objects: the BaseConfig and the ReadOnlyDict self.config for config; the MultiFileLogger self.log_obj that handles all logging; MercurialVCS for cloning, ADBDeviceHandler and SUTDeviceHandler for mobile device wrangling. But a lot of what we do is handled by mixins inherited by self.

A while back I filed a bug to create a LocalLogger and BaseHelper to enable parallelization in mozharness scripts. Instead of cloning 90 locale repos serially, we could create 10 helper objects that each clone a repo in parallel, and launch new ones as the previous ones finish. This would have simplified Armen's parallel emulator testing code. But even if we're not planning on running parallel processes, creating a helper object allows us to simplify the config and logic in that object, similar to the "inner" script if we split mozharness into in-tree and out-of-tree instances, which could potentially also be instantiated by other non-mozharness scripts.

Essentially, as long as the object has a self.log_obj, it will use that for logging. The LocalLogger would log to memory or disk, outside of the main script log, to avoid parallel log interleaving; we would use this if we were going to run the helper objects in parallel. If we wanted the helper object to stream to the main log, we could set its log_obj to our self.log_obj. Similarly with its config. We could set its config to our self.config, or limit what config we pass to simplify.

(Mozharness' config locking is a feature that promotes easier debugging and predictability, but in practice we often find ourselves trying to get around it somehow. Other config dicts, self.variables, editing self.config in _pre_config_lock() ... Creating helper objects lets us create dynamic config at runtime without violating this central principle, as long as it's logged properly.)

Because this "helper object" solution overlaps considerably with the "splitting mozharness" solution, we could use a combination of the two to great efficacy.

[functions and globals]

This idea completely alters our implementation of mozharness, by moving self.config to a global config, directly calling logging methods (or wrapped logging methods). By making each method a standalone function that's only slightly different from a standard python function, it lowers the bar for contribution or re-use of mozharness code. It does away with both the downsides and benefits of objects.

The first, large downside I see is this solution appears incompatible with the "helper objects" solution. By relying on a global config and logging in our functions, it's difficult to create standalone helpers that use minimized configs or alternate logging configurations. I also think the global logging may make the double-logging bug more prevalent.

It's quite possible I'm downplaying the benefit of importing individual functions like a standard python script. There are decorators to transform functions into class methods and vice versa, which might allow for both standalone functions and object-based methods with the same code.

[related links]

  • Jordan Lund has some ideas + wip patches linked from bug 753547 comment 6.
  • Andrew Halberstadt's Sharing code not always a good thing and How to deal with IFFY requirements
  • My mozharness core principles example scripts+configs and video
  • Lars Lohn's Crouching Argparse Hidden Configman. Afaict configman appears to solve similar problems to mozharness' BaseConfig, but Argparse requires python 2.7 and mozharness locks the config.
  • escapewindow: escape window (Default)
    [10:57] <catlee>    so one thing - I think it may be premature to look at db models before looking at APIs
    

    I think I agree with that, especially since it looks like LWR may end up looking very different than my initial views of it.

    However, as I noted in part 3's preface, I'm going to go ahead and get my thoughts down, with the caveat that this will probably change significantly. Hopefully this will be useful in the high-level sense, at least.


    jobs and graphs db

    At the bare minimum, this will need graphs and jobs. But right now I'm thinking we may want the following tables (or collections, depending what db solution we end up choosing):

    graph sets

    Graph sets would tie a set of graphs together. If we wanted to, say, launch b2g builds as a separate graph from the firefox mobile and firefox desktop graphs, we could still tie them together as a graph set. Alternately, we could create a big graph that represents everything. The benefit of keeping the separate graphs is it's easier to reference and retrigger jobs of a type: retrigger all the b2g jobs, retrigger all the Firefox desktop win32 PGO talos runs.

    graph templates

    I'm still in the brainstorm mode for the templates. Having templates for graphs and jobs would allow us to generate graphs from the web app, allowing for a faster UI without having to farm out a job to the graph generation pool to clone a repo and generate a graph. These would have templatized bits like branch name that we can fill out to create a graph for a specific branch. It would also be one way we could keep track of changes in customized graphs or jobs (by diffing the template against the actual job or graph).

    graphs

    This would be the actual dependency graphs we use for scheduling, built from the templates or submitted from external requests.

    job templates

    This would work similar to the graph templates: how jobs would look, if you filled in the branch name and such, that we can easily work with in the webapp to create jobs.

    jobs

    These would have the definitions of jobs for specific graphs, but not the actual job run information -- that would go in the "job runs" table.

    job runs

    I thought of "job runs" as separate from "jobs" because I was trying to resolve how we deal with retriggers and retries. Do we embed more and more information inside the job dictionary? What happens if we want to run a new job Y just like completed job X, but as its own entity? Do we know how to scrub the previous history and status of job X, while still keeping the definition the same? (This is what I meant by "volatile" information in jobs). The "job run" table solves this by keeping the "jobs" table all about definitions, and the "job runs" table has the actual runtime history and status. I'm not sure if I'm creating too many tables or just enough here, right now.

    If you're wondering what you might run, you might care about the graph sets, and {graph,job} templates tables. If you're wondering what has run, you would look at the graphs and job runs tables. If you're wondering if a specific job or graph were customized, you would compare the graph or job against the appropriate template. And if you're looking at retriggering stuff, you would be cloning bits of the graphs and jobs tables.

    (I think Catlee has the job runs in a separate db, as the job queue, to differentiate pending jobs from jobs that are blocked by dependencies. I think they're roughly equivalent in concept, and his model would allow for acting on those pending jobs faster.)

    I think the main thing here is not the schema, but my concerns about retries and retriggers, keeping track of customization of jobs+graphs via the webapp, and reducing the turnaround time between creating a graph in LWR and showing it on the webapp. Not all of these things will be as important for all projects, but since we plan on supporting a superset of gecko's needs with LWR, we'll need to support gecko's workflow at some point.


    dependency graph repo

    This repo isn't a mandatory requirement; I see it as a piece that could speed up and [hopefully] streamline the workflow of LWR. It could allow us to:

    • refer to a set of job+graph definitions by a single SHA. That would make it easier to tie graphs- and jobs- templates to the source.
    • easily diff, say, mozilla-inbound jobs against mozilla-central. You can do that if the jobs and graphs definitions live in-tree for each, but it's easier to tell if one revision differs from another revision than if one directory tree in a revision differs from that directory tree in another revision. Even in the same repo: it's hard to tell if m-c revision 2's job and graphs have changed from m-c revision 1, without diffing. The job-and-graph repo would [generally] only have a new revision if something has changed.
    • pre-populate sets of jobs and graph templates in the db that people can use without re-generating them.

    There's no requirement that the jobs+graph definitions live in this repo, but having it would make the webapp graph+job creation a lot easier.

    We could create a branch definitions file in-tree (though it could live elsewhere; its location would be defined in LWR's config). The branch definitions could have trychooser-like options to turn parts of the graphs on or off: PGO, nightlies, l10n, etc. These would be referenced by the job and graph definitions: "enable this job if nightlies are enabled in the branch definitions", etc. So in the branch definitions, we could either point to this dependency graph repo+revision, or at a local directory, for the job+graph definitions. In the gecko model, if someone adds a new job type, that would result in a new jobs+graphs repo revision. A patch to point the branch-definitions file at the new revision would land in Try, first, then one of the inbound branches. Then it would get merged to m-c and then out to the other inbound and project branches; then it would ride the trains.

    (Of course, in the above model, there's the issue of inbound having different branch definitions than mozilla-central, which is why I was suggesting we have overrides by branch name. I'm not sure of a better solution at the moment.)

    The other side of this model: when someone pushes a new jobs+graphs revision, that triggers a "generate new jobs+graphs templates" job. That would enter new jobs+graphs for the new SHA. Then, if you wanted to create a graph or job based on that SHA, the webapp doesn't have to re-build those from scratch; it has them in the db, pre-populated.


    chunks and customizations

    • For chunked jobs (most test jobs, l10n), we were brainstorming about having dynamic numbers of chunks. If only a single machine is free, it would grab the first chunk, finish, then ask LWR if there are more chunks to run. If only a handful of machines are available, LWR could lean towards starting min_chunks jobs. If there are many machines idle, we can trigger max_chunks jobs in parallel. I'm not sure how plausible this is, but this could help if it doesn't add too much complexity.
    • I think that while users should be able to adjust graph- or job-priority, we should have max priorities set per-branch or per-user, so we can keep chemspill release builds at the highest priority.

      Similarly, I think we should allow customization of jobs via the webapp on certain branches, but

      1. we should mark them as customized (by, say, a flag, plus the diff between the template and the job), and
      2. we need to prevent customizing, say, nightly builds that get sent to users, or release builds.

      This causes interesting problems when we want to clone a job or graph: do we clone a template, or the customized job or graph that contains volatile information? (I worked around this in my head by creating the schema above.)

      Signed graphs and jobs, per-branch restrictions, or separate LWR clusters with different ACLs, could all factor into limiting what people can customize.

    retries and retriggers

    For retries, we need to track max [auto] retries, as well as job statuses per run. I played with this in my head: do we keep track of runs separately from job definitions? Or clone the jobs, in which case volatile status and customizing jobs become more of an issue? Keeping the job runs separate, and specifying whether they were an auto-retry or a user-generated retrigger, could help in preventing going beyond max-auto-retries.

    Retriggers themselves are somewhat problematic if we mark jobs as skipped due to dependencies: if a job that was unsuccessful is retriggered and exits successfully, do we revisit previously skipped jobs? Or do we clone the graph when we retrigger, and keep all downstream jobs pending-blocked-by-dependencies? Does this graph mark the previous graph as a parent graph, or do we mark it as part of the same graph set?

    (This is less of a problem currently, since build jobs run sendchanges in that cascading-waterfall type scheduling; any time you retrigger a build, it will retrigger all downstream tests, which is useful if that's what you want. If you don't want the tests, you either waste test machine time or human time in cancelling the tests. Explicitly stating what we want in the graph is a better model imo, but forces us to be more explicit when we retrigger a job and want the downstream jobs.)

    Another possibility: I thought that instead of marking downstream jobs as skipped-due-to-dependencies, we could leave them pending-blocked-by-dependencies until they either see a successful run from their upstream dependencies, or hit their TTL (request timeout). This would remove some complexity in retriggering, but would leave a lot of pending jobs hanging around that could slow down the graph processing pool and skew our 15 minute wait time SLA metrics.

    I don't think I have the perfect answers to these questions; a lot of my conclusions are based on the scope of the problem that I'm holding in my head. I'm certain that some solutions are better for some workflows and others are better for others. I think, for the moment, I came to the soft conclusion of a hand-wavy retriggering-portions-of-the-graph-via-webapp (or web api call, which does the equivalent).

    A random semi-tangential thought: whichever method of pending- or skipped- solution we generate will probably significantly affect our 15 minute wait time SLA metrics, anyway; they may also provide more useful metrics like end-to-end-times. After we move to this model, we may want to revisit our metric of record.


    lwr_runner.py

    (This doesn't really have anything to do with the db, but I had this thought recently, and didn't want to save it for a part 6.)

    While discussing this with the Gaia, A-Team, and perf teams, it became clear that we may need a solution for other projects that want to run simple jobs that aren't ported to mozharness. Catlee was thinking an equivalent process to the buildbot buildslave process: this would handle logging, uploads, status notifications, etc., without requiring a constant connection to the master. Previously I had worked with daemons like this that spawned jobs on the build farm, but then moved to launching scripts via sshd as a lower-maintenance solution.

    The downsides of no daemon include having to solve the mach context problem on macs, having to deal with sshd on windows, and needing a remote logging solution in the scripts. The upsides include being able to add machines to the pool without requiring the daemon, avoiding platform-specific issues with writing and maintaining the daemon(s), and script changes are faster to roll out (and more granular) than upgrading a daemon across an entire pool.

    if we create a mozharness/scripts/lwr_runner.py that takes a set of commands to run against a repo+revision (or set of repos+revisions), with pre-defined metrics for success and failure, then simpler processes don't need their own mozharness script; we could wrap the commands in lwr_runner.py. And we'd get all the logging, error parsing, and retry logic already defined in mozharness.

    I don't think we should rule out either approach just yet. With the lwr_runner.py idea, both approaches seem viable at the moment.



    In part 1, I covered where we are currently, and what needs to change to scale up.
    In part 2, I covered a high level overview of LWR.
    In part 3, I covered some hand-wavy LWR specifics, including what we can roll out in phase 1.
    In part 4, I drilled down into the dependency graph.
    We met with the A-team about this, a couple times, and are planning on working on LWR together!
    Now I'm going to take care of some vcs-sync tasks, prep for this next meeting, and start writing some code.

    escapewindow: escape window (Default)

    preface

    Since I wrote my previous blog post on LWR, I've found/been sent/been reminded of a few links:

    • Automating away operations in production deployments:
    • Other job scheduling systems, which I plan on digging into later. These may or may not work for us:
      • From comments in my previous blog post, JobScheduler, which may or may not scale to the degree we would need it to.
      • And this article that talks about Google Borg and Apache Mesos based solutions, which definitely can scale. It's not immediately clear to me if the compute cluster model lends itself to a heterogeneous compute farm where it's mandatory we be able to target specific nodes for specific tasks (as opposed to finding free resources of any type). It's also unclear at first blush whether they would lend themselves to Windows, OSX, ARM device, or other hardware-based nodes, or are explicitly linux/cloud specific.

    I'm going to keep writing this third LWR blog post as planned, since I think we failed to explain things clearly to our newer team members during the team week. Also, recording our recent brainstorming may help us decide if these other scheduling systems would work for us, and help guide any customizations we may want to make, should we choose to use them.


    whiteboard schematics


    the drawing is either really important, or just a brainstorm first draft. i'm not sure which yet.

    event api
    This would allow specific events to trigger pre-defined behavior. New push, new graph, job start, job finish, timer.
    lwr config
    This would specify the above pre-defined behavior, as well as some timers for nightly/periodic/scheduled tasks.
    graph api
    This would allow for reading/inserting/updating dependency graphs into the system.
    event processor
    This would trigger graph generation and graph processing jobs during phase 1.
    web app
    As described here; most likely only a subset of those features would exist in phase 1.
    graphs db
    This would hold the graphs and jobs. (We might want a different name than 'graphs' for this db, to avoid confusion with graphserver.)
    graph generation pool
    This would generate the graphs using pre-defined defaults. We'd like the defaults for the build/test graphs to live in-tree. We would potentially need TryChooser, per-product (only kick off certain builds depending on which files have been changed), and other customization logic here.
    graph processing pool
    This would read the graphs and determine the next steps. Dependency status checking, triggering next jobs when appropriate, marking the rest of the graph as skipped/timed out/etc if needed.

    We'd like to keep the server-side as dumb as possible. The fewer changes that need to be made there, the more stable it will be. By moving logic and configs off the server, we can make more complex and granular changes without touching the servers. We've already seen the effects of keeping all the logic, all the configs, all the scheduling on the buildbot masters, and we want the opposite.

    We'd like a small server to be runnable on a single machine, so we can test and debug changes on our laptops. Versioned- or backwards-compatible APIs may allow us to upgrade half a production cluster, bring that live, then upgrade the second half. If we're easily able to spin up a new standalone cluster, we can easily support different workflows/audiences like staging-vs-production, standalone project branch "pods", or experimental small project support.


    slaveapi, mozpool, and network logging

    Currently, buildbot has a list of buildslaves per builder, and keeps track of which ones are currently connected to the buildbot master. :catlee then tweaks the nextSlave logic to prefer faster buildslaves over slower, or spot instances over reserved instances (or to use reserved instances if the same job was run on an interrupted spot instance), or the last buildslave that successfully ran that particular job (to improve depend build times). Buildbot doesn't have any concept of how healthy the buildslave is, or how to maintain the buildslaves, and requires that we make any pooling or nextSlave decisions ahead of time and load them into the running buildbot masters via a reconfig. Plus, the master<->buildslave communication requires an uninterrupted network connection, which gives us streaming logs, but adds network fragility.

    SlaveAPI is designed to handle some of the above issues: determine the health of a slave, reboot a slave, or mark it as disabled. In the future it may allow us to spin up and spin down AWS instances, and reimage hardware slaves.

    MozPool is currently limited to Android Pandas, and allows for health checks on Pandas, rebooting Pandas, and IIRC reimaging Pandas. A job that requires a Panda would be able to request [a healthy] one from the pool, run the job, and return it to the pool or mark it as bad.

    With a combination of the two, LWR could request a node with certain properties (with tags, maybe?): any machine; any linux/osx machine; the fastest linux machine available with build tools; or specifically by hostname. If LWR also passed the history and details of the job along with the request, the SlaveAPI/MozPool analog could make decisions like spot instance or reserved instance; fast or slow; most recently successfully built that tree so a depend build would be fast. And we might be able to keep that complexity out of LWR.

    We'd like to be able to spawn the job and detach, to remove the need for an uninterrupted network connection. For status, it might be nice to be able to tail the log on demand, and/or add network logging support to mozharness (via a MultiNetworkLogger class, perhaps). This would probably require a network log cluster of some sort. Someone else suggested that we be able to toggle the network logging on at will, but otherwise default to off, to reduce network traffic. I'm not entirely sure how to do this, but given a trigger we could replay the log from disk over the network, and continue to stream the log as it came in, once the network logging had caught up.

    We could also take this opportunity to move away from the buildbot master/slave terminology, to... perhaps server/node? farmer/cow? :) Technically this wouldn't matter at all. Semantically it does.


    artifact manifests

    Currently, we upload various things from build jobs: installers, crashsymbols, test zips. The build jobs then guess which binary is the installer and sendchange the installer and test zip urls, triggering tests. The buildbot master then uploads the build logs, sometimes to a completely different directory than the installer, which can cause issues with TBPL and other downstream consumers. The test jobs take the installer and test zip urls from the sendchange, and use those to download and install the binary, extract the tests, and run them. Sometimes they need other files: crashsymbols, the robocop apk, so we apply a regex to the installer url to guess the other urls, causing all sorts of fun when this doesn't work. In a similar vein, we download previous MARs to generate partial updates. However, the mar files contain version numbers, causing more fun when we try to guess the filename after a version bump.

    Call me a buzzkill, but I'd like to eliminate all this fun guesswork in favor of a boring and predictable solution. I'd like an artifact manifest. A structured artifact manifest, with versioned manifest formats, so we know how to read them. And while I think it's ok to have a section of the manifest for dumping random blobs of information, if portions of those become generally useful, we should probably put those in the structured area in the next version of the manifest format.

    The manifest would definitely contain naming information for the various artifacts uploaded, as well as what they are. If mozharness jobs uploaded their own logs, they would more predictably live with the other artifacts, and be specified in the manifest. We could also include job status and uid and other such information. Dependent jobs could then act on all of that information without guessing, given only the location of the manifest. This also reduces the amount of information that LWR has to transfer between jobs... and may satisfy :sfink's request for more structured schema for downstream jobs.


    phase one

    We can't write and roll all of this out at once. Besides the magnitude of work represented by this handful of blog posts, we also have existing dependencies on buildbot that we don't yet have replacements for. For phase one, we're picturing the graph processing pool sending jobs into buildbot, probably via the db.

    First we should build the dependency graph for our existing build and test jobs. If I were to tackle one piece first, this would be it, because it's a single script with no infrastructure dependencies. It's easy to verify afterwards, by comparing the output to our existing TBPL runs. Normalizing the builder names would help here.

    Then we could feed that graph into self serve, potentially allowing us to more easily trigger individual builds (we currently use regexes, iirc). Tests and repacks may be trickier, since they expect additional information to come via sendchange and buildbot properties, but that's a start.

    Next we could start writing the server pieces -- trigger polling, graph generation, iterate through the graph. Any web app work could start here. This isn't strictly blocked by the self-serve implementation, so if more people chipped in we could work on those in parallel.

    We could then start feeding the jobs from the graph into buildbot, and disable the respective buildbot polling and scheduling.

    Once we got this far, we could also look into moving certain jobs that are already ported to mozharness out of buildbot and into a pure LWR implementation. That may depend on a streaming log solution or artifact manifest solution. This might belong in phase 2.


    I've been both excited and nervous writing about LWR. Excited, since I'm bursting with ideas for the project. Nervous, because so much of it extends outside of my domain of expertise; because it's a huge project; because portions of it are still nebulous concepts in my head. I think we have the team(s) to build it, though. And since I think best about projects when I write [about] them, these blog posts have helped focus my ideas. To get a first draft down that we can revise later.


    In part 1, I covered where we are currently, and what needs to change to scale up.
    In part 2, I covered a high level overview of LWR.
    In part 4, I'm going to drill down into the dependency graph.
    Then I'm going to meet with the A-team about this, and start writing some code.

    escapewindow: escape window (Default)

    compute farm

    I think of all the ideas we've brainstormed, the one I'm most drawn to is the idea that our automation infrastructure shouldn't just be a build farm feeding into a test farm. It should be a compute farm, capable of running a superset of tasks including, but not restricted to, builds and tests.

    Once we made that leap, it wasn't too hard to imagine the compute farm running its own maintenance tasks, or doing its own dependency scheduling. Or running any scriptable task we need it to.

    This perspective also guides the schematics; generic scheduling, generic job running. This job only happens to be a Firefox desktop build, a Firefox mobile l10n repack, or a Firefox OS emulator test. This graph only happens to be the set of builds and tests that we want to spawn per-checkin. But it's not limited to that.


    dependency graph (cf.)

    Currently, when we detect a new checkin, we kick off new builds. When they successfully upload, they create new dependent jobs (tests), in a cascading waterfall scheduling method. This works, but is hard to predict, and it doesn't lend itself to backfilling of unscheduled jobs, or knowing when the entire set of builds and tests have finished.

    Instead, if we create a graph of all builds and tests at the beginning, with dependencies marked, we get these nice properties:

    • Scheduling changes can be made, debugged, and verified without actually needing to hook it up into a full system; the changes will be visible in the new graph.
    • It becomes much easier to answer the question of what we expect to run, when, and where.
    • If we initially mark certain jobs in the graph as inactive, we can backfill those jobs very easily, by later marking them as active.
    • We are able to create jobs that run at the end of full sets of builds and tests, to run analyses or cleanup tasks. Or "smoketest" jobs that run before any other tests are run, to make sure what we're testing is worth testing further. Or "breakpoint" jobs that pause the graph before proceeding, until someone or something marks that job as finished.
    • If the graph is viewable and editable, it becomes much easier to toggle specific jobs on or off, or requeue a job with or without changes. Perhaps in a web app.

    web app

    The dependency graph could potentially be edited, either before it's submitted, or as runtime changes to pending or re-queued jobs. Given a user-friendly web app that allows you to visualize the graph, and drill down into each job to modify it, we can make scheduling even more flexible.

    • TryChooser could go from a checkin-comment-based set of flags to a something viewable and editable before you submit the graph. Per-job toggles, certainly (just mochitest-3 on windows64 debug, please, but mochitest-2 through 4 on the other platforms).
    • If the repository + revision were settable fields in the web app, we could potentially get rid of the multi-headed Try repository altogether (point to a user repo and revision, and build from there).
    • Some project branches might not need per-checkin or nightly jobs at all, given a convenient way to trigger builds and tests against any revision at will.
    • Given the ability to specify where the job logic comes from (e.g., mozharness repo and revision), people working on the automation itself can test their changes before rolling them out, especially if there are ways to send the output of jobs (job status, artifact uploads, etc.) to an alternate location. This vastly reduces the need for a completely separate "staging" area that quickly falls out of date. Faster iteration on automation, faster turnaround.

    community job status

    One feature we lost with the Tinderbox EOL was the ability for any community member to contribute job status. We'd like to get it back. It's useful for people to be able to set up their own processes and have them show up in TBPL, or other status queries and dashboards.

    Given the scale we're targeting, it's not immediately clear that a community member's machine(s) would be able to make a dent in the pool. However, other configurations not supported by the compute farm would potentially have immense value: alternate toolchains. Alternate OSes. Alternate hardware, especially since the bulk of the compute farm will be virtual. Run your own build or test (or other job) and send your status to the appropriate API.

    As for LWR dependency graphs potentially triggering community-run machines: if we had jobs that are useful in aggregate, like a SETI at home communal type job, or intermittent test runner/crasher type jobs, those could be candidates. Or if we wanted to be able to trigger a community alternate-configuration job from the web app. Either a pull-not-push model, or a messaging model where community members can set up listeners, could work here.

    Since we're talking massive scale, if the jobs in question are runnable on the compute farm, perhaps the best route would be contributing scripts to run. Releng-as-a-Service.


    Releng-as-a-Service

    Release Engineering is a bottleneck. I think Ted once said that everyone needs something from RelEng; that's quite possibly true. We've been trying to reverse this trend by empowering others to write or modify their own mozharness scripts: the A-team, :sfink, :gaye, :graydon have all been doing so. More bandwidth. Less bottleneck.

    We've already established that compute load on a small subset of servers doesn't work as well as moving it to the massively scalable compute farm. This video on leadership says the same thing, in terms of people: empowering the team makes for more brain power than bottlenecking the decisions and logic on one person. Similarly, empowering other teams to update their automation at their own pace will scale much better than funneling all of those tasks into a single team.

    We could potentially move towards a BYOS (bring your own script) model, since other teams know their workflow, their builds, their tests, their processes better than RelEng ever could. :catlee's been using the term Releng-as-a-Service for a while now. I think it would scale.

    I would want to allow for any arbitrary script to run on our compute farm (within the realms of operational-, security-, and fiscal- sanity, of course). Comparing talos performance numbers looking for regressions? Parsing logs for metrics? Trying to find patterns in support feedback? Have a whole new class of thing to automate? Run it on the compute farm. We'll help you get started. But first, we have to make it less expensive and complex to schedule arbitrary jobs.


    This is largely what we talked about, on a high level, both during our team week and over the years. A lot of this seems very blue sky. But we're so much closer to this as a reality than we were when I was first brainstorming about replacing buildbot, 4-5 years ago. We need to work towards this, in phases, while also keeping on top of the rest of our tasks.

    In part 1, I covered where we are currently, and what needs to change to scale up.
    In part 3, I'm going to go into some hand-wavy LWR specifics, including what we can roll out in phase 1.
    In part 4, I'm going to drill down into the dependency graph.
    Then I'm going to start writing some code.

    escapewindow: escape window (Default)

    (you missed these, a tiny bit, didn't you.)

    As I've mentioned here and here, non-fastforward commits are problematic for downstream partners, and are to be strictly avoided. And as I mentioned here, we had identified that our original merge day mechanics introduced non-fastforward commits to our Repositories of Record on a periodic basis.

    I was under the impression that this was a Solved Problem, but I noticed a non-fastforward commit in mozilla-release after the mozilla-beta -> mozilla-release merge for Firefox 25.0 (thanks to the vcs-sync emails I'm getting now). I hadn't updated the instructions on the Merge Documentation wiki page, and the merge scripts followed that document.

    The scripts are now available in this github repo. One pull request later, plus :lsblakk's fixes from the long Firefox 26 merge day, and we now use hg debugsetparents to "merge" in old heads in a fast-forwardable way. One more step towards our goal.

    escapewindow: escape window (Default)

    (Resuming the shorter blog posts about features in the new vcs sync process...)

    After I enabled beagle conversions in cron, I started getting some intermittent non-fastforward emails. When I poked around, I noticed that some changes had landed on GECKO90_2011121217_RELBRANCH on mozilla-release, but not mozilla-beta. When syncing mozilla-release, the tip of this branch stayed in place; when syncing mozilla-beta, the tip of this branch was behind several commits, resulting in a non-fastforward push. We'll continue to see problems like this, if we sync branches with shared names, across mercurial repos, which contain different histories, but we're good for now.

    I was able to solve this by transplanting some changesets, and I recorded my actions for this and one other branch here.

    (This is more a blog post about manual investigation+actions taken after automated emails highlighted the issue, rather than a feature of the new vcs sync process, but the emails themselves were key...)

    :bhearsum filed filed bug 924024 (kill relbranches), which should help reduce the frequency of this happening.

    escapewindow: escape window (Default)

    (This is one of several shorter blog posts about features in the new vcs sync process.)

    At first blush, pushing all branches and tags from a converted mercurial repo to its git mirror seems like a logical choice; no renaming, no filtering. Except, of course, default -> master. Easy enough: a blanket conversion, with one exception.

    However, as noted earlier, we move tags in hg land. Mercurial makes it easy to do, and moving tags is currently part of our official release process. Git, on the other hand, makes it impossible to move tags invisibly; to pick up a moved tag, every downstream user would need to explicitly delete and re-fetch the tag. Otherwise, their tag will continue to point at the old revision. Given the choice between no tags and tags pointing at the wrong revision, I much prefer no tags. We do have a small subset of tags in our partner-oriented gecko.git, though, so in my mind we needed a tag whitelist. With wildcards/regexes, so we wouldn't have to keep updating a static list.

    Branches could have remained a bit simpler, but I had several types of git repo to support. For example, mozilla-b2g18: in gecko.git the default branch is gecko-18. In github/mozilla/mozilla-central, it's b2g18. Neither of those repositories have other, non-default branches from mozilla-b2g18. However, with the lack of {FIREFOX,FENNEC}.*{BUILDn,RELEASE} tags on the git side, I wanted to at least support the GECKO_.*RELBRANCH branches from mozilla-beta and mozilla-release. So not only do we need a whitelist here, but also a map, to say that mozilla-aurora is aurora in gecko-dev, but v1.2 in gecko.git. (We were even considering supporting the standalone git repos; mozilla-aurora:default would have been master in releases-mozilla-aurora. We're going to nuke those in bug 847643, however.)

    In addition to the above, we want to strictly avoid noise (e.g., unwanted branches+tags, or branches+tags with the wrong names) in the partner-oriented gecko.git. I can control this by strictly limiting on push. However, a single layer of safety like that feels a bit dicey; a loose regex or a bug can push the wrong thing. Also, I'm not currently keeping track of which branches+tags I convert for each mercurial repo, so a loose regex for one mercurial repo's conversion whitelist, coupled with a loose regex for another mercurial repo's push whitelist, can result in pushing the former mercurial repo's unwanted branches+tags during the latter mercurial repo's push. It seems easier to not convert unwanted branches+tags in the first place. Because of this, I'm restricting what we convert at all via strict whitelists, and then again restricting at the push-per-target level, for an additional level of safety.

    (Does that make sense? I almost feel like drawing a picture.)

    When I look at the config files, they don't seem very elegant, but given the complexity we're trying to encapsulate, I think they're a pretty decent first draft.

    escapewindow: escape window (Default)

    (This is one of several shorter blog posts about features in the new vcs sync process.)

    We've seen instances of hg corruption before, like

    abort: data/dom/network/interfaces/nsIDOMNetworkStats.idl.i@50a5a9362723: unknown parent!
    abort: data/mobile/android/base/DoorHangerPopup.java.i@62e6137d125c: no match found!
    abort: data/b2g/config/gaia.json.i@327bccdb70e8: no node!
    abort: unknown revision '177bf37a49f5'!
    

    People have theorized that at least one of these may be caused by a race condition while pulling from an http head during an ssh push (edit: bug 737865); others seem a bit more random. We deal with this in our TBPL build/test CI by detecting these types of failures, and nuking/recloning the repo.

    We also see these in the legacy vcs-sync process. With a single-, non-cvs-prepended-, mozilla-central-based- repo, recovering from hg corruption in the working conversion directory is a manual process that can take multiple hours. I saw this as a non-starter for a repo like beagle or gecko.git, where rebuilding the entire thing from scratch can take over a week.

    As I mentioned here, the new process has an intermediate stage_source clone:

    hg.mozilla.org -> stage_source clone -> conversion_dir

    When we detect corruption in the stage_source clone, we don't have to worry very much; just clobber and reclone. The time to recreate a fresh clone of a single mercurial repo is a matter of a few minutes, not multiple hours. This approach uses more disk, but helps prevent long downtimes.

    Previously I had been running hg verify in each stage_source clone before proceeding, but that slowed each pull/convert/push cycle down by ~5 minutes per source mercurial repo (and doesn't always fix the problem), making it a non-viable option.

    escapewindow: escape window (Default)

    (This is one of several shorter blog posts about features in the new vcs sync process.)

    Focusing on beagle first turned out to be the right call (thanks :joduinn) -- I severely underestimated the time it would take to solve the initial mozilla-central cvs-prepend step in an automated, repeatable fashion, as noted here and here. However, this meant that after all my beagle-specific testing, I had to refactor to support the other vcs-sync processes, and re-test.

    One consequence: my ~6 minute estimate ballooned to ~9+ minutes for each conversion job when I changed from a single conversion_dir push to a push-per-source-repo. With each job cron'ed every 5 minutes, a commit could take up to 20-some minutes to show up in git (if it happened right after the previous conversion started).

    :nbp had given me a shell script to look at, and :hwine had suggested we use hg incoming to check for any changes before proceeding with the pull/convert/push loop. The latter seemed simple to add, so I did.

    The average no-op conversion time dropped from ~400-550 seconds to ~12. This includes rsyncing the updated status json and ~600 log files to an upload server. (That number of logfiles will go down exponentially when we have dated upload dirs, so we don't have to keep so many backups in the same directory.) This is dependent on hg.mozilla.org load, and can spike up to ~40 seconds.

    The average conversion time dropped from ~400-550 seconds to a little over a minute, and additional repos don't add much more time.. sometimes ~2 minutes for 4 repos' worth of conversion. I also bumped the cron job frequency up to once per minute, so on average mercurial commits should show up in git within a minute or three. Plus, it's harder to find multiple repos' worth of changes within a single minute, so it keeps the incoming changes down. The longest delays tend to come when we hit hg corruption (uncommon), and even then we're auto-fixing within 8-9 minutes (see a later blog post about this). I still want to get more built-in parallelization support into mozharness, but with these numbers it's a lot less urgent.

    One side effect of this: sometimes we skip over repos that need to be synced. For example, if we add a new target to a repo (e.g., a git.m.o repo, when we had previously only been pushing to github), or a new branch or tag regex (there will be a later blog post on these). If the repo in question has a lot of activity, this would populate on the next push. If it's a closed or low-activity repo, that might not happen for days, or weeks, or ever.

    I added a --no-check-incoming commandline option, with a corresponding global check_incoming config setting to skip this behavior, and convert/sync everything. Also, I added per-repo check_incoming flags (defaulting to True) for more granularity. This helped in debugging the relbranch issue on mozilla-beta (see later blog post).

    escapewindow: escape window (Default)

    (This is the first of several shorter blog posts about features in the new vcs sync process.)

    Back when the vcs sync project was first dropped in my lap, I quickly decided the initial implementation was going to push to a local repo. On disk, not on a network server. This has the nice properties of

    1. ruling out any server-permission- or network- related issues,
    2. allowing for development without an active network connection,
    3. speeding up the testing process to a small degree,
    4. and allowing for immediate inspection of the pushed repo's contents.

    I named this a test_push, though I'm waffling on that name.

    When it became clearer that non-fastforward pushes and deletions would be an issue for downstream partners, we were looking for ways to prevent that.

    Ideally we prevent this at the RoR (repository of record), with pre-commit hooks. However, not all of our upstream repos have pre-commit hooks (see github), so this can't be a blanket solution. (I think our single-head hook on our release branches catches a lot of this, but there might be more we can do on hg.m.o).

    Next, it would be good to have these denied at the partner repository level (we've done this on gecko.git and gaia.git). This is less ideal than the pre-commit hook, because a deletion or non-fast-forward commit can land upstream, and then the sync process has nowhere to go. Also, this is the last place we can catch this issue -- if this is set incorrectly, or missed, or if others have administrative rights to the repo and unset this, then we're in a bad position. (hg debugsetparents can fix non-fastforward commits. I don't know how to recover from deletions, other than track that revision down somewhere and re-push; luckily deletions look to be difficult to do in mercurial.)

    I finally decided to add receive.denyNonFastForwards to the local test_push repo, via the --shared=true option in git init. If the test_push happens before any network pushes, and test_push failure prevents any network pushes for that branch, we get another layer of safety. It's still not as good a solution as preventing that change in the first place, but it's something we can control locally.

    It looks like I'm missing receive.denyDeletes from the test_push... I added that to my development branch so we get that check in each test_push soon as well.

    escapewindow: escape window (Default)

    live

    As of about 7pm PDT last Friday (October 11), gecko-dev (née beagle) and gecko-projects are live. Here's the newsgroup post. Here's gecko-dev on github and git.mozilla.org. Here's gecko-projects (with a README.md on how to use it) on github. The logs, repo_update.json files, and mapfiles are temporarily living here. The bug is here.

    Both of these repositories are RelEng-supported, and have SHAs that match gecko.git.

    If you use git for your gecko development, please start using these repos and make sure they work for you. AreWeFastYet and some developers have already switched over just fine. If you hit any problems, please let us know.

    coming shortly

    what's next

    Beagle was definitely the largest piece in RelEng's vcs-sync puzzle, but it's not the final piece.

    • gecko.git (hg->git): I already have configs and a test repo. Once we're confident that gecko-dev and gecko-projects are solid and solve our developers' needs, we can switch our partner-oriented repo over from the legacy system to being converted by this new production system. This switchover does not require changing any SHAs, and should hopefully be an invisible, seamless cutover.
    • l10n (hg->git) I already have configs and have tested this locally. The workflow I went with here involves reading the b2g/locales/all-locales and languages_dev.json files on various branches, and building the list of repos-to-sync dynamically that way. Along with the ability to create new repos on git.mozilla.org, this allows us to sync new locale repos on demand, rather than requiring manual Release Engineering + IT intervention every time.
    • git->git sync support. We have a large number of b2g github repos that need to be populated in git.mozilla.org. If I follow the l10n model, we would dynamically create this list from b2g-manifests, rather than require manual Release Engineering + IT intervention.
    • git->hg sync support. This is needed for gaia currently.
    • hg->hg sync support. We've had some, but limited, use of the repos mirrored on bitbucket.com

    As we add support for these, we can cut over legacy processes over to the new process; stay tuned for announcements for each of these switchovers.

    Also, I plan to write a few shorter blog posts about some of what went into this project.

    escapewindow: escape window (Default)

    There's been another arbitrary version bump!

    What's changed?

    • Android Panda unittests with mozpool,
    • Gaia unittests,
    • Gaia pushbot,
    • Servo builds,
    • Android x86 emulator unittest support,
    • C++ unittest support,
    • desktop talos (rolled out!),
    • optional config files,
    • pulling config files from a url,
    • system resource monitoring,
    • action/script hooks,
    • additional VirtualenvMixin enhancements,
    • fixed windows rmtree,
    • made ReadOnlyDict fully lockable,
    • TBPL exit_status levels,
    • fixed TBPL worst_level() usage,
    • added _post_fatal() callback,
    • consolidated OSMixin and ShellMixin to ScriptMixin,
    • output timeouts,
    • run_command() env printing,
    • load_json_from_url(),
    • max log size,
    • additional Boot2Gecko emulator tests,
    • b2gbuild.py: removed snapshots, gaia packaging, debug build support, ...

    ... and we have a lot of work in progress that will be ready to land shortly.

    I think the best sign is we're getting more and more contributors to the project. Stay tuned for more goodness.

    escapewindow: escape window (Default)
  • What is gecko.git?
  • What is beagle?
  • Problems encountered / lessons learned: master-only conversion
  • Problems encountered / lessons learned: full branchlist conversion
  • Still to do


  • What is gecko.git?

    gecko.git is a git repo on git.mozilla.org that Mozilla's B2G partners view as the Repository of Record (RoR) for Mozilla B2G gecko code. It's a read-only synced version of a handful of Mozilla's mercurial gecko repos.

    Hal Wine set this repository up, using his vcs-sync repos; his vcs-sync docs are well written and comprehensive for the single-repo variety of job.

    As Hal noted in his RelEngCon talk, this repository has a number of requirements we don't have elsewhere (see the "Challenge Areas (Con't)" section):

    • all changes must be fast-forwardable; no deletes
    • the conversion is not foolproof
    • git email validation is stricter than hg
    • commits already live on hg.mozilla.org, so we have to change hg-git to allow for special cases

    We have identified some upstream processes and behaviors that would cause problems, should they show up in gecko.git. For example, historically our merge day process has involved closing the tip of the downstream branch (e.g. Aurora), and landing the upstream branch (e.g. Central -> Aurora) in, creating a new default head. This process violates the "only fast-forwardable changes" rule above. We were avoiding this issue previously by not converting the aurora/beta/release repos, and only pushing the b2g18* repos, which don't ride trains.

    Since then, we identified hg debugsetparents as a potential solution to make the merge day landings artificially fast-forward; it's unclear if we currently have that implemented. We would need to before we could successfully push aurora/beta/release to gecko.git without violating this fast-forward rule.

    Also, our current release process for Firefox desktop and Firefox Mobile (aka Fennec) involves creating a BUILDn tag and a RELEASE tag. The former generally doesn't change; the latter changes on any respin. Because of this, we use a -f in our hg tag command. Here's a changeset where we move a RELEASE tag for a build 2.

    However, Git doesn't allow for re-tagging like mercurial does. If you push a new tag with the same name as an old tag, any downstream repository owners will need to take action to move their tags. This is problematic on sensitive release repositories that multiple downstream partners rely on.

    I can imagine two longer-term solutions here: one is to continue to severely limit which tags can be pushed to gecko.git; the other is to change how Mozilla's release automation determines which revision to pull. Our current strategy of not pushing aurora/beta/release has delayed this decision.

    As noted in a recent thread (Whoops, I'm bad with git...), it's safest to not make changes to git history if there are enough users downstream to form a lynch mob.


    Top


    What is beagle?

    I'm not entirely sure about the code name, but the purpose is an official replacement of Ehsan's github/mozilla/mozilla-central repo with shas that match gecko.git. (Due to differences in the conversion processes and toolchains, the current two repos' shas do not match.)

    As I understand it, this is a developer-oriented repo, with important mercurial repos pulled in as git branches. This allows for easier cross-branch diffing for git-based developers, though landings still happen in hg.

    There are a number of branches tracked by the existing github/mozilla/mozilla-central repo that don't necessarily follow the above rules for gecko.git, which complicates having a single git repo that serves both purposes. Notably, larch, birch, alder, and cypress are project branches, which, if they follow the standard project branch life cycle, will get hard-reset at some point. This violates both the expectation of fast-forwardability as well as no deletes. And, as noted above, it appears as though the movable *_RELEASE tags from the beta+release repos are getting populated here, which would be problematic for gecko.git.

    Also, if the "Whoops..." thread was any indication (as well as the occasional "please purge this revision from hg.m.o" bug that pops up): in this world of many downstream users, we have to become better at not needing to purge revisions or rewrite history. But given a split between a partner-oriented gecko.git and a developer-oriented beagle, we're at least allowed some additional leeway in the latter.

    I tackled the beagle project with an eye to being able to support both gecko.git and beagle.


    Top


    Problems encountered / lessons learned: master-only conversion

    I came in with the assumption that this was a fully solved problem, and I would merely be making existing processes more robust, more scalable, and maintainable. RelEng is currently converting and supporting over 300? repos on git.mozilla.org, using early prototype code currently live in production, so this needs to be improved. And certainly, many of the issues were already fixed and upstreamed, and many of the processes were well documented, just not all. I definitely underestimated how much I would have to learn about the process.

    However, I knew that I would be changing the workflow and process (as Hal is fond of saying, his scripts are a proof of concept running in production). I wasn't going to just tweak existing scripts; I wanted to make the entire process automated and config-driven, rather than human-intensive.

    How best to test these changes? The most data we have to test with is in m-c history. So converting m-c from scratch, and verifying that everything looks the way we want afterwards, was the best test for my new process.

    The project was down-scoped to just hg->git conversions. (Previously I had been aiming for a generic, config-driven vcs sync project that could convert hg -> git or git -> hg, or sync hg -> hg or git -> git (e.g., github -> git.m.o) to cover all of our vcs-sync needs.) Once I had a machine to test on, I started my first m-c conversion to verify my script + configs. About a week in, gps blogged about faster hg-git.

    As I noted earlier, I switched over. But that wasn't seamless; Hal was running on a forked hg-git 0.3.2; gps' changes were landed as a part of 0.4.0. Rather than try to backport gps' changes, we thought I should use the latest hg-git (0.4.0), since as far as we knew all of the forked changes had been upstreamed.

    However, as noted in bug 835202 comment 9, my conversion failed using the latest upstream hg-git; our timezone fix had never been upstreamed because of missing unit tests. When I forked hg-git with this patch, I was able to continue and convert everything as expected... except the shas diverged. Which I consider to be a major no-no, a violation of a central goal of this project.

    It turns out that hg-git 0.4.0 deals with this revision, which I fondly refer to as <h<surkov, differently than Hal's 0.3.2-moz2 fork. These two revisions change the second angle bracket '<' into a question mark '?', turning the initial portion of the email address into <h?surkov. The previous behavior was to drop the second angle bracket '<' entirely, as seen in gecko.git's <hsurkov. Both are valid ways to change <h<surkov into a well-formed git email address, but only one of these resulted in the same shas as gecko.git. I backed those commits out of my hg-git fork, and started verifying that my latest 0.4.0-moz2 fork gave us expected behavior. And it did, for a master-branch-only conversion, like we currently have in gecko.git. (Phew!)

    (While it may seem like I was spending an inordinate amount of time converting and re-converting mozilla-central, I was, in fact, testing against the largest data set we had in terms of unexpected, real-world commit data. Shaking out bugs in the conversion process before pushing live avoids massive headaches in the future for developers, partners, and maintainers. If the conversion scripts could robustly handle anything we'd seen so far, we'd have greater confidence about these systems in live use.)


    Top


    Problems encountered / lessons learned: full branchlist conversion

    A master-branch-only conversion was good for gecko.git, but I was looking to support beagle as well. Ehsan's repo currently has all of mozilla-central's tags and branches, and not only did he gexport each of those branches, he also ran git filter-branch on them after the cvs prepend, and translated each of the tags. Had I proceeded without doing this work, I could foresee a future where we would be asked to shoehorn these changes onto a live production repo, well after the downstream audience had crossed lynch-mob-levels. Worse still, I wasn't able to push to github without running a git gc aggressive, which would make future conversion of these branches and tags very difficult. I couldn't use gecko.git as the base if I wanted these branches and tags; I couldn't use Ehsan's repo as the base if I wanted the shas to match gecko.git. It wasn't difficult for me to add this added workflow to the script, just compute-time-intensive. So I added it, and started the full conversion at that point.

    This took a bit longer to complete than I predicted, because I had to either use the -f option in git filter-branch and potentially overwrite revisions already created in the map directory, or not halt on failure should one of the git filter-branch runs exit non-zero. Neither of these seemed like comfortable sensitive-production-service decisions to make. (In the end, the -f turned out to be the lesser evil.)

    As a side note: certainly at this point (if not before), a second machine or a fast local disk with enough inodes+space to handle a second parallel conversion would have helped speed up the process immensely, as I could have run both conversion workflows in parallel, rather than serially as I have been (with lots of tar backups and restores to try to save time). I did finally get that second machine last week (yay!), though that was well after most of my time-consuming compute churn was over. (I estimate that had I had this second machine at the start of the project, I would have saved at least a month of calendar time; very likely more.) Still, I'm making good use of it now.

    Unbeknownst to me, this seemingly innocuous commit landed between my successful master-only conversion and the first full-branch conversion attempt that made it through without halting-on-failure. During the cvs prepending step, for some unknown reason, git filter-branch changes "Carlos G." into "Carlos G" sans-period; gecko.git preserves the original name. My theory was that hg-git was dealing with this user's name appropriately, and git filter-branch was munging it. Enter hg strip: I theorized that if I stripped the problem commit before cvs prepending, and then updated and converted via hg-git normally afterwards, we'd be golden. And that's almost true. But not quite.

    The main problem, as I found out after a second full conversion (and having my conversion machine reboot from under me, which only cost me 5 compute days) pass later, is that the hg strip didn't leave default as the tip of its own branch; due to unnamed branches+merges as DVCS allows, I had to hg strip three times to get to a single tip-of-default head. Without that, git filter-branch would move the master branch to cvs-based shas, but the additional forks stayed on the old, non-cvs-based shas. Once I landed this change, (and once I restarted this conversion pass after the machine reboot), we now have a full conversion of mozilla-central, with all mozilla-central branches and tags, with cvs history, with shas that match up with gecko.git. The toolchain, config, and script have been tested against a huge number of historical revisions and held up; this is a good sign for future robustness. A bonus side effect is we can predictably recreate our git history now, which is great, but was never the primary goal.

    I pushed the resulting repo to this test repository on github; please feel free to poke at it (but be forewarned that I reserve the right to reset it in the future). I started populating the repo branch list to include branches that we think make the most sense given the use cases for this repo and all the gotchas described above. During this process, I came across some things that need fixing in the code+configs (for instance, you'll note that the mozilla-central branches+tags aren't there, and the mozilla-beta+mozilla-release release tags. I have the former in my conversion repo, and have a partial plan for the latter). But we're definitely past the long compute time hump.

    (As another side note -- I fixed some of these cvs prepend issues via specific revision hardcodes rather than generic solutions. This isn't ideal; I'd love to write this tool in a way that could apply to any project, anywhere. However, given some of the unique constraints (cvs prepending, matching previous shas that were built with a different process and toolchain), and limited time, I decided hardcodes in this section of code, which in theory should only need to be run once barring some sort of emergency, was the lesser of evils.)


    Top


    Still to do

    • From the start, even before we down-scoped the project, I was planning on being able to push a subset of repos, branches, and tags to a number of different locations. My configs for doing that are a bit ugly currently, but they work! So if I'm pulling mozilla-b2g18 into my conversion repo, I can push the master branch of that repo to gecko.git:gecko-18, beagle:b2g18, and create a standalone mozilla-b2g18 repo on github or git.m.o where it's b2g18:master. Since these would be pushed from the same conversion directory, they would be guaranteed to have the same shas.

      And since I've made the effort to make my code modular, config-driven, and well tested, I would have no qualms about running two instances of the script in parallel: one for gecko.git, one for beagle. After painstaking and time-consuming effort to try to guarantee matching shas, and with the same toolchains and code paths, I wouldn't be worried about the two repos diverging. The split would mainly serve to avoid pushing any bad changes to the more sensitive, partner-oriented gecko.git. It's easier to guarantee a change won't be pushed if the change doesn't exist in the conversion repo at all, rather than only relying on push-time checks.

      Currently I'm leaning towards the split (for more stable gecko.git), with the multiple-target logic in the former living in the beagle process.

    • As I was pushing the various repos to my test-beagle github repo, I noticed a few things about tags and branches. The branch and tag configs are not creating the {FIREFOX,FENNEC}_*_{BUILDn,RELEASE} tags, among others, even when I changed the tag_regexes. This is probably because the release tags are all created on relbranches, and outside of mozilla-central, I'm only converting the branches listed in an explicit branch whitelist.

      I imagine I need branch and tag configs for incoming branches+tags (branches and tags designated for conversion); these would need to be a superset of all branches and tags that we would want to push. Limiting this set would help prevent pushing any unwanted branches and tags, which could prove problematic if they either violate the no-deletes or fast-forward-only rules, or if the tags in question move on the mercurial side. A whitelist would work, but by itself would require a lot of manual intervention. I need to add regex or wildcard support. The outgoing branch and tag configs would be similar, but could be a subset of available branches and tags, and continue to involve different configs for different push targets (e.g., the above gecko.git:gecko-18, beagle:b2g18, b2g18:master example).

      Since we'll be adding to this list and modifying it on the fly when this script is in production, a config file validator and a more sane format become less optional and more production best practice.

    • For bug 848908 - prevent repo corruption due to bad pulls for hg-git conversion, I already have three types of repos on local disk: clean staging clones, the conversion repo, and test target repos.

      The conversion repo is the same as gecko.git and Ehsan's repo. The staging clones are clean clones designed to catch any hg.mozilla.org repo corruption from a pull, as noted in in the bug. Basically, it's a matter of minutes to blow away and reclone a clean clone. It's a matter of hours or days if we corrupt the conversion repo.

      I haven't caught any repo corruption yet, but I do know we've seen at least three or more instances of repo corruption in Hal's production conversion repos, each time a multi-hour restoration process, each delaying further changes from making their way downstream. I knew that beagle, with its massively larger matrix of repos, branches, and tags, would be even harder and likely more time consuming to un-corrupt.

      The local disk test target repos are there to give debuggable local repos to push to. I imagine we could set them up in such a way that they would need to pass certain tests before allowing the script to push to a public repo.

      To do here: catch staging clone repo corruption, and make sure the automated response deals with it appropriately; potentially add test target repo validation to avoid pushing bad commits to public repos. I'm not sure these are blockers, but they would go a long way to making the process more robust.

    • For bug 799845 - For git mirrored repositories, please provide status on what the last successfully sync time was, I create a json file with what I think is adequate information about the latest pull/push times, and the configuration involved. It currently looks like this.

      I need to make sure the format works for downstream users, and find a place for this file (and the logs, etc.) to get uploaded to.

    • When bug 892563 - Add a timeout parameter to mozharness' run_command landed, I added output_timeout settings to various commands I felt might hang at some point in the future; they also run through the retry() method, and each successful and unsuccessful run sends email (configurable, of course). I have yet to see timeout/retry code successfully exercised; I'd love some verification before production.

    • Split the patch up for review!


    Top


    I hope that's been useful. Feel free to send me feedback / ideas / questions, or to take a look at my work in progress.

    The bug is here.
    The script is here; I went through and added comments and docstrings, to hopefully make it more readable.
    The config is here.
    My test github push repo is here. (This repo may be reset in the future!)
    My previous blog post (hg-git part i) is here.

    escapewindow: escape window (Default)

    I was about a week into my m-c hg-git conversion* when gps blogged about faster hg-git.

    I was about two weeks into the initial conversion when I tried the latest upstream hg-git, with gps' patches. That conversion finished in less than a day, while the initial conversion was still chugging along on a different core. However, the git shas differed from our existing pre-cvs conversion repositories. This was due to different behavior when faced with a bogus email address.

    Once I backed out the corresponding email munging commits, and landed Ehsan's timezone fix for a separate issue that halted the conversion process, we're now back to a converted m-c with the same shas.

    At this point I killed the still-running initial conversion.

    I'm still working on automating the cvs prepending and pulling in other repos/branches, but this is a good sanity-check point.



    * Note, I had pulled both mozilla-central and mozilla-b2g18 into the same working directory as a proof-of-concept workflow test before converting. In hindsight this was a terrible decision, as it more than doubled the estimated conversion time.

    escapewindow: escape window (Default)

    We are now running Firefox desktop unittests through mozharness for Firefox-22-based branches (mozilla-central, mozilla-inbound, try, and project branches).

    Jordan Lund started this work during his internship last year, and had the mozharness scripts and configs code-complete by the end of his internship. I took over afterwards, but my efforts were sporadic, between high-priority mobile and b2g work.

    This was a multi-phase operation:

    • We enabled them on Cedar
      • This required platform-specific fixes, since we were working against a subset of platforms in staging.
      • Cedar unittests were essentially the same as, say, mozilla-central, other than having the logic in mozharness instead of hardcoded in buildbot.
    • At this point, the majority of the tests were green, but a number were perma-orange or red or missing or had intermittent issues not present in buildbot-based unittests. So we went through and identified+fixed the various problems per-suite or per-platform, whether that was missing debug tests or barfing on leaks.
    • We also cleaned up issues that made these jobs hard to sheriff, e.g. missing automated retries or ugly log summaries.
    • Whenever P0 interrupts hit (regularly), we had to put this work on the back burner, and determine what changes had made the buildbot-based unittests diverge from the mozharness-based unittests when we revisited. This back-and-forth added months to the schedule, but allowed us to respond to higher priority tasks more quickly.
    • Once all the blocker bugs were resolved, I split the rollout into three parts:
      1. Get mozilla-tests/config.py to not warn in python mode (pep8). This wasn't critical, but helped avoid spurious conflicts later. Easy peasy. (comment 8 through comment 10)
      2. Split out the mobile tests into their own config file. This helped me limit the scope of my config.py rewrites to desktop only. (comment 11 through comment 28)
      3. The final set of patches made mozharness-based unittests default for Firefox desktop, with exceptions for aurora, beta, release, esr, and b2g18*. (comment 29 through comment 60)
    • I ran through a number of spot tests in staging to limit risk for rollout; I also waited until after merge day so the unittests could ride the trains. So far the only fallout is some bustage in debug b2g emulator tests on b2g18 and b2g18_v1_0_1 due to missing symbols (now fixed; we also have bugs to enable symbols filed).

    Making large sweeping changes in infrastructure while

    1. responding to constant priority interrupts (often for weeks or months at a time), and
    2. avoiding bustage
    is a long game. But we're playing for keeps.

    Thanks to Jordan for the initial work around this, and A-Team, sheriffs, and RelEng for their help + patience.

    escapewindow: escape window (Default)
    " I certainly don't want to wait another 5 months before the next arbitrary version bump."

    -- me, 11 months ago.

    What's happened in the past 11 months? In no particular order,

    • completed Android Native support,
    • Fennec armv6+noion support,
    • broadened VirtualenvMixin functionality (setup.py support, pip freeze, PyWin32 support, --find-links, ...),
    • further mozbase support / integration,
    • developer-runnable Android multilocale build support,
    • Marionette tests on Firefox desktop,
    • mock (mock_mozilla) support,
    • tooltool support,
    • retry support,
    • Boot2Gecko (FirefoxOS) device builds (panda, unagi, unagi_eng, unagi_stable, otoro) (all multilocale),
    • Boot2Gecko desktop builds (all multilocale),
    • Boot2Gecko emulator tests: marionette-webapi, mochitests, reftests, crashtests, xpcshell,
    • mozpool support for Boot2Gecko + Android Pandas,
    • Boot2Gecko Panda Gaia UI tests,
    • Android signing on demand,
    • TBPL parser compatibility improvements,
    • per-locale status in mobile_l10n.py,
    • mozharness Firefox desktop unittests (ready to roll out),
    • mozharness Firefox desktop talos support (baking on Cedar),
    • proof-of-concept jetperf,
    • proof-of-concept remote (mobile device) talos support (needs work),
    • further Peptest support (which has since been disabled),
    • MPL 2.0,
    • and a mozharness production branch, which will let us run CI tests against default before merging, to avoid production bustage.

    So, yeah. Some minor version bumps are larger than others. We've gotten all this done thanks to great efforts by the A-Team and RelEng.

    B2G automation is largely mozharness-based; mobile is maybe half-mozharness and Firefox desktop is still largely buildbot-based. But, without actually crunching numbers, I would guess that rolling out mozharness Firefox desktop unittests would push us somewhere near the halfway point. Soon.

    I'm glad to see the project continuing to gain traction, and we've been seeing concrete evidence of the theoretical benefits I've been touting since the beginning.

    This time I'm not going to make any predictions about when I'll bump the arbitrary version next, but I'll schedule something in Things.

    escapewindow: escape window (Default)

    Tl;dr: it's now easier to run multilocale builds locally. I'd love feedback in terms of whether this is useful for you, and if there are changes that would make this script better for your workflow.

    I updated the wiki as well.


    what & why

    There are only three ways to get a multilocale build currently:

    • triggering a nightly build off of a multilocale-enabled branch,
    • pinging me directly and asking for a multilocale build, or
    • building multilocale locally. Up until now, that involved a combination of digging through logs, asking for help, and reverse-engineering the multilocale script with a lot of trial and error.

    The latter was so difficult because I hadn't spent the time to make it friendly to run locally. Now that bug 753545 has landed, it should now be relatively easy.

    The config file is written to [hopefully] be easily editable and usable.


    running the script

    • Create a directory to run in. It's simplest to create a new directory, although you can use an existing directory with mozilla-central (or other tree) already checked out.
      mkdir multilocale
      cd multilocale
      
    • Clone mozharness and copy the standalone config file for easy editing+use.
      hg clone http://hg.mozilla.org/build/mozharness
      cp mozharness/configs/multi_locale/standalone_mozilla-central.py myconfig.py
      
    • Customize your config file.

      By default it's going to clone mozilla-central into a directory called mozilla-central.

      It also assumes your objdir will be named objdir-droid and that will be specified in your mozconfig, which will be in a file named mozconfig in this directory.

      This should hopefully be easy to edit. Holler if it's confusing.

      Afterwards, you can verify it's still valid python by typing

      python myconfig.py
      

      which should exit silently.

    • If your config file still says that your mozconfig is in os.path.join(os.getcwd(), "mozconfig"), copy your mozconfig into this directory and make sure it's named mozconfig.
    • Make sure mozilla-central (or whatever repo you specified) is in this directory.

      You can either clone it manually, or use the script's pull-build-source action:

      mozharness/scripts/multil10n.py --cfg myconfig.py --pull-build-source
      
    • Run the script.

      By default, this will:

      • pull the locale source,
      • copy your mozconfig into the tree and build mozilla-central,
      • package the en-US apk,
      • backup your objdir to a -bak directory (e.g. objdir-droid-bak) via rsync,
      • restore your objdir via rsync (no changes),
      • add the locales to your objdir,
      • and package the multilocale build.

      (See the actions section below for more advanced usage.)

      mozharness/scripts/multil10n.py --cfg myconfig.py
      

    developing + debugging

    The above instructions are sufficient for when you want a single multilocale build.

    If you're changing m-c code or otherwise want multiple builds, you probably want to follow this workflow:

    • After running the multilocale steps, you want to restore your objdir to en-US. At least until someone figures out how to clobber the multilocale-specific bits, rsyncing from a clean backup directory seemed like the most expedient solution.
      mozharness/scripts/multil10n.py --cfg myconfig.py --restore-objdir
      
    • Then make your code changes.
    • Then run the default actions again. If this seems like it's doing more than it needs to, you're correct. See the actions section below for more advanced usage.
      mozharness/scripts/multil10n.py --cfg myconfig.py
      

    actions

    (I covered actions to some degree here as well, but the below specifically covers this script + config file.)

    what actions are available?

    You can see the default actions, defined in the config file, by:

    mozharness/scripts/multil10n.py --cfg myconfig.py --list-actions

    This will give you this output:

    Actions available: clobber, pull-build-source, pull-locale-source, build, package-en-US, upload-en-US, backup-objdir, restore-objdir, add-locales, package-multi, upload-multi
    Default actions: pull-locale-source, build, package-en-US, backup-objdir, restore-objdir, add-locales, package-multi
    what do these actions do?

    I'd like to have more information about each action built into each script. Here's an overview of multil10n.py's actions:

    • clobber - This is effectively no-op as long as work_dir is set to '.'. Otherwise it nukes the entire work_dir.
    • pull-build-source - This clones the repos defined in repos.
    • pull-locale-source - This clones the repos defined in l10n_repos, as well as the l10n repositories defined in the locales file (specified here)
    • build - This copies the mozconfig path specified here into the source tree, and runs make -f client.mk build.
    • package-en-US - This runs make package.
    • upload-en-US - To be written. It will call make upload.
    • backup-objdir - This rsyncs your objdir to a backup directory, since there's no current solution for clobbering the multilocale bits in your objdir.
    • restore-objdir - This rsyncs the backup directory into your objdir. If you want to re-run the build or re-add the locales without a lot of old cruft lying around, I recommend doing this first.
    • add-locales - This runs compare-locales and adds the locale strings + manifests into your objdir.
    • package-multi - This runs make package for your multilocale apk.
    • upload-multi - To be written. It will call make upload.
    how do I run a single action?

    You can run any single action by itself by prepending -- in front of the action name (making it a command line option). For instance, to only run the restore-objdir action,

    mozharness/scripts/multil10n.py --cfg myconfig.py --restore-objdir
    
    how do I skip an action?

    You can run the default set of actions, minus an action, by prepending --no- to the action(s) you want to skip. To run the default actions without pulling the locale source,

    mozharness/scripts/multil10n.py --cfg myconfig.py --no-pull-locale-source
    
    how do I run several actions at once?

    Running the script with no action-specific command line options will iterate over the list of default actions.

    You can also have a combination of --ACTION or --no-ACTION to add an action to or remove an action from the default list. For example,

    mozharness/scripts/multil10n.py --cfg myconfig.py --no-pull-locale-source --no-build --no-package-en-US --no-backup-objdir
    

    and

    mozharness/scripts/multil10n.py --cfg myconfig.py --restore-objdir --add-locales --package-multi
    

    are equivalent.

    You also have the option of editing your default_actions in myconfig.py if that makes things easier.

    escapewindow: escape window (Default)
    For those interested, I gave an overview of mozharness for the Automation Team today.

    I mentioned that we're not yet at that point of critical mass where the accumulated knowledge and shared code help make everything go faster. But we're getting there.

    The presentation "slides" are available here.
    escapewindow: escape window (Default)

    A lot's happened since mozharness 0.4 landed in late September. We:

    • added a bunch of Android native support.
    • rolled peptest out to production on Try, across all desktop test platforms.
    • improved mozharness virtualenv support, with real-life mozbase usage in peptest.
    • fixed actions-in-config-files.
    • separated the output parser from ShellMixin.run_command(), so we can
      • parse output from subprocess or get_output_from_command(),
      • eventually add context lines to output parsing, and
      • potentially split serial tasks into multiple parallel jobs with their own log parsing.
    • added chunking support to split up jobs across machines.
    • added query_exe() and which() support to specify or find executables.
    • added add_failure(), query_failure(), and summarize_success_count() support to track granular status across a list of tasks.
    • added a BuildbotMixin and ReleaseMixin to tie into our existing buildbot infrastructure/configs.
    • precompiled the error_list regexes.
    • made various OSMixin and ShellMixin method improvements.
    • added a setup.py.
    • added a pyflakes call and Debian/Ubuntu support to unit.sh.
    • moved mozilla-specific modules into mozharness.mozilla.*.
    • retired Maemo scripts (Maemo tier 3).

    This feels like an as-good-as-any time to arbitrarily increment the arbitrary version number: mozharness 0.5.

    I think the best part of this release is how more people got involved; I can feel the momentum building. I certainly don't want to wait another 5 months before the next arbitrary version bump.

    escapewindow: escape window (Default)

    [what is it?]

    device_talosrunner.py is a work-in-progress mozharness script that sets up talos and runs it against a device using either the sut or adb protocols.

    It's designed to:

    • allow anyone with a [supported and rooted] Android device to run talos without building the tree;
    • allow someone with access to a staging/production tegra to run a production-like talos run without setting up buildbot;
    • at some point, become the official way we run talos on Android in production.

    This is similar to, and parallel to, the work on make talos-remote. However, aiui, that is developer-oriented, whereas device_talosrunner.py is targeted towards someone who wants to test pre-built apks (as well as replacing our current production code).

    I have used this script to run talos on both a tegra over sut, and an Asus Eee Pad Transformer using adb-over-ip, successfully.

    [how do i use it?]

    You need:

    • python (2.5 - 2.7.x)
    • virtualenv
    • hg, to clone the various repos
    • a copy of my github mozharness repo, on the branch 'talosrunner';
    • a rooted, supported Android device that is either running the sutagent, or is attach{ed,able} via adb. The adb calls require busybox.

    Once you have those things,

    1. Create a config file. I've got a wip tablet config that I use for my Asus Eee Pad Transformer, and a wip tegra config that I use for a staging tegra.

    2. If you're using adb-over-usb, the device needs to be connected to your desktop/laptop. If you're using adb-over-ip or sut, specifying the device_ip should be sufficient.

    3. If this is the first time you're running device_talosrunner.py, you need a virtualenv. Run
      mozharness/scripts/device_talosrunner.py --cfg users/aki/tablet1.py --create-virtualenv
      (replacing the config file path with the path to your config file). This will create a python virtualenv in ./venv and install PyYAML, a Talos dependency, into it.

    4. Run
      mozharness/scripts/device_talosrunner.py --cfg users/aki/tablet1.py
      If the start_python_webserver option is set, the script will start a webserver and point the device at it. Otherwise it'll point the device at whatever talos_webserver is set.

      When this script is polished and mature, I definitely want people to be comfortable using it on their personal phones and tablets, but it's still rough around the edges.

      This will uninstall Fennec and possibly reboot your device, so please be careful if you're trying this on a device/profile that you haven't backed up.

    [what's left to do?]

    • I've only tested a handful of talos suites against two devices; for this to be a production replacement, it needs to be able to run reliably against a large pool of devices.
    • This relies on a talos patch in bug 688604 for the python webserver; I most likely need to use mozhttpd.
    • I've hit issues with both m-c and birch builds, and will most likely need to make changes for the native UI.
    • I have an idea which would allow us to find the next available and working tegra, which may reduce red and purple Tegra results; this needs writing and testing. More on that later.

    May 2014

    S M T W T F S
        123
    45 6 78910
    11121314151617
    18192021222324
    25262728293031

    Syndicate

    RSS Atom

    Most Popular Tags

    Style Credit

    Expand Cut Tags

    No cut tags
    Page generated Jul. 25th, 2014 03:13 pm
    Powered by Dreamwidth Studios