escapewindow | Entries tagged with taskcluster

Premise (Tl;dr)

A federated protocol for automation platforms could usher in a new era of collaboration between open-source projects, corporations, NGOs, and governments. This collaboration could happen, not at the latency of human handoffs, but at the speed of automation.

(I had decided to revive this idea in a blog post before the renewed interest in Mastodon, but the timing is good. I had also debated whether to post this on US election day, but it may be a welcome distraction?)

Background

Once upon a time, an excited computer lab assistant showed my class the world wide web. Left-aligned black text with blue, underlined hypertext on a grey background, interspersed with low-resolution GIFs. Sites, hosted on other people's computers across the country, transferred across analog phone lines at over a thousand baud. "This," he said. "This will change everything."

Some two decades later, I blogged about blue sky, next-gen Release Engineering infrastructure without knowing how we'd get there. Stars aligned, and many teams put in hard work. Today, most of our best ideas made it into taskcluster, the massively scalable, cloud-agnostic automation platform that runs Mozilla's CI and Release Pipelines.

Firefox CI in January 2019: 7,139,432 tasks; 259.0 compute years; 1,071,533 unique workers; 383,411 maximum concurrent tasks #Mozilla #ContinuousIntegration #Taskcluster
— Chris Cooper (@ccooper) February 6, 2019

The still-unimplemented idea that's stuck with me the longest is something I referred to as cross-cluster communication.

Simple cases

In the simplest case, what if you could spin up two Taskcluster instances, and one cluster's tasks could depend on the other cluster's tasks and artifacts? Task 2 on Cluster B could remain unscheduled until Task 1 on Cluster A finished. At that point, Task 2 could move to `pending`, then `running`, and use Task 1's artifacts as input.

We might have an upstream dependency that tends to break everything in our pipeline whenever a new release of that dependency ships. Our current solution might involve pinning this dependency and periodically bumping the pin, debugging and bisecting any bustage at that time. But what if we could fire a set of unit and integration tests, not just when the dependency ships a new release, but whenever their CI runs a build off of trunk? We could detect the breaking change much more easily and quickly. Cross-cluster communication would allow for this cross-cluster dependency while leaving the security, billing, and configuration decisions in the hands of the individual cluster owners.

Or we could split up the FirefoxCI cluster. The release pipeline could move to a locked-down cluster for enhanced security monitoring. In contrast, the CI pipeline could remain accessible for easier debugging. We wouldn't want to split the hardware test pools. We have a limited number of those machines. Instead, we could create a third cluster with the hardware test pools, triggering tests against builds generated by both upstream clusters.

Of course, we wouldn't need this layer if we only wanted to connect to one or two upstreams. This concept starts to shine when we scale up. Many-to-many.

Many-to-many

Open source projects could collaborate in ways we haven't yet seen. But this isn't limited to just open source.

If the US were using Taskcluster, municipal offices could collaborate, each owning their own cluster but federating with the others. The state could aggregate each and generate state-wide numbers and reports. The federal government could aggregate the states' data and federate with other nations. NGOs and corporations could also access public data. Traffic. Carbon. The disappearance and migration of wildlife. The spread of disease outbreaks.

A graph of cluster interdependencies. A matrix. A web, if you will. But instead of connecting machines or individuals, we're connecting automation platforms. Instead of cell-to-cell communication this is organ-to-organ communication.

(As I mentioned in the Tl;dr): A federated protocol for automation platforms could usher in a new era of collaboration between open-source projects, corporations, NGOs, and governments. This collaboration could happen, not at the latency of human handoffs, but at the speed of automation.

Current Mood: anxious

Since my last blog post, we've released seven more 1.0.0bX betas and a 2.0.0 final.

Since then, we've added beetmover-, balrog- and pushapk- scriptworker instance types, with chain of trust support, and upgraded them off of the now-retired 0.7.x branch. We now have a live_backing.log for easier treeherder log viewing. Our configs are now recursively frozen for more immutable goodness, and we have an unfreeze function as well. We're now running scriptworker instances against tier1 linux and android Firefox nightlies (and developer edition). And we have more contributors and contributions, including two releases pushed by jlund and jlorenzo.

Why 2.0.0? First, we introduced some backwards incompatible changes , and decided that the spirit of semver rule 5 included 1.0.0 betas. Why not 2.0.0b1? We're in production, tier 1, so let's stop futzing with betas and call it 2.0.0. The major version should be incrementing fairly rapidly, since we have a number of changes in the pipeline that may be backwards incompatible. Skipping 1.0.0 and getting used to larger major version numbers seems like a good first step.

Thanks Johan and Jordan, Mihai for the beetmover and balrog work, Pankaj Ahuja for the recursive freeze/unfreeze functions, and a bunch of other people on the Releng and Taskcluster teams for all their help!

Tl;dr: I just shipped scriptworker 1.0.0b1 (changelog) (github) (pypi).
This enables chain of trust verification for signing scriptworkers.

chain of trust

As I mentioned before, scriptworkers allow for more control and auditability around sensitive release-oriented tasks in Taskcluster. The Chain of Trust allows us to trace requests back to the tree and verify each previous task in the chain.

We have been generating Chain of Trust artifacts for a while now. These are gpg-signed json blobs with the task definition, artifact shas, and other information needed to verify the task and follow the chain back to the tree. However, nothing has been verifying these artifacts until now.

With the latest scriptworker changes, scriptworker follows and verifies the chain of trust before proceeding with its task. If there is any discrepancy in the verification step, it marks the task invalid before proceeding further. This is effectively a second factor to verify task request authenticity.

scriptworker 1.0.0b1

1.0.0b1 is largely two pull requests: scriptworker.yaml, which allows for more complex, commented config, and chain of trust verification, which grew a little large (275k patch !).

This is running on signing scriptworkers which sign nightlies on date-branch. We still need to support and update the other scriptworker instance types to enable end-to-end chain of trust verification.

I was planning for the 0.9.0 release to revolve around Chain of Trust verification. However, some pexpect async issues reared their ugly head. The fix is in scriptworker 0.9.0 (changelog) (github) (pypi) ; Chain of Trust verification will land in the next release, likely 1.0.0.

scriptworker 0.9.0

While working on the chain of trust verification code, I noticed that more than half the time I'd hit async pexpect errors during testing (we used async pexpect to sign gpg keys in the background).

This was just a personal annoyance until bug 1311111 - please start landing docker-worker pubkeys in gpg repo landed, and production signing scriptworker instances barfed on async pexpect errors.

The solution either called for fixing the upstream bug, or pulling the gpg homedir creation/rebuild out into its own process. We opted for the latter solution; so far this seems to be working much more smoothly.

Tl;dr: I shipped scriptworker 0.8.2 (changelog) (github) (pypi) and scriptworker 0.7.2 (changelog) (github) (pypi) last Monday (Oct 24), and am only now getting around to blogging about them.

These are patch releases, and fix the polling loop.

scriptworker 0.8.2

The fix for bug 1310120 - puppet reinstalls scriptworker on every run exposed some scriptworker loop bugs: because puppet was restarting scriptworker regularly, we never had a long-running instance before.

:kmoir and :Callek noticed that signing scriptworker wasn't signing (nagios alerts on stuck queues are pending =\ ). It was clear that while git polling was continuing on its merry way, the task polling was dying fairly quickly. We also needed more logging around fatal exceptions.

I addressed these issues in scriptworker 0.8.2. We also have our third scriptworker contributor, :jlorenzo! (:jlund was #2)

scriptworker 0.7.2

Since we already had the 0.7.1 release to help other scriptworker instance types from having to deal with the churn from pre-1.0 changes, I backported the 0.8.2 fixes to the 0.7.x branch and released 0.7.2 off of it. This involved enough merge conflicts that I'm hoping to avoid too many more 0.7.x releases.

Tl;dr: I just shipped scriptworker 0.8.1 (changelog) (github) (pypi) and scriptworker 0.7.1 (changelog) (github) (pypi)
These are patch releases, and are currently the only versions of scriptworker that work.

scriptworker 0.8.1

The json, embedded in the Azure XML, now contains a new property, hintId. Ideally this wouldn't have broken anything, but I was using that json dict as kwargs, rather than explicitly passing taskId and runId. This means that older versions of scriptworker no longer successfully poll for tasks.

This is now fixed in scriptworker 0.8.1.

scriptworker 0.7.1

Scriptworker 0.8.0 made some non-backwards-compatible changes to its config format, and there may be more such changes in the near future. To simplify things for other people working on scriptworker, I suggested they stay on 0.7.0 for the time being if they wanted to avoid the churn.

To allow for this, I created a 0.7.x branch and released 0.7.1 off of it. Currently, 0.8.1 and 0.7.1 are the only two versions of scriptworker that will successfully poll Azure for tasks.

Tl;dr: I just shipped scriptworker 0.8.0 (changelog) (RTD) (github) (pypi).
This is a non-backwards-compatible release.

background

By design, taskcluster workers are very flexible and user-input-driven. This allows us to put CI task logic in-tree, which means developers can modify that logic as part of a try push or a code commit. This allows for a smoother, self-serve CI workflow that can ride the trains like any other change.

However, a secure release workflow requires certain tasks to be less permissive and more auditable. If the logic behind code signing or pushing updates to our users is purely in-tree, and the related checks and balances are also in-tree, the possibility of a malicious or accidental change being pushed live increases.

Enter scriptworker. Scriptworker is a limited-purpose taskcluster worker type: each instance can only perform one type of task, and validates its restricted inputs before launching any task logic. The scriptworker instances are maintained by Release Engineering, rather than the Taskcluster team. This separates roles between teams, which limits damage should any one user's credentials become compromised.

scriptworker 0.8.0

The past several releases have included changes involving the chain of trust. Scriptworker 0.8.0 is the first release that enables gpg key management and chain of trust signing.

An upcoming scriptworker release will enable upstream chain of trust validation. Once enabled, scriptworker will fail fast on any task or graph that doesn't pass the validation tests.