I keep running into the same questions, project after project, company after company. How do I see who broke the build? How do I know if this bug has been fixed in this codeline? How do I see the difference between these two builds? And how can we make this all happen faster? Smoother? Easier? And you revisit the same questions as new tangles and complexities of scale are added to the equation.
I joined Netscape in 2001, partially to play with a bunch of weird unices, partially to see How Build Engineering is Done Properly. I had already tackled LXR/ViewCVS + Bonsai + Tinderbox elsewhere, and that toolchain has loomed large at every place I've been, at least in concept. Source browsing + repository querying + continuous integration, no matter which specific products you use.
Here I am, back at Mozilla after a number of years away, and I'm amused (and pleased) to see that the current state of things is surprisingly analogous to how I designed our build system elsewhere. We had the db with history and archived configurations; buildbot has the daemon with real-time log views; but otherwise things are fairly close. Both systems are in similar stages of development, ready to take that next step.
Here are some thoughts about the direction that next step could go. Please keep in mind that I'm trying to ignore how much work this will all be to implement for the moment, so I can write down my ideas without cringing.
[Is hgweb a strong bonsai replacement?]:
Installing Bonsai or its equivalent has been my immediate priority for each new project I've joined. Why? It gives you immediate insight into the code base. What's changed. Who's changing it. The full history, across all branches.
I honestly don't see how you manage a large scale CVS project without it. You could manage in Subversion, possibly, but Kamikaze and ViewVC's query database still give you so much more than svn log: cross-repository queries and GUI-user friendliness come to mind.
As for hgweb: it seems to be working for people currently. I can't help but feel like it's missing a lot. I don't claim to be a Mercurial expert by any means. But at first blush:
How do you differentiate which checkins are in linear history? (Covered here)
Where's the guilty column? Not sure if this is a missing feature in hgweb or if it's just a matter of needing tinderbox/hg code. But tinderbox used to query bonsai for this info.
Where are the complex queries? I see you can query by date range in the pushlog, but I haven't found a way to also query by author, bug number, whether the checkin was in linear history or not, and combinations using AND or OR.
How do I query which repositories bug 12345 has been fixed in? (cross-repo queries)
Given a query, how do I tell which build first contained a checkin, which build preceded that one, where to get them, and what the delta is between the two builds in terms of code, performance, and unit test results? (cross-system links)
For that matter, given a query for any two arbitrary checkins in any supported repositor(y|ies), how do we find the above deltas? (cross-repo cross-system links)
Point 4 came up in conversation. We were discussing updating Bugzilla with checkin information, which I've done previously. Adding comments or keywords to a bug is certainly one approach, at the risk of potentially cluttering already-cluttered bugs further. Linking the bug to a Bonsai2/hgweb query that gives you the same information is another. We're still thinking about this one.
Points 5 and 6 depend on a build db, which I'll discuss in more detail shortly.
Whether these features are wanted/needed by the community, and whether desired features are best written into hgweb or an external database like Bonsai's remain to be seen.
[Tinderbox: waterfall vs dashboard]:
Ken Estes had a hard time convincing Netscape to switch to Tinderbox 2, but I felt its modularity and ease of extensibility and customization made it ideal for projects unburdened by the history and existing ingrained processes Moz has. It's great for what it is. As is Tinderbox 1, in its own special way.
Having said that, there are plenty of Tinderbox detractors, and it's hard to defend certain Tinderbox behaviors. Its reliance on procmail, for instance. And it's fairly easy to see how a single waterfall can get overwhelmed with information.
rhelmer covered the current tinderbox/buildbot split, and is among the voices I've heard/read calling for a move away from the waterfall view, which I don't completely understand. I do understand that the waterfall is far from ideal as a solitary view. But it does represent the activity of builds and build machines over a brief amount of time quite well. Even better when you have a guilty column ;-)
So, why not have both? Or multiple? Not to clutter, but to present different ways of accessing the data. Each with their own strengths.
We already have different views. The waterfall is one. Another is the sidebar panel or summary view. But we could add something similar to the Cruise Control and Bamboo dashboard pages. Or create something somewhere in between: collapsible "groups" of builds with ETAs and links to drill down. Customizable per-user views. Graphs of tree-openness over time. Each of these brings something to the table that the others aren't as strong at.
(joduinn points out these cruisecontrol and agitar screenshots)
And how do you add a view? Not by putting more load on procmail, certainly. These would be built from build database queries. If we build an anonymous read-only db mirror, or generate the necessary feeds, community members could create their own views.
I saw the hard sell that Ken Estes faced, trying to convince people that something as vital as the Tinderbox waterfall could be changed without altering the Mozilla Way. My solution would be to add views over time, rather than replace them; we can trim if and when old views are no longer wanted/needed.
However, there is that pesky little fact that the build database doesn't exist yet.
[Buildbot: strengths and weaknesses]:
I've only recently started using buildbot. I don't claim to be an expert. Here are my impressions in my brief amount of time using it:
Single point of failure. If the buildbot master goes down, that's it. Everything's dead in the water. We've discussed clustering buildbot, but I think the real answer is a db.
But this is more than just emergency cases. This is also true of routine maintenance or configuration updates. In many cases you can just reconfigure buildbot while it's running. But many times this requires system-wide downtime, whether planned or accidental.
High load on the buildbot master. There is definitely an upper limit to the number of builds than can be run from a single server, and there is no way to tie multiple masters together.
No build slave independence. We've been rebooting our talos slaves as a way to stabilize performance test numbers. We ignore the return value so it doesn't make the build fail, but the slave daemon disconnect keeps showing up red or purple in the buildbot waterfall.
I envision occasional maintenance steps: after X number of builds, do some disk cleanup, defragmentation, reboot, without false positive errors. I also envision setting multiple masters in the build slave config, for both load balancing and failover. The build slaves, for the most part, are smart enough to run Python. They should be allowed to run some steps without a full-time server/client leash, as long as the appropriate server can query status. And it shouldn't matter whether the server that initiated the build request is the same that takes the final results, as long as the appropriate failover occurred.
Limited views and history querying. Database.
(joduinn points out: Scrolling back in a waterfall doesnt work for this, especially whenever timestamps dont line up, or when items start to fall off the bottom of the page. Waterfall also does not work for "when did this intermittent test last fail" - a common question. Letting people roll their own SQL query / plugin / easy-to-use web UI / firefox addon, seems better for both of these usage cases.)
Loss of state. Build queues are kept in memory. Downtime not only means kililng running builds, but also clearing any queued builds.
Splitting out the queuing step to a specifiable object would work; drop in an object that saves the queue to disk, or send it to the database.
Single config. This could be seen as a feature or a bug. I tend to favor picking up build config/script changes automatically from source, which helps preserve history of which config was used in which build.
I haven't really fully fleshed this one out. Spawning multiple buildbot/twistd processes, much like httpd, to allow old-config builds to complete. Or having a multiple brief .cfg files checked in to help make changes more granular. This will take a bit more exploration. But updates from source per build would be nice.
[Pull, not push]:
During one of my interviews for Mozilla, I covered build database architecture. I pretty much covered what I had already planned and [mostly] implemented elsewhere, and the designs matched perfectly, except for one thing.
Buildbot pushes queued builds to its slaves via twisted. I took a similar approach independently, using sshd; when a single server coordinates everything, it's fairly easy to take the push model. But during the interview, I was asked specifically about pulls.
Overall, same thing, right? Pushing from a central server, or having multiple servers pull from a central queue.
Except if you pull from a central queue, rather than push from that queue, you remove your dependence on a single build server knowing everything. The database becomes central, but there are already established ways of mirroring those.
After thinking about this for a while, I thought why not? Two or five servers queuing builds, each able to take over the others' responsibilities if needed. Three or ten servers pulling builds from the queue, managing distributed build farms, each able to take over others' builds and build clients if one or more servers were to go down. The database is the central thing, not any instance of buildbot. And all the configs are in source.
The queuers could also do routine maintenance: act on stalled builds, determining status and requeuing or failing out as necessary. Inserting maintenance tasks as needed into host pools, but without taking too many build hosts out of any one pool at any time.
With this model, I would guess -- not promise or guarantee, but strongly guess -- we could scale up by an order of magnitude. Add quote-unquote buildbot masters as queuers or build coordinators as needed. Different groups could have their own sets if needed or desired. And if the db became the bottleneck, I imagine we could find ways to logically split it up.
I've never seen a build system anywhere near that scalable. Except in my head. But it's exciting to think.
[Still a ways out]:
I've been expounding on some of my ideas about what we could potentially do with a build db... which doesn't exist yet. It may take a while.
I think the first step is to be able to dump queues to disk, possibly in yaml, possibly in python, and be able to read those queues back into buildbot at start. Then add a module that lets us queue to disk by default. Hopefully during the process we can reduce the amount of information in the request to the bare bones: revision, requester, build type, time of request, etc., and separate it from the actual build logic.
I'm not quite used to working by announcing plans or ideas in public before they're implemented; even at NSCP we tended to keep things in-house unless they were Mozilla-specific. Hopefully some of these ideas resonate with other people as well; let me know. And hopefully people can wait a bit if they really want some of these ideas implemented.
... Back to work on the Nokias and try server.