19:19 “It’s only updates”


Nightly updates are broken again.
And not in a “Why can’t I get my Firefox 1.0.3 Amiga build to auto-update to Firefox 2.0 alpha3″ sort of way. In more of a “Every product, from every version, to any version, it’s all broken!!!” sort of way.
We know, and morgamic and I have been working on it basically straight for the past couple of days.
Here’s where we are, and what happened…

The first thing to know is the release update system is somewhat different than the nightly update system. The nuts and bolts and building blocks are the same, but the way the updates themselves and the information about those updates gets generated is different than the process we use for Firefox and Thunderbird releases.
Two things which, overall, are very good for the Mozilla Project, worked together to start the process of update brokeness:

  • Changes to the mirror modules, which made the mirror module smaller, and thus all our mirrors around the world happier, caused certain build deliverables, namely nightlies, to no longer appear in locations… well… around the world. Some build machines had been configured to rely on specific mirrors and specific locations, which were no longer hosting those deliverables, so the downloads failed, and nightly patches could not be created
  • The host doing a lot of the heavy lifting for AUS was on a machine doing a bunch of other work. This machine was old and creaky, and we’ve been wanting to replace it with a machine that has all the spiffy RAIDed disks, and which is not completely overworked. This consisted of two subtasks:
    • Moving the AUS data from the old machine to the new machine
    • Using the version of the AUS server that is now in public CVS. Mike led the charge here, and released the server-side AUS components, beating me to the punch on releasing the patch-generation tools. We wanted to switch to this version of the code, which includes the exteremly useful always jump nightlies to the latest build-patch.

All in all, these are all a very positive and useful set of changes, but… they all conspired together to break nightly updates in weird and wonderful ways.
After solving the FTP module change problem, we found that a lot of the nightly patch generation infrastructure had hardcoded path- and host-names, causing it to break after the move.
After fixing that, Mike and I found that the AUS server wasn’t returning the right results; this turned out to be an older instance of AUS running that hiccuped on some of the new data we were stuffing into it.
After fixing this, we spent some time fixing some assumptions through the patch generation and AUS server-side code regarding how nightlies get updated, namely the “jump everyone to the latest build functionality,” which was entirely new and had some implications we weren’t expecting. We’re currently working through those issues now, mainly in bugs 341752 and 342549.
Throughout debugging this, we’ve figured out a couple of areas to focus on for next time:

  • We need to investigate how to bring up (and use) an AUS staging server that nightlies point to, that is entirely different from the AUS production server. This will allow us to make and test changes to server-side AUS code without it affecting the production AUS system. This may seem like a “Well, duh” idea, and such testing is possible to some extent today, since the aus2-staging server is different from the aus2 server, but there are some client changes that may need to be made so that we can produce nightlies that point to a different server.
  • We need to focus on some automation tools around testing nightly updates. Mike and Nick Thomas (cf on IRC) have both done some excellent work in this area (Mike’s tests, which are available in CVS as part of the AUS package, and Nick’s tests, which query a different, more consumer-focused set of data), and we need to incorporate that into the nightly nagios checks we do for builds, because nightly updates for all active branches are priority build deliverables, and not having this is a problem.

A lot of people have wandered into #build, asking when updates are going to be back. I’ve had people wander by my desk and ask about them. We know they’re broken, and we’re working on fixing them.
But really, this isn’t good enough. So, I’ll be adding to the Build Team radar next month a focus on getting some verification automation up for knowing sooner when there are lacking nightly updates.
We don’t think lacking nightly updates is an acceptable state of affairs either and we aim to provide better service than that.
The first step to fixing the problem, though, will be better visibility into when it occurs.