A case of the Mondays… on a Wednesday

10/12/2006

Last time I did 2.0 release candidate localization builds, I ran into a problem with the tagging and checkout.
For various reasons, we only tag the locales that we ship for a particular release with the _RELEASE tag. During the build, client.mk checks out every locale specified in all-locales. This has worked beautifully for the 1.5.0.x maintenance series, since the locales we ship aren’t shifting a lot, there aren’t many, if any, new locales on that branch, and if you check out a directory that isn’t tagged, you get… nothing. Exactly what we want, right?
Well, we repeated this process for the 2.0 RCs, and suddenly, the builds started failing with “cvs [checkout aborted]: no such tag ". This was certainly a surprise; “I had just created the tag,” I kept thinking to myself.
“Am I going insane?”
I started debugging it, and was only able to reproduce it once originally. Once I got a checkout going, it seemed to work repeatably, and since I was busy with 2.0 RC 2, I didn’t investigate more.
Well, it happened again with today’s l10n builds. Originally, rhelmer and I thought it may have been a problem of using the wrong CVSROOT, since cvs-mirror.m.o only gets updated every few minutes, and I had just created the tag. This didn’t make a huge amount of sense, as the command is run with -d, to specify the CVSROOT directly. A peek at the source confirms that -d takes precedence over $CVSROOT and CVS/Root. Then I thought maybe it was a compatibility problem. It turns out that we use CVS 1.11.2 on the client side to create release tags; maybe this is so old—it was released in 2003—that it was hiccuping with something server side?
After trying to reproduce this problem for rhelmer, I was only able to reproduce it once before it worked. Again. Something must be modifying the state server-side.
After some experimentation and more source reading, it turns out that an “optimization” introduced in the CVS 1.11 line, so-called “val-tags”, is responsible for the bug.
In a nutshell, when using val-tags, CVS searches for the existence of a tag by 1) looking into val-tags, and then b) looking at the RCS files themselves. This normally isn’t a problem, except in the case where an untagged directory is requested before a tagged directory. In the case of RC2, l10n/af was not part of the release, and therefore untagged, but l10n/ar was tagged. They were checked out in that [alphabetical] order.
Running “cvs co l10n/af l10n/ar” will repeatably produce the (incorrect) “invalid tag” error until you run “cvs co l10n/ar” (or some other checkout for which the tag does exist first. This adds the tag to the “val-tags” file, and after that point, CVS will check the val-tags file first, to see the that tag does indeed exist, and then traverse all of the directories you’ve listed, instead of the first one.
All the gory, buggy details are in tag.c‘s tag_check_valid(), which, based upon my very cursory reading of the source code, still exists in CVS 1.11.22.
While reading through the source, I was surprised at the number of comments that… didn’t inspire confidence:

/* FIXME: This routine doesn't seem to do any locking whatsoever
(and it is called from places which don't have locks in place).
If two processes try to write val-tags at the same time, it would
seem like we are in trouble.  */
/* FIXME: should check errors somehow (add dbm_error to myndbm.c?).*/

But, unlike so many other open source projects, at least we have…

/* warm fuzzies */
if (!really_quiet)

I think the moral of today’s story is: open source is cool because you can look at the source… but if you ever do, there’s a very real chance you could become very depressed.
Or scared.
Or both1,2.

***

In mostly unrelated4 news, today is National Coming Out Day.
I only remembered this because I was walking around Google’s campus and saw a sign noting it.
It always seems to sneak up on me every year5.
___________________________
1 Which makes me wonder how much of our source has “gems” like that…
2 I’m sure at least someone out there is asking “So, Bigshot where’s the patch then?” Well, I spent a some time looking at this… and decided that I couldn’t spend anymore time wrapping my brain around a function (start_recursion) that calls yet another long function (do_recursion), especially when it seems that most of the CVS devs don’t either3
3update.c says /* FIXME-twp: the arguments to start_recursion make me dizzy. This function call was copied from the update_fileproc call that follows it; someone should make sure that I did it right. */
4 Read “completely”
5 It’s much like MMLRD, but on a yearly scale…