Sometimes, I don’t understand VMware at all

09/29/2006

pacifica-vm, which most of you probably know as the Firefox 2 Windows nightly build machine, has had an interesting week.
Last week, I took the machine down to back up its virtual disk image, increase the amount of RAM available to it, and, in an attempt to decrease the cycle-time, added a VMware virtual-CPU to the VM, increasing it to two.
This didn’t really have the intended effect. Cycle times for both nightly and depend builds went up by about 15%.
Thinking that maybe the build system wasn’t making as efficient usage of its shiny new virtual CPU as it could, I upped make’s -j value from 3 to 4. This reduced cycle times… to what they were before the memory/CPU “upgrade,” but also had the useful side-effect that make would hang every few builds, including most notably on nightly builds. (That is, incidentally, why nightly builds on Tuesday, Wednesday, and Thursday were all late; make kept hanging overnight with -j4.)
Finally, last night, I removed the second VCPU, but kept the extra memory and higher -j values.
That change not only made the machine start reliably producing nightlies again (or, at least, make stopped hanging), but it took the cycle time down to 40 minutes for a depend cycle, and 2ish hours for a nightly build. (Interestingly enough, that full build value seems to fluctuate anywhere from about 90 minutes to just over a couple of hours; I think this is because the trunk build machine and the 1.8 build machine are on the same VM box, and they’re both starting their nightlies at the same time, which slows both of them down a bit.)
So, to recap here:


Nightly Build Depend Build
Before changes ~ 2 hours ~1 hour
After memory/CPU “upgrade” ~ 2 hours, 20 minutes ~ 1 hour, 15 minutes
After adding -j4 ~2 hours; hung often 1 hour; hung often
Remove one VCPU ~ 2 hours; jury still out, though 40 minutes


Things I’ve learned from this experience:

  • Linking Firefox, especially on Win32, takes memory. A lot of memory. In the couple of trials I paid attention to, it took around 700 megs. Seeing as the VM had 700 megs, a large part of the problem seemed to be the machine descending into swapping thrash-hell when trying to do the final link.
  • Win32 SMP = Teh suck. I had actually learned this from previous experiences in previous lives, but… a reminder is always good.
  • When you actually pay attention to VMs, and spend some time “tuning” them—which in this case, amounted to creating a better match between the memory profile for the machine’s task and the virtual hardware—VMs don’t perform all that badly, relatively to physical hardware. gaius-vm, for instance, has horrible cycle times compared to gaius, but it’s not “just because it’s a -vm.” It’s because no one’s paid enough attention to it after migrating it to tune it. (No, I haven’t taken what I’ve learned here and applied to gaius.)
  • Once again, sometimes… VMware['s performance] continues to confuse the hell out of me.