Try not. Do, or do not.

2015-05-06 by Mike Shal, tagged as clobber, make, mozilla, try

Despite the existence of hg.mozilla.org/try, I sometimes feel that "there is no try", or at least not a try that I would like. My concern for this post covers three main areas: how the input to try is specified, how the output (failure or success) is determined, and how long it takes to run. I'd like to look at each of these in turn and compare our current setup with an "ideal" try server.

Try Server - Input

So you have a patch, and it seems to compile and run locally. Maybe it's even reviewed. But you have no idea if it runs correctly on anything other than your laptop. Enter the try server! Ideally you'd just send your patch off to try:

But unfortunately the patch isn't the only input to the try server. There's also the trychooser syntax, which determines the set of platforms builds and/or tests run on:

Choose too few options, and you may end up pushing broken code around. Choose too many, and you'll waste precious machine time. If you naively do a "-p all -u all" push because you don't really know enough about the system to foresee all the side effects of your patch, someone will yell at you. We even have a dedicated try highscore page so we can easily chastise the ignorant.

If you are convinced that you can select the optimal set of trychooser options, I wish you luck. With some 187 on/off checkboxes, there are about 10^56 combinations to select from. This is roughly the number of atoms in the sun.

The only reasonable solution is to remove trychooser from the equation entirely, so that the sole input to try is the patch itself. Our scheduling, build, and test logic is responsible for verifying the patch using a minimal amount of resources.

Try Server - Output

After your patch is built and the tests execute, you'll see some results on treeherder. Decoding the output can be a bit tricky, so I'll show some examples. First up is an all-green try push (truly an elusive beast!). This is safe to land, assuming you picked the right atom in the sun:

More common is something that has a few oranges & reds for tests. When it's just a couple spattered around, it's hard to know for sure if it's something your patch broke or just an intermittent test:

But if you get a sea of orange & red, best to fix it before landing:

This means you've done too many try pushes today:

While knowing exactly which build or test failed is useful, having a blurred line between a "good" and "bad" build is not. The output of try needs to look something like this:

In other words, a patch either works or it doesn't. But this isn't a new idea — long ago, our ancestors bravely declared War on Orange. Sadly, the war continues to this day. Will we ever attain peace?

Try Server - Runtime

Let's look at a recent try push. Here's the diffstat:

toolkit/devtools/gcli/commands/screenshot.js | 77 ++++++++++++++---------------------
1 file changed, 30 insertions(+), 47 deletions(-)

So, just some changes to one js file. Here's all we need to do in this situation on the build machine, for each platform:

Patch is applied to the source tree (~1 second)

Obviously, the source tree and a perfectly usable prebuilt object directory already exist on the build machine. The machine was sitting there like a hungry piranha waiting for a yummy patch to gobble up, so it was scheduled immediately.

The build system runs, and constructs a DAG from the changed js file (~3ms)

This assumes a sane build system that works correctly and is efficient.

The build system runs packager.py to create a new omni.ja file (~5 seconds)

I guess we could skip this if we used a flat hierarchy on our testers, but testing what we ship seems reasonable and the runtime is not too horrible.

Anything else is a waste of time in this case (compiling any .cpp files, linking libxul, reading any Makefiles or moz.build files, running buildsymbols or package-tests or l10n-check, etc.)

The new omni.ja file is uploaded (500 ms)

All other packaged files and tests from the previous revision are unchanged, so of course they don't need to be uploaded again.

In total, this try push should run for maybe 7 seconds before tests get scheduled. Here are the actual runtimes:

Linux x86 opt: 30 minutes
Linux x86 debug: 30 minutes
OS X 10.8 opt: 70 minutes
OS X 10.8 debug: 49 minutes
Windows XP opt: 90 minutes
Windows XP debug: 75 minutes

Over 5 hours of total builder machine time was used where less than a minute was needed. Such is the price of not being able to trust our build system to do the right thing. No, a solid build system doesn't immediately help schedule tests more efficiently, and test runtime does impact the total try turnaround as well. Yes, some changesets still require essentially re-compiling the world. That doesn't mean we shouldn't be faster, automatically, for those cases where it is possible.

What do you think? Can we get rid of trychooser? Can we win the War on Orange? How fast do you think we could turn around a try push, including tests?

Mike Shal's Blog