Oh Noes! A Bug in Tup
I thought it might be interesting to see an example of what I consider to be a bug in tup, especially since most other build systems would not consider this scenario a bug. While working on building mozilla-central, I had 25 new commits since the day started. After working on each commit, I would run 'tup upd' to bring the system up-to-date. Some commits only needed a single update, others required me to iterate many times until I was satisfied with the new commit. After I was done for the day, I pulled the repo onto another machine and ran a single 'tup upd' there, and that's when I noticed a problem...
Disaster Strikes!
I was greeted with the following error message on my laptop (the second machine):
$ tup upd [ tup ] [0.000s] No filesystem scan - monitor is running. [ tup ] [0.000s] Reading in new environment variables... [ tup ] [0.001s] No Tupfiles to parse. [ tup ] [0.001s] No files to delete. [ tup ] [0.028s] Executing Commands... fatal tup error: Graph is not empty after execution. This likely indicates a circular dependency. See the dumped dependency graph for the dependency structure. tup: saving graph '.tup/tmp/graph-full-22139.dot'
Now, the circular dependency error itself does not necessarily indicate a bug. It is quite easy to specify a circular dependency in tup, or really any build system. Tup usually catches these while parsing Tupfiles, not while running commands like in this case. But even so, that is not immediately cause for alarm.
The reason this is a bug is because:
- In one build directory, I built commit mybranch~1, and then mybranch. This appeared to build successfully.
- In another build directory (on my laptop), I built mybranch~25, and then mybranch. This resulted in the circular dependency error message.
- Since both build directories were at the same commit (mybranch), running 'tup upd' must produce the same result, but in this case they did not.
There are some obvious caveats to item 3). For example, I could generate a file based on /dev/random, and then the results would be different in both machines. Or, I could build them with different versions of gcc, which may produce different object files. That's not what I'm talking about here. The issue is that I should be able to check out any version of the software (or, make any edit to any file in the tree), and run 'tup upd'. The result must be independent of any prior builds I completed, or partial builds that may have failed, or mistakes I made in Tupfiles or C files while developing, and so on.
Tracking it Down
So now that we know it's a bug, how did it happen? To figure that out, I should be able to look at the graphviz graph that tup saved and look for the circular dependency. There, I should be able to figure out what I was typing in the Tupfiles, and eventually get a simple test case together to reproduce the issue. Let's take a look:
$ cat .tup/tmp/graph-full-22139.dot | dot -Tpng | miv - dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.0723303 to fit Error loading image '-' Out of memory $ wc -l .tup/tmp/graph-full-22139.dot 4787 .tup/tmp/graph-full-22139.dot
Hmm, ok, so trying to display a 4k node graph isn't going to work. Moreover, I'm not going to want to sift through all those nodes and look for the circular dependency. It seems tup's behavior of just dumping it's partial DAG here isn't all that useful. Fortunately, trimming the graph down to only the nodes in the cycle is easy enough. Consider the circular dependency in this graph to the right. We can look at each node and remove the ones that have either 1) no incoming links; or 2) no outgoing links. Repeat the process until there are no more nodes that can be removed. All nodes that remain have both incoming and outgoing links and are part of the cycle (here, nodes A, B, and C will remain).
After adding this feature to tup, I now get a much more readable graph (assuming you are familiar with tup's graphs). You can see this graph below.
Now the problem is exposed, and it appears it involves a fairly new feature in tup, which are "groups". The groups are nodes represented inside angle brackets, such as <installed-headers> shown here. These were introduced to handle cases such as mozilla-central which have hundreds of generated headers, and we want to generate them all before trying to compile things. It doesn't make sense to add a dependency from every generated header to every compilation command, since we would need NxM links to do so. Instead, we can group up the headers in a group node, which will have N incoming links and M outgoing links, for a total of N+M links. (Note in either case, these links are what tup calls "sticky links", and are only used for ordering nodes that need to be updated. Adding a link from a header file to every compilation command does not mean that everything is recompiled if the header changes. Only those commands that actually read from the header will be executed when the header changes.)
In this particular case, I accidentally wrote a Tupfile that said the compilation commands for host programs should be able to use headers in the <installed-headers> group, which is the link on the bottom right pointing from <installed-headers> to 'C++ [host] ./jsoplengen.cpp'. This command creates an object file, which creates an executable, which generates a header. This header can then be used by other compilation commands, which is why it is placed into the <installed-headers> group. As a result, I mistakenly put a circular dependency into the build tree. Essentially, I have a "bug" in the Tupfile. (There is another cycle involving jskwgen.cpp, but it is the same issue).
The Point
Note a few key words in the last paragraph: "accidentally", and "mistakenly". It is perfectly reasonable to expect that a developer will make a mistake when writing some code, whether it is C++ code or a build file. The important question here is: What happens *when* the developer makes a mistake?. As I was writing the Tupfile, tup did not give me an error message when I introduced the circular dependency. This caused the problem to go unnoticed until I ran it on another machine. Even though my Tupfile was technically invalid, it still means there is a bug in tup itself since it didn't tell me about the error when it was introduced.
Now, the exact reason for why tup didn't detect the circular dependency when it was introduced is a little more complicated, and unfortunately it is a problem I don't have entirely resolved yet. More on that in the future...
comments powered by Disqus