Alex Groce, Associate Professor, School of Informatics, Computing and Cyber Systems, Northern Arizona University
Introducing one bug by hand (as we did in Part 1) is fine, and we could try it again, but “the plural of anecdote is not data.” However, this is not strictly true. If we have enough anecdotes, we can probably call it data (the field of “big multiple anecdotes” is due to take off any day now). In software testing, creating multiple “fake bugs” has a name, mutation testing (or mutation analysis). Mutation testing works by automatically generating lots of small changes to a program, in the expectation that most such changes will make the program incorrect. A test suite or fuzzer is better if it detects more of these changes. In the lingo of mutation testing, a detected mutant is “killed.” The phrasing is a bit harsh on mutants, but in testing a certain hard-heartedness towards bugs is in order. Mutation testing was once an academic niche topic, but is now in use at major companies, in real-world situations.
There are many tools for mutation testing available, especially for Java. The tools for C code are less robust, or more difficult to use, in general. I (along with colleagues at NAU and other universities) recently released a tool, the universalmutator, that uses regular expressions to allow mutation for many languages, including C and C++ (not to mention Swift, Solidity, Rust, and numerous other languages previously without mutation-testing tools). We’ll use the universalmutator to see how well our fuzzers do at detecting artificial red-black tree bugs. Besides generality, one advantage of universalmutator is that it produces lots of mutants, including ones that are often equivalent but can sometimes produce subtle distinctions in behavior — that is, hard to detect bugs — that are not supported in most mutation systems. For high-stakes software, this can be worth the additional effort of analyzing and examining the mutants.
Installing universalmutator and generating some mutants is easy:
pip install universalmutator mkdir mutants mutate red_black_tree.c --mutantDir mutants
This will generate a large number of mutants, most of which won’t compile (the universalmutator doesn’t parse, or “know” C, so it’s no surprise many of its mutants are not valid C). We can discover the compiling mutants by running “mutation analysis” on the mutants, with “does it compile?” as our “test”:
analyze_mutants red_black_tree.c "make clean; make" --mutantDir mutants
This will produce two files:
killed.txt, containing mutants that don’t compile, and
notkilled.txt, containing the 1120 mutants that actually compile. To see if a mutant is killed, the analysis tool just determines whether the command in quotes returns a non-zero exit code, or times out (the default timeout is 30 seconds; unless you have a very slow machine, this is plenty of time to compile our code here).
If we copy the
notkilled.txt file containing valid (compiling) mutants to another file, we can then do some real mutation testing:
cp notkilled.txt compile.txt analyze_mutants red_black_tree.c "make clean; make fuzz_rb; ./fuzz_rb" --mutantDir mutants --verbose --timeout 120--fromFile compile.txt
Output will look something like:
ANALYZING red_black_tree.c COMMAND: ** ['make clean; make fuzz_rb; ./fuzz_rb'] ** #1: [0.0s 0.0% DONE] mutants/red_black_tree.mutant.2132.c NOT KILLED RUNNING SCORE: 0.0 ... Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702. /bin/sh: line 1: 30015 Abort trap: 6 ./fuzz_rb #2: [62.23s 0.09% DONE] mutants/red_black_tree.mutant.1628.c KILLED IN 1.78541398048 RUNNING SCORE: 0.5 ...
Similar commands will run mutation testing on the DeepState fuzzer and libFuzzer. Just change
make fuzz_rb; ./fuzz_rb to
make ds_rb; ./ds_rb --fuzz --timeout 60 --exit_on_fail to use the built-in DeepState fuzzer. For libFuzzer, to speed things up, we’ll want to set the environment variable
TRUE, and pipe output to
/dev/null since libFuzzer’s verbosity will hide our actual mutation results:
export LIBFUZZER_EXIT_ON_FAIL=TRUE analyze_mutants red_black_tree.c "make clean; make ds_rb_lf; ./ds_rb_lf -use_value_profile=1 -detect_leaks=0 -max_total_time=60 >& /dev/null" --mutantDir mutants --verbose --timeout 120 --fromFile compile.txt
The tool generates 2,602 mutants, but only 1,120 of these actually compile. Analyzing those mutants with a test budget of 60 seconds, we can get a better idea of the quality of our fuzzing efforts. The DeepState brute-force fuzzer kills 797 of these mutants (71.16%). John’s original fuzzer kills 822 (73.39%). Fuzzing the mutants not killed by these fuzzers another 60 seconds doesn’t kill any additional mutants. The performance of libFuzzer is strikingly similar: 60 seconds of libFuzzer (starting from an empty corpus) kills 797 mutants, exactly the same as DeepState’s brute force fuzzer – the same mutants, in fact.
“There ain’t no such thing as a free lunch” (or is there?)
DeepState’s native fuzzer appears, for a given amount of time, to be less effective than John’s “raw” fuzzer. This shouldn’t be a surprise: in fuzzing, speed is king. Because DeepState is parsing a byte-stream, forking in order to save crashes, and producing extensive, user-controlled logging (among other things), it is impossible for it to generate and execute tests as quickly as John’s bare-bones fuzzer.
libFuzzer is even slower; in addition to all the services (except forking for crashes, which is handled by libFuzzer itself) provided by the DeepState fuzzer, libFuzzer determines the code coverage and computes value profiles for every test, and performs computations needed to base future testing on those evaluations of input quality.
Is this why John’s fuzzer kills 25 mutants that DeepState does not? Well, not quite. If we examine the 25 additional mutants, we discover that every one involves changing an equality comparison on a pointer into an inequality. For example:
< if ( (y == tree->root) || --- > if ( (y <= tree->root) ||
The DeepState fuzzer is not finding these because it runs each test in a fork. The code doesn’t allocate enough times to use enough of the address space to cause a problem for these particular checks, since most allocations are in a fork! In theory, this shouldn’t be the case for libFuzzer, which runs without forking. And, sure enough, if we give the slow-and-steady libFuzzer five minutes instead of 60 seconds, it catches all of these mutants, too. No amount of additional fuzzing will help the DeepState fuzzer. In this case, the bug is strange enough and unlikely enough we can perhaps ignore it. The issue is not the speed of our fuzzer, or the quality (exactly), but the fact that different fuzzing environments create subtle differences in what tests we are actually running.
After we saw this problem, we added an option to DeepState to make the brute force fuzzer (or test replay) run in a non-forking mode:
--no_fork. Unfortunately, this is not a complete solution. While we can now detect these bugs, we can’t produce a good saved test case for them, since the failure depends on all the mallocs that have been issued, and the exact addresses of certain pointers. However, it turns out that
--no_fork has a more important benefit: it dramatically speeds up fuzzing and test replay on mac OS – often by orders of magnitude. While we omit it in our examples because it complicates analyzing failure causes, you should probably use it for most fuzzing and test replay on mac OS.
We can safely say that, for most intents and purposes, DeepState is as powerful as John’s “raw” fuzzer, as easy to implement, and considerably more convenient for debugging and regression testing.
Examining the Mutants
This takes care of the differences in our fuzzers’ performances. But how about the remaining mutants? None of them are killed by five minutes of fuzzing using any of our fuzzers. Do they show holes in our testing? There are various ways to detect equivalent mutants (mutants that don’t actually change the program semantics, and so can’t possibly be killed), such as comparing the binaries generated by an optimizing compiler. For our purposes, we will just examine a random sample of the 298 unkilled mutants, to confirm that at least most of the unkilled mutants are genuinely uninteresting.
- The first mutant changes a
<=in a comment. There’s no way we can kill this. Comparing compiled binaries would have proven it.
- The second mutant modifies code in the
InorderTreePrintfunction, which John’s fuzzer (and thus ours) explicitly chooses not to test. This would not be detectable by comparing binaries, but it is common sense. If our fuzzer never covers a piece of code, intentionally, it cannot very well detect bugs in that code.
- The third mutant changes the assignment to
temp->keyon line 44, in the
RBTreeCreatefunction, so it assigns a
1rather than a
0. This is more interesting. It will take some thought to convince ourselves this does not matter. If we follow the code’s advice and look at the comments on
nilin the header file, we can see these are used as sentinels. Perhaps the exact data values in
nildon’t matter, since we’ll only detect them by pointer comparisons? Sure enough, this is the case.
- The fourth mutant removes the assignment
newTree->PrintKey= PrintFunc;on line 35. Again, since we never print trees, this can’t be detected.
- The fifth mutant is inside a comment.
- The sixth mutant changes a pointer comparison in an assert.
686c686 < assert (node->right->parent == node); --- > assert (node->right->parent >= node);
If we assume the assert always held for the original code, then changing == to the more permissive >= obviously cannot fail.
- The seventh mutant lurks in a comment.
- The eighth mutant removes an assert. Again, removing an assert can never cause previously passing tests to fail, unless something is wrong with your assert!
- The ninth mutant changes a red assignment:
243c243 < x->parent->parent->red=1; --- > x->parent->parent->red=-1;
Since we don’t check the exact value of the red field, but use it to branch (so all non-zero values are the same) this is fine.
- The tenth mutant is again inside the
At this point if we were really testing this red-black tree as a critical piece of code, we would probably:
- Make a tool (like a 10-line Python script, not anything heavyweight!) to throw out all mutants inside comments, inside the
InorderTreePrintfunction, or that remove an assertion.
- Compile all the mutants and compare binaries with each other and the original file, to throw out obvious equivalent mutants and redundant mutants. This step can be a little annoying. Compilers don’t always produce equivalent binaries, due to timestamps generated at compile time, which is why we skipped over it in the discussion above.
- Examine the remaining mutants (maybe 200 or so) carefully, to make sure we’re not missing anything. Finding categories of “that’s fine” mutants often makes this process much easier than it sounds off hand (things like “assertion removals are always ok”).
The process of (1) making a test generator then (2) applying mutation testing and (3) actually looking at the surviving mutants and using them to improve our testing can be thought of as a falsification-driven testing process. For highly critical, small pieces of code, this can be a very effective way to build an effective fuzzing regimen. It helped Paul E. McKenney discover real bugs in the Linux kernel’s RCU module.
Just Fuzz it More
Alternatively, before turning to mutant investigation, you can just fuzz the code more aggressively. Our mutant sample suggests there won’t be many outstanding bugs, but perhaps there are a few. Five minutes is not that extreme a fuzzing regimen. People expect to run AFL for days. If we were really testing the red-black tree as a critical piece of code, we probably wouldn’t give up after five minutes.
Which fuzzer would be best for this? It’s hard to know for sure, but one reasonable approach would be first to use libFuzzer to generate a large corpus of tests to seed fuzzing, that achieve high coverage on the un-mutated red-black tree. Then, we could try a longer fuzzing run on each mutant, using the seeds to make sure we’re not spending most of the time just “learning” the red-black tree API.
After generating a corpus on the original code for an hour, we ran libFuzzer, starting from that corpus, for ten minutes. The tests we generated this way can be found here. How many additional mutants does this kill? We can already guess it will be fewer than 30, based on our 3% sample. A simple script, as described above, brings the number of interesting, unkilled, mutants to analyze down to 174 by removing comment mutations, print function mutations, and assertion removals. In fact, this more aggressive (and time-consuming) fuzzing kills zero additional mutants over the ones already killed by John’s fuzzer in one minute and libFuzzer in five minutes. Even an hour-long libFuzzer run with the hour-long corpus kills only three additional mutants, and those are not very interesting. One new kill removes a
free call, and the memory leak eventually kills libFuzzer; the other two kills are just more pointer comparisons. Is this solid evidence that our remaining mutants (assuming we haven’t examined them all yet) are harmless? We’ll see.
What About Symbolic Execution?
[Note: this part doesn’t work on Mac systems right now, unless you know enough to do a cross compile, and can get the binary analysis tools working with that. I ran it on Linux inside docker.]
DeepState also supports symbolic execution, which, according to some definitions, is just another kind of fuzzing (white box fuzzing). Unfortunately, at this time, neither Manticore nor angr (the two binary analysis engines we support) can scale to the full red-black tree or file system examples with a search depth anything like 100. This isn’t really surprising, given the tools are trying to generate all possible paths through the code! However, simply lowering the depth to a more reasonable number is also insufficient. You’re likely to get solver timeout errors even at depth three. Instead, we use symex.cpp, which does a much simpler insert/delete pattern, with comparisons to the reference, three times in a row.
clang -c red_black_tree.c container.c stack.c misc.c clang++ -o symex symex.cpp -ldeepstate red_black_tree.o stack.o misc.o container.o -static -Wl,--allow-multiple-definition,--no-export-dynamic deepstate-manticore ./symex --min_log_level 1
The result will be tests covering all paths through the code, saved in the out directory. This may take quite some time to run, since each path can take a minute or two to generate, and there are quite a few paths. If
deepstate-manticore is too slow, try
deepstate-angr (or vice versa). Different code is best suited for different symbolic execution engines. (This is one of the purposes of DeepState – to make shopping around for a good back-end easy.)
INFO:deepstate.mcore:Running 1 tests across 1 workers TRACE:deepstate:Running RBTree_TinySymex from symex.cpp(65) TRACE:deepstate:symex.cpp(80): 0: INSERT:0 0x0000000000000000 TRACE:deepstate:symex.cpp(85): 0: DELETE:0 TRACE:deepstate:symex.cpp(80): 1: INSERT:0 0x0000000000000000 TRACE:deepstate:symex.cpp(85): 1: DELETE:0 TRACE:deepstate:symex.cpp(80): 2: INSERT:0 0x0000000000000000 TRACE:deepstate:symex.cpp(85): 2: DELETE:-2147483648 TRACE:deepstate:Passed: RBTree_TinySymex TRACE:deepstate:Input: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... TRACE:deepstate:Saved test case in file out/symex.cpp/RBTree_TinySymex/89b9a0aba0287935fa5055d8cb402b37.pass TRACE:deepstate:Running RBTree_TinySymex from symex.cpp(65) TRACE:deepstate:symex.cpp(80): 0: INSERT:0 0x0000000000000000 TRACE:deepstate:symex.cpp(85): 0: DELETE:0 TRACE:deepstate:symex.cpp(80): 1: INSERT:0 0x0000000000000000 TRACE:deepstate:symex.cpp(85): 1: DELETE:0 TRACE:deepstate:symex.cpp(80): 2: INSERT:0 0x0000000000000000 TRACE:deepstate:symex.cpp(85): 2: DELETE:0 TRACE:deepstate:Passed: RBTree_TinySymex ...
We can see how well the 583 generated tests perform using mutation analysis as before. Because we are just replaying the tests, not performing symbolic execution, we can now add back in the
RBTreeVerify checks that were removed in order to speed symbolic execution, by compiling
-DREPLAY, and compile everything with all of our sanitizers. The generated tests, which can be run (on a correct
red_black_tree.c) in less than a second, kill 428 mutants (38.21%). This is considerably lower than for fuzzing, and worse than the 797 (71.16%) killed by the libFuzzer one hour corpus, which has a similar < 1s runtime. However, this summary hides something more interesting: five of the killed mutants are ones not killed by any of our fuzzers, even in the well-seeded ten minute libFuzzer runs:
703c703 < return left_black_cnt + (node->red ? 0 : 1); --- > return left_black_cnt / (node->red ? 0 : 1);
703c703 < return left_black_cnt + (node->red ? 0 : 1); --- > return left_black_cnt % (node->red ? 0 : 1);
703c703 < return left_black_cnt + (node->red ? 0 : 1); --- > /*return left_black_cnt + (node->red ? 0 : 1);*/
701c701 < right_black_cnt = checkRepHelper (node->right, t); --- > /*right_black_cnt = checkRepHelper (node->right, t);*/
700c700 < left_black_cnt = checkRepHelper (node->left, t); --- > /*left_black_cnt = checkRepHelper (node->left, t);*/
These bugs are all in the
checkRep code itself, which was not even targeted by symbolic execution. While these bugs do not involve actual faulty red-black tree behavior, they show that our fuzzers could allow subtle flaws to be introduced into the red-black tree’s tools for checking its own validity. In the right context, these could be serious faults, and certainly show a gap in the fuzzer-based testing. In order to see how hard it is to detect these faults, we tried using libFuzzer on each of these mutants, with our one hour corpus as seed, for one additional hour of fuzzing on each mutant. It was still unable to detect any of these mutants.
While generating tests using symbolic execution takes more computational power, and, perhaps, more human effort, the very thorough (if limited in scope) tests that result can detect bugs that even aggressive fuzzing may miss. Such tests are certainly a powerful addition to a regression test suite for an API. Learning to use DeepState makes mixing fuzzing and symbolic execution in your testing easy. Even if you need a new harness for symbolic execution work, it looks like, and can share code with, most of your fuzzing-based testing. A major long-term goal for DeepState is to increase the scalability of symbolic execution for API sequence testing, using high-level strategies not dependent on the underlying engine, so you can use the same harness more often.
See the DeepState repo for more information on how to use symbolic execution.
What About Code Coverage?
We didn’t even look at code coverage in our fuzzing. The reason is simple: if we’re willing to go to the effort of applying mutation testing, and examining all surviving mutants, there’s not much additional benefit in looking at code coverage. Under the hood, libFuzzer and the symbolic execution engines aim to maximize coverage, but for our purposes mutants work even better. After all, if we don’t cover mutated code, we can hardly kill it. Coverage can be very useful, of course, in early stages of fuzzer harness development, where mutation testing is expensive, and you really just want to know if you are even hitting most of the code. But for intensive testing, when you have the time to do it, mutation testing is much more thorough. Not only do you have to cover the code, you actually have to test what it does. In fact, at present, most scientific evidence for the usefulness of code coverage relies on the greater usefulness of mutation testing.
For a more involved example using DeepState to test an API, see the TestFs example, which tests a user-level, ext3-like file system, or the differential tester that compares behavior of Google’s leveldb and Facebook’s rocksdb. For more details on DeepState in general, see our NDSS 2018 Binary Analysis Research Workshop paper.