Symbolic Path Merging in Manticore

Each year, Trail of Bits runs a month-long winter internship “winternship” program. This year we were happy to host 4 winterns who contributed to 3 projects. This is the first in a series of blog posts covering the 2019 Wintern class.

Our first report is from Vaibhav Sharma (@vbsharma), a PhD student at the University of Minnesota. Vaibhav’s research focuses on improving symbolic executors and he took a crack at introducing a new optimization to Manticore:

Symbolic Path Merging in Manticore

My project was about investigating the use of path-merging techniques in Manticore, a symbolic execution engine that supports symbolic exploration of binaries compiled for X86, X64, and Ethereum platforms. A significant barrier for symbolic exploration of many practical programs is path explosion. As a symbolic executor explores a program, it encounters branch instructions with two feasible sides. The symbolic executor needs to explore both sides of the branch instruction. Manticore explores such branch instructions by forking the path that reached the branch instruction into two paths each of which explores a feasible side. A linear increase in the number of branch instructions with both sides feasible causes an exponential increase in the number of paths Manticore needs to explore through the program. If we hit enough of these branch conditions, Manticore may never finish exploring all the states.

Path merging reduces the number of paths to be explored. The central idea is to merge paths at the same program location that are similar. Manticore uses the notion of a “state” object to capture the processor, memory, and file system information into a single data structure at every point of symbolic exploration through a program. Hence, path merging can be specialized to “state merging” in Manticore where merging similar states that are at the same program location leads to an exponential reduction in the number of paths to explore. With a simple program, I observed Manticore could cut its number of explored execution paths by 33% if it merged similar states at the same program location.

State merging can be implemented statically or dynamically. Static state merging explores the control-flow graph of the subject program in topological order and merges states at the same program location when possible. Veritesting is a path-merging technique that is similar to static state merging, it requires paths to be at same program location to merge them. Dynamic state merging does not require two states to be at the same program location for them to be considered for merging. Given two states a1, a2 at different program locations l1, l2 respectively, if a transitive successor a1′ of a1 has a high and beneficial similarity to a2, dynamic state merging fast-forwards a1 to a1′ and merges it with a2. The fast-forwarding involves overriding the symbolic executor’s search heuristic to reach l2. Dynamic state merging uses the intuition that if two states are similar, their successors within a few steps are also likely to be similar.

While it is possible to implement either state merging technique in Manticore, I chose dynamic state merging as described by Kuznetsov et al. as a better fit for Manticore’s use of state-based instead of path-based symbolic executors. Also, static state merging is less suited for symbolic exploration guided towards a goal and more suited for exhaustive exploration of a subject program. Since static state merging can only merge states at the same program location, when directed towards a goal it tends to cover less code than dynamic state merging in the same time budget. This was also a conclusion of Kuznetsov et al (see Figure 8 from their paper below). Since we often tend to use symbolic execution to reach an exploration goal, static state merging is less suited to our needs.

DSM and SSM vs KLEE coverage

Dynamic State Merging (DSM) provided more statement coverage than Static State Merging (SSM). Figure from “Efficient state merging in symbolic execution.” Kuznetsov et al. 11 Jun. 2012.

Engineering Challenges

Both static and dynamic state merging require the use of an external static analysis tool like Binary Ninja to find the topological ordering of program locations. Given the short duration of my winternship, I chose to implement opportunistic state merging which only merges states that happen to be at the same program location. While this approach does not give the full benefit of dynamic state merging, it is easier to implement because it does not rely on integration with an external static analysis tool to obtain topological ordering. This approach is also easily extensible to dynamic state merging since it uses many of the same primitive operations like state comparison and state merging.

Implementation

I implemented opportunistic state merging for Manticore. The implementation checks if two states at the same program location have semantically equivalent input, output socket buffers, memory, and system call traces in an “isMergeable” predicate. If this predicate is satisfied, the implementation merges CPU register values that are semantically inequivalent.

Results

I used a simple example where I could see two states saved in Manticore’s queue that are at the same program location making them good candidates to be merged. I present the partial CFG of this example program below.

Merged CFG Annotated

Merged CFG Annotated

The two basic blocks highlighted in red cause control flow to merge at the basic block highlighted in green. The first highlighted red block causes control flow to jump directly to the green block. The second highlighted red block moves a constant (0x4a12dd) to the edi register and then jumps to the green block. To explore this example, Manticore creates two states, one which explores the first red block and jumps to the green block, and another state that explores the second red block and jumps to the green block. Since the only difference between these two states which are at the same program location (the green block) is the value present in their edi register, Manticore can merge these two states into a single state with the value for edi set to be an if-then-else expression. This if-then-else expression will use the condition that chooses which side of the branch (jbe 0x40060d) gets taken. If the condition is satisfied, the if-then-else expression will evaluate to the value that is present in edi in the first red block. If the condition is not satisfied, it will evaluate to 0x4a12dd (the constant set in the second red block). Thus, Manticore merges two control flow paths into one path opportunistically which eventually leads to Manticore cutting its number of execution paths by 33% on the binary compiled with the -Os optimization option with gcc and by 20% if the binary is compiled with the -O3 optimization option.

Directions for future improvement:

  1. This implementation can be extended to get the full benefits of dynamic state merging by integrating Manticore with a tool that can provide a topological ordering of program locations.
  2. State merging always creates new symbolic data since it converts all concrete writes in a region of code to symbolic writes.
  3. Check if new symbolic data introduced by state merging causes more branching later during exploration. We need to implement re-interpret heuristics such as query count estimation by Kuznetsov et al. so that we may use dynamic state merging only when it is most useful.

Path merging is a technique that needs to be re-interpreted to fit the needs of a symbolic executor. This winternship allowed me to understand the inner workings of Manticore, a state-based symbolic executor, and re-interpret path merging to better fit the use-case of binary symbolic execution with Manticore. My implementation of opportunistic state merging merges similar states if they are at the same program location. The implementation can be used in a Python script by registering a plugin called Merger with Manticore. basic_statemerging.py under examples/script is an example of such use of state merging.

Fuzzing an API with DeepState (Part 2)

Alex Groce, Associate Professor, School of Informatics, Computing and Cyber Systems, Northern Arizona University

Mutation Testing

Introducing one bug by hand (as we did in Part 1) is fine, and we could try it again, but “the plural of anecdote is not data.” However, this is not strictly true. If we have enough anecdotes, we can probably call it data (the field of “big multiple anecdotes” is due to take off any day now). In software testing, creating multiple “fake bugs” has a name, mutation testing (or mutation analysis). Mutation testing works by automatically generating lots of small changes to a program, in the expectation that most such changes will make the program incorrect. A test suite or fuzzer is better if it detects more of these changes. In the lingo of mutation testing, a detected mutant is “killed.” The phrasing is a bit harsh on mutants, but in testing a certain hard-heartedness towards bugs is in order. Mutation testing was once an academic niche topic, but is now in use at major companies, in real-world situations.

There are many tools for mutation testing available, especially for Java. The tools for C code are less robust, or more difficult to use, in general. I (along with colleagues at NAU and other universities) recently released a tool, the universalmutator, that uses regular expressions to allow mutation for many languages, including C and C++ (not to mention Swift, Solidity, Rust, and numerous other languages previously without mutation-testing tools). We’ll use the universalmutator to see how well our fuzzers do at detecting artificial red-black tree bugs. Besides generality, one advantage of universalmutator is that it produces lots of mutants, including ones that are often equivalent but can sometimes produce subtle distinctions in behavior — that is, hard to detect bugs — that are not supported in most mutation systems. For high-stakes software, this can be worth the additional effort of analyzing and examining the mutants.

Installing universalmutator and generating some mutants is easy:

pip install universalmutator
mkdir mutants
mutate red_black_tree.c --mutantDir mutants

This will generate a large number of mutants, most of which won’t compile (the universalmutator doesn’t parse, or “know” C, so it’s no surprise many of its mutants are not valid C). We can discover the compiling mutants by running “mutation analysis” on the mutants, with “does it compile?” as our “test”:

analyze_mutants red_black_tree.c "make clean; make" --mutantDir mutants

This will produce two files: killed.txt, containing mutants that don’t compile, and notkilled.txt, containing the 1120 mutants that actually compile. To see if a mutant is killed, the analysis tool just determines whether the command in quotes returns a non-zero exit code, or times out (the default timeout is 30 seconds; unless you have a very slow machine, this is plenty of time to compile our code here).

If we copy the notkilled.txt file containing valid (compiling) mutants to another file, we can then do some real mutation testing:

cp notkilled.txt compile.txt
analyze_mutants red_black_tree.c "make clean; make fuzz_rb; ./fuzz_rb" --mutantDir mutants --verbose --timeout 120--fromFile compile.txt

Output will look something like:

ANALYZING red_black_tree.c
COMMAND: ** ['make clean; make fuzz_rb; ./fuzz_rb'] **
#1: [0.0s 0.0% DONE]
  mutants/red_black_tree.mutant.2132.c NOT KILLED
  RUNNING SCORE: 0.0
...
Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702.
/bin/sh: line 1: 30015 Abort trap: 6           ./fuzz_rb
#2: [62.23s 0.09% DONE]
  mutants/red_black_tree.mutant.1628.c KILLED IN 1.78541398048
  RUNNING SCORE: 0.5
...

Similar commands will run mutation testing on the DeepState fuzzer and libFuzzer. Just change make fuzz_rb; ./fuzz_rb to make ds_rb; ./ds_rb --fuzz --timeout 60 --exit_on_fail to use the built-in DeepState fuzzer. For libFuzzer, to speed things up, we’ll want to set the environment variable LIBFUZZER_EXIT_ON_FAIL to TRUE, and pipe output to /dev/null since libFuzzer’s verbosity will hide our actual mutation results:

export LIBFUZZER_EXIT_ON_FAIL=TRUE
analyze_mutants red_black_tree.c "make clean; make ds_rb_lf; ./ds_rb_lf -use_value_profile=1 -detect_leaks=0 -max_total_time=60 >& /dev/null" --mutantDir mutants --verbose --timeout 120 --fromFile compile.txt

The tool generates 2,602 mutants, but only 1,120 of these actually compile. Analyzing those mutants with a test budget of 60 seconds, we can get a better idea of the quality of our fuzzing efforts. The DeepState brute-force fuzzer kills 797 of these mutants (71.16%). John’s original fuzzer kills 822 (73.39%). Fuzzing the mutants not killed by these fuzzers another 60 seconds doesn’t kill any additional mutants. The performance of libFuzzer is strikingly similar: 60 seconds of libFuzzer (starting from an empty corpus) kills 797 mutants, exactly the same as DeepState’s brute force fuzzer – the same mutants, in fact.

“There ain’t no such thing as a free lunch” (or is there?)

DeepState’s native fuzzer appears, for a given amount of time, to be less effective than John’s “raw” fuzzer. This shouldn’t be a surprise: in fuzzing, speed is king. Because DeepState is parsing a byte-stream, forking in order to save crashes, and producing extensive, user-controlled logging (among other things), it is impossible for it to generate and execute tests as quickly as John’s bare-bones fuzzer.

libFuzzer is even slower; in addition to all the services (except forking for crashes, which is handled by libFuzzer itself) provided by the DeepState fuzzer, libFuzzer determines the code coverage and computes value profiles for every test, and performs computations needed to base future testing on those evaluations of input quality.

Is this why John’s fuzzer kills 25 mutants that DeepState does not? Well, not quite. If we examine the 25 additional mutants, we discover that every one involves changing an equality comparison on a pointer into an inequality. For example:

<   if ( (y == tree->root) ||
---
>   if ( (y <= tree->root) ||

The DeepState fuzzer is not finding these because it runs each test in a fork. The code doesn’t allocate enough times to use enough of the address space to cause a problem for these particular checks, since most allocations are in a fork! In theory, this shouldn’t be the case for libFuzzer, which runs without forking. And, sure enough, if we give the slow-and-steady libFuzzer five minutes instead of 60 seconds, it catches all of these mutants, too. No amount of additional fuzzing will help the DeepState fuzzer. In this case, the bug is strange enough and unlikely enough we can perhaps ignore it. The issue is not the speed of our fuzzer, or the quality (exactly), but the fact that different fuzzing environments create subtle differences in what tests we are actually running.

After we saw this problem, we added an option to DeepState to make the brute force fuzzer (or test replay) run in a non-forking mode: --no_fork. Unfortunately, this is not a complete solution. While we can now detect these bugs, we can’t produce a good saved test case for them, since the failure depends on all the mallocs that have been issued, and the exact addresses of certain pointers. However, it turns out that --no_fork has a more important benefit: it dramatically speeds up fuzzing and test replay on mac OS – often by orders of magnitude. While we omit it in our examples because it complicates analyzing failure causes, you should probably use it for most fuzzing and test replay on mac OS.

We can safely say that, for most intents and purposes, DeepState is as powerful as John’s “raw” fuzzer, as easy to implement, and considerably more convenient for debugging and regression testing.

Examining the Mutants

This takes care of the differences in our fuzzers’ performances. But how about the remaining mutants? None of them are killed by five minutes of fuzzing using any of our fuzzers. Do they show holes in our testing? There are various ways to detect equivalent mutants (mutants that don’t actually change the program semantics, and so can’t possibly be killed), such as comparing the binaries generated by an optimizing compiler. For our purposes, we will just examine a random sample of the 298 unkilled mutants, to confirm that at least most of the unkilled mutants are genuinely uninteresting.

  • The first mutant changes a <= in a comment. There’s no way we can kill this. Comparing compiled binaries would have proven it.
  • The second mutant modifies code in the InorderTreePrint function, which John’s fuzzer (and thus ours) explicitly chooses not to test. This would not be detectable by comparing binaries, but it is common sense. If our fuzzer never covers a piece of code, intentionally, it cannot very well detect bugs in that code.
  • The third mutant changes the assignment to temp->key on line 44, in the RBTreeCreate function, so it assigns a 1 rather than a 0. This is more interesting. It will take some thought to convince ourselves this does not matter. If we follow the code’s advice and look at the comments on root and nil in the header file, we can see these are used as sentinels. Perhaps the exact data values in root and nil don’t matter, since we’ll only detect them by pointer comparisons? Sure enough, this is the case.
  • The fourth mutant removes the assignment newTree->PrintKey= PrintFunc; on line 35. Again, since we never print trees, this can’t be detected.
  • The fifth mutant is inside a comment.
  • The sixth mutant changes a pointer comparison in an assert.
    686c686
    <     assert (node->right->parent == node);
    ---
    >     assert (node->right->parent >= node);

    If we assume the assert always held for the original code, then changing == to the more permissive >= obviously cannot fail.

  • The seventh mutant lurks in a comment.
  • The eighth mutant removes an assert. Again, removing an assert can never cause previously passing tests to fail, unless something is wrong with your assert!
  • The ninth mutant changes a red assignment:
    243c243
    <       x->parent->parent->red=1;
    ---
    >       x->parent->parent->red=-1;

    Since we don’t check the exact value of the red field, but use it to branch (so all non-zero values are the same) this is fine.

  • The tenth mutant is again inside the InorderTreePrint function.

At this point if we were really testing this red-black tree as a critical piece of code, we would probably:

  • Make a tool (like a 10-line Python script, not anything heavyweight!) to throw out all mutants inside comments, inside the InorderTreePrint function, or that remove an assertion.
  • Compile all the mutants and compare binaries with each other and the original file, to throw out obvious equivalent mutants and redundant mutants. This step can be a little annoying. Compilers don’t always produce equivalent binaries, due to timestamps generated at compile time, which is why we skipped over it in the discussion above.
  • Examine the remaining mutants (maybe 200 or so) carefully, to make sure we’re not missing anything. Finding categories of “that’s fine” mutants often makes this process much easier than it sounds off hand (things like “assertion removals are always ok”).

The process of (1) making a test generator then (2) applying mutation testing and (3) actually looking at the surviving mutants and using them to improve our testing can be thought of as a falsification-driven testing process. For highly critical, small pieces of code, this can be a very effective way to build an effective fuzzing regimen. It helped Paul E. McKenney discover real bugs in the Linux kernel’s RCU module.

Just Fuzz it More

Alternatively, before turning to mutant investigation, you can just fuzz the code more aggressively. Our mutant sample suggests there won’t be many outstanding bugs, but perhaps there are a few. Five minutes is not that extreme a fuzzing regimen. People expect to run AFL for days. If we were really testing the red-black tree as a critical piece of code, we probably wouldn’t give up after five minutes.

Which fuzzer would be best for this? It’s hard to know for sure, but one reasonable approach would be first to use libFuzzer to generate a large corpus of tests to seed fuzzing, that achieve high coverage on the un-mutated red-black tree. Then, we could try a longer fuzzing run on each mutant, using the seeds to make sure we’re not spending most of the time just “learning” the red-black tree API.

After generating a corpus on the original code for an hour, we ran libFuzzer, starting from that corpus, for ten minutes. The tests we generated this way can be found here. How many additional mutants does this kill? We can already guess it will be fewer than 30, based on our 3% sample. A simple script, as described above, brings the number of interesting, unkilled, mutants to analyze down to 174 by removing comment mutations, print function mutations, and assertion removals. In fact, this more aggressive (and time-consuming) fuzzing kills zero additional mutants over the ones already killed by John’s fuzzer in one minute and libFuzzer in five minutes. Even an hour-long libFuzzer run with the hour-long corpus kills only three additional mutants, and those are not very interesting. One new kill removes a free call, and the memory leak eventually kills libFuzzer; the other two kills are just more pointer comparisons. Is this solid evidence that our remaining mutants (assuming we haven’t examined them all yet) are harmless? We’ll see.

What About Symbolic Execution?

[Note: this part doesn’t work on Mac systems right now, unless you know enough to do a cross compile, and can get the binary analysis tools working with that. I ran it on Linux inside docker.]

DeepState also supports symbolic execution, which, according to some definitions, is just another kind of fuzzing (white box fuzzing). Unfortunately, at this time, neither Manticore nor angr (the two binary analysis engines we support) can scale to the full red-black tree or file system examples with a search depth anything like 100. This isn’t really surprising, given the tools are trying to generate all possible paths through the code! However, simply lowering the depth to a more reasonable number is also insufficient. You’re likely to get solver timeout errors even at depth three. Instead, we use symex.cpp, which does a much simpler insert/delete pattern, with comparisons to the reference, three times in a row.

clang -c red_black_tree.c container.c stack.c misc.c
clang++ -o symex symex.cpp -ldeepstate red_black_tree.o stack.o misc.o container.o -static -Wl,--allow-multiple-definition,--no-export-dynamic
deepstate-manticore ./symex --log_level 1

The result will be tests covering all paths through the code, saved in the out directory. This may take quite some time to run, since each path can take a minute or two to generate, and there are quite a few paths. If deepstate-manticore is too slow, try deepstate-angr (or vice versa). Different code is best suited for different symbolic execution engines. (This is one of the purposes of DeepState – to make shopping around for a good back-end easy.)

INFO:deepstate.mcore:Running 1 tests across 1 workers
TRACE:deepstate:Running RBTree_TinySymex from symex.cpp(65)
TRACE:deepstate:symex.cpp(80): 0: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 0: DELETE:0
TRACE:deepstate:symex.cpp(80): 1: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 1: DELETE:0
TRACE:deepstate:symex.cpp(80): 2: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 2: DELETE:-2147483648
TRACE:deepstate:Passed: RBTree_TinySymex
TRACE:deepstate:Input: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
TRACE:deepstate:Saved test case in file out/symex.cpp/RBTree_TinySymex/89b9a0aba0287935fa5055d8cb402b37.pass
TRACE:deepstate:Running RBTree_TinySymex from symex.cpp(65)
TRACE:deepstate:symex.cpp(80): 0: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 0: DELETE:0
TRACE:deepstate:symex.cpp(80): 1: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 1: DELETE:0
TRACE:deepstate:symex.cpp(80): 2: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 2: DELETE:0
TRACE:deepstate:Passed: RBTree_TinySymex
...

We can see how well the 583 generated tests perform using mutation analysis as before. Because we are just replaying the tests, not performing symbolic execution, we can now add back in the checkRep and RBTreeVerify checks that were removed in order to speed symbolic execution, by compiling symex.cpp with -DREPLAY, and compile everything with all of our sanitizers. The generated tests, which can be run (on a correct red_black_tree.c) in less than a second, kill 428 mutants (38.21%). This is considerably lower than for fuzzing, and worse than the 797 (71.16%) killed by the libFuzzer one hour corpus, which has a similar < 1s runtime. However, this summary hides something more interesting: five of the killed mutants are ones not killed by any of our fuzzers, even in the well-seeded ten minute libFuzzer runs:

703c703
<   return left_black_cnt + (node->red ? 0 : 1);
---
>   return left_black_cnt / (node->red ? 0 : 1);
703c703
<   return left_black_cnt + (node->red ? 0 : 1);
---
>   return left_black_cnt % (node->red ? 0 : 1);
703c703
<   return left_black_cnt + (node->red ? 0 : 1);
---
>   /*return left_black_cnt + (node->red ? 0 : 1);*/
701c701
<   right_black_cnt = checkRepHelper (node->right, t);
---
>   /*right_black_cnt = checkRepHelper (node->right, t);*/
700c700
<   left_black_cnt = checkRepHelper (node->left, t);
---
>   /*left_black_cnt = checkRepHelper (node->left, t);*/

These bugs are all in the checkRep code itself, which was not even targeted by symbolic execution. While these bugs do not involve actual faulty red-black tree behavior, they show that our fuzzers could allow subtle flaws to be introduced into the red-black tree’s tools for checking its own validity. In the right context, these could be serious faults, and certainly show a gap in the fuzzer-based testing. In order to see how hard it is to detect these faults, we tried using libFuzzer on each of these mutants, with our one hour corpus as seed, for one additional hour of fuzzing on each mutant. It was still unable to detect any of these mutants.

While generating tests using symbolic execution takes more computational power, and, perhaps, more human effort, the very thorough (if limited in scope) tests that result can detect bugs that even aggressive fuzzing may miss. Such tests are certainly a powerful addition to a regression test suite for an API. Learning to use DeepState makes mixing fuzzing and symbolic execution in your testing easy. Even if you need a new harness for symbolic execution work, it looks like, and can share code with, most of your fuzzing-based testing. A major long-term goal for DeepState is to increase the scalability of symbolic execution for API sequence testing, using high-level strategies not dependent on the underlying engine, so you can use the same harness more often.

See the DeepState repo for more information on how to use symbolic execution.

What About Code Coverage?

We didn’t even look at code coverage in our fuzzing. The reason is simple: if we’re willing to go to the effort of applying mutation testing, and examining all surviving mutants, there’s not much additional benefit in looking at code coverage. Under the hood, libFuzzer and the symbolic execution engines aim to maximize coverage, but for our purposes mutants work even better. After all, if we don’t cover mutated code, we can hardly kill it. Coverage can be very useful, of course, in early stages of fuzzer harness development, where mutation testing is expensive, and you really just want to know if you are even hitting most of the code. But for intensive testing, when you have the time to do it, mutation testing is much more thorough. Not only do you have to cover the code, you actually have to test what it does. In fact, at present, most scientific evidence for the usefulness of code coverage relies on the greater usefulness of mutation testing.

Further Reading

For a more involved example using DeepState to test an API, see the TestFs example, which tests a user-level, ext3-like file system, or the differential tester that compares behavior of Google’s leveldb and Facebook’s rocksdb. For more details on DeepState in general, see our NDSS 2018 Binary Analysis Research Workshop paper.

Fuzzing an API with DeepState (Part 1)

Alex Groce, Associate Professor, School of Informatics, Computing and Cyber Systems, Northern Arizona University

Using DeepState, we took a handwritten red-black tree fuzzer and, with minimal effort, turned it into a much more fully featured test generator. The DeepState fuzzer, despite requiring no more coding effort, supports replay of regression tests, reduction of the size of test cases for debugging, and multiple data-generation back-ends, including Manticore, angr, libFuzzer, and AFL. Using symbolic execution, we even discovered artificially introduced bugs that the original fuzzer missed. After reading this article, you should be ready to start applying high-powered automated test generation to your own APIs.

Background

In 2013, John Regehr wrote a blog post on “How to Fuzz an ADT Implementation.” John wrote at some length about general issues in gaining confidence that a data-type implementation is reliable, discussing code coverage, test oracles, and differential testing. If you have not yet read John’s article, then I recommend reading it now. It gives a good overview of how to construct a simple custom fuzzer for an ADT, or, for that matter, any fairly self-contained API where there are good ways to check for correctness.

The general problem is simple. Suppose we have a piece of software that provides a set of functions or methods on objects. Our running example in this post is a red-black tree; however, an AVL tree, a file-system, an in-memory store, or even a crypto library could easily be swapped in. We have some expectations about what will happen when we call the available functions. Our goal is to thoroughly test the software, and the traditional unit-testing approach to the problem is to write a series of small functions that look like:

result1 = foo(3, "hello");
result2 = bar(result1, "goodbye")
assert(result2 == DONE);

That is, each test has the form: “do something, then check that it did the right thing.” This approach has two problems. First, it’s a lot of work. Second, the return on investment for that work is not as good as you would hope; each test does one specific thing, and if the author of the tests doesn’t happen to think of a potential problem, then the tests are very unlikely to catch that problem. These unit tests are insufficient for the same reasons that AFL and other fuzzers have been so successful at finding security vulnerabilities in widely used programs: humans are too slow at writing many tests, and are limited in their ability to imagine insane, harmful inputs. The randomness of fuzzing makes it possible to produce many tests very quickly and results in tests that go far outside the “expected uses.”

Fuzzing is often thought of as generating files or packets, but it can also generate sequences of API calls to test software libraries. Such fuzzing is often referred to as random or randomized testing, but fuzzing is fuzzing. Instead of a series of unit tests doing one specific thing, a fuzzer test (also known as a property-based test or a parameterized unit test) looks more like:

foo_result = NULL;
bar_result = NULL;
repeat LENGTH times:
   switch (choice):
      choose_foo:
         foo_result = foo(randomInt(), randomString());
         break;
      choose_bar:
         bar_result = bar(foo_result, randomString());
         break;
      choose_baz:
         baz_result = baz(foo_result, bar_result);
         break;
   checkInvariants();

That is, the fuzzer repeatedly chooses a random function to call, and then calls the chosen function, perhaps storing the results for use in later function calls.

A well-constructed test of this form will include lots of generalized assertions about how the system should behave, so that the fuzzer is more likely to shake out unusual interactions between the function calls. The most obvious such checks are any assertions in the code, but there are numerous other possibilities. For a data structure, this will come in the form of a repOK function that makes sure that the ADT’s internal representation is in a consistent state. For red-black trees, that involves checking node coloring and balance. For a file system, you may expect that chkdsk will never find any errors after a series of valid file system operations. In a crypto library (or a JSON parser, for that matter, with some restrictions on the content of message) you may want to check round-trip properties: message == decode(encode(message, key), key). In many cases, such as with ADTs and file systems, you can use another implementation of the same or similar functionality, and compare results. Such differential testing is extremely powerful, because it lets you write a very complete specification of correctness with relatively little work.

John’s post doesn’t just give general advice, it also includes links to a working fuzzer for a red-black tree. The fuzzer is effective and serves as a great example of how to really hammer an API using a solid test harness based on random value generation. However, it’s also not a completely practical testing tool. It generates inputs, and tests the red-black tree, but when the fuzzer finds a bug, it simply prints an error message and crashes. You don’t learn anything except “Your code has a bug. Here is the symptom.” Modifying the code to print out the test steps as they happen slightly improves the situation, but there are likely to be hundreds or thousands of steps before the failure.

Ideally, the fuzzer would automatically store failing test sequences in a file, minimize the sequences to make debugging easy, and make it possible to replay old failing tests in a regression suite. Writing the code to support all this infrastructure is no fun (especially in C/C++) and dramatically increases the amount of work required for your testing effort. Handling the more subtle aspects, such as trapping assertion violations and hard crashes so that you write the test to the file system before terminating, is also hard to get right.

AFL and other general-purpose fuzzers usually provide this kind of functionality, which makes fuzzing a much more practical tool in debugging. Unfortunately, such fuzzers are not convenient for testing APIs. They typically generate a file or byte buffer, and expect that the program being tested will take that file as input. Turning a series of bytes into a red-black tree test is probably easier and more fun than writing all the machinery for saving, replaying, and reducing tests, but it still seems like a lot of work that isn’t directly relevant to your real task: figuring out how to describe valid sequences of API calls, and how to check for correct behavior. What you really want is a unit testing framework like GoogleTest, but one that is capable of varying the input values used in tests. There are lots of good tools for random testing, including my own TSTL, but few sophisticated ones target C/C++, and none that we are aware of let you use any test generation method other than the tools’ built-in random tester. That’s what we want: GoogleTest, but with the ability to use libFuzzer, AFL, HonggFuzz, or what you will to generate data.

Enter DeepState

DeepState fills that need, and more. (We’ll get to the ‘more’ when we discuss symbolic execution).

Translating John’s fuzzer into a DeepState test harness is relatively easy. Here is a DeepState version of “the same fuzzer.” The primary changes for DeepState, which can be found in the file deepstate_harness.cpp, are:

  • Remove main and replace it with a named test (TEST(RBTree, GeneralFuzzer))
    • A DeepState file can contain more than one named test, though it is fine to only have one test.
  • Just create one tree in each test, rather than having an outer loop that iterates over calls that affect a single tree at a time.
    • Instead of a fuzzing loop, our tests are closer to very generalized unit tests: each test does one sequence of interesting API calls.
    • DeepState will handle running multiple tests; the fuzzer or symbolic execution engine will provide the “outer loop.”
  • Fix the length of each API call sequence to a fixed value, rather than a random one.
    • The #define LENGTH 100 at the top of the file controls how many functions we call in each test.
    • Having bytes be in somewhat the same positions in every test is helpful for mutation-based fuzzers. Extremely long tests will go beyond libFuzzer’s default byte length.
    • So long as they don’t consume so many bytes that fuzzers or DeepState reach their limits, or have trouble finding the right bytes to mutate, longer tests are usually better than shorter tests. There may be a length five sequence that exposes your bug, but DeepState’s brute-force fuzzer and even libFuzzer and AFL will likely have trouble finding it, and more easily produce a length 45 version of the same problem. Symbolic execution, on the other hand, will find such rare sequences for any length it can handle.
    • For simplicity, we use a #define in our harness, but it is possible to define such testing parameters as optional command-line arguments with a default value, for even greater flexibility in testing.  Just use the same tools as DeepState uses to define its own command-line options (see DeepState.c and DeepState.h).
  • Replace various rand() % NNN calls with DeepState_Int(), DeepState_Char() and DeepState_IntInRange(...) calls.
    • DeepState provides calls to generate most of the basic data types you want, optionally over restricted ranges.
    • You can actually just use rand() instead of making DeepState calls. If you include DeepState and have defined DEEPSTATE_TAKEOVER_RAND, all rand calls will be translated to appropriate DeepState functions. The file easy_deepstate_fuzzer.cpp shows how this works, and is the simplest translation of John’s fuzzer. It isn’t ideal, since it doesn’t provide any logging to show what happens during tests. This is often the easiest way to convert an existing fuzzer to use DeepState; the changes from John’s fuzzer are minimal: 90% of the work is just changing a few includes and removing main.
  • Replace the switch statement choosing the API call to make with DeepState’s OneOf construct.
    • OneOf takes a list of C++ lambdas, and chooses one to execute.
    • This change is not strictly required, but using OneOf simplifies the code and allows optimization of choices and smart test reduction.
    • Another version of OneOf takes a fixed-size array as input, and returns some value in it; e.g., OneOf("abcd") will produce a character, either a, b, c, or d.

There are a number of other cosmetic (e.g. formatting, variable naming) changes, but the essence of the fuzzer is clearly preserved here. With these changes, the fuzzer works almost as before, except that instead of running the fuzz_rb executable, we’ll use DeepState to run the test we’ve defined and generate input values that choose which function calls to make, what values to insert in the red-black tree, and all the other decisions represented by DeepState_Int, OneOf, and other calls:

int GetValue() {
  if (!restrictValues) {
    return DeepState_Int();
  } else {
    return DeepState_IntInRange(0, valueRange);
  }
}
...
  for (int n = 0; n < LENGTH; n++) {
    OneOf(
      [&] {
        int key = GetValue();
        int* ip = (int*)malloc(sizeof(int));
        *ip = key;
        if (!noDuplicates || !containerFind(*ip)) {
          void* vp = voidP();
          LOG(TRACE) << n << ": INSERT:" << *ip << " " << vp;
          RBTreeInsert(tree, ip, vp);
          containerInsert(*ip, vp);
        } else {
          LOG(TRACE) << n << ": AVOIDING DUPLICATE INSERT:" << *ip;
          free(ip);
        }
      },
      [&] {
        int key = GetValue();
        LOG(TRACE) << n << ": FIND:" << key;
        if ((node = RBExactQuery(tree, &key))) {
          ASSERT(containerFind(key)) << "Expected to find " << key;
        } else {
          ASSERT(!containerFind(key)) << "Expected not to find " << key;
        }
      },
...

Installing DeepState

The DeepState GitHub repository provides more details and dependencies, but on my MacBook Pro, installation is simple:

git clone https://github.com/trailofbits/deepstate
cd deepstate
mkdir build
cd build
cmake ..
sudo make install

Building a version with libFuzzer enabled is slightly more involved:

brew install llvm@7
git clone https://github.com/trailofbits/deepstate
cd deepstate
mkdir build
cd build
CC=/usr/local/opt/llvm\@7/bin/clang CXX=/usr/local/opt/llvm\@7/bin/clang++ BUILD_LIBFUZZER=TRUE cmake ..
sudo make install

AFL can also be used to generate inputs for DeepState, but most of the time, raw speed (due to not needing to fork), decomposition of compares, and value profiles seem to give libFuzzer an edge for this kind of API testing, in our (limited experimentally!) experience. For more on using AFL and other file-based fuzzers with DeepState, see the DeepState README.

Using the DeepState Red-Black Tree Fuzzer

Once you have installed DeepState, building the red-black tree fuzzer(s) is also simple:

git clone https://github.com/agroce/rb_tree_demo
cd rb_tree_demo
make

The make command compiles everything with all the sanitizers we could think of (address, undefined, and integer) in order to catch more bugs in fuzzing. This has a performance penalty, but is usually worth it.

If you are on macOS and using a non-Apple clang in order to get libFuzzer support, you’ll want to do something like

CC=/usr/local/opt/llvm\@7/bin/clang CXX=/usr/local/opt/llvm\@7/bin/clang++ make

in order to use the right (e.g., homebrew-installed) version of the compiler.

This will give you a few different executables of interest. One, fuzz_rb, is simply John’s fuzzer, modified to use a 60-second timeout instead of a fixed number of “meta-iterations.” The ds_rb executable is the DeepState executable. You can fuzz the red-black tree using a simple brute-force fuzzer (that behaves very much like John’s original fuzzer):

mkdir tests
./ds_rb --fuzz --timeout 60 --output_test_dir tests

If you want to see more about what the fuzzer is doing, you can specify a log level using --log_level to indicate the minimum importance of messages you want to see. A log_level of 0 corresponds to including all messages, even debug messages; 1 is TRACE messages from the system under test (e.g., those produced by the LOG(TRACE) code shown above); 2 is INFO, non-critical messages from DeepState itself (this is the default, and usually appropriate); 3 is warnings, and so forth up the hierarchy. The tests directory should be empty at the termination of fuzzing, since the red-black tree code in the repo (to my knowledge) has no bugs. If you add --fuzz_save_passing to the options, you will end up with a large number of files for passing tests in the directory.

Finally, we can use libFuzzer to generate tests:

mkdir corpus
./ds_rb_lf corpus -use_value_profile=1 -detect_leaks=0 -max_total_time=60

The ds_rb_lf executable is a normal libFuzzer executable, with the same command line options. This will run libFuzzer for 60 seconds, and place any interesting inputs (including test failures) in the corpus directory. If there is a crash, it will leave a crash- file in the current directory. You can tune it to perform a little better in some cases by determining the maximum input size your tests use, but this is a non-trivial exercise. In our case at length 100 the gap between our max size and 4096 bytes is not extremely large.

For more complex code, a coverage-driven, instrumentation-based fuzzer like libFuzzer or AFL will be much more effective than the brute force randomness of John’s fuzzer or the simple DeepState fuzzer.  For an example like the red-black-tree, this may not matter as much, since few states may be very hard to reach for a  fast “dumb” fuzzer.  Even here, however, smarter fuzzers have the advantage of producing a corpus of tests that produce interesting code coverage.  DeepState lets you use a faster fuzzer for quick runs, and smarter tools for more in-depth testing, with almost no effort.

We can replay any DeepState-generated tests (from libFuzzer or DeepState’s fuzzer) easily:

./ds_rb --input_test_file file

Or replay an entire directory of tests:

./ds_rb --input_test_files_dir dir

Adding an --exit_on_fail flag when replaying an entire directory lets you stop the testing as soon as you hit a failing or crashing test. This approach can easily be used to add failures found with DeepState (or interesting passing tests, or perhaps corpus tests from libFuzzer) to automatic regression tests for a project, including in CI.

Adding a Bug

This is all fine, but it doesn’t (or at least shouldn’t) give us much confidence in John’s fuzzer or in DeepState. Even if we changed the Makefile to let us see code coverage, it would be easy to write a fuzzer that doesn’t actually check for correct behavior – it covers everything, but doesn’t find any bugs other than crashes. To see the fuzzers in action (and see more of what DeepState gives us), we can add a moderately subtle bug. Go to line 267 of red_black_tree.c and change the 1 to a 0. The diff of the new file and the original should look like:

267c267
<   x->parent->parent->red=0;
---
>   x->parent->parent->red=1;

Do a make to rebuild all the fuzzers with the new, broken red_black_tree.c.

Running John’s fuzzer will fail almost immediately:

time ./fuzz_rb
Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702.
Abort trap: 6

real 0m0.100s
user 0m0.008s
sys 0m0.070s

Using the DeepState fuzzer will produce results almost as quickly. (We’ll let it show us the testing using the --log_level option, and tell it to stop as soon as it finds a failing test.):

time ./ds_rb --fuzz --log_level 1 --exit_on_fail --output_test_dir tests
INFO: Starting fuzzing
WARNING: No seed provided; using 1546625762
WARNING: No test specified, defaulting to last test defined (RBTree_GeneralFuzzer)
TRACE: Running: RBTree_GeneralFuzzer from deepstate_harness.cpp(78)
TRACE: deepstate_harness.cpp(122): 0: DELETE:-747598508
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(122): 1: DELETE:831257296
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(134): 2: PRED:1291220586
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(154): 4: SUCC:-1845067087
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(113): 6: FIND:-427918646
TRACE: deepstate_harness.cpp(190): checkRep...
...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(103): 44: INSERT:-1835066397 0x00000000ffffff9c
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(154): 46: SUCC:-244966140
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(103): 48: INSERT:1679127713 0x00000000ffffffa4
TRACE: deepstate_harness.cpp(190): checkRep...
Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702.
ERROR: Crashed: RBTree_GeneralFuzzer
INFO: Saved test case to file `tests/6de8b2ffd42af6878875833c0cbfa9ea09617285.crash`
...
real 0m0.148s
user 0m0.011s
sys 0m0.131s

I’ve omitted much of the output above, since showing all 49 steps before the detection of the problem is a bit much, and the details of your output will certainly vary. The big difference from John’s fuzzer, besides the verbose output, is the fact that DeepState saved a test case. The name of your saved test case will, of course, be different, since the names are uniquely generated for each saved test. To replay the test, I would do this:

./ds_rb --input_test_file tests/6de8b2ffd42af6878875833c0cbfa9ea09617285.crash

and I would get to see the whole disaster again, in gory detail. As we said above, this lengthy sequence of seemingly arbitrary operations isn’t the most helpful test for seeing what’s going on. DeepState can help us here:

deepstate-reduce ./ds_rb tests/6de8b2ffd42af6878875833c0cbfa9ea09617285.crash minimized.crash
ORIGINAL TEST HAS 8192 BYTES
LAST BYTE READ IS 509
SHRINKING TO IGNORE UNREAD BYTES
ONEOF REMOVAL REDUCED TEST TO 502 BYTES
ONEOF REMOVAL REDUCED TEST TO 494 BYTES
...
ONEOF REMOVAL REDUCED TEST TO 18 BYTES
ONEOF REMOVAL REDUCED TEST TO 2 BYTES
BYTE RANGE REMOVAL REDUCED TEST TO 1 BYTES
BYTE REDUCTION: BYTE 0 FROM 168 TO 0
NO (MORE) REDUCTIONS FOUND
PADDING TEST WITH 49 ZEROS

WRITING REDUCED TEST WITH 50 BYTES TO minimized.crash

Again, we omit some of the lengthy process of reducing the test. The new test is (much!) easier to understand:

./ds_rb --input_test_file minimized.crash
WARNING: No test specified, defaulting to last test defined (RBTree_GeneralFuzzer)
TRACE: Initialized test input buffer with data from `minimized.crash`
TRACE: Running: RBTree_GeneralFuzzer from deepstate_harness.cpp(78)
TRACE: deepstate_harness.cpp(103): 0: INSERT:0 0x0000000000000000
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(103): 1: INSERT:0 0x0000000000000000
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(103): 2: INSERT:0 0x0000000000000000
TRACE: deepstate_harness.cpp(190): checkRep...
Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702.
ERROR: Crashed: RBTree_GeneralFuzzer

We just need to insert three identical values into the tree to expose the problem. Remember to fix your red_black_tree.c before proceeding!

You can watch the whole process in action:

In Part 2, we’ll look at how to assess the quality of our testing: is our DeepState testing as effective as John’s fuzzer? Are both approaches unable to find certain subtle bugs? And what about symbolic execution?

How McSema Handles C++ Exceptions

C++ programs using exceptions are problematic for binary lifters. The non-local control-flow “throw” and “catch” operations that appear in C++ source code do not map neatly to straightforward binary representations. One could allege that the compiler, runtime, and stack unwinding library collude to make exceptions work. We recently completed our investigation into exceptions and can claim beyond a reasonable doubt that McSema is the only binary lifter that correctly lifts programs with exception-based control flow.

Our work on McSema had to bridge the semantic gap between a program’s high-level language semantics and its binary representation, which required a complete understanding of how exceptions work under the hood. This post is organized into three sections: first, we are going to explain how C++ exceptions are handled in Linux for x86-64 architectures and explain core exception handling concepts. Second, we will show how we used this knowledge to recover exception information at the binary level. And third, we will explain how to emit exception information for the LLVM ecosystem.

A Short Primer in C++ Exception Handling

In this section, we will use a small motivating example to demonstrate how C++ exceptions work at the binary level and discuss exception semantics for Linux programs running on x86-64 processors. While exceptions work differently on different operating systems, processors, and languages, many of the core concepts are identical.

Exceptions are a programming language construct that provide a standardized way to handle abnormal or erroneous situations. They work by automatically redirecting execution flow to a special handler called the exception handler when such an event occurs. Using exceptions, it is possible to be explicit about ways in which operations can fail and how those failures should be handled. For example, some operations like object instantiation and file processing can fail in multiple ways. Exception handling allows the programmer to handle these failures in a generic way for large blocks of code, instead of manually verifying each individual operation.

Exceptions are a core part of C++, although their use is optional. Code that may fail is surrounded in a try {…} block, and the exceptions that may be raised are caught via a catch {…} block. Signalling of exceptional conditions is triggered via the throw keyword, which raises an exception of a specific type. Figure 1 shows a simple program that uses C++ exception semantics. Try building the program yourself (clang++ -o exception exception.cpp) or look at the code in the Compiler Explorer.

#include <iostream>
#include <vector>
#include <stdexcept>

int main(int argc, const char *argv[]) {
std::vector myVector(10);
int var = std::atoi(argv[1]);

try {
if( var == 0 ) {
throw std::runtime_error("Runtime error: argv[1] cannot be zero.");
}

if(argc != 2) {
throw std::out_of_range("Supply one argument.");
}

myVector.at( var ) = 100;
while(true) {
new int [100000000ul];
}

} catch (const std::out_of_range& e) {
std::cerr << "Out of range error: " << e.what() << '\n';
return 1;
} catch (const std::bad_alloc& e) {
std::cerr << "Allocation failed: " << e.what() << '\n';
return 1;
} catch (...) {
std::cerr << "Unknown error.\n";
return 1;
}
return 0;
}

Figure 1: Example C++ program that throws and catches exceptions, including a catch-all clause.

This simple program can explicitly throw std::runtime_error and std::out_of_range exceptions based on input arguments. It also implicitly throws the std::bad_alloc exception when it runs out of memory. The program installs three exception handlers: one for std::out_of_range, one for std::bad_alloc, and a catch-all handler for generic unknown exceptions. Run the following sample inputs to trigger the three exceptional conditions:

Scenario 1: ./exception 0
Unknown error.
The program checks for the input argument and it does not expect `0` as the input and throws the std::runtime_error exception.
Scenario 2: ./exception 0 1
Out of range error: vector::_M_range_check
The program expects one argument as the input and checks for it. If the number of input arguments are more than one it throws std::out_of_range exception.
Scenario 3: ./exception 1
Allocation failed: std::bad_alloc
For the input other than `0`, the program does the large memory allocation which can fail. It can happen during runtime and may go unnoticed. The memory allocator in such cases throws the std::bad_alloc exception to safely terminate the program.

Let’s look at the same program at the binary level. Compiler Explorer shows the binary code generated by the compiler for this program. The compiler translates the throw statements into a pair of calls to libstdc++ functions (__cxa_allocate_exception and __cxa_throw) that allocates the exception structure and start the process of cleaning up local objects in the scopes leading up to the exception stack unwinding (see lines 40-48 in Compiler Explorer).

Stack unwinding: Removes the stack frame of the exited functions from the process stack.

The catch statements are translated into functions that handle the exception and perform clean-up operations called the landingpad. The compiler generates an exception table that ties together everything the operating system needs to dispatch exceptions, including exception type, associated landing pad, and various utility functions.

landingpad: User code intended to catch an exception. It gains control from the exception runtime via the personality function, and either merges into the normal user code or returns to the runtime by resuming or raising a new exception.

When an exception occurs, the stack unwinder cleans up previously allocated variables and call the catch block. The unwinder:

  1. Calls the libstdc++ personality function. First, the stack unwinder calls a special function provided by libstdc++ called the personality function. The personality function will determine whether the raised exception is handled by a function somewhere on the call stack. In high-level terms, the personality function determines whether there is a catch block that should be called for this exception. If no handler can be located (i.e. the exception is unhandled), the personality function terminates the program by calling std::terminate.
  2. Cleans up allocated objects. To cleanly call the catch block, the unwinder must first clean up (i.e. call destructors for each allocated object) after every function called inside the try block. The unwinder will iterate through the call stack, using the personality function to identify a cleanup method for each stack frame. If there are any cleanup actions, the unwinder calls the associated cleanup code.
  3. Executes the catch block. Eventually the unwinder will reach the stack frame of the function containing the exception handler, and execute the catch block.
  4. Releases memory. Once the catch block completes, a cleanup function will be called again to release memory allocated for the exception structure.

For the curious, more information is available in the comments and source code for libgcc’s stack unwinder.

personality function: A libstdc++ function called by the stack unwinder. It determines whether there is a catch block for a raised exception. If none is found, the program is terminated with std::terminate.

Recovering Exception Information

Recovering exception-based control flow is a challenging proposition for binary analysis tools like McSema. The fundamental data is difficult to assemble, because exception information is spread throughout the binary and tied together via multiple tables. Utilizing exception data to recover control flow is hard, because operations that affect flow, like stack unwinding, calls to personality functions, and exception table decoding happen outside the purview of the compiled program.

Here’s a quick summary of the end goal. McSema must identify every basic block that may raise an exception (i.e. the contents of a try block) and associate it with the appropriate exception handler and cleanup code (i.e. the catch block or landing pad). This association will then be used to re-generate exception handlers at the LLVM level. To associate blocks with landing pads, McSema parses the exception table to provide these mappings.

We’re going to go into some detail about the exception table. It’s important to understand, because this is the main data structure that allows McSema to recover exception-based control flow.

The Exception Table

The exception table provides language runtimes the information to support exceptions. It has two levels: the language-independent level and the language-specific level. Locating stack frames and restoring them is language agnostic, and is therefore stored in the independent level. Identifying the frame that handles the exceptions and transferring control to it is language dependent, so this is stored in the language-specific level.

Language-Independent Level

The table is stored in special sections in the binary called .eh_frame and .eh_framehdr. The .eh_frame section contains one or more call frame information records encoded in the DWARF debug information format. Each frame information record contains a Common Information Entry (CIE) record, followed by one or more Frame Descriptor Entry (FDE) records. Together they describe how to unwind the caller based on the current instruction pointer. More details are described in the Linux Standards Base documentation.

Language-Specific Level

The language-specific data area (LSDA) contains pointers to related data, a list of call sites, and a list of action records. Each function has its own LSDA, which is provided as the augmentation data of the Frame Descriptor Entry (FDE). Information from the LSDA is essential to recovering C++ exception information, and in translating it to LLVM semantics.

exceptionblog

Figure 2: Exception handling information from the current instruction pointer (IP). Graphic adapted from the linked original.

The LSDA header describes how exception information applies to language-specific procedure fragments. Figure 4 shows the LSDA in more detail. There are two fields defined in the LSDA header that McSema needs to recover exception information:

  • The landing pad start pointer: A relative offset to the start of the landing pad code.
  • The types table pointer: A relative offset to the types table, which describes exception types handled by the catch clauses for this procedure fragment.

Following the LSDA header, the call site table lists all call sites that may throw an exception. Each entry in the call site table indicates the position of the call site, the position of the landing pad, and the first action record for that call site. A missing entry from the call site table indicates that a call should not throw an exception. Information from this table will be used by McSema during the translation stage to emit proper LLVM semantics for call sites that may throw exceptions.

The action table follows the call site table in the LSDA and specifies both catch clauses and exception specifications. By exception specifications here we mean the much maligned C++ feature called “exception specifications”, that enumerates the exceptions a function may throw. The two record types have the same format and are distinguished solely by the first field of each entry. Positive values for this field specify types used in catch clauses. Negative values specify exception specifications. Figure 3 shows the action table with a catch clauses (red), catch-all clause (orange), an exception specification (blue). (The exception specification feature has been deprecated in C++17.) Because this feature is being deprecated and rarely used, currently McSema does not handle exception specifications.

.gcc_except_table:4022CF db 7Fh; ar_filter[1]: -1(exception spec index = 4022EC)
.gcc_except_table:4022D0 db 0  ; ar_next[1]: 0 (end)
.gcc_except_table:4022D1 db 0  ; ar_filter[2]: 0 (cleanup)
.gcc_except_table:4022D2 db 7Dh; ar_next[2]: -3 (next: 1 => 004022CF)
.gcc_except_table:4022D3 db 4  ; ar_filter[3]: 4 (catch typeinfo = 000000)
.gcc_except_table:4022D4 db 0  ; ar_next[3]: 0 (end)
.gcc_except_table:4022D5 db 1  ; ar_filter[4]: 1 (catch typeinfo = 00603280)
.gcc_except_table:4022D6 db 7Dh; ar_next[4]: -3 (next: 3 => 004022D3)
.gcc_except_table:4022D7 db 3  ; ar_filter[5]: 3 (catch typeinfo = 603230)
.gcc_except_table:4022D8 db 7Dh; ar_next[5]: -3 (next: 4 => 004022D5)

Figure 3: Action table entries in the LSDA section

Lifting Exception Information

So far we have looked at how exceptions in C++ work at a low level, how exception information is stored, and how McSema recovers exception based control flow. Now we will look at how McSema lifts this control flow to LLVM.

To lift exception information, the exception and language semantics described in the last section have to be recovered from the binary and translated into LLVM. The recovery and translation is a three-phase process that required updating control flow graph (CFG) recovery, lifting, and runtime components of McSema.

McSema’s translation stage uses the information gleaned from CFG recovery to generate LLVM IR that handles exception semantics. To ensure the final binary will execute like the original, the following steps must happen:

  • McSema must associate exception handlers and cleanup methods with blocks that raise exceptions. Functions that throw exceptions must be called via LLVM’s invoke instruction versus the call instruction.
  • Stack unwinding has to be enabled for function fragments that raise exceptions. This is complicated by the fact that translated code may have two stacks: a native stack (used for calling external APIs) and a lifted stack.
  • McSema must ensure there is a smooth transition between lifted code and the language runtime. Handlers called directly by the language runtime must serialize processor state into a structure expected by lifted code.

Associating Blocks and Handlers

The initial association between blocks that may throw exceptions and the handlers for those exceptions is performed during CFG recovery, via information extracted from the exception table. This association is required because the translator must ensure functions that may throw exceptions are called via LLVM’s invoke semantics and not the typical call instruction. The invoke instruction has two continuation points: normal flow when call succeeds and exception flow (i.e., the exception handler) if the function raises an exception (Figure 4). The replacement of call with invoke must cover every invocation of that function. Any call of the function convinces the optimizer the function doesn’t throw and does not need an exception table.

%1403 = call i64 @__mcsema_get_stack_pointer()
store i64 %1403, i64* %stack_ptr_var
%1404 = call i64 @__mcsema_get_frame_pointer()
store i64 %1404, i64* %frame_ptr_var
%1405 = load %struct.Memory*, %struct.Memory** %MEMORY
%1406 = load i64, i64* %PC
%1407 = invoke %struct.Memory* @ext_6032a0__Znam(%struct.State* %0,
i64 %1406, %struct.Memory* %1405)
to label %block_40119f unwind label %landingpad_4012615

Figure 4: An invoke instruction replaces a call instruction to a function that may throw an exception

Unwinding of the Stack

When an exception occurs, control transfers from the throw statement to the first catch statement that can handle the exception. Before the transfer, variables defined in function scope must be properly destroyed. This is called stack unwinding.

McSema uses two different stacks: one for lifted code, and one for native code (i.e. external functions). The split stack puts limitations on stack unwinding, since the native execution (i.e. libstdc++ API) doesn’t have a full view of the stack. To support stack unwinding, we added a new flag, --abi-libraries, which enables the usage of the same stack for lifted and native code execution.

The --abi-libraries flag enables usage of the same stack for native and lifted code by removing the need for lifted code to native transitions. McSema needs to transition stacks so that an external function that does not know about McSema can see CPU state as it was in the original program. Application binary interface (ABI) libraries, which provide external function signatures, including the return value, argument type, and argument count, allow lifted code to directly call native functions on the same stack. Figure 5 shows a snapshot of function signatures defined via ABI libraries.

declare i8* @__cxa_allocate_exception(i64) #0
declare void @__cxa_free_exception(i8*) #0
declare i8* @__cxa_allocate_dependent_exception() #0
declare void @__cxa_free_dependent_exception(i8*) #0
declare void @__cxa_throw(i8*, %"class.std::type_info"*, void (i8*)*) #0
declare i8* @__cxa_get_exception_ptr(i8*) #0
declare i8* @__cxa_begin_catch(i8*) #0
declare void @__cxa_end_catch() #0

Figure 5: An ABI library defining the external functions relating to exceptions.

Exception handling at runtime

Exception handlers and cleanup methods are called by the language runtime, and are expected to follow a strict calling convention. Lifted code does not follow standard calling convention semantics, because it expresses the original instructions as operations on CPU state. To support these callbacks, we implemented a special adaptor that converts a native state into a machine context usable by lifted code. Special care has been taken to preserve the RDX register, which stores the type index of the exception.

There is one more trick to emitting functional exception handlers: proper ordering of type indices. Recall that our motivating example (Figure 1) has three exception handlers: std::out_of_range, std::bad_alloc, and the catch-all handler. Each of these handlers are assigned a type index, say 1, 2, 3 respectively (Figure 6a), meaning that the original program expects type index 1 to corresponds to std::out_of_range.

.gcc_except_table:402254 db 3      ; ar_filter[1]: 3 (catch typeinfo = 000000)
.gcc_except_table:402255 db 0      ; ar_next[1]: 0 (end) 
.gcc_except_table:402256 db 2      ; ar_filter[2]: 2 (catch typeinfo = 603280)
.gcc_except_table:402257 db 7Dh    ; ar_next[2]: -3 (next: 1 => 402254)
.gcc_except_table:402258 db 1      ; ar_filter[3]: 1 (catch typeinfo = 603230)
.gcc_except_table:402259 db 7Dh    ; ar_next[3]: -3 (next: 2 => 402256)
.gcc_except_table:40225A db 0 
.gcc_except_table:40225B db 0 
.gcc_except_table:40225C dd 0      ; Type index 3 
.gcc_except_table:402260 dd 603280h; Type index 2 
.gcc_except_table:402264 dd 603230h; Type index 1
.gcc_except_table:41A78E db 1      ; ar_filter[1]: 1 (catch typeinfo = 000000)
.gcc_except_table:41A78F db 0      ; ar_next[1]: 0 (end) 
.gcc_except_table:41A790 db 2      ; ar_filter[2]: 2 (catch typeinfo = 61B450)
.gcc_except_table:41A791 db 7Dh    ; ar_next[2]: -3 (next: 1 => 0041A78E)
.gcc_except_table:41A792 db 3      ; ar_filter[3]: 3 (catch typeinfo = 61B4A0)
.gcc_except_table:41A793 db 7Dh    ; ar_next[3]: -3 (next: 2 => 0041A790)
.gcc_except_table:41A794 dd 61B4A0h; Type index 3 
.gcc_except_table:41A798 dd 61B450h; Type index 2 
.gcc_except_table:41A79C dd 0      ; Type index 1

Figure 6 (a & b): Type indices assignment in the exception table of the original & new binary (std::out_of_range, std::bad_alloc, and catch-all exception types are assigned type index 1, 2, and 3 respectively)

During the lifting process McSema recreates exception handlers used in the program. The type index assigned to each handler is generated at compile time. When lifted bitcode is compiled into a new binary, the type indices could be, and often are, reassigned. For example, std::out_of_range could get type index 3 in a new binary (Figure 6b). This would cause the lifted binary to run the catch-all handler when std::out_of_range is thrown!

To ensure the right exception handler is called, McSema generates a static map (see gvar_landingpad_401133 in Figure 7) of original type indices to new type indices, and fixes the type index during ladningpad passthrough. The landingpad passthrough is a function that is automatically generated by McSema. Not only does it ensure the type index is correct, it also transitions between lifted and native state.

Upon being called, the passthrough saves native execution state, loads lifted state, and calls any exception handlers (that have been lifted, and expect lifted state). When the passthrough returns (in case the exception wasn’t handled), it must do the reverse, and transition from lifted to native state to return into runtime library code. Figure 7 shows the landingpad passthrough generated for our motivating example. The generated passthrough code gets the type index from the RDX register using the function __mcsema_get_type_index. It fixes and restores the machine context of the lifted execution using the function __mcsema_exception_ret. The wrapper instruction across the invoke statement saves the stack and frame pointer in the function context.

%landingpad_4011336 = landingpad { i8*, i32 }
catch i8* @"_ZTISt13runtime_error@@GLIBCXX_3.4"
catch i8* @"_ZTISt12out_of_range@@GLIBCXX_3.4"
catch i8* null
%4021 = call i64 @__mcsema_get_type_index()
%4022 = getelementptr [4 x i32], [4 x i32]* @gvar_landingpad_401133, i32 0, i64 %4021
%4023 = load i64, i64* %stack_ptr_var
%4024 = load i64, i64* %frame_ptr_var
%4025 = load i32, i32* %4022
call void @__mcsema_exception_ret(i64 %4023, i64 %4024, i32 %4025)
br label %block_401133

Figure 7: landing pad pass through for the exception handlers

With all of these pieces in place, McSema can finally translate C++ programs that use exceptions into LLVM.

Conclusion

To our knowledge, McSema is the only binary lifter to handle C++ exceptions, which are common throughout C++ software of any complexity. As we have shown, exception-based control flow recovery and accurate translation is an extremely complex topic and difficult to implement correctly. Implementing exception handling touched all parts of McSema, including new challenges for both control flow recovery and translation. The fine details, such as type index re-ordering and ensuring every call is replaced with an invoke all had to be discovered the hard way by debugging subtle and frustrating failures.

We are continuing to develop and enhance McSema, and have more to share about exciting new features. If you are interested in McSema, try it out, contribute (we love open source contributions!), and talk to us in the binary-lifting channel on the Empire Hacking Slack.

Empire Hacking: Ethereum Edition 2

On December 12, over 150 attendees joined a special, half-day Empire Hacking to learn about pitfalls in smart contract security and how to avoid them. Thank you to everyone who came, to our superb speakers, and to BuzzFeed for hosting this meetup at their office.

Watch the presentations again

It’s hard to find such rich collections of practical knowledge about securing Ethereum on the internet. For many of you, it was even harder to attend the event. That’s why we’re posting these recordings. We hope you find them useful.

Anatomy of an unsafe smart contract programming language

Our own Evan Sultanik gave an introduction to blockchains and smart contracts, followed by a dissection of Solidity: the most popular smart contract programming language (slides).

Takeaways

  • Solidity harbors many unsafe features that allow even experienced, competent programmers to easily shoot themselves in the foot.
  • Solidity is changing quickly, which is both bad and good. Developers must keep pace with new compiler releases and beware the implications of contract upgradability.
  • There is an effort to introduce an intermediate representation to the compiler. Early indications suggest that it suffers from many of the same design decisions that have plagued Solidity.

Evaluating digital asset security fundamentals

Shamiq Islam of Coinbase discussed problems in the security of digital assets and their ecosystems. These problems pose unique and interesting challenges to cryptocurrency exchanges like Coinbase.

Takeaways

  • The supply chain is untrustworthy. How do you validate that an asset issuer’s node, smart contract, or wallet code is authentic?
  • Security communications channels are immature. If you find a bug, how do you know where to report it? Conversely, how can exchanges like Coinbase become aware if a bug is found? Monitoring a Telegram chat for each asset does not scale.
    [For the time being, use our directory of Blockchain Security Contacts.]
  • How do we know when a smart contract owner is malicious? The code itself may be secure, but the owner’s key is a single point of failure. If compromised, it can arbitrarily modify the accounting of an asset in most cases.

Contract upgrade risks and recommendations

Our own Josselin Feist compared several different strategies for upgrading smart contracts. This talk covers everything you need to know to decide how and if to implement upgradability in your contracts (slides).

Takeaways

  • Upgradability is useful for developers as it allows for features to be added and bugs to be fixed after the fact. However, it also adds complexity and increases the likelihood of deployment mistakes.
  • Use the simplest upgrade system that suits your needs. Compared to data separation, the delegatecall proxy pattern is very fragile and adds even more complexity.
  • Instead of these upgradability patterns, consider contract migration. Migration is more involved, but it allows for recovery from many more scenarios.

How to buidl an enterprise-grade mainnet Ethereum client

S. Matthew English of PegaSys highlighted the trials and tribulations of implementing a new Ethereum client. Many of the insights in the talk apply just as well to any large-scale software engineering projects.

Takeaways

  • Building an Ethereum client is hard. The protocol itself is poorly documented, uses non-standard computer science concepts, and has continued to evolve at the same time as existing clients evolve.
  • Team communication, architectural design, and incremental progress validation were important factors in the successful development of PegaSys.
  • Pantheon is now syncing with Ethereum mainnet, has been open-sourced, and is available for download.

Failures in on-chain privacy

Ian Miers of Cornell Tech and The Zcash Foundation provided an overview of privacy issues in cryptocurrencies. Cryptocurrencies might not be as private as you thought.

Takeaways

  • Privacy in cryptocurrency has been misunderstood since the very beginning. It’s important that we figure it out now before it’s too late to fix.
  • Decoy-based approaches may seem successful in isolation, but the privacy claims they make break down in real-world scenarios.
  • Stronger approaches are deployed and efficient, but they still need important work to improve usability for end users.

Secure micropayment protocols

Yondon Fu of Livepeer highlighted some security requirements unique to micropayment methods. He shared how Livepeer is making micropayments securely scale.

Takeaways

  • Micropayments are useful for a variety of applications, in particular those with the potential for ongoing or a high volume of transactions. However, high deployment and transaction costs have stymied widespread adoption.
  • Security considerations for clients common to most micropayment methods include security of the hot signing key and timely transaction confirmation of additional necessary transactions, even when gas prices fluctuate.
  • Important considerations for probabilistic micropayments include secure random number generation and protection from replay attacks and double spends.

Designing the Gemini dollar: a regulated, upgradeable, transparent stablecoin

Brandon Arvanaghi of Gemini Trust explained the design decisions that went into the regulated, upgradable, and transparent Gemini dollar, comparing and contrasting it with other implementations.

Takeaways

  • Upgradability in smart contracts provides a means for response to illicit activities and bugs, but can reduce transparency and expand the attack surface.
  • Contract modularity, ownership distribution, and “time-locked” upgrades help mitigate these issues.
  • Take every opportunity to provide multi-level mitigations. Gemini ensures that even if an attacker were to compromise a contract with all of its underlying logic (Impl), its custodian/owner-contract would need to be compromised too, as it is the sole entity to confirm printed tokens.

Property testing with Echidna and Manticore for secure smart contracts

Our own JP Smith introduced the concept of property-based testing and its application to smart contracts. This includes strategies for picking good properties and testing them thoroughly (slides).

Takeaways

  • Unit Testing is not always sufficient: it tests one individual case at a time, and typically focuses on known cases and failure modes. Property Testing aims to cover unknown cases by specifying generic code invariants.
  • Echidna is a tool for property testing smart contracts, which is extremely fast and can discover new transaction sequences that violate code properties.
  • When property-based testing with such tools, you’re sure to hit some conditions that a user might have typically missed in their individual unit tests.

Simple is hard: Making your awesome security thing usable

Patrick Nielsen and Amber Baldet of Clovyr went down infosec memory lane of great-ideas-that-didn’t-quite-catch-on to help attendees think about how to get people to use what they build (slides).

Takeaways

  • A great idea or tool can often be derailed by its (lack of) usability, undermining its potential to deliver immense real-world value. Sweat the “boring stuff” if you want your worthwhile work to be worth it.
  • Most end users don’t change settings or look for anything beyond the default, most devs don’t want to mess with complex configs. Practice simplicity at every opportunity and do as much as you can in the background for both.
  • Regular people care about simplicity, stability and cost. Power users care about implementation details. Developers care about approachability, utility, and ops overhead. Businesses care about technical risk and the bottom line. Put yourself in the shoes of each; don’t expect them to change priorities just because you made something innovative or pure.
  • Practice what you preach; if we (in the security and “crypto” communities) use tools that are fundamentally insecure or data hungry, how can we expect others to act differently?

Attend the next Empire Hacking on February 12. Join the meetup to RSVP.

How to write a rootkit without really trying

We open-sourced a fault injection tool, KRF, that uses kernel-space syscall interception. You can use it today to find faulty assumptions (and resultant bugs) in your programs. Check it out!


This post covers intercepting system calls from within the Linux kernel, via a plain old kernel module.

We’ll go through a quick refresher on syscalls and why we might want to intercept them and then demonstrate a bare-bones module that intercepts the read(2) syscall.

But first, you might be wondering:

What makes this any different from $other_fault_injection_strategy?

Other fault injection tools rely on a few different techniques:

  • There’s the well-known LD_PRELOAD trick, which really intercepts the syscall wrapper exposed by libc (or your language runtime of choice). This often works (and can be extremely useful for e.g. spoofing the system time within a program or using SOCKS proxies transparently), but comes with some major downsides:
    • LD_PRELOAD only works when libc (or the target library of choice) has been dynamically linked, but newer languages (read: Go) and deployment trends (read: fully static builds and non-glibc Linux containers) have made dynamic linkage less popular.
    • Syscall wrappers frequently deviate significantly from their underlying syscalls: depending on your versions of Linux and glibc open() may call openat(2), fork() may call clone(2), and other calls may modify their flags or default behavior for POSIX compliance. As a result, it can be difficult to reliably predict whether a given syscall wrapper invokes its syscall namesake.
  • Dynamic instrumentation frameworks like DynamoRIO or Intel PIN can be used to identify system calls at either the function or machine-code level and instrument their calls and/or returns. While this grants us fine-grained access to individual calls, it usually comes with substantial runtime overhead.

Injecting faults within kernelspace sidesteps the downsides of both of these approaches: it rewrites the actual syscalls directly instead of relying on the dynamic loader, and it adds virtually no runtime overhead (beyond checking to see whether a given syscall is one we’d like to fault).

What makes this any different from $other_blog_post_on_syscall_interception?

Other blog posts address the interception of syscalls, but many:

  • Grab the syscall table by parsing their kernel’s System.map, which can be unreliable (and is slower than the approach we give below).
  • Assume that the kernel exports sys_call_table and that extern void *sys_call_table will work (not true on Linux 2.6+).
  • Involve prodding large ranges of kernel memory, which is slow and probably dangerous.

Basically, we couldn’t find a recent (>2015) blog post that described a syscall interception process that we liked. So we developed our own.

Why not just use eBPF or kprobes?

eBPF can’t intercept syscalls. It can only record their parameters and return types.

The kprobes API might be able to perform interception from within a kernel module, although I haven’t come across a really good source of information about it online. In any case, the point here is to do it ourselves!

Will this work on $architecture?

For the most part, yes. You’ll need to make some adjustments to the write-unlocking macro for non-x86 platforms.

What’s a syscall?

A syscall, or system call, is a function1 that exposes some kernel-managed resource (I/O, process control, networking, peripherals) to user-space processes. Any program that takes user input, communicates with other programs, changes files on disk, uses the system time, or contacts another device over a network (usually) does so via syscalls.2

The core UNIX-y syscalls are fairly primitive: open(2), close(2), read(2), and write(2) for the vast majority of I/O; fork(2), kill(2), signal(2), exit(2), and wait(2) for process management; and so forth.

The socket management syscalls are mostly bolted on to the UNIX model: send(2) and recv(2) behave much like read(2) and write(2), but with additional transmission flags. ioctl(2) is the kernel’s garbage dump, overloaded to perform every conceivable operation on a file descriptor where no simpler means exists. Despite these additional complexities in usage, the underlying principle behind their usage (and interception) remains the same. If you’d like to dive all the way in, Filippo Valsorda maintains an excellent Linux syscall reference for x86 and x86_64.

Unlike regular function calls in user-space, syscalls are extraordinarily expensive: on x86 architectures, int 80h (or the more modern sysenter/syscall instructions) causes both the CPU and the kernel to execute slow interrupt-handling code paths as well as perform a privilege-context switch.3

Why intercept syscalls?

For a few different reasons:

  • We’re interested in gathering statistics about a given syscall’s usage, beyond
    what eBPF or another instrumentation API could (easily) provide.
  • We’re interested in fault injection that can’t be avoided by static linking or manual syscall(3) invocations (our use case).
  • We’re feeling malicious, and we want to write a rootkit that’s hard to remove from user-space (and possibly even kernel-space, with a few tricks).4

Why do I need fault injection?

Fault injection finds bugs in places that fuzzing and conventional unit testing often won’t:

  • NULL dereferences caused by assuming that particular functions never fail (are you sure you always check whether getcwd(2) succeeds?) Are you sure that you’re doing better than systemd?
  • Memory corruption caused by unexpectedly small buffers, or disclosure caused by unexpectedly large buffers
  • Integer over/underflow caused by invalid or unexpected values (are you sure you’re not making incorrect assumptions about stat(2)‘s atime/mtime/ctime fields?)

Getting started: Finding the syscall table

Internally, the Linux kernel stores syscalls within the syscall table, an array
of __NR_syscalls pointers. This table is defined as sys_call_table, but has not been directly exposed as a symbol (to kernel modules) since Linux 2.5.

First thing, we need to get the syscall table’s address, ideally without using the System.map file or scanning kernel memory for well-known addresses. Luckily for us, Linux provides a superior interface than either of these: kallsyms_lookup_name.

This makes retrieving the syscall table as easy as:

static unsigned long *sys_call_table;

int init_module(void) {
  sys_call_table = (void *)kallsyms_lookup_name("sys_call_table");

  if (sys_call_table == NULL) {
    printk(KERN_ERR "Couldn't look up sys_call_table\n");
    return -1;
  }

  return 0;
}

Of course, this only works if your Linux kernel was compiled with CONFIG_KALLSYMS=1. Debian and Ubuntu provide this, but you may need to test in other distros. If your distro doesn’t enable kallsyms by default, consider using a VM for one that does (you weren’t going to test this code on your host, were you?).

Injecting our replacement syscalls

Now that we have the kernel’s syscall table, injecting our replacement should be as easy as:

static unsigned long *sys_call_table;
static typeof(sys_read) *orig_read;

/* asmlinkage is important here -- the kernel expects syscall parameters to be
 * on the stack at this point, not inside registers.
 */
asmlinkage long phony_read(int fd, char __user *buf, size_t count) {
  printk(KERN_INFO "Intercepted read of fd=%d, %lu bytes\n", fd, count);

  return orig_read(fd, buf, count);
}

int init_module(void) {
  sys_call_table = (void *)kallsyms_lookup_name("sys_call_table");

  if (sys_call_table == NULL) {
    printk(KERN_ERR "Couldn't look up sys_call_table\n");
    return -1;
  }

  orig_read = (typeof(sys_read) *)sys_call_table[__NR_read];
  sys_call_table[__NR_read] = (void *)&phony_read;

  return 0;
}

void cleanup_module(void) {
  /* Don't forget to fix the syscall table on module unload, or you'll be in
   * for a nasty surprise!
   */
  sys_call_table[__NR_read] = (void *)orig_read;
}

…but it isn’t that easy, at least not on x86: sys_call_table is write-protected by the CPU itself. Attempting to modify it will cause a page fault (#PF) exception.5 To get around this, we twiddle the 16th bit of the cr0 register, which controls the write-protect state:

#define CR0_WRITE_UNLOCK(x) \
  do { \
    write_cr0(read_cr0() & (~X86_CR0_WP)); \
    x; \
    write_cr0(read_cr0() | X86_CR0_WP); \
  } while (0)

Then, our insertions become a matter of:

CR0_WRITE_UNLOCK({
  sys_call_table[__NR_read] = (void *)&phony_read;
});

and:

CR0_WRITE_UNLOCK({
  sys_call_table[__NR_read] = (void *)orig_read;
});

and everything works as expected…almost.

We’ve assumed a single processor; there’s an SMP-related race condition bug in the way we twiddle cr0. If our kernel task were preempted immediately after disabling write-protect and placed onto another core with WP still enabled, we’d get a page fault instead of a successful memory write. The chances of this happening are pretty slim, but it doesn’t hurt to be careful by implementing a guard around the critical section:

#define CR0_WRITE_UNLOCK(x) \
  do { \
    unsigned long __cr0; \
    preempt_disable(); \
    __cr0 = read_cr0() & (~X86_CR0_WP); \
    BUG_ON(unlikely((__cr0 & X86_CR0_WP))); \
    write_cr0(__cr0); \
    x; \
    __cr0 = read_cr0() | X86_CR0_WP; \
    BUG_ON(unlikely(!(__cr0 & X86_CR0_WP))); \
    write_cr0(__cr0); \
    preempt_enable(); \
  } while (0)

(The astute will notice that this is almost identical to the “rare write” mechanism from PaX/grsecurity. This is not a coincidence: it’s based on it!)

What’s next?

The phony_read above just wraps the real sys_read and adds a printk, but we could just as easily have it inject a fault:

asmlinkage long phony_read(int fd, char __user *buf, size_t count) {
  return -ENOSYS;
}

…or a fault for a particular user:

asmlinkage long phony_read(int fd, char __user *buf, size_t count) {
  if (current_uid().val == 1005) {
    return -ENOSYS;
  } else {
    return orig_read(fd, buf, count);
  }
}

…or return bogus data:

asmlinkage long phony_read(int fd, char __user *buf, size_t count) {
  unsigned char kbuf[1024];

  memset(kbuf, 'A', sizeof(kbuf));
  copy_to_user(buf, kbuf, sizeof(kbuf));

  return sizeof(kbuf);
}

Syscalls happen under task context within the kernel, meaning that the
current task_struct is valid. Opportunities for poking through kernel structures abound!

Wrap up

This post covers the very basics of kernel-space syscall interception. To do anything really interesting (like precise fault injection or statistics beyond those provided by official introspection APIs), you’ll need to read a good kernel module programming guide6 and do the legwork yourself.

Our new tool, KRF, does everything mentioned above and more: it can intercept and fault syscalls with per-executable precision, operate on an entire syscall “profile” (e.g., all syscalls that touch the filesystem or perform process scheduling), and can fault in real-time without breaking a sweat. Oh, and static linkage doesn’t bother it one bit: if your program makes any syscalls, KRF will happily fault them.

Other work

Outside of kprobes for kernel-space interception and LD_PRELOAD for user-space interception of wrappers, there are a few other clever tricks out there:

  • syscall_intercept is loaded through LD_PRELOAD like a normal wrapper interceptor, but actually uses capstone internally to disassemble (g)libc and instrument the syscalls that it makes. This only works on syscalls made by the libc wrappers, but it’s still pretty cool.
  • ptrace(2) can be used to instrument syscalls made by a child process, all within user-space. It comes with two considerable downsides, though: it can’t be used in conjunction with a debugger, and it returns (PTRACE_GETREGS) architecture-specific state on each syscall entry and exit. It’s also slow. Chris Wellons’s awesome blog post covers ptrace(2)‘s many abilities.

  1. More of a “service request” than a “function” in the ABI sense, but thinking about syscalls as a special class of functions is a serviceable-enough fabrication.
  2. The number of exceptions to this continues to grow, including user-space networking stacks and the Linux kernel’s vDSO for many frequently called syscalls, like time(2).
  3. No process context switch is necessary. Linux executes syscalls within the same underlying kernel task that the process belongs to. But a processor context switch does occur.
  4. I won’t detail this because it’s outsite of this post’s scope, but consider that init_module(2) and delete_module(2) are just normal syscalls.
  5. Sidenote: this is actually how CoW works on Linux. fork(2) write-protects the pre-duplicated process space, and the kernel waits for the corresponding page fault to tell it to copy a page to the child.
  6. This one’s over a decade old, but it covers the basics well. If you run into missing symbols or changed signatures, you should find the current equivalents with a quick search.

On Bounties and Boffins

Trying to make a living as a programmer participating in bug bounties is the same as convincing yourself that you’re good enough at Texas Hold ‘Em to quit your job. There’s data to back this up in Fixing a Hole: The Labor Market for Bugs, a chapter in New Solutions for Cybersecurity by MIT Press. Bug bounties follow a Pareto distribution, exhibiting the same characteristics as the distribution of wealth and other sociological phenomena. A select few, the boffins, report the largest number and highest quality bug reports and earn the lion’s share of bounties. The rest of the population fights over the boffins’ table scraps.

Fixing a Hole does not offer much to encourage software companies looking to improve their security through bug bounty programs. HackerOne, a company that hosts bug bounty programs, boasts that over 300,000 people “have signed up” to help organizations improve their security. It’s nice to think that you have 300,000 sets of eyes scrutinizing your code, but this number includes zombie accounts and people who never find a single bug. In reality, only an elite few are getting the work done and cashing in.

So why not just hire the boffins as consultants instead of gamifying your company’s security? The authors of Fixing a Hole argue that bug bounties should be designed to incentivize the elite. They say that making bounties invite-only lowers the operational cost of managing a tsunami of trivial, non-issue, and duplicate bugs. (Only 4-5% of bugs from Google, Facebook, and GitHub’s public-facing bounty programs were eligible for payment.) According to the authors, a small number of bounty hunters are indispensable and hold significant power to shape the market for bug bounty programs. Based on this, hiring security consultants under terms and conditions that can be controlled seems more practical.

The data undermining bug bounties

In Fixing a Hole, independent researchers funded by Facebook studied data from 61 HackerOne bounty programs over 23 months and one Facebook program over 45 months. The HackerOne data set includes bounty programs from Twitter, Square, Slack, Coinbase, Flash, and others. Usernames could be tracked across different programs in the HackerOne data set, but not to the Facebook data set.

bugbountydata3

Top: Participants, sales, and payments for Facebook (45 months) and HackerOne (23 months). Bottom: Population according to bug sales per person.

The prolific group at the top doesn’t limit itself to one program. It sweeps across multiple programs selling bugs across different technologies. What’s more, they also report the most critical bugs that are worth the most money. On average, those in the top 1% submitted bugs to nearly five different programs.

The authors included averages for sales per seller, earnings per seller, and price per transaction, but these values skew the analysis of uneven distributions, so they are disregarded in this review. (e.g., If 90 people end up earning a $10/hour wage, and 10 people earn $1,000/hour, the average wage is $109/hour, which isn’t characteristic of either population.)

Surprisingly, the variance of the data was not reported in Fixing a Hole. When populations are stratified, as the authors discover, the variance of the individual groups reveals some surprising insights. Consequently, a lot of information about the most interesting population, the top-performing 5%, is omitted.

We reproduced and overlaid some of the plots here to show the main theme in the report: there is a small group of prolific participants in bug bounty programs. The larger the data set, the more pronounced this trend is. For the entirety of the HackerOne and Facebook data sets, the 7% of participants with 10 or more bugs were paid for 1,622 bounties, while the other 93% of the population earned 2,523.

bugbountryplot

The most interesting group was arbitrarily lumped into the 10-or-more-bugs category. (See how the line trends upward at the end of the plot? You don’t want to do that. Plot all your data.) The top 1% (6 participants) in the HackerOne data landed 161 bounties and the top 1% (7 participants) in the Facebook data accounted for 274 bugs. That’s an average of 27 and 39 bugs per person, respectively! There may be stratification even among the top earners, but without knowledge of this distribution at the top (i.e., the variance), it remains a mystery.

As productive as the top 1% are, their earnings are equally depressing. The top seven participants in the Facebook data set averaged 0.87 bugs per month, earning an average yearly salary of $34,255; slightly less than what a pest control worker makes in Mississippi.

It gets worse for the top six earners from the HackerOne data set. Averaging 1.17 bugs per month, they earn a yearly average of $16,544. (Two outlying data points appear in the notes of Fixing a Hole mentioning that Google’s Chromium Rewards Program paid $60,000 for a single submission and one Facebook participant earned $183,000 in 21 months or a $104,000/year average.)

If your heart is breaking and you’re wondering whether this top 1% is wondering where their next meal is coming from, it’s more likely that these are security professionals working a side hustle. You can get really good at finding a few classes of critical bugs, then set scanners and alerts for when relevant bounty programs come online. You find your bugs, submit your proof, grab your dollars, then move on.

What’s better than bug bounties

Who are these top bug bounty performers, and what’s their background? What separates them from the rest? There’s no way to tell from the data, but the authors suggest three possibilities: improved skills over time, differences in raw talent, and professionals vs. hobbyists. (I believe that some top performers may work as teams under one account or are individuals who specialize in a few types of critical bugs and keep a watchful eye for low-hanging fruit when new programs launch.) Whoever they are, they’re indispensable and need to be incentivized to join bug bounty programs. For this, the authors offer three solutions:

  1. Keep the talent pool exclusive through invite-only programs that are closed to the public. This ensures that the most talented will not lose any bounties to lesser talent—even the low-hanging fruit.
  2. Escalate prices with successive valid submissions to deter people from straying to other programs.
  3. Offer grants to talented researchers, and pay them even if no bugs are found.

There’s not much difference between this advice and simply reaching out to a consulting firm for a code audit. Plus, an exclusive bug bounty program faces the chicken-or-egg paradox for the participants: How do you get an invite when you aren’t given the opportunity to establish a reputation? Also, there’s a lot less control and a lot more risk to holding bug bounties than most people are aware of.

The economics of bug bounty programs are turbulent, because there’s a competing offensive market in play. Exploitable zero-days can fetch up to millions of dollars from the right buyer. Anyone who discovers a critical bug can choose not to disclose it to the vendor and try to sell it elsewhere for much more. Fixing a Hole recommends that work should be directed toward incentivizing disclosure to vendors, but offers no practical details beyond that. There’s no evidence to suggest that researchers compare defensive and offensive bounty programs searching for the highest sale. Our opinion is that the decision to disclose or not disclose to vendors is mostly a moral one.

So who is currently incentivized to participate in bug bounty programs? Two groups: Citizens of economically disadvantaged countries, who can take advantage of US dollar exchange rates; and students who want to improve their security skills and learn the tools of the trade. After reading Fixing a Hole, I wasn’t convinced that the elite boffins are incentivized to participate in bug bounty programs. Perhaps they should wield some of their indispensable power to demand more from the market.

What do La Croix, octonions, and Second Life have in common?

This year for CSAW CTF, Trail of Bits contributed two cryptography problems. In the first problem, you could combine two bugs to break DSA much like the Playstation 3 firmware hackers. The other challenge–-weirder and mathier–-was split into two parts: one for the qualifiers, one in finals. This challenge, “Holywater,” was some of the most fun I’ve ever had making a CTF problem.

The qualifier challenge was a pretty textbook CTF cryptography challenge. Contestants began with a script and a text file of past outputs (preserved on Github), and had to recover a secret passphrase. Spoilers follow below the (extremely relevant) image, if you’d like to try it yourself.

Before diving into my own solution, I first want to commend Galhacktic Trendsetters for their excellent writeup (if any of you Trendsetters are reading this, get in touch, I’d love to mail you some swag). They covered the mathematical foundations of the attack with eloquence, a topic which I won’t get into in quite as much depth here. It’s also an excellent walkthrough of the thought process that lets a team start with nothing but a python file and a few hex strings and develop a working attack in less than 48 hours.

The challenge’s python file didn’t make that easy. It was called “lattice.py,” which might immediately suggest it has something to do with lattice cryptography. The method names included, among other things “isogeny,” “gaussian,” and “wobble.” Even the above writeup acknowledges some confusion about the terms’ meanings.

In reality, more or less every name in that file is a red herring. It implements HK17 key exchange, a proposed new post-quantum key exchange mechanism that was proven totally unworkable by Li, Liu, Pan, and Xie. The mathematical construction underlying HK17 is not lattices or isogenies, but octonions! Octonions are eight-dimensional hypercomplex numbers used in theoretical physics with a number of counterintuitive properties.

Perhaps the easiest way to understand octonions is by constructing them from scratch. Most readers will already be familiar with complex numbers, a two-dimensional superset of real numbers that is algebraically closed, a property that makes many kinds of math much easier. We construct the complex numbers using the Cayley-Dickson construction. Effectively, we double the number of dimensions and define multiplication much as we would in a direct sum (though not in exactly the same way).

We can repeat this process on complex numbers to yield a four-dimensional set of numbers known as the quaternions. Readers with graphics programming experience may be familiar, as quaternions allow for efficient computation of rotations in three-dimensional space, and are thus used by many graphics libraries. One more application of the Cayley-Dickson process takes us to eight dimensions; the octonions we use for our cryptosystem.

However, the Cayley-Dickson process cannot preserve every property of a number system we might want. Complex numbers, unlike their real counterparts, are not orderable (they can’t just be laid out end to end on a line). Quaternions are also unorderable, but unlike reals or complex numbers, have noncommutative multiplication! If a and b are quaternions, a * b and b * a can yield different numbers. This gradual loss of invariants continues with octonions, which aren’t even associative; if d, e, and f are octonions, (d * e) * f may well not equal d * (e * f).

“The real numbers are the dependable breadwinner of the family, the complete ordered field we all rely on. The complex numbers are a slightly flashier but still respectable younger brother: not ordered, but algebraically complete. The quaternions, being noncommutative, are the eccentric cousin who is shunned at important family gatherings. But the octonions are the crazy old uncle nobody lets out of the attic: they are nonassociative.” – John Baez

This is fairly gnarly, by the standards of numbers we choose to use, and explains to a degree why octonions aren’t used frequently (keep the attic door shut!). However, it also appears to allow for exactly the kind of hard problem we want when building a key exchange system! By working with polynomials over the octonions, the HK17 authors create a Diffie-Hellman style key exchange system they claim is quantum-hard.

However, in real life this system can be reliably broken by college students over the course of a weekend (nine teams solved it). Octonions’ odd multiplication rules end up making factoring far easier! With a few octonion identities and a high schooler’s knowledge of linear algebra, the cryptosystem reduces to four variables in four linear equations, and can be solved in O(1) by a python script that runs almost instantaneously.

An astute reader may pause here, with complete knowledge of the problem, and wonder “why was this challenge called Holywater?” The answer has nothing to do with octonion key exchange, and everything to do with my plans for the second half of the problem. The HK17 draft defined systems not just on octonions, but on unit quaternions (quaternions of magnitude one) as well! And, since quaternions are used by so many more programmers (as mentioned above, for graphics) that opens some interesting doors.

Specifically, it means we can now define our system in Linden Scripting Language, the official scripting language of Second Life. I’ve always been a bit of a programming language snob. For a while, I thought PHP was the absolute bottom of the barrel. Nothing could possibly be worse than that fractal of bad design, created largely by accident. Later in life I began working on blockchain security, and learned about the language Solidity. Suffice to say, my mind has since changed. Neither language, however, compares to the absolute tire fire that is Linden Scripting Language. Seriously, just read how you parse JSON.

LSL has a built-in quaternion type, and, while the “Differences Between Math’s Quaternions and LSL’s [Quaternions]” might seem foreboding, they are completely workable for our purposes. And, writing the whole challenge in LSL meant the competitors could have even more fun reverse engineering. However, I needed help to develop the Second Life scripts, design objects for them to be attached to, lease space in Second Life, and generally do the non-mathy parts of the whole project.

This is where the name comes in. The final part was called “Holywater 2: La Croix” specifically to entice Dan “Pamplemousse” Hlavenka, a friend of mine who loves both LSL and La Croix more than any other person I know of. He was willing to help with every part of the Second Life portion, but only if we made the challenge La Croix themed in every way we could, to spread the gospel to the next generation.
Competitors were greeted by the below acrostic, which, when the blanks are filled in, describes both a location in Second Life and half-dozen La Croix flavors.

yeah i SMOKE WEED
                                P    P
                                E  M A
                               PA  AOM
                               UC BNRP
maps.Secondlife.com/secondlife/_______/23/233/1
     M                         E PRONE
     O                          PRR GM
     K                          EIY EO
     E                          AC   U
                                RO   S
     W                           T   S
     E                               E
     E
     D

Once teams arrive, they find themselves inside a giant can of La Croix, underwater (and with particle effects for carbonation). The location in Second Life was renamed “Mud City” after LaCrosse Wisconsin, home of the beverage. They are then presented with two glowing orbs, reading “Click here for everything you need” and “Click here to die instantly.”

These labels are accurate. That did not stop many people from repeatedly clicking the “die instantly” orb however, perhaps in an attempt at some sort of reincarnation-based cryptanalysis. The “everything you need” orb in contrast, gives the player an IBM Selectric typeball. Since unit quaternions describe rotations, we elected to encode the message by physically rotating one such typeball (as in normal Selectric operation), agreeing on rotations via HK17 key exchange in Second Life’s chat box. Users could see a script attached to the type ball that outlined the whole process, though again, some attempted other strategies (see below).

Nonetheless, the math was much the same, if harder to apply. This time only two teams (MIT and CMU) found the final flag (another clever La Croix reference), with the first blood winning a case of La Croix for each team member as a bonus on top of the (unusually high) 600 points (typically, challenges are 100 points if extremely easy, 500 points if extremely hard). By reversing the script and scraping the chat, the same process that worked for quals can work here. All that’s left is rotating your typeball and watching which letter is on the bottom.

Dan’s lease on the land in Second Life is now up, so the challenge is unfortunately closed to the public. Dan’s La Croix contributions ended up far more popular than I expected though, so perhaps this challenge won’t be the last to feature the beverage. This challenge is perhaps less applicable than the qualifier, but its lesson remains valid: if you’re securing a remote-control typewriter sending La Croix secrets in Second Life, don’t use HK17.

P.S.: While the last minute removal of csaw.tv meant this never saw the light of competition, you can enjoy this La Croix themed playlist Dan and I made for a special csaw.tv only accessible from Second Life.

Fuzzing Like It’s 1989

With 2019 a day away, let’s reflect on the past to see how we can improve. Yes, let’s take a long look back 30 years and reflect on the original fuzzing paper, An Empirical Study of the Reliability of UNIX Utilities, and its 1995 follow-up, Fuzz Revisited, by Barton P. Miller.

In this blog post, we are going to find bugs in modern versions of Ubuntu Linux using the exact same tools as described in the original fuzzing papers. You should read the original papers not only for context, but for their insight. They proved to be very prescient about the vulnerabilities and exploits that would plague code over the decade following their publication. Astute readers may notice the publication date for the original paper is 1990. Even more perceptive readers will observe the copyright date of the source code comments: 1989.

A Quick Review

For those of you who didn’t read the papers (you really should), this section provides a quick summary and some choice quotes.

The fuzz program works by generating random character streams, with the option to generate only printable, control, or non-printable characters. The program uses a seed to generate reproducible results, which is a useful feature modern fuzzers often lack. A set of scripts execute target programs and check for core dumps. Program hangs are detected manually. Adapters provide random input to interactive programs (1990 paper), network services (1995 paper), and graphical X programs (1995 paper).

The 1990 paper tests four different processor architectures (i386, CVAX, Sparc, 68020) and five operating systems (4.3BSD, SunOS, AIX, Xenix, Dynix). The 1995 paper has similar platform diversity. In the first paper, 25-33% of utilities fail, depending on the platform. In the 1995 follow-on, the numbers range from 9%-33%, with GNU (on SunOS) and Linux being by far the least likely to crash.

The 1990 paper concludes that (1) programmers do not check array bounds or error codes, (2) macros make code hard to read and debug, and (3) C is very unsafe. The extremely unsafe gets function and C’s type system receive special mention. During testing, the authors discover format string vulnerabilities years before their widespread exploitation (see page 15). The paper concludes with a user survey asking about how often users fix or report bugs. Turns out reporting bugs was hard and there was little interest in fixing them.

The 1995 paper mentions open source software and includes a discussion of why it may have fewer bugs. It also contains this choice quote:

When we examined the bugs that caused the failures, a distressing phenomenon emerged: many of the bugs discovered (approximately 40%) and reported in 1990 are still present in their exact form in 1995. …

The techniques used in this study are simple and mostly automatic. It is difficult to understand why a vendor would not partake of a free and easy source of reliability improvements.

It would take another 15-20 years for fuzz testing to become standard practice at large software development shops.

I also found this statement, written in 1990 to be prescient of things to come:

Often the terseness of the C programming style is carried to extremes; form is emphasized over correct function. The ability to overflow an input buffer is also a potential security hole, as shown by the recent Internet worm.

Testing Methodology

Thankfully, after 30 years, Dr. Barton still provides full source code, scripts, and data to reproduce his results, which is a commendable goal that more researchers should emulate. The scripts and fuzzing code have aged surprisingly well. The scripts work as is, and the fuzz tool required only minor changes to compile and run.

For these tests, we used the scripts and data found in the fuzz-1995-basic repository, because it includes the most modern list of applications to test. As per the top-level README, these are the same random inputs used for the original fuzzing tests. The results presented below for modern Linux used the exact same code and data as the original papers. The only thing changed is the master command list to reflect modern Linux utilities.

Updates for 30 Years of New Software

Obviously there have been some changes in Linux software packages in the past 30 years, although quite a few tested utilities still trace their lineage back several decades. Modern versions of the same software audited in the 1995 paper were tested, where possible. Some software was no longer available and had to be replaced. The justification for each replacement is as follows:

  • cfecc1: This is a C preprocessor and equivalent to the one used in the 1995 paper.
  • dbxgdb: This is a debugger, an equivalence to that used in the 1995 paper.
  • ditroffgroff: ditroff is no longer available.
  • dtblgtbl: A GNU Troff equivalent of the old dtbl utility.
  • lispclisp: A common lisp implementation.
  • moreless: Less is more!
  • prologswipl: There were two choices for prolog: SWI Prolog and GNU Prolog. SWI Prolog won out because it is an older and a more comprehensive implementation.
  • awkgawk: The GNU version of awk.
  • ccgcc: The default C compiler.
  • compressgzip: GZip is the spiritual successor of old Unix compress.
  • lintsplint: A GPL-licensed rewrite of lint.
  • /bin/mail/usr/bin/mail: This should be an equivalent utility at a different path.
  • f77fort77: There were two possible choices for a Fortan77 compiler: GNU Fortran and Fort77. GNU Fortran is recommended for Fortran 90, while Fort77 is recommended for Fortran77 support. The f2c program is actively maintained and the changelog records entries date back to 1989.

Results

The fuzzing methods of 1989 still find bugs in 2018. There has, however, been progress.

Measuring progress requires a baseline, and fortunately, there is a baseline for Linux utilities. While the original fuzzing paper from 1990 predates Linux, the 1995 re-test uses the same code to fuzz Linux utilities on the 1995 Slackware 2.1.0 distribution. The relevant results appear on Table 3 of the 1995 paper (pages 7-9). GNU/Linux held up very well against commercial competitors:

The failure rate of the utilities on the freely-distributed Linux version of UNIX was second-lowest at 9%.

Let’s examine how the Linux utilities of 2018 compare to the Linux utilities of 1995 using the fuzzing tools of 1989:

Ubuntu 18.10 (2018) Ubuntu 18.04 (2018) Ubuntu 16.04 (2016) Ubuntu 14.04 (2014) Slackware 2.1.0 (1995)
Crashes 1 (f77) 1 (f77) 2 (f77, ul) 2 (swipl, f77) 4 (ul, flex, indent, gdb)
Hangs 1 (spell) 1 (spell) 1 (spell) 2 (spell, units) 1 (ctags)
Total Tested 81 81 81 81 55
Crash/Hang % 2% 2% 4% 5% 9%

Amazingly, the Linux crash and hang count is still not zero, even for the latest Ubuntu release. The f2c program called by f77 triggers a segmentation fault, and the spell program hangs on two of the test inputs.

What Are The Bugs?

There are few enough bugs that I could manually investigate the root cause of some issues. Some results, like a bug in glibc, were surprising while others, like an sprintf into a fixed-sized buffer, were predictable.

The ul crash

The bug in ul is actually a bug in glibc. Specifically, it is an issue reported here and here (another person triggered it in ul) in 2016. According to the bug tracker it is still unfixed. Since the issue cannot be triggered on Ubuntu 18.04 and newer, the bug has been fixed at the distribution level. From the bug tracker comments, the core issue could be very serious.

f77 crash

The f77 program is provided by the fort77 package, which itself is a wrapper script around f2c, a Fortran77-to-C source translator. Debugging f2c reveals the crash is in the errstr function when printing an overly long error message. The f2c source reveals that it uses sprintf to write a variable length string into a fixed sized buffer:

errstr(const char *s, const char *t)
#endif
{
  char buff[100];
  sprintf(buff, s, t);
  err(buff);
}

This issue looks like it’s been a part of f2c since inception. The f2c program has existed since at least 1989, per the changelog. A Fortran77 compiler was not tested on Linux in the 1995 fuzzing re-test, but had it been, this issue would have been found earlier.

The spell Hang

This is a great example of a classical deadlock. The spell program delegates spell checking to the ispell program via a pipe. The spell program reads text line by line and issues a blocking write of line size to ispell. The ispell program, however, will read at most BUFSIZ/2 bytes at a time (4096 bytes on my system) and issue a blocking write to ensure the client received spelling data processed thus far. Two different test inputs cause spell to write a line of more than 4096 characters to ispell, causing a deadlock: spell waits for ispell to read the whole line, while ispell waits for spell to acknowledge that it read the initial corrections.

The units Hang

Upon initial examination this appears to be an infinite loop condition. The hang looks to be in libreadline and not units, although newer versions of units do not suffer from the bug. The changelog indicates some input filtering was added, which may have inadvertently fixed this issue. While a thorough investigation of the cause and correction was out of scope for this blog post, there may still be a way to supply hanging input to libreadline.

The swipl Crash

For completeness I wanted to include the swipl crash. However, I did not investigate it thoroughly, as the crash has been long-fixed and looks fairly benign. The crash is actually an assertion (i.e. a thing that should never occur has happened) triggered during character conversion:

[Thread 1] pl-fli.c:2495: codeToAtom: Assertion failed: chrcode >= 0
C-stack trace labeled "crash":
  [0] __assert_fail+0x41
  [1] PL_put_term+0x18e
  [2] PL_unify_text+0x1c4
…

It is never good when an application crashes, but at least in this case the program can tell something is amiss, and it fails early and loudly.

Conclusion

Fuzzing has been a simple and reliable way to find bugs in programs for the last 30 years. While fuzzing research is advancing rapidly, even the simplest attempts that reuse 30-year-old code are successful at identifying bugs in modern Linux utilities.

The original fuzzing papers do a great job at foretelling the dangers of C and the security issues it would cause for decades. They argue convincingly that C makes it too easy to write unsafe code and should be avoided if possible. More directly, the papers show that even naive fuzz testing still exposes bugs, and such testing should be incorporated as a standard software development practice. Sadly, this advice was not followed for decades.

I hope you have enjoyed this 30-year retrospective. Be on the lookout for the next installment of this series: Fuzzing In The Year 2000, which will investigate how Windows 10 applications compare against their Windows NT/2000 equivalents when faced with a Windows message fuzzer. I think that you can already guess the answer.

$10,000 research fellowships for underrepresented talent

The Trail of Bits SummerCon Fellowship program is now accepting applications from emerging security researchers with excellent project ideas. Fellows will explore their research topics with our guidance and then present their findings at SummerCon 2019. We will be reserving at least 50% of our funding for marginalized, female-identifying, transgender, and non-binary candidates. If you’re interested in applying, read on!

Why we’re doing this

Inclusion is a serious and persistent issue for the infosec industry. According to the 2017 (ISC)2 report on Women in Cybersecurity, only 11% of the cybersecurity workforce identify as women–-a deficient proportion that hasn’t changed since 2013. Based on a 2018 (ISC)2 study, the issue is worse for women of color, who report facing pervasive discrimination, unexplained denial or delay in career advancement, exaggerated highlights of mistakes and errors, and tokenism.

Not only is this ethically objectionable, it makes no business sense. In 2012, Mckinsey & Company found–-with ‘startling consistency’—that “for companies ranking in the top quartile of executive-board diversity, Returns on Equity (ROE) were 53 percent higher, on average, than they were for those in the bottom quartile. At the same time, Earnings Before Tax and Interest (EBTI) margins at the most diverse companies were 14 percent higher, on average, than those of the least diverse companies.”

The problem is particularly conspicuous at infosec conferences: a dearth of non-white non-male speakers, few female attendees, and pervasive reports of sexual discrimination. That’s why Trail of Bits and one of the longest-running hacker conferences, SummerCon, decided to collaborate to combat the issue. Through this fellowship, we’re sponsoring and mentoring emerging talent that might not otherwise get enough funding, mentorship, and exposure, and then shining a spotlight on their research.

Funding and mentorship to elevate your security research

The Trail of Bits SummerCon Fellowship provides awarded fellows with:

  • $10,000 grant to fund a six-month security research project
  • Dedicated research mentorship from a security engineer at Trail of Bits
  • An invitation to present findings at SummerCon 2019

50% of the program spots are reserved for marginalized, people of color, female-identifying, transgender, and non-binary candidates. Applicants of all genders, races, ethnicities, sexual orientations, ages, and abilities are encouraged to apply.

The research topics we’ll support

Applicants should bring a low-level programming or security research project that they’ve been wanting to tackle but have lacked the time or resources to pursue. They’ll have strong skills in low-level or systems programming, reverse engineering, program analysis (including dynamic binary instrumentation, symbolic execution, and abstract interpretation), or vulnerability analysis.

We’re especially interested in research ideas that align with our areas of expertise. That way, we can better support applicants. Think along the lines of:

How do I apply?

Apply here!

We’re accepting applications until January 15th. We’ll announce fellowship recipients in February.

Interested in applying? Go for it!

Submissions will be judged by a panel of experts from the SummerCon foundation, including Trail of Bits. Good luck!