A fuzzer and a symbolic executor walk into a cloud

Finding bugs in programs is hard. Automating the process is even harder. We tackled the harder problem and produced two production-quality bug-finding systems: GRR, a high-throughput fuzzer, and PySymEmu (PSE), a binary symbolic executor with support for concrete inputs.

From afar, fuzzing is a dumb, brute-force method that works surprisingly well, and symbolic execution is a sophisticated approach, involving theorem provers that decide whether or not a program is “correct.” Through this lens, GRR is the brawn while PSE is the brains. There isn’t a dichotomy though — these tools are complementary, and we use PSE to seed GRR and vice versa.

Let’s dive in and see the challenges we faced when designing and building GRR and PSE.

GRR, the fastest fuzzer around

GRR is a high speed, full-system emulator that we use to fuzz program binaries. A fuzzing “campaign” involves executing a program thousands or millions of times, each time with a different input. The hope is that spamming a program with an overwhelming number of inputs will result in triggering a bug that crashes the program.

Note: GRR is pronounced with two fists held in the air

During DARPA’s Cyber Grand Challenge, we went web-scale and performed tens of billions of input mutations and program executions — in only 24 hours! Below are the challenges we faced when making this fuzzer, and how we solved those problems.

  1. Throughput. Typically, program fuzzing is split into discrete steps. A sample input is given to an input “mutator” which produces input variants. In turn, each variant is separately tested against the program in the hopes that the program will crash or execute new code. GRR internalizes these steps, and while doing so, completely eliminates disk I/O and program analysis ramp-up times, which represent a significant portion of where time is spent during a fuzzing campaign with other common tools.
  2. Transparency. Transparency requires that the program being fuzzed cannot observe or interfere with GRR. GRR achieves transparency via perfect isolation. GRR can “host” multiple 32-bit x86 processes in memory within its 64-bit address space. The instructions of each hosted process are dynamically rewritten as they execute, guaranteeing safety while maintaining operational and behavioral transparency.
  3. Reproducibility. GRR emulates both the CPU architecture and the operating system, thereby eliminating sources of non-determinism. GRR records program executions, enabling any execution to be faithfully replayed. GRR’s strong determinism and isolation guarantees let us combine the strengths of GRR with the sophistication of PSE. GRR can snapshot a running program, enabling PSE to jump-start symbolic execution from deep within a given program execution.

PySymEmu, the PhD of binary symbolic execution

Symbolic execution as a subject is hard to penetrate. Symbolic executors “reason about” every path through a program, there’s a theorem prover in there somewhere, and something something… bugs fall out the other end.

At a high level, PySymEmu (PSE) is a special kind of CPU emulator: it has a software implementation for almost every hardware instruction. When PSE symbolically executes a binary, what it really does is perform all the ins-and-outs that the hardware would do if the CPU itself was executing the code.

frankenpse

PSE explores the relationship between the life and death of programs in an unorthodox scientific experiment

CPU instructions operate on registers and memory. Registers are names for super-fast but small data storage units. Typically, registers hold four to eight bytes of data. Memory on the other hand can be huge; for a 32-bit program, up to 4 GiB of memory can be addressed. PSE’s instruction simulators operate on registers and memory too, but they can do more than just store “raw” bytes — they can store expressions.

A program that consumes some input will generally do the same thing every time it executes. This happens because that “concrete” input will trigger the same conditions in the code, and cause the same loops to merry-go-round. PSE operates on symbolic input bytes: free variables that can initially take on any value. A fully symbolic input can be any input and therefore represents all inputs. As PSE emulates the CPU, if-then-else conditions impose constraints on the originally unconstrained input symbols. An if-then-else condition that asks “is input byte B less than 10” will constrain the symbol for B to be in the range [0, 10) along the true path, and to be in the range [10, 256) along the false path.

If-then-elses are like forks in the road when executing a program. At each such fork, PSE will ask its theorem prover: “if I follow the path down one of the prongs of the fork, then are there still inputs that satisfy the additional constraints imposed by that path?” PSE will follow each yay path separately, and ignore the nays.

So, what challenges did we face when creating and extending PSE?

  1. Comprehensiveness. Arbitrary program binaries can exercise any one of thousands of the instructions available to x86 CPUs. PSE implements simulation functions for hundreds of x86 instructions. PSE falls back onto a custom, single-instruction “micro-executor” in those cases where an instruction emulation is not or cannot be provided. In practice, this setup enables PSE to comprehensively emulate the entire CPU.
  2. Scale. Symbolic executors try to follow all feasible paths through a program by forking at every if-then-else condition, and constraining the symbols one way or another along each path. In practice, there are an exponential number of possible paths through a program. PSE handles the scalability problem by selecting the best path to execute for the given execution goal, and by distributing the program state space exploration process across multiple machines.
  3. Memory. Symbolic execution produces expressions representing simple operations like adding two symbolic numbers together, or constraining the possible values of a symbol down one path of an if-then-else code block. PSE gracefully handles the case where addresses pointing into memory are symbolic. Memory accessed via a symbolic address can potentially point anywhere — even point to “good” and “bad” (i.e. unmapped) memory.
  4. Extensibility. PSE is written using the Python programming language, which makes it easy to hack on. However, modifying a symbolic executor can be challenging — it can be hard to know where to make a change, and how to get the right visibility into the data that will make the change a success. PSE includes smart extension points that we’ve successfully used for supporting concolic execution and exploit generation.

Measuring excellence

So how do GRR and PSE compare to the best publicly available tools?

GRR

GRR is both a dynamic binary translator and fuzzer, and so it’s apt to compare it to AFLPIN, a hybrid of the AFL fuzzer and Intel’s PIN dynamic binary translator. During the Cyber Grand Challenge, DARPA helpfully provided a tutorial on how to use PIN with DECREE binaries. At the time, we benchmarked PIN and found that, before we even started optimizing GRR, it was already twice as fast as PIN!

The more important comparison metric is in terms of bug-finding. AFL’s mutation engine is smart and effective, especially in terms of how it chooses the next input to mutate. GRR internalizes Radamsa, another too-smart mutation engine, as one of its many input mutators. Eventually we may also integrate AFL’s mutators. During the qualifying event, GRR went face-to-face with AFL, which was integrated into the Driller bug-finding system. Our combination of GRR+PSE found more bugs. Beyond this one data point, a head-to-head comparison would be challenging and time-consuming.

PySymEmu

PSE can be most readily compared with KLEE, a symbolic executor of LLVM bitcode, or the angr binary analysis platform. LLVM bitcode is a far cry from x86 instructions, so it’s an apples-to-oranges comparison. Luckily we have McSema, our open-source and actively maintained x86-to-LLVM bitcode translator. Our experiences with KLEE have been mostly negative; it’s hard to use, hard to hack on, and it only works well on bitcode produced by the Clang compiler.

Angr uses a customized version of the Valgrind VEX intermediate representation. Using VEX enables angr to work on many different platforms and architectures. Many of the angr examples involve reverse engineering CTF challenges instead of exploitation challenges. These RE problems often require manual intervention or state knowledge to proceed. PSE is designed to try to crash the program at every possible emulated instruction. For example PSE will use its knowledge of symbolic memory to access any possible invalid array-like memory accesses instead of just trying to solve for reaching unconstrained paths. During the qualifying event, angr went face-to-face with GRR+PSE and we found more bugs. Since then, we have improved PSE to support user interaction, concrete and concolic execution, and taint tracking.

I’ll be back!

Automating the discovery of bugs in real programs is hard. We tackled this challenge by developing two production-quality bug-finding tools: GRR and PySymEmu.

GRR and PySymEmu have been a topic of discussion in recent presentations about our CRS, and we suspect that these tools may be seen again in the near future.

26 thoughts on “A fuzzer and a symbolic executor walk into a cloud

  1. Shellphish here!

    Interesting article! I’d like to note that angr does fully support the type of analysis that you describe for PySymEmu (i.e., checking for invalid array/memory accesses, symbolic or otherwise); we’re just lacking examples on it.

    Regarding going head to head in the CQE, keep in mind that you were actually running something closer to Driller (symbolic execution *with* fuzzing), rather than pure PySymEmu, and we were running some pretty horrible mess that we clobbered together in 3 weeks. When we finished Driller, it did beat your CQE numbers, though to be fair, that was a few months after the CQE itself😉.

    As a defense of AFL (which we obviously take no credit for — that’s lcamtuf’s genius), we were using it pretty horribly in the CQE. Once we started using it properly, it actually found more crashes on its own than anyone’s full systems except for ForAllSecure’s [1]. But again, to be fair, that was a few months after the CQE.

    At DEFCON, we’ll release a ton of stuff that we’ve been pushing for the CFE, including some significant angr speedups (such as support for *fast* concrete and concolic execution), Driller itself, and exploitation tools built using angr. I’d be really interested in hearing your thoughts on it once it’s out, and how you feel it measures up against what you’re familiar with.

    Finally, are you planning on open-sourcing PySymEmu (I’m assuming it’s not this project: https://github.com/feliam/pysymemu), or use it more as a reference point in talks and discussions? I think more open source tools will always be useful to the community!

    [1] http://www.ieee-security.org/TC/SP2016/papers/0824a138.pdf

    • Hey Yan,

      We don’t have plans to open source PSE right now but we might release it in the future. You correctly note the origins of PSE, Felipe works with us and our team has privately developed his open source code for the last two years into what it is today (FrankenPSE). We have other forks of it for concolic execution, even an IDA integration, that we use on our client-driven software audits. It’s come a long way as compared to the open source code.

      I agree the performance comparison is messy, but as you mentioned about your own tools, our performance has also improved significantly since the qualifying event. We’ll definitely take a look at the new features of angr and Driller when they come out. Several people on our team have played with angr for CTF challenges and are familiar with its capabilities. Good luck in the final competition!

      -Dan

  2. This was a nice read! Very interesting although I would like to see more inner workings😉.

    I have one question though, you were trying LLVM at first with McSema as translator and KLEE as symbolic execution engine. I would like to know if you are actually using LLVM’s IR more internally?
    From your post, it seems that LLVM-IR would be totally unsuitable for these kind of doings, can you go more deeper on this? (I wasn’t sure if LLVM was the actual issue here or just KLEE.)
    I am asking this because I am currently also working on somekind of deobfucator based on the LLVM-IR which might use symbolic execution. Do you think this would be a good approach according your experiences?

    • Hi Nick,

      Yes we were using KLEE on the bitcode produced by lifting x86 machine code to to LLVM bitcode using McSema. We do other things internally with LLVM. We have a use-after-free detector called PointsTo; I have another blog post here describing it.

      Regarding de-obfuscation, I am not sure. I have seen things go both ways. We have some blog posts about a product, MAST, where we obfuscate LLVM bitcode as a way of defending mobile apps against reverse engineering. Perhaps a more detailed explanation of your current approach would help me make an objective suggestion. Drop by our Empire Hacking slack channel (https://empireslacking.herokuapp.com/) and ping me, my username is pag.

  3. pysymemu on git doesn’t even works ! the example in the read me always uses a zero stdin and doesn’t generate the needed paths .. 😦
    are you going to upload the modified AFL with the new mutators ? and when you will fix that github ?!

    • Hi LaBBa,

      You’ve correctly identified that the public PySymEmu repository is not what this blog post describes. The version described in this blog post is based on the public repository.

      We don’t have a modified AFL. GRR uses Radamsa and also includes some other simple mutators (bit flips, etc.). In the future we hope to integrate AFL; it’s an excellent piece of software.

    • Hi Sanjay,

      We do not have any stats at this time. An apples-to-apples comparison would be challenging because of differences between how GRR and AFLPIN produce and consume a program’s input.

  4. Pingback: Angr management: first steps and limitations | CyberSmashup

  5. Peter,

    More technical details on either tool would be great, but I’ll ask something maybe a little higher level. I knew you guys were cross-seeding from fuzzing and symbolic execution — did you find any “secret sweet spot” or was it a more a matter of both tools getting to see all interesting (for some value of interesting) products of the other tool, or even just sharing an common seed pool?

    • Hi Alex, This is a great question. I think I found a few sweet spots, and I’ll describe them below in various levels of technical depth. You are correct about both points: both tools getting to see more interesting inputs, and cross seeding via sharing the same pool. In the latter case, our shared pool was the “minset”, a small set of program inputs that maximised coverage. Getting to it then…

      1) Uniformity — make all tools operate on the same program format (in this case, GRR snapshots). This meant that PSE didn’t really know that it was jumping into the middle of a program vs. starting at the beginning.

      2) Don’t be too clever. I mentioned how PSE could jump into the middle of a program, thereby starting symbolic execution “deep” within a program. The big benefit was that I didn’t need to think hard about all of them cooperating on exploring some shared program state space. Keeping the lens of a bug-fining being a trivially parallelisable problem helped me move onto more important problems.

      3) Diversity. You noted that we use a shared input pool. Our mechanism for deciding entry into this pool is code coverage. From this pool I’d select inputs and places within those inputs for snapshotting and sending to PSE. GRR would only fuzz what was in this minset. The result is a lot of cross-pollination, but in the long run not a huge amount of diversity.

      I achieved diversity by making the minset a moving target: compressing it further over time, and switching the coverage metric over time. A hand-wavy theory for this would be that you can have two coverage metrics that measure different execution features, and compose them (i.e. identify inputs with both features) by using the first metric for a bit, then compressing the minset with the second metric, and finally switching to the second metric. I have a big spreadsheet where I evaluate pairs of metrics within a short time frame. For paid engagements that can run for days or weeks, I usually set up a big chain of metrics, where each runs for a while, then a switch happens. In general, I start with a “bad” metric (high collision rate, not much precision), which keeps the initial minset small, but shakes out low-hanging fruit (bugs) because you get more GRR fuzzers pounding away at this small set.

      4) Don’t go distributed until you can utilise a single machine. The design of GRR let me easily saturate cores, but doing so with a symbolic executor can be more challenging. Sure, you can make one multi-threaded, but then you’re stuck thinking about concurrency control, and you miss out on the diversity you get from jumping into the middle of programs. Having a moving minset and using its entries for snapshot candidates let me run more PSE instances, all of which could do “different” thing. Once the CRS ramped up, it was easy to saturate remaining cores with symbolic executors, and be confident that they were all doing different but useful things.

      In summary, a whole bunch of little decisions let me dumbly share progress in a “smart” way, and never let a cycle go to waste😉

  6. Pingback: Work For Us: Fall and Winter Internship Opportunities – Trail of Bits Blog

  7. Peter,

    Do your minsets just hold a set of inputs that achieved coverage (for a moving target set, I have to think about that) raw, or are the inputs themselves minimized with respect to the (moving) targets? With the seeded version of KLEE, we got decent improvements shrinking the inputs themselves, and I think AFL does some input-shrinking. Input shrinking with moving targets sounds like something to explore.

    • Hi Alex,

      Inputs in our minset are raw; I do not try to minimize the size of an input whilst maintaining coverage.

      Notably, not all inputs in our minset are complete. 99% of what PSE spits out are incomplete inputs: they don’t cause the program to crash or exit/terminate. If you were to run some of these inputs directly, then the program may hang waiting for more input — this is not the case when run through GRR (which is used for all the things). In general, I don’t want to wait for crashes or terminating inputs. If there’s something significant at the end of an input that will help trigger a bug then okay, eventually PSE will get there. In the meantime, GRR might get there first based on partial program inputs.

      Another important thing is the mechanism by which the minset is compressed. In this case, I mean reducing the number of inputs in minset while maintaining the coverage. A smaller minset means more fuzzing resources brought to bear for each entry in the minset. The idea here is to re-evaluate the inputs in the minset in reverse. Pretend that the last entry in the minset is the first input you’ve ever seen. It sets your baseline of coverage, then move on from there. In the end, you can usually “kick out” redundant entries.

      One thing not mentioned in the blog post is that another simple mechanism for combining the results of PSE and GRR are to take two inputs in the minset and splice them together. This behaves like “random restart” seen through the lens of hill-climbing.

      • I wonder how much you could get from taking either complete or partial inputs and shrinking them with respect to whatever they are currently contributing (why they are in the current minset). On the one hand, it’s possibly slow, on the other hand, running reduction is itself just another fuzzing algorithm. In http://www.cs.cmu.edu/~agroce/issta14.pdf KLEE (the zesti version) could get quite a bit more traction on novel branches if we took part of the symbolic execution time and instead spent it reducing the fuzzer-produced input with respect to its coverage. That was with respect to total coverage, not a contribution to a minset, but we’re looking at the latter idea too, as a follow on to an ASE paper this year.

      • (the whole two-stage aspect is a bit off, in that paper: no reason not to just set up general feedback in all directions, really, though optimal use of your compute power presumably weights things at different times, I’d guess)

      • For the most part, we are always running fuzzers and symbolic executors concurrently. For fuzzers this makes sense. I chose a simplistic but effective approach to deciding where to snapshot within a program. It’s a variant of choosing common prefixes that minimizes the prefix size while maximizing the number of inputs that share that prefix. Chosen prefixes would themselves share prefixes, but have unique suffixes. Using delta debugging for finding the end of an information-dense prefix could be an interesting jump-in point for unconstrained symbolic execution.

        Time for me to go, but feel free to email me at @ , or ping me (pag) on our empire hacking slack. I like talking about this stuff.

    • One point to clarify: changing the coverage metric happens at a compression point. I compress the minset at regular intervals, and specify the amount of time a particular coverage metric should be used. When one coverage metric expires, the next metric is used when the minset is next compressed, and then things move on to the new coverage metric.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s