Your tool works better than mine? Prove it.

No doubt, DARPA’s Cyber Grand Challenge (CGC) will go down in history for advancing the state of the art in a variety of fields: symbolic execution, binary translation, and dynamic instrumentation, to name a few. But there is one contribution that we believe has been overlooked so far, and that may prove to be the most useful of them all: the dataset of challenge binaries.

Until now, if you wanted to ‘play along at home,’ you would have had to install DECREE, a custom Linux-derived operating system that has no signals, no shared memory, no threads, and only seven system calls. Sound like a hassle? We thought so.

One metric for all tools

Competitors in the Cyber Grand Challenge identify vulnerabilities in challenge binaries (CBs) written for DECREE on the 32-bit Intel x86 architecture. Since 2014, DARPA has released the source code for over 100 of these vulnerable programs. These programs were specifically designed with vulnerabilities that represent a wide variety of software flaws. They are more than simple test cases, they approximate real software with enough complexity to stress both manual and automated vulnerability discovery.

If the CBs become widely adopted as benchmarks, they could change the way we solve security problems. This mirrors the rapid evolution of the SAT and ML communities once standardized benchmarks and regular competitions were established. The challenge binaries, valid test inputs, and sample vulnerabilities create an industry standard benchmark suite for evaluating:

Bug-finding tools
Program-analysis tools (e.g. automated test coverage generation, value range analysis)
Patching strategies
Exploit mitigations

The CBs are a more robust set of tests than previous approaches to measuring the quality of software analysis tools (e.g. SAMATE tests, NSA Juliet tests, or the STONESOUP test cases). First, the CBs are complex programs like games, content management systems, image processors, and so on, instead of just snippets of vulnerable code. After all, to be effective, analysis tools must process real software with a fairly low bug density, not direct snippets of vulnerable code. Second, unlike open source projects with added bugs, we have very high confidence all the bugs in the CBs have been found, so analysis tools can be compared to an objective standard. Finally, the CBs also come with extensive functionality tests, triggers for introduced bugs, patches, and performance monitoring tools, enabling benchmarking of patching tools and bug mitigation strategies.

Creating an industry standard benchmarking set will solve several problems that hamper development of future program analysis tools:

First, the absence of standardized benchmarks prevents an objective determination of which tools are “best.” Real applications don’t come with triggers for complex bugs, nor an exhaustive list of those bugs. The CBs provide metrics for comparison, such as:

Number of bugs found
Number of bugs found per unit of time or memory
Categories of bugs found and missed
Variances in performance from configuration options

Next, which mitigations are most effective? CBs come with inputs that stress original program functionality, inputs that check for the presence of known bugs, and performance measuring tools. These allow us to explore questions like:

What is the potential effectiveness and performance impact of various bug mitigation strategies (e.g. Control Flow Integrity, Code Pointer Integrity, Stack Cookies, etc)?
How much slower does the resulting program run?
How good is a mitigation compared to a real patch?

Play Along At Home

The teams competing in the CGC have had years to hone and adapt their bug-finding tools to the peculiarities of DECREE. But the real world doesn’t run on DECREE; it runs on Windows, Mac OS X, and Linux. We believe that research should be guided by real-world challenges and parameters. So, we decided to port* the challenge binaries to run in those environments.

It took us several attempts to find the best porting approach to minimize the amount of code changes, while preserving as much original code as possible between platforms. The eventual solution was fairly straightforward: build each compilation unit without standard include files (as all CBs are statically linked), implement CGC system calls using their native equivalents, and perform various minor fixes to make the code compatible with more compilers and standard libraries.

We’re excited about the potential of multi-platform CBs on several fronts:

Since there’s no need to set up a virtual machine just for DECREE, you can run the CBs on the machine you already have.
With that hurdle out of the way, we all now have an industry benchmark to evaluate program analysis tools. We can make comparisons such as:
- How good are the CGC tools vs. existing program analysis and bug finding tools
- When a new tool is released, how does it stack up against the current best?
- Do static analysis tools that work with source code find more bugs than dynamic analysis tools that work with binaries?
- Are tools written for Mac OS X better than tools written for Linux, and are they better than tools written for Windows?
When researchers open source their code, we can evaluate how well their findings work for a particular OS or compiler.

Before you watch the competitors’ CRSs duke it out, explore the challenges that the robots will attempt to solve in an environment you’re familiar with.

Get the CGC’s Challenge Binaries in the most common operating systems.

* Big thanks to our interns, Kareem El-Faramawi and Loren Maggiore, for doing the porting, and to Artem, Peter, and Ryan for their support.

6 thoughts on “Your tool works better than mine? Prove it.”

Sanjay- securitylearner

August 5, 2016 at 10:31 am Reply

This is really great service to whole community. thank you.
There is excel doc about the status of each CB, with statistics in terms of certain features, like, POV failed/passed, POLL fail/passed. Are these terms explained somewhere or can you please explain them? thanks

Loading...
Pingback: RallySecurity – Episode 3 – APTallthethings | Rally Security Show
Pingback: Come Find Us at O’Reilly Security | Trail of Bits Blog
Pingback: 2016 Year in Review | Trail of Bits Blog
Pingback: The Smart Fuzzer Revolution | Trail of Bits Blog
Pingback: Architecture Agnostic Function Detection In Binaries - Wheres My Keyboard?

Trail of Bits Blog

Your tool works better than mine? Prove it.

One metric for all tools

Play Along At Home

Like this:

Related

6 thoughts on “Your tool works better than mine? Prove it.”

Leave a ReplyCancel reply

One metric for all tools

Play Along At Home

Share this:

Like this:

Related

6 thoughts on “Your tool works better than mine? Prove it.”

Leave a ReplyCancel reply

Discover more from Trail of Bits Blog