Of the nearly 200 papers on software fuzzing that have been published in the last three years, most of them—even some from high-impact conferences—are academic clamor. Fuzzing research suffers from inconsistent and subjective benchmarks, which keeps this potent field in a state of arrested development. We’d like to help explain why this has happened and offer some guidance for how to consume fuzzing publications.
Researchers play a high-stakes game in their pursuit of building the next generation of fuzzing tools. A major breakthrough can render obsolete what was once the state of the art. Nobody is eager to use the world’s second-greatest fuzzer. As a result, researchers must somehow demonstrate how their work surpasses the state of the art in finding bugs.
The problem is trying to objectively test the efficacy of fuzzers. There isn’t a set of universally accepted benchmarks that is statistically rigorous, reliable, and reproducible. Inconsistent fuzzing measurements persist throughout the literature and prevent meaningful meta-analysis. That was the motivation behind the paper “Evaluating Fuzz Testing,” to be presented by Andrew Ruef at the 2018 SIGSAC Conference on Computer and Communications Security in Toronto.
“Evaluating Fuzz Testing” offers a comprehensive set of best practices for constructing a dependable frame of reference for comparing fuzzing tools. Whether you’re submitting your fuzzing research for publication, peer-reviewing others’ submissions, or trying to decide which tool to use in practice, the recommendations from Ruef and his colleagues establish an objective lens for evaluating fuzzers. In case you don’t have time to read the whole paper, we’re summarizing the criteria we recommend you use when evaluating the performance claims in fuzzing research.
Quick Checklist for Benchmarking Fuzzers
- Compare new research against popular baseline tools like american fuzzy lop (AFL), Basic Fuzzing Framework (BFF), libfuzzer, Radamsa, and Zzuf. In lieu of a common benchmark, reviewing research about these well-accepted tools will prepare you to ascertain the quality of other fuzzing research. The authors note that “there is a real need for a solid, independently defined benchmark suite, e.g., a DaCapo or SPEC10 for fuzz testing.” We agree.
- Outputs should be easy to read and compare. Ultimately, this is about finding the fuzzer that delivers the best results. “Best” is subjective (at least until that common benchmark comes along), but evaluators’ work will be easier if they can interpret fuzzers’ results easily. As Ruef and his colleagues put it: “Clear knowledge of ground truth avoids overcounting inputs that correspond to the same bug, and allows for assessing a tool’s false positives and false negatives.”
- Account for differences in heuristics. Heuristics influence how fuzzers start and pursue their searches through code paths. If two fuzzers’ heuristics lead them to different targets, then the fuzzers will produce different results. Evaluators have to account for that influence in order to compare one fuzzer’s results against another’s.
- Targets representative datasets with distinguishable bugs like the Cyber Grand Challenge binaries, LAVA-M, and Google’s fuzzer test suite and native programs like nm, objdump, cxxfilt, gif2png, and FFmpeg. Again for lack of a common benchmark suite, fuzzer evaluators should look for research that used one of the above datasets (or, better yet, one native and one synthetic). Doing so can encourage researchers to ‘fuzz to the test,’ which doesn’t do anyone any good. Nevertheless, these datasets provide some basis for comparison. Related: As we put more effort into fuzzers, we should invest in refreshing datasets for their evaluation, too.
- Fuzzers are configured to begin in similar and comparable initial states. If two fuzzers’ configuration parameters reflect different priorities, then the fuzzers will yield different results. It isn’t realistic to expect that all researchers will use the same configuration parameters, but it’s quite reasonable to expect that those parameters are specified in their research.
- Timeout values are at least 24 hours. Of the 32 papers the authors reviewed, 11 capped timeouts at “less than 5 or 6 hours.” The results of their own tests of AFL and AFLFast varied by the length of the timeout: “When using a non-empty seed set on nm, AFL outperformed AFLFast at 6 hours, with statistical significance, but after 24 hours the trend reversed.” If all fuzzing researchers allotted the same time period—24 hours—for their fuzz runs, then evaluators would have one less variable to account for.
- Consistent definitions of distinct crashes throughout the experiment. Since there’s some disagreement in the profession about how to categorize unique crashes and bugs (by the input or by the bug triggered), evaluators need to seek the researchers’ definition in order to make a comparison. That said, beware the authors’ conclusion: “experiments we carried out showed that the heuristics [for de-duplicating or triaging crashes] can dramatically over-count the number of bugs, and indeed may suppress bugs by wrongly grouping crashing inputs.”
- Consistent input seed files. The authors found that fuzzers’ “performance can vary substantially depending on what seeds are used. In particular, two different non-empty inputs need not produce similar performance, and the empty seed can work better than one might expect.” Somewhat surprisingly, many of the 32 papers evaluated did not carefully consider the impact of seed choices on algorithmic improvements.
- At least 30 runs per configuration with variance measured. With that many runs, anomalies can be ignored. Don’t compare the results of single runs (“as nearly ⅔ of the examined papers seem to,” the authors report!). Instead, look for research that not only performed multiple runs, but also used statistical tests to account for variances in those tests’ performance.
- Prefer bugs discovered over code coverage metrics. We at Trail of Bits believe that our work should have a practical impact. Though code coverage is an important criterion in choosing a fuzzer, this is about finding and fixing bugs. Evaluators of fuzz research should measure performance in terms of known bugs, first and foremost.
Despite how obvious or simple these recommendations may seem, the authors reviewed 32 high-quality publications on fuzzing and did not find a single paper that aligned with all 10. Then they demonstrated how conclusive the results from rigorous and objective experiments can be by using AFLFast and AFL as a case study. They determined that, “Ultimately, while AFLFast found many more ‘unique’ crashing inputs than AFL, it only had a slightly higher likelihood of finding more unique bugs in a given run.”
The authors’ results and conclusions showed decisively that in order to advance the science of software fuzzing, researchers must strive for disciplined statistical measurements and better empirical measurements. We believe this paper will begin a new chapter in fuzzing research by providing computer scientists with an excellent set of standards for designing, evaluating, and reporting software fuzzing experiments in the future.
In the meantime, if you’re evaluating a fuzzer for your work, approach with caution and this checklist.
Pingback: Fuzzing Like It’s 1989 | Trail of Bits Blog
Pingback: DeepState Now Supports Ensemble Fuzzing | Trail of Bits Blog