Solar: Context-free, interactive analysis for Solidity

We’re hiring for our Research + Engineering team! 

By Aaron Yoo, University of California, Los Angeles

As an intern at Trail of Bits, I worked on Solar, a proof-of-concept static analysis framework. Solar is unique because it enables context-free interactive analysis of Solidity smart contracts. A user can direct Solar to explore program paths (e.g., to expand function calls or follow if statements) and assign constraints or values to variables, all without reference to a concrete execution. As a complement to Solar, I created an integrated development environment (IDE)-like tool that illustrates the types of interactivity and analysis that Solar can provide.

Solar user interface.

The Solar UI has two main panels, “source” and “IR” (intermediate representation). The source panel displays the source code of the functions being analyzed and the IR panel translates those functions into an enhanced variant of SlithIR, our open-source IR. The green highlights show the lines of the function that are being analyzed. The two panes display similar information intended for different uses: The source pane serves as a map, aiding in navigation and providing context. The IR pane enables interactivity as well as visualization of the information deduced by Solar.

Analyses built on top of Solar are called “strategies.” Strategies are modular, and each defines a different mode of operation for Solar. Three strategies are shown in the above screenshot: concrete, symbolic, and constraint. Although each analysis has its own workflow, Solar needs all three to support the simplify, downcall, and upcall primitives, which are explained below.

Solar primitives

Simplify

The simplify primitive instructs the framework to make as many inferences as possible based on the current information. Solar can receive new information either through user input or by deriving it as an axiom from the program. In the screencast below, Solar axiomatically deduces the value of the constant, 2. If the user tells Solar to assume that the value of x is 5 and then clicks the “simplify” button, Solar will deduce the return value.

Downcall

The downcall primitive instructs the framework to inline a function call. Function calls are inlined so that the user can see the entire path in one pane. In a traditional IDE, the user would have to navigate to the function in question. In our framework, the downcalled function is inlined directly into the IR pane, and its source is brought into the source pane.

Upcall

The upcall primitive inlines the current context into a calling function. In other words, upcalling implies traversal up the call stack and is generally the opposite of downcalling. By upcalling, Solar can traverse program paths up through function calls, as demonstrated below. (Pay special attention to the green-highlighted lines.)

Together, these three primitives give Solar its defining properties: context insensitivity and interactivity. Solar is context-free (context-insensitive) because the user can start analysis from any function. It is interactive because the exact program path is determined by the upcalls and downcalls—which are chosen by the user.

Demonstration for reasoning about integer overflow

Solar can help a user reason about nontrivial program properties such as integer overflow. Consider the following program:

contract Overflow {
    function z(uint32 x, uint32 y) public returns (uint32) {
        uint32 ret = 0;
        if (x < 1000 && y < 1000) {
            will_it_overflow(x, y);
        }
        return ret;
    }

    function will_it_overflow(uint32 x, uint32 y) public returns (uint32) {
        return x * y;
    }
}

Here, we want to find out whether will_it_overflow will ever cause an integer overflow. Integer overflow occurs when the mathematical result of an arithmetic operation cannot fit into the physical space allocated for its storage.

Looking at the will_it_overflow function, it’s clear that integer overflow may be possible, as two 32-bit numbers are multiplied and placed into a 32-bit result. However, based on the call sites of will_it_overflow, if z calls will_it_overflow, there can never be an integer overflow; this is because z verifies that arguments to will_it_overflow are small. Let’s see how Solar would reach this conclusion.

Performing this analysis with Solar requires use of the constraint strategy, which works by attempting to find a single valid execution of the program. The user can constrain the execution with arbitrary polynomial constraints. To start the analysis, we select the will_it_overflow function from the left pane to indicate it as the desired starting point. Here is the initial analysis view:

Solar provides one possible execution that evaluates all values to zero. The next step is constraining the values of x and y. We can provide the following constraints (in terms of IR variables, not source variables) to the constraint strategy:

  1.   x_1 < (2 ** 32)
  2.   y_1 < (2 ** 32)
  3.   x_1 * y_1 >= (2 ** 32)

The first two constraints bind x_1 and y_1 to 32-bit integers. The third causes the solver to try to find an execution in which x_1 * y_1 overflows. It is common practice to prove properties using an SAT/SMT solver by showing that the negation of the property is unsatisfiable. With that in mind, we are going to use Solar to show that there are no executions in which x_1 * y_1 overflows, thereby implying that will_it_overflow does not overflow. After simplifying the use of these constraints, we get the following:

Again, Solar provides a single possible execution based on the constraints. Upcalling once puts us at line 5:

Because the green-highlighted lines do not form a logical contradiction, Solar can still return a valid program execution. However, upcalling again returns a different result:

The question marks on the right indicate that the solver cannot find any possible execution given the program path. This is because line 4 contradicts the inequality we wrote earlier: x_1 * y_1 >= (2 ** 32). The above steps constitute informal proof that overflow is not possible if will_it_overflow is called from z.

Conclusion

I am proud of what Solar became. Although it is a prototype, Solar represents a novel type of analysis platform that prioritizes interactivity. Given the potential applications for code auditing and IDE-style semantic checking, I am excited to see what the future holds for Solar and its core ideas. I would like to give a big thank-you to my mentor, Peter Goodman, for making this internship fun and fulfilling. Peter accomplished perhaps the most challenging task for a mentor: striking the delicate balance between providing me guidance and freedom of thought. I would also like to extend thanks to Trail of Bits for hosting the internship. I look forward to seeing the exciting projects that future interns create!

A Year in the Life of a Compiler Fuzzing Campaign

By Alex Groce, Northern Arizona University

In the summer of 2020, we described our work fuzzing the Solidity compiler, solc. So now we’d like to revisit this project, since fuzzing campaigns tend to “saturate,” finding fewer new results over time. Did Solidity fuzzing run out of gas? Is fuzzing a high-stakes project worthwhile, especially if it has its own active and effective fuzzing effort?

The first bugs from that fuzzing campaign were submitted in February of 2020 using an afl variant. Since then, we’ve submitted 74 reports. Sixty-seven have been confirmed as bugs, and 66 of those have been fixed. Seven were duplicates or not considered to be true bugs.


Given that 21 of these bugs were submitted since last December, it’s fair to say that our fuzzing campaign makes a strong case for why independent fuzzing is still important and can find different bugs, even when OSSFuzz-based testing is also involved.

Why is it useful to keep fuzzing such a project perhaps indefinitely? The answer has three parts. First, more fuzzing covers more code execution paths and long fuzzing runs are especially helpful. It would be hard to get anywhere near path-coverage saturation with any fuzzer we know of. Even when running afl for 30 or more days, our tests still find new paths around every 1-2 hours and sometimes find new edges. Some of our reported bugs were discovered only after a month of fuzzing.

A production compiler is fantastically complex, so fuzzing in-depth is useful. It takes time to generate inputs such as the most recent bug we found fuzzing the Solidity level of solc:

pragma experimental SMTChecker;
contract A {
    function f() internal virtual {
        v();
    }
    function v() internal virtual {
    }
}

contract B is A {
    function f() internal virtual override {
        super.f();
    }
}

contract C is B {
    function v() internal override {
        if (0==1)
            f();
    }
}

This code should compile without error, but it didn’t. Until the bug was fixed, it caused the SMT Checker to crash, throwing a fatal “Super contract not available” error, due to incorrect contract context used for variable access in virtual calls inside branches.

Compilers should undergo fuzz testing for long stretches of time because of their complexity and number of possible execution paths. A rule of thumb is that afl hasn’t really started in earnest on any non-trivial target until it hits one million executions, and compilers likely require much more than this. In our experience, compiler runs vary anywhere from less than one execution per second to as many as 40 executions per second. Just getting to one million executions can take a few days!

The second reason we want to fuzz independently alongside OSSFuzz is to approach the target from a different angle. Instead of strictly using a dictionary- or grammar-based approach with traditional fuzzer mutation operators, we used ideas from any-language mutation testing to add “code-specific” mutation operators to afl, and rely mostly (but not exclusively) on those, as opposed to afl’s even more generic mutations, which tend to be focused on binary format data. Doing something different is likely going to be a good solution to fuzzer saturation.

Finally, we keep grabbing the latest code and start fuzzing on new versions of solc. Since the OSSFuzz continuous integration doesn’t include our techniques, bugs that are hard for other fuzzers but easy for our code-mutation approach will sometimes appear, and our fuzzer will find them almost immediately.

But we don’t grab every new release and start over, because we don’t want to lose the ground that was gained with our long fuzzing campaigns. We also don’t continually take up where we left off since the tens of thousands of test corpora that afl can generate are likely full of uninteresting paths that might make finding bugs in the new code easier. We sometimes resume from an existing run, but only infrequently.

Finding bugs in a heavily-fuzzed program like solc is not easy. The next best independent fuzzing effort to ours, that of Charalambos Mitropoulos, also mentioned by the solc team in their post of the OSSFuzz fuzzing, has only discovered 8 bugs, even though it’s been ongoing since October 2019.

Other Languages, Other Compilers

Our success with solc inspired us to fuzz other compilers. First, we tried fuzzing the Vyper compiler—a language intended to provide a safer, Python-like, alternative to Solidity for writing Ethereum blockchain smart contracts. Our previous Vyper fuzzing uncovered some interesting bugs using essentially a grammar-based approach with the TSTL (Template Scripting Testing Language) Python library via python-afl. We found a few bugs during this campaign, but chose not to go to extremes, because of the poor speed and throughput of the instrumented Python testing.

In contrast, my collaborator, Rijnard van Tonder at Sourcegraph, had much greater success fuzzing the Diem project’s Move language—the language for the blockchain formerly known as Facebook’s Libra. Here, the compiler is fast and the instrumentation is cheap. Rijnard has reported 14 bugs in the compiler, so far, all of which have been confirmed and assigned, and 11 of which have been fixed. Given that the fuzzing began just two months ago, this is an impressive bug haul!

Using Rijnard’s notes on fuzzing Rust code using afl.rs, I tried our tools on Fe, a new smart contract language supported by the Ethereum foundation. Fe is, in a sense, a successor to Vyper, but with more inspiration from Rust and a much faster compiler. I began fuzzing Fe on the date of its first alpha release and submitted my first issue nine days later.

To support my fuzzing campaign, the Fe team changed failures in the Yul backend, which uses solc to compile Yul, to produce Rust panics visible to afl, and we were off to the races. So far, this effort has produced 31 issues, slightly over 18% of all GitHub issues for Fe, including feature requests. Of these, 14 have been confirmed as bugs, and ten of those have been fixed; the remaining bugs are still under review.

We didn’t just fuzz smart contract languages. Rijnard fuzzed the Zig compiler—a new systems programming language that aims at simplicity and transparency and found two bugs (confirmed, but not fixed).

The Future of Our Fuzzing Campaigns

We uncovered 88 bugs that were fixed during our afl compiler fuzzing campaign, plus an additional 14 confirmed, but not yet fixed bugs.

What’s interesting is that the fuzzers aren’t using dictionaries or grammars. They know nothing about any of these languages beyond what is expressed by a modest corpus of example programs from the test cases. So how can we fuzz compilers this effectively?

The fuzzers operate at a regular-expression level. They don’t even use context-free language information. Most of the fuzzing has used fast C string-based heuristics to make “code-like” changes, such as removing code between brackets, changing arithmetic or logical operators, or just swapping lines of code, as well as changing if statements to while and removing function arguments. In other words, they apply the kind of changes a mutation testing tool would. This approach works well even though Vyper and Fe aren’t very C-like and only Python’s whitespace, comma, and parentheses usage are represented.

Custom dictionaries and language-aware mutation rules may be more effective, but the goal is to provide compiler projects with effective fuzzing without requiring many resources. We also want to see the impact that a good fuzzing strategy can have on a project during the early stages of development, as with the Fe language. Some of the bugs we’ve reported highlighted tricky corner cases for developers much earlier than might have otherwise been the case. We hope that discussions such as this one will help produce a more robust language and compiler with fewer hacks made to accommodate design flaws detected too late to easily change.

We plan to keep fuzzing most of these compilers since the solc effort has shown that a fuzzing campaign can remain viable for a long time, even if there are other fuzzing efforts targeting the same compiler.

Compilers are complex and most are also rapidly changing. For example, Fe is a brand-new language that isn’t really fully designed, and Solidity is well-known for making dramatic changes to both user-facing syntax and compiler internals.

We’re also talking to Bhargava Shastry, who leads the internal fuzzing effort of Solidity, and applying some of the semantic checks they apply in their protobuf-fuzzing of the Yul optimization level ourselves. We started directly fuzzing Yul via solc’s strict-assembly option, and we already found one amusing bug that was quickly fixed and incited quite a bit of discussion! We hope that the ability to find more than just inputs that crash solc will take this fuzzing to the next level.

The issue at large is whether fuzzing is limited to the bugs it can find due to the inability of a compiler to detect many wrong-code errors. Differential comparison of two compilers, or of a compiler with its own output when optimizations are turned off, usually requires a much more restricted form of the program, which limits the bugs you find, since programs must be compiled and executed to compare results.

One way to get around this problem is to make the compiler crash more often. We imagine a world where compilers include something like a testing option that enables aggressive and expensive checks that wouldn’t be practical in normal runs, such as sanity-checks on register allocation. Although these checks would likely be too expensive for normal runs, they could be turned on for both some fuzzing runs, since the programs compiled are usually small, and, perhaps even more importantly, in final production compilation for extremely critical code (Mars Rover code, nuclear-reactor control code — or high-value smart contracts) to make sure no wrong-code bugs creep into such systems.

Finally, we want to educate compiler developers and developers of other tools that take source code as input, that effective fuzzing doesn’t have to be a high-cost effort requiring significant developer time. Finding crashing inputs for a compiler is often easy, using nothing more than some spare CPU cycles, a decent set of source code examples in the language, and the afl-compiler-fuzzer tool!

We hope you enjoyed learning about our long-term compiler fuzzing project, and we’ve love to hear about your own fuzzing experiences on Twitter @trailofbits.

Un-bee-lievable Performance: Fast Coverage-guided Fuzzing with Honeybee and Intel Processor Trace

By Allison Husain, UC Berkeley

Today, we are releasing an experimental coverage-guided fuzzer called Honeybee that records program control flow using Intel Processor Trace (IPT) technology. Previously, IPT has been scrutinized for severe underperformance due to issues with capture systems and inefficient trace analyses. My winter internship focused on working through these challenges to make IPT-based fuzzing practical and efficient.

IPT is a hardware feature that asynchronously records program control flow, costing a mere 8-15% overhead at record time. However, applying IPT as a fuzzing coverage mechanism isn’t practical except for highly experimental binary-only coverage solutions, since source code instrumentation typically provides far better performance. Honeybee addresses this limitation and makes IPT significantly faster to capture and hundreds of times faster to analyze. So now we have coverage-guided fuzzing—even if source code is unavailable—at performances competitive with, and sometimes faster than, source-level coverage instrumentation. Here, I will describe the development process behind Honeybee and a general overview of its design.

How it started…

IPT is an Intel-specific processor feature that can be used to record the full control flow history of any process for minutes at a time with a minimal performance penalty. IPT drastically improves on Intel’s older hardware tracing systems such as Branch Trace Store, which can have a performance penalty exceeding 100%. Even better, IPT supports the granular selection of the trace target by a specific process or range of virtual addresses.

A hardware mechanism like IPT is especially alluring for security researchers because it can provide code coverage information for closed-source, unmodified target binaries. A less appreciated fact is that not only can IPT be much faster than any existing black-box approaches (like QEMU, interrupts, or binary lifting), IPT should also be faster than inserting source-level instrumentation.

Coverage instrumentation inhibits run-time performance by thrashing various CPU caches via frequent, random writes into large coverage buffers. Source-level instrumentation also inhibits many important compile-time optimizations, like automatic vectorization. Further, due to the multi-threaded nature of most programs, instrumentation code needs to operate on bitmaps atomically, which significantly limits pipeline throughput.

IPT should work on both closed source and open source software with no change in coverage generation strategy and incur only a minimal 8-15% performance penalty. Jaw-dropping, I tell you!

How it’s going…

Unfortunately, the 8-15% performance penalty doesn’t tell the whole story. While IPT has low capture overhead, it does not have a low analysis overhead. To capture long traces on commodity hardware, IPT uses various techniques to minimize the amount of data stored per trace. One technique is to only record control flow information not readily available from static analysis of the underlying binary, such as taken/not-taken branches and indirect branch targets. While this optimization assumes the IPT decoder has access to the underlying program binary, this assumption is often correct. (See Figure 1 for example.)

IPT is a very dense binary format. To showcase what information is stored, I’ve converted it to a more readable format in Figure 1. The packet type is in the left column and the packet payload is on the right.

Example IPT trace showing how IPT minimizes data stored during program execution.

    1. Tracing starts while the program executes at 0x7ffff7f427ef.
    2. The program hits a conditional branch and accepts it. (The first ! in line 2.)
    3. The program hits two conditional branches and does not accept them. (The . . in line 2.)
    4. The program hits a conditional branch and does not accept it. (The last ! in line 2.)
    5. The program hits a conditional branch and accepts it. (Line 4.)
    6. The program hits an indirect branch, at which point it jumps to the last instruction pointer with the lower two bytes replaced with 0x3301.
    7. The program hits a conditional branch and accepts it.
    8. The program continued with no other conditional/indirect branches until the last four bytes of the instruction pointer were 0xf7c189a0 at which point tracing stopped because the program either exited or another piece of code that did not match the filters began executing.

Despite all the trace data provided, there is still a surprising amount of information that is omitted. The trace provides its beginning virtual address, eventual conditional branches, and whether they are taken. However, unconditional and unquestionable control flow transfers (i.e. call and jmp instructions) and conditional branch destinations are not provided. This reduces the trace size, because 1) non-branching code is never recorded, 2) conditional branches are represented as a single bit, and 3) indirect branches are only represented by changes to the instruction pointer.

So how is the real control flow reconstructed from this partial data? An IPT decoder can pair trace data with the underlying binary and “walk” through the binary from the trace start address. When the decoder encounters a control flow transfer that can’t be trivially determined, like a conditional branch, it consults the trace. Data in the trace indicates which branches were taken/not taken and the result of indirect control flow transfers. By walking the binary and trace until the trace ends, a decoder can reconstruct the full flow.

But herein lies the gotcha of IPT: although capture is fast, walking through the code is ostensibly not, because the decoder must disassemble and analyze multiple x86-64 instructions at every decoding step. While overhead for disassembly and analysis isn’t a problem for debugging and profiling scenarios, it severely hampers fuzzing throughput. Unfortunately, such expensive analysis is fundamentally unavoidable as traces cannot be decoded without analyzing the original program.

But…is this the end? Was it the beautiful promise of IPT fuzzing just…a dream? A mere illusion? Say it ain’t so!

Making IPT faster!

While profiling Intel’s reference IPT decoder, libipt, I noticed that over 85% of the CPU time was spent decoding instructions during a trace analysis. This is not surprising given that IPT data must be decoded by walking through a binary looking for control flow transfers. An enormous amount of time spent during instruction decoding, however, is actually good news.

Why? A fuzzer needs to decode a multitude of traces against the same binary. It may be reasonable to continuously analyze instructions for a single trace of one binary, but re-analyzing the same instructions millions of times for the same binary is extraordinarily wasteful. If your “hey we should probably use a cache” sense is tingling, you’re totally right! Of course, the importance of instruction decode caches is not a novel realization.

An open-source decoder that claims to be the fastest IPT decoder (more on that later) named libxdc tries to solve this issue using a fast runtime cache. Using a runtime cache and other performance programming techniques, libxdc operates 10 to 40 times faster than Intel’s reference decoder, which demonstrates that caching is very important.

I thought I could do better though. My critique of libxdc was that its dynamic instruction cache introduced unnecessary and expensive overhead in two ways. First, a dynamic cache typically has expensive lookups, because it needs to calculate the target’s hash and ensure that the cache is actually a hit. This introduces more overhead and complexity to one of the hottest parts of the entire algorithm and cannot be overlooked.

Second, and frankly much worse, is that a dynamic cache is typically expected to fill and evict older results. Even the best cache eviction strategy will cause future work: Any evicted instructions will eventually need to be re-decoded because a fuzzer never targets code just once. This creates duplicated effort and decoder performance penalties with every single cache eviction

My idea was to introduce a static, ahead-of-time generated cache that holds all data IPT decoding that could conceivably require. The cache would be shared between multiple threads without penalty and could be accessed without expensive hashing or locking. By entirely eliminating binary analysis and cache access overhead, I could decode traces significantly faster than libxdc, because my decoder would simply be doing less work.

The Honeybee architecture.

Keeping with the theme of Honeybee, I named these static, ahead-of-time generated caches “hives” since they require work to create but only need to be made once. To make the hives, I created hive_generator, which consumes an ELF executable and captures information that may be needed to decode a trace. The hive_generator searches for all control flow instructions and generates an extremely compact encoding of all basic blocks and where code execution could continue. There are two important features of this new design worth discussing. (Full details are available on Github.)

First, this encoding is data cache-friendly, because not only are blocks the size of cache lines, encoded blocks are stored in the same order as the original binary, which is a small important detail. It means that Honeybee’s decoder can take full advantage of the original binary’s cache locality optimization since compilers generally put relevant basic blocks close to each other. This is not generally possible in dynamic caches, like in libxdc, since the cache’s hash function by design will send neighboring blocks to random locations. This is harmful to performance because it evicts meaningful data from the CPU’s data cache.

The other important feature is that blocks are encoded in a bitwise-friendly format, so Honeybee can process the compacted blocks using exclusively high-throughput ALU operations. This design makes several critical operations completely branchless — like determining whether a block ends in a direct, indirect, or conditional branch. Combining this with high-throughput ALU operations avoids many costly branch mispredictions and pipeline purges.

These changes seemed relatively trivial, but I hoped that they would combine to a respectable performance boost over the current state of the art, libxdc.

Honeybee decoder benchmarks

To compare the performance of Honeybee’s decoder with Intel’s reference decoder, I ran traces ranging from tens of kilobytes up to 1.25 GB among binaries of sizes of 100 kb to 45 MB. Tests were performed 25 times, and I verified that both decodes traveled identical control flow paths for the same trace files.

Comparing Honeybee’s decoding speed to libipt.

These tests showed promising results (Figure 3). On large programs like clang, Honeybee outperformed Intel’s reference decode and libxdc by an order of magnitude (and two orders of magnitude in one case).

For context, the largest trace in this test suite, “honey_mirror_1/clang_huge.pt,” is 1.25GB and originates from a trace of a complicated static analysis program that disassembled the entire 35MB clang binary.

Honeybee takes only 3.5 seconds to do what Intel’s reference decoder does in two-and-a-half minutes, which is a 44x improvement! This is the difference between stepping away while the trace decodes and being able to take a sip of water while you wait.

This difference is even more pronounced on small traces, which are more similar to fuzzing loads like “html_fast_parse/6_txt.pt.” In this case, Honeybee needed only 6.6 microseconds to finish what Intel’s reference coder took 451 microseconds to do. An order of magnitude improvement!

Integrating Honeybee with honggfuzz.

Now to actually integrate this new coverage mechanism into a fuzzer. I chose Google’s honggfuzz since it’s modular and, notably, because it actually already has another slow and partially broken version of IPT-based coverage that uses Intel’s reference decoder. My plan was to simply rip out Intel’s decoder, bolt Honeybee in place, and get a wonderful speedup. However, this was more complicated than I expected.

The challenge is how Linux typically collects IPT data, which is meant to be fairly simple since the mainline kernel actually has support for IPT built right into perf. But I discovered that the complex and aggressive filtering mechanisms that Honeybee needs to clean up IPT data expose stability and performance issues in perf.

This was problematic. Not only was perf not terribly fast to begin with, but it was highly unstable. Complex configurations used by Honeybee triggered serious bugs in perf which could cause CPUs to be misconfigured and require a full system reboot to recover from lockup. Understandably, both of these issues ultimately made perf unusable for capturing IPT data for any Honeybee-related fuzzing tasks.

But, as my mother says, if you want something done right, sometimes you just need to do it yourself. Following in her footsteps, I wrote a small kernel module for IPT named “honey_driver” that was specifically optimized for fuzzing. While this new kernel module is certainly less featureful than perf and a likely security hazard, honey_driver is extremely fast and enables a user-space client to rapidly reconfigure tracing and analyze the results with little overhead.

And so, with this small constellation of custom code, honggfuzz was ready to roll with IPT data from Honeybee!

Fuzzer benchmarks

Fuzzer performance measurement is complex and so there are many more reliable and definitive means to measure performance. As a rough benchmark, I persistently fuzzed a small HTML parser using four different coverage strategies. Then, I allowed honggfuzz to fuzz the binary using the chosen coverage technique before recording the average number of executions over the test period.

Comparing Honeybee to source-level instrumentation.

The first contender in the experiment was no coverage whatsoever. I considered this to be the baseline since it’s essentially as fast as honggfuzz can run on this binary by feeding random input into the test program. In this configuration, honggfuzz achieved an average of 239K executions per second. In the context of my system, this is decently fast but is still certainly limited by the fuzzing target’s CPU performance.

Next, I tested honggfuzz’s source-level software instrumentation by compiling my target using the instrumenting compiler with no other features enabled. This led to an average of 98K executions per second or a 41% drop in efficiency compared to the no coverage baseline, which is a generally accepted and expected penalty when fuzzing due to missed compiler optimizations, many function calls, expensive locking, and cache thrashing due to essentially random writes into coverage bitmaps.

After software instrumentation, we get into the more interesting coverage techniques. As mentioned earlier, honggfuzz has support for processor trace using libipt for analysis and unfiltered perf data for IPT capture. However, honggfuzz’s existing IPT support does not generate full or even possibly correct coverage information, because honggfuzz only extracts indirect branch IPs from the trace and completely ignores any conditional branches. Additionally, since no filtering is employed using perf, honggfuzz generates coverage for every piece of code in the process, including uninteresting libraries like libc. And this leads to bitmap pollution.

Even with these shortcuts, honggfuzz’s existing IPT can only achieve an average of 1.1K executions per second (roughly half of a percent of the theoretical max). Due to the inaccurate coverage data, a true comparison cannot be made to software instrumentation, because it is possible that it found a more difficult path sooner. Realistically, however, the gap is so enormous that such an issue is unlikely to account for most of the overhead given the previously established performance issues of both perf and libipt.

Lastly, we have Honeybee with its custom decoder and capture system. Unlike the existing honggfuzz implementation, Honeybee decodes the entire trace and so it is able to generate a full, correct basic block and edge coverage information. Honeybee achieved an average of 171K executions per second, which is only a 28% performance dip compared to the baseline.

This shouldn’t come as a shock, since IPT only has an 8-15% record time overhead. This leaves 14-21% of the baseline’s total execution time to process IPT data and generate coverage. Given the incredible performance of Honeybee’s decoder and its ability to quickly decode traces, it is entirely reasonable to assume that the total overhead of Honeybee’s data processing could add up to a 29% performance penalty.

I analyzed Honeybee’s coverage and confirmed that it was operating normally and processing all the data correctly. As such, I’m happy to say that Honeybee is (at least in this case) able to fuzz both closed and open-source software faster and more efficiently than even conventional software instrumentation methods!

What’s next?

While it is very exciting to claim to have dethroned an industry-standard fuzzing method, these methods have not been rigorously tested or verified at a large scale or across a large array of fuzzing targets. I can attest, however, that Honeybee has been either faster or at least able to trade blows with software instrumentation while fuzzing many different large open source projects like libpcap, libpng, and libjpegturbo.

If these patterns apply more generally, this could mean a great speedup for those who need to perform source-level fuzzing. More excitingly, however, this is an absolutely wonderful speedup for those who need to perform black-box fuzzing and have been relying on slow and unreliable tools like QEMU instrumentation, Branch Trace Store (BTS), or binary lifting, since it means they can fuzz at equal or greater speeds than if they had source without making any serious compromises. Even outside fuzzing, however, Honeybee is still a proven and extraordinarily fast IPT decoder. This high-performance decoding is useful outside of fuzzing because it enables many novel applications ranging from live control flow integrity enforcement to more advanced debuggers and performance analysis tools.

Honeybee is a very young project and is still under active development in its home on GitHub. If you’re interested in IPT, fuzzing, or any combination thereof please feel free to reach out to me over on Twitter where I’m @ezhes_ or over email at allison.husain on the berkeley.edu domain!

Acknowledgments

As mentioned earlier, IPT has caught many researchers’ eyes and so, naturally, many have tried to use it for fuzzing. Before starting my own project, I studied others’ research to learn from their work and see where I might be able to offer improvements. I’d first like to acknowledge the authors behind PTrix (PTrix: Efficient Hardware-Assisted Fuzzing for COTS Binary) and PTfuzz (PTfuzz: Guided Fuzzing With Processor Trace Feedback) as they provided substantial insight into how a fuzzer could be structured around IPT data. Additionally, I’d like to thank the team behind libxdc as their fast IPT packet decoder forms the basis for the packet decoder in Honeybee. Finally, I’d like to give a big thank you to the team at Trail of Bits, and especially my mentors Artem Dinaburg and Peter Goodman, for their support through this project and for having me on as an intern this winter!

Never a dill moment: Exploiting machine learning pickle files

By Evan Sultanik

Many machine learning (ML) models are Python pickle files under the hood, and it makes sense. The use of pickling conserves memory, enables start-and-stop model training, and makes trained models portable (and, thereby, shareable). Pickling is easy to implement, is built into Python without requiring additional dependencies, and supports serialization of custom objects. There’s little doubt about why choosing pickling for persistence is a popular practice among Python programmers and ML practitioners.

Pre-trained models are typically treated as “free” byproducts of ML since they allow the valuable intellectual property like algorithms and corpora that produced the model to remain private. This gives many people the confidence to share their models over the internet, particularly for reusable computer vision and natural language processing classifiers. Websites like PyTorch Hub facilitate model sharing, and some libraries even provide APIs to download models from GitHub repositories automatically.

Here, we discuss the underhanded antics that can occur simply from loading an untrusted pickle file or ML model. In the process, we introduce a new tool, Fickling, that can help you reverse engineer, test, and even create malicious pickle files. If you are an ML practitioner, you’ll learn about the security risks inherent in standard ML practices. If you are a security engineer, you’ll learn about a new tool that can help you construct and forensically examine pickle files. Either way, by the end of this article, pickling will hopefully leave a sour taste in your mouth.

Do you know how pickles are stored? It’s jarring!

Python pickles are compiled programs run in a unique virtual machine called a Pickle Machine (PM). The PM interprets the pickle file’s sequence of opcodes to construct an arbitrarily complex Python object. Python pickle is also a streaming format, allowing the PM to incrementally build the resulting object as portions of the pickle are downloaded over the network or read from a file.

The PM uses a Harvard architecture, segregating the program opcodes from writable data memory, thus preventing self-modifying code and memory corruption attacks. It also lacks support for conditionals, looping, or even arithmetic. During unpickling, the PM reads in a pickle program and performs a sequence of instructions. It stops as soon as it reaches the STOP opcode and whatever object is on top of the stack at that point is the final result of unpickling.

From this description, one might reasonably conclude that the PM is not Turing-complete. How could this format possibly be unsafe? To corrode the words of Mishima’s famous aphorism:

Computer programs are a medium that reduces reality to abstraction for transmission to our reason, and in their power to corrode reality inevitably lurks the danger of the weird machines.

The PM contains two opcodes that can execute arbitrary Python code outside of the PM, pushing the result onto the PM’s stack: GLOBAL and REDUCE. GLOBAL is used to import a Python module or class, and REDUCE is used to apply a set of arguments to a callable, typically previously imported through GLOBAL. Even if a pickle file does not use the REDUCE opcode, the act of importing a module alone can and will execute arbitrary code in that module, so GLOBAL alone is dangerous.

For example, one can use a GLOBAL to import the exec function from __builtins__ and then REDUCE to call exec with an arbitrary string containing Python code to run. Likewise for other sensitive functions like os.system and subprocess.call. Python programs can optionally limit this behavior by defining a custom unpickler; however, none of the ML libraries we inspected do so. Even if they did, these protections can almost always be circumvented; there is no guaranteed way to safely load untrusted pickle files, as is highlighted in this admonition from the official Python 3.9 Pickle documentation:

Warning The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
Consider signing data with hmac if you need to ensure that it has not been tampered with.
Safer serialization formats such as JSON may be more appropriate if you are processing untrusted data.

We are not aware of any ML file formats that include a checksum of the model, eitherSome libraries like Tensorflow do have the capability to verify download checksums, however, verification is disabled by default and based upon a checksum embedded in the filename, which can be easily forged..

The dangers of Python pickling have been known to the computer security community for quite some time.

Introducing Fickling: A decompiler, static analyzer, and bytecode rewriter for pickle files

Fickling has its own implementation of a Pickle Virtual Machine (PM), and it is safe to run on potentially malicious files, because it symbolically executes code rather than overtly executing it.

Let’s see how Fickling can be used to reverse engineer a pickle file by creating an innocuous pickle containing a serialized list of basic Python types:

$ python3 -c "import sys, pickle; \
  sys.stdout.buffer.write(pickle.dumps([1, ‘2’, {3: 4}]))" \
  > simple_list.pickle
$ python3 -m pickle simple_list.pickle
[1, ‘2’, {3: 4}]

Running fickling on the pickle file will decompile it and produce a human-readable Python program equivalent to what code would be run by the real PM during deserialization:

$ fickling simple_list.pickle
result = [1, ‘2’, {3: 4}]

In this case, since it’s a simple serialized list, the code is neither surprising nor very interesting. By passing the --trace option to Fickling, we can trace the execution of the PM:

$ fickling --trace simple_list.pickle
PROTO
FRAME
EMPTY_LIST
    Pushed []
MEMOIZE
    Memoized 0 -> []
MARK
    Pushed MARK
BININT1
    Pushed 1
SHORT_BINUNICODE
    Pushed '2'
MEMOIZE
    Memoized 1 -> '2'
EMPTY_DICT
    Pushed {}
MEMOIZE
    Memoized 2 -> {}
BININT1
    Pushed 3
BININT1
    Pushed 4
SETITEM
    Popped 4
    Popped 3
    Popped {}
    Pushed {3: 4}
APPENDS
    Popped {3: 4}
    Popped '2'
    Popped 1
    Popped MARK
STOP
    result = [1, '2', {3: 4}]
    Popped [1, '2', {3: 4}]

You can run Fickling’s static analyses to detect certain classes of malicious pickles by passing the --check-safety option:

$ fickling --check-safety simple_list.pickle
Warning: Fickling failed to detect any overtly unsafe code,
but the pickle file may still be unsafe.
Do not unpickle this file if it is from an untrusted source!

What would it look like if the pickle file were malicious? Well, why not make one! We can do that by injecting arbitrary Python code into the pickle file:

$ fickling --inject 'print("Hello World!")' testpickle > testpickle.pwn3d
$ python3 -m pickle testpickle.pwn3d
Hello World!
[1, '2', {3: 4}]

It works! So let’s see Fickling’s decompilation:

$ fickling testpickle.pwn3d
_var0 = eval('print("Hello World!")')
result = [1, '2', {3: 4}]

and its analysis:

$ fickling --check-safety testpickle.pwn3d
Call to `eval('print("Hello World!")')` is almost certainly
evidence of a malicious pickle file

Fickling can also be used as a Python library, and has a programmatic interface to decompile, analyze, modify, and synthesize Pickle files. It is open source, and you can install it by running:

pip3 install fickling

Making Malicious ML Models

Since the majority of ML models use pickling extensively, there is a potential attack surface for weight/neuron perturbations on models, including fault injections, live trojans, and weight poisoning attacks among others. For example, during deserialization, code injected into the pickle could programmatically make changes to the model depending on the local environment, such as time of day, timezone, hostname, system locale/language, or IP address. These changes could be subtle, like a bitflip attack, or more overt, like injecting arbitrary delays in the deserialization to deny service.

Fickling has a proof-of-concept based on the official PyTorch tutorial that injects arbitrary code into an existing PyTorch model. This example shows how loading the generated model into PyTorch will automatically list all of the files in the current directory (presumably containing proprietary models and code) and exfiltrate them to a remote server.

This is concerning for services like Microsoft’s Azure ML, which supports running user-supplied models in their cloud instances. A malicious, “Fickled” model could cause a denial of service, and/or achieve remote code execution in an environment that Microsoft likely assumed would be proprietary. If multiple users’ jobs are not adequately compartmentalized, there is also the potential of exfiltrating other users’ proprietary models.

How do we dill with it?

The ideal solution is to avoid pickling altogether. There are several different encodings—JSON, CBOR, ProtoBuf—that are much safer than pickling and are sufficient for encoding these models. In fact, PyTorch already includes state_dict and load_state_dict functions that save and load model weights into a dictionary, which can be easily serialized into a JSON format. In order to fully load the model, the model structure (how many layers, layer types, etc.) is also required. If PyTorch implements serialization/deserialization methods for the model structure, the entire model can be much more safely encoded into JSON files.

Outside of PyTorch, there are other frameworks that avoid using pickle for serialization. For example, the Open Neural Network Exchange (ONNX) aims to provide a universal standard for encoding AI models to improve interoperability. The ONNX specification uses ProtoBuf to encode their model representations.

ReSpOnSiBlE DiScLoSuRe

We reported our concerns about sharing ML models to the PyTorch and PyTorch Hub maintainers on January 25th and received a reply two days later. The maintainers said that they will consider adding additional warnings to PyTorch and PyTorch Hub. They also explained that models submitted to PyTorch Hub are vetted for quality and utility, but the maintainers do not perform any background checks on the people publishing the model or carefully audit the code for security before adding a link to the GitHub repository on the PyTorch Hub indexing page. The maintainers do not appear to be following our recommendation to switch to a safer form of serialization; they say that the onus is on the user to ensure the provenance and trustworthiness of third party models.

We do not believe this is sufficient, particularly in the face of increasingly prevalent typosquatting attacks (see those of pip and npm). Moreover, a supply chain attack could very easily inject malicious code into a legitimate model, even though the associated source code appears benign. The only way to detect such an attack would be to manually inspect the model using a tool like Fickling.

Conclusions

As ML continues to grow in popularity and the majority of practitioners rely on generalized frameworks, we must ensure the frameworks are secure. Many users do not have a background in computer science, let alone computer security, and may not understand the dangers of trusting model files of unknown provenance. Moving away from pickling as a form of data serialization is relatively straightforward for most frameworks and is an easy win for security. We relish the thought of a day when pickling will no longer be used to deserialize untrusted files. In the meantime, try out Fickling and let us know how you use it!

Acknowledgements

Many thanks goes out to our team for their hard work on this effort: Sonya Schriner, Sina Pilehchiha, Jim Miller, Suha S. Hussain, Carson Harmon, Josselin Feist, and Trent Brunson


† Some libraries like Tensorflow do have the capability to verify download checksums, however, verification is disabled by default and based upon a checksum embedded in the filename, which can be easily forged.

The Tao of Continuous Integration

By Paul Kehrer

It is a truism in modern software development that a robust continuous integration (CI) system is necessary. But many projects suffer from CI that feels brittle, frustrates developers, and actively impedes development velocity. Why is this? What can you do to avoid the common CI pitfalls?

Continuous Integration Needs a Purpose

CI is supposed to provide additional assurance that a project’s code is correct. However, the tests a developer writes to verify the expected functionality are at their least useful when they are initially written. This is perhaps counterintuitive because a developer’s greatest familiarity comes when they initially write the code. They’ve thought about it from numerous angles and considered many of the possible edge cases and have implemented something that works pretty well!

Unfortunately, writing code is the easiest part of programming. The real challenge is building code that others can read so your project can thrive for many years. Software entropy increases over time. Developers—especially ones not familiar with large long-term codebases—can’t anticipate how their code may be integrated, refactored, and repurposed to accommodate needs beyond those that weren’t originally considered.

When these sorts of refactors and expansions occur, tests are the only way changes can be made confidently. So why do developers end up with systems that lack quality testing?

Trivial Testing

When writing tests, especially for high code-coverage metrics, the most common complaint is that some tests are trivial and exercise nothing interesting or error-prone in the codebase. These complaints are valid when thinking about the code as it exists today, but now consider that the software could be repurposed from its original intention. What once was trivial might now be subtle. Failing to test trivial cases may lead your work into a labyrinth of hidden traps rooted in unobservable behavior.
Remember these three things:

  1. No test is trivial in the long run.
  2. Tests are documentation of expected behavior.
  3. Untested code is subject to incidental behavioral change.

Unreliable CI

Unreliable CI is poison for developers. For internal projects, it saps productivity and makes people hate working on it. And for open-source projects, it drives away contributors faster than they can arrive.

Find what’s causing your tests to be unreliable and fix it. Unreliable CI commonly manifests as flaky tests, and tools exist to mark tests as flaky until you can find the root cause. This will allow immediate improvement in your CI without crippling the team.

Slow CI

You may find yourself with an excessively long CI cycle time. This is problematic because a quality development process requires that all CI jobs pass. If the cycle time is too long and complex so that it’s impractical to run it locally, then developers will create workarounds. These workarounds may take many forms, but it’s most common to see PR sizes balloon when no one wants to put in a 2-line PR, wait an hour for it to merge, and then rebase their 300-line PR. On top of it when they can just make a few unrelated changes in a single PR. This causes problems for code reviewers and lowers the quality of the project.

Developers aren’t wrong to do this, and CI has failed them. When building CI systems, it’s important to keep a latency budget in mind that goes something like, “CI should never be slower than time, t, where t is chosen a priori.” If CI becomes slower than that, then an effort is spent to improve it, even if it encroaches on the development of new features.

Coverage is difficult

Part of responsible testing is knowing which lines of code your tests are exercising—a nice, simple number that tells you everything. So why is coverage so commonly ignored?

First, the technical challenge. Modern software runs against many disparate targets. To be useful, CI systems should run against numerous targets that submit data to a hosted system that can combine coverage. (The frustration of tools like this failing and how to maintain development velocity despite all software being hot garbage is another discussion.) Absent this, service software developers often fail to notice missed coverage as it becomes lost in the noise of “expected” missed lines.

Now let’s talk about the social challenges. Software is typically written in a way that makes it difficult to test small pieces of functionality. This issue gave rise to the test-driven development (TDD) trend, where tests are written first to help developers factor their code in a testable manner. This is generally a net win in readability and testability but requires discipline and a different approach to development that doesn’t come naturally to many people.

The perceived drudgery in making more of a codebase testable causes complaints that coverage is an imperfect metric. After all, not all code branches are created equal, and depending on your language, some code paths should never be exercised. These are not good reasons to dismiss coverage as a valuable metric, but on specific occasions, there may exist a compelling reason to not spend the effort to cover something with tests. However, be aware that by failing to cover a certain piece of code with tests, its behavior is no longer part of the contract future developers will uphold during refactoring.

What do we do?

So how do we get to CI nirvana given all these obstacles? Incrementally. An existing project is a valuable asset, and we want to preserve what we have while increasing our ability to improve it in the future. (Rewrites are almost universally a bad idea.) This necessitates a graduated approach that, while specifically customized to a given project, has a broad recipe:

    1. Make CI reliable
    2. Speed up CI
    3. Improve test quality
    4. Improve coverage

We should all spend time investing in the longevity of our projects. This sort of foundational effort pays rapid dividends and ensures that your software projects can be world-class.

Serving up zero-knowledge proofs

By Jim Miller, Senior Cryptography Analyst

Zero-knowledge (ZK) proofs are gaining popularity, and exciting new applications for this technology are emerging, particularly in the blockchain space. So we’d like to shine a spotlight on an interesting source of implementation bugs that we’ve seen—the Fiat Shamir transformation.

A ZK proof can be either interactive, where the prover and verifier communicate via challenges in a multi-step process, or non-interactive, where a prover computes a proof once and sends it to the verifier. The non-interactive ZK proof is preferred over the multi-step interactive process, but most ZK schemes are interactive by default.

Enter the Fiat-Shamir transformation. It transforms interactive ZK proofs into non-interactive ones. Easier said than done. This can be a tricky implementation and has led to several bugs, including one discovered in a Swiss voting system.

Here, we will use a tennis analogy to walk you through what this transformation is, and then we’ll show some examples of how it goes wrong in practice.

ZK Crash Course

Zero-knowledge proofs allow you (the prover) to prove to another party (the verifier) that you know some secret value without having to reveal the value itself. This concept can be applied in a variety of ways. For example, you give the verifier some value y, and you can prove that you possess an input value x such that y = sha256(x), without having to reveal x.

Or, as we’ve seen in the blockchain space, you want to spend some money without revealing how much you are spending, and ZK proofs can prove to the verifier that you possess enough money to spend without revealing how much you have or spent.

Now let’s talk about what security looks like for ZK proofs. They have a security notion known as soundness, which has a formal and well-defined technical definition. Simply put, a ZK proof is sound if a malicious prover can’t create fake proofs that appear to be valid. In other words, the verifier can’t be tricked with a bad proof.

Tennis analogy

Let’s visualize ZK proofs and the Fiat-Shamir transformation with a tennis analogy. The prover and verifier are tennis players standing opposite sides of the net. The prover is trying to prove to the verifier that they are good at tennis. (“Good” is used here in a very basic sense and specific to this example.)

(1) The prover hits the ball to the verifier, and (2) the verifier sends a random return back to the prover—random speed, location, amount of spin, etc.—and (3) the prover must return the ball to a targeted spot on the court. If the prover hits the target, the verifier classifies them as a good tennis player. If they do not hit the mark, they are classified as a bad tennis player.

In ZK terms, being classified as “good at tennis” corresponds to the verifier accepting the ZK proof as valid. Being classified as “bad at tennis” corresponds to the verifier rejecting the proof as invalid.

Now assume that this tennis game is a sound ZK scheme. Recalling our definition earlier, this means that a bad tennis player cannot convince the verifier they are good. If the prover is actually bad, they are unable to hit the tennis ball into the target. However, and this is important, this game is sound only if the verifier sends the tennis ball back in a truly random way.

Imagine the prover realizes that the verifier sends the tennis ball to the same spot and same location every time. Maybe the verifier is bad at tennis but only practiced hitting the ball from that exact location every time.

Even though they cannot hit the target from any location, they can hit the target from that exact spot and trick the verifier. Therefore, bad randomness can break the soundness of this scheme as well as real ZK schemes.

Back to ZK

This tennis analogy is a common challenge/response protocol seen in various areas of cryptography. In the ZK space, these are sigma protocols having three steps:

  1. The prover first sends an initial message, the commitment message. (The prover first hits the ball to the verifier in our tennis analogy.)
  2. The verifier replies with a truly random message, the challenge message. (The verifier returns the tennis ball to a random spot with random speed and spin.)
  3. The prover sends a message that directly depends on the challenge message, the “response” message, and the verifier then accepts this as valid or invalid. (The prover moves to hit the return of the verifier and attempts to hit the target.)

It turns out that many ZK schemes take the form of one of these sigma protocols, which unfortunately makes them interactive. Luckily, the Fiat-Shamir transformation can transform these sigma protocols into non-interactive protocols.

First, I’ll explain the transformation using our tennis example. Sometimes players practice by hitting the tennis ball off of a wall to simulate another player’s returns. This is like the Fiat-Shamir transformation. Instead of relying on the verifier sending a return, we replace the verifier with a wall that returns the ball for us. And for this scheme to be sound, this return has to be completely random. Therefore (stretch your imagination here), we need a special wall that returns the ball with a completely random ball speed, placement, and spin. Then, the prover sends the ball back as before, and if they hit the target, they are a good player.

So how do we find such a wall for ZK proofs in practice? We use a hash function. Instead of sending their initial message to the verifier and waiting for a random response, the prover inputs their message into a hash function and uses the output as its challenge message. Using this challenge, the prover executes step 3 and sends the verifier’s response message as its proof. The verifier either accepts or rejects this in a non-interactive process.

Security of Fiat-Shamir implementations

Is this secure? Yes, mostly. First of all, we have to assume that our hash function returns completely random outputs. This is known as modeling the hash function as a random oracle and is somewhat of a controversial topic, but it has been used in various applications.

The other crucial and more subtle aspect is determining what input the hash function should receive. Using our tennis example, let’s say we’re using a wall to return the tennis ball, but for some reason, the wall’s internal random generation is only dependent on the speed and spin, but not on the location. This means that the prover will get the same challenge (i.e., the ball will be returned to the same spot on the court) from different locations, as long as the speed and spin are identical. These challenges are no longer truly random, allowing the prover to break the scheme’s soundness.

In the theoretical world, this is referred to as the weak and strong Fiat-Shamir transformations. Essentially, a small difference in inputs into the hash function can have very subtle but severe consequences to the protocol’s security. The theoretical details are a bit out of scope for this blog post, but if you are curious, you can read more about them here.

Going wrong in practice

Let’s look at a very simple sigma protocol example—the Schnorr protocol. In the Schnorr protocol, the prover proves to the verifier that they know the value X such that Y = gX, for some group generator g. For those of you familiar with cryptography, you’ll know this is the common private/public key structure for things like ECC, El Gamal, etc. Therefore, this allows the prover to prove that they hold the secret key, X, corresponding to their public key, Y. The protocol works as follows:

  • The prover generates a random value A, and sends B = gA to the verifier
  • The verifier replies with a random challenge value C
  • The prover sends Z = A + CX to the verifier as its proof
  • The verifier verifies the proof by checking that gZ = BYC

After doing some arithmetic, you can see that the verification check works, and this protocol is secure because the discrete log problem is assumed to be hard.

Since this is a sigma protocol, we can make this non-interactive with our new favorite transformation, Fiat-Shamir:

  • The prover generates a random value A, computes B = gA
  • The prover generates C randomly with a hash function
  • The prover sends Z = A + CX to the verifier as its proof
  • The verifier verifies the proof by checking that gZ = BYC

But how exactly should the prover generate this random value, C? In the interactive protocol, the prover sends B to the verifier, so it might make sense to compute C = Hash(B). This is actually what theorists define to be the weak Fiat-Shamir transformation. As you might suspect, this has some subtle consequences. In particular, by using this computation, it’s actually possible for an adversary to provide a valid proof for some public key even if they don’t know the secret key:

  • Let PK’ be some public key that we don’t know the secret key for
  • Set B = PK’, C = Hash(B)
  • Pick a random Z
  • Set PK = (gZ / PK’)1/C

After some more arithmetic, you can see that Z will verify as a valid proof for PK. But since we don’t know the secret key for PK’, we don’t know the secret key for PK either! This is a forged proof. Depending on the exact application, this could be very problematic because it allows you to create fake proofs for keys related to another party’s public key, PK’.

This problem is avoided by adjusting how you compute C. By setting C = Hash(B, PK) or C = Hash(B, PK, g) will avoid these issues entirely. These are defined to be the strong Fiat-Shamir transformation.

A very similar issue was discovered in a Swiss voting system. In this system, one of the design goals is to allow for verifiable election results. To do this, a ZK scheme produces a decryption proof, which proves that an encrypted vote decrypts to the correct result. However, when implementing this scheme, only part of the prover’s input was given to the hash function (i.e., the strong Fiat-Shamir transformation was not used), and the prover was able to create false proofs. In this case, this meant that valid votes could be altered in such a way to make them not be counted, which has obvious implications for an election using such a scheme.

Conclusion

In summary, the Fiat-Shamir transformation converts an interactive ZK proof into a non-interactive one by computing a hash function on the prover’s inputs. In practice, this can lead to very subtle but dangerous bugs. When in doubt, use the strong Fiat-Shamir transformation and make sure every piece of the prover’s input is placed inside of the hash function!

Confessions of a smart contract paper reviewer

If you’re thinking of writing a paper describing an exciting novel approach to smart contract analysis and want to know what reviewers will be looking for, you’ve come to the right place. Deadlines for many big conferences (ISSTA tool papers, ASE, FSE, etc.) are approaching, as is our own Workshop on Smart Contract Analysis, so we’d like to share a few pro tips. Even if you’re not writing a smart contract paper, this post can also help you identify research that’s worth reading and understand its impact.

I’ve been reviewing smart contract analysis papers for a few years now—over 25 papers in seven different venues in the last year—and I’ve taken away six requirements for a good paper. I’d also like to share a little about extra points that take a paper from “good enough” to “great.”

  1. Explain what the analysis method actually does! Some authors fail to describe how a proposed analysis algorithm works, perhaps due to page limits. It’s not essential to describe every painstaking detail of an algorithm or implementation, but a good paper does more than describe a method with high-level buzzwords. “We combine symbolic execution with swarm testing” is a great sentence for a paper’s abstract, but this level of granularity cannot be sustained throughout the paper. Reviewers see this as hand-wavy. Provide details for your audience to understand what you actually are proposing and evaluate it. For a tool paper—a short paper that basically advertises a particular tool exists—a generic explanation is sometimes fine. Still, such uninformative descriptions appear surprisingly often in full conference submissions, which are supposed to allow readers to understand and even duplicate interesting new approaches.
  2. Understand the basics of blockchain and smart contracts. Too many papers are rejected for making obvious mistakes about how a contract or the blockchain works. This kind of problem is often foreshadowed by an introduction that includes boilerplate text describing blockchains and smart contracts. Smart contract and blockchain analyses are, in some ways, pure instances of core code analysis and test generation problems. However, if you’re still in the early stages of researching this topic, a minimum amount of homework is required before you can produce credible results. We recommend going through the Ethernaut CTF exercises to understand some basics of contract exploitation and then reading our Building Secure Contracts tutorials, including experiments with real tools used in paid audits. Blockchains and smart contracts are fast-moving targets, and many early papers concentrated on handling ether. However, much of the modern contracts’ financial value is in ERC-20 or other recently developed token types. If you’re only looking at ether-related exploits, you’re not addressing much of what’s going on now.
  3. Base experimental results on meaningful sets of contracts. Out of all the contracts ever created on the Ethereum blockchain, only a small fraction accounts for almost all transactions and ether/token activity. Most Etherscan contracts have little practical use and are toys deployed by programmers learning Solidity. If your experiments are based on randomly selected contracts from Etherscan, your results will not reflect contracts of interest, and many are likely to be near-duplicates. A random sampling of contracts is a red flag for reviewers because the data set is noisy and may fail to include any contracts anyone actually cares about. Instead, base your experiments on active contracts that have participated in transactions, held ether or token value, or satisfying other criteria demonstrating that they’re meaningful. It also shows good judgment to include some diversity in the contract set and demonstrate that you aren’t, say, basing your results on 30 basic ERC-20 tokens with nearly identical implementations. Moreover, the fact that state-of-the-art Ethereum moves fast applies here. These days a lot of the action is not in single contracts but in multi-contract systems, where analysis based on those contracts’ composition is necessary to explore meaningful behavior. The same guidance goes for demonstrating a method for finding vulnerabilities. Finding meaningless vulnerabilities in contracts that hold no ether or token value and never participate in transactions isn’t compelling. On the other hand, there are real vulnerabilities in real contracts that participate in numerous transactions. Find those, and you have a good demonstration of your ideas! Google BigQuery is one way to get started on this since it can be difficult to extract from the blockchain.
  4. More generally, respect the rules of (fuzzing) research. Our post on performing good fuzzing research mostly applies to fuzzing smart contracts, too. Smart contract fuzzers may not be expected to run for 24 hours, but it’s certainly essential to run tools “long enough.” You need statistics-based evidence, not a string of anecdotes. If your contract fuzzer performed better, did it do so by a statistically significant margin? What’s the estimated effect size, and how confident can we be in that estimate’s quality? Other points of good fuzzing experimental practice are just common sense but easily overlooked: for example, if you don’t use a consistent version of the Solidity compiler in your experiments or fail to report what version you used, reproducing (and understanding) your results will be complicated. Two particular aspects of these general guidelines are essential for smart contract fuzzing papers, so we’ll separate those out.
  5. Compare against a meaningful tool baseline. You need to compare your work with widely accepted concepts and tools. Pit your tool against real competition, which may require more effort in smart contract work. People are always releasing new tools and updating old ones, and some older tools no longer work. By the time your paper is reviewed, the cutting edge may have moved. Still, it must be obvious that you went through the trouble of selecting a reasonable set of comparable tools and comparing your work against state of the art when you wrote the paper.
  6. Draw clear boundaries explaining what your tool does and does not detect. Smart contract fuzzers report a variety of bugs, but there are no Solidity “crashes.” So tools have to look for something, whether it be an integer overflow, reentrancy, locked funds, or runaway gas usage indicating a possible denial-of-service vulnerability. Ultimately, this means that tools may excel in some areas and lag behind in others. One fuzzer may use an aggressive set of oracles, which can lead to false positives, while another may report a particular set of bugs lowering its false-positive error. Comparing apples to apples can be hard in this context, but you must show that your method finds meaningful bugs. One way to do this in fuzzing is to compare code coverage results between your tool and others.

We hope this advice helps strengthen your approach to publishing your smart contract research. In summary, you can almost guarantee that if I review your paper, and I can’t figure out what your method even is, I’d reject your paper. If you clearly didn’t do any homework on smart contracts beyond reading the Wikipedia page on Ethereum, I’d reject your paper. If you based your experiments on 50 random contracts on the blockchain that have received 10 transactions after deployment, hold a total of $0.05 worth of Ether, and are mostly duplicates of each other, I’d reject your paper. If you don’t understand the basic rules of fuzzing research, if you only compare to one outdated academic research tool, and ignore five popular open-source tools, if you claim your approach is better simply because you have a tendency to produce more false positives based on a very generous notion of “bug”… well, you can guess!

The good news is, doing all of the things this post suggests is not just part of satisfying reviewers. It’s part of satisfying yourself and your future readers (and potential tool users), and it’s essential to building a better world for smart contract developers.

Finally, to take a paper from “good” to “great,” tell me something about how you came up with the core idea of the paper and what the larger implications of the idea working might be. That is, there’s some reason, other than sheer luck, why this approach is better at finding bugs in smart contracts. What does the method’s success tell us about the nature of smart contracts or the larger problem of generating tests that aren’t just a sequence of bytes or input values but a structured sequence of function calls? How can I improve my understanding of smart contracts or testing, in general, based on your work? I look forward to reading about your research. We’d love to see your smart contract analysis papers at this year’s edition of WoSCA to be co-located with ISSTA in July!

PDF is Broken: a justCTF Challenge

Trail of Bits sponsored the recent justCTF competition, and our engineers helped craft several of the challenges, including D0cker, Go-fs, Pinata, Oracles, and 25519. In this post we’re going to cover another of our challenges, titled PDF is broken, and so is this file. It demonstrates some of the PDF file format’s idiosyncrasies in a bit of an unusual steganographic puzzle. CTF challenges that amount to finding a steganographic needle in a haystack are rarely enlightening, let alone enjoyable. LiveOverflow recently had an excellent video on file format tricks and concludes with a similar sentiment. Therefore, we designed this challenge to teach justCTF participants some PDF tricks and how Trail of Bits’ open source tools can make easy work of these forensic challenges.

In which a PDF is a webserver, serving copies of itself

The PDF file in the challenge is in fact broken, but most PDF viewers will usually just render it as a blank page with no complaints. The file command reports the challenge as just being “data.” Opening the file in a hex editor, we see that it looks like a Ruby script:

require 'json'
require 'cgi'
require 'socket'
=begin
%PDF-1.5
%ÐÔÅØ
% `file` sometimes lies
% and `readelf -p .note` might be useful later

The PDF header on line 5 is embedded within a Ruby multi-line comment that begins on line 4, but that’s not the part that’s broken! Almost all PDF viewers will ignore everything before the %PDF-1.5 header. Lines 7 and 8 are PDF comments affirming what we saw from the file command, as well as a readelf hint that we’ll get to later.

The remainder of the Ruby script is embedded within a PDF object stream—the “9999 0 obj” line—, which can contain arbitrary data ignored by PDF. But what of the remainder of the PDF? How does that not affect the Ruby script?

9999 0 obj
<< /Length 1680 >>^Fstream
=end
port = 8080
if ARGV.length > 0 then
  port = ARGV[0].to_i
end
html=DATA.read().encode('UTF-8', 'binary', :invalid => :replace, :undef => :replace).split(/<\/html>/)[0]+"\n"
v=TCPServer.new('',port)
print "Server running at http://localhost:#{port}/\nTo listen on a different port, re-run with the desired port as a command-line argument.\n\n"
⋮
__END__

Ruby has a feature where the lexer will halt on the __END__ keyword and effectively ignore everything thereafter. Sure enough, this curious PDF has such a symbol, followed by the end of the encapsulating PDF object stream and the remainder of the PDF.

This is a Ruby/PDF polyglot, and you can turn any PDF into such a polyglot using a similar method. If your script is short enough, you don’t even need to embed it in a PDF stream object. You can just prepend all of it before the %PDF-1.5 header. Although some PDF parsers will complain if the header is not found within the first 1024 bytes of the file.

You didn’t think it would be that easy, did you?

So let’s be brave and try running the PDF as if it were a Ruby script. Sure enough, it runs a webserver that serves a webpage with a download link for “flag.zip.” Wow, that was easy, right? Inspect the Ruby script further and you’ll see that the download is the PDF file itself renamed as a .zip. Yes, in addition to being a Ruby script, this PDF is also a valid ZIP file. PoC||GTFO has used this trick for years, which can also be observed by running binwalk -e on the challenge PDF.

Unzipping the PDF produces two files: a MμPDF mutool binary and false_flag.md, the latter suggesting the player run the broken PDF through the mutool binary.

Clearly, this version of mutool was modified to render the broken PDF properly, despite whatever is “broken” about it. Is the CTF player supposed to reverse engineer the binary to figure out what was modified? If someone tried, or if they tried the readelf clue embedded as a PDF comment above, they might notice this:

The first thing you should do is: Open the PDF in a hex editor. You’ll probably need to “fix” the PDF so it can be parsed by a vanilla PDF reader. You could reverse this binary to figure out how to do that, but it’s probably easier to use it to render the PDF, follow the clues, and compare the raw PDF objects to those of a “regular” PDF. You might just be able to repair it with `bbe`!

The Binary Block Editor (bbe) is a sed-like utility for editing binary sequences. This implies that whatever is causing the PDF to render as a blank page can easily be fixed with a binary regex.

Deeper Down the Hole

When we use the modified version of mutool to render the PDF, it results in this ostensibly meaningless memetic montage:

The rendered PDF using the modified version of mutool

The “broken” challenge PDF rendered using the modified version of mutool.

Searching Google for the LMGTFY string will take you to Didier Stevens’ excellent article describing the PDF stream format in detail, including how PDF objects are numbered and versioned. One important factor is that two PDF objects can have the same number but different versions.

The first hint on the page identifies PDF object 1337, so that is probably important. The figures in Stevens’ article alone, juxtaposed to a hexdump of the broken PDF’s stream objects, provide a clear depiction of what was changed.

Annotated diagram of the parts of a PDF stream object

Didier Stevens’ annotated diagram of a PDF stream object.

5 0 obj
<< /Length 100 >>^Fstream
⋮
endstream
endobj

A PDF stream object in the challenge PDF.

As the hints suggest, the PDF specification only allows for six whitespace characters: \0, \t, \n, \f, \r, and space. The version of mutool in the ZIP was modified to also allow ACK (0x06) to be used as a seventh whitespace character! Sure enough, on the twelfth line of the file we see:

>>^Fstream

That “^F” is an ACK character, where the PDF specification says there should be whitespace! All of the PDF object streams are similarly broken. This can be fixed with:

bbe -e "s/\x06stream\n/\nstream\n/" -o challenge_fixed.pdf challenge.pdf

Solving the Puzzle

Is fixing the file strictly necessary to solve the challenge? No, the flag may be found in PDF object 0x1337 using a hex editor

4919 0 obj
<< /Length 100 /Filter /FlateDecode >>^Fstream
x<9c>^MËA^N@0^PFá}OñëÆÊÊ        <88>X;^Ba<9a>N<8c>N£#áöº~ßs<99>s^ONÅ6^Qd<95>/°<90>^[¤(öHû       }^L^V   k×E»d<85>fcM<8d>^[køôië<97><88>^N<98> ^G~}Õ\°L3^BßÅ^Z÷^CÛ<85>!Û
endstream
endobj
4919 1 obj
<< /Length 89827 /Filter [/FlateDecode /ASCIIHexDecode /DCTDecode] >>^Fstream
…
endstream
endobj

and manually decoding the stream contents. Binwalk will even automatically decode the first stream because it can decode the Flate compression. That contains:

pip3 install polyfile
Also check out the `--html` option!
But you’ll need to “fix” this PDF first!

Binwalk doesn’t automatically expand the second stream because it’s also encoded with the ASCIIHex and DCT PDF filters. A casual observer who had not followed all of the clues and wasn’t yet familiar with the PDF specification might not even realize that the second version of the PDF stream object 0x1337 even existed! And that’s the one with the flag. Sure, it’s possible to have combed through the dozens of files extracted by binwalk to manually decode the flag, or even directly from the stream content in a hex editor, with a quick implementation of PDF’s decoders. But why do that when Polyfile can do it for you?

polyfile challenge_fixed.pdf -html challenge_fixed.html

PolyFile’s HTML output for the challenge PDF.

Oh, hey, that’s a hierarchical representation of the PDF objects, with an interactive hex viewer! How about we go to object 0x1337’s stream?

We immediately see PDF object 0x1337.

PolyFile can automatically decode the objects.

And finally, let’s look at the second version of object 0x1337, containing the multi-encoded flag:

PolyFile automatically decodes the layered PDF filters to produce the flag.

The flag!

Conclusions

PDF is a very … flexible file format. Just because a PDF looks broken, it doesn’t mean it is. And just because a PDF is broken, it doesn’t mean PDF viewers will tell you it is. PDF is at its core a container format that lets you encode arbitrary binary blobs that don’t even have to contribute to the document’s rendering. And those blobs can be stacked with an arbitrary number of encodings, some of which are bespoke features of PDF. If this is interesting to you, check out our talk on The Treachery of Files, as well as our tools for taming them, such as Polyfile and PolyTracker.

Breaking Aave Upgradeability

On December 3rd, Aave deployed version 2 of their codebase. While we were not hired to look at the code, we briefly reviewed it the following day. We quickly discovered a vulnerability that affected versions 1 and 2 of the live contracts and reported the issue. Within an hour of sending our analysis to Aave, their team mitigated the vulnerability in the deployed contracts. If exploited, the issue would have broken Aave, and impacted funds in external DeFi contracts.

Five different security firms reviewed the Aave codebase, including some that used formal verification; however, this bug went unnoticed. This post describes the issue, how the bug escaped detection, and other lessons learned. We are also open-sourcing a new Slither detector that identifies this vulnerability to increase security in the larger Ethereum community.

The vulnerability

Aave uses the delegatecall proxy pattern that we have thoroughly discussed in past writeups on this blog. At a high-level, each component is split into two contracts: 1) A logic contract that holds the implementation and 2) a proxy that contains the data and uses delegatecall to interact with the logic contract. Users interact with the proxy while the code is executed on the logic contract. Here’s a simplified representation of the delegatecall proxy pattern:

In Aave, LendingPool (LendingPool.sol) is an upgradeable component that uses a delegatecall proxy.

The vulnerability we discovered relies on two features in these contracts:

  • Functions on the logic contract can be called directly, including initialization functions
  • The lending pool has its own delegatecall capabilities

Initializing upgradeable contracts

A limitation of this upgradeability pattern is that the proxy cannot rely on the logic contract’s constructor for initialization. Therefore, state variables and initial setup must be performed in public initialization functions that do not benefit from the constructor safeguards.

In LendingPool, the initialize function sets the provider address (_addressesProvider):


function initialize(ILendingPoolAddressesProvider provider) public initializer {
_addressesProvider = provider;
}

LendingPool.sol#L90-L92

The initializer modifier prevents initialize from being called multiple times. It requires that the following condition is true:

   require(
      initializing || isConstructor() || revision > lastInitializedRevision,
      'Contract instance has already been initialized'
    );

VersionedInitializable.sol#L32-L50

Here:

  1. initializing allows multiple calls to the modifier within the same transaction (hence multiple initialize functions)
  2. isConstructor() is what the proxy needs to execute the code
  3. revision > lastInitializedRevision allows calling the initialization functions again when the contract is upgraded

While this works as expected through the proxy, 3 also allows anyone to call initialize directly on the logic contract itself. Once the logic contract is deployed:

The bug: Anyone can set _addressesProvider on the LendingPool logic contract.

Arbitrary delegatecall

LendingPool.liquidationCall delegatecalls directly to the addresses returned by _addressProvider:

   address collateralManager = _addressesProvider.getLendingPoolCollateralManager();

    //solium-disable-next-line
    (bool success, bytes memory result) =
      collateralManager.delegatecall(
        abi.encodeWithSignature(
          'liquidationCall(address,address,address,uint256,bool)',
          collateralAsset,
          debtAsset,
          user,
          debtToCover,
          receiveAToken
        )
      );

LendingPool.sol#L424-L450

This lets anyone initiate the LendingPool logic contract, set a controlled addresses provider, and execute arbitrary code, including selfdestruct.

The exploit scenario: Anyone can destruct the lending pool logic contract. Here’s a simplified visual representation:

Lack-of-existence check

By itself, this issue is already severe since anyone can destruct the logic contract and prevent the proxy from executing the lending pool code (pour one out for Parity).

However, the severity of this issue is amplified by the use of OpenZeppelin for the proxy contract. Our 2018 blogpost highlighted that a delegatecall to a contract without code would return success without executing any code. Despite our initial warning, OpenZeppelin did not fix the fallback function in their proxy contract:

   function _delegate(address implementation) internal {
        //solium-disable-next-line
        assembly {
            // Copy msg.data. We take full control of memory in this inline assembly
            // block because it will not return to Solidity code. We overwrite the
            // Solidity scratch pad at memory position 0.
            calldatacopy(0, 0, calldatasize)

            // Call the implementation.
            // out and outsize are 0 because we don't know the size yet.
            let result := delegatecall(gas, implementation, 0, calldatasize, 0, 0)

Proxy.sol#L30-L54

If the proxy delegatecalls to a destroyed lending pool logic contract, the proxy will return success, while no code was executed.

This exploit would not be persistent since Aave can update the proxy to point to another logic contract. But in the timeframe where the issue could be exploited, any third-party contracts calling the lending pool would act as if some code was executed when it was not. This would break the underlying logic of many external contracts.

Affected contracts

  • All ATokens (Aave tokens): AToken.redeem calls pool.redeemUnderlying (AToken.sol#L255-L260). As the call does nothing, the users would burn their ATokens without receiving back their underlying tokens.
  • WETHGateway (WETHGateway.sol#L103-L111): Deposits would be stored in the gateway, allowing anyone to steal the deposited assets.
  • Any codebase based on Aave’s Credit Delegation v2 (MyV2CreditDelegation.sol)

If the issue we discovered were exploited, many contracts outside of Aave would have been affected in various ways. Determining a complete list is difficult, and we did not attempt to do so. This incident highlights the underlying risks of DeFi composability. Here are a few that we found:

Fixes and recommendations

Luckily, no one abused this issue before we reported it. Aave called the initialize functions on both versions of the lending pool and thus secured the contracts:

Long term, contract owners should:

  • Add a constructor in all logic contracts to disable the initialize function
  • Check for the existence of a contract in the delegatecall proxy fallback function
  • Carefully review delegatecall pitfalls and use slither-check-upgradeability

Formally verified contracts are not bulletproof

Aave’s codebase is “formally verified.” A trend in the blockchain space is to think that safety properties are the holy grail of security. Users might try to rank the safety of various contracts based on the presence or absence of such properties. We believe this is dangerous and can lead to a false sense of security.

The Aave formal verification report lists properties on the LendingPool view functions (e.g., they don’t have side effects) and the pool operations (e.g., the operations return true when successful and do not revert). For example, one of the verified properties is:

Yet this property can be broken if the logic contract is destroyed. So how could this have been verified? While we don’t have access to the theorem prover nor the setup used, likely, the proofs don’t account for upgradeability, or the prover does not support complex contract interactions.

This is common for code verification. You can prove behaviors in a targeted component with assumptions about the overall behavior. But proving properties in a multi-contract setup is challenging and time-consuming, so a tradeoff must be made.

Formal techniques are great, but users must be aware that they cover small areas and might miss attack vectors. On the other hand, automated tools and human review can help the developers reach higher overall confidence in the codebase with fewer resources. Understanding the benefits and limits of each solution is crucial for developers and users. The current issue is a good example. Slither can find this issue in a few seconds, an expert with training may quickly point it out, but it would take substantial effort to detect with safety properties.

Conclusion

Aave reacted positively and quickly fixed the bug once they were aware of the issue. Crisis averted. But other victims of recent hacks were not as fortunate. Before deploying code and exposing it to an adversarial environment, we recommend that developers:

We hope to prevent similar mistakes by sharing this post and the associated Slither detector for this issue. However, security is a never-ending process, and developers should contact us for security reviews before launching their projects.

Reverie: An optimized zero-knowledge proof system

Zero-knowledge proofs, once a theoretical curiosity, have recently seen widespread deployment in blockchain systems such as Zcash and Monero. However, most blockchain applications of ZK proofs make proof size and performance tradeoffs that are a poor fit for other use-cases. In particular, these protocols often require an elaborate trusted setup phase and optimize for proof size at the expense of prover time. Many would-be applications of ZK proofs are so complex that the ZK-SNARK protocol used by Zcash would take days (if not months) to run. With this in mind, I believe it is important to provide engineers who want to tinker with ZK proofs an alternative set of tradeoffs.

During my internship at Trail of Bits, I implemented a ZK proof system Reverie that optimizes for prover efficiency and does not require any trusted setup. These optimizations come at the expense of proof size, but Reverie allows proofs to be streamed across a network and verified incrementally so as to avoid loading large proofs into memory all at once. Reverie is also available as reverie-zk on crates.io.

The rest of this blog post will explain the underlying ZK proof system behind Reverie and provide performance benchmarks. Since the proof system uses techniques from secure multiparty computation (MPC), I’ll start by going over some MPC basics and then showing how to build ZK proofs using MPC.

MPC Primer

The players have secret inputs X, Y, Z, respectively, and compute f(X, Y, Z) without revealing X, Y or Z.

The goal of a multi-party computation protocol is to allow the distributed computation of a function on inputs from different parties, without the parties revealing their inputs to one other.

For instance:

  1. Tallying a vote without revealing the values of individual votes.
  2. Computing the number of shared customers without revealing your customer databases.
  3. Training/evaluating a machine-learning model on private datasets without sharing confidential data.

Zero-knowledge proofs and multi-party computation have conceptual similarities, but there are crucial differences:

  1. MPC allows the computation upon private data, where zero-knowledge proofs only enable parties to prove properties about private data.
  2. MPC protocols are interactive (their non-interactive cousin is functional encryption), requiring the parties to be online. Zero-knowledge proofs can be either interactive or non-interactive.

In this post we see that these two primitives are indeed related by describing how to “compile” a multi-party computation protocol into a zero-knowledge proof system. This is a beautiful theoretic result: essentially, checking the output of a function inside a zero-knowledge proof will always be more efficient than computing the function using multi-party computation.

Additionally, it can also lead to very simple, concretely efficient proof systems in practice, with applications including practical post-quantum signature schemes. As we’ll see, it’s a very general and versatile framework for constructing zero-knowledge proof systems.

Cryptographic Compilers

We will now explore how to apply a sequence of “cryptographic compilers” to any MPC protocol to obtain a non-interactive zero-knowledge proof system. A cryptographic compiler uses cryptographic tools to produce a new protocol from an existing one. The advantage of “compilers” in protocol design is that they are general: any input protocol satisfying a set of properties will yield a new protocol where a new set of properties are guaranteed to hold.  In other words, you can compose cryptographic primitives in a black-box fashion to derive new protocols in a “mechanical way.”

IKOS Compiler

Given an MPC protocol capable of computing a function “g(x),” the IKOS compiler yields a zero-knowledge proof system for input/output pairs (w, o) of the function “g”: It enables a prover to convince a verifier that it knows a secret input “w” such that “g(w) = o” for some output “o,” e.g., if “g(x) = SHA-256(x)” to prove that it knows an input “w” to SHA-256 resulting in a given hash “o” without revealing the pre-image (“w”).

At a high level, this is done when the prover locally runs the MPC protocol for “g(x)” where the input “w” is shared among the players. For example, in the case of three players, this can be done by picking random X, Y, Z subject to X + Y + Z = w. In this way, knowing just two of the inputs, X, Y, Z reveals nothing about w. Then the function f(X, Y, Z) = g(X + Y + Z) is computed using the multi-party computation protocol. To prevent cheating, the verifier asks the prover to reveal the inputs and communication for a subset of the players and checks that they’re run correctly.

The IKOS cryptographic compiler

More precisely, the IKOS compiler takes:

  1. An MPC protocol.
  2. A cryptographic commitment scheme (covered below).

And yields an interactive (public coin) zero-knowledge proof protocol. A cryptographic commitment scheme consists of a function”Commit,” which takes a key and a message, then returns a “commitment” Commit(key, msg) -> commitment, with the property that:

  1. Finding different messages with the same commitment is intractable (like a collision-resistant hash function)
  2. Discerning whether a commitment is then created from a given message is intractable (with a random key, the commitment hides msg, like encryption)

Commitments can be constructed from cryptographic hash functions, e.g., by using the HMAC construction. For a physical analogy to commitments, one can think of an envelope: you can put a letter inside the envelope and seal it (compute “Commit(key, msg)”), the envelope will hide the contents of the letter (“hiding”), but will not allow you to replace the letter by another (“binding”). At a later point, one can then open the envelope to reveal the contents (by providing someone with “key” and “msg” and having them recompute the commitment).

Uses cryptographic commitments, the MPC protocol is “compiled” as follows:

  1. The prover executes the MPC protocol for f(X, Y, Z) “in the head”, by simulating the players running the protocol: running every player locally as if they were distributed. You can think of this either as the prover executing the code for each player inside a virtual machine and recording the network traffic between them, or, as the prover playing a form of MPC sock-puppet theater wherein it plays the role of every player in the protocol.
  2. The prover records everything the different players send and receive during the execution of the protocol. We call this “the view” of each player: everything the player sees during the protocol. The prover then commits to the views of every player (“putting it inside an envelope”) and sends the commitments (“envelopes”) to the verifier.
  3. The verifier asks the prover to provide it with a subset of the views (2 in our example).

The verifier then checks that the correspondence between the two parties for which the view has been provided are compatible: the players are sending and receiving the messages that the protocol specifies that they should.

This is illustrated below.

The prover executes the MPC protocol between “virtual/imagined” parties and records their views. The verifier checks the consistency between a subset of these purported views.

Note: The “public coin” part just means that the verifier simply responds to the prover by picking random numbers: in this case the indexes to open, i.e. the verifier has no “secrets” that must be hidden from the prover to prevent him from cheating. This will be relevant later.

Why does this work? We need to convince ourselves of two things:

Soundness: If the prover cheats, it is caught

Suppose the prover does not have inputs X, Y, Z such that f(X, Y, Z) = 1. If it executed the MPC protocol correctly, then the computation would not return the expected output “o” (since the MPC protocol correctly computes f). Hence the prover must cheat while executing the MPC protocol “in his head”. In order to cheat, the prover must either:

  1. Cheat in a player, by e.g. having the player do a local computation wrong.
  2. Cheat in the communication between pairs of players, by e.g. pretending like player A received a message from player B, which player B never sent.

These two ways encompass the only ways to cheat in the MPC protocol: either by having a player violate the protocol, or, by subverting the simulated communication medium.

Consider now the case of 3 players with pairwise channels (as illustrated between Alice, the Hare, and the Hatter). Then the cheating prover must cheat between at least one pair of players. Since it commits to the views of all three players, at least one pair of these commitments must be inconsistent: if the verifier opens this pair, it observes that either one of the players is cheating or the messages between the two players do not match. Since there are three pairs of commitments, the cheating prover is caught with a probability of at least ⅓.

To “improve soundness” (to catch the cheating prover with higher probability), the zero-knowledge proof is repeated in parallel: having the prover run the MPC protocol multiple times, then challenge it to open two parties from every repetition. This way, the probability of a cheating prover succeeding after N repetitions drops exponentially in N.

Zero-knowledge: The verifier learns nothing

When two views are opened, the verifier learns the messages and inputs of two of the players. Since the commitment scheme hides the view of the unopened player and the MPC protocol ensures privacy of the input for the players, the verifier does not learn the input of the unopened player (Y in the example), since the input is shared as described above providing two of the inputs (X, Z in the example) leaks no information about “w”.

Fiat-Shamir Compiler

The Fiat-Shamir transform removes the interactive verifier (professor).

The Fiat-Shamir transform/compiler takes:

  1. A public coin zero-knowledge proof system.
  2. A (particularly strong) hash function.

And yields a non-interactive zero-knowledge proof.

The observation underlying the Fiat-Shamir transform is essentially as follows: Since all the verifier does is pick random numbers, we could just have the prover roll a dice instead and open the index that the dice shows. Then everyone can verify that the opened views are consistent.

There is just one obvious problem: a cheating prover can of course just pretend like the dice throw came out however it wanted them to (e.g. not opening the cheating pair of players). The Fiat-Shamir transformation resolves this by essentially using a hash function to “roll the dice in a verifiably random way”.

This is done by having the prover hash his messages (the commitments to each player’s view), the hash digest is then used as the random challenge (e.g. interpreted as a sequence of integers denoting which players to open).

A cheating prover can of course try changing its messages and recomputing the hash until it gets a challenge that allows him to cheat, however, this merely corresponds to running the interactive protocol multiple times in a setting where the verifier allows him to attempt any number of times. Hence the probability of a cheating prover succeeding in the interactive protocol is amplified by the amount of computation the cheating prover has available, e.g. if the interactive protocol allows the prover to cheat with probability 2^{-128}, then the adversary would have to compute the hash an expected 2^{127} times to cheat; which is clearly not feasible.

Note: In reality, the Fiat-Shamir transform requires a “Random Oracle” [see: https://blog.cryptographyengineering.com/2011/09/29/what-is-random-oracle-model-and-why-3/ for more details], a proof artifact that cannot really be instantiated. However in practice replacing the “Random Oracle” with a “good hash function,” where the output of the hash function “looks random” is sufficient.

Reverie

This finally brings us to Reverie. Reverie is an efficient Rust implementation of the MPC-in-the-head proof system described in KKW 2018 derived by applying the previous two compilers to a particularly simple MPC protocol. Since the proof system works for any ring (e.g., finite fields, the integers modulo a composite or vector spaces), Reverie is designed to be both fast and flexible enough to support computations in any ring.

Since the MPC-in-the-head paradigm covered above is quite general and the techniques in KKW generalize very easily, Reverie is also designed to be general and easily extendable. This makes it easier to add features such as moving between rings inside the proof system and adding “random gates” to the circuits.

Optimizations

To improve performance, Reverie uses a number of implementation tricks:

Heavy use of bitslicing. Since the parties in the underlying MPC protocol all execute the same simple operations in parallel, we can significantly improve performance by packing the players’ state continuously in memory, then operating on all players in parallel with a single operation (either in traditional 64-bit registers or with SIMD instructions). Reverie never operates on the state of individual players.

Parallelism. Since the proof system requires a number of repetitions for “soundness amplification,” each of these can be trivially  delegated to a separate thread for a linear increase in performance.

Fast cryptographic primitives. Reverie makes use of Blake3 and ChaCha12, since these were found to be the fastest cryptographic primitives (cryptographic hash function and pseudo random function) while still offering a very comfortable security margin.

Streaming prover and streaming verifier. Reverie allows the proof to be produced and verified in a steaming fashion: this enables the verifier to validate the proof while the proof is still being generated and transmitted by the prover. This means that Reverie’s proof and verification components are fully asynchronous.

Benchmarks

To benchmark Reverie, we measured the speed of verifying the SHA-256 compression function on this circuit. Reverie averages around 20 SHA-256 compression/second on a (32-core) AMD EPYC 7601:

Num. of SHA-256 compression applications Proof generation time AND Gates XOR Gates Proof size
100 4.76 s 2,227,200 9,397,400 22 MB
1000 50.39 s 22,272,000 93,974,000 220 MB
10000 482.97s 222,720,000 939,740,000 2.2 GB

Conclusion

Reverie is the first step in creating high-performance zero-knowledge proofs outside of the dominant SNARK paradigm, and it handles hundreds of millions of AND/XOR gates with relative ease. The project comes with a simple “companion” program (in the “companion” subdirectory) which makes playing with Reverie relatively easy. Try it now on Github and crates.io!