Breaking Aave Upgradeability

On December 3rd, Aave deployed version 2 of their codebase. While we were not hired to look at the code, we briefly reviewed it the following day. We quickly discovered a vulnerability that affected versions 1 and 2 of the live contracts and reported the issue. Within an hour of sending our analysis to Aave, their team mitigated the vulnerability in the deployed contracts. If exploited, the issue would have broken Aave, and impacted funds in external DeFi contracts.

Five different security firms reviewed the Aave codebase, including some that used formal verification; however, this bug went unnoticed. This post describes the issue, how the bug escaped detection, and other lessons learned. We are also open-sourcing a new Slither detector that identifies this vulnerability to increase security in the larger Ethereum community.

The vulnerability

Aave uses the delegatecall proxy pattern that we have thoroughly discussed in past writeups on this blog. At a high-level, each component is split into two contracts: 1) A logic contract that holds the implementation and 2) a proxy that contains the data and uses delegatecall to interact with the logic contract. Users interact with the proxy while the code is executed on the logic contract. Here’s a simplified representation of the delegatecall proxy pattern:

In Aave, LendingPool (LendingPool.sol) is an upgradeable component that uses a delegatecall proxy.

The vulnerability we discovered relies on two features in these contracts:

  • Functions on the logic contract can be called directly, including initialization functions
  • The lending pool has its own delegatecall capabilities

Initializing upgradeable contracts

A limitation of this upgradeability pattern is that the proxy cannot rely on the logic contract’s constructor for initialization. Therefore, state variables and initial setup must be performed in public initialization functions that do not benefit from the constructor safeguards.

In LendingPool, the initialize function sets the provider address (_addressesProvider):

function initialize(ILendingPoolAddressesProvider provider) public initializer {
_addressesProvider = provider;


The initializer modifier prevents initialize from being called multiple times. It requires that the following condition is true:

      initializing || isConstructor() || revision > lastInitializedRevision,
      'Contract instance has already been initialized'



  1. initializing allows multiple calls to the modifier within the same transaction (hence multiple initialize functions)
  2. isConstructor() is what the proxy needs to execute the code
  3. revision > lastInitializedRevision allows calling the initialization functions again when the contract is upgraded

While this works as expected through the proxy, 3 also allows anyone to call initialize directly on the logic contract itself. Once the logic contract is deployed:

The bug: Anyone can set _addressesProvider on the LendingPool logic contract.

Arbitrary delegatecall

LendingPool.liquidationCall delegatecalls directly to the addresses returned by _addressProvider:

   address collateralManager = _addressesProvider.getLendingPoolCollateralManager();

    (bool success, bytes memory result) =


This lets anyone initiate the LendingPool logic contract, set a controlled addresses provider, and execute arbitrary code, including selfdestruct.

The exploit scenario: Anyone can destruct the lending pool logic contract. Here’s a simplified visual representation:

Lack-of-existence check

By itself, this issue is already severe since anyone can destruct the logic contract and prevent the proxy from executing the lending pool code (pour one out for Parity).

However, the severity of this issue is amplified by the use of OpenZeppelin for the proxy contract. Our 2018 blogpost highlighted that a delegatecall to a contract without code would return success without executing any code. Despite our initial warning, OpenZeppelin did not fix the fallback function in their proxy contract:

   function _delegate(address implementation) internal {
        assembly {
            // Copy We take full control of memory in this inline assembly
            // block because it will not return to Solidity code. We overwrite the
            // Solidity scratch pad at memory position 0.
            calldatacopy(0, 0, calldatasize)

            // Call the implementation.
            // out and outsize are 0 because we don't know the size yet.
            let result := delegatecall(gas, implementation, 0, calldatasize, 0, 0)


If the proxy delegatecalls to a destroyed lending pool logic contract, the proxy will return success, while no code was executed.

This exploit would not be persistent since Aave can update the proxy to point to another logic contract. But in the timeframe where the issue could be exploited, any third-party contracts calling the lending pool would act as if some code was executed when it was not. This would break the underlying logic of many external contracts.

Affected contracts

  • All ATokens (Aave tokens): AToken.redeem calls pool.redeemUnderlying (AToken.sol#L255-L260). As the call does nothing, the users would burn their ATokens without receiving back their underlying tokens.
  • WETHGateway (WETHGateway.sol#L103-L111): Deposits would be stored in the gateway, allowing anyone to steal the deposited assets.
  • Any codebase based on Aave’s Credit Delegation v2 (MyV2CreditDelegation.sol)

If the issue we discovered were exploited, many contracts outside of Aave would have been affected in various ways. Determining a complete list is difficult, and we did not attempt to do so. This incident highlights the underlying risks of DeFi composability. Here are a few that we found:

Fixes and recommendations

Luckily, no one abused this issue before we reported it. Aave called the initialize functions on both versions of the lending pool and thus secured the contracts:

Long term, contract owners should:

  • Add a constructor in all logic contracts to disable the initialize function
  • Check for the existence of a contract in the delegatecall proxy fallback function
  • Carefully review delegatecall pitfalls and use slither-check-upgradeability

Formally verified contracts are not bulletproof

Aave’s codebase is “formally verified.” A trend in the blockchain space is to think that safety properties are the holy grail of security. Users might try to rank the safety of various contracts based on the presence or absence of such properties. We believe this is dangerous and can lead to a false sense of security.

The Aave formal verification report lists properties on the LendingPool view functions (e.g., they don’t have side effects) and the pool operations (e.g., the operations return true when successful and do not revert). For example, one of the verified properties is:

Yet this property can be broken if the logic contract is destroyed. So how could this have been verified? While we don’t have access to the theorem prover nor the setup used, likely, the proofs don’t account for upgradeability, or the prover does not support complex contract interactions.

This is common for code verification. You can prove behaviors in a targeted component with assumptions about the overall behavior. But proving properties in a multi-contract setup is challenging and time-consuming, so a tradeoff must be made.

Formal techniques are great, but users must be aware that they cover small areas and might miss attack vectors. On the other hand, automated tools and human review can help the developers reach higher overall confidence in the codebase with fewer resources. Understanding the benefits and limits of each solution is crucial for developers and users. The current issue is a good example. Slither can find this issue in a few seconds, an expert with training may quickly point it out, but it would take substantial effort to detect with safety properties.


Aave reacted positively and quickly fixed the bug once they were aware of the issue. Crisis averted. But other victims of recent hacks were not as fortunate. Before deploying code and exposing it to an adversarial environment, we recommend that developers:

We hope to prevent similar mistakes by sharing this post and the associated Slither detector for this issue. However, security is a never-ending process, and developers should contact us for security reviews before launching their projects.

Reverie: An optimized zero-knowledge proof system

Zero-knowledge proofs, once a theoretical curiosity, have recently seen widespread deployment in blockchain systems such as Zcash and Monero. However, most blockchain applications of ZK proofs make proof size and performance tradeoffs that are a poor fit for other use-cases. In particular, these protocols often require an elaborate trusted setup phase and optimize for proof size at the expense of prover time. Many would-be applications of ZK proofs are so complex that the ZK-SNARK protocol used by Zcash would take days (if not months) to run. With this in mind, I believe it is important to provide engineers who want to tinker with ZK proofs an alternative set of tradeoffs.

During my internship at Trail of Bits, I implemented a ZK proof system Reverie that optimizes for prover efficiency and does not require any trusted setup. These optimizations come at the expense of proof size, but Reverie allows proofs to be streamed across a network and verified incrementally so as to avoid loading large proofs into memory all at once. Reverie is also available as reverie-zk on

The rest of this blog post will explain the underlying ZK proof system behind Reverie and provide performance benchmarks. Since the proof system uses techniques from secure multiparty computation (MPC), I’ll start by going over some MPC basics and then showing how to build ZK proofs using MPC.

MPC Primer

The players have secret inputs X, Y, Z, respectively, and compute f(X, Y, Z) without revealing X, Y or Z.

The goal of a multi-party computation protocol is to allow the distributed computation of a function on inputs from different parties, without the parties revealing their inputs to one other.

For instance:

  1. Tallying a vote without revealing the values of individual votes.
  2. Computing the number of shared customers without revealing your customer databases.
  3. Training/evaluating a machine-learning model on private datasets without sharing confidential data.

Zero-knowledge proofs and multi-party computation have conceptual similarities, but there are crucial differences:

  1. MPC allows the computation upon private data, where zero-knowledge proofs only enable parties to prove properties about private data.
  2. MPC protocols are interactive (their non-interactive cousin is functional encryption), requiring the parties to be online. Zero-knowledge proofs can be either interactive or non-interactive.

In this post we see that these two primitives are indeed related by describing how to “compile” a multi-party computation protocol into a zero-knowledge proof system. This is a beautiful theoretic result: essentially, checking the output of a function inside a zero-knowledge proof will always be more efficient than computing the function using multi-party computation.

Additionally, it can also lead to very simple, concretely efficient proof systems in practice, with applications including practical post-quantum signature schemes. As we’ll see, it’s a very general and versatile framework for constructing zero-knowledge proof systems.

Cryptographic Compilers

We will now explore how to apply a sequence of “cryptographic compilers” to any MPC protocol to obtain a non-interactive zero-knowledge proof system. A cryptographic compiler uses cryptographic tools to produce a new protocol from an existing one. The advantage of “compilers” in protocol design is that they are general: any input protocol satisfying a set of properties will yield a new protocol where a new set of properties are guaranteed to hold.  In other words, you can compose cryptographic primitives in a black-box fashion to derive new protocols in a “mechanical way.”

IKOS Compiler

Given an MPC protocol capable of computing a function “g(x),” the IKOS compiler yields a zero-knowledge proof system for input/output pairs (w, o) of the function “g”: It enables a prover to convince a verifier that it knows a secret input “w” such that “g(w) = o” for some output “o,” e.g., if “g(x) = SHA-256(x)” to prove that it knows an input “w” to SHA-256 resulting in a given hash “o” without revealing the pre-image (“w”).

At a high level, this is done when the prover locally runs the MPC protocol for “g(x)” where the input “w” is shared among the players. For example, in the case of three players, this can be done by picking random X, Y, Z subject to X + Y + Z = w. In this way, knowing just two of the inputs, X, Y, Z reveals nothing about w. Then the function f(X, Y, Z) = g(X + Y + Z) is computed using the multi-party computation protocol. To prevent cheating, the verifier asks the prover to reveal the inputs and communication for a subset of the players and checks that they’re run correctly.

The IKOS cryptographic compiler

More precisely, the IKOS compiler takes:

  1. An MPC protocol.
  2. A cryptographic commitment scheme (covered below).

And yields an interactive (public coin) zero-knowledge proof protocol. A cryptographic commitment scheme consists of a function”Commit,” which takes a key and a message, then returns a “commitment” Commit(key, msg) -> commitment, with the property that:

  1. Finding different messages with the same commitment is intractable (like a collision-resistant hash function)
  2. Discerning whether a commitment is then created from a given message is intractable (with a random key, the commitment hides msg, like encryption)

Commitments can be constructed from cryptographic hash functions, e.g., by using the HMAC construction. For a physical analogy to commitments, one can think of an envelope: you can put a letter inside the envelope and seal it (compute “Commit(key, msg)”), the envelope will hide the contents of the letter (“hiding”), but will not allow you to replace the letter by another (“binding”). At a later point, one can then open the envelope to reveal the contents (by providing someone with “key” and “msg” and having them recompute the commitment).

Uses cryptographic commitments, the MPC protocol is “compiled” as follows:

  1. The prover executes the MPC protocol for f(X, Y, Z) “in the head”, by simulating the players running the protocol: running every player locally as if they were distributed. You can think of this either as the prover executing the code for each player inside a virtual machine and recording the network traffic between them, or, as the prover playing a form of MPC sock-puppet theater wherein it plays the role of every player in the protocol.
  2. The prover records everything the different players send and receive during the execution of the protocol. We call this “the view” of each player: everything the player sees during the protocol. The prover then commits to the views of every player (“putting it inside an envelope”) and sends the commitments (“envelopes”) to the verifier.
  3. The verifier asks the prover to provide it with a subset of the views (2 in our example).

The verifier then checks that the correspondence between the two parties for which the view has been provided are compatible: the players are sending and receiving the messages that the protocol specifies that they should.

This is illustrated below.

The prover executes the MPC protocol between “virtual/imagined” parties and records their views. The verifier checks the consistency between a subset of these purported views.

Note: The “public coin” part just means that the verifier simply responds to the prover by picking random numbers: in this case the indexes to open, i.e. the verifier has no “secrets” that must be hidden from the prover to prevent him from cheating. This will be relevant later.

Why does this work? We need to convince ourselves of two things:

Soundness: If the prover cheats, it is caught

Suppose the prover does not have inputs X, Y, Z such that f(X, Y, Z) = 1. If it executed the MPC protocol correctly, then the computation would not return the expected output “o” (since the MPC protocol correctly computes f). Hence the prover must cheat while executing the MPC protocol “in his head”. In order to cheat, the prover must either:

  1. Cheat in a player, by e.g. having the player do a local computation wrong.
  2. Cheat in the communication between pairs of players, by e.g. pretending like player A received a message from player B, which player B never sent.

These two ways encompass the only ways to cheat in the MPC protocol: either by having a player violate the protocol, or, by subverting the simulated communication medium.

Consider now the case of 3 players with pairwise channels (as illustrated between Alice, the Hare, and the Hatter). Then the cheating prover must cheat between at least one pair of players. Since it commits to the views of all three players, at least one pair of these commitments must be inconsistent: if the verifier opens this pair, it observes that either one of the players is cheating or the messages between the two players do not match. Since there are three pairs of commitments, the cheating prover is caught with a probability of at least ⅓.

To “improve soundness” (to catch the cheating prover with higher probability), the zero-knowledge proof is repeated in parallel: having the prover run the MPC protocol multiple times, then challenge it to open two parties from every repetition. This way, the probability of a cheating prover succeeding after N repetitions drops exponentially in N.

Zero-knowledge: The verifier learns nothing

When two views are opened, the verifier learns the messages and inputs of two of the players. Since the commitment scheme hides the view of the unopened player and the MPC protocol ensures privacy of the input for the players, the verifier does not learn the input of the unopened player (Y in the example), since the input is shared as described above providing two of the inputs (X, Z in the example) leaks no information about “w”.

Fiat-Shamir Compiler

The Fiat-Shamir transform removes the interactive verifier (professor).

The Fiat-Shamir transform/compiler takes:

  1. A public coin zero-knowledge proof system.
  2. A (particularly strong) hash function.

And yields a non-interactive zero-knowledge proof.

The observation underlying the Fiat-Shamir transform is essentially as follows: Since all the verifier does is pick random numbers, we could just have the prover roll a dice instead and open the index that the dice shows. Then everyone can verify that the opened views are consistent.

There is just one obvious problem: a cheating prover can of course just pretend like the dice throw came out however it wanted them to (e.g. not opening the cheating pair of players). The Fiat-Shamir transformation resolves this by essentially using a hash function to “roll the dice in a verifiably random way”.

This is done by having the prover hash his messages (the commitments to each player’s view), the hash digest is then used as the random challenge (e.g. interpreted as a sequence of integers denoting which players to open).

A cheating prover can of course try changing its messages and recomputing the hash until it gets a challenge that allows him to cheat, however, this merely corresponds to running the interactive protocol multiple times in a setting where the verifier allows him to attempt any number of times. Hence the probability of a cheating prover succeeding in the interactive protocol is amplified by the amount of computation the cheating prover has available, e.g. if the interactive protocol allows the prover to cheat with probability 2^{-128}, then the adversary would have to compute the hash an expected 2^{127} times to cheat; which is clearly not feasible.

Note: In reality, the Fiat-Shamir transform requires a “Random Oracle” [see: for more details], a proof artifact that cannot really be instantiated. However in practice replacing the “Random Oracle” with a “good hash function,” where the output of the hash function “looks random” is sufficient.


This finally brings us to Reverie. Reverie is an efficient Rust implementation of the MPC-in-the-head proof system described in KKW 2018 derived by applying the previous two compilers to a particularly simple MPC protocol. Since the proof system works for any ring (e.g., finite fields, the integers modulo a composite or vector spaces), Reverie is designed to be both fast and flexible enough to support computations in any ring.

Since the MPC-in-the-head paradigm covered above is quite general and the techniques in KKW generalize very easily, Reverie is also designed to be general and easily extendable. This makes it easier to add features such as moving between rings inside the proof system and adding “random gates” to the circuits.


To improve performance, Reverie uses a number of implementation tricks:

Heavy use of bitslicing. Since the parties in the underlying MPC protocol all execute the same simple operations in parallel, we can significantly improve performance by packing the players’ state continuously in memory, then operating on all players in parallel with a single operation (either in traditional 64-bit registers or with SIMD instructions). Reverie never operates on the state of individual players.

Parallelism. Since the proof system requires a number of repetitions for “soundness amplification,” each of these can be trivially  delegated to a separate thread for a linear increase in performance.

Fast cryptographic primitives. Reverie makes use of Blake3 and ChaCha12, since these were found to be the fastest cryptographic primitives (cryptographic hash function and pseudo random function) while still offering a very comfortable security margin.

Streaming prover and streaming verifier. Reverie allows the proof to be produced and verified in a steaming fashion: this enables the verifier to validate the proof while the proof is still being generated and transmitted by the prover. This means that Reverie’s proof and verification components are fully asynchronous.


To benchmark Reverie, we measured the speed of verifying the SHA-256 compression function on this circuit. Reverie averages around 20 SHA-256 compression/second on a (32-core) AMD EPYC 7601:

Num. of SHA-256 compression applications Proof generation time AND Gates XOR Gates Proof size
100 4.76 s 2,227,200 9,397,400 22 MB
1000 50.39 s 22,272,000 93,974,000 220 MB
10000 482.97s 222,720,000 939,740,000 2.2 GB


Reverie is the first step in creating high-performance zero-knowledge proofs outside of the dominant SNARK paradigm, and it handles hundreds of millions of AND/XOR gates with relative ease. The project comes with a simple “companion” program (in the “companion” subdirectory) which makes playing with Reverie relatively easy. Try it now on Github and!

High-fidelity build instrumentation with blight

TL;DR: We’re open-sourcing a new framework, blight, for painlessly wrapping and instrumenting C and C++ build tools. We’re already using it on our research projects, and have included a set of useful actions. You can use it today for your own measurement and instrumentation needs:

Why would you ever want to wrap a build tool?

As engineers, we tend to treat our build tools as monoliths: gcc, clang++, et al. are black boxes that we shove inputs into and get outputs out of.

We then encapsulate our build tools in even higher-level build systems (Make, Ninja) or build system generators (CMake) to spare ourselves the normal hassles of C and C++ compilation: linking individual outputs together, maintaining consistent flags between invocations, supplying the correct linker and including directories, etc.

This all works well enough for everyday development, but falls short for a few specific use cases. We’ll cover some of them below.


Most build systems have their own caches and intermediate management mechanisms (with varying degrees of soundness). What normal build systems can’t do is cache intermediates between separate projects: Each project’s build is hermetic, even when two projects build nearly identical source dependencies.

That’s where tools like ccache and sccache come in: both supply a global intermediate cache by wrapping individual build tools and diverting from a normal tool invocation if a particular input’s output already exists1.

Static analysis

Modern C and C++ compilers support a variety of language standards, as well as customization flags for allowing or disallowing language features. Static analysis tools like clang-tidy need to be at least partially aware of these parameters to provide accurate results; for instance, they shouldn’t be recommending snprintf for versions of C prior to C99. Consequently, these tools need an accurate record of how the program was compiled.

clang-tidy and a few other tools support a Compilation Database format, which is essentially just a JSON-formatted array of “command objects,” each of contains the command-line, working directory, and other metadata associated with a build tool invocation. CMake knows how to generate these databases when using its Make generator; there’s also a tool called Bear that does the same through some LD_PRELOAD trickery.

Similarly: LLVM is a popular static analysis target, but it can be difficult to generically instrument pre-existing build systems to emit a single LLVM IR module (or individual bitcode units) in lieu of a single executable or object intermediates. Build tool wrappers like WLLVM and GLLVM exist for precisely this purpose.

Performance profiling

C and C++ builds get complicated quickly. They also tend to accumulate compilation performance issues over time:

  • Expensive standard headers (like <regex>) get introduced and included in common headers, bloating compilation times for individual translation units.
  • Programmers get clever and write complex templates and/or recursive macros, both of which are traditional sore points in compiler performance.
  • Performance-aiding patterns (like forward declarations) erode as abstractions break.

To fix these performance problems, we’d like to time each individual tool invocation and look for sore points. Even better, we’d like to inject additional profiling flags, like -ftime-report, into each invocation without having to fuss with the build system too much. Some build systems allow the former by setting CC=time cc or similar, but this gets hairy with multiple build systems tied together. The latter is easy enough to do by modifying CFLAGS in Make or add_compile_options /target_compile_options in CMake, but becomes similarly complicated when build systems are chained together or invoke each other.

Build and release assurance

C and C++ are complex languages that are hard, if not impossible, to write safe code in.

To protect us, we have our compilers add mitigations (ASLR, W^X, Control Flow Integrity) and additional instrumentation (ASan, MSan, UBSan).

Unfortunately, build complexity gets in our way once again: when stitching multiple builds together, it’s easy to accidentally drop (or incorrectly add, for release configurations) our hardening flags2. So, we’d like a way to predicate a particular build’s success or failure on the presence (or absence) of our desired flags, no matter how many nested build systems we invoke. That means injecting and/or removing flags just like with performance profiling, so wrapping is once again an appealing solution.

Build tools are a mess

We’ve come up with some potential use cases for a build tool wrapper. We’ve also seen that a useful build tool wrapper is one that knows a decent bit about the command-line syntax of the tool that it’s wrapping: how to reliably extract its inputs and outputs, as well as correctly model a number of flags and options that change the tool’s behavior.

Unfortunately, this is easier said than done:

  • GCC, historically the dominant open-source C and C++ compiler, has thousands of command-line options, including hundreds of OS- and CPU-specific options that can affect code generation and linkage in subtle ways. Also, because nothing in life should be easy, the GCC frontends have no less than four different syntaxes for an option that takes a value:
    • -oVALUE (examples: -O{,0,1,2,3}, -ooutput.o, -Ipath)
    • -flag VALUE (examples: -o output.o, -x c++)
    • -flag=VALUE (examples: -fuse-ld=gold, -Wno-error=switch, -std=c99)
    • -Wx,VALUE (examples: -Wl,--start-group, -Wl,-ereal_main)

    Some of these overlap consistently, while others only overlap in a few select cases. It’s up to the tool wrapper to handle each, at least to the extent required by the wrapper’s expected functionality.

  • Clang, the (relative) newcomer, makes a strong effort to be compatible with the gcc and g++ compiler frontends. To that end, most of the options that need to be modeled for correctly wrapping GCC frontends are the same for the Clang frontends. That being said, clang and clang++ add their own options, some of which overlap in functionality with the common GCC ones. By way of example: The clang and clang++ frontends support -Oz for aggressive code size optimization beyond the (GCC-supported) -Os.
  • Finally, the weird ones: There’s Intel’s ICC, which apparently likewise makes an effort to be GCC-compatible. And then there’s Microsoft’s cl.exe frontend for MSVC which, to the best of my understanding, is functionally incompatible3.

Closer inspection also reveals a few falsehoods that programmers frequently believe about their C and C++ compilers:

“Compilers only take one input at a time!”

This is admittedly less believed: Most C and C++ programmers realize early on that these invocations…

cc -c -o foo.o foo.c
cc -c -o bar.o bar.c
cc -c -o baz.o baz.c
cc -o quux foo.o bar.o baz.o

…can be replaced with:

cc -o quux foo.c bar.c baz.c

This is nice for quickly building things on the command-line, but is less fruitful to cache (we no longer have individual intermediate objects) and is harder for a build tool wrapper to model (we have to suss out the inputs, even when interspersed with other compiler flags).

Compilers only produce one output at a time!

Similar to the above: C and C++ compilers will happily produce a single intermediate output for every input, as long as you don’t explicitly ask them for a single executable output via -o:

cc -c foo.c bar.c baz.c

…produces foo.o, bar.o, and baz.o. This is once again nice for caching, but with a small bit of extra work. To correctly cache each output, we need to transform each input’s filename into the appropriate implicit output name. This ought to be as simple as replacing the source extension with .o, but it isn’t guaranteed:

  • Windows hosts (among others) use .obj rather than .o. Annoying.
  • As we’ll see, not all source inputs are required to have an extension.

Yet more work for our tool wrapper.

“cc only compiles C sources and c++ only compiles C++ sources!”

This is a popular misconception: that cc and c++ (or gcc and g++, or …) are completely different programs that happen to share a great deal of command-line functionality.

In reality, even if they’re separate binaries, they’re usually just thin shims over a common compiler frontend. In particular, c++ corresponds to cc -x c++ and cc corresponds to c++ -x c:

# compile a C++ program using cc
cc -x c++ -c -o foo.o foo.cpp

The -x <language> option enables one particularly annoying useful feature: being able to compile files as a particular language even if their suffix doesn’t match. This comes in handy when doing, say, code generation:

# promise the compiler that junk.gen is actually a C source file
cc -x c junk.gen

Compilers only compile one language at a time!”

Even with the above, programmers assume that each input to the compiler frontend has to be of the same language, i.e., that you can’t mix C and C++ sources in the same invocation. But this just isn’t true:

# compile a C source and a C++ source into their respective object files
cc -c -x c foo.c -x c++ bar.cpp

You don’t even need the -x <language> modifiers when the frontend understands the file suffixes, as it does for .c and .cpp:

# identical to the above
cc -c foo.c bar.cpp

Not every build tool is a compiler frontend

We’ve omitted a critical fact above: Not every build tool shares the general syntax of the C and C++ compiler frontends. Indeed, there are five different groups of tools that we’re interested in:

  • “Compiler tools” like cc and c++, normally overriden with CC and CXX. We’re focusing our efforts on C and C++ for the time being; it’s common to see similar variables for Go (GO), Rust (RUSTC), and so forth.
  • The C preprocessor (cpp), normally overriden with CPP. Most builds invoke the C preprocessor through the compiler frontend, but some invoke it directly.
  • The system linker (ld), normally overriden with LD. Like the preprocessor, the linker is normally interacted with through the frontend, but occasionally makes an appearance of its own when dealing with custom toolchains and linker scripts.
  • The system assembler (as), normally overriden with AS. Can be used via the frontend like the preprocessor and linker, but is also seen independently.
  • The system archiver (ar), normally overriden with AR. Unlike the last three, the archiver is not integrated into the compiler frontend for, e.g., static library construction; users are expected to invoke it directly.

blight’s architecture

blight’s tool hierarchyWe’ve seen some of the complexities that arise when wrapping and accurately modeling the behavior of a build tool. So now let’s take a look at how blight mitigates those complexities.

Like most (all?) build tool wrappers, blight’s wrappers take the place of their wrapped counterparts. For example, to run a build with blight’s C compiler wrapper:

# -e tells make to always give CC from the environment precedence
CC=blight-cc make -e

Setting each CC, CXX, etc. manually is tedious and error-prone, so the blight CLI provides a bit of shell code generation magic to automate the process:

# --guess-wrapped searches the $PATH to find a suitable tool to wrap
eval "$(blight-env --guess-wrapped)"
make -e

Under the hood, each blight wrapper corresponds to a concrete subclass of Tool (e.g., blight.tool.AS for the assembler), each of which has at least the following:

  • The argument vector (args) that the wrapped tool will be run with.
  • The working directory (cwd) that the wrapped tool will be run from.
  • A list of Actions, each of which can register two separate events on each tool run:
    • before_run(tool)—Run before each tool invocation.
    • after_run(tool)—Run after each successful tool invocation.

Individual subclasses of Tool are specialized using a mixin pattern. For example, blight.tool.CC

A render of the CC class’s source

…specializes CompilerTool, which is a doozy of mixins:

A render of the CompilerTool class’s source

Each mixin, in turn, provides some modeled functionality common between one or more tools.

ResponseFileMixin, for example, specializes the behavior of Tool.args for tools that support the @file syntax for supplying additional arguments via an on-disk file (in particular, CC, CXX, and LD):

A render of the ResponseFileMixin class’s source

Other mixins make heavy use of Python 3’s Enum class to strictly model the expected behavior of common tool flags, like -std=STANDARD

A render of the StdMixin class’s source

…where Std:

A render of the Std enum

Taking action

By default, blight does absolutely nothing: it’s just a framework for wrapping build tools. The magic happens when we begin to inject actions before and after each tool invocation.

Internally, the actions API mirrors the tool API: Each tool has a corresponding class under blight.action (e.g., CCCCAction):

To add an action, you just specify its name in the BLIGHT_ACTIONS environment variable. Multiple actions can be specified with : as the delimiter, and will be executed in a left-to-right order. Only actions that “match” a particular tool are run, meaning that an action with CCAction as its parent will never (incorrectly) run on a CXX invocation.

To bring the concepts home, here’s blight running with two stock actions: IgnoreWerror and InjectFlags:

An example blight run

In this case, IgnoreWerror strips any instances of –Werror that it sees in compiler tools (i.e., CC and CXX), while InjectFlags injects a configurable set of arguments via a set of nested variables. We’ll see how that configuration works in a bit.

For example, here’s InjectFlags:

A render of the InjectFlags class

In particular, note that InjectFlags is a CompilerAction, meaning that its events (in this case, just before_run) are only executed if the underlying tool is CC or CXX.

Writing and configuring your own actions

Writing a new action is straightforward: They live in the blight.actions module, and all inherit from one of the specializations of blight.action.Action.

For example, here’s an action that prints a friendly message before and after every invocation of tool.AS (i.e., the standard assembler)…

…and blight will take care of the rest—all you need to do is specify SayHello in BLIGHT_ACTIONS!


What about actions that need to be configured, such as a configurable output file?

Remember the InjectFlags action above: Every loaded action can opt to receive configuration settings via the self._config dictionary, which the action API parses behind the scenes from BLIGHT_ACTION_ACTIONNAME where ACTIONNAME is the uppercase form of the actual action name.

How is that environment variable parsed? Very simply:

A render of the config parsing code for actions

Spelled out, the configuration string is split according to shell lexing rules, and then once again split from KEY=VALUE pairs.

This should be a suitable base for most configuration needs. However, because blight’s actions are just plain old Python classes, they can implement their own configuration approaches as desired.

A new paradigm for build instrumentation

blight is the substrate for a new generation of build wrapping and instrumentation tools. Instead of modeling the vagaries of a variety of tool CLIs themselves, new wrapping tools can rely on blight to get the modeling right and get directly to their actual tasks.

Some ideas that we’ve had during blight’s development:

  • An action that provides real-time build statistics via WebSockets, allowing developers to track the progress of arbitrary builds.
  • A rewrite of tools like WLLVM and GLLVM, enabling higher-fidelity tracking of troublesome edge cases (e.g., in-tree assembly files and builds that generate explicit assembly intermediates).
  • A feedback mechanism for trying performance options. Choosing between optimization flags can be fraught, so a blight action that parametrizes builds across a matrix of potential options could help engineers select the appropriate flags for their projects.

We’ve already written some actions for our own use cases, but we believe blight can be useful to the wider build tooling and instrumentation communities. If you’re interested in working with us on it or porting your existing tools to blight’s framework, contact us!

  1. This, it turns out, is nontrivial: Compilation flags can affect the ABI, so a sound compilation cache must correctly model the various flags passed to the compiler and only hit the intermediate cache if two (likely different) sets of flags from separate invocations are not incompatible.↩︎
  2. Another use case: Companies usually want to strip their application binaries of all symbol and debug information before release, to hamper reverse engineering. Doing so is usually part of a release checklist and should be enforced by the build system, and yet companies repeatedly manage to leak debug and symbol information. A build tool wrapper could provide another layer of assurance.↩︎
  3. In a rare divergence for Microsoft, cl.exe does allow you to use -opt instead of /opt for all command-line options. Unfortunately, most of cl.exe’s options bear no resemblance to GCC or Clang’s, so it doesn’t make much of a difference.↩︎

Smart (and simple) ways to prevent symlink attacks in Go

After writing Go for years, many of us have learned the error-checking pattern down to our bones: “Does this function return an error? Ope, better make sure it’s nil before moving on.”

And that’s great! This should be our default behavior when writing Go.

However, rote error checking can sometimes prevent critical thinking about what that error actually means to the code: When does that function return an error? And does it encompass everything you think it does?

For instance, in os.Create, the nil error value can trick you into thinking you’re safe with file creation. Reading the linked documentation, os.Create actually truncates the file when it already exists instead of throwing any indication that it’s not a new file.

This leaves us vulnerable to a symlink attack.

Does it exist?

Say my program needs to create and use a file. Almost every example of idiomatic Go code guides us through an error check, but no validation of whether the file existed before Create was called. If a symbolic link had already been set up for that file, no error would occur, but the file and its contents would not behave as intended due to the truncation behavior. The risk is that we can remove information using the program to overwrite it for us.

At Trail of Bits, this issue comes up frequently in audits. Thankfully, the fix for it is incredibly easy. We just need to check if a file exists before attempting to create it. A slight tweak in our approach to idiomatic Go can make file creation safer long term and take us one step closer to prioritizing security in Go programs.

The situation

Let’s say there’s a file, my_logs, that I need to create and write to. However, in another part of the codebase, someone previously set up a symlink with ln -s other/logs my_logs.

- logs
- notes
- things I care about
- very important information we can't lose

Contents of other/logs.

package main

import (

func main() {
	file, err := os.Create("my_logs")
	if err != nil {
		fmt.Printf("Error creating file: %s", err)

	_, err = file.Write([]byte("My logs for this process"))
	if err != nil {



$ ln -s other/logs my_logs
$ go build symlink_attack.go
$ ./symlink_attack
$ cat other/logs
- My logs for this process

As you can see, the content of other/logs is wiped even though our program only interacted with my_logs.

Even in this accidental scenario, os.Create removes important data through its truncation behavior. In malicious scenarios, an attacker could leverage the truncation behavior against the user to remove specific data—perhaps audit logs that would have revealed their presence on the box at one point.

Simple fix, two approaches

To remedy this, we have to insert an os.IsNotExist check before calling Create. If you run the edited symlink_attack.go below, the data in other/logs remains and is not overwritten.

package main

import (

func main() {

	if fileExists("my_logs") {
		log.Fatalf("symlink attack failure")

	file, err := os.Create("my_logs")
	if err != nil {
		fmt.Printf("Error creating file: %s", err)

	_, err = file.Write([]byte("My logs for this process"))
	if err != nil {
		fmt.Printf("Failure to write: %s", err)

func fileExists(filename string) bool {
	info, err := os.Stat(filename)
	if os.IsNotExist(err) {
		return false
	return !info.IsDir()

symlink_attack.go with IsNotExist check.

The stipulation here is that by checking os.IsNotExist before creating, we put ourselves in a position where we can’t verify whether a symlink was created between the existence check and the file creation (a time-of-check vs. time-of-use bug). To account for that, we can take a few different approaches.

The first approach is to recreate the implementation of os.Create with your own OpenFile command, thus eliminating the truncation.

func Create(name string) (*File, error) {
    return OpenFile(name, O_RDWR\|O_CREATE\|O_TRUNC, 0666)

os pkg definition for Create.

package main

import (

func main() {
	file, err := os.OpenFile("my_logs", os.O_RDWR|os.O_CREATE|syscall.O_NOFOLLOW, 0666)
	if err != nil {

	_, err = file.Write([]byte("Is this the only thing in the file\n"))
	if err != nil {
		fmt.Printf("Failure to write: %s", err)
	err = file.Close()
	if err != nil {
		fmt.Printf("Couldn't close file: %s", err)
	buf, err := ioutil.ReadFile("./my_logs")
	if err != nil {
		fmt.Printf("Failed to read file: %s", err)

	fmt.Printf("%s", buf)

symlink_attack.go OpenFile with O_NOFOLLOW to avoid following symlinks.

By opening the file with O_NOFOLLOW, you will not follow a symlink. So when a new file is created, this will work the same as os.Create. However, it will fail to open if a symlink is set up in that location.

The alternative is to create a TempFile and use os.Rename to move it to your preferred location.

package main

import (

func main() {
	tmpfile, err := ioutil.TempFile(".", "")
	if err != nil {

	os.Rename(tmpfile.Name(), "my_logs")
	if _, err := tmpfile.Write([]byte("Is this the only thing in the file")); err != nil {

	buf, err := ioutil.ReadFile("./my_logs")
	if err != nil {
		fmt.Printf("Failed to read file: %s", err)

	fmt.Printf("%s", buf)


symlink_attack.go with TempFile creation and os.Rename.

This pattern broke the symlink between my_logs and other/logs. other/logs still had its contents and my_logs had only the contents “Is this the only thing in the file,” as intended.

Protect future you, now

No matter how careful you are about checking errors in Go, they’re not always behaving the way you might think (tl;dr: rtfm). But updating your practices within Go file creation is really simple, and can save you from unintended consequences.

Good idea, bad design: How the Diamond standard falls short

TL;DR: We audited an implementation of the Diamond standard proposal for contract upgradeability and can’t recommend it in its current form—but see our recommendations and upgrade strategy guidance.

We recently audited an implementation of the Diamond standard code, a new upgradeability pattern. It’s a laudable undertaking, but the Diamond proposal and implementation raise many concerns. The code is over-engineered, with lots of unnecessary complexities, and we can’t recommend it at this time.

Of course, the proposal is still a draft, with room to grow and improve. A working upgradeability standard should include:

  • A clear, simple implementation. Standards should be easy to read to simplify integration with third-party applications.
  • A thorough checklist of upgrade procedures. Upgrading is a risky process that must be thoroughly explained.
  • On-chain mitigations against the most common upgradeability mistakes, including function shadowing and collisions. Several mistakes, though easy to detect, can lead to severe issues. See slither-check-upgradeability for many pitfalls that can be mitigated.
  • A list of associated risks. Upgradeability is difficult; it can conceal security considerations or imply that risks are trivial. EIPs are proposals to improve Ethereum, not commercial advertisements.
  • Tests integrated with the most common testing platforms. The tests should highlight how to deploy the system, how to upgrade a new implementation, and how an upgrade can fail.

Unfortunately, the Diamond proposal fails to address these points. It’s too bad, because we’d love to see an upgradeable standard that solves or at least mitigates the main security pitfalls of upgradeable contracts. Essentially, standard writers must assume that developers will make mistakes, and aim to build a standard that alleviates them.

Still, there’s plenty to learn from the Diamond proposal. Read on to see:

  • How the Diamond proposal works
  • What our review revealed
  • Our recommendations
  • Upgradeability standard best practices

The Diamond proposal paradigm

The Diamond proposal is a work-in-progress defined in EIP 2535. The draft claims to propose a new paradigm for contract upgradeability based on delegatecall. (FYI, here’s an overview of how upgradeability works.) EIP 2535 proposes the use of:

  1. A lookup table for the implementation
  2. An arbitrary storage pointer

Lookup table

The delegatecall-based upgradeability mainly works with two components—a proxy and an implementation:

Figure 1: delegatecall-based upgradeability with a single implementation

The user interacts with the proxy and the proxy delegatecall to the implementation. The implementation code is executed, while the storage is kept in the proxy.

Using a lookup table allows delegatecalls to multiple contract implementations, where the proper implementation is selected according to the function to be executed:

Figure 2: delegatecall-based upgradeability with multiple implementations.

This schema is not new; other projects have used such lookup tables for upgradeability in the past. See ColonyNetwork for an example.

Arbitrary storage pointer

The proposal also suggests using a feature recently introduced into Solidity: the arbitrary storage pointer, which (like the name says) allows assignment of a storage pointer to an arbitrary location.

Because the storage is kept on the proxy, the implementation’s storage layout must follow the proxy’s storage layout. It can be difficult to keep track of this layout when doing an upgrade (see examples here).

The EIP proposes that every implementation have an associated structure to hold the implementation variables, and a pointer to an arbitrary storage location where the structure will be stored. This is similar to the unstructured storage pattern, where the new Solidity feature allows use of a structure instead of a single variable.

It is assumed that two structures from two different implementations cannot collide as long as their respective base pointers are different.

   bytes32 constant POSITION = keccak256(

    struct MyStruct {
        uint var1;
        uint var2;

    function get_struct() internal pure returns(MyStruct storage ds) {
        bytes32 position = POSITION;
        assembly { ds_slot := position }

Figure 3: Storage pointer example.

Figure 4: Storage pointer representation.

BTW, what’s a “diamond”?

EIP 2535 introduces “diamond terminology,” wherein the word “diamond” means a proxy contract, “facet” means an implementation, and so on. It’s unclear why this terminology was introduced, especially since the standard terminology for upgradeability is well known and defined. Here’s a key to help you translate the proposal if you go through it:

Diamond vocabulary Common name
Diamond Proxy
Facet Implementation
Cut Upgrade
Loupe List of delegated functions
Finished diamond Non-upgradeable
Single cut diamond Remove upgradeability functions

Figure 5: The Diamond proposal uses new terms to refer to existing ideas.

Audit findings and recommendations

Our review of the diamond implementation found that:

  • The code is over-engineered and includes several misplaced optimizations
  • Using storage pointers has risks
  • The codebase had function shadowing
  • The contract lacks an existence check
  • The diamond vocabulary adds unnecessary complexity

Over-engineered code

While the pattern proposed in the EIP is straightforward, its actual implementation is difficult to read and review, increasing the likelihood of issues.

For example, a lot of the data kept on-chain is cumbersome. While the proposal only needs a lookup table, from the function signature to the implementation’s address, the EIP defines many interfaces that require storage of additional data:

interface IDiamondLoupe {
    /// These functions are expected to be called frequently
    /// by tools.

    struct Facet {
        address facetAddress;
        bytes4[] functionSelectors;

    /// @notice Gets all facet addresses and their four byte function selectors.
    /// @return facets_ Facet
    function facets() external view returns (Facet[] memory facets_);

    /// @notice Gets all the function selectors supported by a specific facet.
    /// @param _facet The facet address.
    /// @return facetFunctionSelectors_
    function facetFunctionSelectors(address _facet) external view returns (bytes4[] memory facetFunctionSelectors_);

    /// @notice Get all the facet addresses used by a diamond.
    /// @return facetAddresses_
    function facetAddresses() external view returns (address[] memory facetAddresses_);

    /// @notice Gets the facet that supports the given selector.
    /// @dev If facet is not found return address(0).
    /// @param _functionSelector The function selector.
    /// @return facetAddress_ The facet address.
    function facetAddress(bytes4 _functionSelector) external view returns (address facetAddress_);

Figure 6: Diamond interfaces.

Here, facetFunctionSelectors returns all the function selectors of an implementation. This information will only be useful for off-chain components, which can already extract the information from the contract’s events. There’s no need for such a feature on-chain, especially since it significantly increases code complexity.

Additionally, much of the code complexity is due to optimization in locations that don’t need it. For example, the function used to upgrade an implementation should be straightforward. Taking a new address and a signature, it should update the corresponding entry in the lookup table. Well, part of the function doing so is the following:

           // adding or replacing functions
            if (newFacet != 0) {
                // add and replace selectors
                for (uint selectorIndex; selectorIndex &amp;amp;amp;amp;amp;lt; numSelectors; selectorIndex++) {
                    bytes4 selector;
                    assembly {
                        selector := mload(add(facetCut,position))
                    position += 4;
                    bytes32 oldFacet = ds.facets[selector];
                    // add
                    if(oldFacet == 0) {
                        // update the last slot at then end of the function
                        slot.updateLastSlot = true;
                        ds.facets[selector] = newFacet | bytes32(selectorSlotLength) &amp;amp;amp;amp;amp;lt;&amp;amp;amp;amp;amp;lt; 64 | bytes32(selectorSlotsLength);
                        // clear selector position in slot and add selector
                        slot.selectorSlot = slot.selectorSlot &amp;amp;amp;amp;amp;amp; ~(CLEAR_SELECTOR_MASK &amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;gt; selectorSlotLength * 32) | bytes32(selector) &amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;gt; selectorSlotLength * 32;
                        // if slot is full then write it to storage
                        if(selectorSlotLength == 8) {
                            ds.selectorSlots[selectorSlotsLength] = slot.selectorSlot;
                            slot.selectorSlot = 0;
                            selectorSlotLength = 0;
                    // replace
                    else {
                        require(bytes20(oldFacet) != bytes20(newFacet), "Function cut to same facet.");
                        // replace old facet address
                        ds.facets[selector] = oldFacet &amp;amp;amp;amp;amp;amp; CLEAR_ADDRESS_MASK | newFacet;

Figure 7: Upgrade function.

A lot of effort was made to optimize this function’s gas efficiency. But upgrading a contract is rarely done, so it would never be an expensive operation anyway, no matter what its gas cost.

In another example of unnecessary complexity, bitwise operations are used instead of a structure:

uint selectorSlotsLength = uint128(slot.originalSelectorSlotsLength);
uint selectorSlotLength = uint128(slot.originalSelectorSlotsLength &amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;gt; 128);
        // uint32 selectorSlotLength, uint32 selectorSlotsLength
        // selectorSlotsLength is the number of 32-byte slots in selectorSlots.
        // selectorSlotLength is the number of selectors in the last slot of
        // selectorSlots.
        uint selectorSlotsLength;

Figure 8: Use of bitwise operations instead of a structure.

Update November 5th:
Since our audit, the reference implementation has changed, but its underlying complexity remains. There are now three reference implementations, which makes everything even more confusing for users, and further review of the proposal is more difficult.

Our recommendations:

  • Always strive for simplicity, and keep as much code as you can off-chain.
  • When writing a new standard, keep the code readable and easy to understand.
  • Analyze the needs before implementing optimizations.

Storage pointer risks

Despite the claim that collisions are impossible if the base pointers are different, a malicious contract can collide with a variable from another implementation. Basically, it’s possible because of the way Solidity stores variables and affects mapping or arrays. For example:

contract TestCollision{
    // The contract represents two implementations, A and B
    // A has a nested structure 
    // A and B have different bases storage pointer 
    // Yet writing in B, will lead to write in A variable
    // This is because the base storage pointer of B 
    // collides with A.ds.my_items[0].elems
    bytes32 constant public A_STORAGE = keccak256(
    struct InnerStructure{
        uint[] elems;
    struct St_A {
        InnerStructure[] my_items;

    function pointer_to_A() internal pure returns(St_A storage s) {
        bytes32 position = A_STORAGE;
        assembly { s_slot := position }
    bytes32 constant public B_STORAGE = keccak256(
    struct St_B {
        uint val;

    function pointer_to_B() internal pure returns(St_B storage s) {
        bytes32 position = B_STORAGE;
        assembly { s_slot := position }
    constructor() public{
        St_A storage ds = pointer_to_A();
    function get_balance() view public returns(uint){
        St_A storage ds = pointer_to_A();
        return ds.my_items[0].elems[0];
    function exploit(uint new_val) public{
        St_B storage ds = pointer_to_B();
        ds.val = new_val;

Figure 9: Storage pointer collision.

In exploit, the write to the B_STORAGE base pointer will actually write to the my_items[0].elems[0], which is read from the A_STORAGE base pointer. A malicious owner could push an upgrade that looks benign, but contains a backdoor.

The EIP has no guidelines for preventing these malicious collisions. Additionally, if a pointer is reused after being deleted, the re-use will lead to data compromise.

Our recommendations

  • Low-level storage manipulations are risky, so be extra careful when designing a system that relies on them.
  • Using unstructured storage with structures for upgradeability is an interesting idea, but it requires thorough documentation and guidelines on what to check for in a base pointer.

Function shadowing

Upgradeable contracts often have functions in the proxy that shadow the functions that should be delegated. Calls to these functions will never be delegated, as they will be executed in the proxy. Additionally, the associated code will not be upgradeable.

contract Proxy {

    constructor(...) public{
          // add my_targeted_function() 
          // as a delegated function
    function my_targeted_function() public{
    fallback () external payable{
          // delegate to implementations

Figure 10: Simplification of a shadowing issue.

Although this issue is well known, and the code was reviewed by the EIP author, we found two instances of function-shadowing in the contracts.

Our recommendations

  • When writing an upgradeable contract, use or slither-check-upgradeability to catch instances of shadowing.
  • This issue highlights an important point: Developers make mistakes. Any new standard should include mitigations for common mistakes if it’s to work better than custom solutions.

No contract existence check

Another common mistake is the absence of an existence check for the contract’s code. If the proxy delegates to an incorrect address, or implementation that has been destructed, the call to the implementation will return success even though no code was executed (see the Solidity documentation). As a result, the caller will not notice the issue, and such behavior is likely to break third-party contract integration.

   fallback() external payable {
        DiamondStorage storage ds;
        bytes32 position = DiamondStorageContract.DIAMOND_STORAGE_POSITION;
        assembly { ds_slot := position }
        address facet = address(bytes20(ds.facets[msg.sig]));
        require(facet != address(0), "Function does not exist.");
        assembly {
            calldatacopy(0, 0, calldatasize())
            let result := delegatecall(gas(), facet, 0, calldatasize(), 0, 0)
            let size := returndatasize()
            returndatacopy(0, 0, size)
            switch result
            case 0 {revert(0, size)}
            default {return (0, size)}

Figure 11: Fallback function without contract’s existence check.

Our recommendations

  • Always check for contract existence when calling an arbitrary contract.
  • If gas cost is a concern, only perform this check if the call returns no data, since the opposite result means that some code was executed.

Unnecessary Diamond vocabulary

As noted, the Diamond proposal relies heavily on its newly created vocabulary. This is error-prone, makes review more difficult, and does not benefit developers.

  1. A diamond is a contract that uses functions from its facets to execute function calls. A diamond can have one or more facets.
  2. The word facet comes from the diamond industry. It is a side, or flat surface of a diamond. A diamond can have many facets. In this standard a facet is a contract with one or more functions that executes functionality of a diamond.
  3. A loupe is a magnifying glass that is used to look at diamonds. In this standard a loupe is a facet that provides functions to look at a diamond and its facets.

Figure 12: The EIP redefines standard terms to ones that are unrelated to software engineering.

Our recommendation

  • Use the common, well-known vocabulary, and do not invent terminology when it’s not needed.

Is the Diamond proposal a dead end?

As noted, we still believe the community would benefit from a standardized upgradeability schema. But the current Diamond proposal does not meet the expected security requirements or bring enough benefits over a custom implementation.

However, the proposal is still a draft, and could evolve into something simpler and better. And even if it doesn’t, some of the existing techniques used, such as the lookup table and arbitrary storage pointers, are worth continuing to explore.

So…is upgradeability feasible or not?

Over the years, we’ve reviewed many upgradeable contracts and published several analyses on this topic. Upgradeability is difficult, error-prone, and increases risk, and we still generally don’t recommend it as a solution. But developers who need upgradeability in their contracts should:

And please contact us if you have any questions about your upgrade strategy. We’re ready to help!

Efficient audits with machine learning and Slither-simil

by Sina Pilehchiha, Concordia University

Trail of Bits has manually curated a wealth of data—years of security assessment reports—and now we’re exploring how to use this data to make the smart contract auditing process more efficient with Slither-simil.

Based on accumulated knowledge embedded in previous audits, we set out to detect similar vulnerable code snippets in new clients’ codebases. Specifically, we explored machine learning (ML) approaches to automatically improve on the performance of Slither, our static analyzer for Solidity, and make life a bit easier for both auditors and clients.

Currently, human auditors with expert knowledge of Solidity and its security nuances scan and assess Solidity source code to discover vulnerabilities and potential threats at different granularity levels. In our experiment, we explored how much we could automate security assessments to:

  1. Minimize the risk of recurring human error, i.e., the chance of overlooking known, recorded vulnerabilities.
  2. Help auditors sift through potential vulnerabilities faster and more easily while decreasing the rate of false positives.


Slither-simil, the statistical addition to Slither, is a code similarity measurement tool that uses state-of-the-art machine learning to detect similar Solidity functions. When it began as an experiment last year under the codename crytic-pred, it was used to vectorize Solidity source code snippets and measure the similarity between them. This year, we’re taking it to the next level and applying it directly to vulnerable code.

Slither-simil currently uses its own representation of Solidity code, SlithIR (Slither Intermediate Representation), to encode Solidity snippets at the granularity level of functions. We thought function-level analysis was a good place to start our research since it’s not too coarse (like the file level) and not too detailed (like the statement or line level.)

Figure 1: A high-level view of the process workflow of Slither-simil.

In the process workflow of Slither-simil, we first manually collected vulnerabilities from the previous archived security assessments and transferred them to a vulnerability database. Note that these are the vulnerabilities auditors had to find with no automation.

After that, we compiled previous clients’ codebases and matched the functions they contained with our vulnerability database via an automated function extraction and normalization script. By the end of this process, our vulnerabilities were normalized SlithIR tokens as input to our ML system.

Here’s how we used Slither to transform a Solidity function to the intermediate representation SlithIR, then further tokenized and normalized it to be an input to Slither-simil:

function transferFrom(address _from, address _to, uint256 _value) public returns (bool success) {
        require(_value <= allowance[_from][msg.sender]);     // Check allowance
        allowance[_from][msg.sender] -= _value;
        _transfer(_from, _to, _value);
        return true;

Figure 2: A complete Solidity function from the contract TurtleToken.sol.

Function TurtleToken.transferFrom(address,address,uint256) (*)

Solidity Expression: require(bool)(_value <= allowance[_from][msg.sender])
         REF_10(mapping(address => uint256)) ->    allowance[_from]
         REF_11(uint256) -> REF_10[msg.sender]
         TMP_16(bool) = _value <= REF_11
         TMP_17 = SOLIDITY_CALL require(bool)(TMP_16)

Solidity Expression: allowance[_from][msg.sender] -= _value
         REF_12(mapping(address => uint256)) -> allowance[_from]
         REF_13(uint256) -> REF_12[msg.sender]
         REF_13(-> allowance) = REF_13 - _value

Solidity Expression: _transfer(_from,_to,_value)
         INTERNAL_CALL,      TurtleToken._transfer(address,address,uint256)(_from,_to,_value)

Solidity Expression: true
         RETURN True

Figure 3: The same function with its SlithIR expressions printed out.

First, we converted every statement or expression into its SlithIR correspondent, then tokenized the SlithIR sub-expressions and further normalized them so more similar matches would occur despite superficial differences between the tokens of this function and the vulnerability database.







(state_solc_variable(string)):=(local_solc_variable(memory, string))

(state_solc_variable(string)):=(local_solc_variable(memory, string))


Figure 4: Normalized SlithIR tokens of the previous expressions.

After obtaining the final form of token representations for this function, we compared its structure to that of the vulnerable functions in our vulnerability database. Due to the modularity of Slither-simil, we used various ML architectures to measure the similarity between any number of functions.

$ slither-simil test etherscan_verified_contracts.bin --filename TurtleToken.sol --fname TurtleToken.transferFrom --input cache.npz --ntop 5

Reviewed 825062 functions, listing the 5 most similar ones:

filename            contract      function     score
TokenERC20.sol      TokenERC20    freeze       0.991
ETQuality.sol       StandardToken transferFrom 0.936
NHST.sol            NHST          approve      0.889

Figure 5: Using Slither-simil to test a function from a smart contract with an array of other Solidity contracts.

Let’s take a look at the function transferFrom from the ETQuality.sol smart contract to see how its structure resembled our query function:

function transferFrom(address _from, address _to, uint256 _value) returns (bool success) {
    if (balances[_from] >= _value && allowed[_from][msg.sender] >= _value && _value > 0) {
        balances[_to] += _value;
        balances[_from] -= _value;
        allowed[_from][msg.sender] -= _value;
        Transfer(_from, _to, _value);
        return true;
    } else { return false; }

Figure 6: Function transferFrom from the ETQuality.sol smart contract.

Comparing the statements in the two functions, we can easily see that they both contain, in the same order, a binary comparison operation (>= and <=), the same type of operand comparison, and another similar assignment operation with an internal call statement and an instance of returning a “true” value.

As the similarity score goes lower towards 0, these sorts of structural similarities are observed less often and in the other direction; the two functions become more identical, so the two functions with a similarity score of 1.0 are identical to each other.

Related Research

Research on automatic vulnerability discovery in Solidity has taken off in the past two years, and tools like Vulcan and SmartEmbed, which use ML approaches to discovering vulnerabilities in smart contracts, are showing promising results.

However, all the current related approaches focus on vulnerabilities already detectable by static analyzers like Slither and Mythril, while our experiment focused on the vulnerabilities these tools were not able to identify—specifically, those undetected by Slither.

Much of the academic research of the past five years has focused on taking ML concepts (usually from the field of natural language processing) and using them in a development or code analysis context, typically referred to as code intelligence. Based on previous, related work in this research area, we aim to bridge the semantic gap between the performance of a human auditor and an ML detection system to discover vulnerabilities, thus complementing the work of Trail of Bits human auditors with automated approaches (i.e., Machine Programming, or MP).


We still face the challenge of data scarcity concerning the scale of smart contracts available for analysis and the frequency of interesting vulnerabilities appearing in them. We can focus on the ML model because it’s sexy but it doesn’t do much good for us in the case of Solidity where even the language itself is very young and we need to tread carefully in how we treat the amount of data we have at our disposal.

Archiving previous client data was a job in itself since we had to deal with the different solc versions to compile each project separately. For someone with limited experience in that area this was a challenge, and I learned a lot along the way. (The most important takeaway of my summer internship is that if you’re doing machine learning, you will not realize how major a bottleneck the data collection and cleaning phases are unless you have to do them.)

Figure 7: Distribution of 89 vulnerabilities found among 10 security assessments.

The pie chart shows how 89 vulnerabilities were distributed among the 10 client security assessments we surveyed. We documented both the notable vulnerabilities and those that were not discoverable by Slither.

The Road Ahead for Slither-simil

This past summer we resumed the development of Slither-simil and SlithIR with two goals in mind:

  • Research purposes, i.e., the development of end-to-end similarity systems lacking feature engineering.
  • Practical purposes, i.e., adding specificity to increase precision and recall.

We implemented the baseline text-based model with FastText to be compared with an improved model with a tangibly significant difference in results; e.g., one not working on software complexity metrics, but focusing solely on graph-based models, as they are the most promising ones right now.

For this, we have proposed a slew of techniques to try out with the Solidity language at the highest abstraction level, namely, source code.

To develop ML models, we considered both supervised and unsupervised learning methods. First, we developed a baseline unsupervised model based on tokenizing source code functions and embedding them in a Euclidean space (Figure 8) to measure and quantify the distance (i.e., dissimilarity) between different tokens. Since functions are constituted from tokens, we just added up the differences to get the (dis)similarity between any two different snippets of any size.

The diagram below shows the SlithIR tokens from a set of training Solidity data spherized in a three-dimensional Euclidean space, with similar tokens closer to each other in vector distance. Each purple dot shows one token.

Figure 8: Embedding space containing SlithIR tokens from a set of training Solidity data

We are currently developing a proprietary database consisting of our previous clients and their publicly available vulnerable smart contracts, and references in papers and other audits. Together they’ll form one unified comprehensive database of Solidity vulnerabilities for queries, later training, and testing newer models.

We’re also working on other unsupervised and supervised models, using data labeled by static analyzers like Slither and Mythril. We’re examining deep learning models that have much more expressivity we can model source code with—specifically, graph-based models, utilizing abstract syntax trees and control flow graphs.

And we’re looking forward to checking out Slither-simil’s performance on new audit tasks to see how it improves our assurance team’s productivity (e.g., in triaging and finding the low-hanging fruit more quickly). We’re also going to test it on Mainnet when it gets a bit more mature and automatically scalable.

You can try Slither-simil now on this Github PR. For end users, it’s the simplest CLI tool available:

  1. Input one or multiple smart contract files (either directory, .zip file, or a single .sol).
  2. Identify a pre-trained model, or separately train a model on a reasonable amount of smart contracts.
  3. Let the magic happen, and check out the similarity results.
$ slither-simil test etherscan_verified_contracts.bin --filename MetaCoin.sol --fname MetaCoin.sendCoin --input cache.npz


Slither-simil is a powerful tool with potential to measure the similarity between function snippets of any size written in Solidity. We are continuing to develop it, and based on current results and recent related research, we hope to see impactful real-world results before the end of the year.

Finally, I’d like to thank my supervisors Gustavo, Michael, Josselin, Stefan, Dan, and everyone else at Trail of Bits, who made this the most extraordinary internship experience I’ve ever had.

Let’s build a high-performance fuzzer with GPUs!

by Ryan Eberhardt, Stanford University

TL;DR: Can we use GPUs to get 10x performance/dollar when fuzzing embedded software in the cloud? Based on our preliminary work, we think the answer is yes!

Fuzzing is a software testing technique that supplies programs with many randomized inputs in an attempt to cause unexpected behavior. It’s an important, industry-standard technique responsible for the discovery of many security vulnerabilities and the prevention of many more. However, fuzzing well takes time, and fuzzing embedded software presents additional challenges.

Embedded platforms are not designed for the instrumentation and high-throughput computation required to find bugs via fuzzing. Without access to source code, practical fuzzing of such platforms requires an emulator, which is slow, or many physical devices, which are often impractical.

Most fuzzing approaches have used conventional CPU architectures or emulators, but we decided to use other commodity hardware to tackle this problem—in particular, GPUs. The recent boom in machine learning has driven down the off-peak price of GPUs and made massive GPU computing capacity readily available from all major cloud providers. GPUs are very good at executing tasks in parallel, and fuzzing is an easily parallelizable problem.

In this blog post, I’ll walk you through the design and implementation of this massively parallel GPU-based fuzzer. So far, we’ve implemented an execution engine that can achieve 5x more executions/second/dollar than libFuzzer—and there’s much more room for optimization.

The Pitch: Fuzzing with GPUs

Fuzzing aims to generate unexpected inputs to a program and cause undesired behaviors (e.g., crashes or memory errors). The most commonly used fuzzers are coverage-guided fuzzers, which focus on finding inputs that cause new code coverage (such as executing a function that wasn’t executed before) to explore edge cases that may crash the program.


To do this, fuzzers run many different randomized inputs through the target program. This task is easily parallelizable, as each input can be executed independently of the others.

GPUs are fairly cheap; it costs $0.11/hour for a pre-emptible Tesla T4 on Google Cloud. Also, GPUs are really good at executing many things in parallel—a Tesla T4 can context-switch between over 40,000 threads and can simultaneously execute 2,560 of them in parallel—and, as mentioned, fuzzing is a naturally parallelizable problem. Using thousands of threads, we should theoretically be able to test thousands of different inputs at the same time.

Why Hasn’t Anyone Done This Before?

In short, running code on GPUs is very different from running code on CPUs in a few critical ways.

First, a GPU cannot directly execute x86/aarch64/etc. instructions, as GPUs have their own instruction set. Our goal is to fuzz embedded software for which no source code is available. With only a binary in hand, we have no easy way of generating GPU assembly to run.

Second, a GPU has no operating system. Traditional parallel fuzzers  launch multiple processes that can execute inputs separately without interfering with other processes. If an input causes one process to crash, other processes will be unaffected. GPUs have no notion of processes or address space isolation, and any memory violation will cause the entire fuzzer to crash, so we need to find some way to isolate the parallel instances of the program being fuzzed.

Additionally, without an operating system, there’s no one home to answer system calls, which enable a program to open files, use the network, and so on. System calls must be emulated or relayed back to the CPU to be executed by the host operating system.

Finally, GPU memory is hard to manage well. GPUs have complicated memory hierarchies with several different types of memory, and each has different ease-of-use and performance characteristics. Performance is highly dependent on memory access patterns, and controlling when and how threads access memory can make or break an application. Additionally, there isn’t much memory to go around, making it even more difficult to properly manage memory layout and access patterns. Having 16GB of device memory might sound impressive, but splitting it between 40,000 threads of execution leaves each thread with a paltry 419 KiB.

Can We Build It? Yes We Can!

There are many obstacles to building a working GPU fuzzer, but none of them are insurmountable.

Executing code with binary translation

First, let’s see if we can get aarch64 binaries running on the GPU.

As mentioned, we want to fuzz embedded binaries (e.g., ARMv7, aarch64, etc.) on the GPU. NVIDIA GPUs use a different instruction set architecture called PTX (“Parallel Thread eXecution”), so we cannot directly execute the binaries we want to fuzz. A common solution to this problem is to emulate an embedded CPU, but developing a CPU emulator for GPUs would likely be an expensive investment that would perform poorly. Another alternative is to translate binaries to PTX so they can execute directly on the GPU without emulation.

Trail of Bits has developed a binary translation tool called Remill that we can use to do this. Remill “lifts” binaries to LLVM IR (intermediate representation), which can then be retargeted and compiled to any architecture supported by the LLVM project. It just so happens that LLVM supports emitting LLVM IR as PTX code, which is perfect for our purposes.

Say we have this simple example function, which sets w19 to 0, adds 5, and returns the result:

    mov w19, #0       // Store the number 0 in register w19
    add w19, w19, #5  // Add 5
    mov w0, w19       // Return the result

We can pass the bytes for these instructions to Remill, which produces LLVM IR that models the original program executing on an ARM processor:

// Simplified for brevity :)
define dso_local %struct.Memory* @_Z5sliceP6MemorymPm(%struct.Memory* readnone returned %0, i64 %1, i64* nocapture %2) local_unnamed_addr #0 {
  %4 = add i64 %1, 1
  store i64 %4, i64* %2, align 8, !tbaa !2
  ret %struct.Memory* %0

Then, with some optimizations, we can have LLVM compile the above LLVM IR to PTX assembly:

ld.param.u64  %rd1, [sub_0_param_0];
ld.param.u64  %rd2, [sub_0_param_1];
mov.u64       %rd4, 5;
st.u64        [%rd1+848], %rd4;
add.s64       %rd5, %rd2, 12;
st.u64        [%rd1+1056], %rd5;

Finally, we can load this PTX into a GPU and execute it as if we’d had access to the source code in the first place.

Managing Memory

As mentioned earlier, GPUs have no operating system to provide isolation between processes. We need to implement address space isolation so multiple instances of the fuzzed program can access the same set of memory addresses without interfering with each other, and we need to detect memory safety errors in the target program.

Remill replaces all memory access in the original program with calls to the special functions read_memory and write_memory. By providing these functions, we can implement a software memory management unit that  fills in for the missing OS functionality and mediates memory accesses.

For example, consider this function that takes a pointer and increments the integer it points to:

ldr w8, [x0]   // Load the value pointed to by the pointer
add w8, w8, #1 // Increment the value
str w8, [x0]   // Write the new value back

Remill translates this assembly into the following IR containing a read_memory call, an add instruction, and a write_memory call:

define %struct.Memory* @slice(%struct.Memory*, i64 %X8, i64* nocapture %X0_output) local_unnamed_addr #2 {
%2 = tail call i32 @__remill_read_memory_32(%struct.Memory* %0, i64 undef) #3
%3 = add i32 %2, 1
%4 = tail call %struct.Memory* @__remill_write_memory_32(%struct.Memory* %0, i64 undef, i32 %3) #3
%5 = tail call %struct.Memory* @__remill_function_return(%struct.State* nonnull undef, i64 16, %struct.Memory* %4) #2, !noalias !0
ret %struct.Memory* %5

By providing __remill_read_memory_32 and __remill_write_memory_32 functions, we can give each thread its own virtual address space. In addition, we can validate memory access and intercept invalid access before it crashes the entire fuzzer.

Remember, though, that 16GB of device memory is actually not much when shared across 40,000 threads. To conserve memory, we can use copy-on-write strategies in our MMU; threads share the same memory until one of the threads writes to memory, at which point that memory is copied. Conserving memory this way has been a surprisingly effective strategy.

Initial performance

Wonderful—we have something that works! We can take a binary program, translate it to LLVM, convert it to PTX, mix in an MMU, and run the result on thousands of GPU threads in parallel.

But how well are we meeting our goal of building a fuzzer that achieves 10x performance per dollar when compared to other fuzzers?

Evaluating fuzzers is a very tricky business, and there have been many papers published about how to effectively compare them. Our fuzzer is still too young to properly evaluate, since we are still missing critical fuzzer components such as a mutator to generate new inputs to the program. To measure executor performance only, we can look at how quickly our fuzzer runs inputs through the target program (in executions/second). By normalizing the cost of the compute hardware (GPUs are more expensive than the CPUs on which other fuzzers run), we can compare executions/second/$.

What should we fuzz in our benchmarking tests? The BPF packet filtering code from libpcap seems like a good candidate for a few reasons:

  • It implements a complicated state machine that is too difficult for humans to reason about, making it a good candidate for fuzzing.
  • BPF components have had bugs in the past, so this is a realistic target that we might want to fuzz.
  • It has no system calls (our minimal fuzzer does not yet support system calls).

Let’s write a test application that takes a packet from the fuzzer and runs a complicated BPF filter program on it:

dst host or tcp or udp or ip or ip6 or arp or rarp
or atalk or aarp or decnet or iso or stp or ipx

This test program doesn’t do a whole lot, but it does exercise complicated logic and requires a good amount of memory access.

To evaluate our fuzzer, we can compare it to libFuzzer, a fast and widely used fuzzer. This isn’t an entirely fair comparison. On one hand, libFuzzer solves an easier problem: It fuzzes using the test program’s source code, whereas our fuzzer translates and instruments a binary compiled for a different architecture. Source code is often unavailable in security research. On the other hand, libFuzzer is performing mutation to generate new inputs, which we are not doing. While obviously imperfect, this comparison will work well enough to provide order-of-magnitude estimates.

I ran this comparison using Google Compute Engine 8-core N1 instances ($0.379998/hour for non-preemptible instances at time of writing) and a Tesla T4 GPU ($0.35/hour at time of writing).

Unfortunately, our fuzzer doesn’t compare so well against libFuzzer. libFuzzer achieves 5.2M execs/s/$, and our fuzzer only achieves 361K execs/s/$.

Let’s see if we can do better…

Interleaving Memory

Before we start optimizing performance, we should profile the fuzzer to get a better sense of how it’s performing. Nvidia’s Nsight Compute profiler helps explain hardware utilization and performance bottlenecks.

From the profile, we can see that the GPU is only using 3% of its compute capacity. Most of the time, the GPU compute hardware is sitting idle, not doing anything.

This generally happens because of high memory latency: The GPU is waiting for memory reads/writes to complete. However, this isn’t happening because our fuzzer needs to access too much memory; the profile shows that the GPU is only using 45% of its available memory bandwidth. Rather, we must be accessing memory very inefficiently. Each memory access is taking a long time and not providing enough data for computation.

To fix this problem, we need a better understanding of a GPU’s execution model.

GPU threads execute in groups of 32 called a warp. All threads in a warp execute together in a parallel multiprocessor, and they run in lockstep, i.e., they run the same instruction at the same time.

When threads read or write memory, the memory accesses happen in 128-byte blocks. If the 32 threads in the warp try to read memory that lies in the same 128-byte block, then the hardware will only need to request one block (one “transaction”) from the memory bus.

However, if the threads each read memory addresses from different blocks, the hardware may need to make 32 separate memory transactions, which are often serialized. This leads to the behavior we found in the profile: The compute hardware is almost always idle because it has to wait for so many memory transactions to complete. The memory bandwidth utilization doesn’t appear quite as bad because many 128-byte chunks are being read, but only four or eight bytes out of each chunk are actually used, so much of the used bandwidth is wasted.

Currently, we allocate separate memory for each thread, so when a thread accesses memory, it very rarely falls into the same 128-byte chunk as a different thread. We can change that by allocating a slab of memory for a warp (32 threads), and interleaving the threads’ memory within that warp. This way, when threads need to access a value from memory, their values are all next to each other, and the GPU can fulfill these memory reads with a single memory transaction.

Trying this out, we find that performance improves by an order of magnitude! Clearly, it’s extremely important to be aware of memory access patterns when programming for GPUs.

Reducing Data Transfers and Kernel Launches

Re-running the profiler, we can see that we are getting much better compute utilization (33%, up from 3%), but we are still nowhere near full utilization. Can we do better?

Continuing our examination of memory usage patterns, let’s look at the type of memory used. Nvidia GPUs have several kinds of memory located in different physical places, but the easiest type to use is called “unified memory,” which automatically transfers data between different physical locations on our behalf. We have been using this because it doesn’t require us to think much about where bytes are being physically stored, but it can lead to performance bottlenecks if mismanaged, since data will be transferred between physical memory locations inefficiently.

Since we are still seeing very high memory latency, let’s take a closer look at these transfers.

Our simple fuzzer is working in “rounds”: if the GPU can run 40,000 threads, we pass 40,000 inputs to the GPU, and each thread fuzzes an input before we launch the next round. In between rounds, we reset the memory used (e.g., coverage tracking data structures and memory used by the program being fuzzed). However, this results in significant data transfers between the GPU and the CPU in between each round, as memory is paged back to the CPU, reset, and then paged back to the GPU. While these transfers are happening, the GPU is doing nothing. Additional latency is incurred as the GPU waits for the CPU to launch the next round.

We can improve this setup by doing a single launch of the GPU code and avoiding synchronicity between the CPU and GPU. Much of the data doesn’t need to be in unified memory; we can allocate global memory on the GPU instead, then asynchronously transfer data to the CPU when we need to send information about fuzzing progress (e.g., which inputs are causing crashes). In this way, when a thread finishes fuzzing an input, it can reset the memory and proceed to the next input without data transfer overhead and without waiting on the CPU.

This achieves a speedup of almost another order of magnitude! Now, the fuzzer is about five times faster per dollar than libFuzzer.

It’s extremely promising—although our fuzzer lacks a mutation engine and can’t handle system calls, exceeding libFuzzer’s performance to this degree suggests that fuzzing using GPUs may be extremely useful for certain applications.

What’s Next for GPU-Based Fuzzing?

Although we are close to our performance goal for this test program, the project still has a long way to go. Hardware utilization remains low, so there’s room for more optimization.

In addition, we need to build support for handling system calls, which may have a significant performance impact when fuzzing I/O-heavy applications. We also need to build the mutation engine before this fuzzer can be useful, although this problem is much better understood than building the execution engine.

Still, we’re very excited to be getting such promising results in such early stages of development. We look forward to an order of magnitude improvement in fuzzing embedded binaries.

We would love to hear your thoughts on this work! Contact us at or

Finally, a big “thank you” goes to Artem Dinaburg for the initial design of this system and for mentoring me throughout this project. Also, thank you to Peter Goodman for giving design feedback and debugging suggestions.

Osquery: Using D-Bus to query systemd data

by Rachel Cipkins, Stevens Institute of Technology

During my summer internship at Trail of Bits I worked on osquery, the massively popular open-source endpoint monitoring agent used for intrusion detection, threat hunting, operational monitoring, and many other functions. Available for Windows, macOS, Linux, and FreeBSD, osquery exposes an operating system as a high-performance relational database, which allows you to write SQL-based queries to explore operating system data.

My initial task was to port osquery’s startup_items table to Linux. Since the startup_items table is only available on macOS and Windows, we wanted to port it to Linux while keeping the current schema. Porting to Linux is complicated, though; like macOS and Windows, Linux has an indefinite number of locations for startup items, so I needed to parse the data in each location and insert it into the table. This would have been fairly simple, but we couldn’t directly parse the data for the systemd location. Ultimately, we added systemd support to the table through the D-Bus API and created a brand-new table for systemd units.

A note on the startup_items table…

Startup items are applications and binaries that run when your system is booted up, but “startup items” is also an abstract concept indicating some set of locations and subsystems that you want to enumerate. There are two primary types of startup item locations on Linux: user-specific locations and system-specific locations. The user-specific locations include ~/.config/autostart and ~/.config/autostart-scripts, while the system-specific locations include XDG and SysV. These all contain user-specific desktop entries and scripts, respectively.

You can see an example of the startup_items table for Linux at

Systemd and D-Bus

Systemd is not as simple to implement as the other locations; it’s an init system and service manager for Linux that uses units to represent resources. Units are resources that the system knows how to manage and operate, and these are defined in unit files. We are not able to parse these unit files directly, as we did for files relating to the other locations, and we could only get the information we needed by using an API.

So we turned to D-Bus, an interprocess communication system that allows us to interact directly with systemd to extract information about startup items. Now, systemd does have its own bus library, sd-bus, but we still prefer D-Bus because:

  • osquery uses CMake as its build system; and systemd does not. Meanwhile, D-Bus does use CMake, so it was simpler to integrate with osquery.
  • D-Bus can be used to query things other than systemd.

Here are some of the API calls we used from D-Bus to extract the information we needed:

// Connection
conn = dbus_bus_get(DBUS_BUS_SYSTEM, &error);


// Message
message =

// Reply
reply =
      dbus_connection_send_with_reply_and_block(conn, message, -1, &error);

And here’s an example of the startup_items table with systemd:

The systemd units table rises again

A few years ago the osquery community determined there was a need for some sort of table related to systemd. We restarted that conversation after successfully implementing D-Bus, and it was agreed that our unit-based table was the right direction. There are many different types of systemd units—some of the most common ones are service, socket, and device—and the systemd unit table has a column that differentiates them.

This example shows the many distinct types of units currently on my computer and narrows the results by type. Here we have executed a query for all of the services:

There are three states associated with each unit:

  • load_state indicates whether or not the unit has been properly loaded onto the system.
  • active_state indicates whether or not the unit is activated.
  • sub_state is an additional state specific to each type of unit.

Here you can see all the active units on the system:

What’s next for D-Bus and osquery?

D-Bus will allow us to query a lot of other things on osquery, including desktop environment configurations for GNOME and KDE, and network devices. Be sure to check out the new startup_items and systemd_units tables once they are merged and keep an eye out for new D-Bus features on osquery.

Want to learn more about osquery or contribute to the project? Check it out here!

Detecting Iterator Invalidation with CodeQL

by Kevin Higgs, Montgomery Blair High School

Iterator invalidation is a common and subtle class of C++ bugs that often leads to exploitable vulnerabilities. During my Trail of Bits internship this summer, I developed Itergator, a set of CodeQL classes and queries for analyzing and discovering iterator invalidation.

Results are easily interpretable by an auditor, providing information such as where an iterator is acquired, where it is invalidated, and a significance level that indicates the likelihood of a false positive. Itergator has been used to find bugs in real-world code, and the queries are easily extensible for further analysis.

Iterators Defined

Iterators are the standard way to traverse the contents of a container in C++. An iterator object supports at least two operations: dereferencing, to get the underlying object in the container; and incrementation, to get an iterator for the next element.

For example, the following code will output 1 2 3 4 5:

std::vector<int> vec{1, 2, 3, 4, 5};

for (std::vector<int>::iterator it = vec.begin(), end = vec.end(); it != end; ++it) {
    std::cout << *it << " ";

This is such a common code pattern that C++11 introduced a simplified syntax:

for (auto i : vec) {
    std::cout << i << " ";

While equivalent to the previous code, all iterator operations are now generated by the compiler. The details about the iteration are hidden from the developer.

Iterator Invalidation

Iterators are invalidated after certain modifications to their container, such as adding or erasing an element. Use of invalidated iterators is, per the standard, undefined behavior. In other words, what happens is implementation-specific and probably not good. For example, in the following code (discovered by Itergator in Cataclysm: Dark Days Ahead) the call to zones.erase invalidates the iterator it:

void zone_manager::deserialize( JsonIn &jsin )
{ zones );
    for( auto it = zones.begin(); it != zones.end(); ++it ) {
        const zone_type_id zone_type = it->get_type();
        if( !has_type( zone_type ) ) {
            zones.erase( it );
            debugmsg( "Invalid zone type: %s", zone_type.c_str() );

The iterators of a vector in libstdc++ are pointers to the vector’s backing buffer. The erase method shifts all pointers past the erased iterator to the left by one, overwriting the erased object, and decrements the end of the vector.

If the vector contains only one element, vec.end() becomes the same as vec.begin(). In the example invalidation, at the end of the first loop iteration the iterator is incremented to be the address after vec.begin(). This means the continuation condition it != zones.end() holds, so we enter the loop with the iterator referencing whatever memory exists after the backing buffer on the heap! Because of the complexity of Cataclysm, the heap layout and the crash are not deterministic, but a properly modified game save frequently results in a segmentation fault from dereferencing an invalid address.

While this is a relatively benign example, the threat presented by this class of issues is not theoretical; iterator invalidation bugs in high-value targets have been weaponized before.


CodeQL is a static analysis framework developed by GitHub that allows you to query codebases with an SQL-like syntax. It has an object-oriented class system with predicates that define logical properties and relationships. The standard library provides a comprehensive set of classes which allow querying for a wide array of code properties and patterns.

A CodeQL database can be built for almost any code that compiles. GitHub maintains an index of databases they’ve built from public repositories at, which can be queried on their site or locally with the CodeQL CLI. There is also a Visual Studio Code extension for inspection of query results.

Itergator consists of both queries and libraries, allowing auditors to use Itergator’s classes in their own queries.

Detecting Iterator Invalidation

Using static analysis to detect iterator invalidation presents several challenges. The example above is simple, but invalidations can be nested many function calls deep, with complex logic surrounding them. Some iterators are declared and invalidated outside of a loop, resulting in flows that are expensive to detect without an enormous number of false positives. It is also important for the query to be extensible: Codebases often have their own iterable types with invalidation constraints that need to be detected.

CodeQL’s global data flow library and support for recursion make complex control flow analyses easy to write. Itergator is able to construct a graph of all potentially invalidating function calls (those that may result in a call to an invalidating function, like std::vector::push_back) and define classes to be used in queries:

  • Iterator: a variable that stores an iterator
  • Iterated: where a collection is iterated, e.g. vec in vec.begin()
  • Invalidator: a potentially invalidating function call in the scope of an iterator
  • Invalidation: a function call that directly invalidates an iterator

The InvalidationFlows query relates these classes with data flow to locate likely invalidations. To query non-standard iterated types, you simply extend the PotentialInvalidation class which, as an abstract class, is defined as the union of its subclasses. For example, here is an invalidation definition for destructors:

class PotentialInvalidationDestructor extends PotentialInvalidation {
    PotentialInvalidationDestructor() {
        this instanceof MemberFunction
        and this.getName().matches("~%")

    override predicate invalidates(Iterated i) {

These subclasses can be defined anywhere in your query or an imported library; definitions for STL classes are already included in Itergator. A utility query in Itergator, IteratedTypes, identifies what types to specify invalidation constraints for.

A large part of Itergator’s development required finding fixed iterator invalidation bugs on GitHub and attempting to reproduce them. One especially tricky bug in a regular expression library by Google exemplifies the challenges of this project:

struct Frame {
  Frame(Regexp** sub, int nsub)
      : sub(sub),
        round(0) {}

  Regexp** sub;
  int nsub;
  int round;
  std::vector<Splice> splices;
  int spliceidx;

int Regexp::FactorAlternation(Regexp** sub, int nsub, ParseFlags flags) {
  std::vector<Frame> stk;
  stk.emplace_back(sub, nsub);

  for (;;) {
    auto& splices = stk.back().splices;
    auto& spliceiter = stk.back().spliceiter;

    if (splices.empty()) {
    } else if (spliceiter != splices.end()) {
      stk.emplace_back(spliceiter->sub, spliceiter->nsub);
  } else { ... }

    switch (round) { ... }

    if (splices.empty() || round == 3) {
      spliceiter = splices.end();
    } else {
      spliceiter = splices.begin();

This function declares stk, a vector of frames, each of which has a splices vector and a spliceiter iterator. The iterator begins uninitialized, and is only assigned a value at the end of the first iteration of the loop (lines 32-36). It’s not obvious where the invalidation occurs; it’s not an operation on splices directly, but an element added to stk on line 26. If the backing buffer of stk is at capacity, it is reallocated and the Frame objects are copied, resulting in re-allocation of each splices vector. Because of the continue statement, spliceiter is never re-initialized, and an invalidated iterator is used on the next loop iteration.

This invalidation happens over three iterations of the loop: first initialization of the iterator, then invalidation, and finally, usage. The invalidating function call is performed on a member of an object stored inside a vector; confirming that this is the same vector the iterator refers to would be extremely complicated. Tracking control flow across all three executions is possible but expensive, and the query becomes impractical to run on large codebases.

My solution to these problems was to search for conditions necessary, but not sufficient, for invalidation. For example, I verified that the same variable—not value—can flow to both locations of iteration and invalidation. While this introduces a significant number of false positives, automatic filtering based on recurring patterns and the addition of a “significance” value makes searching through the results very manageable, while still identifying complex invalidations like the one above.

CodeQL’s caching and optimization also mean Itergator can query massive codebases, like Apple’s fork of LLVM for Swift, and find deep invalidations. Itergator identified the following bug, which was unintentionally fixed upstream a couple months ago, where the invalidation is 12 function calls deep. InvalidationFlows gives us the iteration and invalidation locations; then, after further investigation, including a customized path query, we can identify the necessary control flow:

And then we can construct a reproduction:

Step 1: Run the LLVM linker with undefined symbols and lazily loaded bitcode that references a linker script.

./ld.lld --no-threads --undefined asdf --undefined fawef --start-lib ~/lib.bc --end-lib ~/a.o

Steps 2 through 13:

lld::elf::Symbol::fetch() const
doParseFile<llvm::object::ELFType<1, true> >
void lld::elf::BitcodeFile::parse<llvm::object::ELFType<1, true> >()
lld::elf::LinkerDriver::addFile(llvm::StringRef, bool)

Step 14: Profit.

==332005==ERROR: AddressSanitizer: heap-use-after-free on address 0x603000004160 at pc 0x556f81e36288 bp 0x7ffd14c663f0 sp 0x7ffd14c663e0
READ of size 16 at 0x603000004160 thread T0
    #0 0x556f81e36287 in void 
lld::elf::LinkerDriver::link<llvm::object::ELFType<(llvm::support::endianness)1, true> >(llvm::opt::InputArgList&) (/home/kmh/llvm-project/build/bin/lld+0x1d53287)
    #1 0x556f81ddaa10 in lld::elf::LinkerDriver::main(llvm::ArrayRef<char const*>) /home/kmh/llvm-project/lld/ELF/Driver.cpp:514
    #2 0x556f81ddc3b6 in lld::elf::link(llvm::ArrayRef<char const*>, bool, llvm::raw_ostream&, llvm::raw_ostream&) /home/kmh/llvm-project/lld/ELF/Driver.cpp:111
    #3 0x556f8186cda8 in main /home/kmh/llvm-project/lld/tools/lld/lld.cpp:154
    #4 0x7f1f70d1b151 in __libc_start_main (/usr/lib/
    #5 0x556f8186861d in _start (/home/kmh/llvm-project/build/bin/lld+0x178561d)


Itergator is a powerful tool for detecting complex iterator invalidations in codebases of any size. Working with CodeQL’s declarative query language was awesome, despite the occasional engine bug, as it incorporated concepts I was already familiar with to make static analysis easy to pick up. There will always be improvements to make and more bugs to hunt, but I’m very happy with my results.

Finally, I’d like to thank my mentor Josh and everyone else at Trail of Bits who made this summer great. I can definitively say that Trail of Bits is the best place I’ve ever worked! If you have any questions, or just want to talk, shoot me a message on Twitter @themalwareman.

PrivacyRaven Has Left the Nest

By Suha S. Hussain, Georgia Tech

If you work on deep learning systems, check out our new tool, PrivacyRaven—it’s a Python library that equips engineers and researchers with a comprehensive testing suite for simulating privacy attacks on deep learning systems.

PrivacyRaven is a comprehensive testing suite for simulating privacy attacks on deep learning systems

Because deep learning enables software to perform tasks without explicit programming, it’s become ubiquitous in sensitive use cases such as:

  • Fraud detection,
  • Medical diagnosis,
  • Autonomous vehicles,
  • Facial recognition,
  • … and more.

Unfortunately, deep learning systems are also vulnerable to privacy attacks that compromise the confidentiality of the training data set and the intellectual property of the model. And unlike other forms of software, deep learning systems lack extensive assurance testing and analysis tools such as fuzzing and static analysis.

The CATastrophic Consequences of Privacy Attacks

But wait—are such privacy attacks likely? After all, medical applications using deep learning are subject to strict patient privacy regulations.

Unfortunately, yes. Imagine you’re securing a medical diagnosis system for detecting brain bleeds using CAT scan images:

Now, suppose the deep learning model in this image predicts whether or not a patient has a brain bleed and responds with a terse “Yes” or “No” answer. This setting provides users with as little access to the model as possible, so you might think there’s not much an adversary could learn. However, even when strongly restricted, an adversary modeled by PrivacyRaven can:

Evidently, an adversary can critically compromise the confidentiality of this system, so it has to be defended against privacy attacks to be considered secure. Otherwise, any major vulnerability has the potential to undermine trust and participation in all such systems.

PrivacyRaven Design Goals

Many other deep learning security techniques are onerous to use, which discourages their adoption. PrivacyRaven is meant for a broad audience, so we designed it to be:

  • Usable: Multiple levels of abstraction allow users to either automate much of the internal mechanics or directly control them, depending on their use case and familiarity with the domain.
  • Flexible: A modular design makes the attack configurations customizable and interoperable. It also allows new privacy metrics and attacks to be incorporated straightforwardly.
  • Efficient: PrivacyRaven reduces the boilerplate, affording quick prototyping and fast experimentation. Each attack can be launched in fewer than 15 lines of code.

As a result, PrivacyRaven is appropriate for a range of users, e.g., a security engineer analyzing bot detection software, an ML researcher pioneering a novel privacy attack, an ML engineer choosing between differential privacy techniques, and a privacy researcher auditing data provenance in text-generation models.

Threat Model

Optimized for usability, efficiency, and flexibility, PrivacyRaven allows users to simulate privacy attacks. Presently, the attacks provided by PrivacyRaven operate under the most restrictive threat model, i.e., they produce worst-case scenario analyses. (This may change as PrivacyRaven develops.) The modeled adversary only receives labels from an API that queries the deep learning model, so the adversary directly interacts only with the API, not the model:

Many other known machine learning attacks exploit the auxiliary information released under weaker threat models. For instance, under the white-box threat model, many systems allow users to access model parameters or loss gradients. Some black-box attacks even assume that the adversary receives full confidence predictions or model explanations.

Despite the possible benefits of these features, if you are deploying a deep learning system, we recommend reducing user access and adhering to PrivacyRaven’s threat model. The extra information provided under the aforementioned weaker threat models substantially increases the effectiveness and accessibility of attacks.

PrivacyRaven Features

PrivacyRaven provides three types of attacks: model extraction, membership inference,  and model inversion. Most of the library is dedicated to wrappers and interfaces for launching these attacks, so users don’t need an extensive background in machine learning or security.

1. Model Extraction

Model extraction attacks directly violate the intellectual property of a system. The primary objective is to extract a substitute model, or grant an adversary a copycat version of the target.

These attacks fall into two categories: optimized for high accuracy or optimized for high fidelity. A high-accuracy substitute model attempts to perform the task to the best of its ability. If the target model incorrectly classifies a data point, the substitute model will prioritize the correct classification. In contrast, a high-fidelity substitute model will duplicate the errors of the target model.

High-accuracy attacks are typically financially motivated. Models are often embedded in a Machine-Learning-as-a-Service distribution scheme, where users are billed according to the number of queries they send. With a substitute model, an adversary can avoid paying for the target and profit from their own version.

High-fidelity attacks are used for reconnaissance to learn more about the target. The substitute model extracted using this attack allows the adversary to launch other classes of attacks, including membership inference and model inversion.

Because the existing methods of model extraction often adopt disparate approaches, most security tools and implementations treat each extraction attack distinctly. PrivacyRaven instead partitions model extraction into multiple phases that encompass most attacks found in the literature (notably excluding cryptanalytic extraction):

  1. Synthesis: First, synthetic data is generated with techniques such as leveraging public data, exploiting population statistics, and collecting adversarial examples.
  2. Training: A preliminary substitute model is trained on the synthetic dataset. Depending on the attack objectives and configuration, this model doesn’t need to have the same architecture as the target model.
  3. Retraining: The substitute model is retrained using a subset sampling strategy to optimize the synthetic data quality and the overall attack performance. This phase is optional.

With this modular approach, users can quickly switch between different synthesizers, sampling strategies, and other features without being limited to configurations that have already been tested and presented. For example, a user may combine a synthesizer found in one paper on extraction attacks with a subset sampling strategy found in another one.

2. Membership Inference

Membership inference attacks are, at their core, re-identification attacks that undermine trust in the systems they target. For example, patients have to trust medical diagnosis system developers with their private medical data. But if a patient’s participation, images, and diagnosis are recovered by an adversary, it will diminish the trustworthiness of the whole system.

PrivacyRaven separates membership inference into different phases:

During a membership inference attack, an attack network is trained to detect whether a data point is included in the training dataset. To train the attack network, a model extraction attack is launched. The outputs are combined with adversarial robustness calculations to generate the dataset.

Unlike similar tools, PrivacyRaven integrates the model extraction API, which makes it easier to optimize the first phase, improve attack performance, and achieve stronger privacy guarantees. Additionally, PrivacyRaven is one of the first implementations of label-only membership inference attacks.

3. Model Inversion

Model inversion attacks look for data that the model has already memorized. Launching an inversion attack on the medical diagnosis system, for instance, would yield the CAT scan’s training dataset. In PrivacyRaven, this attack will be implemented by training a neural network to act as the inverse of the target model. Currently, this feature is in incubation and will be integrated into future PrivacyRaven releases.

Upcoming Flight Plans

We are rapidly adding more methods for model extraction, membership inference, and model inversion. Likewise, we’ll improve and extend the capabilities of PrivacyRaven to address the priorities of the larger deep learning and security communities. Right now, we’re considering:

  1. An enhanced interface for metrics visualizations: We intend PrivacyRaven to generate a high-quality output that balances comprehensiveness and clarity, so it lucidly demonstrates the attack’s impact to non-experts while still providing a measure of control for more specialized use cases.
  2. Automated hyperparameter optimization: Hyperparameter choices are both difficult to reason about and critical to the success of privacy attacks. We plan to incorporate hyperparameter optimization libraries like Optuna to help users avoid major pitfalls and reach their objectives faster.
  3. Verification of differential privacy or machine unlearning: Multiple mechanisms for auditing the implementations of differential privacy and machine unlearning exist, including using minimax rates to construct property estimators or manipulating data poisoning attacks. Consolidating these techniques would bolster the evaluation of privacy-preserving machine learning techniques.
  4. Privacy thresholds and metric calculations: Coupling metrics for privacy grounded in information theory and other fields of mathematics with practical privacy attacks is a nascent endeavor that would greatly benefit the field in its current state.
  5. More classes of attacks: We would like to incorporate attacks that specifically target federated learning and generative models as well as side channel and property inference attacks.

PrivacyRaven in Practice

To attack any deep learning model, PrivacyRaven requires only a query function from a classifier, regardless of the original programming framework or current distribution method. Here’s a model extraction attack executed with PrivacyRaven.

Inside the blue box, a query function is created for a PyTorch Lightning model included with the library (executed after the requisite components are imported). To accelerate prototyping, PrivacyRaven includes a number of victim models. The target model in this example is a fully connected neural network trained on the MNIST dataset. The single line inside of the red box downloads the EMNIST dataset to seed the attack. The bulk of the attack is the attack configuration, located in the green box. Here, the copycat synthesizer helps train the ImageNetTransferLearning classifier.

The output of this example is quite detailed, incorporating statistics about the target and substitute models in addition to metrics regarding the synthetic dataset and overall attack performance. For instance, the output may include statements like:

  • The accuracy of the substitute model is 80.00%.
  • Out of 1,000 data points, the target model and substitute model agreed on 900 data points.

This example demonstrates the core attack interface where attack parameters are defined individually. PrivacyRaven alternatively offers a run-all-attacks and a literature-based interface. The former runs a complete test on a single model, and the latter provides specific attack configurations from the literature.

The Future of Defense

Until now, in the arms race between privacy attacks and defense, engineers and researchers have not had the privacy analysis tools they need to protect deep learning systems. Differential privacy and stateful detection have emerged as two potential solutions to explore, among others. We hope PrivacyRaven will lead to the discovery and refinement of more effective defenses or mitigations. Check out this GitHub repository for a curated collection of research on privacy attacks and defenses.

Contribute to PrivacyRaven!

We’re excited to continue developing PrivacyRaven, and eagerly anticipate more applications. Try it out and contribute to PrivacyRaven now on GitHub: Incorporate a new synthesis technique, make an attack function more readable, etc.!

On a personal note, building PrivacyRaven was the primary objective of my internship this summer at Trail of Bits. It was a rewarding experience: I learned more about cutting-edge areas of security, developed my software engineering skills, and presented my PrivacyRaven work at Empire Hacking and the OpenMined Privacy Conference.

I’m continuing my internship through this winter, and look forward to applying what I’ve already learned to new problems. Feel free to contact me about PrivacyRaven or anything related to trustworthy machine learning at or @suhackerr.