High-fidelity build instrumentation with blight

TL;DR: We’re open-sourcing a new framework, blight, for painlessly wrapping and instrumenting C and C++ build tools. We’re already using it on our research projects, and have included a set of useful actions. You can use it today for your own measurement and instrumentation needs:

Why would you ever want to wrap a build tool?

As engineers, we tend to treat our build tools as monoliths: gcc, clang++, et al. are black boxes that we shove inputs into and get outputs out of.

We then encapsulate our build tools in even higher-level build systems (Make, Ninja) or build system generators (CMake) to spare ourselves the normal hassles of C and C++ compilation: linking individual outputs together, maintaining consistent flags between invocations, supplying the correct linker and including directories, etc.

This all works well enough for everyday development, but falls short for a few specific use cases. We’ll cover some of them below.

Caching

Most build systems have their own caches and intermediate management mechanisms (with varying degrees of soundness). What normal build systems can’t do is cache intermediates between separate projects: Each project’s build is hermetic, even when two projects build nearly identical source dependencies.

That’s where tools like ccache and sccache come in: both supply a global intermediate cache by wrapping individual build tools and diverting from a normal tool invocation if a particular input’s output already exists1.

Static analysis

Modern C and C++ compilers support a variety of language standards, as well as customization flags for allowing or disallowing language features. Static analysis tools like clang-tidy need to be at least partially aware of these parameters to provide accurate results; for instance, they shouldn’t be recommending snprintf for versions of C prior to C99. Consequently, these tools need an accurate record of how the program was compiled.

clang-tidy and a few other tools support a Compilation Database format, which is essentially just a JSON-formatted array of “command objects,” each of contains the command-line, working directory, and other metadata associated with a build tool invocation. CMake knows how to generate these databases when using its Make generator; there’s also a tool called Bear that does the same through some LD_PRELOAD trickery.

Similarly: LLVM is a popular static analysis target, but it can be difficult to generically instrument pre-existing build systems to emit a single LLVM IR module (or individual bitcode units) in lieu of a single executable or object intermediates. Build tool wrappers like WLLVM and GLLVM exist for precisely this purpose.

Performance profiling

C and C++ builds get complicated quickly. They also tend to accumulate compilation performance issues over time:

  • Expensive standard headers (like <regex>) get introduced and included in common headers, bloating compilation times for individual translation units.
  • Programmers get clever and write complex templates and/or recursive macros, both of which are traditional sore points in compiler performance.
  • Performance-aiding patterns (like forward declarations) erode as abstractions break.

To fix these performance problems, we’d like to time each individual tool invocation and look for sore points. Even better, we’d like to inject additional profiling flags, like -ftime-report, into each invocation without having to fuss with the build system too much. Some build systems allow the former by setting CC=time cc or similar, but this gets hairy with multiple build systems tied together. The latter is easy enough to do by modifying CFLAGS in Make or add_compile_options /target_compile_options in CMake, but becomes similarly complicated when build systems are chained together or invoke each other.

Build and release assurance

C and C++ are complex languages that are hard, if not impossible, to write safe code in.

To protect us, we have our compilers add mitigations (ASLR, W^X, Control Flow Integrity) and additional instrumentation (ASan, MSan, UBSan).

Unfortunately, build complexity gets in our way once again: when stitching multiple builds together, it’s easy to accidentally drop (or incorrectly add, for release configurations) our hardening flags2. So, we’d like a way to predicate a particular build’s success or failure on the presence (or absence) of our desired flags, no matter how many nested build systems we invoke. That means injecting and/or removing flags just like with performance profiling, so wrapping is once again an appealing solution.

Build tools are a mess

We’ve come up with some potential use cases for a build tool wrapper. We’ve also seen that a useful build tool wrapper is one that knows a decent bit about the command-line syntax of the tool that it’s wrapping: how to reliably extract its inputs and outputs, as well as correctly model a number of flags and options that change the tool’s behavior.

Unfortunately, this is easier said than done:

  • GCC, historically the dominant open-source C and C++ compiler, has thousands of command-line options, including hundreds of OS- and CPU-specific options that can affect code generation and linkage in subtle ways. Also, because nothing in life should be easy, the GCC frontends have no less than four different syntaxes for an option that takes a value:
    • -oVALUE (examples: -O{,0,1,2,3}, -ooutput.o, -Ipath)
    • -flag VALUE (examples: -o output.o, -x c++)
    • -flag=VALUE (examples: -fuse-ld=gold, -Wno-error=switch, -std=c99)
    • -Wx,VALUE (examples: -Wl,--start-group, -Wl,-ereal_main)

    Some of these overlap consistently, while others only overlap in a few select cases. It’s up to the tool wrapper to handle each, at least to the extent required by the wrapper’s expected functionality.

  • Clang, the (relative) newcomer, makes a strong effort to be compatible with the gcc and g++ compiler frontends. To that end, most of the options that need to be modeled for correctly wrapping GCC frontends are the same for the Clang frontends. That being said, clang and clang++ add their own options, some of which overlap in functionality with the common GCC ones. By way of example: The clang and clang++ frontends support -Oz for aggressive code size optimization beyond the (GCC-supported) -Os.
  • Finally, the weird ones: There’s Intel’s ICC, which apparently likewise makes an effort to be GCC-compatible. And then there’s Microsoft’s cl.exe frontend for MSVC which, to the best of my understanding, is functionally incompatible3.

Closer inspection also reveals a few falsehoods that programmers frequently believe about their C and C++ compilers:

“Compilers only take one input at a time!”

This is admittedly less believed: Most C and C++ programmers realize early on that these invocations…

cc -c -o foo.o foo.c
cc -c -o bar.o bar.c
cc -c -o baz.o baz.c
cc -o quux foo.o bar.o baz.o

…can be replaced with:

cc -o quux foo.c bar.c baz.c

This is nice for quickly building things on the command-line, but is less fruitful to cache (we no longer have individual intermediate objects) and is harder for a build tool wrapper to model (we have to suss out the inputs, even when interspersed with other compiler flags).

Compilers only produce one output at a time!

Similar to the above: C and C++ compilers will happily produce a single intermediate output for every input, as long as you don’t explicitly ask them for a single executable output via -o:

cc -c foo.c bar.c baz.c

…produces foo.o, bar.o, and baz.o. This is once again nice for caching, but with a small bit of extra work. To correctly cache each output, we need to transform each input’s filename into the appropriate implicit output name. This ought to be as simple as replacing the source extension with .o, but it isn’t guaranteed:

  • Windows hosts (among others) use .obj rather than .o. Annoying.
  • As we’ll see, not all source inputs are required to have an extension.

Yet more work for our tool wrapper.

“cc only compiles C sources and c++ only compiles C++ sources!”

This is a popular misconception: that cc and c++ (or gcc and g++, or …) are completely different programs that happen to share a great deal of command-line functionality.

In reality, even if they’re separate binaries, they’re usually just thin shims over a common compiler frontend. In particular, c++ corresponds to cc -x c++ and cc corresponds to c++ -x c:

# compile a C++ program using cc
cc -x c++ -c -o foo.o foo.cpp

The -x <language> option enables one particularly annoying useful feature: being able to compile files as a particular language even if their suffix doesn’t match. This comes in handy when doing, say, code generation:

# promise the compiler that junk.gen is actually a C source file
cc -x c junk.gen

Compilers only compile one language at a time!”

Even with the above, programmers assume that each input to the compiler frontend has to be of the same language, i.e., that you can’t mix C and C++ sources in the same invocation. But this just isn’t true:

# compile a C source and a C++ source into their respective object files
cc -c -x c foo.c -x c++ bar.cpp

You don’t even need the -x <language> modifiers when the frontend understands the file suffixes, as it does for .c and .cpp:

# identical to the above
cc -c foo.c bar.cpp

Not every build tool is a compiler frontend

We’ve omitted a critical fact above: Not every build tool shares the general syntax of the C and C++ compiler frontends. Indeed, there are five different groups of tools that we’re interested in:

  • “Compiler tools” like cc and c++, normally overriden with CC and CXX. We’re focusing our efforts on C and C++ for the time being; it’s common to see similar variables for Go (GO), Rust (RUSTC), and so forth.
  • The C preprocessor (cpp), normally overriden with CPP. Most builds invoke the C preprocessor through the compiler frontend, but some invoke it directly.
  • The system linker (ld), normally overriden with LD. Like the preprocessor, the linker is normally interacted with through the frontend, but occasionally makes an appearance of its own when dealing with custom toolchains and linker scripts.
  • The system assembler (as), normally overriden with AS. Can be used via the frontend like the preprocessor and linker, but is also seen independently.
  • The system archiver (ar), normally overriden with AR. Unlike the last three, the archiver is not integrated into the compiler frontend for, e.g., static library construction; users are expected to invoke it directly.

blight’s architecture

blight’s tool hierarchyWe’ve seen some of the complexities that arise when wrapping and accurately modeling the behavior of a build tool. So now let’s take a look at how blight mitigates those complexities.

Like most (all?) build tool wrappers, blight’s wrappers take the place of their wrapped counterparts. For example, to run a build with blight’s C compiler wrapper:

# -e tells make to always give CC from the environment precedence
CC=blight-cc make -e

Setting each CC, CXX, etc. manually is tedious and error-prone, so the blight CLI provides a bit of shell code generation magic to automate the process:

# --guess-wrapped searches the $PATH to find a suitable tool to wrap
eval "$(blight-env --guess-wrapped)"
make -e

Under the hood, each blight wrapper corresponds to a concrete subclass of Tool (e.g., blight.tool.AS for the assembler), each of which has at least the following:

  • The argument vector (args) that the wrapped tool will be run with.
  • The working directory (cwd) that the wrapped tool will be run from.
  • A list of Actions, each of which can register two separate events on each tool run:
    • before_run(tool)—Run before each tool invocation.
    • after_run(tool)—Run after each successful tool invocation.

Individual subclasses of Tool are specialized using a mixin pattern. For example, blight.tool.CC

A render of the CC class’s source

…specializes CompilerTool, which is a doozy of mixins:

A render of the CompilerTool class’s source

Each mixin, in turn, provides some modeled functionality common between one or more tools.

ResponseFileMixin, for example, specializes the behavior of Tool.args for tools that support the @file syntax for supplying additional arguments via an on-disk file (in particular, CC, CXX, and LD):

A render of the ResponseFileMixin class’s source

Other mixins make heavy use of Python 3’s Enum class to strictly model the expected behavior of common tool flags, like -std=STANDARD

A render of the StdMixin class’s source

…where Std:

A render of the Std enum

Taking action

By default, blight does absolutely nothing: it’s just a framework for wrapping build tools. The magic happens when we begin to inject actions before and after each tool invocation.

Internally, the actions API mirrors the tool API: Each tool has a corresponding class under blight.action (e.g., CCCCAction):

To add an action, you just specify its name in the BLIGHT_ACTIONS environment variable. Multiple actions can be specified with : as the delimiter, and will be executed in a left-to-right order. Only actions that “match” a particular tool are run, meaning that an action with CCAction as its parent will never (incorrectly) run on a CXX invocation.

To bring the concepts home, here’s blight running with two stock actions: IgnoreWerror and InjectFlags:

An example blight run

In this case, IgnoreWerror strips any instances of –Werror that it sees in compiler tools (i.e., CC and CXX), while InjectFlags injects a configurable set of arguments via a set of nested variables. We’ll see how that configuration works in a bit.

For example, here’s InjectFlags:

A render of the InjectFlags class

In particular, note that InjectFlags is a CompilerAction, meaning that its events (in this case, just before_run) are only executed if the underlying tool is CC or CXX.

Writing and configuring your own actions

Writing a new action is straightforward: They live in the blight.actions module, and all inherit from one of the specializations of blight.action.Action.

For example, here’s an action that prints a friendly message before and after every invocation of tool.AS (i.e., the standard assembler)…

…and blight will take care of the rest—all you need to do is specify SayHello in BLIGHT_ACTIONS!

Configuration

What about actions that need to be configured, such as a configurable output file?

Remember the InjectFlags action above: Every loaded action can opt to receive configuration settings via the self._config dictionary, which the action API parses behind the scenes from BLIGHT_ACTION_ACTIONNAME where ACTIONNAME is the uppercase form of the actual action name.

How is that environment variable parsed? Very simply:

A render of the config parsing code for actions

Spelled out, the configuration string is split according to shell lexing rules, and then once again split from KEY=VALUE pairs.

This should be a suitable base for most configuration needs. However, because blight’s actions are just plain old Python classes, they can implement their own configuration approaches as desired.

A new paradigm for build instrumentation

blight is the substrate for a new generation of build wrapping and instrumentation tools. Instead of modeling the vagaries of a variety of tool CLIs themselves, new wrapping tools can rely on blight to get the modeling right and get directly to their actual tasks.

Some ideas that we’ve had during blight’s development:

  • An action that provides real-time build statistics via WebSockets, allowing developers to track the progress of arbitrary builds.
  • A rewrite of tools like WLLVM and GLLVM, enabling higher-fidelity tracking of troublesome edge cases (e.g., in-tree assembly files and builds that generate explicit assembly intermediates).
  • A feedback mechanism for trying performance options. Choosing between optimization flags can be fraught, so a blight action that parametrizes builds across a matrix of potential options could help engineers select the appropriate flags for their projects.

We’ve already written some actions for our own use cases, but we believe blight can be useful to the wider build tooling and instrumentation communities. If you’re interested in working with us on it or porting your existing tools to blight’s framework, contact us!



  1. This, it turns out, is nontrivial: Compilation flags can affect the ABI, so a sound compilation cache must correctly model the various flags passed to the compiler and only hit the intermediate cache if two (likely different) sets of flags from separate invocations are not incompatible.↩︎
  2. Another use case: Companies usually want to strip their application binaries of all symbol and debug information before release, to hamper reverse engineering. Doing so is usually part of a release checklist and should be enforced by the build system, and yet companies repeatedly manage to leak debug and symbol information. A build tool wrapper could provide another layer of assurance.↩︎
  3. In a rare divergence for Microsoft, cl.exe does allow you to use -opt instead of /opt for all command-line options. Unfortunately, most of cl.exe’s options bear no resemblance to GCC or Clang’s, so it doesn’t make much of a difference.↩︎

Smart (and simple) ways to prevent symlink attacks in Go

After writing Go for years, many of us have learned the error-checking pattern down to our bones: “Does this function return an error? Ope, better make sure it’s nil before moving on.”

And that’s great! This should be our default behavior when writing Go.

However, rote error checking can sometimes prevent critical thinking about what that error actually means to the code: When does that function return an error? And does it encompass everything you think it does?

For instance, in os.Create, the nil error value can trick you into thinking you’re safe with file creation. Reading the linked documentation, os.Create actually truncates the file when it already exists instead of throwing any indication that it’s not a new file.

This leaves us vulnerable to a symlink attack.

Does it exist?

Say my program needs to create and use a file. Almost every example of idiomatic Go code guides us through an error check, but no validation of whether the file existed before Create was called. If a symbolic link had already been set up for that file, no error would occur, but the file and its contents would not behave as intended due to the truncation behavior. The risk is that we can remove information using the program to overwrite it for us.

At Trail of Bits, this issue comes up frequently in audits. Thankfully, the fix for it is incredibly easy. We just need to check if a file exists before attempting to create it. A slight tweak in our approach to idiomatic Go can make file creation safer long term and take us one step closer to prioritizing security in Go programs.

The situation

Let’s say there’s a file, my_logs, that I need to create and write to. However, in another part of the codebase, someone previously set up a symlink with ln -s other/logs my_logs.

- logs
- notes
- things I care about
- very important information we can't lose

Contents of other/logs.

package main

import (
	"fmt"
	"os"
)

func main() {
	file, err := os.Create("my_logs")
	if err != nil {
		fmt.Printf("Error creating file: %s", err)
	}

	_, err = file.Write([]byte("My logs for this process"))
	if err != nil {
		fmt.Println(err)
	}

}

symlink_attack.go.

$ ln -s other/logs my_logs
$ go build symlink_attack.go
$ ./symlink_attack
$ cat other/logs
- My logs for this process
$

As you can see, the content of other/logs is wiped even though our program only interacted with my_logs.

Even in this accidental scenario, os.Create removes important data through its truncation behavior. In malicious scenarios, an attacker could leverage the truncation behavior against the user to remove specific data—perhaps audit logs that would have revealed their presence on the box at one point.

Simple fix, two approaches

To remedy this, we have to insert an os.IsNotExist check before calling Create. If you run the edited symlink_attack.go below, the data in other/logs remains and is not overwritten.

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"os"
)

func main() {

	if fileExists("my_logs") {
		log.Fatalf("symlink attack failure")
	}

	file, err := os.Create("my_logs")
	if err != nil {
		fmt.Printf("Error creating file: %s", err)
	}

	_, err = file.Write([]byte("My logs for this process"))
	if err != nil {
		fmt.Printf("Failure to write: %s", err)
	}
}

func fileExists(filename string) bool {
	info, err := os.Stat(filename)
	if os.IsNotExist(err) {
		return false
	}
	return !info.IsDir()
}

symlink_attack.go with IsNotExist check.

The stipulation here is that by checking os.IsNotExist before creating, we put ourselves in a position where we can’t verify whether a symlink was created between the existence check and the file creation (a time-of-check vs. time-of-use bug). To account for that, we can take a few different approaches.

The first approach is to recreate the implementation of os.Create with your own OpenFile command, thus eliminating the truncation.

func Create(name string) (*File, error) {
    return OpenFile(name, O_RDWR\|O_CREATE\|O_TRUNC, 0666)
}

os pkg definition for Create.

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"syscall"
)

func main() {
	file, err := os.OpenFile("my_logs", os.O_RDWR|os.O_CREATE|syscall.O_NOFOLLOW, 0666)
	if err != nil {
		log.Fatal(err)
	}

	_, err = file.Write([]byte("Is this the only thing in the file\n"))
	if err != nil {
		fmt.Printf("Failure to write: %s", err)
	}
	err = file.Close()
	if err != nil {
		fmt.Printf("Couldn't close file: %s", err)
	}
	buf, err := ioutil.ReadFile("./my_logs")
	if err != nil {
		fmt.Printf("Failed to read file: %s", err)
	}

	fmt.Printf("%s", buf)
}

symlink_attack.go OpenFile with O_NOFOLLOW to avoid following symlinks.

By opening the file with O_NOFOLLOW, you will not follow a symlink. So when a new file is created, this will work the same as os.Create. However, it will fail to open if a symlink is set up in that location.

The alternative is to create a TempFile and use os.Rename to move it to your preferred location.

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"os"
)

func main() {
	tmpfile, err := ioutil.TempFile(".", "")
	if err != nil {
		log.Fatal(err)
	}

	os.Rename(tmpfile.Name(), "my_logs")
	if _, err := tmpfile.Write([]byte("Is this the only thing in the file")); err != nil {
		log.Fatal(err)
	}

	buf, err := ioutil.ReadFile("./my_logs")
	if err != nil {
		fmt.Printf("Failed to read file: %s", err)
	}

	fmt.Printf("%s", buf)

}

symlink_attack.go with TempFile creation and os.Rename.

This pattern broke the symlink between my_logs and other/logs. other/logs still had its contents and my_logs had only the contents “Is this the only thing in the file,” as intended.

Protect future you, now

No matter how careful you are about checking errors in Go, they’re not always behaving the way you might think (tl;dr: rtfm). But updating your practices within Go file creation is really simple, and can save you from unintended consequences.

Good idea, bad design: How the Diamond standard falls short

TL;DR: We audited an implementation of the Diamond standard proposal for contract upgradeability and can’t recommend it in its current form—but see our recommendations and upgrade strategy guidance.

We recently audited an implementation of the Diamond standard code, a new upgradeability pattern. It’s a laudable undertaking, but the Diamond proposal and implementation raise many concerns. The code is over-engineered, with lots of unnecessary complexities, and we can’t recommend it at this time.

Of course, the proposal is still a draft, with room to grow and improve. A working upgradeability standard should include:

  • A clear, simple implementation. Standards should be easy to read to simplify integration with third-party applications.
  • A thorough checklist of upgrade procedures. Upgrading is a risky process that must be thoroughly explained.
  • On-chain mitigations against the most common upgradeability mistakes, including function shadowing and collisions. Several mistakes, though easy to detect, can lead to severe issues. See slither-check-upgradeability for many pitfalls that can be mitigated.
  • A list of associated risks. Upgradeability is difficult; it can conceal security considerations or imply that risks are trivial. EIPs are proposals to improve Ethereum, not commercial advertisements.
  • Tests integrated with the most common testing platforms. The tests should highlight how to deploy the system, how to upgrade a new implementation, and how an upgrade can fail.

Unfortunately, the Diamond proposal fails to address these points. It’s too bad, because we’d love to see an upgradeable standard that solves or at least mitigates the main security pitfalls of upgradeable contracts. Essentially, standard writers must assume that developers will make mistakes, and aim to build a standard that alleviates them.

Still, there’s plenty to learn from the Diamond proposal. Read on to see:

  • How the Diamond proposal works
  • What our review revealed
  • Our recommendations
  • Upgradeability standard best practices

The Diamond proposal paradigm

The Diamond proposal is a work-in-progress defined in EIP 2535. The draft claims to propose a new paradigm for contract upgradeability based on delegatecall. (FYI, here’s an overview of how upgradeability works.) EIP 2535 proposes the use of:

  1. A lookup table for the implementation
  2. An arbitrary storage pointer

Lookup table

The delegatecall-based upgradeability mainly works with two components—a proxy and an implementation:

Figure 1: delegatecall-based upgradeability with a single implementation

The user interacts with the proxy and the proxy delegatecall to the implementation. The implementation code is executed, while the storage is kept in the proxy.

Using a lookup table allows delegatecalls to multiple contract implementations, where the proper implementation is selected according to the function to be executed:

Figure 2: delegatecall-based upgradeability with multiple implementations.

This schema is not new; other projects have used such lookup tables for upgradeability in the past. See ColonyNetwork for an example.

Arbitrary storage pointer

The proposal also suggests using a feature recently introduced into Solidity: the arbitrary storage pointer, which (like the name says) allows assignment of a storage pointer to an arbitrary location.

Because the storage is kept on the proxy, the implementation’s storage layout must follow the proxy’s storage layout. It can be difficult to keep track of this layout when doing an upgrade (see examples here).

The EIP proposes that every implementation have an associated structure to hold the implementation variables, and a pointer to an arbitrary storage location where the structure will be stored. This is similar to the unstructured storage pattern, where the new Solidity feature allows use of a structure instead of a single variable.

It is assumed that two structures from two different implementations cannot collide as long as their respective base pointers are different.

   bytes32 constant POSITION = keccak256(
        "some_string"
    );

    struct MyStruct {
        uint var1;
        uint var2;
    }

    function get_struct() internal pure returns(MyStruct storage ds) {
        bytes32 position = POSITION;
        assembly { ds_slot := position }
    }  

Figure 3: Storage pointer example.

Figure 4: Storage pointer representation.

BTW, what’s a “diamond”?

EIP 2535 introduces “diamond terminology,” wherein the word “diamond” means a proxy contract, “facet” means an implementation, and so on. It’s unclear why this terminology was introduced, especially since the standard terminology for upgradeability is well known and defined. Here’s a key to help you translate the proposal if you go through it:

Diamond vocabulary Common name
Diamond Proxy
Facet Implementation
Cut Upgrade
Loupe List of delegated functions
Finished diamond Non-upgradeable
Single cut diamond Remove upgradeability functions

Figure 5: The Diamond proposal uses new terms to refer to existing ideas.

Audit findings and recommendations

Our review of the diamond implementation found that:

  • The code is over-engineered and includes several misplaced optimizations
  • Using storage pointers has risks
  • The codebase had function shadowing
  • The contract lacks an existence check
  • The diamond vocabulary adds unnecessary complexity

Over-engineered code

While the pattern proposed in the EIP is straightforward, its actual implementation is difficult to read and review, increasing the likelihood of issues.

For example, a lot of the data kept on-chain is cumbersome. While the proposal only needs a lookup table, from the function signature to the implementation’s address, the EIP defines many interfaces that require storage of additional data:

interface IDiamondLoupe {
    /// These functions are expected to be called frequently
    /// by tools.

    struct Facet {
        address facetAddress;
        bytes4[] functionSelectors;
    }

    /// @notice Gets all facet addresses and their four byte function selectors.
    /// @return facets_ Facet
    function facets() external view returns (Facet[] memory facets_);

    /// @notice Gets all the function selectors supported by a specific facet.
    /// @param _facet The facet address.
    /// @return facetFunctionSelectors_
    function facetFunctionSelectors(address _facet) external view returns (bytes4[] memory facetFunctionSelectors_);

    /// @notice Get all the facet addresses used by a diamond.
    /// @return facetAddresses_
    function facetAddresses() external view returns (address[] memory facetAddresses_);

    /// @notice Gets the facet that supports the given selector.
    /// @dev If facet is not found return address(0).
    /// @param _functionSelector The function selector.
    /// @return facetAddress_ The facet address.
    function facetAddress(bytes4 _functionSelector) external view returns (address facetAddress_);

Figure 6: Diamond interfaces.

Here, facetFunctionSelectors returns all the function selectors of an implementation. This information will only be useful for off-chain components, which can already extract the information from the contract’s events. There’s no need for such a feature on-chain, especially since it significantly increases code complexity.

Additionally, much of the code complexity is due to optimization in locations that don’t need it. For example, the function used to upgrade an implementation should be straightforward. Taking a new address and a signature, it should update the corresponding entry in the lookup table. Well, part of the function doing so is the following:

           // adding or replacing functions
            if (newFacet != 0) {
                // add and replace selectors
                for (uint selectorIndex; selectorIndex &amp;amp;amp;amp;amp;lt; numSelectors; selectorIndex++) {
                    bytes4 selector;
                    assembly {
                        selector := mload(add(facetCut,position))
                    }
                    position += 4;
                    bytes32 oldFacet = ds.facets[selector];
                    // add
                    if(oldFacet == 0) {
                        // update the last slot at then end of the function
                        slot.updateLastSlot = true;
                        ds.facets[selector] = newFacet | bytes32(selectorSlotLength) &amp;amp;amp;amp;amp;lt;&amp;amp;amp;amp;amp;lt; 64 | bytes32(selectorSlotsLength);
                        // clear selector position in slot and add selector
                        slot.selectorSlot = slot.selectorSlot &amp;amp;amp;amp;amp;amp; ~(CLEAR_SELECTOR_MASK &amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;gt; selectorSlotLength * 32) | bytes32(selector) &amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;gt; selectorSlotLength * 32;
                        selectorSlotLength++;
                        // if slot is full then write it to storage
                        if(selectorSlotLength == 8) {
                            ds.selectorSlots[selectorSlotsLength] = slot.selectorSlot;
                            slot.selectorSlot = 0;
                            selectorSlotLength = 0;
                            selectorSlotsLength++;
                        }
                    }
                    // replace
                    else {
                        require(bytes20(oldFacet) != bytes20(newFacet), "Function cut to same facet.");
                        // replace old facet address
                        ds.facets[selector] = oldFacet &amp;amp;amp;amp;amp;amp; CLEAR_ADDRESS_MASK | newFacet;
                    }
                }
            }

Figure 7: Upgrade function.

A lot of effort was made to optimize this function’s gas efficiency. But upgrading a contract is rarely done, so it would never be an expensive operation anyway, no matter what its gas cost.

In another example of unnecessary complexity, bitwise operations are used instead of a structure:

uint selectorSlotsLength = uint128(slot.originalSelectorSlotsLength);
uint selectorSlotLength = uint128(slot.originalSelectorSlotsLength &amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;gt; 128);
        // uint32 selectorSlotLength, uint32 selectorSlotsLength
        // selectorSlotsLength is the number of 32-byte slots in selectorSlots.
        // selectorSlotLength is the number of selectors in the last slot of
        // selectorSlots.
        uint selectorSlotsLength;

Figure 8: Use of bitwise operations instead of a structure.

Update November 5th:
Since our audit, the reference implementation has changed, but its underlying complexity remains. There are now three reference implementations, which makes everything even more confusing for users, and further review of the proposal is more difficult.

Our recommendations:

  • Always strive for simplicity, and keep as much code as you can off-chain.
  • When writing a new standard, keep the code readable and easy to understand.
  • Analyze the needs before implementing optimizations.

Storage pointer risks

Despite the claim that collisions are impossible if the base pointers are different, a malicious contract can collide with a variable from another implementation. Basically, it’s possible because of the way Solidity stores variables and affects mapping or arrays. For example:

contract TestCollision{
    
    // The contract represents two implementations, A and B
    // A has a nested structure 
    // A and B have different bases storage pointer 
    // Yet writing in B, will lead to write in A variable
    // This is because the base storage pointer of B 
    // collides with A.ds.my_items[0].elems
    
    bytes32 constant public A_STORAGE = keccak256(
        "A"
    );
    
    struct InnerStructure{
        uint[] elems;
    }
    
    struct St_A {
        InnerStructure[] my_items;
    }

    function pointer_to_A() internal pure returns(St_A storage s) {
        bytes32 position = A_STORAGE;
        assembly { s_slot := position }
    }
    
    
    bytes32 constant public B_STORAGE = keccak256(
        hex"78c8663007d5434a0acd246a3c741b54aecf2fefff4284f2d3604b72f2649114"
    );
    
    struct St_B {
        uint val;
    }

    function pointer_to_B() internal pure returns(St_B storage s) {
        bytes32 position = B_STORAGE;
        assembly { s_slot := position }
    }
    
    
    constructor() public{
        St_A storage ds = pointer_to_A();
        ds.my_items.push();
        ds.my_items[0].elems.push(100);
    }
    
    function get_balance() view public returns(uint){
        St_A storage ds = pointer_to_A();
        return ds.my_items[0].elems[0];
    }
    
    function exploit(uint new_val) public{
        St_B storage ds = pointer_to_B();
        ds.val = new_val;
    }
    
}

Figure 9: Storage pointer collision.

In exploit, the write to the B_STORAGE base pointer will actually write to the my_items[0].elems[0], which is read from the A_STORAGE base pointer. A malicious owner could push an upgrade that looks benign, but contains a backdoor.

The EIP has no guidelines for preventing these malicious collisions. Additionally, if a pointer is reused after being deleted, the re-use will lead to data compromise.

Our recommendations

  • Low-level storage manipulations are risky, so be extra careful when designing a system that relies on them.
  • Using unstructured storage with structures for upgradeability is an interesting idea, but it requires thorough documentation and guidelines on what to check for in a base pointer.

Function shadowing

Upgradeable contracts often have functions in the proxy that shadow the functions that should be delegated. Calls to these functions will never be delegated, as they will be executed in the proxy. Additionally, the associated code will not be upgradeable.

contract Proxy {

    constructor(...) public{
          // add my_targeted_function() 
          // as a delegated function
    }
    
    function my_targeted_function() public{
    }
 
    fallback () external payable{
          // delegate to implementations
    }
}

Figure 10: Simplification of a shadowing issue.

Although this issue is well known, and the code was reviewed by the EIP author, we found two instances of function-shadowing in the contracts.

Our recommendations

  • When writing an upgradeable contract, use crytic.io or slither-check-upgradeability to catch instances of shadowing.
  • This issue highlights an important point: Developers make mistakes. Any new standard should include mitigations for common mistakes if it’s to work better than custom solutions.

No contract existence check

Another common mistake is the absence of an existence check for the contract’s code. If the proxy delegates to an incorrect address, or implementation that has been destructed, the call to the implementation will return success even though no code was executed (see the Solidity documentation). As a result, the caller will not notice the issue, and such behavior is likely to break third-party contract integration.

   fallback() external payable {
        DiamondStorage storage ds;
        bytes32 position = DiamondStorageContract.DIAMOND_STORAGE_POSITION;
        assembly { ds_slot := position }
        address facet = address(bytes20(ds.facets[msg.sig]));
        require(facet != address(0), "Function does not exist.");
        assembly {
            calldatacopy(0, 0, calldatasize())
            let result := delegatecall(gas(), facet, 0, calldatasize(), 0, 0)
            let size := returndatasize()
            returndatacopy(0, 0, size)
            switch result
            case 0 {revert(0, size)}
            default {return (0, size)}
        }
    }

Figure 11: Fallback function without contract’s existence check.

Our recommendations

  • Always check for contract existence when calling an arbitrary contract.
  • If gas cost is a concern, only perform this check if the call returns no data, since the opposite result means that some code was executed.

Unnecessary Diamond vocabulary

As noted, the Diamond proposal relies heavily on its newly created vocabulary. This is error-prone, makes review more difficult, and does not benefit developers.

  1. A diamond is a contract that uses functions from its facets to execute function calls. A diamond can have one or more facets.
  2. The word facet comes from the diamond industry. It is a side, or flat surface of a diamond. A diamond can have many facets. In this standard a facet is a contract with one or more functions that executes functionality of a diamond.
  3. A loupe is a magnifying glass that is used to look at diamonds. In this standard a loupe is a facet that provides functions to look at a diamond and its facets.

Figure 12: The EIP redefines standard terms to ones that are unrelated to software engineering.

Our recommendation

  • Use the common, well-known vocabulary, and do not invent terminology when it’s not needed.

Is the Diamond proposal a dead end?

As noted, we still believe the community would benefit from a standardized upgradeability schema. But the current Diamond proposal does not meet the expected security requirements or bring enough benefits over a custom implementation.

However, the proposal is still a draft, and could evolve into something simpler and better. And even if it doesn’t, some of the existing techniques used, such as the lookup table and arbitrary storage pointers, are worth continuing to explore.

So…is upgradeability feasible or not?

Over the years, we’ve reviewed many upgradeable contracts and published several analyses on this topic. Upgradeability is difficult, error-prone, and increases risk, and we still generally don’t recommend it as a solution. But developers who need upgradeability in their contracts should:

And please contact us if you have any questions about your upgrade strategy. We’re ready to help!

Efficient audits with machine learning and Slither-simil

by Sina Pilehchiha, Concordia University

Trail of Bits has manually curated a wealth of data—years of security assessment reports—and now we’re exploring how to use this data to make the smart contract auditing process more efficient with Slither-simil.

Based on accumulated knowledge embedded in previous audits, we set out to detect similar vulnerable code snippets in new clients’ codebases. Specifically, we explored machine learning (ML) approaches to automatically improve on the performance of Slither, our static analyzer for Solidity, and make life a bit easier for both auditors and clients.

Currently, human auditors with expert knowledge of Solidity and its security nuances scan and assess Solidity source code to discover vulnerabilities and potential threats at different granularity levels. In our experiment, we explored how much we could automate security assessments to:

  1. Minimize the risk of recurring human error, i.e., the chance of overlooking known, recorded vulnerabilities.
  2. Help auditors sift through potential vulnerabilities faster and more easily while decreasing the rate of false positives.

Slither-simil

Slither-simil, the statistical addition to Slither, is a code similarity measurement tool that uses state-of-the-art machine learning to detect similar Solidity functions. When it began as an experiment last year under the codename crytic-pred, it was used to vectorize Solidity source code snippets and measure the similarity between them. This year, we’re taking it to the next level and applying it directly to vulnerable code.

Slither-simil currently uses its own representation of Solidity code, SlithIR (Slither Intermediate Representation), to encode Solidity snippets at the granularity level of functions. We thought function-level analysis was a good place to start our research since it’s not too coarse (like the file level) and not too detailed (like the statement or line level.)

Figure 1: A high-level view of the process workflow of Slither-simil.

In the process workflow of Slither-simil, we first manually collected vulnerabilities from the previous archived security assessments and transferred them to a vulnerability database. Note that these are the vulnerabilities auditors had to find with no automation.

After that, we compiled previous clients’ codebases and matched the functions they contained with our vulnerability database via an automated function extraction and normalization script. By the end of this process, our vulnerabilities were normalized SlithIR tokens as input to our ML system.

Here’s how we used Slither to transform a Solidity function to the intermediate representation SlithIR, then further tokenized and normalized it to be an input to Slither-simil:

function transferFrom(address _from, address _to, uint256 _value) public returns (bool success) {
        require(_value <= allowance[_from][msg.sender]);     // Check allowance
        allowance[_from][msg.sender] -= _value;
        _transfer(_from, _to, _value);
        return true;
    }

Figure 2: A complete Solidity function from the contract TurtleToken.sol.

Function TurtleToken.transferFrom(address,address,uint256) (*)


Solidity Expression: require(bool)(_value <= allowance[_from][msg.sender])
SlithIR: 
         REF_10(mapping(address => uint256)) ->    allowance[_from]
         REF_11(uint256) -> REF_10[msg.sender]
         TMP_16(bool) = _value <= REF_11
         TMP_17 = SOLIDITY_CALL require(bool)(TMP_16)


Solidity Expression: allowance[_from][msg.sender] -= _value
SlithIR: 
         REF_12(mapping(address => uint256)) -> allowance[_from]
         REF_13(uint256) -> REF_12[msg.sender]
         REF_13(-> allowance) = REF_13 - _value


Solidity Expression: _transfer(_from,_to,_value)
SlithIR: 
         INTERNAL_CALL,      TurtleToken._transfer(address,address,uint256)(_from,_to,_value)


Solidity Expression: true
SlithIR: 
         RETURN True

Figure 3: The same function with its SlithIR expressions printed out.

First, we converted every statement or expression into its SlithIR correspondent, then tokenized the SlithIR sub-expressions and further normalized them so more similar matches would occur despite superficial differences between the tokens of this function and the vulnerability database.

type_conversion(uint256)

binary(**)

binary(*)

(state_solc_variable(uint256)):=(temporary_variable(uint256))

index(uint256)

(reference(uint256)):=(state_solc_variable(uint256))

(state_solc_variable(string)):=(local_solc_variable(memory, string))

(state_solc_variable(string)):=(local_solc_variable(memory, string))

...

Figure 4: Normalized SlithIR tokens of the previous expressions.

After obtaining the final form of token representations for this function, we compared its structure to that of the vulnerable functions in our vulnerability database. Due to the modularity of Slither-simil, we used various ML architectures to measure the similarity between any number of functions.

$ slither-simil test etherscan_verified_contracts.bin --filename TurtleToken.sol --fname TurtleToken.transferFrom --input cache.npz --ntop 5

Output:
Reviewed 825062 functions, listing the 5 most similar ones:

filename            contract      function     score
...
TokenERC20.sol      TokenERC20    freeze       0.991
...
ETQuality.sol       StandardToken transferFrom 0.936
...
NHST.sol            NHST          approve      0.889

Figure 5: Using Slither-simil to test a function from a smart contract with an array of other Solidity contracts.

Let’s take a look at the function transferFrom from the ETQuality.sol smart contract to see how its structure resembled our query function:

function transferFrom(address _from, address _to, uint256 _value) returns (bool success) {
    if (balances[_from] >= _value && allowed[_from][msg.sender] >= _value && _value > 0) {
        balances[_to] += _value;
        balances[_from] -= _value;
        allowed[_from][msg.sender] -= _value;
        Transfer(_from, _to, _value);
        return true;
    } else { return false; }
}

Figure 6: Function transferFrom from the ETQuality.sol smart contract.

Comparing the statements in the two functions, we can easily see that they both contain, in the same order, a binary comparison operation (>= and <=), the same type of operand comparison, and another similar assignment operation with an internal call statement and an instance of returning a “true” value.

As the similarity score goes lower towards 0, these sorts of structural similarities are observed less often and in the other direction; the two functions become more identical, so the two functions with a similarity score of 1.0 are identical to each other.

Related Research

Research on automatic vulnerability discovery in Solidity has taken off in the past two years, and tools like Vulcan and SmartEmbed, which use ML approaches to discovering vulnerabilities in smart contracts, are showing promising results.

However, all the current related approaches focus on vulnerabilities already detectable by static analyzers like Slither and Mythril, while our experiment focused on the vulnerabilities these tools were not able to identify—specifically, those undetected by Slither.

Much of the academic research of the past five years has focused on taking ML concepts (usually from the field of natural language processing) and using them in a development or code analysis context, typically referred to as code intelligence. Based on previous, related work in this research area, we aim to bridge the semantic gap between the performance of a human auditor and an ML detection system to discover vulnerabilities, thus complementing the work of Trail of Bits human auditors with automated approaches (i.e., Machine Programming, or MP).

Challenges

We still face the challenge of data scarcity concerning the scale of smart contracts available for analysis and the frequency of interesting vulnerabilities appearing in them. We can focus on the ML model because it’s sexy but it doesn’t do much good for us in the case of Solidity where even the language itself is very young and we need to tread carefully in how we treat the amount of data we have at our disposal.

Archiving previous client data was a job in itself since we had to deal with the different solc versions to compile each project separately. For someone with limited experience in that area this was a challenge, and I learned a lot along the way. (The most important takeaway of my summer internship is that if you’re doing machine learning, you will not realize how major a bottleneck the data collection and cleaning phases are unless you have to do them.)

Figure 7: Distribution of 89 vulnerabilities found among 10 security assessments.

The pie chart shows how 89 vulnerabilities were distributed among the 10 client security assessments we surveyed. We documented both the notable vulnerabilities and those that were not discoverable by Slither.

The Road Ahead for Slither-simil

This past summer we resumed the development of Slither-simil and SlithIR with two goals in mind:

  • Research purposes, i.e., the development of end-to-end similarity systems lacking feature engineering.
  • Practical purposes, i.e., adding specificity to increase precision and recall.

We implemented the baseline text-based model with FastText to be compared with an improved model with a tangibly significant difference in results; e.g., one not working on software complexity metrics, but focusing solely on graph-based models, as they are the most promising ones right now.

For this, we have proposed a slew of techniques to try out with the Solidity language at the highest abstraction level, namely, source code.

To develop ML models, we considered both supervised and unsupervised learning methods. First, we developed a baseline unsupervised model based on tokenizing source code functions and embedding them in a Euclidean space (Figure 8) to measure and quantify the distance (i.e., dissimilarity) between different tokens. Since functions are constituted from tokens, we just added up the differences to get the (dis)similarity between any two different snippets of any size.

The diagram below shows the SlithIR tokens from a set of training Solidity data spherized in a three-dimensional Euclidean space, with similar tokens closer to each other in vector distance. Each purple dot shows one token.

Figure 8: Embedding space containing SlithIR tokens from a set of training Solidity data

We are currently developing a proprietary database consisting of our previous clients and their publicly available vulnerable smart contracts, and references in papers and other audits. Together they’ll form one unified comprehensive database of Solidity vulnerabilities for queries, later training, and testing newer models.

We’re also working on other unsupervised and supervised models, using data labeled by static analyzers like Slither and Mythril. We’re examining deep learning models that have much more expressivity we can model source code with—specifically, graph-based models, utilizing abstract syntax trees and control flow graphs.

And we’re looking forward to checking out Slither-simil’s performance on new audit tasks to see how it improves our assurance team’s productivity (e.g., in triaging and finding the low-hanging fruit more quickly). We’re also going to test it on Mainnet when it gets a bit more mature and automatically scalable.

You can try Slither-simil now on this Github PR. For end users, it’s the simplest CLI tool available:

  1. Input one or multiple smart contract files (either directory, .zip file, or a single .sol).
  2. Identify a pre-trained model, or separately train a model on a reasonable amount of smart contracts.
  3. Let the magic happen, and check out the similarity results.
$ slither-simil test etherscan_verified_contracts.bin --filename MetaCoin.sol --fname MetaCoin.sendCoin --input cache.npz

Conclusion

Slither-simil is a powerful tool with potential to measure the similarity between function snippets of any size written in Solidity. We are continuing to develop it, and based on current results and recent related research, we hope to see impactful real-world results before the end of the year.

Finally, I’d like to thank my supervisors Gustavo, Michael, Josselin, Stefan, Dan, and everyone else at Trail of Bits, who made this the most extraordinary internship experience I’ve ever had.

Let’s build a high-performance fuzzer with GPUs!

by Ryan Eberhardt, Stanford University

TL;DR: Can we use GPUs to get 10x performance/dollar when fuzzing embedded software in the cloud? Based on our preliminary work, we think the answer is yes!

Fuzzing is a software testing technique that supplies programs with many randomized inputs in an attempt to cause unexpected behavior. It’s an important, industry-standard technique responsible for the discovery of many security vulnerabilities and the prevention of many more. However, fuzzing well takes time, and fuzzing embedded software presents additional challenges.

Embedded platforms are not designed for the instrumentation and high-throughput computation required to find bugs via fuzzing. Without access to source code, practical fuzzing of such platforms requires an emulator, which is slow, or many physical devices, which are often impractical.

Most fuzzing approaches have used conventional CPU architectures or emulators, but we decided to use other commodity hardware to tackle this problem—in particular, GPUs. The recent boom in machine learning has driven down the off-peak price of GPUs and made massive GPU computing capacity readily available from all major cloud providers. GPUs are very good at executing tasks in parallel, and fuzzing is an easily parallelizable problem.

In this blog post, I’ll walk you through the design and implementation of this massively parallel GPU-based fuzzer. So far, we’ve implemented an execution engine that can achieve 5x more executions/second/dollar than libFuzzer—and there’s much more room for optimization.

The Pitch: Fuzzing with GPUs

Fuzzing aims to generate unexpected inputs to a program and cause undesired behaviors (e.g., crashes or memory errors). The most commonly used fuzzers are coverage-guided fuzzers, which focus on finding inputs that cause new code coverage (such as executing a function that wasn’t executed before) to explore edge cases that may crash the program.

 

To do this, fuzzers run many different randomized inputs through the target program. This task is easily parallelizable, as each input can be executed independently of the others.

GPUs are fairly cheap; it costs $0.11/hour for a pre-emptible Tesla T4 on Google Cloud. Also, GPUs are really good at executing many things in parallel—a Tesla T4 can context-switch between over 40,000 threads and can simultaneously execute 2,560 of them in parallel—and, as mentioned, fuzzing is a naturally parallelizable problem. Using thousands of threads, we should theoretically be able to test thousands of different inputs at the same time.

Why Hasn’t Anyone Done This Before?

In short, running code on GPUs is very different from running code on CPUs in a few critical ways.

First, a GPU cannot directly execute x86/aarch64/etc. instructions, as GPUs have their own instruction set. Our goal is to fuzz embedded software for which no source code is available. With only a binary in hand, we have no easy way of generating GPU assembly to run.

Second, a GPU has no operating system. Traditional parallel fuzzers  launch multiple processes that can execute inputs separately without interfering with other processes. If an input causes one process to crash, other processes will be unaffected. GPUs have no notion of processes or address space isolation, and any memory violation will cause the entire fuzzer to crash, so we need to find some way to isolate the parallel instances of the program being fuzzed.

Additionally, without an operating system, there’s no one home to answer system calls, which enable a program to open files, use the network, and so on. System calls must be emulated or relayed back to the CPU to be executed by the host operating system.

Finally, GPU memory is hard to manage well. GPUs have complicated memory hierarchies with several different types of memory, and each has different ease-of-use and performance characteristics. Performance is highly dependent on memory access patterns, and controlling when and how threads access memory can make or break an application. Additionally, there isn’t much memory to go around, making it even more difficult to properly manage memory layout and access patterns. Having 16GB of device memory might sound impressive, but splitting it between 40,000 threads of execution leaves each thread with a paltry 419 KiB.

Can We Build It? Yes We Can!

There are many obstacles to building a working GPU fuzzer, but none of them are insurmountable.

Executing code with binary translation

First, let’s see if we can get aarch64 binaries running on the GPU.

As mentioned, we want to fuzz embedded binaries (e.g., ARMv7, aarch64, etc.) on the GPU. NVIDIA GPUs use a different instruction set architecture called PTX (“Parallel Thread eXecution”), so we cannot directly execute the binaries we want to fuzz. A common solution to this problem is to emulate an embedded CPU, but developing a CPU emulator for GPUs would likely be an expensive investment that would perform poorly. Another alternative is to translate binaries to PTX so they can execute directly on the GPU without emulation.

Trail of Bits has developed a binary translation tool called Remill that we can use to do this. Remill “lifts” binaries to LLVM IR (intermediate representation), which can then be retargeted and compiled to any architecture supported by the LLVM project. It just so happens that LLVM supports emitting LLVM IR as PTX code, which is perfect for our purposes.

Say we have this simple example function, which sets w19 to 0, adds 5, and returns the result:

main:
    mov w19, #0       // Store the number 0 in register w19
    add w19, w19, #5  // Add 5
    mov w0, w19       // Return the result
    ret

We can pass the bytes for these instructions to Remill, which produces LLVM IR that models the original program executing on an ARM processor:

// Simplified for brevity :)
define dso_local %struct.Memory* @_Z5sliceP6MemorymPm(%struct.Memory* readnone returned %0, i64 %1, i64* nocapture %2) local_unnamed_addr #0 {
  %4 = add i64 %1, 1
  store i64 %4, i64* %2, align 8, !tbaa !2
  ret %struct.Memory* %0
}

Then, with some optimizations, we can have LLVM compile the above LLVM IR to PTX assembly:

ld.param.u64  %rd1, [sub_0_param_0];
ld.param.u64  %rd2, [sub_0_param_1];
mov.u64       %rd4, 5;
st.u64        [%rd1+848], %rd4;
add.s64       %rd5, %rd2, 12;
st.u64        [%rd1+1056], %rd5;
ret;

Finally, we can load this PTX into a GPU and execute it as if we’d had access to the source code in the first place.

Managing Memory

As mentioned earlier, GPUs have no operating system to provide isolation between processes. We need to implement address space isolation so multiple instances of the fuzzed program can access the same set of memory addresses without interfering with each other, and we need to detect memory safety errors in the target program.

Remill replaces all memory access in the original program with calls to the special functions read_memory and write_memory. By providing these functions, we can implement a software memory management unit that  fills in for the missing OS functionality and mediates memory accesses.

For example, consider this function that takes a pointer and increments the integer it points to:

add_one:
ldr w8, [x0]   // Load the value pointed to by the pointer
add w8, w8, #1 // Increment the value
str w8, [x0]   // Write the new value back
ret

Remill translates this assembly into the following IR containing a read_memory call, an add instruction, and a write_memory call:

define %struct.Memory* @slice(%struct.Memory*, i64 %X8, i64* nocapture %X0_output) local_unnamed_addr #2 {
%2 = tail call i32 @__remill_read_memory_32(%struct.Memory* %0, i64 undef) #3
%3 = add i32 %2, 1
%4 = tail call %struct.Memory* @__remill_write_memory_32(%struct.Memory* %0, i64 undef, i32 %3) #3
%5 = tail call %struct.Memory* @__remill_function_return(%struct.State* nonnull undef, i64 16, %struct.Memory* %4) #2, !noalias !0
ret %struct.Memory* %5
}

By providing __remill_read_memory_32 and __remill_write_memory_32 functions, we can give each thread its own virtual address space. In addition, we can validate memory access and intercept invalid access before it crashes the entire fuzzer.

Remember, though, that 16GB of device memory is actually not much when shared across 40,000 threads. To conserve memory, we can use copy-on-write strategies in our MMU; threads share the same memory until one of the threads writes to memory, at which point that memory is copied. Conserving memory this way has been a surprisingly effective strategy.

Initial performance

Wonderful—we have something that works! We can take a binary program, translate it to LLVM, convert it to PTX, mix in an MMU, and run the result on thousands of GPU threads in parallel.

But how well are we meeting our goal of building a fuzzer that achieves 10x performance per dollar when compared to other fuzzers?

Evaluating fuzzers is a very tricky business, and there have been many papers published about how to effectively compare them. Our fuzzer is still too young to properly evaluate, since we are still missing critical fuzzer components such as a mutator to generate new inputs to the program. To measure executor performance only, we can look at how quickly our fuzzer runs inputs through the target program (in executions/second). By normalizing the cost of the compute hardware (GPUs are more expensive than the CPUs on which other fuzzers run), we can compare executions/second/$.

What should we fuzz in our benchmarking tests? The BPF packet filtering code from libpcap seems like a good candidate for a few reasons:

  • It implements a complicated state machine that is too difficult for humans to reason about, making it a good candidate for fuzzing.
  • BPF components have had bugs in the past, so this is a realistic target that we might want to fuzz.
  • It has no system calls (our minimal fuzzer does not yet support system calls).

Let’s write a test application that takes a packet from the fuzzer and runs a complicated BPF filter program on it:

dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp
or atalk or aarp or decnet or iso or stp or ipx

This test program doesn’t do a whole lot, but it does exercise complicated logic and requires a good amount of memory access.

To evaluate our fuzzer, we can compare it to libFuzzer, a fast and widely used fuzzer. This isn’t an entirely fair comparison. On one hand, libFuzzer solves an easier problem: It fuzzes using the test program’s source code, whereas our fuzzer translates and instruments a binary compiled for a different architecture. Source code is often unavailable in security research. On the other hand, libFuzzer is performing mutation to generate new inputs, which we are not doing. While obviously imperfect, this comparison will work well enough to provide order-of-magnitude estimates.

I ran this comparison using Google Compute Engine 8-core N1 instances ($0.379998/hour for non-preemptible instances at time of writing) and a Tesla T4 GPU ($0.35/hour at time of writing).

Unfortunately, our fuzzer doesn’t compare so well against libFuzzer. libFuzzer achieves 5.2M execs/s/$, and our fuzzer only achieves 361K execs/s/$.

Let’s see if we can do better…

Interleaving Memory

Before we start optimizing performance, we should profile the fuzzer to get a better sense of how it’s performing. Nvidia’s Nsight Compute profiler helps explain hardware utilization and performance bottlenecks.

From the profile, we can see that the GPU is only using 3% of its compute capacity. Most of the time, the GPU compute hardware is sitting idle, not doing anything.

This generally happens because of high memory latency: The GPU is waiting for memory reads/writes to complete. However, this isn’t happening because our fuzzer needs to access too much memory; the profile shows that the GPU is only using 45% of its available memory bandwidth. Rather, we must be accessing memory very inefficiently. Each memory access is taking a long time and not providing enough data for computation.

To fix this problem, we need a better understanding of a GPU’s execution model.

GPU threads execute in groups of 32 called a warp. All threads in a warp execute together in a parallel multiprocessor, and they run in lockstep, i.e., they run the same instruction at the same time.

When threads read or write memory, the memory accesses happen in 128-byte blocks. If the 32 threads in the warp try to read memory that lies in the same 128-byte block, then the hardware will only need to request one block (one “transaction”) from the memory bus.

However, if the threads each read memory addresses from different blocks, the hardware may need to make 32 separate memory transactions, which are often serialized. This leads to the behavior we found in the profile: The compute hardware is almost always idle because it has to wait for so many memory transactions to complete. The memory bandwidth utilization doesn’t appear quite as bad because many 128-byte chunks are being read, but only four or eight bytes out of each chunk are actually used, so much of the used bandwidth is wasted.

Currently, we allocate separate memory for each thread, so when a thread accesses memory, it very rarely falls into the same 128-byte chunk as a different thread. We can change that by allocating a slab of memory for a warp (32 threads), and interleaving the threads’ memory within that warp. This way, when threads need to access a value from memory, their values are all next to each other, and the GPU can fulfill these memory reads with a single memory transaction.

Trying this out, we find that performance improves by an order of magnitude! Clearly, it’s extremely important to be aware of memory access patterns when programming for GPUs.

Reducing Data Transfers and Kernel Launches

Re-running the profiler, we can see that we are getting much better compute utilization (33%, up from 3%), but we are still nowhere near full utilization. Can we do better?

Continuing our examination of memory usage patterns, let’s look at the type of memory used. Nvidia GPUs have several kinds of memory located in different physical places, but the easiest type to use is called “unified memory,” which automatically transfers data between different physical locations on our behalf. We have been using this because it doesn’t require us to think much about where bytes are being physically stored, but it can lead to performance bottlenecks if mismanaged, since data will be transferred between physical memory locations inefficiently.

Since we are still seeing very high memory latency, let’s take a closer look at these transfers.

Our simple fuzzer is working in “rounds”: if the GPU can run 40,000 threads, we pass 40,000 inputs to the GPU, and each thread fuzzes an input before we launch the next round. In between rounds, we reset the memory used (e.g., coverage tracking data structures and memory used by the program being fuzzed). However, this results in significant data transfers between the GPU and the CPU in between each round, as memory is paged back to the CPU, reset, and then paged back to the GPU. While these transfers are happening, the GPU is doing nothing. Additional latency is incurred as the GPU waits for the CPU to launch the next round.

We can improve this setup by doing a single launch of the GPU code and avoiding synchronicity between the CPU and GPU. Much of the data doesn’t need to be in unified memory; we can allocate global memory on the GPU instead, then asynchronously transfer data to the CPU when we need to send information about fuzzing progress (e.g., which inputs are causing crashes). In this way, when a thread finishes fuzzing an input, it can reset the memory and proceed to the next input without data transfer overhead and without waiting on the CPU.

This achieves a speedup of almost another order of magnitude! Now, the fuzzer is about five times faster per dollar than libFuzzer.

It’s extremely promising—although our fuzzer lacks a mutation engine and can’t handle system calls, exceeding libFuzzer’s performance to this degree suggests that fuzzing using GPUs may be extremely useful for certain applications.

What’s Next for GPU-Based Fuzzing?

Although we are close to our performance goal for this test program, the project still has a long way to go. Hardware utilization remains low, so there’s room for more optimization.

In addition, we need to build support for handling system calls, which may have a significant performance impact when fuzzing I/O-heavy applications. We also need to build the mutation engine before this fuzzer can be useful, although this problem is much better understood than building the execution engine.

Still, we’re very excited to be getting such promising results in such early stages of development. We look forward to an order of magnitude improvement in fuzzing embedded binaries.

We would love to hear your thoughts on this work! Contact us at ryan@reberhardt.com or artem@trailofbits.com.

Finally, a big “thank you” goes to Artem Dinaburg for the initial design of this system and for mentoring me throughout this project. Also, thank you to Peter Goodman for giving design feedback and debugging suggestions.

Osquery: Using D-Bus to query systemd data

by Rachel Cipkins, Stevens Institute of Technology

During my summer internship at Trail of Bits I worked on osquery, the massively popular open-source endpoint monitoring agent used for intrusion detection, threat hunting, operational monitoring, and many other functions. Available for Windows, macOS, Linux, and FreeBSD, osquery exposes an operating system as a high-performance relational database, which allows you to write SQL-based queries to explore operating system data.

My initial task was to port osquery’s startup_items table to Linux. Since the startup_items table is only available on macOS and Windows, we wanted to port it to Linux while keeping the current schema. Porting to Linux is complicated, though; like macOS and Windows, Linux has an indefinite number of locations for startup items, so I needed to parse the data in each location and insert it into the table. This would have been fairly simple, but we couldn’t directly parse the data for the systemd location. Ultimately, we added systemd support to the table through the D-Bus API and created a brand-new table for systemd units.

A note on the startup_items table…

Startup items are applications and binaries that run when your system is booted up, but “startup items” is also an abstract concept indicating some set of locations and subsystems that you want to enumerate. There are two primary types of startup item locations on Linux: user-specific locations and system-specific locations. The user-specific locations include ~/.config/autostart and ~/.config/autostart-scripts, while the system-specific locations include XDG and SysV. These all contain user-specific desktop entries and scripts, respectively.

You can see an example of the startup_items table for Linux at https://asciinema.org/a/354125.

Systemd and D-Bus

Systemd is not as simple to implement as the other locations; it’s an init system and service manager for Linux that uses units to represent resources. Units are resources that the system knows how to manage and operate, and these are defined in unit files. We are not able to parse these unit files directly, as we did for files relating to the other locations, and we could only get the information we needed by using an API.

So we turned to D-Bus, an interprocess communication system that allows us to interact directly with systemd to extract information about startup items. Now, systemd does have its own bus library, sd-bus, but we still prefer D-Bus because:

  • osquery uses CMake as its build system; and systemd does not. Meanwhile, D-Bus does use CMake, so it was simpler to integrate with osquery.
  • D-Bus can be used to query things other than systemd.

Here are some of the API calls we used from D-Bus to extract the information we needed:

// Connection
dbus_error_init(&error);
conn = dbus_bus_get(DBUS_BUS_SYSTEM, &error);

...

// Message
message =
      dbus_message_new_method_call("org.freedesktop.systemd1",
                                   "/org/freedesktop/systemd1",
                                   "org.freedesktop.systemd1.Manager",
                                   "ListUnits");
...

// Reply
reply =
      dbus_connection_send_with_reply_and_block(conn, message, -1, &error);

And here’s an example of the startup_items table with systemd:

https://asciinema.org/a/354126

The systemd units table rises again

A few years ago the osquery community determined there was a need for some sort of table related to systemd. We restarted that conversation after successfully implementing D-Bus, and it was agreed that our unit-based table was the right direction. There are many different types of systemd units—some of the most common ones are service, socket, and device—and the systemd unit table has a column that differentiates them.

This example shows the many distinct types of units currently on my computer and narrows the results by type. Here we have executed a query for all of the services: https://asciinema.org/a/354127.

There are three states associated with each unit:

  • load_state indicates whether or not the unit has been properly loaded onto the system.
  • active_state indicates whether or not the unit is activated.
  • sub_state is an additional state specific to each type of unit.

Here you can see all the active units on the system: https://asciinema.org/a/354130.

What’s next for D-Bus and osquery?

D-Bus will allow us to query a lot of other things on osquery, including desktop environment configurations for GNOME and KDE, and network devices. Be sure to check out the new startup_items and systemd_units tables once they are merged and keep an eye out for new D-Bus features on osquery.

Want to learn more about osquery or contribute to the project? Check it out here!

Detecting Iterator Invalidation with CodeQL

by Kevin Higgs, Montgomery Blair High School

Iterator invalidation is a common and subtle class of C++ bugs that often leads to exploitable vulnerabilities. During my Trail of Bits internship this summer, I developed Itergator, a set of CodeQL classes and queries for analyzing and discovering iterator invalidation.

Results are easily interpretable by an auditor, providing information such as where an iterator is acquired, where it is invalidated, and a significance level that indicates the likelihood of a false positive. Itergator has been used to find bugs in real-world code, and the queries are easily extensible for further analysis.

Iterators Defined

Iterators are the standard way to traverse the contents of a container in C++. An iterator object supports at least two operations: dereferencing, to get the underlying object in the container; and incrementation, to get an iterator for the next element.

For example, the following code will output 1 2 3 4 5:

std::vector<int> vec{1, 2, 3, 4, 5};

for (std::vector<int>::iterator it = vec.begin(), end = vec.end(); it != end; ++it) {
    std::cout << *it << " ";
}

This is such a common code pattern that C++11 introduced a simplified syntax:

for (auto i : vec) {
    std::cout << i << " ";
}

While equivalent to the previous code, all iterator operations are now generated by the compiler. The details about the iteration are hidden from the developer.

Iterator Invalidation

Iterators are invalidated after certain modifications to their container, such as adding or erasing an element. Use of invalidated iterators is, per the standard, undefined behavior. In other words, what happens is implementation-specific and probably not good. For example, in the following code (discovered by Itergator in Cataclysm: Dark Days Ahead) the call to zones.erase invalidates the iterator it:

void zone_manager::deserialize( JsonIn &jsin )
{
    jsin.read( zones );
    for( auto it = zones.begin(); it != zones.end(); ++it ) {
        const zone_type_id zone_type = it->get_type();
        if( !has_type( zone_type ) ) {
            zones.erase( it );
            debugmsg( "Invalid zone type: %s", zone_type.c_str() );
        }
    }
}

The iterators of a vector in libstdc++ are pointers to the vector’s backing buffer. The erase method shifts all pointers past the erased iterator to the left by one, overwriting the erased object, and decrements the end of the vector.

If the vector contains only one element, vec.end() becomes the same as vec.begin(). In the example invalidation, at the end of the first loop iteration the iterator is incremented to be the address after vec.begin(). This means the continuation condition it != zones.end() holds, so we enter the loop with the iterator referencing whatever memory exists after the backing buffer on the heap! Because of the complexity of Cataclysm, the heap layout and the crash are not deterministic, but a properly modified game save frequently results in a segmentation fault from dereferencing an invalid address.

While this is a relatively benign example, the threat presented by this class of issues is not theoretical; iterator invalidation bugs in high-value targets have been weaponized before.

CodeQL

CodeQL is a static analysis framework developed by GitHub that allows you to query codebases with an SQL-like syntax. It has an object-oriented class system with predicates that define logical properties and relationships. The standard library provides a comprehensive set of classes which allow querying for a wide array of code properties and patterns.

A CodeQL database can be built for almost any code that compiles. GitHub maintains an index of databases they’ve built from public repositories at lgtm.com, which can be queried on their site or locally with the CodeQL CLI. There is also a Visual Studio Code extension for inspection of query results.

Itergator consists of both queries and libraries, allowing auditors to use Itergator’s classes in their own queries.

Detecting Iterator Invalidation

Using static analysis to detect iterator invalidation presents several challenges. The example above is simple, but invalidations can be nested many function calls deep, with complex logic surrounding them. Some iterators are declared and invalidated outside of a loop, resulting in flows that are expensive to detect without an enormous number of false positives. It is also important for the query to be extensible: Codebases often have their own iterable types with invalidation constraints that need to be detected.

CodeQL’s global data flow library and support for recursion make complex control flow analyses easy to write. Itergator is able to construct a graph of all potentially invalidating function calls (those that may result in a call to an invalidating function, like std::vector::push_back) and define classes to be used in queries:

  • Iterator: a variable that stores an iterator
  • Iterated: where a collection is iterated, e.g. vec in vec.begin()
  • Invalidator: a potentially invalidating function call in the scope of an iterator
  • Invalidation: a function call that directly invalidates an iterator

The InvalidationFlows query relates these classes with data flow to locate likely invalidations. To query non-standard iterated types, you simply extend the PotentialInvalidation class which, as an abstract class, is defined as the union of its subclasses. For example, here is an invalidation definition for destructors:

class PotentialInvalidationDestructor extends PotentialInvalidation {
    PotentialInvalidationDestructor() {
        this instanceof MemberFunction
        and this.getName().matches("~%")
    }

    override predicate invalidates(Iterated i) {
        i.getType().refersTo(this.getParentScope())
    }
}

These subclasses can be defined anywhere in your query or an imported library; definitions for STL classes are already included in Itergator. A utility query in Itergator, IteratedTypes, identifies what types to specify invalidation constraints for.

A large part of Itergator’s development required finding fixed iterator invalidation bugs on GitHub and attempting to reproduce them. One especially tricky bug in a regular expression library by Google exemplifies the challenges of this project:

struct Frame {
  Frame(Regexp** sub, int nsub)
      : sub(sub),
        nsub(nsub),
        round(0) {}

  Regexp** sub;
  int nsub;
  int round;
  std::vector<Splice> splices;
  int spliceidx;
};

int Regexp::FactorAlternation(Regexp** sub, int nsub, ParseFlags flags) {
  std::vector<Frame> stk;
  stk.emplace_back(sub, nsub);

  for (;;) {
    ...
    auto& splices = stk.back().splices;
    auto& spliceiter = stk.back().spliceiter;

    if (splices.empty()) {
      round++;
    } else if (spliceiter != splices.end()) {
      stk.emplace_back(spliceiter->sub, spliceiter->nsub);
      continue;
  } else { ... }

    switch (round) { ... }

    if (splices.empty() || round == 3) {
      spliceiter = splices.end();
    } else {
      spliceiter = splices.begin();
    }
  }
}

This function declares stk, a vector of frames, each of which has a splices vector and a spliceiter iterator. The iterator begins uninitialized, and is only assigned a value at the end of the first iteration of the loop (lines 32-36). It’s not obvious where the invalidation occurs; it’s not an operation on splices directly, but an element added to stk on line 26. If the backing buffer of stk is at capacity, it is reallocated and the Frame objects are copied, resulting in re-allocation of each splices vector. Because of the continue statement, spliceiter is never re-initialized, and an invalidated iterator is used on the next loop iteration.

This invalidation happens over three iterations of the loop: first initialization of the iterator, then invalidation, and finally, usage. The invalidating function call is performed on a member of an object stored inside a vector; confirming that this is the same vector the iterator refers to would be extremely complicated. Tracking control flow across all three executions is possible but expensive, and the query becomes impractical to run on large codebases.

My solution to these problems was to search for conditions necessary, but not sufficient, for invalidation. For example, I verified that the same variable—not value—can flow to both locations of iteration and invalidation. While this introduces a significant number of false positives, automatic filtering based on recurring patterns and the addition of a “significance” value makes searching through the results very manageable, while still identifying complex invalidations like the one above.

CodeQL’s caching and optimization also mean Itergator can query massive codebases, like Apple’s fork of LLVM for Swift, and find deep invalidations. Itergator identified the following bug, which was unintentionally fixed upstream a couple months ago, where the invalidation is 12 function calls deep. InvalidationFlows gives us the iteration and invalidation locations; then, after further investigation, including a customized path query, we can identify the necessary control flow:

And then we can construct a reproduction:

Step 1: Run the LLVM linker with undefined symbols and lazily loaded bitcode that references a linker script.

./ld.lld --no-threads --undefined asdf --undefined fawef --start-lib ~/lib.bc --end-lib ~/a.o

Steps 2 through 13:

handleUndefined
lld::elf::Symbol::fetch() const
lld::elf::LazyObjFile::fetch()
lld::elf::parseFile(lld::elf::InputFile*)
doParseFile<llvm::object::ELFType<1, true> >
void lld::elf::BitcodeFile::parse<llvm::object::ELFType<1, true> >()
addDependentLibrary
lld::elf::LinkerDriver::addFile(llvm::StringRef, bool)
lld::elf::readLinkerScript(llvm::MemoryBufferRef)
readLinkerScript
readExtern
std::vector<llvm::StringRef>::push_back(llvm::StringRef&&)

Step 14: Profit.

==332005==ERROR: AddressSanitizer: heap-use-after-free on address 0x603000004160 at pc 0x556f81e36288 bp 0x7ffd14c663f0 sp 0x7ffd14c663e0
READ of size 16 at 0x603000004160 thread T0
    #0 0x556f81e36287 in void 
lld::elf::LinkerDriver::link<llvm::object::ELFType<(llvm::support::endianness)1, true> >(llvm::opt::InputArgList&) (/home/kmh/llvm-project/build/bin/lld+0x1d53287)
    #1 0x556f81ddaa10 in lld::elf::LinkerDriver::main(llvm::ArrayRef<char const*>) /home/kmh/llvm-project/lld/ELF/Driver.cpp:514
    #2 0x556f81ddc3b6 in lld::elf::link(llvm::ArrayRef<char const*>, bool, llvm::raw_ostream&, llvm::raw_ostream&) /home/kmh/llvm-project/lld/ELF/Driver.cpp:111
    #3 0x556f8186cda8 in main /home/kmh/llvm-project/lld/tools/lld/lld.cpp:154
    #4 0x7f1f70d1b151 in __libc_start_main (/usr/lib/libc.so.6+0x28151)
    #5 0x556f8186861d in _start (/home/kmh/llvm-project/build/bin/lld+0x178561d)

Conclusion

Itergator is a powerful tool for detecting complex iterator invalidations in codebases of any size. Working with CodeQL’s declarative query language was awesome, despite the occasional engine bug, as it incorporated concepts I was already familiar with to make static analysis easy to pick up. There will always be improvements to make and more bugs to hunt, but I’m very happy with my results.

Finally, I’d like to thank my mentor Josh and everyone else at Trail of Bits who made this summer great. I can definitively say that Trail of Bits is the best place I’ve ever worked! If you have any questions, or just want to talk, shoot me a message on Twitter @themalwareman.

PrivacyRaven Has Left the Nest

By Suha S. Hussain, Georgia Tech

If you work on deep learning systems, check out our new tool, PrivacyRaven—it’s a Python library that equips engineers and researchers with a comprehensive testing suite for simulating privacy attacks on deep learning systems.

PrivacyRaven is a comprehensive testing suite for simulating privacy attacks on deep learning systems

Because deep learning enables software to perform tasks without explicit programming, it’s become ubiquitous in sensitive use cases such as:

  • Fraud detection,
  • Medical diagnosis,
  • Autonomous vehicles,
  • Facial recognition,
  • … and more.

Unfortunately, deep learning systems are also vulnerable to privacy attacks that compromise the confidentiality of the training data set and the intellectual property of the model. And unlike other forms of software, deep learning systems lack extensive assurance testing and analysis tools such as fuzzing and static analysis.

The CATastrophic Consequences of Privacy Attacks

But wait—are such privacy attacks likely? After all, medical applications using deep learning are subject to strict patient privacy regulations.

Unfortunately, yes. Imagine you’re securing a medical diagnosis system for detecting brain bleeds using CAT scan images:

Now, suppose the deep learning model in this image predicts whether or not a patient has a brain bleed and responds with a terse “Yes” or “No” answer. This setting provides users with as little access to the model as possible, so you might think there’s not much an adversary could learn. However, even when strongly restricted, an adversary modeled by PrivacyRaven can:

Evidently, an adversary can critically compromise the confidentiality of this system, so it has to be defended against privacy attacks to be considered secure. Otherwise, any major vulnerability has the potential to undermine trust and participation in all such systems.

PrivacyRaven Design Goals

Many other deep learning security techniques are onerous to use, which discourages their adoption. PrivacyRaven is meant for a broad audience, so we designed it to be:

  • Usable: Multiple levels of abstraction allow users to either automate much of the internal mechanics or directly control them, depending on their use case and familiarity with the domain.
  • Flexible: A modular design makes the attack configurations customizable and interoperable. It also allows new privacy metrics and attacks to be incorporated straightforwardly.
  • Efficient: PrivacyRaven reduces the boilerplate, affording quick prototyping and fast experimentation. Each attack can be launched in fewer than 15 lines of code.

As a result, PrivacyRaven is appropriate for a range of users, e.g., a security engineer analyzing bot detection software, an ML researcher pioneering a novel privacy attack, an ML engineer choosing between differential privacy techniques, and a privacy researcher auditing data provenance in text-generation models.

Threat Model

Optimized for usability, efficiency, and flexibility, PrivacyRaven allows users to simulate privacy attacks. Presently, the attacks provided by PrivacyRaven operate under the most restrictive threat model, i.e., they produce worst-case scenario analyses. (This may change as PrivacyRaven develops.) The modeled adversary only receives labels from an API that queries the deep learning model, so the adversary directly interacts only with the API, not the model:

Many other known machine learning attacks exploit the auxiliary information released under weaker threat models. For instance, under the white-box threat model, many systems allow users to access model parameters or loss gradients. Some black-box attacks even assume that the adversary receives full confidence predictions or model explanations.

Despite the possible benefits of these features, if you are deploying a deep learning system, we recommend reducing user access and adhering to PrivacyRaven’s threat model. The extra information provided under the aforementioned weaker threat models substantially increases the effectiveness and accessibility of attacks.

PrivacyRaven Features

PrivacyRaven provides three types of attacks: model extraction, membership inference,  and model inversion. Most of the library is dedicated to wrappers and interfaces for launching these attacks, so users don’t need an extensive background in machine learning or security.

1. Model Extraction

Model extraction attacks directly violate the intellectual property of a system. The primary objective is to extract a substitute model, or grant an adversary a copycat version of the target.

These attacks fall into two categories: optimized for high accuracy or optimized for high fidelity. A high-accuracy substitute model attempts to perform the task to the best of its ability. If the target model incorrectly classifies a data point, the substitute model will prioritize the correct classification. In contrast, a high-fidelity substitute model will duplicate the errors of the target model.

High-accuracy attacks are typically financially motivated. Models are often embedded in a Machine-Learning-as-a-Service distribution scheme, where users are billed according to the number of queries they send. With a substitute model, an adversary can avoid paying for the target and profit from their own version.

High-fidelity attacks are used for reconnaissance to learn more about the target. The substitute model extracted using this attack allows the adversary to launch other classes of attacks, including membership inference and model inversion.

Because the existing methods of model extraction often adopt disparate approaches, most security tools and implementations treat each extraction attack distinctly. PrivacyRaven instead partitions model extraction into multiple phases that encompass most attacks found in the literature (notably excluding cryptanalytic extraction):

  1. Synthesis: First, synthetic data is generated with techniques such as leveraging public data, exploiting population statistics, and collecting adversarial examples.
  2. Training: A preliminary substitute model is trained on the synthetic dataset. Depending on the attack objectives and configuration, this model doesn’t need to have the same architecture as the target model.
  3. Retraining: The substitute model is retrained using a subset sampling strategy to optimize the synthetic data quality and the overall attack performance. This phase is optional.

With this modular approach, users can quickly switch between different synthesizers, sampling strategies, and other features without being limited to configurations that have already been tested and presented. For example, a user may combine a synthesizer found in one paper on extraction attacks with a subset sampling strategy found in another one.

2. Membership Inference

Membership inference attacks are, at their core, re-identification attacks that undermine trust in the systems they target. For example, patients have to trust medical diagnosis system developers with their private medical data. But if a patient’s participation, images, and diagnosis are recovered by an adversary, it will diminish the trustworthiness of the whole system.

PrivacyRaven separates membership inference into different phases:

During a membership inference attack, an attack network is trained to detect whether a data point is included in the training dataset. To train the attack network, a model extraction attack is launched. The outputs are combined with adversarial robustness calculations to generate the dataset.

Unlike similar tools, PrivacyRaven integrates the model extraction API, which makes it easier to optimize the first phase, improve attack performance, and achieve stronger privacy guarantees. Additionally, PrivacyRaven is one of the first implementations of label-only membership inference attacks.

3. Model Inversion

Model inversion attacks look for data that the model has already memorized. Launching an inversion attack on the medical diagnosis system, for instance, would yield the CAT scan’s training dataset. In PrivacyRaven, this attack will be implemented by training a neural network to act as the inverse of the target model. Currently, this feature is in incubation and will be integrated into future PrivacyRaven releases.

Upcoming Flight Plans

We are rapidly adding more methods for model extraction, membership inference, and model inversion. Likewise, we’ll improve and extend the capabilities of PrivacyRaven to address the priorities of the larger deep learning and security communities. Right now, we’re considering:

  1. An enhanced interface for metrics visualizations: We intend PrivacyRaven to generate a high-quality output that balances comprehensiveness and clarity, so it lucidly demonstrates the attack’s impact to non-experts while still providing a measure of control for more specialized use cases.
  2. Automated hyperparameter optimization: Hyperparameter choices are both difficult to reason about and critical to the success of privacy attacks. We plan to incorporate hyperparameter optimization libraries like Optuna to help users avoid major pitfalls and reach their objectives faster.
  3. Verification of differential privacy or machine unlearning: Multiple mechanisms for auditing the implementations of differential privacy and machine unlearning exist, including using minimax rates to construct property estimators or manipulating data poisoning attacks. Consolidating these techniques would bolster the evaluation of privacy-preserving machine learning techniques.
  4. Privacy thresholds and metric calculations: Coupling metrics for privacy grounded in information theory and other fields of mathematics with practical privacy attacks is a nascent endeavor that would greatly benefit the field in its current state.
  5. More classes of attacks: We would like to incorporate attacks that specifically target federated learning and generative models as well as side channel and property inference attacks.

PrivacyRaven in Practice

To attack any deep learning model, PrivacyRaven requires only a query function from a classifier, regardless of the original programming framework or current distribution method. Here’s a model extraction attack executed with PrivacyRaven.

Inside the blue box, a query function is created for a PyTorch Lightning model included with the library (executed after the requisite components are imported). To accelerate prototyping, PrivacyRaven includes a number of victim models. The target model in this example is a fully connected neural network trained on the MNIST dataset. The single line inside of the red box downloads the EMNIST dataset to seed the attack. The bulk of the attack is the attack configuration, located in the green box. Here, the copycat synthesizer helps train the ImageNetTransferLearning classifier.

The output of this example is quite detailed, incorporating statistics about the target and substitute models in addition to metrics regarding the synthetic dataset and overall attack performance. For instance, the output may include statements like:

  • The accuracy of the substitute model is 80.00%.
  • Out of 1,000 data points, the target model and substitute model agreed on 900 data points.

This example demonstrates the core attack interface where attack parameters are defined individually. PrivacyRaven alternatively offers a run-all-attacks and a literature-based interface. The former runs a complete test on a single model, and the latter provides specific attack configurations from the literature.

The Future of Defense

Until now, in the arms race between privacy attacks and defense, engineers and researchers have not had the privacy analysis tools they need to protect deep learning systems. Differential privacy and stateful detection have emerged as two potential solutions to explore, among others. We hope PrivacyRaven will lead to the discovery and refinement of more effective defenses or mitigations. Check out this GitHub repository for a curated collection of research on privacy attacks and defenses.

Contribute to PrivacyRaven!

We’re excited to continue developing PrivacyRaven, and eagerly anticipate more applications. Try it out and contribute to PrivacyRaven now on GitHub: Incorporate a new synthesis technique, make an attack function more readable, etc.!

On a personal note, building PrivacyRaven was the primary objective of my internship this summer at Trail of Bits. It was a rewarding experience: I learned more about cutting-edge areas of security, developed my software engineering skills, and presented my PrivacyRaven work at Empire Hacking and the OpenMined Privacy Conference.

I’m continuing my internship through this winter, and look forward to applying what I’ve already learned to new problems. Feel free to contact me about PrivacyRaven or anything related to trustworthy machine learning at suha.hussain@trailofbits.com or @suhackerr.

Graphtage: A New Semantic Diffing Tool

Graphtage is a command line utility and underlying library for semantically comparing and merging tree-like structures such as JSON, JSON5, XML, HTML, YAML, and TOML files. Its name is a portmanteau of “graph” and “graftage” (i.e., the horticultural practice of joining two trees together so they grow as one). Read on for:

  • What Graphtage does differently and better
  • Why we developed it
  • How it works
  • Directions for using it as a library

All sorts of good

Graphtage lets you see what’s different between two files quickly and easily, but it isn’t a standard line-oriented comparison tool like diff. Graphtage is semantically aware, which allows it to map differences across unordered structures like JSON dicts and XML element tags. You can even compare files that are in two different formats! And when paired with our PolyFile tool, you can semantically diff arbitrary file formats.

Tree-like file formats are becoming increasingly common as a means for transmitting and storing data. If you’ve ever wrangled a gnarly REST API, disentangled the output of a template-generated webpage, or confabulated a config file (and subsequently needed to figure out which specific change was the one that made things work), you’ve probably fought with—and been disappointed by—the current state of open-source semantic diffing tools.

Graphtage solves these problems. It’s available today. To install the utility, run:

pip3 install graphtage

Grab the source code here.

How are existing diff tools insufficient?

Ordered nodes in the tree (e.g., JSON lists) and, in particular, mappings (e.g., JSON dicts) are challenging. Most extant diffing algorithms and utilities assume that the structures are ordered. Take this JSON as an example:

# original.json
{
    "foo": [1, 2, 3, 4],
    "bar": "testing"
}
# modified.json
{
    "foo": [2, 3, 4, 5],
    "zab": "testing",
    "woo": ["foobar"]
}

Existing tools effectively canonicalize the JSON (e.g., sort dictionary elements by key and format lists with one item per line), and then perform a traditional diff. We don’t need no fancy tools for that! Here’s effectively what they do:

$ cat original.json | jq -M --sort-keys > original.canonical.json
$ cat modified.json | jq -M --sort-keys > modified.canonical.json
$ diff -u original.canonical.json modified.canonical.json
    {
    -  "bar": "testing",
       "foo": [
    -    1,
         2,
         3,
    -    4
    -  ]
    +    4,
    +    5
    +  ],
    +  "woo": [
    +    "foobar"
    +  ],
    +  "zab": "testing"
    }

That result is not very useful, particularly if the input files are large. The problem is that changing dict keys breaks the diff: Since “bar” was changed to “zab,” the canonical representation changed, and the traditional diff algorithm considered them separate edits (lines 2 and 15 of the diff).

In contrast, here is Graphtage’s output for the same pair of files:
An example of Graphtage's diff output

Why hasn’t this been done before?

In general, optimally mapping one graph to another cannot be executed in polynomial time, and is therefore not tractable for graphs of any useful size (unless P=NP). This is true even for restricted classes of graphs like DAGs. However, trees and forests are special cases that can be mapped in polynomial time, with reasonable constraints on the types of edits possible. Graphtage exploits this.

How do it know?

Graphtage’s diffing algorithms operate on an intermediate representation rather than on the data structures of the original file format. This allows Graphtage to have generic comparison algorithms that can work on any input file type. Therefore, to add support for a new file type, all one needs to do is “lift” it to the intermediate representation. Likewise, one only needs to implement support for a new type of edit once, and it will immediately be available to apply against all supported filetypes. Using an intermediate representation has the added benefit of allowing cross-format comparisons and formatting translations: Graphtage will happily diff a JSON file against a YAML file, formatting the diff output in TOML syntax.

Graphtage matches ordered sequences like lists using an “online” “constructive” implementation of the Levenshtein distance metric, similar to the Wagner–Fischer algorithm. The algorithm starts with an unbounded mapping and iteratively improves it until the bounds converge, at which point the optimal edit sequence is discovered.

Dicts are matched by solving the minimum weight matching problem on the complete bipartite graph from key/value pairs in the source dict to key/value pairs in the destination dict.

Graphtage is a command line utility, but it can just as easily be used as a library. One can interact with Graphtage directly from Python, and extend it to support new file formats and edit types.

Next up for Graphtage

We think Graphtage is pretty nifty. You can also use Graphtage in conjunction with our PolyFile tool to semantically diff arbitrary file formats, even if they aren’t naturally tree-based. Try it, and let us know how you use it.

We also plan to extend Graphtage to work on abstract syntax trees, which will allow your source code diffs to tell you things like which variables were changed and whether code blocks were reordered. If you have a similarly nifty idea for a new feature, please share it with us!

Note: This tool was partially developed with funding from the Defense Advanced Research Projects Agency (DARPA) on the SafeDocs project. The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Using Echidna to test a smart contract library

In this post, we’ll show you how to test your smart contracts with the Echidna fuzzer. In particular, you’ll see how to:

  • Find a bug we discovered during the Set Protocol audit using a variation of differential fuzzing, and
  • Specify and check useful properties for your own smart contract libraries.

And we’ll demonstrate how to do all of this using crytic.io, which provides a GitHub integration and additional security checks.

Libraries may import risk

Finding bugs in individual smart contracts is critically important: A contract may manage significant economic resources, whether in the form of tokens or Ether, and damages from vulnerabilities may be measured in millions of dollars. Arguably, though, there is code on the Ethereum blockchain that’s even more important than any individual contract: library code.

Libraries are potentially shared by many high-value contracts, so a subtle unknown bug in, say, SafeMath, could allow an attacker to exploit not just one, but many critical contracts. The criticality of such infrastructure code is well understood outside of blockchain contexts—bugs in widely used libraries like TLS or sqlite are contagious, infecting potentially all code that relies on the vulnerable library.

Library testing often focuses on detecting memory safety vulnerabilities. On the blockchain, however, we’re not so worried about avoiding stack smashes or a memcpy from a region containing private keys; we’re worried most about the semantic correctness of the library code. Smart contracts operate in a financial world where “code is law,” and if a library computes incorrect results under some circumstances, that “legal loophole” may propagate to a calling contract, and allow an attacker to make the contract behave badly.

Such loopholes may have other consequences than making a library produce incorrect results; if an attacker can force library code to unexpectedly revert, they then have the key to a potential denial-of-service attack. And if the attacker can make a library function enter a runaway loop, they can combine denial of service with costly gas consumption.

That’s the essence of a bug Trail of Bits discovered in an old version of a library for managing arrays of addresses, as described in this audit of the Set Protocol code.

The faulty code looks like this:

/**
* Returns whether or not there's a duplicate. Runs in O(n^2).
* @param A Array to search
* @return Returns true if duplicate, false otherwise
*/
function hasDuplicate(address[] memory A) returns (bool)
   {
     for (uint256 i = 0; i < A.length - 1; i++) {
       for (uint256 j = i + 1; j < A.length; j++) {
         if (A[i] == A[j]) {
            return true;
         }
       }
   }
   return false;
}

The problem is that if A.length is 0 (A is empty), then A.length - 1 underflows, and the outer (i) loop iterates over the entire set of uint256 values. The inner (j) loop, in this case, doesn’t execute, so we have a tight loop doing nothing for (basically) forever. Of course this process will always run out of gas, and the transaction that makes the hasDuplicate call will fail. If an attacker can produce an empty array in the right place, then a contract that (for example) enforces some invariant over an address array using hasDuplicate can be disabled—possibly permanently.

The library

For specifics, see the code for our example, and check out this tutorial on using Echidna.

At a high level, the library provides convenient functions for managing an array of addresses. A typical use case involves access control using a whitelist of addresses. AddressArrayUtils.sol has 19 functions to test:

function indexOf(address[] memory A, address a)
function contains(address[] memory A, address a)
function indexOfFromEnd(address[] A, address a)
function extend(address[] memory A, address[] memory B)
function append(address[] memory A, address a)
function sExtend(address[] storage A, address[] storage B)
function intersect(address[] memory A, address[] memory B)
function union(address[] memory A, address[] memory B)
function unionB(address[] memory A, address[] memory B)
function difference(address[] memory A, address[] memory B)
function sReverse(address[] storage A)
function pop(address[] memory A, uint256 index)
function remove(address[] memory A, address a)
function sPop(address[] storage A, uint256 index)
function sPopCheap(address[] storage A, uint256 index)
function sRemoveCheap(address[] storage A, address a)
function hasDuplicate(address[] memory A)
function isEqual(address[] memory A, address[] memory B)
function argGet(address[] memory A, uint256[] memory indexArray)

It seems like a lot, but many of the functions are similar in effect, since AddressArrayUtils provides both functional versions (operating on memory array parameters) and mutating versions (requiring storage arrays) of extend, reverse, pop, and remove. You can see how once we’ve written a test for pop, writing a test for sPop probably won’t be too difficult.

Property-based fuzzing 101

Our job is to take the functions we’re interested in—here, all of them—and:

  • Figure out what each function does, then
  • Write a test that makes sure the function does it!

One way to do this is to write a lot of unit tests, of course, but this is problematic. If we want to thoroughly test the library, it’s going to be a lot of work, and, frankly, we’re probably going to do a bad job. Are we sure we can think of every corner case? Even if we try to cover all the source code, bugs that involve missing source code, like the hasDuplicate bug, can easily be missed.

We want to use property-based testing to specify the general behavior over all possible inputs, and then generate lots of inputs. Writing a general description of behavior is harder than writing any individual concrete “given inputs X, the function should do/return Y” test. But the work to write all the concrete tests needed would be exorbitant. Most importantly, even admirably well-done manual unit tests don’t find the kind of weird edge-case bugs attackers are looking for.

The Echidna test harness: hasDuplicate

The most obvious thing about the code to test the library is that it’s bigger than the library itself!  That’s not uncommon in a case like this. Don’t let that daunt you; unlike a library, a test harness approached as a work-in-progress, and slowly improved and expanded, works just fine. Test development is inherently incremental, and even small efforts provide considerable benefit if you have a tool like Echidna to amplify your investment.

For a concrete example, let’s look at the hasDuplicate bug. We want to check that:

  • If there is a duplicate, hasDuplicate reports it, and
  • If there isn’t a duplicate, hasDuplicate reports that there isn’t one.

We could just re-implement hasDuplicate itself, but this doesn’t help much in general (here, it might let us find the bug). If we had another, independently developed, high-quality address array utility library, we could compare it, an approach called differential testing. Unfortunately, we don’t often have such a reference library.

Our approach here is to apply a weaker version of differential testing by looking for another function in the library that can detect duplicates without calling hasDuplicate. For this, we’ll use indexOf and indexOfFromEnd to check if the index of an item (starting from 0) is the same as that when a search is performed from the end of the array:

    for (uint i = 0; i < addrs1.length; i++) {
      (i1, b) = AddressArrayUtils.indexOf(addrs1, addrs1[i]);
      (i2, b) = AddressArrayUtils.indexOfFromEnd(addrs1, addrs1[i]);
      if (i1 != (i2-1)) { // -1 because fromEnd return is off by one
	hasDup = true;
      }
    }
    return hasDup == AddressArrayUtils.hasDuplicate(addrs1);
  }

See the full example code in our addressarrayutils demo

This code iterates through addrs1 and finds the index of the first appearance of each element.  If there are no duplicates, of course, this will always just be i itself. The code then finds the index of the last appearance of the element (i.e., from the end). If those two indices are different, there is a duplicate. In Echidna, properties are just Boolean Solidity functions that usually return true if the property is satisfied (we’ll see the exception below), and fail if they either revert or return false. Now our hasDuplicate test is testing both hasDuplicate and the two indexOf functions. If they don’t agree, Echidna will tell us.

Now we can add a couple of functions to be fuzzed to set addrs1.

Let’s run this property on Crytic:

The property test for hasDuplicate fails in Crytic

First, crytic_hasDuplicate fails:

crytic_hasDuplicate: failed!
  Call sequence:
    set_addr(0x0)

The triggering transaction sequence is extremely simple: Don’t add anything to addrs1, then call hasDuplicate on it. That’s it—the resulting runaway loop will exhaust your gas budget, and Crytic/Echidna will tell you the property failed. The 0x0 address results when Echidna minimizes the failure to the simplest sequence possible.

Our other properties (crytic_revert_remove and crytic_remove) pass, so that’s good. If we fix the bug in hasDuplicate then our tests will all pass:

All three property tests now pass in Crytic

The crytic_hasDuplicate: fuzzing (2928/10000) tells us that since the expensive hasDuplicate property doesn’t quickly fail, only 3,000 of our maximum of 10,000 tests for each property were performed before we hit our timeout of five minutes.

The Echidna test harness: The rest of the library

Now we’ve seen one example of a test, here are some basic suggestions for building the rest of the tests (as we’ve done for the addressarrayutils_demo repository):

  • Try different ways of computing the same thing. The more “differential” versions of a function you have, the more likely you are to find out if one of them is wrong. For example, look at all the ways we cross-check indexOf, contains, and indexOfFromEnd.
  • Test for revert. If you add the prefix _revert_ before your property name as we do here, the property only passes if all calls to it revert. This ensures code fails when it is supposed to fail.
  • Don’t forget to check obvious simple invariants, e.g., that the diff of an array with itself is always empty (ourEqual(AddressArrayUtils.difference(addrs1, addrs1), empty)).
  • Invariant checks and preconditions in other testing can also serve as a cross-check on tested functions. Note that hasDuplicate is called in many tests that aren’t meant to check hasDuplicate at all; it’s just that knowing an array is duplicate-free can establish additional invariants of many other behaviors, e.g., after removing address X at any position, the array will no longer contain X.

Getting up and running with Crytic

You can run Echidna tests on your own by downloading and installing the tool or using our docker build—but using the Crytic platform integrates Echidna property-based testing, Slither static analysis (including new analyzers not available in the public version of Slither), upgradability checks, and your own unit tests in a seamless environment tied to your version control. Plus the addressarrayutils_demo repository shows all you need for property-based testing: It can be as simple as creating a minimal Truffle setup, adding a crytic.sol file with the Echidna properties, and turning on property-based tests in your repository configuration in Crytic.

Sign up for Crytic today, and if you have questions, join our Slack channel (#crytic) or follow @CryticCI on Twitter.