Protecting Software Against Exploitation with DARPA’s CFAR

Today, we’re going to talk about a hard problem that we are working on as part of DARPA’s Cyber Fault-Tolerant Attack Recovery (CFAR) program: automatically protecting software from 0-day exploits, memory corruption, and many currently undiscovered bugs. You might be thinking: “Why bother? Can’t I just compile my code with exploit mitigations like stack guard, CFG, or CFI?” These mitigations are wonderful, but require source code and modifications to the build process. In many situations it is impossible or impractical to change the build process or alter program source code. That’s why our solution for CFAR protects binary installations for which source isn’t available or editable.

CFAR is very intuitive and deceptively simple. The system runs multiple versions, or ‘variants,’ of the software in parallel, and uses comparisons between these variants to identify when one or more have diverged from the others in behavior. The idea is akin to an intrusion detection system that compares program behavior against variants of itself running on identical input, instead of against a model of past behavior. When the system detects behavioral divergence, it can infer that something unusual, and possibly malicious, has happened.

Like all DARPA programs, CFAR is a large and difficult research problem. We are only working on a small piece of it. We have coordinated this blog post with our teammates – Galois, Immunant, and UCI – each of whom has more details about their respective contributions to the CFAR project.

We are excited to talk about CFAR not just because it’s a hard and relevant problem, but because one of our tools, McSema, is a part of our team’s versatile LLVM-based solution. As a part of this post, we get to show examples of lesser-known McSema features, and explain why they were developed. Perhaps most exciting of all, we’re going to show how to use McSema and the UCI multicompiler to harden off-the-shelf binaries against exploitation.

Our CFAR Team

The overall goal of CFAR is to detect and recover from faults in existing software without impacting core functionality. Our team’s responsibility was to produce an optimal set of variants to mitigate and detect fault-inducing inputs. The other teams were responsible for the specialized execution environment, for red-teaming, and so on. Galois’s blog post on CFAR describes the program in greater detail.

The variants must behave identically to each other and to the original application, and present compelling proof that behavior will remain identical for all valid inputs. Our teammates have developed transformations and provided equivalence guarantees for programs with available source code. The team has devised a multicompiler-based solution for variant generation using the Clang/LLVM toolchain.

McSema’s Role

We have been working on generating program variants of binary-only software, because source code may be unavailable for proprietary or older applications. Our team’s source code based toolchain works at the LLVM intermediate representation (IR) level. Transforming and hardening programs at the IR level allows us to manipulate program structure without altering the program’s source code. Using McSema, we could translate binary-only programs to LLVM IR, and re-use the same components for both source-level and binary-only variant generation.

Accurately translating programs for CFAR required us to bridge the gap between machine-level semantics and program-level semantics. Machine-level semantics are the changes to processor and memory state caused by individual instructions. Program-level semantics (e.g., functions, variables, exceptions, and try/catch blocks) are more abstract concepts that represent program behavior. McSema was designed to be a translator for machine level semantics (the name “McSema” derives from “machine code semantics”). However, to accurately transform the variants required for CFAR, McSema would have to recover program semantics as well.

We are actively working to recover more and more program semantics, and many common use-cases are already supported. In the following section we’ll discuss how we handle two particularly important semantics: stack variables and global variables.

Stack Variables

The compiler can place the data backing function variables in one of several locations. The most common location for program variables is the stack, a region of memory specifically made for storing temporary information and easily accessible to the calling function. Variables that the compiler stores on the stack are called… stack variables!

int sum_of_squares(int a, int b) {
  int a2 = a * a;
  int b2 = b * b;
  return a2+b2;
}
Binary view of the sum_of_squares function
Figure 1: Stack variables for a simple function shown both at the source code level, and at the binary level. At the binary level, there is no concept of individual variables, just bytes in a large block of memory.

When attackers turn bugs into exploits, they often rely on stack variables being in a specific order. The multicompiler can mitigate this class of exploits by generating program variants, where no two variants have stack variables in the same order. We wanted to enable this stack variable shuffling for binaries, but there was a problem: there is no concept of stack variables at the machine code level (Figure 1). Instead, the stack is just a large contiguous block of memory. McSema faithfully models this behavior and treats the program stack as an indivisible blob. This, of course, makes it impossible to shuffle stack variables.

Stack Variable Recovery

The process of converting a block of memory that represents the stack into individual variables is called stack variable recovery. McSema implements stack variable recovery as a three-step process.

First, McSema identifies stack variable bounds during disassembly, via the disassembler’s (e.g., IDA Pro’s) heuristics and, where present, DWARF-based debugging information. There is prior research on identifying stack variable bounds without such hints, which we plan to utilize in the future. Second, McSema attempts to identify which instructions in the program reference which stack variable. Every reference must be accurately identified, or the resulting program will not function. Finally, McSema creates an LLVM-level variable for each recovered stack variable and rewrites instructions to reference these LLVM-level variables instead of the prior monolithic stack block.

Stack variable recovery works for many functions, but it isn’t perfect. McSema will default to the classic behavior of treating the stack as a monolithic block when it encounters functions with the following characteristics:

  • Varargs functions. Functions that use a variable number of arguments (like the common printf family of functions) have a variable sized stack frame. This variance makes it difficult to determine which instruction references which stack variable.
  • Indirect stack references. Compilers also rely on a predetermined layout of stack variables, and will generate code that accesses a variable via the address of an unrelated variable.
  • No stack-frame pointer. As an optimization, the stack-frame pointer can serve as a general purpose register. This optimization makes it difficult for us to detect possible indirect stack references.

Stack variable recovery is a part of the CFG recovery process, and is currently implemented in the IDAPython CFG recovery code (in collect_variable.py). It can be invoked via the --recover-stack-vars argument to mcsema-disass. For an example, see the code accompanying this blog post, which is described more in the Lifting and Diversifying a Binary section.

Global Variables

Global variables can be accessed by all functions in a program. Since these variables are not tied to a specific function, they are typically placed in a special section of the program binary (Figure 2). As with stack variables, the specific ordering of global variables can be exploited by attackers.

bool is_admin = false;
int set_admin(int uid) {
  is_admin = 0 == uid;
}
The set_admin function
The is_admin global variable
Figure 2: Global variables as seen at source code level and at the machine code level. Global variables are typically placed into a special section in the program (in this case, into .bss).

Like the stack, McSema treats each data section as a large block of memory. One major difference between stack and global variables is that McSema knows where global variables start, because they are referenced directly from multiple locations. Unfortunately that is not enough information to shuffle around the global variable layout. McSema also needs to know where every variable ends, which is harder. Currently we rely on DWARF debug information to identify global variable sizes, but look forward to implementing approaches that would work on binaries without DWARF information.

Currently, global variable recovery is implemented separately from normal CFG recovery (in var_recovery.py). That script creates an “empty” CFG, filled with only global variable definitions. The normal CFG recovery process will further populate the file with the real control flow graph, referencing the pre-populated global variables. We will show an example of using global variable recovery later.

Lifting and Diversifying A Binary

In the remainder of this blog post, we’ll refer to the process of generating new program variants via the multicompiler as ‘diversification.’ For this specific example, we will lift and diversify a simple C++ application that uses exception handling (including a catch-all clause) and global variables. While this is just a simple example, program semantics recovery is meant to work on large, real applications: our standard test program is the Apache2 web server.

First, let’s familiarize ourselves with the standard McSema workflow (i.e. without any diversification), which is to lift the example binary to LLVM IR, then compile that IR back down into a runnable program. To get started, please build and install McSema. We provide detailed instructions in the official McSema README.

Next, build and lift the program using the provided script (lift.sh). The script will need to be edited to match your McSema installation.

After running lift.sh, you should have two programs: example and example-lift, along with some intermediate files.

The example program squares two numbers and passes the result to the set_admin function. If both the numbers are 5, then the program throws the std::runtime_error exception. If the numbers are 0, then the global variable is_admin is set to true. Finally, if two numbers are not supplied to the program, then it throws std::out_of_range.

The four different cases can be demonstrated via the following program invocations:

$ ./example
Starting example program
Index out of range: Supply two arguments, please
$ ./example 0 0
Starting example program
You are now admin.
$ ./example 1 2
Starting example program
You are not admin.
$ ./example 5 5
Starting example program
Runtime error: Lucky number 5

We can see that example-lifted, the same program as lifted and re-created by McSema, behaves identically:

$ ./example-lifted
Starting example program
Index out of range: Supply two arguments, please
$ ./example-lifted 0 0
Starting example program
You are now admin.
$ ./example-lifted 1 2
Starting example program
You are not admin.
$ ./example-lifted 5 5
Starting example program
Runtime error: Lucky number 5

Now, lets diversify the lifted example program. To start, install the multicompiler. Next, edit the lift.sh script to specify a path to your multicompiler installation.

It’s time to build the diversified version. Run the script with the diversify argument (./lift.sh diversify) to generate a diversified binary. The diversified example looks different at the binary level than the original (Figure 3), but has the same functionality:

$ ./example-diverse
Starting example program
Index out of range: Supply two arguments, please
$ ./example-diverse 0 0
Starting example program
You are now admin.
$ ./example-diverse 1 2
Starting example program
You are not admin.
$ ./example-diverse 5 5
Starting example program
Runtime error: Lucky number 5
Differences between the normal and diversified binaries
Figure 3: The normal lifted binary (left) and its diversified equivalent (right). Both binaries are functionally identical, but look different at the binary level. Binary diversification protects software by preventing certain classes of bugs from turning into exploits.

Open example-lifted and example-diversified in your favorite disassembler. Your binaries may not be identical to the ones in the screenshot, but they should be different from each other.

Let’s review what we did. It’s really quite amazing. We started by building a simple C++ program that used exceptions and global variables. Then we translated the program into LLVM bitcode, identified stack and global variables, and preserved exception-based control flow. We then transformed it using the multicompiler, and created a new, diversified binary with the same functionality as the original program.

While this was just a small example, this approach scales to much larger applications, and provides a means to rapidly create diversified programs, whether starting with source code or with a previous program binary.

Conclusion

We would first like to thank DARPA, without whom this work would not be possible, for providing ongoing funding for CFAR and other great research programs. We would also like to thank our teammates — Galois, Immunant and UCI — for their hard work creating the multicompiler, transformations, providing equivalence guarantees for variants, and for making everything work together.

We are actively working to improve stack and global variable recovery in McSema. Not only will these higher-level semantics create more diversification and transformation opportunities, but they will also allow for smaller, leaner bitcode, faster re-compiled binaries, and more thorough analyses.

We believe there is a bright future for CFAR and similar technologies: the number of available cores per machine continues to increase, as does the need for secure computing. Many software packages can’t utilize these cores for performance, so it is only natural to use the spare cores for security. McSema, the multicompiler, and other CFAR technologies show how we can put these extra cores in service to stronger security guarantees.

If you think some of these technologies can be applied to your software, please contact us. We’d love to hear from you. To learn more about CFAR, the multicompiler, and other technologies developed under this program, please read our teammates’ blog posts at the Galois blog and the Immunant blog.

Disclaimer

The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Rattle – an Ethereum EVM binary analysis framework

Most smart contracts have no verified source code, but people still trust them to protect their cryptocurrency. What’s more, several large custodial smart contracts have had security incidents. The security of contracts that exist on the blockchain should be independently ascertainable.

Ethereum VM (EVM) Bytecode

Ethereum contracts are compiled to EVM – the Ethereum Virtual Machine. As blocks are mined, EVM is executed and its resulting state is encoded into the blockchain forever. Everyone has access to this compiled EVM code for every smart contract on the blockchain — but reviewing EVM directly isn’t easy.

EVM is a RISC Harvard-architecture stack machine, which is fairly distinct in the world of computer architectures. EVM has around 200 instructions which push and pop values from a stack, occasionally performing specific actions on them (e.g. ADD takes two arguments of the stack, adds them together, and pushes the result back to the stack). If you’re familiar with reverse polish notation (RPN) calculators, then stack machines will appear similar. Stack machines are easy to implement but difficult to reverse-engineer. As a reverse-engineer, I have no registers, local variables, or arguments that I can label and track when looking at a stack machine.

For these reasons, I created Rattle, a framework which turns the stack machine into an infinite-register SSA form.

Rattle

Rattle is an EVM binary static analysis framework designed to work on deployed smart contracts. Rattle takes EVM byte strings, uses a flow-sensitive analysis to recover the original control flow graph, lifts the control flow graph into an SSA/infinite register form, and optimizes the SSA – removing DUPs, SWAPs, PUSHs, and POPs. Converting the stack machine to SSA form removes 60%+ of EVM instructions and presents a much friendlier interface to those who wish to read the smart contracts they’re interacting with.

Demo

As an example, we will analyze the infamous King of Ether contract.

First in Ethersplay, our Binary Ninja plug-in for analyzing Ethereum Smart Contracts:

Figure 1: The King of Ether contract as disassembled by Ethersplay

In Ethersplay, we can see there are 43 instructions and 5 basic blocks. The majority of the instructions are pure stack manipulation instructions (e.g. PUSH, DUP, SWAP, POP). Interspersed in the blocks are the interesting instructions (e.g. CALLVALUE, SLOAD, etc.).

Now, analyze the contract with Rattle and observe the output for the same function. We run Rattle with optimizations, so constants are folded and unneeded blocks are removed.

$ python3 rattle-cli.py --input inputs/kingofether/KingOfTheEtherThrone.bin -O

The Rattle CLI interface generates graphviz files for each function that it can identify and extract.

Figure 2: The King of Ether contract as recovered by Rattle

As you can see, Rattle optimized the numberOfMonarchs() function to only 12 instructions. Rattle eliminated 72% of the instructions, assigned registers you can track visually, and removed an entire basic block. What’s more, Rattle recovered the used storage location and the ABI of the function.

Rattle will help organizations and individuals study the contracts they’re interacting with and establish an informed degree of trust to the contracts’ security. If your contracts’ source code isn’t available or can’t be verified, then you should run Rattle.

Get Rattle on our GitHub and try it out for yourself.

Contract upgrade anti-patterns

A popular trend in smart contract design is to promote the development of upgradable contracts. At Trail of Bits, we have reviewed many upgradable contracts and believe that this trend is going in the wrong direction. Existing techniques to upgrade contracts have flaws, increase the complexity of the contract significantly, and ultimately introduce bugs. To highlight this point, we are releasing a previously unknown flaw in the Zeppelin contract upgrade strategy, one of the most common upgrade approaches.

In this article, we are going to detail our analysis of existing smart contract upgrade strategies, describe the weaknesses we have observed in practice, and provide recommendations for contracts that require upgrades. In a follow-up blog post, we will detail a method, contract migration, that achieves the same benefits with few of the downsides.

An overview of upgradable contracts

Two ‘families’ of patterns have emerged for upgradable smart contracts:

  • Data separation, where logic and data are kept in separate contracts. The logic contract owns and calls the data contract.
  • Delegatecall-based proxies, where logic and data are kept in separate contracts, also, but the data contract (the proxy) calls the logic contract through delegatecall.

The data separation pattern has the advantage of simplicity. It does not require the same low-level expertise as the delegatecall pattern. The delegatecall pattern has received lots of attention recently. Developers may be inclined to choose this solution because documentation and examples are easier to find.

Using either of these patterns comes at considerable risk, an aspect of this trend that has gone unacknowledged thus far.

Data separation pattern

The data separation pattern keeps logic and data in separate contracts. The logic contract, which owns the data contract, can be upgraded if required. The data contract is not meant to be upgraded. Only the owner can alter its content.

Figure 1: High-level overview of the data separation upgrade pattern

When considering this pattern, pay special attention to these two aspects: how to store data, and how to perform the upgrade.

Data storage strategy

If the variables needed across an upgrade will remain the same, you can use a simple design where the data contract holds these variables, with their getters and setters. Only the contract owner should be able to call the setters:

contract DataContract is Owner {
  uint public myVar;

  function setMyVar(uint new_value) onlyOwner public {
    myVar = new_value;
  }
}

Figure 2: Data storage example (using onlyOwner modifier)

You have to clearly identify the state variables required. This approach is suitable for ERC20 token-based contracts since they only require the storage of their balances.

If a future upgrade requires new persistent variables, they could be stored in a second data contract. You can split the data across separate contracts, but at the cost of additional logic contract calls and authorization. If you don’t intend to upgrade the contract frequently, the additional cost may be acceptable.

Nothing prevents the addition of state variables to the logic contract. These variables will not be kept during an upgrade, but can be useful for implementing the logic. If you want to keep them, you can migrate them to the new logic contract, too.

Key-value pair

A key-value pair system is an alternative to the simple data storage solution described above. It is more amenable to evolution but also more complex. For example, you can declare a mapping from a bytes32 key value to each base variable type:

contract DataContract is Owner {
  mapping(bytes32 => uint) uIntStorage;

  function getUint(bytes32 key) view public returns(uint) {
    return uintStorage[key];
}

  function setUint(bytes32 key, uint new_val) onlyOwner public {
    uintStorage[key] = new_val;
  }
}

Figure 3: Key-Value Storage Example (using onlyOwner modifier)

This solution is often called the Eternal Storage pattern.

How to perform the upgrade

This pattern offers several different strategies, depending on how the data are stored.

One of the simplest approaches is to transfer the ownership of the data contract to a new logic contract and then disable the original logic contract. To disable the previous logic contract, implement a pausable mechanism or set its pointer to 0x0 in the data contract.

Figure 4: Upgrade by deploying a new logic contract and disabling the old one

Another solution involves forwarding the calls from the original logic contract to the new version:

Figure 5: Upgrade by deploying a new logic contract and forwarding calls to it from the old one

This solution is useful if you want to allow users to call the first contract. However, it adds complexity; you have to maintain more contracts.

Finally, a more complex approach uses a third contract as an entry point, with a changeable pointer to the logic contract:

Figure 6: Upgrade by deploying a proxy contract that calls a new logic contract

A proxy contract provides the user with a constant entry point and a distinction of responsibilities that is clearer than the forwarding solution. However, it comes with additional gas costs.

Cardstack and Rocket-pool have detailed implementations of the data separation pattern.

Risks of the data separation pattern

The simplicity of the data separation pattern is more perceived than real. This pattern adds complexity to your code, and necessitates a more complex authorization schema. We have repeatedly seen clients deploy this pattern incorrectly. For example, one client’s implementation achieved the opposite effect, where a feature was impossible to upgrade because some its logic was located in the data contract.

In our experience, developers also find the EternalStorage pattern challenging to apply consistently. We have seen developers storing their values as bytes32, then applying type conversion to retrieve the original values. This increased the complexity of the data model, and the likelihood of subtle flaws. Developers unfamiliar with complex data structures will make mistakes with this pattern.

Delegatecall-based proxy pattern

Like the data separation method, the proxy pattern splits a contract in two: one contract holding the logic and a proxy contract holding the data. What’s different? In this pattern, the proxy contract calls the logic contract with delegatecall; the reverse order.

Figure 7: Visual representation of the proxy pattern

In this pattern, the user interacts with the proxy. The contract holding the logic can be updated. This solution requires mastering delegatecall to allow one contract to use code from another.

Let’s review how delegatecall works.

Background on delegatecall

delegatecall allows one contract to execute code from another contract while keeping the context of the caller, including its storage. A typical use-case of the delegatecall opcode is to implement libraries. For example:

pragma solidity ^0.4.24;

library Lib {

  struct Data { uint val; }

  function set(Data storage self, uint new_val) public {
    self.val = new_val;
  }
}

contract C {
  Lib.Data public myVal;

  function set(uint new_val) public {
    Lib.set(myVal, new_val);
  }
}

Figure 8: Library example based on delegatecall opcode

Here, two contracts will be deployed: Lib and C. A call to Lib in C will be done through delegatecall:

Figure 9: EVM opcodes of a call to Lib.set (Ethersplay output)

As a result, when Lib.set changes self.val, it changes the value stored in C’s myVal variable.

Solidity looks like Java or JavaScript, which are object-oriented languages. It’s familiar, but comes with the baggage of misconceptions and assumptions. In the following example, a programmer might assume that as long as two contract variables share the same name, then they will share the same storage, but this is not the case with Solidity.

pragma solidity ^0.4.24;

contract LogicContract {
  uint public a;

  function set(uint val) public {
    a = val;
  }
}

contract ProxyContract {
  address public contract_pointer;
  uint public a;

  constructor() public {
    contract_pointer = address(new LogicContract());
  }

  function set(uint val) public {
    // Note: the return value of delegatecall should be checked
    contract_pointer.delegatecall(bytes4(keccak256("set(uint256)")), val);
  }
}

Figure 10: Dangerous delegatecall usage

Figure 11 represents the code and the storage variables of both of the contracts at deployment:

Figure 11: Memory illustration of Figure 10

What happens when the delegatecall is executed? LogicContract.set will write in ProxyContract.contract_pointer instead of ProxyContract.a. This memory corruption happens because:

  • LogicContract.set is executed within the context of ProxyContract.
  • LogicContract knows only one state variable: a. Any store to this variable will be done on the first element in memory (see the Layout of State Variables in Storage documentation).
  • The first element for ProxyContract is contract_pointer. As a result, LogicContract.set will write theProxyContract.contract_pointer variable instead of ProxyContract.a (see Figure 12).
  • At this point, the memory in ProxyContract has been corrupted.

If a was the first variable declared in ProxyContract, delegatecall would have not corrupted the memory.

Figure 12: LogicContract.set will write the first element in storage: ProxyContract.contract_pointer

Use delegatecall with caution, especially if the called contract has state variables declared.

Let’s review the different data-storage strategies based on delegatecall.

Data storage strategies

There are three approaches to separating data and logic when using the proxy pattern:

  • Inherited storage, which uses Solidity inheritance to ensure that the caller and the callee have the same memory layout.
  • Eternal storage, which is the key-value storage version of the logic separation that we saw above.
  • Unstructured storage, which is the only strategy that does not suffer from potential memory corruption due to an incorrect memory layout. It relies on inline assembly code and custom memory management on storage variables.

See ZeppelinOS for a more thorough review of these approaches.

How to perform an upgrade

To upgrade the code, the proxy contract needs to point to a new logic contract. The previous logic contract is then discarded.

Risks of delegatecall

In our experience with clients, we have found that it is difficult to apply the delegatecall-based proxy pattern correctly. The proxy pattern requires that memory layouts stay consistent between contract and compiler upgrades. A developer unfamiliar with EVM internals can easily introduce critical bugs during an upgrade.

Only one approach, unstructured storage, overcomes the memory layout requirement but it requires low-level memory handling, which is difficult to implement and review. Due to its high complexity, unstructured storage is only meant to store state variables that are critical for the upgradability of the contract, such as the pointer to the logic contract. Further, this approach hinders static analysis of Solidity (for example, by Slither), costing the contract the guarantees provided by these tools.

Preventing memory layout corruption with automated tools is an ongoing area of research. No existing tool can verify that an upgrade is safe against a compromise. Upgrades with delegatecall will lack automated safety guarantees.

Breaking the proxy pattern

To wit, we have discovered and are now disclosing a previously unknown security issue in the Zeppelin proxy pattern, rooted in the complex semantics of delegatecall. It affects all the Zeppelin implementations that we have investigated. This issue highlights the complexity of using a low-level Solidity mechanism and illustrates the likelihood that an implementation of this pattern will have flaws.

What is the bug?

The Zeppelin Proxy contract does not check for the contract’s existence prior to returning. As a result, the proxy contract may return success to a failed call and result in incorrect behavior should the result of the call be required for application logic.

Low-level calls, including assembly, lack the protections offered by high-level Solidity calls. In particular, low-level calls will not check that the called account has code. The Solidity documentation warns:

The low-level call, delegatecall and callcode will return success if the called account is non-existent, as part of the design of EVM. Existence must be checked prior to calling if desired.

If the destination of delegatecall has no code, then the call will successfully return. If the proxy is set incorrectly, or if the destination was destroyed, any call to the proxy will succeed but will not send back data.

A contract calling the proxy may change its own state under the assumption that its interactions are successful, even though they are not.

If the caller does not check the size of the data returned, which is the case of any contract compiled with Solidity 0.4.22 or earlier, then any call will succeed. The situation is slightly better for recently compiled contracts (Solidity 0.4.24 and up) thanks to the check on returndatasize. However, that check won’t protect the calls that do not expect data in return.

ERC20 tokens are at considerable risk

Many ERC20 tokens have a known flaw that prevents the transfer functions from returning data. As a result, these contracts support a call to transfer which may return no data. In such a case, the lack of an existence check, as detailed above, may lead a third party to believe that a token transfer was successful when it was not, and may lead to the theft of money.

Exploit scenario

Bob’s ERC20 smart contract is a proxy contract based on delegatecall. The proxy is incorrectly set due to human error, a flaw in the code, or a malicious actor. Any call to the token will act as a successful call with no data returned.

Alice’s exchange handles ERC20 tokens that do not return data on transfer. Eve has no tokens. Eve calls the deposit function of Alice’s exchange for 10,000 tokens, which calls transferFrom of Bob’s token. The call is a success. Alice’s exchange credits Eve with 10,000 tokens. Eve sells the tokens and receives ethers for free.

How to avoid this flaw

During an upgrade, check that the new logic contract has code. One solution is to use the extcodesize opcode. Alternatively, you can check for the existence of the target each time delegatecall is used.

There are tools that can help. For instance, Manticore is capable of reviewing your smart contract code to check a contract’s existence before any calls are made to it. This check was designed to help mitigate risky proxy contract upgrades.

Recommendations

If you must design a smart contract upgrade solution, use the simplest solution possible for your situation.

In all cases, avoid the use of inline assembly and low-level calls. The proper use of this functionality requires extreme familiarity with the semantics of delegatecall, and the internals of Solidity and EVM. Few teams whose code we’ve reviewed get this right.

Data separation recommendations

If you need to store data, opt for the simple data storage strategy over key-pairs (aka Eternal Storage). This method requires writing less code and depends on fewer moving parts. There is simply less that can go wrong.

Use the contract-discard solution to perform upgrades. Avoid the forwarding solution, since it requires building forwarding logic that may be too complex to implement correctly. Only use the proxy solution if you need a fixed address.

Proxy pattern recommendations

Check for the destination contract’s existence prior to calling delegatecall. Solidity will not perform this check on your behalf. Neglecting the check may lead to unintended behavior and security issues. You are responsible for these checks if relying upon low-level functionality.

If you are using the proxy pattern, you must:

  • Have a detailed understanding of Ethereum internals, including the precise mechanics of delegatecall and detailed knowledge of Solidity and EVM internals.
  • Carefully consider the order of inheritance, as it impacts the memory layout.
  • Carefully consider the order in which variables are declared. For example, variable shadowing, or even type changes (as noted below) can impact the programmer’s intent when interacting with delegatecall.
  • Be aware that the compiler may use padding and/or pack variables together. For example, if two consecutive uint256 are changed to two uint8, the compiler can store the two variables in one slot instead of two.
  • Confirm that the variables’ memory layout is respected if a different version of solc is used or if different optimizations are enabled. Different versions of solc compute storage offsets in different ways. The storage order of variables may impact gas costs, memory layout, and thus the result of delegatecall.
  • Carefully consider the contract’s initialization. According to the proxy variant, state variables may not be initializable during construction. As a result, there is a potential race condition during initialization that needs to be mitigated.
  • Carefully consider names of functions in the proxy to avoid function-name collision. Proxy functions with the same Keccak hash as the intended function will be called instead, which could lead to unpredictable or malicious behavior.

Concluding remarks

We strongly advise against the use of these patterns for upgradable smart contracts. Both strategies have the potential for flaws, significantly increase complexity, and introduce bugs, and ultimately decrease trust in your smart contract. Strive for simple, immutable, and secure contracts rather than importing a significant amount of code to postpone feature and security issues.

Further, security engineers that review smart contracts should not recommend complex, poorly understood, and potentially insecure upgrade mechanisms. Ethereum security community, consider the risk prior to endorsing these techniques.

In a follow-up blog post, we will describe contract migration, our recommended approach to achieve the benefits of upgradable smart contracts without their downsides. A contract migration strategy is essential in case of private key compromise, and helpful in avoiding the need for other upgrades.

In the meantime, you should contact us if you’re concerned that your upgrade strategy may be insecure.

Introducing windows-acl: working with ACLs in Rust

Access Control Lists (ACLs) are an integral part of the Microsoft Windows security model. In addition to controlling access to secured resources, they are also used in sandboxing, event auditing, and specifying mandatory integrity levels. They are also exceedingly painful to programmatically manipulate, especially in Rust. Today, help has arrived — we released windows-acl, a Rust crate that simplifies the manipulation of access control lists on Windows.

A short background on Windows ACLs

Windows has two types of ACLs: discretionary (DACL) and system (SACL). Every securable object in Windows (such as files, registry keys, events, etc.) has an associated security descriptor with a DACL and SACL.

Security Descriptor Diagram (1)

The difference between DACL and SACL lies in the type of held entries. DACLs are used to control an entity’s access to a resource. For instance, to deny a user from reading a registry key, the registry key’s DACL needs to contain an access denied entry for the user. SACLs are used for managing the types of actions required to generate an audit event and for setting mandatory integrity labels on resources. As an example, to audit access failures by a user group on a specific file, the specified file’s SACL must contain an audit entry for the user group.

Working with ACLs today

Adding and removing ACL entries from an existing ACL requires the creation of a new ACL. The removal process is fairly simple — copy all of the existing ACL entries over except the target to be deleted. The insertion process is a bit more difficult for discretionary ACLs. The access rights that a DACL allows a user depend on the ordering of the ACEs in the DACL. Correctly inserting a new access control entry (ACE) at the preferred location is the responsibility of the developer. Furthermore, the new DACL must ensure that the to-be-inserted ACE must not have a conflict with existing ACE entries. For instance, if the existing DACL has an access allowed entry for a user with read/write privileges and the to-be added ACE is an access allowed entry for a user with only read privileges, the existing entry must not be copied into the new DACL. No one wants to deal with this complexity, especially in Rust.

How windows-acl simplifies the task

Let’s take a look at appjaillauncher-rs for a demonstration of windows-acl’s simplicity. Last year, I ported the original C++ AppJailLauncher to Rust. appjaillauncher-rs sandboxes Windows applications using AppContainers. For sandboxing to work correctly, I needed to modify ACLs on specific resources — it was an arduous task without the help of Active Template Library classes (see CDacl and CSacl), the .NET framework, or PowerShell. I ended up with a solution that was both imperfect and complicated. After implementing windows-acl, I went back and updated appjaillauncher-rs to use windows-acl. With windows-acl, I have a modular library that handles the difficulties of Windows ACL programming. It provides an interface that simplifies the act of adding and removing DACL and SACL entries.

For example, adding a DACL access allow entry requires the following code:

match ACL::from_file_path(string_path, false) {
    Ok(mut acl) => {
        let sid = string_to_sid(string_sid).unwrap_or(Vec::new());
        if sid.capacity() == 0 {
            return false;
        }
        acl.remove(
            sid.as_ptr() as PSID, 
            Some(AceType::AccessAllow), None
        ).unwrap_or(0);
        if !acl.allow(sid.as_ptr() as PSID, true, mask).unwrap_or_else(|code| {
               false
           }) {
            return false;
        }
    },
    ...
}

Similarly for removing a DACL access allow entry:

match ACL::from_file_path(string_path, false) {
    Ok(mut acl) => {
        let sid = string_to_sid(string_sid).unwrap_or(Vec::new());
        if sid.capacity() == 0 {
            return false;
        }
        let result = acl.remove(
                         sid.as_ptr() as PSID, 
                         Some(AceType::AccessAllow), 
                         None);
        if result.is_err() {
            return false;
        }
    },
    ...
}

Potential Applications and Future Work with windows-acl

The windows-acl crate opens up new possibilities for writing Windows security tools in Rust. The ability to manipulate SACLs allows us to harness the full power of the Windows Event auditing engine. The Windows Event auditing engine is instrumental in providing information to detect endpoint compromise. We hope this work contributes to the Windows Rust developers community and inspires the creation of more Rust-based security tools!

Get an open-source security multiplier

An increasing number of organizations and companies (including the federal government) rely on open-source projects in their security operations architecture, secure development tools, and beyond.

Open-source solutions offer numerous advantages to development-savvy teams ready to take ownership of their security challenges. Teams can implement them to provide foundational capabilities, like “process logs” or “access machine state,” swiftly; no need to wait for purchasing approval. They can build custom components on top of open-source code to fit their company’s needs perfectly. Furthermore, open-source solutions are transparent, ‘return’ great value for dollars spent (since investment makes the tool better rather than paying for a license), and receive maintenance from a community of fellow users.

What’s the catch?

Open source can potentially mean faster, easier, and better solutions, but there’s one thing missing: Expert engineering support.

How do you pick the right tool for your needs? How do you know this open-source solution works? That it’s safe? What do you do when no one fixes reported bugs? What do you do when you need a new killer feature but no one in your company can build it? What happens when breaking changes are introduced and you need to upgrade?

How we can help

We’re security researchers who specialize in evaluating the security of code and building security solutions. If you want to leverage open-source security technology, we can help you pick the right tool, clean up technical debt, build new necessary features, and maintain it for you. Our developers fix bugs, update out-of-date components or practices, harden vulnerable code, and, when necessary, can re-engineer things completely. We provide the way forward through all your show-stopping issues.

How can I ensure this open-source project is right for my needs? We can find open-source solutions that best suit the needs of your organization. We can polish up whatever needs fixing, build new essential features, and then maintain your final solution.

What if I need a new feature? Whether it’s porting software to a new OS, integrating the tool with others in your stack, or leveraging an existing tool for a new use case, our team are experts at building security solutions. Once built, we work with the open source project teams to merge features into public repos so the features are continuously maintained through updates.

How can I fix known bugs? We can fix them and work through technical debt to prevent more bugs in the future.

What if breaking changes are introduced in open-source dependencies? We can re-engineer solutions to maintain functionality.

Is the open-source software safe? We can review it for security best practices, ensure there’s no malicious code, and harden the code to minimize risk to your company.

How do I know this open-source solution works? We can review its code, understand the mechanics of how it works, and test edge cases to confirm consistent intended behavior. If we find anything broken or poorly-engineered, we can fix it. If the project is truly beyond repair, we can completely engineer the system to work for your organization.

How we do it

We’ve begun hosting support groups for companies leveraging open-source technology in three areas:

  • Security Operations: this team supports technology that keep company fleets and users safe from network-level attacks.
  • Secure Development: this team automates software testing, hardens software, and builds security into modern development practices.
  • Core Infrastructure: this team improves essential core infrastructure that requires maintenance or re-engineering to mitigate newly discovered attacks.

For each group, we help clients pick the best open-source project to suit their needs, update and fix needed functionality, build custom features to perfectly fit client requirements, and maintain the technology so that it’s effective and safe.

Take a look at some examples of our Security Operations group’s success stories:

Facebook’s osquery

Endpoint monitoring tool that transforms your fleets’ system data into a queryable database.

Successes so far:

AirBnB’s StreamAlert

A serverless, real-time data analysis framework which empowers users to ingest, analyze, and set up alerts on data from any environment, using customized data sources and alerting logic.

Successes so far:

Google Santa

A binary whitelisting/blacklisting system for macOS that helps administrators track naughty or nice binaries.

Successes so far:

Google Omaha + CrystalNix Omaha Server

The open-source version of Google Update. Developers can use it to install requested software and keep it up to date. (This set of enhancements is being publicly released soon)

Successes so far:

  • Created scripts to simplify the build process.
  • Simplified the process of rebranding Omaha to work with a custom Omaha server.
  • Added support in the CrystalNix Omaha server to support the latest Google Omaha client with SHA256.

Where we’d like to help next

There’s still so much room for improvement both in the security posture of companies that leverage open-source tooling and the technologies’ capabilities. In security operations, we can leverage promising open-source technology for memory forensics, user account security, binary analysis, and secret management. We can enhance software testing capabilities in fuzzing, symbolic execution, and binary lifting. We can wipe out industry-wide risks by re-engineering, improving, and maintaining widely-adopted tools for forensicspackage management, and document parsing.

How can we help you?

How can we help make open-source security solutions work better for you? Let us know!

Fault Analysis on RSA Signing

Aditi Gupta

This spring and summer, as an intern at Trail of Bits, I researched modeling fault attacks on RSA signatures. I looked at an optimization of RSA signing that uses the Chinese Remainder Theorem (CRT) and induced calculation faults that reveal private keys. I analyzed fault attacks at a low level rather than in a mathematical context. After analyzing both a toy program and the mbed TLS implementation of RSA, I identified bits in memory that leak private keys when flipped.

The Signature Process with RSA-CRT

Normally, an RSA signing operation would use this algorithm: s=md (mod n). Here, s represents the signature, m the message, d the private exponent, and n the public key. This algorithm is effective, but when the numbers involved increase to the necessary size for security, the computation begins to take an extremely long time. For this reason, many cryptography libraries use the Chinese Remainder Theorem (CRT) to speed up decryption and signing. The CRT splits up the single large calculation into two smaller ones before stitching their results together.

Given the private exponent d, we calculate two values, dp and dq, as: dp=d (mod (p-1)) and dq=d (mod (q-1)).

We then compute two partial signatures, each using one of these two numbers; the first partial signature, s1, equals mdp (mod p), while the second, s2, equals mdq (mod q). The inverses of p (mod q) and q (mod p) are calculated, and finally, the two partial signatures are combined to form the final signature, s, with s=(s1*q*qInv)+(s2*p*pInv) (mod n).

The Fault Attack

The problem arises when one of the two partial signatures (let’s assume it’s s2, calculated using q) is incorrect. It happens.

Combining the two partial signatures will give us a faulty final signature. If the signature were correct, we would be able to verify it by comparing the original message to se (mod n), where e is the public exponent. However, with the faulted signature, se (mod p) will still equal m, but se (mod q) will not.

From here, we can say that p, but not q, is a factor of se-m. Because p is also a factor of n itself, the attacker can take the greatest common denominator of n and se-m to extract p. n divided by p is simply q, and the attacker now has both of the private keys.

Faulting a Toy Program

I began by writing a simple toy program in C to conduct RSA signing using the Chinese Remainder Theorem. This program included no padding and no checks, using textbook RSA to sign fairly small numbers. I used a debugger to modify one of the partial signatures manually and produce a faulted final signature. I wrote a program in Python to use this faulted signature to calculate the private keys and successfully decrypt another encrypted message. I tried altering data at various different stages of the signing process to see whether I could still extract the private keys. When I felt comfortable carrying out these fault attacks by hand, I began to automate the process.

Flipping Bits with Manticore

I used Binary Ninja to view the disassembly of my program and identify the memory locations of the data that I was interested in. When I tried to solve for the private keys, I would know where to look. Then, I installed and learned how to use Manticore, the binary analysis tool developed by Trail of Bits with which I was going to conduct the fault attacks.

I wrote a Manticore script that would iterate through each consecutive byte of memory, alter an instruction by flipping a bit in that byte, and execute the RSA signing program. For each execution that did not result in a crash or a timeout, I used the output to try to extract the private keys. I checked them against the correct keys by attempting to successfully decrypt another message. With all of this data, I generated a CSV file of the intermediate and final results from each bit flip, including the partial signatures, the private keys, and whether the private keys were accurate.

Fig. 1: Excerpt from code to find faultable addresses in toy program

Results

I tested a total of 938 bit flips, and I found that 45 of them, or 4.8%, successfully produced the correct private keys. Nearly 55% resulted in either a crash or a timeout, meaning that the program failed to create a signature. Approximately 31% did not alter the partial signatures.

Fig. 2: Output of analysis code

Fig. 3. Bit flip results for toy program

This kind of automation offers a massive speedup in developing exploits for vulnerabilities like this, as once you simply describe the vulnerability to Manticore, you get back a comprehensive list of ways to exploit it. This is particularly useful if you’re able to introduce some imprecise fault (e.g. using Rowhammer) as you can find clusters of bits which, when flipped, leak a private key.

Faulting mbed TLS

Once I had the file of bit flip results for my toy program, I looked for a real cryptographic library to attack. I settled on mbed TLS, an implementation that is primarily used on embedded systems. Because it was much more complex than the program I had written, I spent some time looking at the mbed TLS source code to try to understand the RSA signing process before compiling it and looking at the disassembled binary using Binary Ninja.

One key difference between mbed TLS and my toy program was that signatures using mbed TLS were padded. The fault attacks I was trying to model are applicable only to deterministic padding, in which a given message will always result in the same padded value, and not to probabilistic schemes. Although mbed TLS can implement a variety of different padding schemes, I looked at RSA signing using PKCS#1 v1.5, a deterministic alternative to the more complex, randomized PSS padding scheme. Again, I used a debugger to locate the target data. When I knew what memory locations I would be reading from, I began to fault one of the partial signatures and produce an incorrect signature.

I soon realized, however, that there were some runtime checks in place to prevent a fault attack of the type I was trying to conduct. In particular, two of the checks, if failed, would stop execution and output an error message without creating the signature. I used the debugger to skip over the checks and produce the faulted signature I was looking for.

With the faulted signature and all of the public key data, I was able to replicate the process I had used on my toy program to extract the private keys successfully.

Automating the Attacks

Just as I had with the toy program, I started to try to automate the fault attacks and identify the bit flips that would leak the private keys. In order to speed up the process, I wrote a GDB script instead of using Manticore. I found bit flips that would allow me to bypass both of the checks that would normally prevent the creation of a faulted signature. I used GDB to alter both of those memory instructions. In a process identical to the toy program, I also flipped one bit in a given memory address. I then used Python to loop through each byte of memory, call this script, and try to extract the private keys, again checking whether they were correct by attempting to decrypt a known message. I collected the solved private keys and wrote the results to a CSV file of all the bit flips.

Fig. 4: Excerpt from code to find faultable locations in mbed TLS

Fig. 5: Excerpt from GDB script called from Python code to induce faults in mbed TLS

Results

I tested 566 bit flips, all within the portion of the mbed TLS code that carried out the signing operation. Combined with the two bit flips that ensured that the checks would pass, I found that 28 of them – nearly 5% – leaked the private keys. About 55% failed to produce a signature.

Fig.6. Bit flip results for mbed TLS

The fact that this kind of analysis works on real programs is exciting, but unfortunately, I ran out of time in the summer before I got a chance to test it in the “real world.” Nonetheless, the ability to input real TLS code and get a comprehensive description of fault attacks against it is exciting, and yields fascinating possibilities for future research.

Conclusion

I loved working at Trail of Bits. I gained a better understanding of cryptography, and became familiar with some of the tools used by security engineers. It was a wonderful experience, and I’m excited to apply everything I learned to my classes and projects at Carnegie Mellon University when I start next year.

 

You could have invented that Bluetooth attack

A serious bluetooth bug has received quite a bit of attention lately. It’s a great find by Biham and Newman. Given BLE’s popularity in the patch-averse IoT world, the bug has serious implications. And yet, it’s remarkably clean and simple. Unlike many elliptic curve bugs, an average human can totally understand the bug and how it can be exploited. It’s a cool application of a conceptually approachable attack.

This post describes the bug, how to exploit them, and how that specifically happened with the bluetooth protocol. But first, let’s take a crash course in elliptic curves and invalid curve point attacks.

What is an elliptic curve?

It’s quite a misnomer. In cryptography, an “elliptic curve” is neither elliptical nor a continuous curve. Instead, it’s a collection of (x, y) coordinates where x and y are between 0 and n, such that y2 = x3 + ax + b mod n, plus a bonus “point at infinity,” which counterintuitively behaves much like zero; if you add it to anything, you get the same number back. Fig. 1 shows such a curve, where y2 = x3 - x, and x and y range from 0 to 71. (Technically, x and y are defined over a finite field, but all finite fields of the same order are isomorphic, so our definition is equivalent).

Fig. 1: An elliptic curve (as used in cryptography)

Elliptic curves have a few useful properties:

  • You can add elliptic curve points to one another with rules that look a lot like regular addition: x + y = y + x, x + (y + z) = (x + y) + z, etc. This is because points on the curve form an Abelian group. You can see a visualization of this addition in fig. 2, below.
  • A point can be multiplied by some natural number (call it n) by adding it to itself n times.
  • For each point (x, y), there’s an inverse, (x, -y) such that any point plus its inverse is the point at infinity.
  • Most notably, if you pick a, b, and n in the equation above well, multiplying by a regular number is a trapdoor function: given a point P and a number n, computing a point Q such that Q = n * P is very easy, but given a point P and a point Q, finding n such that Q = n * P is extremely hard. This simple hardness assumption lets us construct hosts of cryptographic algorithms.

Fig. 2: Addition, and the point at infinity (note: these illustrations use a continuous curve for visualization, but that’s still not what we’re working with)

Usually, elliptic curve algorithms are written around a specific curve. Parties exchange points on the curve and scalars, and do computations like we’ve defined above using them. However, problems arise when these algorithms are exposed to (x, y) pairs that don’t satisfy the curve equation, and these can be exploited to perform what’s called an “invalid curve point attack.”

What is an invalid curve point attack?

Remember, a point is a pair (x, y) such that y2 = x3 + ax + b mod n for some a, b, and n. However, if that equation doesn’t hold, (x,y) is an invalid curve point. Quite a few cryptographic algorithms ask a user to provide a curve point and, by design, assume the point is valid and the equation holds. Failing to verify that received curve points are on the curve before doing math with them isn’t too far from violating the cryptographic doom principle and has similar consequences.

In elliptic curve schemes, the secret is usually a regular number (remember, finding n such that Q = n * P is the hard problem). When an attacker can send an unvalidated point that’s multiplied by n and see the results, they can exploit that lack of validation to learn the secret key. Frequently, attackers can pick points that belong to a different curve than the algorithm specifies, one with very few points on it.

For instance, an attacker might pick a point not on the curve with a y coordinate of zero. Any point like this must, when added to itself, be equal to the point at infinity. We know this because we calculate inverses by multiplying the y coordinate by negative one, but zero times negative one is zero, and so any point with a y coordinate of zero is its own inverse. Thus when we add it to itself, the result must be the point at infinity.

If the attacker picks a point like this, then by viewing the secret multiplied by some point on that curve, they learn whether the secret is even or odd! The input point was one of the two possible points. Adding it to itself will get the other point. Thus, if the secret is the point entered, the number is odd; otherwise, it’s even. We can submit similar points equal to the point at infinity when multiplied by three to learn the secret mod 3, points equal to the point at infinity when multiplied by five to learn the secret mod 5, and so on until we can just use the Chinese Remainder Theorem to compute the actual secret.

This isn’t the only way to use invalid curve points to find out a secret key, but it’s perhaps the easiest to understand, and has been used in real life to break TLS libraries from Oracle and Google.

While the recent attack on Bluetooth wasn’t quite as simple as the above example, it was conceptually very similar, and used the same technique of small subgroup confinement to achieve key disclosure.

How did the Bluetooth attack work?

The Bluetooth protocol uses elliptic curve Diffie-Hellman to agree on a shared secret key for encryption. Using Diffie-Hellman algorithms correctly is super hard. Subtle mistakes can compromise the security of your entire system. In this case, the protocol did in fact require curve point validation, but only for the x coordinate. This is an innocent-enough looking mistake to go unnoticed for a decade. And it lets a smart attacker break the whole system.

Elliptic curve Diffie-Hellman is a simple algorithm. Suppose Alice and Bob want to agree upon some secret. Both parties have a secret number, and there’s a commonly agreed-upon “base point” on some elliptic curve.

Alice sends Bob the base point multiplied by her secret number, and Bob sends Alice the base point multiplied by his. Then, Alice multiplies Bob’s message by her secret number, and Bob multiplies Alice’s message by his secret number. If we call the point P and the secrets a and b, Alice sends Bob P * a, Bob sends Alice P * b. Alice then calculates (P * b) * a (the shared secret) and Bob calculates (P * a) * b (the same number.)

Even if an attacker, Chuck, can see either message, he can’t deduce a or b, and he can’t combine P * a and P * b to get the secret, so he’s totally out of luck! However, problems arise when Chuck can modify messages instead of just viewing them. Since the y coordinate isn’t valid, Chuck can modify it to always be zero.

Remember, if a point has a y coordinate of zero, if you multiply it by a random number, it has a 50% chance of being the point at infinity, a 50% chance of being the original point, and no chance of being anything else.

Putting this all together, if Chuck replaces one of the points Alice and Bob send each other, (x, y), with (x, 0), then Alice or Bob multiplies it by their secret key, there’s a 50% chance they get the point at infinity and a 50% chance they get (x, 0). If Chuck replaces both intermediate messages, there’s a 25% chance that Alice and Bob agree the secret key is the point at infinity, and no chance they agree on any other key. This means either ECDH fails and Alice and Bob have to retry (giving Chuck another chance), or Chuck knows the secret key and can read and insert messages.

How can you avoid bugs like this?

The most important takeaway from all of this isn’t anything about this particular attack; it’s that this whole class of attacks is totally preventable. A few more concrete pieces of advice:

  • Diffie-Hellman is an unusually dangerous algorithm to implement. You almost certainly don’t want to be using it in the first place (ref. Latacora’s excellent cryptographic right answers). If you find someone using it, you have a good chance of finding bugs. Seriously, almost no one needs this, and it’s extraordinarily hard to do right.
  • Use X25519 for ECDH or Ed25519 for ECDSA. On those curves any 32-byte string is a valid curve point; invalid curve points are thus impossible.
  • If your input can be invalid and you’re performing cryptographic operations with it, do the validation before anything else.

I’ve deliberately left out instructions for validating points yourself, since that’s far too subtle a topic for the conclusion of a blog post. If you’re interested in that sort of thing, you could do worse than this paper. If you’d like to practice exploiting these kind of bugs (and some way cooler ones) you should check out Cryptopals set 8. If this is the kind of thing you do for fun already, get in touch. We’d love to work with you.

Optimizing Lifted Bitcode with Dead Store Elimination

Tim Alberdingk Thijm

As part of my Springternship at Trail of Bits, I created a series of data-flow-based optimizations that eliminate most “dead” stores that emulate writes to machine code registers in McSema-lifted programs. For example, applying my dead-store-elimination (DSE) passes to Apache httpd eliminated 117,059 stores, or 50% of the store operations to Remill’s register State structure. If you’re a regular McSema user, then pull the latest code to reap the benefits. DSE is now enabled by default.

Now, you might be thinking, “Back it up, Tim, isn’t DSE a fundamental optimization that’s already part of LLVM?” You would be right to ask this (and the answer is yes), because if you’ve used LLVM then you know that it has an excellent optimizer. However, despite LLVM’s excellence, the truth is that, like any optimizer, LLVM can only cut instructions it knows to be unnecessary. The Remill dead code eliminator has the advantage of possessing more higher-level information about the nature of lifted bitcode, which lets it be more aggressive than LLVM in performing its optimizations.

But every question answered just raises more questions! You might now be thinking, “LLVM only does safe optimizations. This DSE is more aggressive… How do we know it didn’t break the lifted httpd program?” Fear not! The dead store elimination tool is specifically designed to perform a whole-program analysis on lifted bitcode that has already been optimized. This ensures that it can find dead instructions with the maximum possible context, avoiding mistakes where the program assumes some code won’t be used. The output is a fully-functioning httpd executable, minus a mountain of useless computation.

What Happens When We Lift

The backbone of Remill/McSema’s lifted bitcode is the State structure, which models the machine’s register state. Remill emulates reads and writes to registers by using LLVM load and store instructions that operate on pointers into the State structure. Here’s what Remill’s State structure might look like for a toy x86-like architecture with two registers: eax and ebx.

struct State {
  uint32_t eax;
  uint32_t ebx;
};

This would be represented in LLVM as follows:

%struct.State = type { i32, i32 }

Let’s say we’re looking at a few lines of machine code in this architecture:

mov eax, ebx
add eax, 10

A heavily-simplified version of the LLVM IR for this code might look like this:

The first two lines derive pointers to the memory backing the emulated eax and ebx registers (%eax_addr and %ebx_addr, respectively) from a pointer to the state (%state). This derivation is performed using the getelementptr instruction, and is equivalent to the C code &(state->eax) and &(state->ebx). The next two lines represent the mov instruction, where the emulated ebx register is read (load), and the value read is then written to (store) the emulated eax register. Finally, the last three lines represent the add instruction.

We can see that %ebx_0 is stored to %eax_ptr and then %eax_0 is loaded from the %eax_ptr without any intervening stores to the %eax_ptr pointer. This means that the load into %eax_0 is redundant. We can simply use %ebx_0 anywhere that %eax_0 is used, i.e. forward the store to the load.

Next, we might also notice that the store %ebx_0, %eax_ptr instruction isn’t particularly useful either, since store %eax_1, %eax_ptr happens before %eax_ptr is read from again. In fact, this is a dead store. Eliminating these kinds of dead stores is what my optimization focuses on!

This process will go on in real bitcode until nothing more can be forwarded or killed.

So now that you have an understanding of how dead store elimination works, let’s explore how we could teach this technique to a computer.

As it turns out, each of the above steps are related to data-flow analyses. To build our eliminator, we’re going to want to figure out how to represent these decisions using data-flow techniques.

Building the Eliminator

With introductions out of the way, let’s get into how this dead code elimination is supposed to work.

Playing the Slots

The DSE pass needs to recognize loads/stores through %eax_ptr and %ebx_ptr as being different. The DSE pass does this by chopping up the State structure into “slots”, which roughly represent registers, with some small distinctions for cases where we bundle sequence types like arrays and vectors as one logical object. The slots for our simplified State structure are:

After chopping up the State structure, the DSE pass tries to label instructions with the slot to which that instruction might refer. But how do we even do this labelling? I mentioned earlier that we have deeper knowledge about the nature of lifted bitcode, and here’s where we get to use it. In lifted bitcode, the State structure is passed into every lifted function as an argument. Every load or store to an emulated register is therefore derived from this State pointer (e.g. via getelementptr, bitcast, etc.). Each such derivation results in a new pointer that is possibly offsetted from its base. Therefore, to determine the slot referenced by any given pointer, we need to calculate that pointer’s offset, and map the offset back to the slot. If it’s a derived pointer, then we need to calculate the base pointer’s offset. And if the base pointer is derived then… really, it’s just offsets all the way down.

And They Were Slot-mates!

The case that interests us most is when two instructions get friendly and alias to the same slot. That’s all it takes for one instruction to kill another: in Remill, it’s the law of the jungle.

To identify instructions which alias, we use a ForwardAliasVisitor (FAV). The FAV keeps track of all the pointers to offsets to the state structure and all the instructions involving accesses to the state structure in two respective maps. As the name implies, it iterates forward through the instructions it’s given, keeping a tally if it notices that one of the addresses it’s tracking has been modified or used.

Here’s how this information is built up from our instructions:

Each time the FAV visits an instruction, it checks if updates need to be made to its maps.

The accesses map stores the instructions which access state offsets. We’ll use this map later to determine which load and store instructions could potentially alias. You can already see here that the offsets of three instructions are all the same: a clear sign that we can eliminate instructions later!

The offsets map ensures the accesses map can get the right information. Starting with the base %state pointer, the offsets map accumulates any pointers that may be referenced as the program runs. You can think of it as the address book which the loads and stores use to make calls to different parts of the state structure.

The third data structure shown here is the exclude set. This keeps track of all the other values instructions might refer to that we know shouldn’t contact the state structure. These would be the values read by load instructions, or pointers to alloca’d memory. In this example, you can also see that if a value is already in the offsets map or exclude set, any value produced from one such value will remain in the same set (e.g. %eax_1 is excluded since %eax_0 already was). You can think of the exclude set as the Do-Not-Call list to the offset map’s address book.

The FAV picks through the code and ensures that it’s able to visit every instruction of every function. Once it’s done, we can associate the relevant state slot to each load and store as LLVM metadata, and move on to the violent crescendo of the dead code eliminator: eliminating the dead instructions!

You’ll Be Stone Dead In a Moment

Now it’s time for us to pick through the aliasing instructions and see if any of them can be eliminated. We have a few techniques available to us, following a similar pattern as before. We’ll look through the instructions and determine their viability for elimination as a data-flow.

Sequentially, we run the ForwardingBlockVisitor to forward unnecessary loads and stores and then use the LiveSetBlockVisitor to choose which ones to eliminate. For the purpose of this post, however, we’ll cover these steps in reverse order to get a better sense of why they’re useful.

Live and Set Live

The LiveSetBlockVisitor (LSBV) has the illustrious job of inspecting each basic block of a module’s functions to determine the overall liveness of slots in the State. Briefly, live variable analysis allows the DSE to check if a store will be overwritten (“killed”) before a load accesses (“revives”) the slot. The LiveSet of LSBV is a bitset representing the liveness of each slot in the State structure: if a slot is live, the bit in the LiveSet corresponding to the slot’s index is set to 1.

The LSBV proceeds from the terminating blocks (blocks ending with ret instructions) of the function back to the entry block, keeping track of a live set for each block. This allows it to determine the live set of preceding blocks based on the liveness of their successors.

Here’s an example of how an LSBV pass proceeds. Starting from the terminating blocks, we iterate through the block’s instructions backwards and update its live set as we do. Once we’re finished, we add the block’s predecessors to our worklist and continue with them. After analyzing the entry block, we finish the pass. Any stores visited while a slot was already dead can be declared dead stores, which we can then remove.

In order to avoid any undefined behaviour, the LSBV had a few generalizations in place. Some instructions, like resume or indirectbr, that could cause uncertain changes to the block’s live set conservatively mark all slots as live. This provides a simple way of avoiding dangerous eliminations and an opportunity for future improvements.

Not To Be Forward, But…

Our work could end here with the LSBV, but there are still potential improvements we can make to the DSE. As mentioned earlier, we can “forward” some instructions by replacing unnecessary sequences of storing a value, loading that value and using that value with direct use of the value prior to the store. This is handled by the ForwardingBlockVisitor, another backward block visitor. Using the aliases gathered by the FAV, it can iterate through the instructions of the block from back to front, keeping track of the upcoming loads to each slot of the State. If we find an operation occurs earlier that accesses the same slot, we can forward it to cut down on the number of operations, as shown in the earlier elimination example.

Doing this step before the LSBV pass allows the LSBV to identify more dead instructions than before. Looking again at our example, we’ve now set up another store to be killed by the LSBV pass. This type of procedure allows us to remove more instructions than before by better exploiting our knowledge of when slots will be used next. Cascading eliminations this way is part of what allows DSE to remove so many instructions: if a store is removed, there may be more instructions rendered useless that can also be eliminated.

A DSE Diet Testimonial

Thanks to the slimming power of dead store elimination, we can make some impressive cuts to the number of instructions in our lifted code.

For an amd64 Apache httpd, we were able to generate the following report:

Candidate stores: 210,855
Dead stores: 117,059
Instructions removed from DSE: 273,322
Forwarded loads: 840
Forwarded stores: 2,222
Perfectly forwarded: 2,836
Forwarded by truncation: 215
Forwarded by casting: 11
Forwarded by reordering: 61
Could not forward: 1,558
Unanalyzed functions: 0

An additional feature of the DSE is the ability to generate DOT diagrams of the instructions removed. Currently, the DSE will produce three diagrams for each function visited, showing the offsets identified, the stores marked for removal, and the post-removal instructions.

DOT diagrams are produced that show eliminated instructions

Still Hungry for Optimizations?

While this may be the end of Tim’s work on the DSE for the time being, future improvements are already in the pipeline to make Remill/McSema’s lifted bitcode even leaner. Work will continue to handle cases that the DSE is currently not brave enough to take on, like sinking store instructions when a slot is only live down one branch, handling calls to other functions more precisely, and lifting live regions to allocas to benefit from LLVM’s mem2reg pass.

Think what Tim did was cool? Check out the “intern project” GitHub issue tags on McSema and Remill to get involved, talk to us on #binary-lifting channel of the Empire Hacking Slack, or reach out to us via our careers page.

Tim is starting a PhD in programming language theory this September at Princeton University, where he will try his hand at following instructions, instead of eliminating them.

Trail of Bits donates $100,000 to support young researchers through SummerCon

We have a soft spot in our hearts for SummerCon. This event, the longest-running hacker conference in the US, is a great chance to host hacker friends from around the world in NYC, catch up in person, and learn about delightfully weird security topics. It draws a great crowd, ranging from “hackers to feds to convicted felons to concerned parents.”

The folks running SummerCon have pulled together an excellent line-up of high-quality talks time and again. However, this year there’s a bigtime issue: all the speakers are men.

We recognize the thanklessness of the job of hosting SummerCon and assume the best of intentions. Nonetheless, we were disappointed. This lineup isn’t an exception to security conferences – it’s close to the norm. Exclusion of women and minorities in the security industry is a pandemic that we need to address. The hacker conference that started them all should be at the forefront of the solution.

This year we’ll be working together to change that.

A grant for inclusion in security research

We are partnering with the SummerCon Foundation to create the Trail of Bits SummerCon Fellowship. This grant will provide $100,000 in funding for budding security researchers. At least 50% of the program spots will be reserved for minority and female-identifying candidates. The organization will reach out directly to women- and minority-serving groups at universities to encourage them to apply (shout out to @MaddieStone for that awesome idea!). Participants will receive grant funding, mentorship from Trail of Bits and the SummerCon Foundation, and an invitation to present their findings at SummerCon after their fellowship.

In addition to this program, SummerCon has committed to a greater level of transparency and representation in its future selection of speakers. They’ll publish well-defined criteria for their CFP. They will identify the SummerCon alumni who comprise their speaker-selection committee. Finally, they will expand the selection team to include 50% minorities and women.

Next, SummerCon has committed to making the conference a safe space of inclusion. They’ve announced and will enforce a clear anti-harassment policy with multiple points of contact for reporting disrespectful behavior. Violators will be kicked out.

Finally, in a small effort to bring more awareness to the change, we have a sweet bonus in store: Keep your eyes peeled for the Trail-of-Bits-sponsored ice cream flavor in a Van Leeuwen ice cream truck outside LittleField. For every scoop sold, we’ll be matching the sales with a donation to Girls Who Code.

Truck-1

Serving up tasty treats for a cause!

Does it fix the problem?

No. This is a small step. The issue of inclusion within security is much bigger than one small annual hacker meetup. Fortunately, everyone in the industry can help, including us. Even today, our growing team of 37 people has only four women, only two of whom are engineers. We must do better.

We’ve already taken some steps to improve:

Here’s what we’ll do this year:

  • Actively work with diversity- and inclusion-recruiting groups to get out of the cycle of predisposing our recruiting toward homogeneity
  • Continue to search for opportunities to volunteer and mentor with groups that support inclusion in tech and infosec
  • Reimburse employees for any tax expenses incurred for insurance of domestic partners

Get involved!

Want to participate as a SummerCon research fellow? Keep an eye on @trailofbits. We’ll be making a joint announcement with SummerCon soon.

Have other ideas about how to foster a more inclusive security environment? Contact us!

Announcing the Trail of Bits osquery support group

As great as it is, osquery could be a whole lot better. (Think write access for extensions, triggered responses upon detection, and even better performance, reliability and ease of use.)

Facebook’s small osquery team can’t respond to every request for enhancement. That’s understandable. They have their hands full with managing the osquery community, reviewing PRs, and ensuring the security of the world’s largest social network. It’s up to the community to move the osquery platform forward.

Good news: none of these feature requests are infeasible. The custom engineering is just uneconomical for individual organizations to bankroll.

We propose a strategy for osquery users to share the cost of development. Participating companies could pool resources and collectively target specific features. This would accelerate the depreciation of other full-suite tools that are more expensive, less flexible and less transparent.

It’s the only way to make real progress quickly. Otherwise, projects rely solely on the charity and coordination of their contributors.

Can an open-source tool replace commercial solutions?

We think that open-source security solutions are inherently better. They’re transparent. They’re more flexible. Their costs are tied closely to the value you get; not just access. Finally, each time there’s an investment in the tool, it increases the advantages for current users, and increases the number of users who can access these advantages.

However, in order to compete with their commercial counterparts, open source projects need implementation support and development support. The former is basically the ability to “set it and forget it.” The latter ensures the absence of show-stopping bugs and the regular addition of new required features.

Companies like Kolide and Uptycs provide user-friendly support for deployment.

For development support, you can now hire us.

Announcing the Trail of Bits osquery support group

We’re offering two ‘flavors’ of support plans; one for year-round assurance, the other for custom development.

12-month assurance plan

Think of this like an all-you-can-eat buffet for critical features and fixes. Any time you need a bug fixed or a feature added, just file a ticket with us. This option’s great for root-cause and fix issues, the development of new tables and extensions, or the redesign of parts of osquery’s core. Basically, the stuff that is holding you back from cancelling those expensive monthly contracts with the proprietary vendors.

Bespoke development

This plan’s for you if you need one-off help with a big-time osquery change. Perhaps: ports to new platforms, non-core features, or forks.

Regardless of the plan you choose, you’ll get:

  • Access to a private Trail of Bits Slack channel for direct access to our engineers
  • The opportunity to participate in a bi-weekly iteration planning meeting for collaborative feature ideation, problem-solving, and feature prioritization
  • A private GitHub repository with issue tracker for visibility and influence over what features are worked on
  • Special access and support to our osquery extensions
  • Early access to all software increments

Whether you’re a long-time osquery user with a list of feature requests, or part of a team that has been holding out for osquery’s feature-parity with commercial tools, this may be the opportunity you’ve been waiting for. As a member, you’ll gain multiple benefits: confidence that there aren’t any show-stopping bugs; direct access to our team of world-class engineers, many of whom have been doing this exact work since we ported osquery to Windows; peace of mind that your internal engineers won’t spend any more time on issues with osquery; and the chance to drive osquery’s product direction while leaving the heavy lifting to us.

Want in? Let us know.