Microsoft didn’t sandbox Windows Defender, so I did

Microsoft exposed their users to a lot of risks when they released Windows Defender without a sandbox. This surprised me. Sandboxing is one of the most effective security-hardening techniques. Why did Microsoft sandbox other high-value attack surfaces such as the JIT code in Microsoft Edge, but leave Windows Defender undefended?

As a proof of concept, I sandboxed Windows Defender for them and, am now open sourcing my code as the Flying Sandbox Monster. The core of Flying Sandbox Monster is AppJailLauncher-rs, a Rust-based framework to contain untrustworthy apps in AppContainers. It also allows you to wrap the I/O of an application behind a TCP server, allowing the sandboxed application to run on a completely different machine, for an additional layer of isolation.

In this blog post, I describe the process and results of creating this tool, as well as thoughts about Rust on Windows.

Flying Sandbox Monster running Defender in a sandbox to scan a WannaCry binary.

The Plan

Windows Defender’s unencumbered access to its host machine and wide-scale acceptance of hazardous file formats make it an ideal target for malicious hackers. The core Windows Defender process, MsMpEng, runs as a service with SYSTEM privileges. The scanning component, MpEngine, supports parsing an astronomical number of file formats. It also bundles full-system emulators for various architectures and interpreters for various languages. All of this, performed with the highest level of privilege on a Windows system. Yikes.

This got me thinking. How difficult would it be to sandbox MpEngine with the same set of tools that I had used to sandbox challenges for the CTF community two years ago?

The first step towards a sandboxed Windows Defender is the ability to launch AppContainers. I wanted to re-use AppJailLauncher, but there was a problem. The original AppJailLauncher was written as a proof-of-concept example. If I had any sense back then, I would’ve written it in C++ Core rather than deal with the pains of memory management. Over the past two years, I’ve attempted rewriting it in C++ but ended up with false starts (why are dependencies always such a pain?).

But then inspiration struck. Why not rewrite the AppContainer launching code in Rust?

Building The Sandbox

A few months later, after crash coursing through Rust tutorials and writing a novel of example Rust code, I had the three pillars of support for launching AppContainers in Rust: SimpleDacl, Profile, and WinFFI.

  • SimpleDacl is a generalized class that handles adding and removing simple discretionary access control entries (ACE) on Windows. While SimpleDacl can target both files and directories, it has a few setbacks. First, it completely overwrites the existing ACL with a new ACL and converts inherited ACEs to “normal” ACEs. Also, it disregards any ACEs that it cannot parse (i.e. anything other than AccessAllowedAce and AccessDeniedAce. Note: we don’t support mandatory and audit access control entries.).
  • Profile implements creation of AppContainer profiles and processes. From the profile, we can obtain a SID that can be used to create ACE on resources the AppContainer needs to access.
  • WinFFI contains the brunt of the functions and structures winapi-rs didn’t implement as well as useful utility classes/functions. I made a strong effort to wrap every raw HANDLE and pointer in Rust objects to manage their lifetimes.

Next, I needed to understand how to interface with the scanning component of Windows Defender. Tavis Ormandy’s loadlibrary repository already offered an example C implementation and instructions for starting an MsMpEng scan. Porting the structures and function prototypes to Rust was a simple affair to automate, though I initially forgot about array fields and function pointers, which caused all sorts of issues; however, with Rust’s built-in testing functionality, I quickly resolved all my porting errors and had a minimum test case that would scan an EICAR test file.

The basic architecture of Flying Sandbox Monster.

Our proof-of-concept, Flying Sandbox Monster, consists of a sandbox wrapper and the Malware Protection Engine (MpEngine). The single executable has two modes: parent process and child process. The mode is determined by the presence of an environment variable that contains the HANDLEs for the file to be scanned and child/parent communication. The parent process populates these two HANDLE values prior to creating an AppContainer’d child process. The now-sandboxed child process loads the malware protection engine library and scans the input file for malicious software.

This was not enough to get the proof-of-concept working. The Malware Protection Engine refused to initialize inside an AppContainer. Initially, I thought this was an access control issue. After extensive differential debugging in ProcMon (comparing AppContainer vs non-AppContainer execution), I realized the issue might actually be with the detected Windows version. Tavis’s code always self-reported the Windows version as Windows XP. My code was reporting the real underlying operating system; Windows 10 in my case. Verification via WinDbg proved that this was indeed the one and only issue causing the initialization failures. I needed to lie to MpEngine about the underlying Windows version. When using C/C++, I would whip up a bit of function hooking code with Detours. Unfortunately, there was no equivalent function hooking library for Rust on Windows (the few hooking libraries available seemed a lot more “heavyweight” than what I needed). Naturally, I implemented a simple IAT hooking library in Rust (32-bit Windows PE only).

Introducing AppJailLauncher-rs

Since I had already implemented the core components of AppJailLauncher in Rust, why not just finish the job and wrap it all in a Rust TCP server? I did, and now I’m happy to announce “version 2” of AppJailLauncher, AppJailLauncher-rs.

AppJailLauncher was a TCP server that listened on a specified port and launched an AppContainer process for every accepted TCP connection. I tried not to reinvent the wheel, but mio, the lightweight IO library for Rust, just didn’t work out. First, mio’s TcpClient did not provide access to raw “socket HANDLEs” on Windows. Second, these raw “socket HANDLEs” were not inheritable by the child AppContainer process. Because of these issues, I had to introduce another “pillar” to support appjaillauncher-rs: TcpServer.

TcpServer is responsible for instantiating an asynchronous TCP server with a client socket that is compatible with STDIN/STDOUT/STDERR redirection. Sockets created by the socket call cannot redirect a process’s standard input/output streams. Properly working standard input/output redirection requires “native” sockets (as constructed via WSASocket). To allow the redirection, TcpServer creates these “native” sockets and does not explicitly disable inheritance on them.

My Experience with Rust

My overall experience with Rust was very positive, despite the minor setbacks. Let me describe some key features that really stood out during AppJailLauncher’s development.

Cargo. Dependency management with C++ on Windows is tedious and complex, especially when linking against third-party libraries. Rust neatly solves dependency management with the cargo package management system. Cargo has a wide breadth of packages that solve many common-place problems such as argument parsing (clap-rs), Windows FFI (winapi-rs et. al.), and handling wide strings (widestring).

Built-in Testing. Unit tests for C++ applications require a third-party library and laborious, manual effort. That’s why unit test are rarely written for smaller projects, like the original AppJailLauncher. In Rust, unit test capability is built into the cargo system and unit tests co-exist with core functionality.

The Macro System. Rust’s macro system works at the abstract syntax tree (AST) level, unlike the simple text substitution engine in C/C++. While there is a bit of a learning curve, Rust macros completely eliminate annoyances of C/C++ macros like naming and scope collisions.

Debugging. Debugging Rust on Windows just works. Rust generates WinDbg compatible debugging symbols (PDB files) that provide seamless source-level debugging.

Foreign Function Interface. The Windows API is written in, and meant to be called from, C/C++ code. Other languages, like Rust, must use a foreign function interface (FFI) to invoke Windows APIs. Rust’s FFI to Windows (the winapi-rs crate) is mostly complete. It has the core APIs, but it is missing some lesser used subsystems like access control list modification APIs.

Attributes. Setting attributes is very cumbersome because they only apply to the next line. Squashing specific code format warnings necessitates a sprinkling of attributes throughout the program code.

The Borrow Checker. The concept of ownership is how Rust achieves memory safety. Understanding how the borrow checker works was fraught with cryptic, unique errors and took hours of reading documentation and tutorials. In the end it was worth it: once it “clicked,” my Rust programming dramatically improved.

Vectors. In C++, std::vector can expose its backing buffer to other code. The original vector is still valid, even if the backing buffer is modified. This is not the case for Rust’s Vec. Rust’s Vec requires the formation of a new Vec object from the “raw parts” of the old Vec.

Option and Result types. Native option and result types should make error checking easier, but instead error checking just seems more verbose. It’s possible to pretend errors will never exist and just call unwrap, but that will lead to runtime failure when an Error (or None) is inevitably returned.

Owned Types and Slices. Owned types and their complementary slices (e.g. String/str, PathBuf/Path) took a bit of getting used to. They come in pairs, have similar names, but behave differently. In Rust, an owned type represents a growable, mutable object (typically a string). A slice is a view of an immutable character buffer (also typically a string).

The Future

The Rust ecosystem for Windows is still maturing. There is plenty of room for new Rust libraries to simplify development of secure software on Windows. I’ve implemented initial versions of a few Rust libraries for Windows sandboxing, PE parsing, and IAT hooking. It is my hope that these are useful to the nascent Rust on Windows community.

I used Rust and AppJailLauncher to sandbox Windows Defender, Microsoft’s flagship anti-virus product. My accomplishment is both great and a bit shameful: it’s great that Windows’ robust sandboxing mechanism is exposed to third-party software. It’s shameful that Microsoft hasn’t sandboxed Defender on its own accord. Microsoft bought what eventually became Windows Defender in 2004. Back in 2004 these bugs and design decisions would be unacceptable, but understandable. During the past 13 years Microsoft has developed a great security engineering organization, advanced fuzzing and program testing, and sandboxed critical parts of Internet Explorer. Somehow Windows Defender got stuck back in 2004. Rather than taking Project Zero’s approach to the problem by continually pointing out the symptoms of this inherent flaw, let’s bring Windows Defender back to the future.

An extra bit of analysis for Clemency

This year’s DEF CON CTF used a unique hardware architecture, cLEMENCy, and only released a specification and reference tooling for it 24 hours before the final event began. cLEMENCy was purposefully designed to break existing tools and make writing new ones harder. This presented a formidable challenge given the timeboxed competition occurs over a single weekend.

Ryan, Sophia, and I wrote and used a Binary Ninja processor module for cLEMENCy during the event. This helped our team analyze challenges with Binary Ninja’s graph view and dataflow analyses faster than if we’d relied on the limited disassembler and debugger provided by the organizers. We are releasing this processor module today in the interest of helping others who want to try out the challenges on their own.

Binary Ninja in action during the competition

cLEMENCy creates a more equitable playing field in CTFs by degrading the ability to use advanced tools, like Manticore or a Cyber Reasoning System. It accomplishes this with architectural features such as:

  • 9-bit bytes instead of 8-bits. This makes parsing the binary difficult. The byte length of the architecture of the system parsing a challenge does not match that in cLEMENCy. The start of a byte on both systems would only match every 9th byte.
  • It’s Middle Endian. Every other architecture stores values in memory in one of two ways: from most significant byte to least significant (Big Endian), or least significant to most significant (Little Endian). Rather than storing a value like 0x123456 as 12 34 56 or 56 34 12, Middle Endian stores it as 34 56 12.
  • Instructions have variable length opcodes. Instructions were anywhere from 18 to 54 bits, with opcodes being anywhere from 4 bits to 18 bits.

This required creativity in a short timespan. With only 24 hours’ head start, we needed to work fast if we wanted something usable before the end of the four-day competition. This would have been hard to do even with an amenable architecture. Here’s how we solved these problems to write and use a disassembler during the CTF:

  • We expanded each 9-bit byte to a 16-bit short. Originally, I wrote some fancy bit masking and shifting to accomplish this, but then Ryan dropped a very simple script that did the same thing using the bitstream module. This had the side effect of doubling all memory offsets but that was trivial to correct.
  • We made liberal use of slicing in Python. Our disassembler first converted the bytes to a string of bits, then rearranged them to match the representation in the reference document. After that, we took the path of speed of implementation rather than brevity to compare the exact number of bits per opcode to identify and parse them.
  • We made instructions more verbose. The Load and Store instructions iterated over a specified number of registers from a starting point, copying each from or into a memory location. Rather than displaying the starting register and count alone, we expanded the entire list, making it much easier to understand the effects of the instruction in the disassembly at a glance.

With an implemented processor module, we could view and interact with the challenges, define functions with automated analyses, and control how assembly instructions were represented.

We also tried to write an LLIL lifter. This was not possible. You could either have consistent register math or consistent memory addresses, but not both. The weird three-byte register widths and the doubled memory addresses were incompatible. All was not lost, since enough instructions were liftable to locate strings with the dataflow analysis.

Binary Ninja’s graph view allowed us to rapidly analyze control flow structures

If you’d like to get started with our Binja module, you can find our Architecture and BinaryView plugins, as well as a script to pack and unpack the challenges, on our Github.

LegitBS has open-sourced their cLEMENCy tools. The challenges will be available shortly. We look forward to seeing how other teams dealt with cLEMENCy!

UPDATE: The challenges are now available. PPPChris Eagle, and Lab RATS released their processor modules for cLEMENCy.

Magic with Manticore

Manticore is a next-generation binary analysis tool with a simple yet powerful API for symbolic execution, taint analysis, and instrumentation. Using Manticore one can identify ‘interesting’ code locations and deduce inputs that reach them. This can generate inputs for improved test coverage, or quickly lead execution to a vulnerability.

I used Manticore’s power to solve Magic, a challenge from this year’s DEFCON CTF qualifying round that consists of 200 unique binaries, each with a separate key. When the correct key is entered into each binary, it prints out a sum:

enter code:
==== The meds helped 
sum is 12

Reverse engineering 200 executables in order to extract strings one at a time takes a significant amount of time. This challenge necessitates automation. As CTFs feature more of these challenges, modern tools will be required to remain competitive.

We’ll be combining the powers of two such tools –Binary Ninja and Manticore– in three different solutions to showcase how you can apply them in your own work.

Challenge structure

The Magic binaries have a simple structure. There is a main function that prompts for the key, reads from stdin, runs the checker function, and then prints out the sum. The checker function loads bytes of the input string one at a time and calls a function to check each character. The character-checking functions do a comparison against a fixed character value. If it matches, the function returns a value to be summed, if it does not, the program exits.

Main, the checker function, and a single character checking function

Manticore’s API is very straight forward. We will use hooks to call functions when instructions are reached, the CPU class to access registers, and the solver. The workflow involves loading a binary by providing the path and adding analysis hooks on instructions in that binary. After that, you run Manticore. As the addresses are reached, your hooks are executed, and you can reason about the state of the program.

Functions defined as hooks take a single parameter: state. The state contains functionality to create symbolic values or buffers, solve for symbolic values, and abandon paths. It also contains a member, cpu, which holds the state of the registers, and allows the reading and writing of memory and registers.

Strategies

There are many ways to solve Magic. We’ll present three methods to demonstrate the flexibility of Manticore.

  1. A symbolic solution that hooks every instruction in order to discover where the character-checking functions are. When Manticore is at a character-checking function, it sets hooks to solve for the necessary value.
  2. A concrete solution that hooks the address of each character-checking function and simply reads the value from the opcodes.
  3. A symbolic solution that hooks the address of each character-checking function and solves for the value.

This is not an exhaustive list of the approaches you could take with Manticore. There is a saying, ‘there are many ways to skin a cat;’ Manticore is a cat-skinning machine.

Function addresses will be extracted using Binary Ninja. All strategies require an address for the terminating hook that prints out the solution. The latter two strategies need the addresses of the character-checking functions.

Address extraction with the Binary Ninja API

In order to extract the character-checking functions’ addresses, as well as the end_hook() address, we will be using Binary Ninja. Binary Ninja is a reverse engineering platform made for the fast-paced CTF environment. It’s user friendly and has powerful analysis features. We will use its API to locate the addresses we want. Loading the file in the Binary Ninja API is very straight forward.

bv = binja.BinaryViewType.get_view_of_file(path)

To reach the checker function, we first need the executable’s main function. We start by retrieving the entry block of the program’s entry function. We know the address of main is loaded in the 11th instruction of the LLIL. From that instruction we do a sanity check that it is a constant being loaded into RDI, then extract the constant (main’s address). Calling get_function_at() with main’s address gives the main function to be returned.


def get_main(bv):
    entry_fn = bv.entry_function
    entry_block = entry_fn.low_level_il.basic_blocks[0]
    assign_rdi_main = entry_block[11]
    rdi, main_const = assign_rdi_main.operands

    if rdi != 'rdi' or main_const.operation != LLIL_CONST:
        raise Exception('Instruction `rdi = main` not found.')

    main_addr = main_const.operands[0]
    main_fn = bv.get_function_at(main_addr)
    return main_fn

The get_checker() function is similar to get_main(). It locates the address of the checker function which is called from main. Then it loads the function at that address and returns it.

1. Symbolic solution via opcode identification

Each character-checking function has identical instructions. This means we can examine the opcodes and use them as an indication of when we’ve reached a target function. We like this solution for situations in which we might not necessarily know where we need to set hooks but can identify when we’ve arrived.

  • Set a hook on every instruction.
    • Check if the opcodes match the first few instructions of the check functions.
      • Set a hook on the positive branch to solve for the register value RDI and store the value.
      • Set a hook on the negative branch to abandon that state.
      • Set a hook at the pre-branch (current instruction) to check if we know the value that was solved for. If we know the value, set RDI so we do not need to solve for it again.
  • Set a hook at a terminating instruction.

The state.abandon() call on the negative branch is crucial. This stops Manticore from reasoning over that branch, which can take a while in more complex code. Without abandonment, you’re looking at a 3 hour solve; with it, 1 minute.

def symbolic(m, end_pc):
    # hook every instruction using None as the address
    @m.hook(None)
    def hook_all(state):
        # read an integer at the program counter
        cpu = state.cpu
        pc = cpu.PC
        instruction = cpu.read_int(pc)

        # check the instructions match
        # cmp   rdi, ??
        # je    +0xe
        if (instruction & 0xFFFFFF == 0xff8348) and (instruction >> 32 & 0xFFFF == 0x0e74):
            # the positive branch is 0x14 bytes from the beginning of the function
            target = pc + 0x14

            # if the target address is not seen yet
            #   add to list and declare solver hook
            if target not in m.context['values']:
                set_hooks(m, pc)

    # set the end hook to terminate execution
    end_hook(m, end_pc)

We’re using Manticore’s context here to store values. The context dictionary is actually the dictionary of a multiprocessing manager. When you start using multiple workers, you will need to use the context to share data between them.

The function set_hooks() will be reused in strategy 3: Symbolic solution via address hooking. It sets the pre-branch, positive-branch, and negative-branch hooks.

def set_hooks(m, pc):
    # pre branch
    @m.hook(pc)
    def write(state):
        _pc = state.cpu.PC
        _target = _pc + 0x14

        if _target in m.context['values']:
            if debug:
                print 'Writing %s at %s...' % (chr(m.context['values'][_target]), hex(_pc))

            state.cpu.write_register('RDI', m.context['values'][_target])
            # print state.cpu

    # negative branch
    neg = pc + 0x6

    @m.hook(neg)
    def bail(state):
        if debug:
            print 'Abandoning state at %s...' % hex(neg)

        state.abandon()

    # target branch
    target = pc + 0x14

    @m.hook(target)
    def solve(state):
        _cpu = state.cpu
        _target = _cpu.PC
        _pc = _target - 0x14

        # skip solver step if known
        if _target in m.context['values']:
            return

        val = _cpu.read_register('RDI')
        solution = state.solve_one(val)

        values = m.context['values']
        values[_target] = solution
        m.context['values'] = values

        target_order = m.context['target_order']
        target_order.append(_target)
        m.context['target_order'] = target_order

        if debug:
            print 'Reached target %s. Current key: ' % (hex(_target))
            print "'%s'" % ''.join([chr(m.context['values'][ea]) for ea in m.context['target_order']])

Note that there is a strange update pattern with the values dictionary and target_order array. They need to be reassigned to the context dictionary in order to notify the multiprocessing manager that they have changed.

The end_hook() function is used to declare a terminating point in all three strategies. It declares a hook after all the check-character functions. The hook prints out the characters discovered, then terminates Manticore.

def end_hook(m, end_pc):
    @m.hook(end_pc)
    def hook_end(state):
        print 'GOAL:'
        print "'%s'" % ''.join([chr(m.context['values'][ea]) for ea in m.context['target_order']])
        m.terminate()

2. Concrete solution via address hooking

Since this challenge performs a simple equality check on each character, it is easy to extract the value. It would be more efficient to solve this statically. In fact, it can be solved with one hideous line of bash.

$ ls -d -1 /path/to/magic_dist/* | while read file; do echo -n "'"; grep -ao $'\x48\x83\xff.\x74\x0e' $file | while read line; do echo $line | head -c 4 | tail -c 1; done; echo "'"; done

However, in situations like this, we can take advantage of concretizing. When a value is written to a register, it is no longer symbolic. This causes the branch to be explicit and skips solving. This also means that the abandon hook on the negative branch is no longer necessary, since it will always take the positive branch due to the concrete value.

  • Set a hook on each character-checking function.
    • Extract the target value from the opcodes.
    • Write that target value to the register RDI.
  • Set a hook at a terminating instruction.
def concrete_pcs(m, pcs, end_pc):
    # for each character checking function address
    for pc in pcs:
        @m.hook(pc)
        def write(state):
            # retrieve instruction bytes
            _pc = state.cpu.PC
            instruction = state.cpu.read_int(_pc)

            # extract value from instruction
            val = instruction >> 24 & 0xFF

            # concretize RDI
            state.cpu.write_register('RDI', val)

            # store value for display at end_hook()
            _target = _pc + 0x14

            values = m.context['values']
            values[_target] = val
            m.context['values'] = values

            target_order = m.context['target_order']
            target_order.append(_target)
            m.context['target_order'] = target_order

            if debug:
                print 'Reached target %s. Current key: ' % (hex(_pc))
                print "'%s'" % ''.join([chr(m.context['values'][ea]) for ea in m.context['target_order']])

    end_hook(m, end_pc)

3. Symbolic solution via address hooking

It is easy to extract the value from each function statically. However, if each character-checking function did some arbitrary bit math before comparing the result, we would not want to reimplement all of those instructions for a static extraction. This is where a hybrid approach would be useful. We identify target functions statically, and then solve for the value in each function.

  • Set a hook on each character-checking function.
    • Set a hook on the positive branch to solve for the register value RDI and store the value.
    • Set a hook on the negative branch to abandon that state.
    • Set a hook at the pre-branch (current instruction) to check if we know the value that was solved for.
      • If we know the value, write it to RDI so we do not need to solve for it again.
  • Set a hook at a terminating instruction.
def symbolic_pcs(m, pcs, end_pc):
    for pc in pcs:
        set_hooks(m, pc)

    end_hook(m, end_pc)

Bringing everything together

With those three functions we have the target addresses we need. Putting everything together in main() we have a dynamic solver for the challenge Magic. You can find the full code listing here.

def main():
    path = sys.argv[1]
    m = Manticore(path)
    m.context['values'] = {}
    m.context['target_order'] = []

    pcs, end_pc = get_pcs(path)

    # symbolic(m, end_pc)
    # concrete_pcs(m, pcs, end_pc)
    symbolic_pcs(m, pcs, end_pc)

    m.run()

A run with our debug print statements enabled will help show the execution of this script. The first time the positive branch is hit we see a Reached target [addr]. Current Key: statement and the key up to this point. Sometimes the negative branch will be taken and the state will be abandoned. We see Writing [chr] at [addr]… when we use our previously solved values to concretize the branch. Finally, when the end_hook() is hit we see GOAL: with our final key.

Start working smarter with Manticore

Manticore delivers symbolic execution over smaller portions of compiled code. It can very quickly discover the inputs required to reach a specific path. Combine the mechanical efficiency of symbolic execution with human intuition and enhance your capabilities. With a straightforward API and powerful features, Manticore is a must-have for anyone working in binary analysis.

Take the Manticore challenge

How about you give this a shot? We created a challenge very similar to Magic, but designed it so you can’t simply grep for the solution. Install Manticore, compile the challenge, and take a step into the future of binary analysis. Try it today! The first solution to the challenge that executes in under 5 minutes will receive a bounty from the Manticore team. (Hint: Use multiple workers and optimize.)

Thanks to @saelo for contributing the functionality required to run Magic with Manticore.

Manticore: Symbolic execution for humans

Earlier this week, we open-sourced a tool we rely on for dynamic binary analysis: Manticore! Manticore helps us quickly take advantage of symbolic execution, taint analysis, and instrumentation to analyze binaries. Parts of Manticore underpinned our symbolic execution capabilities in the Cyber Grand Challenge. As an open-source tool, we hope that others can take advantage of these capabilities in their own projects.

We prioritized simplicity and usability while building Manticore. We used minimal external dependencies and our API should look familiar to anyone with an exploitation or reversing background. If you have never used such a tool before, give Manticore a try.

Two interfaces. Multiple use cases.

Manticore comes with an easy-to-use command line tool that quickly generates new program “test cases” (or sample inputs) with symbolic execution. Each test case results in a unique outcome when running the program, like a normal process exit or crash (e.g., invalid program counter, invalid memory read/write).

The command line tool satisfies some use cases, but practical use requires more flexibility. That’s why we created a Python API for custom analyses and application-specific optimizations. Manticore’s expressive and scriptable Python API can help you answer questions like:

  • At point X in execution, is it possible for variable Y to be a specified value?
  • Can the program reach this code at runtime?
  • What is a program input that will cause execution of this code?
  • Is user input ever used as a parameter to this libc function?
  • How many times does the program execute this function?
  • How many instructions does the program execute if given this input?

In our first release, the API provides functionality to extend the core analysis engine. In addition to test case generation, the Manticore API can:

  • Abandon irrelevant states
  • Run custom analysis functions at arbitrary execution points
  • Concretize symbolic memory
  • Introspect and modify emulated machine state

Early applications

Manticore is one of the primary tools we use for binary analysis research. We used an earlier version as the foundation of our symbolic execution vulnerability hunting in the Cyber Grand Challenge. We’re using it to build a custom program analyzer for DARPA LADS.

In the month leading up to our release, we solicited ideas from the community on simple use cases to demonstrate Manticore’s features. Here are a few of our favorites:

  • Eric Hennenfent solved a simple reversing challenge. He presented two solutions: one using binary instrumentation and one using symbolic execution.
  • Yan and Mark replaced a variable with a tainted symbolic value to determine which specific comparisons user input could influence.
  • Josselin Feist generated an exploit using only the Manticore API. He instrumented a binary to find a crash and then determined constraints to call an arbitrary function with symbolic execution.
  • Cory Duplantis solved a reversing challenge from Google CTF 2016. His script is a great example of how straightforward it is to solve many CTF challenges with Manticore.

Finally, a shoutout to Murmus who made a video review of Manticore only 4 hours after we open sourced it!

It’s easy to get started

With other tools, you’d have to spend time researching their internals. With Manticore, you have a well-written interface and an approachable codebase. So, jump right in and get something useful done sooner.

Grab an Ubuntu 16.04 VM and:

# Install the system dependencies
sudo apt-get update && sudo apt-get install z3 python-pip -y
python -m pip install -U pip

# Install manticore and its dependencies
git clone https://github.com/trailofbits/manticore.git && cd manticore
sudo pip install --no-binary capstone .

You have installed the Manticore CLI and API. We included a few examples in our source repository. Let’s try the CLI first:

# Build the examples
cd examples/linux
make

# Use the Manticore CLI to discover unique test cases
manticore basic
cat mcore_*/*1.stdin | ./basic
cat mcore_*/*2.stdin | ./basic

Basic” is a toy example that reads user input and prints one of two statements. Manticore used symbolic execution to explore `basic` and discovered the two unique inputs. It puts sample inputs it discovers into “stdin” files that you can pipe to the binary. Next, we’ll use the API:

# Use the Manticore API to count executed instructions
cd ../script
python count_instructions.py ../linux/helloworld

The count_instructions.py script uses the Manticore API to instrument the `helloworld` binary and count the number of instructions it executes.

Let us know what you think!

If you’re interested in reverse engineering, binary exploitation, or just want to want to learn about CPU emulators and symbolic execution, we encourage you to play around with it and join #manticore on our Slack for discussion and feedback. See you there!

A walk down memory lane

Admit it. Every now and then someone does something, and you think: “I also had that idea!” You feel validated — a kindred spirit has had the same intuitions, the same insights, and even drawn the same conclusions. I was reminded of this feeling recently when I came across a paper describing how to use Intel’s hardware transactional memory to enforce controlflow integrity.

Applied accounting: Enforcing control-flow integrity with checks and balances

A while back I had the same idea, wrote some code, and never published it. When I saw the paper, I secretly, and perhaps predictably, had that negative thought: “I got there first.” That’s not productive. Let’s go back in time to when I was disabused of the notion that there are any “firsts” with ideas. Don’t worry, hardware transactional memory and control-flow integrity will show up later. For now, I will tell you the story of Granary, my first binary translator.

Granary

I am a self-described binary translator. In fact, I have monograms in my suits saying as much. I dove head-first into binary translation and instrumentation as part of my Master’s degree. My colleague Akshay and I were working with two professors at the Systems Lab at the University of Toronto (UofT). We identified a problem: most bugs in the Linux kernel were actually coming from kernel modules (extensions). Opportunity struck. A former UofT student ported DynamoRIO to work inside of the Linux kernel, and that tool could help catch kernel module bugs in the act.

The path to finding actual bugs was long and twisted. Slowly finding bugs wasn’t as cool as doing it fast, and instrumenting the whole kernel to catch bugs in modules wasn’t fast. Our solution was to instrument modules only, and let the kernel run at full speed. This was challenging and ran up against core design decisions in DynamoRIO; thus, Granary was born. Granary’s claim to fame was that it could selectively instrument only parts of the kernel, leaving the rest of the kernel to run natively.

Second place, or first to lose?

With Granary came Address Watchpoints. This was a cool technique for doing fine-grained memory access instrumentation. Address watchpoints made it possible to instrument accesses to specific allocations in memory and detect errors like buffer overflows, use-before-initialization, and use-after-frees.

Address Watchpoints worked by intercepting calls to memory allocation functions (e.g. kmalloc) and embedding a taint tracking ID into the high 15 bits of returned pointers. Granary made it possible to interpose on memory access instructions that used those pointers. It was comprehensive because tainted pointers spread like radioactive dye — pointer copies and arithmetic transparently preserved any embedded IDs.

Yet it turned out that Address Watchpoints was not novel (this is an important metric in academia). SoftBound+CETS had applied a similar technique years before. Stay positive!

Not again

Despite the lack of novelty, Address Watchpoints were practical and attacked the real problem of memory access bugs in Linux kernel modules. Granary stepped forward as the novelty, and Address Watchpoints were an application showing that Granary was useful.

I presented Address Watchpoints at HotDep in 2013, which was co-located with the SOSP conference. At the same conference, btkernel, a fast, Linux kernel-space dynamic binary translator was released. It applied many of the same techniques that made Granary novel, but beat us to a full paper publication. Darn.

Hardware transactional memory

Time to feel good again. Trail of Bits got me a sweet laptop when I joined in 2015. It had a Broadwell chip, and supported the latest x86 features like hardware transactional memory.

The concurrency tax

The stated purpose of hardware transactional memory (HTM) is to enable lock-free and highly concurrent algorithms. For example, let’s say that I want to find use-after free bugs. A solid approach to this problem is to represent a program using a points-to graph. Use-after-free bugs exist if there is a path through the program’s points-to graph that goes from a free to a use.

Scaling this approach is challenging, but my laptop has many cores and so my imaginary doppelganger can throw some concurrency at the problem and hope for the best. Consider two threads that are working together to update the points-to graph. They propagate information from node to node, figuring out what pointers point where. If they both want to update the same node at the same time, then they need to synchronize so that one thread’s updates don’t clobber the other’s.

How do we know when synchronization is actually needed? We don’t! Instead, we need to conservatively assume that every access to a particular node requires synchronization, just in case “that other thread” rudely shows up. But points-to graphs are huge; we shouldn’t need to handle the worst case every time. That’s where HTM comes in. HTM lets us take an opportunistic approach, where threads don’t bother trying to synchronize. Instead, they try to make their own changes within a transaction, and if they both do so at the same time, then their transactions abort and they fall back to doing things the old fashioned way. This works because transactions provide failure atomicity: either the transaction succeeds and the thread’s changes are committed, or the transaction aborts, and it’s as if none of the changes ever happened.

That’s not what HTM is for

Hooray, we’re concurrency experts now. But didn’t this article start off saying something about control-flow integrity (CFI)? What the heck does concurrency have to do with that? Nothing! But HTM’s failure atomicity property has to do with CFI and more.

It turns out that HTM can be applied to unorthodox problems. For example, Intel’s HTM implementation enables a side-channel attack that can be used to defeat address space layout randomization. For a time I was also looking into similar misuses of HTM and surprise, surprise, I applied it to CFI.

Parallel discovery

Two years ago I had an idea about using HTM to enforce CFI. I had a little proof-of-concept script to go along with it, and a colleague helped me whip up an LLVM instrumentation pass that did something similar. Much to my surprise, researchers from Eurecom and UCSB recently produced a similar, more fleshed out implementation of the same idea. Here’s the gist of things.

Suppose an attacker takes control of the program counter, e.g. via ROP. Before their attack can proceed, they need to pivot and make the program change direction and go down the path to evil. The path to evil is paved with good, albeit unintended instructions. What if we live in a world without consequences? What if we let the attacker go wherever they want?

In ordinary circumstances that would be an awful idea. But attackers taking control is extraordinary, kind of like two threads simultaneously operating on the same node in a massive graph. What HTM gives us is the opportunity to do the wrong thing and be forgiven. We can begin a transaction just before a function return instruction, and end the transaction at its intended destination. Think of it like cattle herding. The only valid destinations are those that end transactions. If an attacker takes control, but doesn’t end the transaction, then the hardware will eventually run out of space and abort the transaction, and the program will seize hold of its destiny.

I believed and still think that this is a cool idea. Why didn’t I follow through? The approach that I envisioned lacked practicality. First, it wasn’t good enough as described. There are perhaps better ways to herd cattle. Aligning them within fences, for example. Protecting CFI using only failure atomicity doesn’t change the game, instead it just adds an extra requirement to ROP chains. Second, hardware transactions aren’t actually that fast — they incur full memory barriers. Executing transactions all the time kills ILP and concurrency. That makes this technique out of the question for real programs like Google Chrome. Finally, Intel’s new CET instruction set for CFI makes this approach dead on arrival. CET provides special instructions that bound control flow targets in a similar way.

If a tree falls in a forest

If I had the idea, and the Eurecom/UCSB group had the idea, then I bet some unknown third party also had the idea. Maybe Martin Abadi dreamt this up one day and didn’t tell the world. Maybe this is like operating systems, or distributed systems, or really just… systems, where similar problems and solutions seem to reappear every twenty or thirty years.

Seeing that someone else had the same idea that I had was really refreshing. It made me feel good and bad all at the same time, and reminded me of a fun time a few years ago where I felt like I was doing something clever. I’m not a special snowflake, though. There are no firsts with ideas. The Eurecom/UCSB group had an idea, then they followed it through, produced an implementation, and evaluated it. That’s what counts.

April means Infiltrate

Break out your guayabera, it’s time for Infiltrate. Trail of Bits has attended every Infiltrate and has been a sponsor since 2015. The majority of the company will be in attendance this year (18 people!) and we’ll be swapping shirts and swag again. We’re looking forward to catching up with the latest research presented there and showcasing our own contributions.

Last year we spoke on Making a Scaleable Automated Hacking System and Swift Reversing. We’re speaking again this year: Sophia d’Antoine has teamed up with a pair of Binary Ninja developers to present Be a Binary Rockstar: Next-level static analyses for vulnerability research, which expands on her previous research bringing abstract interpretation to Binary Ninja.

This year we’re bringing Manticore, the iron heart of our CGC robot, and giving attendees early access to it as we prepare to open source it. Manticore is a binary symbolic and concolic execution engine with support for x86, x86-64, and ARM. Use it to solve a simple challenge and earn yourself a Trail of Bits mug.

We don’t just attend Infiltrate to boast about our work; Infiltrate is truly a top notch conference. Infiltrate’s talks are a sneak peek at the best talks presented at other conferences — all in one place. The lobbycon is strong, giving everyone a chance to interact with the speakers and top researchers. The conference is all-inclusive and the included food, drinks, and events are fantastic — so don’t expect to show up without a ticket and try to steal some people away for dinner.

Last year also saw the return of the NOP certification. Windows 2000 and ImmunityDbg caused much frustration among our team but resulted in an exciting competition.

79F163E7-0355-4BB4-807C-856A2D337915

2016 NOP Certification: 30 minutes fighting ImmunityDbg, 7 minutes 33 seconds to pop calc

We’re particularly excited for several talks:

Of course, we’ve seen Be a Binary Rockstar and it’s great. Infiltrate tickets are still on sale — you can see it for yourself.

Vegas is over. The real show is in Miami. See you there!

McSema: I’m liftin’ it

McSema, our x86 machine code to LLVM bitcode binary translator, just got a fresh coat of paint. Last week we held a successful hackathon that produced substantial improvements to McSema’s usability, documentation, and code quality. It’s now easier than ever to use McSema to analyze and reverse-engineer binaries.

Growth stage

We use McSema on a daily basis. It lets us find and retroactively harden binary programs against security bugs, independently validate vendor source code, and generate application tests with high code coverage. It is part of ongoing research, both in academia and in DARPA programs. We (and others) are constantly extending it to analyze increasingly complex programs.

You could say that McSema has been on a growth spurt since we open-sourced it in 2014. Back then, LLVM 3.5 was new and shiny and that’s what McSema used. And that’s what it used in 2015. And in 2016. McSema stretched and grew, but some things stagnated. Over time an ache developed — a desire to modernize and to polish things off. Last week we massaged those growing pains away during our McSema usability hackathon.

Paying dividends

We made broad improvements to McSema. The code is cleaner than ever. It’s easier to install and is more portable. It runs faster and the code it produces is better.

Performance

McSema builds much faster than before. We simplified the build system by removing dead code and unneeded libraries, and by reorganizing the directory layout to be more descriptive.

McSema is faster at producing bitcode. We improved how McSema traverses the control flow graph, removed dependencies on Boost, and simplified bitcode generation.

McSema generates leaner and faster bitcode. McSema no longer stores and spills register context on entry and exit to functions. Flag operations use faster natural bitwidth operations instead of bit fields. McSema can now optimize the lazily generated bitcode to eliminate unused computations. The optimized bitcode is easier to analyze and truer to the intent of the original program.

Modernization

McSema now uses a stock distribution of LLVM 3.8. Previously, McSema used a custom modified version of LLVM 3.5. This upgrade brings in faster build times and more modern LLVM features. We have also eliminated McSema’s dependency on Boost, opting to use modern C++11 features instead.

Simplifications

The new command-line interface is more consistent and easier to use: mcsema-disass disassembles binaries, and mcsema-lift converts the disassembly into LLVM bitcode.

We removed bin_descend, our custom binary disassembler. There is now only one supported decoder that uses IDA Pro as the disassembly engine.

The new code layout is simpler and more intuitive. The CMake scripts to build McSema are now smaller and simpler.

The old testing framework has been removed in favor of an integration testing based approach with no external dependencies.

New Features

McSema supports more instructions. We are always looking for help adding new instruction semantics, and we have updated our instruction addition guide.

Mcsema will now tell you which instructions are supported and which are not, via the mcsema-lift --list-supported command.

The new integration testing framework allows for easy addition of comprehensive translation tests, and there is a new guide about adding tests to McSema.

Documentation

Our new documentation describes in detail how to install, use, test, extend, and debug McSema’s codebase. We have also documented common errors and how to resolve them. These improvements will make it easier for third-parties to hack on McSema.

Runtime

McSema isn’t just for static analysis. The lifted bitcode can be compiled back into a runnable program. We improved McSema’s runtime footprint, making it faster, greatly reducing its memory usage, and making it able to seamlessly interact with native Windows and Linux code in complex ways.

Investing in the future

We will continue to invest in improving McSema. We are always expanding support for larger and more complex software. We hope to move to Binary Ninja for control flow recovery instead of IDA Pro. And we plan to add support for lifting ARM binaries to LLVM bitcode. We want to broaden McSema’s applicability to include analyzing mobile apps and embedded firmware.

We are looking for interns that are excited about the possibilities of McSema. Looking to get started? Try out the walkthrough of translating a real Linux binary. After that, see how McSema can enable tools like libFuzzer to work on binaries. Finally, contact us and tell us where you’d like to take McSema. If we like it and you have a plan then we will pay you to make it happen.

The Challenges of Deploying Security Mitigations

This blog has promoted control flow integrity (CFI) as a game changing security mitigation and encouraged its use. We wanted to take our own security advice and start securing software we use. To that end, we decided to apply CFI to facebook’s osquery, a cross-platform codebase with which we are deeply familiar. Using osquery, we could compare clang’s implementation of CFI (ClangCFI) against Visual Studio’s Control Flow Guard (CFGuard).

That comparison never happened.

Instead, this blog post is going to be about a very important but underappreciated aspect of security mitigations: development costs and ease of use. We will describe our adventures in applying control flow integrity protections to osquery, and how seemingly small tradeoffs in security mitigations have serious implications for usability.

The Plan

The plan was simple: we would enable CFGuard for the Windows build of osquery, and ClangCFI for the Linux build of osquery. The difference between the protected and unprotected builds on osquery’s test suite would be the quantitative measurement. We’d contribute our patches back to the osquery code, resulting in a great blog post and a more secure osquery.

We got the Windows build of osquery running with CFGuard in about 15 minutes. Here is the pull request to enable CFGuard on osquery for Windows. The changes are two lines in one CMake script.

Even after weeks of effort, we still haven’t managed to enable ClangCFI on the Linux build. The discrepancy is a direct result of well meaning security choices with surprisingly far reaching consequences. The effort wasn’t for naught; we reported two clang bugs (one and two), hit a recently resolved issue, and had very insightful conversations with clang developers. They very patiently explained details of ClangCFI, identified the issues we were seeing, and graciously offered debugging assistance.

Let’s take a step-by-step walk through each security choice and the resulting consequences.

ClangCFI is stricter than CFGuard

For every protected indirect call, ClangCFI permits fewer valid destinations than CFGuard. This is good: fewer destinations means less ways to turn a bug into an exploit. ClangCFI also detects more potential errors than CFGuard (e.g. cast checks, ensuring virtual call destinations fall in the object hierarchy, etc.).

Valid destinations for indirect calls in ClangCFI and CFGuard

Figure 1: Example differences in the valid call targets for the indirect calls, using the icall examples (ClangCFI, CFGuard). The true valid destination are highlighted in green, and everything else is in red.

The specifics of what each CFI scheme permits has critical usability implications. For ClangCFI, an indirect call destination must match the type signature at the call site. The ClangCFI virtual method call checks are even stricter. For example, ClangCFI checks that the destination method belongs to the same object hierarchy. For CFGuard, an indirect call destination can be a valid function entry point [1].

An idealized view of valid indirect call targets

Figure 2: An idealized view of the valid indirect call targets for ClangCFI, CFGuard, and how they compare to the (idealized) set of valid indirect call targets.

ClangCFI’s type signature validation and virtual method call checks require whole-program analysis. The whole program analysis requirement results in two additional requirements:

  1. In general, every linked object and static library that comprise the final program must be built with CFI enabled [2].
  2. Link-time optimization (LTO) is required when using ClangCFI, because whole-program analysis is not possible until link time.

The new requirements are sensible: requiring CFI on everything ensures no part of the program is unprotected. LTO not only allows for whole-program analysis but also whole-program optimization, potentially offsetting CFI-related performance losses.

The looser validation standard used by CFGuard is less secure, but does not require whole-program analysis. Objects built with CFGuard validate indirect calls; objects built without CFGuard do not. Both objects can coexist in the same program. The linker, however, must be aware of CFGuard in order to emit a binary with appropriate tables and flags in the PE header.

ClangCFI is all or nothing. CFGuard is incremental.

In general, ClangCFI must be enabled for every object file and static library in a program: it is an error to link CFI-enabled code with non-CFI code [2]. The error is easy to make but difficult to identify because the linker does not inspect objects for ClangCFI protections. The linker will not report errors, but the resulting executable will fail runtime CFI checks.

Valid linkages when using ClangCFI

Table 1: Valid linkages when using ClangCFI. These linkages are what is valid in general, assuming there are indirect calls between the linked items. Calls across dynamic shared objects (DSOs) calls are valid assuming the use of the experimental -f[no-]sanitize-cfi-cross-dso flag.

Osquery, by design, statically links every dependency, including libc++. Those dependencies statically link other dependencies, and so on. To enable ClangCFI for osquery, we would have to enable ClangCFI for the entire osquery dependency tree. As we’ll see in the next section, that is a difficult task. We could not justify that kind of time commitment for this blog post, although we would love to do this in the future.

CFGuard can be applied on a per-compilation unit level. The documentation for CFGuard explicitly mentions that it is permissible to mix and match CFG-enabled and non-CFG objects and libraries [3]. Calls across DSOs (i.e DLLs, in Windows terminology) are fully supported. This flexibility was critical for enabling CFGuard for osquery; we enabled CFGuard for osquery itself and linked against existing unprotected dependencies. Fortunately, Windows ships with CFGuard-protected system libraries that are utilized when the main program image supports CFGuard. The unprotected code is limited to static libraries used while building osquery.

ClangCFI is too strict for some codebases

ClangCFI is too strict for some codebases. This is not clangs’ fault: some code uses shortcuts and conveniences that may not be strictly standards compliant. We ran into this issue when trying to enable ClangCFI for strongSwan. Our goal was to attempt a smaller example than osquery, and to create a security-enhanced version of strongSwan for Algo, our VPN solution.

There are valid, programmer intended targets that fall outside the domains defined by ClangCFI and CFGuard

Figure 3: How real existing code relates to the indirect call targets for ClangCFI and CFGuard. There are valid, programmer intended targets that fall outside the domains defined by ClangCFI and CFGuard.

We were not able to create a CFI-enabled version of strongSwan because libstrongswan, the core component of strongSwan, uses an OOP-like system for C. This system wraps most indirect calls with an interface that fails ClangCFI’s strict checks. ClangCFI is technically correct: the type signatures of caller and callee should match. In practice, there is shipping code where they do not.

Thankfully ClangCFI has a feature to relax strictness: the CFI blacklist. The blacklist will disable CFI checks for source files, functions, or types matching a regular expression. Unfortunately, in this case, almost every indirect call site would have to be blacklisted, making CFI effectively useless.

CFGuard is unlikely to cause the same issue: there is (probably) some code that indirect calls to the middle of a function, but such code is orders of magnitude more rare than mismatched type signatures.

Conclusion

From a security perspective, ClangCFI is “better” than CFGuard. It is stricter, it requires the whole program to be protected, and it tests for more runtime errors. It is possible to utilize ClangCFI to protect large and complex codebases: the excellent Google Chrome team does it. However, the enhanced security comes with a steep cost. Enabling ClangCFI can turn into a complex undertaking that requires considerable developer time and rigorous testing.

Conversely, CFGuard is considerably more flexible. A program can mix guarded and unguarded code, and CFGuard is much less likely to break existing code. These compromises make CFGuard much easier to enable for existing codebases.

Our experience using ClangCFI and CFGuard reflects these tradeoffs. A ClangCFI-enabled osquery would be more secure than the CFGuard-enabled osquery. However, the CFGuard-enabled osquery for Windows exists right now. The ClangCFI-enabled osquery for Linux is still a work-in-progress after weeks of trial and error.

—–

[1] This is not strictly true. For example, suppressed functions are function entry points but invalid indirect call destinations.

[2] Again, this not strictly true; there are specific exceptions to the mixing rule. For example, the CFI support library is not built with CFI. Linking CFI and non-CFI objects is fine if every function in the non-CFI object is only called directly. See this comment by Evgeniy Stepanov.

[3] from this page: “… a mixture of CFG-enabled and non-CFG enabled code will execute fine.”

The Smart Fuzzer Revolution

I recently had the privilege of giving a keynote at BSidesLisbon. I had a great time at the conference, and I’d like to thank Bruno Morisson for inviting me. If you’re into port, this is the conference for you! I recommend that anyone in the area consider attending next year.

I felt there was a need to put the recent advances in automated bug finding into context. The new developments of the Cyber Grand Challenge, AFL, and libFuzzer were easy to miss if you weren’t paying attention. However, the potential impact they have on our industry is dramatic.

After giving this talk a second time at IT Defense yesterday, I would now like to share it with the Internet. You can watch it below to get my take on where this research area has come from, where we are now, and where I expect we will go. Enjoy!

You should go to BSidesLisbon

You should go to BSidesLisbon

——–

The last 2 years have seen greater advances in automated security testing than the 10 before it. AFL engineered known best practices into an easy-to-use tool, the DARPA Cyber Grand Challenge provided a reliable competitive benchmark and funding for new research, and Project Springfield (aka SAGE) is now available to the public. The common availability of these new technologies has the potential for massive impact on our industry.

How do these tools work and what sets them apart from past approaches? Where do they excel and what are their limitations? How can I use these tools today? How will these technologies advance and what further developed is needed? And finally, how much longer do humans have as part of the secure development lifecycle?

See the slides in full here.

References

Original fuzzing project assignment from UW-Madison (1988)
http://pages.cs.wisc.edu/~bart/fuzz/CS736-Projects-f1988.pdf

PROTOS – systematic approach to eliminate software vulnerabilities (2002)
https://www.ee.oulu.fi/roles/ouspg/PROTOS_MSR2002-protos

The Advantages of Block-Based Protocol Analysis for Security Testing (2002)
http://www.immunitysec.com/downloads/advantages_of_block_based_analysis.html

DART: Directed Automated Random Testing (2005)
https://wkr.io/public/ref/godefroid2005dart.pdf

EXE: Automatically Generating Inputs of Death (2006)
https://web.stanford.edu/~engler/exe-ccs-06.pdf

EXE: 10 years later (2016)
https://ccadar.blogspot.com/2016/11/exe-10-years-later.html

Automated Whitebox Fuzz Testing (2008)
https://patricegodefroid.github.io/public_psfiles/ndss2008.pdf

American Fuzzy Lop (AFL)
http://lcamtuf.coredump.cx/afl/

DARPA Cyber Grand Challenge Competitor Portal (2013)
http://archive.darpa.mil/CyberGrandChallenge_CompetitorSite/

Exploitation and state machines (2011)
http://archives.scovetta.com/pub/conferences/infiltrate_2011/Fundamentals_of_exploitation_revisited.pdf

Your tool works better than mine? Prove it. (2016)
https://blog.trailofbits.com/2016/08/01/your-tool-works-better-than-mine-prove-it/

Microsoft Springfield (2016)
https://www.microsoft.com/en-us/springfield/

Google OSS-Fuzz (2016)
https://github.com/google/oss-fuzz

LLVM libFuzzer
http://llvm.org/docs/LibFuzzer.html

GRR – High-throughput fuzzer and emulator of DECREE binaries
https://github.com/trailofbits/grr

Manticore – A Python symbolic execution platform
https://github.com/trailofbits/manticore

McSema – x86 to machine code translation framework
https://github.com/trailofbits/mcsema

DARPA Challenge Sets for Linux, macOS, and Windows
https://github.com/trailofbits/cb-multios

Trail of Bits publications about the Cyber Grand Challenge
https://blog.trailofbits.com/category/cyber-grand-challenge/

Errata

  • The University of Oulu is in Finland.
  • The University of Wisconsin assigned homework in fuzzing in 1988.
  • SV-Comp is for software verification. ML competitions exist too.

Devirtualizing C++ with Binary Ninja

In my first blog post, I introduced the general structure of Binary Ninja’s Low Level IL (LLIL), as well as how to traverse and manipulate it with the Python API. Now, we’ll do something a little more interesting.

Reverse engineering binaries compiled from object-oriented languages can be challenging, particularly when it comes to virtual functions. In C++, invoking a virtual function involves looking up a function pointer in a virtual table (vtable) and then making an indirect call. In the disassembly, all you see is something like mov rax, [rcx+0x18]; call rax. If you want to know what function it will call for a given class object, you have to find the virtual table and then figure out which function pointer is at that offset.

Or you could use this plugin!

An Example Plugin: Navigating to a Virtual Function

vtable-navigator.py is an example plugin that will navigate to a given class’s virtual function from a call instruction. First, the plugin uses the LLIL to identify a specified class’s vtable when it is referenced in a constructor. Next, it will preprocess the call instruction’s basic block to track register assignments and their corresponding LLIL expressions. Finally, the plugin will process the LowLevelILInstruction object of the call and calculate the offset of the function to be called by recursively visiting register assignment expressions.

Discovering the vtable Pointer

vtable-diagram

Figure 1: Two classes inherit from a base virtual class. Each class’s virtual table points to its respective implementation of the virtual functions.

In the simplest form, a class constructor stores a pointer to its vtable in the object’s structure in memory. The two most common ways that a vtable pointer can be stored are either directly referencing a hardcoded value for the vtable pointer or storing the vtable pointer in a register, then copying that register’s value into memory. Thus, if we look for a write to a memory address from a register with no offset, then it’s probably the vtable.

constructor-example

An example constructor. The highlighted instruction stores a vtable in the object’s structure.

We can detect the first kind of vtable pointer assignment by looking for an LLIL instruction in the constructor’s LowLevelILFunction object (as described in Part 1) that stores a constant value to a memory address contained in a register.

According to the API, an LLIL_STORE instruction has two operands: dest and src. Both are LLIL expressions. For this case, we are looking for a destination value provided by a register, so dest should be an LLIL_REG expression. The value to be stored is a constant, so src should be an LLIL_CONST expression. If we match this pattern, then we assume that the constant is a vtable pointer, read the value pointed to by the constant (i.e. il.src.value), and double check that there’s a function pointer there, just to make sure it’s actually a vtable.

# If it's not a memory store, then it's not a vtable.
if il.operation != LowLevelILOperation.LLIL_STORE:
    continue

# vtable is referenced directly
if (il.dest.operation == LowLevelILOperation.LLIL_REG and
        il.src.operation == LowLevelILOperation.LLIL_CONST):
    fp = read_value(bv, il.src.value, bv.address_size)

    if not bv.is_offset_executable(fp):
        continue

    return il.src.value

Pretty straight forward, but let’s look at the second case, where the value is first stored in a register.

For this case, we search for instructions where the dest and src operands of an LLIL_STORE are both LLIL_REG expressions. Now we need to determine the location of the virtual table based only on the register.

This is where things get cool. This situation not only demonstrates usage of the LLIL, but how powerful the dataflow analysis performed on the LLIL can be. Without dataflow analysis, we would have to parse this LLIL_STORE instruction, figure out which register is being referenced, and then step backwards to find the last value assigned to that register. With the dataflow analysis, the register’s current value is readily available with a single call to get_reg_value_at_low_level_il_instruction.

# vtable is first loaded into a register, then stored
if (il.dest.operation == LowLevelILOperation.LLIL_REG and
        il.src.operation == LowLevelILOperation.LLIL_REG):
    reg_value = src_func.get_reg_value_at_low_level_il_instruction(
        il.instr_index, il.src.src
    )

    if reg_value.type == RegisterValueType.ConstantValue:
        fp = read_value(bv, reg_value.value, bv.address_size)

        if not bv.is_offset_executable(fp):
            continue

    return reg_value.value

Propagation of Register Assignments

Now that we know the location of the vtable, let’s figure out which offset is called. To determine this value, we need to trace back through the state of the program from the call instruction to the moment the vtable pointer is retrieved from memory, calculate the offset into the virtual table, and discover which function is being called. We accomplish this tracing by implementing a rudimentary dataflow analysis that preprocesses the basic block containing the call instruction. This preprocessing step will let us query the state of a register at any point in the basic block.

def preprocess_basic_block(bb):
    defs = {}
    current_defs = {}

    for instr in bb:
        defs[instr.instr_index] = copy(current_defs)

        if instr.operation == LowLevelILOperation.LLIL_SET_REG:
            current_defs[instr.dest] = instr

        elif instr.operation == LowLevelILOperation.LLIL_CALL:
            # wipe out previous definitions since we can't
            # guarantee the call didn't modify registers.
            current_defs.clear()

    return defs

At each instruction of the basic block, we keep a table of register states. As we iterate over each LowLevelILInstruction, this table is updated when an LLIL_SET_REG operation is encountered. For each register tracked, we store the LowLevelILInstruction responsible for changing its value. Later, we can query this register’s state and retrieve the LowLevelILInstruction and recursively query the value of the src operand, which is the expression the register currently represents.

Additionally, if an LLIL_CALL operation is encountered, then we clear the register state from that point on. The called function might modify the registers, and so it is safest to assume that all registers after the call have unknown values.

Now we have all the data that we need to model the vtable pointer dereference and calculate the virtual function offset.

Calculating the Virtual Function Offset

Before diving into the task of calculating the offset, let’s consider how we can model the behavior. Looking back at Figure 1, dispatching a virtual function can be generalized into four steps:

  1. Read a pointer to the vtable from the object’s structure in memory (LLIL_LOAD).
  2. Add an offset to the pointer value, if the function to be dispatched is not the first function (LLIL_ADD).
  3. Read the function pointer at the calculated offset (LLIL_LOAD).
  4. Call the function (LLIL_CALL).

Dispatching a virtual function can therefore be modeled by evaluating the src operand expression of the LLIL_CALL instruction, recursively visiting each expression. The base case of the recursion is reached when the LLIL_LOAD instruction of step 1 is encountered. The value of that LLIL_LOAD is the specified vtable pointer. The vtable pointer value is returned and propagates back through the previous iterations of the recursion to be used in those iterations’ evaluations.

Let’s step through the evaluation of an example, to see how the model works and how it is implemented in Python. Take the following virtual function dispatch in x86.

mov eax, [ecx] ; retrieve vtable pointer
call [eax+4]   ; call the function pointer at offset 4 of the vtable

This assembly would be translated into the following LLIL.

0: eax = [ecx].d
1: call ([eax + 4].d)

Building out the trees for these two LLIL instructions yields the following structures.

vtable-dispatch-llil

Figure 2: LLIL tree structures for the example vtable dispatch assembly.

The src operand of the LLIL_CALL is an LLIL_LOAD expression. We evaluate the src operand of the LLIL_CALL with a handler based on its operation.

# This lets us handle expressions in a more generic way.
# operation handlers take the following parameters:
#   vtable (int): the address of the class's vtable in memory
#   bv (BinaryView): the BinaryView passed into the plugin callback
#   expr (LowLevelILInstruction): the expression to handle
#   current_defs (dict): The current state of register definitions
#   defs (dict): The register state table for all instructions
#   load_count (int): The number of LLIL_LOAD operations encountered
operation_handler = defaultdict(lambda: (lambda *args: None))
operation_handler[LowLevelILOperation.LLIL_ADD] = handle_add
operation_handler[LowLevelILOperation.LLIL_REG] = handle_reg
operation_handler[LowLevelILOperation.LLIL_LOAD] = handle_load
operation_handler[LowLevelILOperation.LLIL_CONST] = handle_const

Therefore, the first iteration of our recursive evaluation of this virtual function dispatch is a call to handle_load.

def handle_load(vtable, bv, expr, current_defs, defs, load_count):
    load_count += 1

    if load_count == 2:
        return vtable

    addr = operation_handler[expr.src.operation](
        vtable, bv, expr.src, current_defs, defs, load_count
    )
    if addr is None:
        return

    # Read the value at the specified address.
    return read_value(bv, addr, expr.size)

handle_load first increments a count of LLIL_LOAD expressions encountered. Recall that our model for dereferencing a vtable expects two LLIL_LOAD instructions: the vtable pointer, then the function pointer we want. Tracing backwards through the program state means that we will encounter the load of the function pointer first and the load for the vtable pointer second. The count is 1 at the moment, so the recursion should not yet be terminated. Instead, the src operand of the LLIL_LOAD is recursively evaluated by a handler function for the src expression. When this call to a handler completes, addr should contain the address that points to the function pointer to be dispatched. In this case, src is an LLIL_ADD, so handle_add is called.

def handle_add(vtable, bv, expr, current_defs, defs, load_count):
    left = expr.left
    right = expr.right

    left_value = operation_handler[left.operation](
        vtable, bv, left, current_defs, defs, load_count
    )

    right_value = operation_handler[right.operation](
        vtable, bv, right, current_defs, defs, load_count
    )

    if None in (left_value, right_value):
        return None

    return left_value + right_value

handle_add recursively evaluates both the left and right sides of the LLIL_ADD expression and returns the sum of these values back to its caller. In our example, the left operand is an LLIL_REG expression, so handle_reg is called.

def handle_reg(vtable, bv, expr, current_defs, defs, load_count):
    # Retrieve the LLIL expression that this register currently
    # represents.
    set_reg = current_defs.get(expr.src, None)
    if set_reg is None:
        return None

    new_defs = defs.get(set_reg.instr_index, {})

    return operation_handler[set_reg.src.operation](
        vtable, bv, set_reg.src, new_defs, defs, load_count
    )

This is where our dataflow analysis comes into play. Using the current register state, as described by current_defs, we identify the LLIL expression that represents the current value of this LLIL_REG expression. Based on the example above, current_defs[‘eax’] would be the expression [ecx].d. This is another LLIL_LOAD expression, so handle_load is called again. This time, load_count is incremented to 2, which meets the base case. If we assume that in our example the user chose a class whose constructor is located at 0x1000, then handle_load will return the value 0x1000.

With the left operand evaluated, it is now time for handle_add to evaluate the right operand. This expression is an LLIL_CONST, which is very easy to evaluate; we just return the value operand of the expression. With both left and right operands evaluated, handle_add returns the sum of the expression, which is 0x1004. handle_load receives the return value from handle_add, then reads the function pointer located at that address from the BinaryView. We can then change the currently displayed function by calling bv.file.navigate(bv.file.view, function_pointer) in the BinaryView object.

Returning to the LLIL tree structures earlier, we can annotate the structures to visualize how the recursion and concrete data propagation happens.

annotated-vtable-dispatch-llil

Figure 3: LLIL tree structures for the example vtable dispatch assembly, annotated with handler calls and propagation of concrete values back up the call chain.

Example: Rectangles and Triangles

For a real world example, I used a slightly modified version of this C++ tutorial, which you can find here. Here’s a demonstration of the plugin in action:

If you compile virtual-test.cpp for both x86-64 and ARM and open the binaries in Binary Ninja with the plugin installed, you will find that it will work on both architectures without needing any architecture-specific code. That’s the beauty of an intermediate representation!

Go Forth and Analyze

Binary Ninja’s LLIL is a powerful feature that enables cross-platform program analysis to be easily developed. As we’ve seen, its structure is simple, yet allows for representation of even the most complex instructions. The Python API is a high quality interface that we can use effectively to traverse instructions and process operations and operands with ease. More importantly, we’ve seen simple examples of how dataflow analysis, enabled by the LLIL, can allow us to develop cross-platform plugins to perform program analysis without having to implement complicated heuristics to calculate program values. What are you waiting for? Pick up a copy of Binary Ninja and start writing your own binary analysis with the LLIL, and don’t forget to attend Sophia’s presentation of “Next-level Static Analysis for Vulnerability Research” using the Binary Ninja LLIL at INFILTRATE 2017.