Manticore: Symbolic execution for humans

Earlier this week, we open-sourced a tool we rely on for dynamic binary analysis: Manticore! Manticore helps us quickly take advantage of symbolic execution, taint analysis, and instrumentation to analyze binaries. Parts of Manticore underpinned our symbolic execution capabilities in the Cyber Grand Challenge. As an open-source tool, we hope that others can take advantage of these capabilities in their own projects.

We prioritized simplicity and usability while building Manticore. We used minimal external dependencies and our API should look familiar to anyone with an exploitation or reversing background. If you have never used such a tool before, give Manticore a try.

Two interfaces. Multiple use cases.

Manticore comes with an easy-to-use command line tool that quickly generates new program “test cases” (or sample inputs) with symbolic execution. Each test case results in a unique outcome when running the program, like a normal process exit or crash (e.g., invalid program counter, invalid memory read/write).

The command line tool satisfies some use cases, but practical use requires more flexibility. That’s why we created a Python API for custom analyses and application-specific optimizations. Manticore’s expressive and scriptable Python API can help you answer questions like:

  • At point X in execution, is it possible for variable Y to be a specified value?
  • Can the program reach this code at runtime?
  • What is a program input that will cause execution of this code?
  • Is user input ever used as a parameter to this libc function?
  • How many times does the program execute this function?
  • How many instructions does the program execute if given this input?

In our first release, the API provides functionality to extend the core analysis engine. In addition to test case generation, the Manticore API can:

  • Abandon irrelevant states
  • Run custom analysis functions at arbitrary execution points
  • Concretize symbolic memory
  • Introspect and modify emulated machine state

Early applications

Manticore is one of the primary tools we use for binary analysis research. We used an earlier version as the foundation of our symbolic execution vulnerability hunting in the Cyber Grand Challenge. We’re using it to build a custom program analyzer for DARPA LADS.

In the month leading up to our release, we solicited ideas from the community on simple use cases to demonstrate Manticore’s features. Here are a few of our favorites:

  • Eric Hennenfent solved a simple reversing challenge. He presented two solutions: one using binary instrumentation and one using symbolic execution.
  • Yan and Mark replaced a variable with a tainted symbolic value to determine which specific comparisons user input could influence.
  • Josselin Feist generated an exploit using only the Manticore API. He instrumented a binary to find a crash and then determined constraints to call an arbitrary function with symbolic execution.
  • Cory Duplantis solved a reversing challenge from Google CTF 2016. His script is a great example of how straightforward it is to solve many CTF challenges with Manticore.

Finally, a shoutout to Murmus who made a video review of Manticore only 4 hours after we open sourced it!

It’s easy to get started

With other tools, you’d have to spend time researching their internals. With Manticore, you have a well-written interface and an approachable codebase. So, jump right in and get something useful done sooner.

Grab an Ubuntu 16.04 VM and:

# Install the system dependencies
sudo apt-get update && sudo apt-get install z3 python-pip -y
python -m pip install -U pip

# Install manticore and its dependencies
git clone https://github.com/trailofbits/manticore.git && cd manticore
sudo pip install --no-binary capstone .

You have installed the Manticore CLI and API. We included a few examples in our source repository. Let’s try the CLI first:

# Build the examples
cd examples/linux
make

# Use the Manticore CLI to discover unique test cases
manticore basic
cat mcore_*/*1.stdin | ./basic
cat mcore_*/*2.stdin | ./basic

Basic” is a toy example that reads user input and prints one of two statements. Manticore used symbolic execution to explore `basic` and discovered the two unique inputs. It puts sample inputs it discovers into “stdin” files that you can pipe to the binary. Next, we’ll use the API:

# Use the Manticore API to count executed instructions
cd ../script
python count_instructions.py ../linux/helloworld

The count_instructions.py script uses the Manticore API to instrument the `helloworld` binary and count the number of instructions it executes.

Let us know what you think!

If you’re interested in reverse engineering, binary exploitation, or just want to want to learn about CPU emulators and symbolic execution, we encourage you to play around with it and join #manticore on our Slack for discussion and feedback. See you there!

A walk down memory lane

Admit it. Every now and then someone does something, and you think: “I also had that idea!” You feel validated — a kindred spirit has had the same intuitions, the same insights, and even drawn the same conclusions. I was reminded of this feeling recently when I came across a paper describing how to use Intel’s hardware transactional memory to enforce controlflow integrity.

Applied accounting: Enforcing control-flow integrity with checks and balances

A while back I had the same idea, wrote some code, and never published it. When I saw the paper, I secretly, and perhaps predictably, had that negative thought: “I got there first.” That’s not productive. Let’s go back in time to when I was disabused of the notion that there are any “firsts” with ideas. Don’t worry, hardware transactional memory and control-flow integrity will show up later. For now, I will tell you the story of Granary, my first binary translator.

Granary

I am a self-described binary translator. In fact, I have monograms in my suits saying as much. I dove head-first into binary translation and instrumentation as part of my Master’s degree. My colleague Akshay and I were working with two professors at the Systems Lab at the University of Toronto (UofT). We identified a problem: most bugs in the Linux kernel were actually coming from kernel modules (extensions). Opportunity struck. A former UofT student ported DynamoRIO to work inside of the Linux kernel, and that tool could help catch kernel module bugs in the act.

The path to finding actual bugs was long and twisted. Slowly finding bugs wasn’t as cool as doing it fast, and instrumenting the whole kernel to catch bugs in modules wasn’t fast. Our solution was to instrument modules only, and let the kernel run at full speed. This was challenging and ran up against core design decisions in DynamoRIO; thus, Granary was born. Granary’s claim to fame was that it could selectively instrument only parts of the kernel, leaving the rest of the kernel to run natively.

Second place, or first to lose?

With Granary came Address Watchpoints. This was a cool technique for doing fine-grained memory access instrumentation. Address watchpoints made it possible to instrument accesses to specific allocations in memory and detect errors like buffer overflows, use-before-initialization, and use-after-frees.

Address Watchpoints worked by intercepting calls to memory allocation functions (e.g. kmalloc) and embedding a taint tracking ID into the high 15 bits of returned pointers. Granary made it possible to interpose on memory access instructions that used those pointers. It was comprehensive because tainted pointers spread like radioactive dye — pointer copies and arithmetic transparently preserved any embedded IDs.

Yet it turned out that Address Watchpoints was not novel (this is an important metric in academia). SoftBound+CETS had applied a similar technique years before. Stay positive!

Not again

Despite the lack of novelty, Address Watchpoints were practical and attacked the real problem of memory access bugs in Linux kernel modules. Granary stepped forward as the novelty, and Address Watchpoints were an application showing that Granary was useful.

I presented Address Watchpoints at HotDep in 2013, which was co-located with the SOSP conference. At the same conference, btkernel, a fast, Linux kernel-space dynamic binary translator was released. It applied many of the same techniques that made Granary novel, but beat us to a full paper publication. Darn.

Hardware transactional memory

Time to feel good again. Trail of Bits got me a sweet laptop when I joined in 2015. It had a Broadwell chip, and supported the latest x86 features like hardware transactional memory.

The concurrency tax

The stated purpose of hardware transactional memory (HTM) is to enable lock-free and highly concurrent algorithms. For example, let’s say that I want to find use-after free bugs. A solid approach to this problem is to represent a program using a points-to graph. Use-after-free bugs exist if there is a path through the program’s points-to graph that goes from a free to a use.

Scaling this approach is challenging, but my laptop has many cores and so my imaginary doppelganger can throw some concurrency at the problem and hope for the best. Consider two threads that are working together to update the points-to graph. They propagate information from node to node, figuring out what pointers point where. If they both want to update the same node at the same time, then they need to synchronize so that one thread’s updates don’t clobber the other’s.

How do we know when synchronization is actually needed? We don’t! Instead, we need to conservatively assume that every access to a particular node requires synchronization, just in case “that other thread” rudely shows up. But points-to graphs are huge; we shouldn’t need to handle the worst case every time. That’s where HTM comes in. HTM lets us take an opportunistic approach, where threads don’t bother trying to synchronize. Instead, they try to make their own changes within a transaction, and if they both do so at the same time, then their transactions abort and they fall back to doing things the old fashioned way. This works because transactions provide failure atomicity: either the transaction succeeds and the thread’s changes are committed, or the transaction aborts, and it’s as if none of the changes ever happened.

That’s not what HTM is for

Hooray, we’re concurrency experts now. But didn’t this article start off saying something about control-flow integrity (CFI)? What the heck does concurrency have to do with that? Nothing! But HTM’s failure atomicity property has to do with CFI and more.

It turns out that HTM can be applied to unorthodox problems. For example, Intel’s HTM implementation enables a side-channel attack that can be used to defeat address space layout randomization. For a time I was also looking into similar misuses of HTM and surprise, surprise, I applied it to CFI.

Parallel discovery

Two years ago I had an idea about using HTM to enforce CFI. I had a little proof-of-concept script to go along with it, and a colleague helped me whip up an LLVM instrumentation pass that did something similar. Much to my surprise, researchers from Eurecom and UCSB recently produced a similar, more fleshed out implementation of the same idea. Here’s the gist of things.

Suppose an attacker takes control of the program counter, e.g. via ROP. Before their attack can proceed, they need to pivot and make the program change direction and go down the path to evil. The path to evil is paved with good, albeit unintended instructions. What if we live in a world without consequences? What if we let the attacker go wherever they want?

In ordinary circumstances that would be an awful idea. But attackers taking control is extraordinary, kind of like two threads simultaneously operating on the same node in a massive graph. What HTM gives us is the opportunity to do the wrong thing and be forgiven. We can begin a transaction just before a function return instruction, and end the transaction at its intended destination. Think of it like cattle herding. The only valid destinations are those that end transactions. If an attacker takes control, but doesn’t end the transaction, then the hardware will eventually run out of space and abort the transaction, and the program will seize hold of its destiny.

I believed and still think that this is a cool idea. Why didn’t I follow through? The approach that I envisioned lacked practicality. First, it wasn’t good enough as described. There are perhaps better ways to herd cattle. Aligning them within fences, for example. Protecting CFI using only failure atomicity doesn’t change the game, instead it just adds an extra requirement to ROP chains. Second, hardware transactions aren’t actually that fast — they incur full memory barriers. Executing transactions all the time kills ILP and concurrency. That makes this technique out of the question for real programs like Google Chrome. Finally, Intel’s new CET instruction set for CFI makes this approach dead on arrival. CET provides special instructions that bound control flow targets in a similar way.

If a tree falls in a forest

If I had the idea, and the Eurecom/UCSB group had the idea, then I bet some unknown third party also had the idea. Maybe Martin Abadi dreamt this up one day and didn’t tell the world. Maybe this is like operating systems, or distributed systems, or really just… systems, where similar problems and solutions seem to reappear every twenty or thirty years.

Seeing that someone else had the same idea that I had was really refreshing. It made me feel good and bad all at the same time, and reminded me of a fun time a few years ago where I felt like I was doing something clever. I’m not a special snowflake, though. There are no firsts with ideas. The Eurecom/UCSB group had an idea, then they followed it through, produced an implementation, and evaluated it. That’s what counts.

April means Infiltrate

Break out your guayabera, it’s time for Infiltrate. Trail of Bits has attended every Infiltrate and has been a sponsor since 2015. The majority of the company will be in attendance this year (18 people!) and we’ll be swapping shirts and swag again. We’re looking forward to catching up with the latest research presented there and showcasing our own contributions.

Last year we spoke on Making a Scaleable Automated Hacking System and Swift Reversing. We’re speaking again this year: Sophia d’Antoine has teamed up with a pair of Binary Ninja developers to present Be a Binary Rockstar: Next-level static analyses for vulnerability research, which expands on her previous research bringing abstract interpretation to Binary Ninja.

This year we’re bringing Manticore, the iron heart of our CGC robot, and giving attendees early access to it as we prepare to open source it. Manticore is a binary symbolic and concolic execution engine with support for x86, x86-64, and ARM. Use it to solve a simple challenge and earn yourself a Trail of Bits mug.

We don’t just attend Infiltrate to boast about our work; Infiltrate is truly a top notch conference. Infiltrate’s talks are a sneak peek at the best talks presented at other conferences — all in one place. The lobbycon is strong, giving everyone a chance to interact with the speakers and top researchers. The conference is all-inclusive and the included food, drinks, and events are fantastic — so don’t expect to show up without a ticket and try to steal some people away for dinner.

Last year also saw the return of the NOP certification. Windows 2000 and ImmunityDbg caused much frustration among our team but resulted in an exciting competition.

79F163E7-0355-4BB4-807C-856A2D337915

2016 NOP Certification: 30 minutes fighting ImmunityDbg, 7 minutes 33 seconds to pop calc

We’re particularly excited for several talks:

Of course, we’ve seen Be a Binary Rockstar and it’s great. Infiltrate tickets are still on sale — you can see it for yourself.

Vegas is over. The real show is in Miami. See you there!

McSema: I’m liftin’ it

McSema, our x86 machine code to LLVM bitcode binary translator, just got a fresh coat of paint. Last week we held a successful hackathon that produced substantial improvements to McSema’s usability, documentation, and code quality. It’s now easier than ever to use McSema to analyze and reverse-engineer binaries.

Growth stage

We use McSema on a daily basis. It lets us find and retroactively harden binary programs against security bugs, independently validate vendor source code, and generate application tests with high code coverage. It is part of ongoing research, both in academia and in DARPA programs. We (and others) are constantly extending it to analyze increasingly complex programs.

You could say that McSema has been on a growth spurt since we open-sourced it in 2014. Back then, LLVM 3.5 was new and shiny and that’s what McSema used. And that’s what it used in 2015. And in 2016. McSema stretched and grew, but some things stagnated. Over time an ache developed — a desire to modernize and to polish things off. Last week we massaged those growing pains away during our McSema usability hackathon.

Paying dividends

We made broad improvements to McSema. The code is cleaner than ever. It’s easier to install and is more portable. It runs faster and the code it produces is better.

Performance

McSema builds much faster than before. We simplified the build system by removing dead code and unneeded libraries, and by reorganizing the directory layout to be more descriptive.

McSema is faster at producing bitcode. We improved how McSema traverses the control flow graph, removed dependencies on Boost, and simplified bitcode generation.

McSema generates leaner and faster bitcode. McSema no longer stores and spills register context on entry and exit to functions. Flag operations use faster natural bitwidth operations instead of bit fields. McSema can now optimize the lazily generated bitcode to eliminate unused computations. The optimized bitcode is easier to analyze and truer to the intent of the original program.

Modernization

McSema now uses a stock distribution of LLVM 3.8. Previously, McSema used a custom modified version of LLVM 3.5. This upgrade brings in faster build times and more modern LLVM features. We have also eliminated McSema’s dependency on Boost, opting to use modern C++11 features instead.

Simplifications

The new command-line interface is more consistent and easier to use: mcsema-disass disassembles binaries, and mcsema-lift converts the disassembly into LLVM bitcode.

We removed bin_descend, our custom binary disassembler. There is now only one supported decoder that uses IDA Pro as the disassembly engine.

The new code layout is simpler and more intuitive. The CMake scripts to build McSema are now smaller and simpler.

The old testing framework has been removed in favor of an integration testing based approach with no external dependencies.

New Features

McSema supports more instructions. We are always looking for help adding new instruction semantics, and we have updated our instruction addition guide.

Mcsema will now tell you which instructions are supported and which are not, via the mcsema-lift --list-supported command.

The new integration testing framework allows for easy addition of comprehensive translation tests, and there is a new guide about adding tests to McSema.

Documentation

Our new documentation describes in detail how to install, use, test, extend, and debug McSema’s codebase. We have also documented common errors and how to resolve them. These improvements will make it easier for third-parties to hack on McSema.

Runtime

McSema isn’t just for static analysis. The lifted bitcode can be compiled back into a runnable program. We improved McSema’s runtime footprint, making it faster, greatly reducing its memory usage, and making it able to seamlessly interact with native Windows and Linux code in complex ways.

Investing in the future

We will continue to invest in improving McSema. We are always expanding support for larger and more complex software. We hope to move to Binary Ninja for control flow recovery instead of IDA Pro. And we plan to add support for lifting ARM binaries to LLVM bitcode. We want to broaden McSema’s applicability to include analyzing mobile apps and embedded firmware.

We are looking for interns that are excited about the possibilities of McSema. Looking to get started? Try out the walkthrough of translating a real Linux binary. After that, see how McSema can enable tools like libFuzzer to work on binaries. Finally, contact us and tell us where you’d like to take McSema. If we like it and you have a plan then we will pay you to make it happen.

The Challenges of Deploying Security Mitigations

This blog has promoted control flow integrity (CFI) as a game changing security mitigation and encouraged its use. We wanted to take our own security advice and start securing software we use. To that end, we decided to apply CFI to facebook’s osquery, a cross-platform codebase with which we are deeply familiar. Using osquery, we could compare clang’s implementation of CFI (ClangCFI) against Visual Studio’s Control Flow Guard (CFGuard).

That comparison never happened.

Instead, this blog post is going to be about a very important but underappreciated aspect of security mitigations: development costs and ease of use. We will describe our adventures in applying control flow integrity protections to osquery, and how seemingly small tradeoffs in security mitigations have serious implications for usability.

The Plan

The plan was simple: we would enable CFGuard for the Windows build of osquery, and ClangCFI for the Linux build of osquery. The difference between the protected and unprotected builds on osquery’s test suite would be the quantitative measurement. We’d contribute our patches back to the osquery code, resulting in a great blog post and a more secure osquery.

We got the Windows build of osquery running with CFGuard in about 15 minutes. Here is the pull request to enable CFGuard on osquery for Windows. The changes are two lines in one CMake script.

Even after weeks of effort, we still haven’t managed to enable ClangCFI on the Linux build. The discrepancy is a direct result of well meaning security choices with surprisingly far reaching consequences. The effort wasn’t for naught; we reported two clang bugs (one and two), hit a recently resolved issue, and had very insightful conversations with clang developers. They very patiently explained details of ClangCFI, identified the issues we were seeing, and graciously offered debugging assistance.

Let’s take a step-by-step walk through each security choice and the resulting consequences.

ClangCFI is stricter than CFGuard

For every protected indirect call, ClangCFI permits fewer valid destinations than CFGuard. This is good: fewer destinations means less ways to turn a bug into an exploit. ClangCFI also detects more potential errors than CFGuard (e.g. cast checks, ensuring virtual call destinations fall in the object hierarchy, etc.).

Valid destinations for indirect calls in ClangCFI and CFGuard

Figure 1: Example differences in the valid call targets for the indirect calls, using the icall examples (ClangCFI, CFGuard). The true valid destination are highlighted in green, and everything else is in red.

The specifics of what each CFI scheme permits has critical usability implications. For ClangCFI, an indirect call destination must match the type signature at the call site. The ClangCFI virtual method call checks are even stricter. For example, ClangCFI checks that the destination method belongs to the same object hierarchy. For CFGuard, an indirect call destination can be a valid function entry point [1].

An idealized view of valid indirect call targets

Figure 2: An idealized view of the valid indirect call targets for ClangCFI, CFGuard, and how they compare to the (idealized) set of valid indirect call targets.

ClangCFI’s type signature validation and virtual method call checks require whole-program analysis. The whole program analysis requirement results in two additional requirements:

  1. In general, every linked object and static library that comprise the final program must be built with CFI enabled [2].
  2. Link-time optimization (LTO) is required when using ClangCFI, because whole-program analysis is not possible until link time.

The new requirements are sensible: requiring CFI on everything ensures no part of the program is unprotected. LTO not only allows for whole-program analysis but also whole-program optimization, potentially offsetting CFI-related performance losses.

The looser validation standard used by CFGuard is less secure, but does not require whole-program analysis. Objects built with CFGuard validate indirect calls; objects built without CFGuard do not. Both objects can coexist in the same program. The linker, however, must be aware of CFGuard in order to emit a binary with appropriate tables and flags in the PE header.

ClangCFI is all or nothing. CFGuard is incremental.

In general, ClangCFI must be enabled for every object file and static library in a program: it is an error to link CFI-enabled code with non-CFI code [2]. The error is easy to make but difficult to identify because the linker does not inspect objects for ClangCFI protections. The linker will not report errors, but the resulting executable will fail runtime CFI checks.

Valid linkages when using ClangCFI

Table 1: Valid linkages when using ClangCFI. These linkages are what is valid in general, assuming there are indirect calls between the linked items. Calls across dynamic shared objects (DSOs) calls are valid assuming the use of the experimental -f[no-]sanitize-cfi-cross-dso flag.

Osquery, by design, statically links every dependency, including libc++. Those dependencies statically link other dependencies, and so on. To enable ClangCFI for osquery, we would have to enable ClangCFI for the entire osquery dependency tree. As we’ll see in the next section, that is a difficult task. We could not justify that kind of time commitment for this blog post, although we would love to do this in the future.

CFGuard can be applied on a per-compilation unit level. The documentation for CFGuard explicitly mentions that it is permissible to mix and match CFG-enabled and non-CFG objects and libraries [3]. Calls across DSOs (i.e DLLs, in Windows terminology) are fully supported. This flexibility was critical for enabling CFGuard for osquery; we enabled CFGuard for osquery itself and linked against existing unprotected dependencies. Fortunately, Windows ships with CFGuard-protected system libraries that are utilized when the main program image supports CFGuard. The unprotected code is limited to static libraries used while building osquery.

ClangCFI is too strict for some codebases

ClangCFI is too strict for some codebases. This is not clangs’ fault: some code uses shortcuts and conveniences that may not be strictly standards compliant. We ran into this issue when trying to enable ClangCFI for strongSwan. Our goal was to attempt a smaller example than osquery, and to create a security-enhanced version of strongSwan for Algo, our VPN solution.

There are valid, programmer intended targets that fall outside the domains defined by ClangCFI and CFGuard

Figure 3: How real existing code relates to the indirect call targets for ClangCFI and CFGuard. There are valid, programmer intended targets that fall outside the domains defined by ClangCFI and CFGuard.

We were not able to create a CFI-enabled version of strongSwan because libstrongswan, the core component of strongSwan, uses an OOP-like system for C. This system wraps most indirect calls with an interface that fails ClangCFI’s strict checks. ClangCFI is technically correct: the type signatures of caller and callee should match. In practice, there is shipping code where they do not.

Thankfully ClangCFI has a feature to relax strictness: the CFI blacklist. The blacklist will disable CFI checks for source files, functions, or types matching a regular expression. Unfortunately, in this case, almost every indirect call site would have to be blacklisted, making CFI effectively useless.

CFGuard is unlikely to cause the same issue: there is (probably) some code that indirect calls to the middle of a function, but such code is orders of magnitude more rare than mismatched type signatures.

Conclusion

From a security perspective, ClangCFI is “better” than CFGuard. It is stricter, it requires the whole program to be protected, and it tests for more runtime errors. It is possible to utilize ClangCFI to protect large and complex codebases: the excellent Google Chrome team does it. However, the enhanced security comes with a steep cost. Enabling ClangCFI can turn into a complex undertaking that requires considerable developer time and rigorous testing.

Conversely, CFGuard is considerably more flexible. A program can mix guarded and unguarded code, and CFGuard is much less likely to break existing code. These compromises make CFGuard much easier to enable for existing codebases.

Our experience using ClangCFI and CFGuard reflects these tradeoffs. A ClangCFI-enabled osquery would be more secure than the CFGuard-enabled osquery. However, the CFGuard-enabled osquery for Windows exists right now. The ClangCFI-enabled osquery for Linux is still a work-in-progress after weeks of trial and error.

—–

[1] This is not strictly true. For example, suppressed functions are function entry points but invalid indirect call destinations.

[2] Again, this not strictly true; there are specific exceptions to the mixing rule. For example, the CFI support library is not built with CFI. Linking CFI and non-CFI objects is fine if every function in the non-CFI object is only called directly. See this comment by Evgeniy Stepanov.

[3] from this page: “… a mixture of CFG-enabled and non-CFG enabled code will execute fine.”

The Smart Fuzzer Revolution

I recently had the privilege of giving a keynote at BSidesLisbon. I had a great time at the conference, and I’d like to thank Bruno Morisson for inviting me. If you’re into port, this is the conference for you! I recommend that anyone in the area consider attending next year.

I felt there was a need to put the recent advances in automated bug finding into context. The new developments of the Cyber Grand Challenge, AFL, and libFuzzer were easy to miss if you weren’t paying attention. However, the potential impact they have on our industry is dramatic.

After giving this talk a second time at IT Defense yesterday, I would now like to share it with the Internet. You can watch it below to get my take on where this research area has come from, where we are now, and where I expect we will go. Enjoy!

You should go to BSidesLisbon

You should go to BSidesLisbon

——–

The last 2 years have seen greater advances in automated security testing than the 10 before it. AFL engineered known best practices into an easy-to-use tool, the DARPA Cyber Grand Challenge provided a reliable competitive benchmark and funding for new research, and Project Springfield (aka SAGE) is now available to the public. The common availability of these new technologies has the potential for massive impact on our industry.

How do these tools work and what sets them apart from past approaches? Where do they excel and what are their limitations? How can I use these tools today? How will these technologies advance and what further developed is needed? And finally, how much longer do humans have as part of the secure development lifecycle?

See the slides in full here.

References

Original fuzzing project assignment from UW-Madison (1988)
http://pages.cs.wisc.edu/~bart/fuzz/CS736-Projects-f1988.pdf

PROTOS – systematic approach to eliminate software vulnerabilities (2002)
https://www.ee.oulu.fi/roles/ouspg/PROTOS_MSR2002-protos

The Advantages of Block-Based Protocol Analysis for Security Testing (2002)
http://www.immunitysec.com/downloads/advantages_of_block_based_analysis.html

DART: Directed Automated Random Testing (2005)
https://wkr.io/public/ref/godefroid2005dart.pdf

EXE: Automatically Generating Inputs of Death (2006)
https://web.stanford.edu/~engler/exe-ccs-06.pdf

EXE: 10 years later (2016)
https://ccadar.blogspot.com/2016/11/exe-10-years-later.html

Automated Whitebox Fuzz Testing (2008)
https://patricegodefroid.github.io/public_psfiles/ndss2008.pdf

American Fuzzy Lop (AFL)
http://lcamtuf.coredump.cx/afl/

DARPA Cyber Grand Challenge Competitor Portal (2013)
http://archive.darpa.mil/CyberGrandChallenge_CompetitorSite/

Exploitation and state machines (2011)
http://archives.scovetta.com/pub/conferences/infiltrate_2011/Fundamentals_of_exploitation_revisited.pdf

Your tool works better than mine? Prove it. (2016)
https://blog.trailofbits.com/2016/08/01/your-tool-works-better-than-mine-prove-it/

Microsoft Springfield (2016)
https://www.microsoft.com/en-us/springfield/

Google OSS-Fuzz (2016)
https://github.com/google/oss-fuzz

LLVM libFuzzer
http://llvm.org/docs/LibFuzzer.html

GRR – High-throughput fuzzer and emulator of DECREE binaries
https://github.com/trailofbits/grr

Manticore – A Python symbolic execution platform
https://github.com/trailofbits/manticore

McSema – x86 to machine code translation framework
https://github.com/trailofbits/mcsema

DARPA Challenge Sets for Linux, macOS, and Windows
https://github.com/trailofbits/cb-multios

Trail of Bits publications about the Cyber Grand Challenge
https://blog.trailofbits.com/category/cyber-grand-challenge/

Errata

  • The University of Oulu is in Finland.
  • The University of Wisconsin assigned homework in fuzzing in 1988.
  • SV-Comp is for software verification. ML competitions exist too.

Devirtualizing C++ with Binary Ninja

In my first blog post, I introduced the general structure of Binary Ninja’s Low Level IL (LLIL), as well as how to traverse and manipulate it with the Python API. Now, we’ll do something a little more interesting.

Reverse engineering binaries compiled from object-oriented languages can be challenging, particularly when it comes to virtual functions. In C++, invoking a virtual function involves looking up a function pointer in a virtual table (vtable) and then making an indirect call. In the disassembly, all you see is something like mov rax, [rcx+0x18]; call rax. If you want to know what function it will call for a given class object, you have to find the virtual table and then figure out which function pointer is at that offset.

Or you could use this plugin!

An Example Plugin: Navigating to a Virtual Function

vtable-navigator.py is an example plugin that will navigate to a given class’s virtual function from a call instruction. First, the plugin uses the LLIL to identify a specified class’s vtable when it is referenced in a constructor. Next, it will preprocess the call instruction’s basic block to track register assignments and their corresponding LLIL expressions. Finally, the plugin will process the LowLevelILInstruction object of the call and calculate the offset of the function to be called by recursively visiting register assignment expressions.

Discovering the vtable Pointer

vtable-diagram

Figure 1: Two classes inherit from a base virtual class. Each class’s virtual table points to its respective implementation of the virtual functions.

In the simplest form, a class constructor stores a pointer to its vtable in the object’s structure in memory. The two most common ways that a vtable pointer can be stored are either directly referencing a hardcoded value for the vtable pointer or storing the vtable pointer in a register, then copying that register’s value into memory. Thus, if we look for a write to a memory address from a register with no offset, then it’s probably the vtable.

constructor-example

An example constructor. The highlighted instruction stores a vtable in the object’s structure.

We can detect the first kind of vtable pointer assignment by looking for an LLIL instruction in the constructor’s LowLevelILFunction object (as described in Part 1) that stores a constant value to a memory address contained in a register.

According to the API, an LLIL_STORE instruction has two operands: dest and src. Both are LLIL expressions. For this case, we are looking for a destination value provided by a register, so dest should be an LLIL_REG expression. The value to be stored is a constant, so src should be an LLIL_CONST expression. If we match this pattern, then we assume that the constant is a vtable pointer, read the value pointed to by the constant (i.e. il.src.value), and double check that there’s a function pointer there, just to make sure it’s actually a vtable.

# If it's not a memory store, then it's not a vtable.
if il.operation != LowLevelILOperation.LLIL_STORE:
    continue

# vtable is referenced directly
if (il.dest.operation == LowLevelILOperation.LLIL_REG and
        il.src.operation == LowLevelILOperation.LLIL_CONST):
    fp = read_value(bv, il.src.value, bv.address_size)

    if not bv.is_offset_executable(fp):
        continue

    return il.src.value

Pretty straight forward, but let’s look at the second case, where the value is first stored in a register.

For this case, we search for instructions where the dest and src operands of an LLIL_STORE are both LLIL_REG expressions. Now we need to determine the location of the virtual table based only on the register.

This is where things get cool. This situation not only demonstrates usage of the LLIL, but how powerful the dataflow analysis performed on the LLIL can be. Without dataflow analysis, we would have to parse this LLIL_STORE instruction, figure out which register is being referenced, and then step backwards to find the last value assigned to that register. With the dataflow analysis, the register’s current value is readily available with a single call to get_reg_value_at_low_level_il_instruction.

# vtable is first loaded into a register, then stored
if (il.dest.operation == LowLevelILOperation.LLIL_REG and
        il.src.operation == LowLevelILOperation.LLIL_REG):
    reg_value = src_func.get_reg_value_at_low_level_il_instruction(
        il.instr_index, il.src.src
    )

    if reg_value.type == RegisterValueType.ConstantValue:
        fp = read_value(bv, reg_value.value, bv.address_size)

        if not bv.is_offset_executable(fp):
            continue

    return reg_value.value

Propagation of Register Assignments

Now that we know the location of the vtable, let’s figure out which offset is called. To determine this value, we need to trace back through the state of the program from the call instruction to the moment the vtable pointer is retrieved from memory, calculate the offset into the virtual table, and discover which function is being called. We accomplish this tracing by implementing a rudimentary dataflow analysis that preprocesses the basic block containing the call instruction. This preprocessing step will let us query the state of a register at any point in the basic block.

def preprocess_basic_block(bb):
    defs = {}
    current_defs = {}

    for instr in bb:
        defs[instr.instr_index] = copy(current_defs)

        if instr.operation == LowLevelILOperation.LLIL_SET_REG:
            current_defs[instr.dest] = instr

        elif instr.operation == LowLevelILOperation.LLIL_CALL:
            # wipe out previous definitions since we can't
            # guarantee the call didn't modify registers.
            current_defs.clear()

    return defs

At each instruction of the basic block, we keep a table of register states. As we iterate over each LowLevelILInstruction, this table is updated when an LLIL_SET_REG operation is encountered. For each register tracked, we store the LowLevelILInstruction responsible for changing its value. Later, we can query this register’s state and retrieve the LowLevelILInstruction and recursively query the value of the src operand, which is the expression the register currently represents.

Additionally, if an LLIL_CALL operation is encountered, then we clear the register state from that point on. The called function might modify the registers, and so it is safest to assume that all registers after the call have unknown values.

Now we have all the data that we need to model the vtable pointer dereference and calculate the virtual function offset.

Calculating the Virtual Function Offset

Before diving into the task of calculating the offset, let’s consider how we can model the behavior. Looking back at Figure 1, dispatching a virtual function can be generalized into four steps:

  1. Read a pointer to the vtable from the object’s structure in memory (LLIL_LOAD).
  2. Add an offset to the pointer value, if the function to be dispatched is not the first function (LLIL_ADD).
  3. Read the function pointer at the calculated offset (LLIL_LOAD).
  4. Call the function (LLIL_CALL).

Dispatching a virtual function can therefore be modeled by evaluating the src operand expression of the LLIL_CALL instruction, recursively visiting each expression. The base case of the recursion is reached when the LLIL_LOAD instruction of step 1 is encountered. The value of that LLIL_LOAD is the specified vtable pointer. The vtable pointer value is returned and propagates back through the previous iterations of the recursion to be used in those iterations’ evaluations.

Let’s step through the evaluation of an example, to see how the model works and how it is implemented in Python. Take the following virtual function dispatch in x86.

mov eax, [ecx] ; retrieve vtable pointer
call [eax+4]   ; call the function pointer at offset 4 of the vtable

This assembly would be translated into the following LLIL.

0: eax = [ecx].d
1: call ([eax + 4].d)

Building out the trees for these two LLIL instructions yields the following structures.

vtable-dispatch-llil

Figure 2: LLIL tree structures for the example vtable dispatch assembly.

The src operand of the LLIL_CALL is an LLIL_LOAD expression. We evaluate the src operand of the LLIL_CALL with a handler based on its operation.

# This lets us handle expressions in a more generic way.
# operation handlers take the following parameters:
#   vtable (int): the address of the class's vtable in memory
#   bv (BinaryView): the BinaryView passed into the plugin callback
#   expr (LowLevelILInstruction): the expression to handle
#   current_defs (dict): The current state of register definitions
#   defs (dict): The register state table for all instructions
#   load_count (int): The number of LLIL_LOAD operations encountered
operation_handler = defaultdict(lambda: (lambda *args: None))
operation_handler[LowLevelILOperation.LLIL_ADD] = handle_add
operation_handler[LowLevelILOperation.LLIL_REG] = handle_reg
operation_handler[LowLevelILOperation.LLIL_LOAD] = handle_load
operation_handler[LowLevelILOperation.LLIL_CONST] = handle_const

Therefore, the first iteration of our recursive evaluation of this virtual function dispatch is a call to handle_load.

def handle_load(vtable, bv, expr, current_defs, defs, load_count):
    load_count += 1

    if load_count == 2:
        return vtable

    addr = operation_handler[expr.src.operation](
        vtable, bv, expr.src, current_defs, defs, load_count
    )
    if addr is None:
        return

    # Read the value at the specified address.
    return read_value(bv, addr, expr.size)

handle_load first increments a count of LLIL_LOAD expressions encountered. Recall that our model for dereferencing a vtable expects two LLIL_LOAD instructions: the vtable pointer, then the function pointer we want. Tracing backwards through the program state means that we will encounter the load of the function pointer first and the load for the vtable pointer second. The count is 1 at the moment, so the recursion should not yet be terminated. Instead, the src operand of the LLIL_LOAD is recursively evaluated by a handler function for the src expression. When this call to a handler completes, addr should contain the address that points to the function pointer to be dispatched. In this case, src is an LLIL_ADD, so handle_add is called.

def handle_add(vtable, bv, expr, current_defs, defs, load_count):
    left = expr.left
    right = expr.right

    left_value = operation_handler[left.operation](
        vtable, bv, left, current_defs, defs, load_count
    )

    right_value = operation_handler[right.operation](
        vtable, bv, right, current_defs, defs, load_count
    )

    if None in (left_value, right_value):
        return None

    return left_value + right_value

handle_add recursively evaluates both the left and right sides of the LLIL_ADD expression and returns the sum of these values back to its caller. In our example, the left operand is an LLIL_REG expression, so handle_reg is called.

def handle_reg(vtable, bv, expr, current_defs, defs, load_count):
    # Retrieve the LLIL expression that this register currently
    # represents.
    set_reg = current_defs.get(expr.src, None)
    if set_reg is None:
        return None

    new_defs = defs.get(set_reg.instr_index, {})

    return operation_handler[set_reg.src.operation](
        vtable, bv, set_reg.src, new_defs, defs, load_count
    )

This is where our dataflow analysis comes into play. Using the current register state, as described by current_defs, we identify the LLIL expression that represents the current value of this LLIL_REG expression. Based on the example above, current_defs[‘eax’] would be the expression [ecx].d. This is another LLIL_LOAD expression, so handle_load is called again. This time, load_count is incremented to 2, which meets the base case. If we assume that in our example the user chose a class whose constructor is located at 0x1000, then handle_load will return the value 0x1000.

With the left operand evaluated, it is now time for handle_add to evaluate the right operand. This expression is an LLIL_CONST, which is very easy to evaluate; we just return the value operand of the expression. With both left and right operands evaluated, handle_add returns the sum of the expression, which is 0x1004. handle_load receives the return value from handle_add, then reads the function pointer located at that address from the BinaryView. We can then change the currently displayed function by calling bv.file.navigate(bv.file.view, function_pointer) in the BinaryView object.

Returning to the LLIL tree structures earlier, we can annotate the structures to visualize how the recursion and concrete data propagation happens.

annotated-vtable-dispatch-llil

Figure 3: LLIL tree structures for the example vtable dispatch assembly, annotated with handler calls and propagation of concrete values back up the call chain.

Example: Rectangles and Triangles

For a real world example, I used a slightly modified version of this C++ tutorial, which you can find here. Here’s a demonstration of the plugin in action:

If you compile virtual-test.cpp for both x86-64 and ARM and open the binaries in Binary Ninja with the plugin installed, you will find that it will work on both architectures without needing any architecture-specific code. That’s the beauty of an intermediate representation!

Go Forth and Analyze

Binary Ninja’s LLIL is a powerful feature that enables cross-platform program analysis to be easily developed. As we’ve seen, its structure is simple, yet allows for representation of even the most complex instructions. The Python API is a high quality interface that we can use effectively to traverse instructions and process operations and operands with ease. More importantly, we’ve seen simple examples of how dataflow analysis, enabled by the LLIL, can allow us to develop cross-platform plugins to perform program analysis without having to implement complicated heuristics to calculate program values. What are you waiting for? Pick up a copy of Binary Ninja and start writing your own binary analysis with the LLIL, and don’t forget to attend Sophia’s presentation of “Next-level Static Analysis for Vulnerability Research” using the Binary Ninja LLIL at INFILTRATE 2017.

Breaking Down Binary Ninja’s Low Level IL

Hi, I’m Josh. I recently joined the team at Trail of Bits, and I’ve been an evangelist and plugin writer for the Binary Ninja reversing platform for a while now. I’ve developed plugins that make reversing easier and extended Binary Ninja’s architecture support to assist in playing the microcorruption CTF. One of my favorite features of Binary Ninja is the Low Level IL (LLIL), which enables development of powerful program analysis tools. At Trail of Bits, we have used the LLIL to automate processing of a large number of CTF binaries, as well as automate identifying memory corruptions.

I often get asked how the LLIL works. In this blog post, I answer common questions about the basics of LLIL and demonstrate how to use the Python API to write a simple function that operates on the LLIL. In a future post, I will demonstrate how to use the API to write plugins that use both the LLIL and Binary Ninja’s own dataflow analysis.

What is the Low Level IL?

Compilers use an intermediate representation (IR) to analyze and optimize the code being compiled. This IR is generated by translating the source language to a single standard language understood by the components of the toolchain. The toolchain components can then perform generic tasks on a variety of architectures without having to implement those tasks individually.

Similarly, Binary Ninja not only disassembles binary code, but also leverages the power of its own IR, called Low Level IL, in order to perform dataflow analysis. The dataflow analysis makes it possible for users to query register values and stack contents at arbitrary instructions. This analysis is architecture-agnostic because it is performed on the LLIL, not the assembly. In fact, I automatically got this dataflow analysis for free when I wrote the lifter for the MSP430 architecture.

Let’s jump right in and see how the Low Level IL works.

Viewing the Low Level IL

Within the UI, the Low Level IL is viewable only in Graph View. It can be accessed either through the “Options” menu in the bottom right corner, or via the i hotkey. The difference between IL View and Graph View is noticeable; the IL View looks much closer to a high level language with its use of infix notation. This, combined with the fact that the IL is a standardized set of instructions that all architectures are translated to, makes working with an unfamiliar language easy.

arm-x64-il-comparison.png

Graph View versus IL View; on the left, Graph View of ARM (top) and x86-64 (bottom) assembly of the same function. On the right, the IL View of their respective Graph Views.

If you aren’t familiar with this particular architecture, then you might not easily understand the semantics of the assembly code. However, the meaning of the LLIL is clear. You might also notice that there are often more LLIL instructions than there are assembly instructions. The translation of assembly to LLIL is actually a one-to-many rather than one-to-one translation because the LLIL is a simplified representation of an instruction set. For example, the x86 repne cmpsb instruction will even generate branches and loops in the LLIL:

repne cmpsb in LLIL

Low Level IL representation of the x86 instruction repne cmpsb

How is analysis performed on the LLIL? To figure that out, we’ll first dive into how the LLIL is structured.

Low Level IL Structure

According to the API documentation, LLIL instructions have a tree-based structure. The root of an LLIL instruction tree is an expression consisting of an operation and zero to four operands as child nodes. The child nodes may be integers, strings, arrays of integers, or another expression. As each child expression can have its own child expressions, an instruction tree of arbitrary order and complexity can be built. Below are some example expressions and their operands:

Operation Operand 1 Operand 2 Operand 3 Operand 4
LLIL_NOP
LLIL_SET_REG dest: string or integer src: expression
LLIL_LOAD src: expression
LLIL_CONST value: integer
LLIL_IF condition: expression true: integer false: integer
LLIL_JUMP_TO dest: expression targets: array of integers

Let’s look at a couple examples of lifted x86, to get a better understanding of how these trees are generated when lifting an instruction: first, a simple mov instruction, and then a more complex lea instruction.

Example: mov eax, 2

IL tree for mov eax, 2

LLIL tree for mov eax, 2

This instruction has a single operation, mov, which is translated to the LLIL expression LLIL_SET_REG. The LLIL_SET_REG instruction has two child nodes: dest and src. dest is a reg node, which is just a string representing the register that will be set. src is another expression representing how the dest register will be set.

In our x86 instruction, the destination register is eax, so the dest child is just eax; easy enough. What is the source expression? Well, 2 is a constant value, so it will be translated into an LLIL_CONST expression. An LLIL_CONST expression has a single child node, value, which is an integer. No other nodes in the tree have children, so the instruction is complete. Putting it all together, we get the tree above.

Example: lea eax, [edx+ecx*4]

IL tree for lea eax, [edx+ecx*4]

LLIL tree for lea eax, [edx+ecx*4]

The end result of this instruction is also to set the value of a register. The root of this tree will also be an LLIL_SET_REG, and its dest will be eax. The src expression is a mathematical expression consisting of an addition and multiplication…or is it?

If we add parenthesis to explicitly define the order of operations, we get (edx + (ecx * 4)); thus, the root of the src sub-tree will be an LLIL_ADD expression, which has two child nodes: left and right, both of which are expressions. The left side of the addition is a register, so the left expression in our tree will be an LLIL_REG expression. This expression only has a single child. The right side of the addition is our multiplication, but the multiplier in an lea instruction has to be a power of 2, which can be translated to a left-shift operation, and that’s exactly what the lifter does: ecx * 4 becomes ecx << 2. So, the right expression in the tree is actually an LLIL_LSL expression (Logical Shift Left).

The LLIL_LSL expression also has left and right child expression nodes. For our left-shift operation, the left side is the ecx register, and the right side is the constant 2. We already know that both LLIL_REG and LLIL_CONST terminate with a string and integer, respectively. With the tree complete, we arrive at the tree presented above.

Now that we have an understanding of the structure of the LLIL, we are ready to dive into using the Python API. After reviewing features of the API, I will demonstrate a simple Python function to traverse an LLIL instruction and examine its tree structure.

Using the Python API

There are a few important classes related to the LLIL in the Python API: LowLevelILFunction, LowLevelILBasicBlock, and LowLevelILInstruction. There are a few others, like LowLevelILExpr and LowLevelILLabel, but those are more for writing a lifter rather than consuming IL.

Accessing Instructions

To begin playing with the IL, the first step is to get a reference to a function’s LLIL. This is accomplished through the low_level_il property of a Function object. If you’re in the GUI, you can get the LowLevelILFunction object for the currently displayed function using current_function.low_level_il.

The LowLevelILFunction class has a lot of methods, but they’re basically all for implementing a lifter, not performing analysis. In fact, this class is really only useful for retrieving or enumerating basic blocks and instructions. The __iter__ method is implemented and iterates over the basic blocks of the LLIL function, and the __getitem__ method is implemented and retrieves an LLIL instruction based on its index. The LowLevelILBasicBlock class also implements __iter__, which iterates over the individual LowLevelILInstruction objects belonging to that basic block. Therefore, it is possible to iterate over the instructions of a LowLevelILFunction two different ways, depending on your needs:

il = current_function.low_level_il

# iterate over instructions using basic blocks
for bb in il.basic_blocks:
  for instruction in bb:
    print instruction

# iterate over instructions directly
for index in range(len(il)):
  instruction = il[index]
  print instruction

Directly accessing an instruction is currently cumbersome. In Python, this is accomplished with function.low_level_il[function.get_low_level_il_at(function.arch, address)]. That’s a pretty verbose line of code. This is because the Function.get_low_level_il_at() method doesn’t actually return a LowLevelILInstruction object; it returns the integer index of the LLIL instruction. Hopefully it will be more concise in an upcoming refactor of the API.

Parsing Instructions

The real meat of the LLIL is exposed in LowLevelILInstruction objects. The common members shared by all instructions allow you to determine:

  • The containing function of the LLIL instruction
  • The address of the assembly instruction lifted to LLIL
  • The operation of the LLIL instruction
  • The size of the operation (i.e. is this instruction manipulating a byte/short/long/long long)

As we saw in the table above, the operands vary by instruction. These can be accessed sequentially, via the operands member, or directly accessed by operand name (e.g. dest, left, etc). When accessing operands of an instruction that has a destination operand, the dest operand will always be the first element of the list.

Example: A Simple Recursive Traversal Function

A very simple example of consuming information from the LLIL is a recursive traversal of a LowLevelILInstruction. In the example below, the operation of the expression of an LLIL instruction is output to the console, as well as its operands. If an operand is also an expression, then the function traverses that expression as well, outputting its operation and operands in turn.

def traverse_IL(il, indent):
  if isinstance(il, LowLevelILInstruction):
    print '\t'*indent + il.operation.name

    for o in il.operands:
      traverse_IL(o, indent+1)

  else:
    print '\t'*indent + str(il)

After copy-pasting this into the Binary Ninja console, select any instruction you wish to output the tree for. You can then use bv, current_function, and here to access the current BinaryView, the currently displayed function’s Function object, and the currently selected address, respectively. In the following example, I selected the ARM instruction ldr r3, [r11, #-0x8]:

traverse-il

Lifted IL vs Low Level IL

While reviewing the API, you might notice that there are function calls such as Function.get_lifted_il_at versus Function.get_low_level_il_at. This might make you unsure of which you should be processing for your analysis. The answer is fairly straight-forward: with almost no exceptions, you will always want to work with Low Level IL.

Lifted IL is what the lifter first generates when parsing the executable code; an optimized version is what is exposed as the Low Level IL to the user in the UI. To demonstrate this, try creating a new binary file, and fill it with a bunch of nop instructions, followed by a ret. After disassembling the function, and switching to IL view (by pressing i in Graph View), you will see that there is only a single IL instruction present: jump(pop). This is due to the nop instructions being optimized away.

It is possible to view the Lifted IL in the UI: check the box in Preferences for “Enable plugin development debugging mode.” Once checked, the “Options” tab at the bottom of the window will now present two options for viewing the IL. With the previous example, switching to Lifted IL view will now display a long list of nop instructions, in addition to the jump(pop).

In general, Lifted IL is not something you will need unless you’re developing an Architecture plugin.

Start Using the LLIL

In this blog post, I described the fundamentals of Binary Ninja’s Low Level IL, and how the Python API can be used to interact with it. Around the office, Ryan has used the LLIL and its data flow analysis to solve 2000 CTF challenge binaries by identifying a buffer to overflow and a canary value that had to remain intact in each. Sophia will present “Next-level Static Analysis for Vulnerability Research” using the Binary Ninja LLIL at INFILTRATE 2017, which everyone should definitely attend. I hope this guide makes it easier to write your own plugins with Binary Ninja!

In Part 2 of this blog post, I will demonstrate the power of the Low Level IL and its dataflow analysis with another simple example. We will develop a simple, platform-agnostic plugin to navigate to virtual functions by parsing the LLIL for an object’s virtual method table and calculating the offset of the called function pointer. This makes reversing the behavior of C++ binaries easier because instructions such as call [eax+0x10] can be resolved to a known function like object->isValid(). In the meantime, get yourself a copy of Binary Ninja and start using the LLIL.

Update (11 February 2017): A new version of Binary Ninja was released on 10 February 2017; this blog post has been updated to reflect changes to the API.

2016 Year in Review

John Oliver may have written off 2016, but we’re darn proud of all that we accomplished and contributed this year.

We released a slew of the security tools that help us -and you- work smarter, and promoted a few more that deserved recognition. We helped the New York City InfoSec community build a foundation for future growth. Perhaps most importantly, we weighed in when we believed the record needed to be set straight.

Here are 14 reasons we’re counting 2016 as a success and feeling good about 2017.

1. Brought automatic bug discovery to market

2016 will go down in history as the year that software began finding and patching vulnerabilities automatically. Our Cyber Reasoning System (CRS), built to compete in DARPA’s Cyber Grand Challenge, made history on its own when it audited zlib. As far as we know, our CRS was the first ever to audit a much larger amount of code in less time, in greater detail, and at a lower cost than a human could. Read the audit report and Mozilla’s announcement.

In January, we used our CRS to help settle a $1,000 bet about libotr, a popular library used in secure messaging software. Discover our insights about the challenges encrypted communications systems present for automated testing, how we solved them, and our testing methodology. And find out who won the bet.

Our CRS is available for commercial engagements, but we’ve open sourced one of its companion tools: GRR, a high-throughput fuzzer built specifically for the CRS. Read about the challenges we overcame while designing and building GRR.

Released And Reviewed Foundational Tools

2. Created a standardized benchmark suite for security tools

DARPA released the source code for over 100 challenge programs used in the Cyber Grand Challenge (CGC). The CGC challenge programs are realistic programs that contain documented vulnerabilities. Unfortunately, the programs don’t run on Windows, Linux, or macOS. We fixed this. We ported and packaged the challenge programs into a cross-platform benchmark suite. Now researchers in academia and industry can reliably evaluate and compare their program analysis tools and bug mitigations.

3. Ported Facebook’s osquery to Windows

Facebook’s osquery allows you to easily ask security and performance questions about your endpoint infrastructure. Until earlier this year, it was only available for macOS and Linux. We ported osquery to Windows, overcoming numerous technical challenges: completely different process and security models, incompatible APIs, and divergent compiler behavior. The port was worth the effort. Before, similar functionality would require cobbling together manual solutions or expensive and proprietary commercial products. Now, there’s a better option.

4. Released Algo, a secure, user-friendly, and free VPN

We built Algo, a self-hosted VPN server designed for ease of deployment and security. Algo servers are not shared with other users, only use modern protocols and ciphers, and only include only the minimal software you need. Since Algo deploys to nearly all popular cloud computing platforms, it provides an effectively unlimited set of egress locations. For anyone who is privacy conscious, travels for work frequently, or can’t afford a dedicated IT department, this one’s for you.

5. Showed how to automatically generate exploits with Binary Ninja

Ryan showed how to use Vector35’s Binary Ninja, a promising new interactive static analysis and reverse engineering platform, to generate exploits for 2,000 unique binaries in this year’s DEFCON CTF qualifying round. Its feature-rich and accessible APIs beat all competing products. Had he used IDA or radare2, Ryan would have had to divert his time and attention to implementing a stack model or using fragile heuristics instead of focusing on the true goal of the CTF: exploiting binaries. Read his full review.

6. Released Protofuzz, a protobuf fuzzer

Protofuzz helps find bugs in applications utilizing Google Protocol Buffers (protobuf). Applications use protobufs to specify the structure of messages, which are passed between processes or across networks. Protobufs automate the error-prone task of generating message serialization and deserialization code. Typical fuzzers that rely on random permutation cannot explore past the correct, auto-generated code. Protofuzz creates valid protobuf-encoded structures composed of malicious values — ensuring that permuted data passes through the hard auto-generated shell into the soft underbelly of the target program.

7. Made iOS’s Secure Enclave usable with Tidas

Apple’s Secure Enclave Crypto API promised to liberate us from the annoyance of passwords and using your Google account to log into Pokemon Go. But it was unusable in its original state. So, we filled the gap with Tidas, a simple SDK drop-in for iOS apps that provides cryptographically proven — and passwordless — authentication. The introduction of the T1 chip and TouchID on new MacBook Pros opens exciting new potential for Tidas on macOS. If you dream of a passwordless future, check out our DIY toolkit in Swift or Objective-C.

Shared With The Community

8. Explained Control Flow Integrity, and how to use it to mitigate exploits

Control Flow Integrity (CFI) prevents bugs from becoming exploits. Before indirect control flow transfers, CFI validates that the target of the flow belongs to a pre-determined set of valid targets. Unfortunately, instructions for using CFI are hard to find and very confusing. We fixed the problem. We published two blog posts that describe how CFI prevents exploitation, and how to use it with clang and Visual Studio. We also provided working examples showing how CFI protects Linux, MacOS, and Windows applications.

9. Pulled back the veil on PointsTo, our whole-program static analysis tool

Dynamic program analysis tools like AddressSanitizer and Valgrind can tell developers when running code accesses uninitialized memory, leaks memory, or uses memory after it’s been freed. Despite this, memory bugs are still shipped and exploited in the wild. That’s because bugs have a nasty habit of hiding behind rarely executed code paths, and dynamic analysis tools can’t check every possible program path. PointsTo can. It’s a whole-program static analysis tool that we developed and use to find and report on potential use-after-free bugs in large codebases. Read about PointsTo.

10. Continued to enrich NYC’s InfoSec community

Our bi-monthly meetup –Empire Hacking– really picked up steam in 2016. Membership passed 500 people; 100 of whom regularly attend our meetings. We heard 14 superb presentations on pragmatic security research and new discoveries in attack and defense. Many thanks to our hosts: Spotify, Two Sigma, DigitalOcean, and WeWork!

Everyone is welcome at Empire Hacking, but it may not be for everyone. So we put together nyc-infosec.com, a directory of all of the gatherings, companies, and university programs in NYC that we could find. Our goals with nyc-infosec.com are to promote collaboration, to elevate NYC’s community to its rightful place on the InfoSec ‘stage,’ and to help researchers and practitioners benefit from each other’s work.

One local event even earned our sponsorship: O’Reilly’s Security Conference.

11. Hired three interns for meaningful work

This winter we are once again giving college students paid internships to work on interesting security problems. We work with each student to create a project that is interesting both for them and beneficial for us, and provide them with resources and mentorship to make progress. This year, our interns are working on applying machine learning to fuzzing, porting challenge binaries to Windows, and improving our program analysis tools. We’ll have a post summarizing their work when it’s done; meanwhile, you can read about our past interns’ experience. We’ll also be hiring interns for the summer: contact us if you are interested.

12. Delivered 9 new talks at 13 separate conferences

When we can, we share the science that fuels our work – the successes and the failures. This year we spoke with an exceptional number of people at conferences all over the world.

Spoke The Truth

13. Called out Verizon for publishing bad data

Verizon’s Data Breach Investigations Report (DBIR) represents a collaborative effort involving 60+ organizations’ proprietary data. It’s the single best source of information for enterprise defenders, which is why it was a travesty that the report’s section on vulnerabilities used in data breaches contained misleading data, analysis, and recommendations. We chided Verizon and Kenna -the section’s contributor- and offered suggestions that would improve the quality of future DBIR. In response to criticism, Verizon and Kenna posted mea culpas on their blogs.

14. Clarified the technical details of the Apple-FBI standoff

In February, a federal judge ordered Apple to help the FBI recover encrypted data from the San Bernardino gunman’s iPhone 5C. Many argued the FBI’s request was technically infeasible given the support for strong encryption on iOS devices. Based on our reading of the request and knowledge of iOS, we explained why Apple could comply with the FBI in this instance. (If the iPhone had had a Secure Enclave, it would have been much harder.) For more detail, listen to Dan’s interview on the Risky Business podcast.

2017 won’t know what hit it

This year, we are looking forward to publicizing more of our research, continuing our commitment to our open source projects, and releasing more of our internal tools. We will:

  • Release Manticore, a ruthlessly effective hybrid symbolic-concrete (“concolic”) execution system that scales to large programs with numerous dependencies, complex interaction, and manual setup.
  • Add ARM support to McSema so we can lift binaries from all kinds of embedded systems such as hard drive firmware, phones, and IoT devices.
  • Publicly release a tool that combines a set of LLVM passes to detect side-channel vulnerabilities in sensitive codebases.
  • Sponsor Infiltrate 2017 and attend en masse. We really appreciate the forum they provide for a focused, quality review of the techniques attackers use to break systems. It’s a service to the community. We’re happy to support it.
  • Deliver a project inspired by Mr. Robot. Yeah, the TV show. More on that soon.

Let’s talk about CFI: Microsoft Edition

We’re back with our promised second installment discussing control flow integrity. This time, we will talk about Microsoft’s implementation of control flow integrity. As a reminder, control flow integrity, or CFI, is an exploit mitigation technique that prevents bugs from turning into exploits. For a more detailed explanation, please read the first post in this series.

Security researchers should study products that people use, and Microsoft has an overwhelming share of the desktop computing market. New anti-exploitation measures in Windows and Visual Studio are a big deal. These can and do directly impact a very large number of people.

For the impatient who want to know about control flow guard right now: add /guard:cf to both your compiler and linker flags, and take a look at our examples showing what CFG does and does not do.

Microsoft’s CFI

Microsoft’s implementation of CFI is called Control Flow Guard (CFG), and it requires both operating system and compiler support. The minimum supported operating system is Windows 8.1 Update 3 and the minimum compiler version is Visual Studio 2015 (VS2015 Update 3 is recommended). All the examples in this blog post use Visual Studio 2015 and Windows 10 on x86-64.

Control Flow Guard is very well documented — there is an official documentation page, the documentation for the compiler option, and even a blog post from when the feature was in development. CFG is a very straightforward implementation of CFI:

  • First, the compiler identifies all indirect branches in a program
  • Next, it determines which branches must be protected. For instance, indirect branches that have a statically identifiable target don’t need CFI checks.
  • Finally, the compiler inserts lightweight checks at potentially vulnerable branches to ensure the branch target is a valid destination.

As in the previous blog post, we will not explore the technical implementation of CFG. There is already plenty of excellent literature on the subject. Instead this blog post will focus on how to use CFG in your programs, and show what CFG does and does not protect. However, we will mention some important differences between CFG and Clang’s CFI implementation.

Comparing CFG with Clang’s CFI

This comparison is meant to show the differences between how each implementation translates theoretical ideas behind control flow integrity into shipping application protection mechanisms. Neither implementation is better or worse than the other; they target different software ecosystems. Each works within real-world constraints (e.g. source availability, performance, ease of use, API/ABI stability, backwards compatibility, etc.) to achieve meaningful software protection.

What’s protected?

Programs protected with Microsoft’s CFG or Clang’s CFI execute lightweight checks before indirect control flow transfers. The check validates that the target of the flow belongs to a pre-determined set of valid targets.

Windows programs have many indirect calls that cannot be hijacked. For instance, API calls are performed via an indirect call through the IAT, which is set to read-only after program load. The Visual Studio compiler safely omits CFG checks for these calls.

Clang’s CFI also includes checks that are not exactly CFI related, such as runtime validation of pointer casts. See the previous blog post for more details and examples.

What is a valid target?

Control Flow Guard has a single per-process mapping of all valid control flow targets. Anything in the mapping is considered a valid target (Figure 1b). CFG provides a way to adjust the valid target map at runtime, via the the aptly named SetProcessValidCallTargets API. This is especially helpful when dealing with JITted code or manually loading dynamic libraries.

CFG also provides three compiler directives that control CFG behavior in a specified method. These directives are defined in ntdef.h in the Windows SDK, but not well documented. We would like to thank Matt Miller from Microsoft for explaining what they do:

  • __declspec(guard(ignore)) will disable CFG checks for all indirect calls inside a method, and ignore any function pointers referenced in the method.
  • __declspec(guard(nocf)) will disable CFG checks for all indirect calls inside a method, but track any function pointers referenced in the method and add those functions to the valid destination map.
  • __declspec(guard(suppress)) will prevent an exported function from being a valid CFG destination. This is used to prevent security sensitive functions from being called indirectly (for instance, SetProcessValidCallTargets is protected in this way).

Clang’s CFI is more fine grained in its protection. The target of each indirect control flow transfer must match an expected type signature (Figure 1a). Depending on the options enabled, calls to class member functions are also verified to be within the proper class hierarchy. Effectively, there is a valid target mapping per type signature and per class hierarchy. The target sets are fixed at compile time and cannot be changed.

Figure 1: Differences in the valid call targets for the cfg_icall example. The true valid destination is in green, and everything else is in red.
clang_valid_dests vs_valid_dests
(a) Valid destinations at the indirect call for Clang’s CFI. Only functions matching the expected function signature are in the list. (b) Valid destinations at the indirect call for CFG using Visual Studio 2015. Every legal function entry point is in the list.

How is protection enforced?

Control Flow Guard splits enforcement duties between the compiler and the operating system. The compiler inserts the checks and provides an initial valid target set, and the operating system maintains the target set and verifies destinations.

Clang’s CFI does all enforcement at the compiler level; the operating system is not aware of CFI.

What about dynamic libraries, JITed code, and other edge cases?

Control Flow Guard supports cross-library calls, but enforcement only occurs if the library is also compiled with Control Flow Guard. Dynamically generated code pages can be added to or excluded from the valid target map. External functions retrieved via GetProcAddress are always valid call targets*.

Clang’s CFI supports cross-library calls via the -fsanitize-cfi-cross-dso flag. Both the library and the application must be compiled with this flag. As far as we can tell, dynamically generated code does not receive CFI protection. External functions retrieved via dlsym are automatically added as a valid target when -fsanitize-cfi-cross-dso is used, otherwise these calls trigger a CFI violation.

* The exception to this rule are functions protected with __declspec(guard(suppress)). These functions must be linked via the import table or they will not be callable.

Using CFI with Visual Studio 2015

Using control flow guard with Visual Studio is extremely simple. There is a fantastic documentation page on the MSDN website that describes how to enable CFG, both via the GUI and via the command line. The quick and summary: add /guard:cf to you compiler and linker flags. That’s it.

There are a few caveats, which are only applicable if you are going to dynamically adjust valid indirect call targets via SetProcessValidCallTargets. First, you will need a new-ish version of the Windows SDK. The version that came by default with our Visual Studio 2015 install didn’t have the proper definitions, we had to install the latest (as of this writing) version 10.0.14393.0. Second, you must set the SDK to target Windows 10 (#define _WIN32_WINNT 0x0A00). Third, you must link with mincore.lib, as it includes the necessary import definitions.

Control Flow Guard Examples

We have created samples with specially crafted bugs to show how to use CFG, and some errors CFG protects against. The bugs that these examples have are not statically identified by the compiler, but are detected at runtime by CFG. Where possible, we simulate potential malicious behavior that CFG would prevent, and which malicious behavior CFG would not prevent.

These CFG examples are modified from the Clang CFI examples to show the different meaning of a valid call destination between the two implementations. Each example builds two binaries, one with CFG (e.g. cfg_icall.exe) and one without CFG (e.g. no_cfg_icall.exe). These binaries are built from the same source, and used to illustrate CFG features and protections.

We have provided the following examples:

cfg_icall

This example is an analogue of the cfi_icall example from the Clang CFI blog post, but modified slightly to work with Visual Studio 2015 and Control Flow Guard. The example binary accepts a single command line argument, with the valid values being 0-3. Each value demonstrates different aspects of indirect call protection.

  • Option 0 is a normal, valid indirect call that should always work. This should run properly under any CFI scheme.
  • Option 1 is an invalid indirect call (the destination is read from outside array bounds), but the destination is a function with the same function signature as a valid call. This works under both Clang’s CFI and CFG, but it could fail under some future scheme.
  • Option 2 is an invalid indirect call, and the destination is a valid function entry but with a signature different than the caller expects. This call fails under Clang’s CFI but works under CFG.
  • Option 3 is an invalid indirect call to a destination that is an invalid function entry point. This should fail under any CFI scheme, and this call fails under Clang’s CFI and CFG.
  • All other options should point to uninitialized memory, and correctly fail for both tested CFI implementations.

cfg_vcall

The cfg_vcall example (derived from the cfi_icall example from the previous post) shows that virtual calls are protected by CFG, when the destination is not a valid entry point. The example shows two simulated bugs: the first bug is an invalid cast to simulate something like a type confusion vulnerability. This will fail under Clang’s CFI, but succeed under CFG. The second bug simulates a use-after-free or similar memory corruption, where the object pointer is replaced by an attacker-created object, with a function pointer that points to the middle of a function. The bad call is blocked by both Clang’s CFI and CFG.

Figure 2: A Control Flow Guard violation as seen in WinDbg.
cfg_violation

cfg_valid_targets

This example is cfg_icall but modified to show how to use SetProcessValidCallTargets. The CFG bitmap is manually updated to remove bad_int_arg and float_arg from the valid call target list. Only option 0 will work; every other option will return a CFG error.

cfg_guard_ignore

Control flow guard can be disabled for certain methods; this example shows how to use the __declspec(guard(ignore)) compiler directive to completely disable CFG inside the specified method.

cfg_guard_nocf

Control flow guard can be partially disabled for certain methods; this example shows how to use the __declspec(guard(nocf)) compiler directive to disable CFG for indirect calls in a specified method, but still enable CFG for any referenced function pointers. The example compares the effects of __declspec(guard(nocf)) to __declspec(guard(ignore)).

cfg_guard_suppress and cfg_suppressed_export

Sometimes a library has security sensitive methods that should never be called indirectly. The __declspec(guard(suppress)) directive will prevent exported functions from being called via function pointer. These two examples work together to show how suppressed exports work. Cfg_suppressed_export is a DLL with a suppressed export and a normal export. Cfg_guard_suppress tries to call both exports via a pointer retrieved via GetProcAddress.

All flows must end

Now that you know what Control Flow Guard is and how it can protect your applications, go turn it on for your software! Enabling CFG is very simple, just add /guard:cf to your compiler and linker flags. To see real examples of how CFG can protect your software, take look at our CFG examples showcase. We hope that Microsoft continues to improve CFG with future Visual Studio releases.