Close Encounters with Symbolic Execution (Part 2)

This is part two of a two-part blog post that shows how to use KLEE with mcsema to symbolically execute Linux binaries (see the first post!). This part will cover how to build KLEE, mcsema, and provide a detailed example of using them to symbolically execute an existing binary. The binary we’ll be symbolically executing is an oracle for a maze with hidden walls, as promised in Part 1.

As a visual example, we’ll show how to get from an empty maze to a solved maze:

Maze (Before) Maze (After)

Building KLEE with LLVM 3.2 on Ubuntu 14.04

One of the hardest parts about using KLEE is building it. The official build instructions cover KLEE on LLVM 2.9 and LLVM 3.4 on amd64. To analyze mcsema generated bitcode, we will need to build KLEE for LLVM 3.2 on i386. This is an unsupported configuration for KLEE, but it still works very well.

We will be using the i386 version of Ubuntu 14.04. The 32-bit version of Ubuntu is required to build a 32-bit KLEE. Do not try adding -m32 to CFLAGS on a 64-bit version. It will take away hours of your time that you will never get back. Get the 32-bit Ubuntu. The exact instructions are described in great detail below. Be warned: building everything will take some time.

# These are instructions for how to build KLEE and mcsema. 
# These are a part of a blog post explaining how to use KLEE
# to symbolically execute closed source binaries.
 
# install the prerequisites
sudo apt-get install vim build-essential g++ curl python-minimal \
  git bison flex bc libcap-dev cmake libboost-dev \
  libboost-program-options-dev libboost-system-dev ncurses-dev nasm
 
# we assume everything KLEE related will live in ~/klee.
cd ~
mkdir klee
cd klee
 
# Get the LLVM and Clang source, extract both
wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz
wget http://llvm.org/releases/3.2/clang-3.2.src.tar.gz
tar xzf llvm-3.2.src.tar.gz
tar xzf clang-3.2.src.tar.gz
 
# Move clang into the LLVM source tree:
mv clang-3.2.src llvm-3.2.src/tools/clang
 
# normally you would use cmake here, but today you HAVE to use autotools.
cd llvm-3.2.src
 
# For this example, we are only going to enable only the x86 target.
# Building will take a while. Go make some coffee, take a nap, etc.
./configure --enable-optimized --enable-assertions --enable-targets=x86
make
 
# add the resulting binaries to your $PATH (needed for later building steps)
export PATH=`pwd`/Release+Asserts/bin:$PATH
 
# Make sure you are using the correct clang when you execute clang — you may 
# have accidentally installed another clang that has priority in $PATH. Lets 
# verify the version, for sanity. Your output should match whats below.
# 
#$ clang --version
#clang version 3.2 (tags/RELEASE_32/final)
#Target: i386-pc-linux-gnu
#Thread model: posix
 
# Once clang is built, its time to built STP and uClibc for KLEE.
cd ~/klee
git clone https://github.com/stp/stp.git
 
# Use CMake to build STP. Compared to LLVM and clang,
# the build time of STP will feel like an instant.
cd stp
mkdir build && cd build
cmake -G 'Unix Makefiles' -DCMAKE_BUILD_TYPE=Release ..
make
 
# After STP builds, lets set ulimit for STP and KLEE:
ulimit -s unlimited
 
# Build uclibc for KLEE
cd ../..
git clone --depth 1 --branch klee_0_9_29 https://github.com/klee/klee-uclibc.git
cd klee-uclibc
./configure -l --enable-release
make
cd ..
 
# It’s time for KLEE itself. KLEE is updated fairly often and we are 
# building on an unsupported configuration. These instructions may not 
# work for future versions of KLEE. These examples were tested with 
# commit 10b800db2c0639399ca2bdc041959519c54f89e5.
git clone https://github.com/klee/klee.git
 
# Proper configuration of KLEE with LLVM 3.2 requires this long voodoo command
cd klee
./configure --with-stp=`pwd`/../stp/build \
  --with-uclibc=`pwd`/../klee-uclibc \
  --with-llvm=`pwd`/../llvm-3.2.src \
  --with-llvmcc=`pwd`/../llvm-3.2.src/Release+Asserts/bin/clang \
  --with-llvmcxx=`pwd`/../llvm-3.2.src/Release+Asserts/bin/clang++ \
  --enable-posix-runtime
make
 
# KLEE comes with a set of tests to ensure the build works. 
# Before running the tests, libstp must be in the library path.
# Change $LD_LIBRARY_PATH to ensure linking against libstp works. 
# A lot of text will scroll by with a test summary at the end.
# Note that your results may be slightly different since the KLEE 
# project may have added or modified tests. The vast majority of 
# tests should pass. A few tests fail, but we’re building KLEE on 
# an unsupported configuration so some failure is expected.
export LD_LIBRARY_PATH=`pwd`/../stp/build/lib
make check
 
#These are the expected results:
#Expected Passes : 141
#Expected Failures : 1
#Unsupported Tests : 1
#Unexpected Failures: 11
 
# KLEE also has a set of unit tests so run those too, just to be sure. 
# All of the unit tests should pass!
make unittests
 
# Now we are ready for the second part: 
# using mcsema with KLEE to symbolically execute existing binaries.
 
# First, we need to clone and build the latest version of mcsema, which
# includes support for linked ELF binaries and comes the necessary
# samples to get started.
cd ~/klee
git clone https://github.com/trailofbits/mcsema.git
cd mcsema
git checkout v0.1.0
mkdir build && cd build
cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release ..
make
 
# Finally, make sure our environment is correct for future steps
export PATH=$PATH:~/klee/llvm-3.2.src/Release+Asserts/bin/
export PATH=$PATH:~/klee/klee/Release+Asserts/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/klee/stp/build/lib/

Translating the Maze Binary

The latest version of mcsema includes the maze program from Felipe’s blog in the examples as demo_maze. In the instructions below, we’ll compile the maze oracle to a 32-bit ELF binary and then convert the binary to LLVM bitcode via mcsema.

# Note: tests/demo_maze.sh completes these steps automatically
cd ~/klee/mcsema/mc-sema/tests
# Load our environment variables
source env.sh
# Compile the demo to a 32-bit ELF executable
${CC} -ggdb -m32 -o demo_maze demo_maze.c
# Recover the CFG using mcsema's bin_descend
${BIN_DESCEND_PATH}/bin_descend -d -func-map=maze_map.txt -i=demo_maze -entry-symbol=main
# Convert the CFG into LLVM bitcode via mcsema's cfg_to_bc
${CFG_TO_BC_PATH}/cfg_to_bc -i demo_maze.cfg -driver=mcsema_main,main,raw,return,C -o demo_maze.bc
# Optimize the bitcode
${LLVM_PATH}/opt -O3 -o demo_maze_opt.bc demo_maze.bc

We will use the optimized bitcode (demo_maze_opt.bc) generated by this step as input to KLEE. Now that everything is set up, let’s get to the fun part — finding all maze solutions with KLEE.

# create a working directory next to the other KLEE examples.
cd ~/klee/klee/examples
mkdir maze
cd maze
# copy the bitcode generated by mcsema into the working directory
cp ~/klee/mcsema/mc-sema/tests/demo_maze_opt.bc ./
# copy the register context (needed to build a drive to run the bitcode)
cp ~/klee/mcsema/mc-sema/common/RegisterState.h ./

Now that we have the maze oracle binary in LLVM bitcode, we need to tell KLEE which inputs are symbolic and when a maze is solved. To do this we will create a small driver that will intercept the read() and exit() system calls, mark input to read() as symbolic, and assert on exit(1), a successful maze solution.

To make the driver, create a file named maze_driver.c with contents from the this gist and use clang to compile the maze driver into bitcode. Every function in the driver is commented to help explain how it works. 

clang -I../../include/ -emit-llvm -c -o maze_driver.bc maze_driver.c

We now have two bitcode files: the translation of the maze program and a driver to start the program and mark inputs as symbolic. The two need to be combined into one bitcode file for use with KLEE. The two files can be combined using llvm-link. There will be a compatibility warning, which is safe to ignore in this case.

llvm-link demo_maze_opt.bc maze_driver.bc > maze_klee.bc

Running KLEE

Once we have the combined bitcode, let’s do some symbolic execution. Lots of output will scroll by, but we can see KLEE solving the maze and trying every state of the program. If you recall from the driver, we can recognize successful states because they will trigger an assert in KLEE. There are four solutions to the original maze, so let’s see how many we have. There should be 4 results — a good sign (note: your test numbers may be different):

klee --emit-all-errors -libc=uclibc maze_klee.bc
# Lots of things will scroll by
ls klee-last/*assert*
# For me, the output is:
# klee-last/test000178.assert.err  klee-last/test000315.assert.err
# klee-last/test000270.assert.err  klee-last/test000376.assert.err

Now let’s use a quick bash script to look at the outputs and see if they match the original results. The solutions identified by KLEE from the mcsema bitcode are:

  • sddwddddsddw
  • ssssddddwwaawwddddsddw
  • sddwddddssssddwwww
  • ssssddddwwaawwddddssssddwwww

… and they match the results from Felipe’s original blog post!

Conclusion

Symbolic execution is a powerful tool that can execute programs on all inputs at once. Using mcsema and KLEE, we can symbolically execute existing closed source binary programs. In this example, we found all solutions to a maze with hidden walls — starting from an opaque binary. KLEE and mcsema could do this while knowing nothing about mazes and without being tuned for string inputs.

This example is simple, but it shows what is possible: using mcsema we can apply the power of KLEE to closed source binaries. We could generate high code coverage tests for closed source binaries, or find security vulnerabilities in arbitrary binary applications.

Note: We’re looking for talented systems engineers to work on mcsema and related projects (contract and full-time). If you’re interested in being paid to work on or with mcsema, send us an email!

Close Encounters with Symbolic Execution

At THREADS 2014, I demonstrated a new capability of mcsema that enables the use of KLEE, a symbolic execution framework, on software available only in binary form. In the talk, I described how to use mcsema and KLEE to learn an unknown protocol defined in a binary that has never been seen before. In the example, we learned the series of steps required to navigate through a maze. Our competition in the DARPA Cyber Grand Challenge requires this capability — our “reasoning system” will have no prior knowledge and no human guidance, yet must learn to speak with dozens, hundreds, or thousands of binaries, each with unique inputs.

Symbolic Execution

In the first part of this two part blog post, I’ll explain what symbolic execution is and how symbolic execution allows our “reasoning system” to learn inputs for arbitrary binaries. In the second part of the blog post, I will guide you through the maze solving example presented at THREADS. To describe the power of symbolic execution, we are going to look at three increasingly difficult iterations of a classic computer science problem: maze solving. Once I discuss the power of symbolic execution, I’ll talk about KLEE, an LLVM-based symbolic execution framework, and how mcsema enables KLEE to run on binary-only applications.

Maze Solving

One of the classic problems in first year computer science classes is maze solving. Plainly, the problem is this: you are given a map of a maze. Your task is to find a path from the start to the finish. The more formal definition is: a maze is defined by a matrix where each cell can be a step or a wall. One can move into a step cell, but not into a wall cell. The only valid move directions are up, down, left, or right. A sequence of moves from cell to cell is called a path. Some cell is marked as START and another cell is marked as END. Given this maze, find a path from START to END, or show that no such path exists.

An example maze. The step spaces are blank, the walls are +-|, the END marker is the # sign, and the current path is the X's.

An example maze. The step spaces are blank, the walls are +-|, the END marker is the # sign, and the current path is the X’s.

The typical solution to the maze problem is to enumerate all possible paths from START, and search for a path that terminates at END. The algorithm is neatly summarized in this stack overflow post. The algorithm works because it has a complete map of the maze. The map is used to create a finite set of valid paths. This set can be quickly searched to find a valid path.

Maze Solving sans Map

In an artificial intelligence class, one may encounter a more difficult problem: solving a maze without the map. In this problem, the solver has to discover the map prior to finding a path from the start to the end. More formally, the problem is: you are given an oracle that answers questions about maze paths. When given a path, the oracle will tell you if the path solves the maze, hits a wall, or moves to a step position. Given this oracle, find a path from the start to the end, or show there is no path.

The solution to this problem is backtracking. The solver will build the path one move at a time, asking the oracle about the path at every move. If an attempted move hits a wall, the solver will try another direction. If no direction works, the solver returns to the previous position and tries a new direction. Eventually, the solver will either find the end or visit every possible position. Backtracking works because with every answer from the oracle, the solver learns more of the map. Eventually, the solver will learn enough of the map to find the end.

Maze Solving with Fake Walls

Lets posit an even more difficult problem: a maze with fake walls. That is, there are some walls that are really steps. Since some walls are fake, the solver learns nothing from the oracle until it asks about a complete solution. If this isn’t very clear, imagine a map that is made from completely fake walls: for any path, except one that solves the maze, the oracle will always answer “wall.” More formally, the problem now is: given an oracle that will verify only a complete path from the start to the end, solve the maze.

This is vastly more difficult than before: the solver can’t learn the map. The only generic solution is to ask the oracle about every possible path. The solver will eventually guess a valid path, since it must be in the set of all paths (assuming the maze is finite). This “brute force” solver is even more powerful than the previous: it will solve all mazes, map or no map.

Despite its power, the brute force solver has a huge problem: it’s slow and impractical.

Cheat To Win

The last problem is equivalent to the following more general problem: given an oracle that verifies solutions, find a valid solution. Ideally, we want something that finds a valid solution faster than brute force guessing. Especially when it comes to generic problems, since we don’t even know what the inputs look like!

So lets make a “generic problem solver”. Brute force is slow and impractical because it tries every single concrete input, in sequence. What if a solver could try all inputs at once? Humans do this all the time without even thinking. For instance, when we solve equations, we don’t try every number until we find the solution. We use a variable that can stand in for any number, and algorithmically identify the answer.

So how will our solver try every input at once? It will cheat to win! Our solver has an ace up its sleeve: the oracle is a real program. The solver can look at the oracle, analyze it, and find a solution without guessing. Sadly, this is impossible to do for every oracle (because you run into the halting problem). But for many real oracles, this approach works.

For instance, consider the following oracle that declares a winner or a loser:

x = input();
if(x > 5 && x < 9 && x % 4 == 0) {
  winner();
else {
  loser();
}

The solver could determine that the winner input must be a number greater than 5, less than 9, and evenly divisible by 4. These constraints can be turned into a set of linear equations and solved, showing the only winner value is 8.

A hypothetical problem solver could work like this: it will treat input into the oracle as a symbol. That is, instead of picking a specific value as the input, the value will be treated as a variable. The solver will then apply constraints to the symbol that correspond to different branches in the oracle program. When the solver finds a “valid solution” state in the oracle, the constraints on the input are solved. If the constraints can be solved, the result will be a concrete input that reaches the valid solution state. The problem solver tries every possible input at once by converting the oracle into a system of linear equations.

This hypothetical problem solver is real: the part that discovers the constraints is called a symbolic execution framework, and the part that solves equations is called an SMT solver.

The Future Is Now

There are several software packages that combine symbolic execution with SMT solvers to analyze programs. We will be looking at KLEE because it works with LLVM bitcode. We can use KLEE as a generic problem solver to find all valid inputs given an oracle that verifies those inputs. KLEE can solve a maze with hidden walls: Felipe Manzano has an excellent blog post showing how to use KLEE to solve exactly such a maze.

So what does mcsema have to do with this? Well, KLEE works on programs written in LLVM bitcode. Before mcsema, KLEE could only analyze programs that come with source code. Using mcsema, KLEE can be a problem solver for arbitrary binary applications! For instance, given a compiled binary that checks solutions to mazes with hidden walls, KLEE could find all the valid paths through the maze. Or it could do something more useful, like automatically generate application tests with high code coverage, or maybe even find security bugs in binary programs.

But back to maze solving. In Part 2 of this blog post, we’ll take a binary that solves mazes, use mcsema to translate it to LLVM, and then use KLEE to find all valid paths through the maze. More specifically, we will take Felipe’s maze oracle and compile it to a Linux binary. Then, we will use mcsema and KLEE to find all possible maze solutions. Everything will be done without modifying the original binary. The only thing KLEE will know is how to provide input and how to check solutions. In essence, we are going to show how to use mcsema and KLEE to identify all valid inputs to a binary application.

Speaker Lineup for THREADS ’14: Scaling Security

For every security engineer you train, there are 20 or more developers writing code with potential vulnerabilities. There’s no human way to keep up. We need to be more effective with less resources. It’s time to make security a fully integrated part of modern software development and operations.

It’s time to automate.

This year’s THREADS will focus exclusively on automating security. In this single forum, a selection of the industry’s best experts will present previously unseen in-house innovations deployed at major technology firms, and share leading research advances available in the future.

Buy tickets for THREADS now to get the early-bird special (expires 10/13).

DARPA Returns – Exclusive

If you attended THREADS’13, you know that our showcase of DARPA’s Cyber Fast Track was not-to-be-missed. Good news, folks. DARPA’s coming back with a brief of another exciting project, the Integrated Cyber Analysis System (ICAS). ICAS enables streamlined detection of targeted attacks on large and diverse corporate networks. (Think Target, Home Depot, and JPMorgan Chase.)

We’ll hear from the three players DARPA invited to tackle the problem: Invincea Labs, Raytheon BBN, and Digital Operatives. Each group attempted to meet the project goals in a unique way, and will share their experiences and insights.

Learn about it at THREADS’14 first.

World-Class Speakers at THREADS’14

KEYNOTES

Robert Joyce, Chief, Tailored Access Operations (TAO), NSA

As the Chief of TAO, Rob leads an organization that provides unique, highly valued capabilities to the Intelligence Community and the Nation’s leadership.  His organization is the NSA mission element charged with providing tools and expertise in computer network exploitation to deliver foreign intelligence. Prior to becoming the Chief of TAO, Rob served as the Deputy Director of the Information Assurance Directorate (IAD) at NSA, where he led efforts to harden, protect and defend the Nation’s most critical National Security systems and improve cybersecurity for the nation.

Michael Tiffany, CEO, White Ops

Michael Tiffany is the co-founder and CEO of White Ops, a security company founded in 2013 to break the profit models of cybercriminals. By making botnet schemes like ad fraud unprofitable, White Ops disrupts the criminal incentive to break into millions of computers. Previously, Tiffany was the co-founder of Mission Assurance Corporation, a pioneer in space-based computing that is now a part of Recursion Ventures. He is a Technical Fellow of Critical Assets Labs, a DARPA-funded cyber-security research lab. He is a Subject Matter Advisor for the Signal Media Project, a nonprofit promoting the accurate portrayal of science, technology and history in popular media. He is also a Ninja.

LEADING RESEARCH

Smten and the Art of Satisfiability-based Search
Nirav Dave, SRI

Reverse All the Things with PANDA
Brendan Dolan-Gavitt, Columbia University

Code-Pointer Integrity
Laszlo Szekeres, Stony Brook University

Static Translation of X86 Instruction Semantics to LLVM with McSema
Artem Dinaburg & Andrew Ruef, Trail of Bits

Transparent ROP Detection using CPU Performance Counters
Xiaoning Li, Intel & Michael Crouse, Harvard University

Improving Scalable, Automated Baremetal Malware Analysis
Adam Allred & Paul Royal, Georgia Tech Information Security Center (GTISC)

Integrated Cyber Attribution System (ICAS) Program Brief
Richard Guidorizzi, DARPA

TAPIO: Targeted Attack Premonition using Integrated Operational Data Sources
Invincea Labs

Gestalt: Integrated Cyber Analysis System
Raytheon BBN

Federated Understanding of Security Information Over Networks (FUSION)
Digital Operatives

IN-HOUSE INNOVATIONS

Building Your Own DFIR Sidekick
Scott J Roberts, Github

Operating system analytics and host intrusion detection at scale
Mike Arpaia, Facebook

Reasoning about Optimal Solutions to Automation Problems
Jared Carlson & Andrew Reiter, Veracode

Augmenting Binary Analysis with Python and Pin
Omar Ahmed, Etsy & Tyler Bohan, NYU-Poly

Are attackers using automation more efficiently than defenders?
Marc-Etienne M.Léveillé, ESET

Making Sense of Content Security Policy (CSP) Reports @ Scale
Ivan Leichtling, Yelp

Automatic Application Security @twitter
Neil Matatall, Twitter

Cleaning Up the Internet with Scumblr and Sketchy
Andy Hoernecke, Netflix

CRITs: Collaborative Research Into Threats
Michael Goffin, Wesley Shields, MITRE

GitHub AppSec: Keeping up with 111 prolific engineers
Ben Toews, GitHub

Don’t miss out. Buy tickets for THREADS now to get the early-bird special (expires 10/13). You won’t find a more comprehensive treatment of scaling security anywhere else.

 

We’re Sponsoring the NYU-Poly Women’s Cybersecurity Symposium

NYU-Poly Women's Cybersecurity Symposium

Cyber security is an increasingly complex and vibrant field that requires brilliant and driven people to work on diverse teams. Unfortunately, women are severely underrepresented and we want to change that. Career Discovery in Cyber Security is an NYU-Poly event, created in a collaboration with influential men and women in the industry. This annual symposium helps guide talented and interested women into careers in cyber security. We know that there are challenges for female professionals in male-dominated fields, which is why we want to create an environment where women have the resources they need to excel.

The goal of this symposium is to showcase the variety of industries and career paths in which cyber security professionals can make their mark. Keynote talks, interactive learning sessions, and technical workshops will prepare participants to identify security challenges and acquire the skills to meet them. A mentoring roundtable, female executive panel Q&A session, and networking opportunities allow participants to interact with accomplished women in the field in meaningful ways. These activities will give an extensive, well-rounded look into possible career paths.

Trail of Bits is a strong advocate for women in the cyber security world at all stages of their careers. In the past, we were participants in the CSAW Summer Program for Women, which introduced high school women to the world of cyber security. We are proud of our involvement in this women’s symposium from its earliest planning stages, continue to offer financial support via named scholarships for attendees, and will take part in the post-event mentoring program.

This year’s symposium is Friday and Saturday, October 17-18 in Brooklyn, New York. For more details and registration, visit the website. Follow the symposium on Twitter or Facebook for news and updates.

Enabling Two-Factor Authentication (2FA) for Apple ID and DropBox

In light of the recent compromises, you’re probably wondering what could have been done to prevent such attacks. According to some unverified articles it would appear that flaws in Apple’s services allowed an attacker to brute force passwords without any rate limiting or account lockout. While its not publicly known if the attacks were accomplished via brute force password guessing, there has been a lot of talk about enabling Two-Factor Authentication (2FA) across services that offer it. The two most popular services being discussed are iCloud and DropBox. While setting up 2FA on these services is not as easy as it should be, this guide will step you through enabling 2FA on Google, Apple ID and DropBox accounts. It’s a free way of adding an extra layer of security on top of these services which handle potentially sensitive information.

What is Two-Factor Authentication?

Username and password authentication uses a single factor to verify identity: something the user knows. Two-Factor authentication adds an extra layer of security on top of a username and password. Normally, the second factor is something only the real user has. This is typically a temporary passcode generated by a piece of hardware such as an RSA token, a passcode sent as an SMS to the user’s cell phone, or a mobile application that accomplishes the same function.

With two-factor authentication, stealing a username and password won’t be enough to log in — the second factor is also required. This multi-factor authentication means an attacker will be required to compromise a user above and beyond password guessing or stealing a credentials database. An attacker would have to gain access to the source of the extra, unique and usually temporary information that makes up the 2FA.
[Read more…]

ReMASTering Applications by Obfuscating during Compilation

In this post, we discuss the creation of a novel software obfuscation toolkit, MAST, implemented in the LLVM compiler and suitable for denying program understanding to even the most well-resourced adversary. Our implementation is inspired by effective obfuscation techniques used by nation-state malware and techniques discussed in academic literature. MAST enables software developers to protect applications with technology developed for offense.

MAST is a product of Cyber Fast Track, and we would like to thank Mudge and DARPA for funding our work. This project would not have been possible without their support. MAST is now a commercial product offering of Trail of Bits and companies interested in licensing it for their own use should contact info@trailofbits.com.

Background

There are a lot of risks in releasing software these days. Once upon a time, reverse engineering software presented a challenge best solved by experienced and skilled reverse engineers at great expense. It was worthwhile for reasonably well-funded groups to reverse engineer and recreate proprietary technology or for clever but bored people to generate party tricks. Despite the latter type of people causing all kinds of mild internet havoc, reverse engineering wasn’t widely considered a serious threat until relatively recently.

Over time, however, the stakes have risen; criminal entities, corporations, even nation-states have become extremely interested in software vulnerabilities. These entities seek to either defend their own network, applications, users, or to attack someone else’s. Historically, software obfuscation was a concern of the “good guys”, who were interested in protecting their intellectual property. It wasn’t long before malicious entities began obfuscating their own tools to protect captured tools from analysis.

A recent example of successful obfuscation is that used by the authors of the Gauss malware; several days after discovering the malware, Kaspersky Lab, a respected malware analysis lab and antivirus company, posted a public plea for assistance in decrypting a portion of the code. That even a company of professionals had trouble enough to ask for outside help is telling: obfuscation can be very effective. Professional researchers have been unable to deobfuscate Gauss to this day.

Motivation

With all of this in mind, we were inspired by Gauss to create a software protection system that leapfrogs available analysis technology. Could we repurpose techniques from software exploitation and malware obfuscation into a state-of-the-art software protection system? Our team is quite familiar with publicly available tools for assisting in reverse engineering tasks and considered how to significantly reduce their efficacy, if not deny it altogether.

Software developers seek to protect varying classes of information within a program. Our system must account for each with equal levels of protection to satisfy these potential use cases:

  • Algorithms: adversary knowledge of proprietary technology
  • Data: knowledge of proprietary data (the company’s or the user’s)
  • Vulnerabilities: knowledge of vulnerabilities within the program

In order for the software protection system to be useful to developers, it must be:

  • Easy to use: the obfuscation should be transparent to our development process, not alter or interfere with it. No annotations should be necessary, though we may want them in certain cases.
  • Cross-platform: the obfuscation should apply uniformly to all applications and frameworks that we use, including mobile or embedded devices that may run on different processor architectures.
  • Protect against state-of-the-art analysis: our obfuscation should leapfrog available static analysis tools and techniques and require novel research advances to see through.

Finally, we assume an attacker will have access to the static program image; many software applications are going to be directly accessible to a dedicated attacker. For example, an attacker interested in a mobile application, anti-virus signatures, or software patches will have the static program image to study.

Our Approach

We decided to focus primarily on preventing static analysis; in this day and age there are a lot of tools that can be run statically over application binaries to gain information with less work and time required by attackers, and many attackers are proficient in generating their own situation-specific tools. Static tools can often very quickly be run over large amounts of code, without necessitating the attacker having an environment in which to execute the target binary.

We decided on a group of techniques that compose together, comprising opaque predicate insertion, code diffusion, and – because our original scope was iOS applications – mangling of Objective-C symbols. These make the protected application impossible to understand without environmental data, impossible to analyze with current static analysis tools due to alias analysis limitations, and deny the effectiveness of breakpoints, method name retrieval scripts, and other common reversing techniques. In combination, these techniques attack a reverse engineer’s workflow and tools from all sides.

Further, we did all of our obfuscation work inside of a compiler (LLVM) because we wanted our technology to be thoroughly baked into the entire program. LLVM can use knowledge of the program to generate realistic opaque predicates or hide diffused code inside of false paths not taken, forcing a reverse engineer to consult the program’s environment (which might not be available) to resolve which instruction sequences are the correct ones. Obfuscating at the compiler level is more reliable than operating on an existing binary: there is no confusion about code vs. data or missing critical application behavior. Additionally, compiler-level obfuscation is transparent to current and future development tools based on LLVM. For instance, MAST could obfuscate Swift on the day of release — directly from the Xcode IDE.

Symbol Mangling

The first and simplest technique was to hinder quick Objective-C method name retrieval scripts; this is certainly the least interesting of the transforms, but would remove a large amount of human-readable information from an iOS application. Without method or other symbol names present for the proprietary code, it’s more difficult to make sense of the program at a glance.

Opaque Predicate Insertion

The second technique we applied, opaque predicate insertion, is not a new technique. It’s been done before in numerous ways, and capable analysts have developed ways around many of the common implementations. We created a stronger version of predicate insertion by inserting predicates with opaque conditions and alternate branches that look realistic to a script or person skimming the code. Realistic predicates significantly slow down a human analyst, and will also slow down tools that operate on program control flow graphs (CFGs) by ballooning the graph to be much larger than the original. Increased CFG size impacts the size of the program and the execution speed but our testing indicates the impact is smaller or consistent with similar tools.

Code Diffusion

The third technique, code diffusion, is by far the most interesting. We took the ideas of Return-Oriented Programming (ROP) and applied them in a defensive manner.

In a straightforward situation, an attacker exploits a vulnerability in an application and supplies their own code for the target to execute (shellcode). However, since the introduction of non-executable data mitigations like DEP and NX, attackers have had to find ways to execute malicious code without the introduction of anything new. ROP is a technique that makes use of code that is already present in the application. Usually, an attacker would compile a set of short “gadgets” in the existing program text that each perform a simple task, and then link those together, jumping from one to the other, to build up the functionality they require for their exploit — effectively creating a new program by jumping around in the existing program.

We transform application code such that it jumps around in a ROP-like way, scrambling the program’s control flow graph into disparate units. However, unlike ROP, where attackers are limited by the gadgets they can find and their ability to predict their location at runtime, we precisely control the placement of gadgets during compilation. For example, we can store gadgets in the bogus programs inserted during the opaque predicate obfuscation. After applying this technique, reverse engineers will immediately notice that the handy graph is gone from tools like IDA. Further, this transformation will make it impossible to use state-of-the-art static analysis tools, like BAP, and impedes dynamic analysis techniques that rely on concrete execution with a debugger. Code diffusion destroys the semantic value of breakpoints, because a single code snippet may be re-used by many different functions and not used by other instances of the same function.

graph view is useful

Native code before obfuscation with MAST

graph view is useless

Native code after obfuscation with MAST

The figures above demonstrate a very simple function before and after the code diffusion transform, using screenshots from IDA. In the first figure, there is a complete control flow graph; in the second, however, the first basic block no longer jumps directly to either of the following blocks; instead, it must refer at runtime to a data section elsewhere in the application before it knows where to jump in either case. Running this code diffusion transform over an entire application reduces the entire program from a set of connected-graph functions to a much larger set of single-basic-block “functions.”

Code diffusion has a noticeable performance impact on whole-program obfuscation. In our testing, we compared the speed of bzip2 before and after our return-oriented transformation and slowdown was approximately 55% (on x86).

Environmental Keying

MAST does one more thing to make reverse engineering even more difficult — it ties the execution of the code to a specific device, such as a user’s mobile phone. While using device-specific characteristics to bind a binary to a device is not new (it is extensively used in DRM and some malware, such as Gauss), MAST is able to integrate device-checking into each obfuscation layer as it is woven through the application. The intertwining of environmental keying and obfuscation renders the program far more resistant to reverse-engineering than some of the more common approaches to device-binding.

Rather than acquiring any copy of the application, an attacker must also acquire and analyze the execution environment of the target computer as well. The whole environment is typically far more challenging to get ahold of, and has a much larger quantity of code to analyze. Even if the environment is captured and time is taken to reverse engineer application details, the results will not be useful against the same application as running on other hosts because every host runs its own keyed version of the binary.

Conclusions

In summary, MAST is a suite of compile-time transformations that provide easy-to-use, cross-platform, state-of-the-art software obfuscation. It can be used for a number of purposes, such as preventing attackers from reverse engineering security-related software patches; protecting your proprietary technology; protecting data within an application; and protecting your application from vulnerability hunters. While originally scoped for iOS applications, the technologies are applicable to any software that can be compiled with LLVM.

McSema is Officially Open Source!

We are proud to announce that McSema is now open source! McSema is a framework for analyzing and transforming machine-code programs to LLVM bitcode. It supports translation of x86 machine code, including integer, floating point, and SSE instructions. We previously covered some features of McSema in an earlier blog post and in our talk at ReCON 2014.

Our talk at ReCON where we first described McSema

Build instructions and demos are available in the repository and we encourage you to try them on your own. We have created a mailing list, mcsema-dev@googlegroups.com, dedicated to McSema development and usage. Questions about licensing or integrating McSema into your commercial project may be directed to opensource@trailofbits.com.

McSema is permissively licensed under a three-clause BSD license. Some code and utilities we incorporate (e.g. Intel PIN for semantics testing) have their own licenses and need to be downloaded separately.

Finally, we would like to thank DARPA for their sponsorship of McSema development and their continued support. This project would not have been possible without them.

Education Initiative Spotlight: THREADS Call for Papers

A 2-day conference exploring state-of-the-art advances in security automation.

We would like to share the call for papers for THREADS 2014, a research and development conference that is part of NYU-Poly’s Cyber Security Awareness Week (CSAW). Trail of Bits is a founding sponsor of THREADS. The final deadline for submissions is October 6th, but you are encouraged to apply earlier.

This year’s theme for THREADS is scaling security — ensuring that security is an integral and automated part of software development and deployment models. In previous years THREADS served as a discussion forum for major developments in the security industry, like mobile security and Cyber Faster Track.

We need to scale security because every day we trust more about our lives and our society to internetworked information systems. The development of these systems is accelerating in pace and increasing in complexity — it is now common to deploy code multiple times per minute to worldwide production systems. To cope with this increasing complexity, more development and deployment tasks are becoming fully automated.

Security must be a core part of our new technology and fully integrated into an increasingly automated development and deployment model. THREADS will present new research and developments about integrating security into modern software development and operations.

Highlights from CSAW 2012 where the first THREADS conference was held

THREADS 2014: Security Automation

Companies such as Amazon and Netflix deploy code to worldwide production systems several times per minute. Tesla automobiles download software updates over the Internet to provide new functionality. An Internet-connected thermostat is a best-selling home automation gadget.

Traditional models of security are increasingly irrelevant in a rapidly updated world of Internet-connected devices. Gating deployments by manual security assessments would erase the point of agile development and continuous deployment. Endpoint security products can’t target rapidly updated customized embedded platforms like cars and thermostats. The new model of security has to focus on automation, integration, detection and response time.

This year’s THREADS conference will focus on how to automate security. The goal of automating security is to ensure that security is never a roadblock, but a core part of development and operations. The success of automated security is essential to our ever more internetworked society and devices.

DAY 1: RESEARCH

The research portion of THREADS will discuss the latest academic and industrial advances in security automation for the identification of errors in programs and intrusions in networks. This will include dynamic and static analysis, symbolic execution and constraint solving, data flow tracking and fuzz testing, host and network monitoring, and related technologies. This research advances the state of the art in reasoning about applications and systems to discover security vulnerabilities, identify flaws in applications, and formulate effective defenses.

DAY 2: DEVELOPMENT

The development portion of THREADS will discuss strategies to integrate security into your development pipeline: what automated analysis tools are available, how to integrate them with developers, and how to provide feedback to developers that encourages reporting instead of assigning blame. Other sessions will show you how to add security monitoring triggers to existing monitoring infrastructure, and how to tune these triggers to information attackers want to steal. Our focus is on practical examples and lessons learned when automating security.

Join Us!

Join us on November 13-14, 2014 at NYU-Poly (5 Metrotech Center, Brooklyn, NY 11201) for the 3rd annual THREADS Conference. We’re incredibly excited about the lineup this year and will be announcing keynote and first round speakers soon.

About THREADS

THREADS is a conference that focuses on pragmatic security research and new discoveries in network attack and defense. It is part of NYU-Poly’s Cyber Security Awareness Week (CSAW) in Brooklyn, NY. THREADS is chaired by NYU-Poly Hacker in Residence and Trail of Bits CEO Dan Guido and the program committee includes Sergey Bratus (Dartmouth), Julien Vanegue (Bloomberg), John Viega (Silver Sky), Max Caceres (the D. E. Shaw group), Rajendra Umadas (Etsy), Phyllis Frankl (NYU-Poly), Tyler Bohan (NYU-Poly), Justin Cappos (NYU-Poly), and Travis Goodspeed (Bloomberg).

THREADS aims to present and discuss cutting edge, peer reviewed, industrial and academic research in computer and network security and focuses on advances in attack techniques and methodologies. We want to discuss what vulnerabilities exist and how attackers of today and tomorrow exploit those vulnerabilities.

Education Initiative Spotlight: Build it Break it

Build it Break it is a first-of-its-kind security challenge run by UMD

Build it Break it is a first-of-its-kind security challenge run by UMD

We’re proud to be a sponsor of the first Build it Break it programming contest, run by the University of Maryland (UMD) and supported by one of our own employees and PhD student at the university, Andrew Ruef. Build it Break it is a “flipped CTF” where contestants both implement secure software and identify vulnerabilities in the creations of others. Points are awarded for the secure construction of software and for identifying security flaws.

The build-it, break-it, fix-it contest was conceived as a way to acquire useful scientific evidence, while at the same time engaging the student population and the wider community in a mentality of building security in rather than adding it after the fact. – Michael Hicks

At Trail of Bits, we think Build It Break it is a necessary addition to the suite of available competitions in the security community. There are a wealth of opportunities for students to learn to break software (many of which we support), however, there are relatively few that challenge them to build it right. In this unique contest, there is something for both builders and breakers since it rewards both activities.

It also presents an opportunity for language evangelists to demonstrate the merits of their approach – if their language is “more secure” than others, it should come out on top in the contest and more implementations built with it will remain standing. Contestants can use any programming language or framework to write their software, so by observing the contest, the community gathers empirical evidence about the security or insecurity of available tools.

Any undergraduate or graduate student at a US-based university is eligible to compete for cash prizes in Build it Break it. Though, be warned that Trail of Bits engineers will be on hand to assist as a “break it” team. For more information about the motivations behind this contest, see this blog post and slide deck from Michael Hicks.

Education Initiative Spotlight: CSAW Summer Program for Women

At Trail of Bits we are proud of our roots in academia and research, and we believe it is important to promote cyber security education for students of every academic level. We recently sponsored a High School Capture the Flag (CTF) event, we released a CTF Field Guide, and we are a regular part of Cyber Security Awareness Week (CSAW) at NYU-Poly.

CSAW Summer Program for High School Women

10487588_588177764633699_1497332133553342074_n

Recent graduates of the CSAW Summer Program for High School Women

As part our ongoing collaboration with NYU-Poly, Trail of Bits is proud to be a part of the CSAW Program for High School Women. This program is a unique opportunity for teenage women from the New York City area to learn the fundamentals of computer science and cyber security. Qualifying young women from local high schools participate in two weeks of lecture and hands-on training breaking codes, hacking databases, and unraveling a digital forensics mystery.

“The Introduction to Computer Science and Cyber Security Summer Program aims to introduce young women to computer science in a safe and encouraging learning environment. Exposing young women to female role models and mentors in computer science allows the women to view the field as a viable career option.” – Summer of STEM

The CSAW Program for High School Women is led by Linda Sellie, an Industry Professor of Computer Science and Engineering at NYU-Poly. The complete program covers a variety of topics in computer science, including cryptography, databases, basics of computation, networking, and more. Trail of Bits’ Dan Guido, along with NYU-Poly Computer Science and Engineering Department Head, Nasir Memon, taught an overview of computers and career opportunities in the field of cyber security. Following this course module, students engaged the first level of Matasano’s MicroCorruption challenge to further explore low-level software security issues. Graduates of the program are prepared to compete in other challenges, such as CSAW’s High School Forensics and Capture the Flag competitions this fall.

The first summer program ran from July 7-18 and a second session begins today and will run through August 8th. For more information about the CSAW Summer Program for High School Women, see the press release from the NYU Engineering department.

Follow

Get every new post delivered to your Inbox.

Join 3,863 other followers