Rewriting Functions in Compiled Binaries

Aditi Gupta (Carnegie Mellon University)

September 02, 2019

cryptography, internship-projects, mcsema

Page content

As a summer intern at Trail of Bits, I’ve been working on building Fennec, a tool to automatically replace function calls in compiled binaries that’s built on top of McSema, a binary lifter developed by Trail of Bits.

The Problem

Let’s say you have a compiled binary, but you don’t have access to the original source code. Now, imagine you find something wrong with your program, or something you’d like to change. You could try to fix it directly in the binary—for example, by patching the file in a hex editor—but that becomes tedious very quickly. Instead, being able to write a C function and swap it in would massively speed up the process.

I spent my summer developing a tool that allows you to do so easily. Knowing the name of the function you want to replace, you can write another C function that you want to use instead, compile it, and feed it into Fennec, which will automatically create a new and improved binary.

A Cryptographic Example

To demonstrate what Fennec can do, let’s look at a fairly common cryptographic vulnerability that has shown up in the real world: the use of a static initialization vector in the CBC mode of AES encryption. In the very first step of CBC encryption, the plaintext block is XOR’ed with an initialization vector (IV). An IV is a block of 128 bits (the same as the block size) that is used a single time in any encryption to prevent repetition in ciphertexts. Once encrypted, this ciphertext plays the role of the IV for the next block of plaintext and is XOR’ed with this plaintext block.

Diagram of Cipher Block Chaining (CBC) mode encryption, showing the initialization vector XORed with the first plaintext block before each block cipher encryption step

This process can become insecure when an initialization vector is constant throughout plaintexts. Under a fixed IV, if every message begins with the same block of plaintext, they will all correspond to the same ciphertext. In other words, a static IV can allow an attacker to analyze multiple ciphertexts as a group rather than as individual messages. Below is an example of an IV being generated statically.

unsigned char *generate_iv() {
  return (unsigned char *)"0123456789012345";
}

Sometimes, developers will use cryptography libraries like OpenSSL to do the actual encryption, but write their own functions to generate IVs. This can be dangerous, since non-random IVs can make AES insecure. I built Fennec to help fix issues like this one—it checks whether IV generation was random or static and replaces the function with a new, secure IV if necessary.

The Process

The end goal was to lift executable binaries to LLVM bitcode with McSema and combine it with some LLVM manipulation to replace any function automatically. I started by understanding my cryptographic example and exploring different ways of patching binaries as a bit of background before getting started.

My first step was to work through a couple of the Matasano Cryptopals Challenges to learn about AES and how it can be used and broken. This stage of the project gave me working encryption and decryption programs in C, both of which called OpenSSL, as well as a few Python scripts to attack my implementations. My encryption program used a static IV generation function, which I was hoping to replace automatically later.

I kept using these C binaries throughout the summer. Then, I started looking at binary patching. I spent some time looking into both LD_PRELOAD and the Witchcraft Compiler Collection, which would work if my IV generation function was dynamically linked into the program. The goal of my project, however, was to replace function calls within binaries, not just dynamically loaded functions.

I didn’t want to complicate everything with lifted bitcode yet, so I started by using clean bitcode that generated directly from source code. I wanted to run an LLVM pass on this bitcode to change the functionality of part of my program—namely, the part that generated an IV.

I started by trying to change the function’s bitcode directly in my pass, but soon moved to writing a new function in C and making my original program call that function instead. Every call to the old function would be replaced with a call to my new function.

After some experiments, I created an LLVM pass that would replace all calls to my old function with calls to a new one. Before moving to lifted bitcode, I added code to make sure I would still be able to call the original function if I wanted to. In my cryptographic example, this meant being able to check whether the original function was generating a static IV and, if so, replace it with the code below, as opposed to assuming it was insecure and replacing it no matter what.

// a stub function that represents the function in original binary
unsigned char *generate_iv_original() {
  unsigned char *result = (unsigned char *)"";
  // the contents of this function do not matter
  return result;
}

unsigned char *random_iv() {
  unsigned char *iv = malloc(sizeof(int) * 16);
  RAND_bytes(iv, 16);  // an OpenSSL call
  return iv;
}

unsigned char *replacement() {
  unsigned char *original = generate_iv_original();
  for (int i = 0; i < 10; i++) {
    unsigned char *iv = generate_iv_original();
    if (iv == original) { . // if the IV is static
      return random_iv();
    }
  }
  return original;
}

With my tool working on clean bitcode, it was time to start looking at lifted bitcode. I familiarized myself with how McSema worked by lifting and recompiling binaries and looking through the intermediate representation. Because McSema changes the way functions are called, it took some extra effort to make my tool work on lifted bitcode in the same way that it had on clean bitcode. I had to lift both the original binary and the replacement with McSema. Additional effort was required because the replacement function in a non-lifted binary doesn’t follow McSema’s calling conventions, so it couldn’t be swapped in trivially.

Disassembly comparison showing the original generate_iv function (left) and its McSema-lifted equivalent sub_400df0_generate_iv (right)

Function names and types are more complex through McSema, but I eventually made a working procedure. Like the tool for clean bitcode, the original function could be kept for use in the replacement.

The last step was to generalize my process and wrap everything into a command line tool that others could use. So I tested it on a variety of targets (including stripped binaries and dynamically-loaded functions), added tests, and tested my installation process.

The Function Replacement Pass

The complete process consists of three primary steps: 1) lifting the binaries to bitcode with McSema, 2) using an LLVM pass to carry out the function replacement within the bitcode, and 3) recompiling a new binary. The LLVM pass is the core of this tool, as it actually replaces the functions. The pass works by iterating through each instruction in the program and checking whether it is a call to the function we want to replace. In the following code, each instruction is checked for calls to the function to replace.

for (auto &B : F) {
  for (auto &I : B) {
    // check if instruction is call to function to be replaced
    if (auto *op = dyn_cast(&I)) {
      auto function = op->getCalledFunction();
      if (function != NULL) {
        auto name = function->getName();
        if (name == OriginalFunction) {
...

Then, we find the replacement function by looking for a new function with the specified name and same type as the original.

Type *retType = function->getReturnType();
FunctionType *newFunctionType =
  FunctionType::get(retType, function->getFunctionType()->params(), false);

// create new function
newFunction = (Function *)(F.getParent()->getOrInsertFunction(ReplacementFunction, newFunctionType));

The next step is to pass the original function’s arguments to the new call.

CallSite CS(&I);

// get args to original function to be passed to replacement
std::vector arguments;

for (unsigned int i = 0; i uses()) {
User* user = U.getUser();
user->setOperand(U.getOperandNo(), newCall);
}

The Complete Tool

Although the LLVM pass does the work of replacing a given function, it is wrapped with the other steps in a bash script that implements the full process. First, we disassemble and lift both input binaries using McSema.

Bash script that generates a .cfg file and lifts it to bitcode for both the original and replacement binaries with mcsema-disass and mcsema-lift — Lifts binaries with McSema

Next, we analyze and tweak the bitcode to find the names of the functions as McSema represents them. This section of code includes support for both dynamically-loaded functions and stripped binaries, which affect the names of functions. We need to know these names so that we can pass them as arguments to the LLVM pass when we actually do the replacement. If we were to look for the names from the original binary, the LLVM pass wouldn’t be able to find any matching functions, since we’re using lifted bitcode.

Bash script that finds the McSema-generated names of the functions to be replaced, with handling for library functions and stripped binaries — Finds the names of functions to be replaced

Finally, we run the pass. If we don’t need access to the original function, we only need to run the pass on the original binary. If, however, we want to call the original function from the replacement, we run the pass on both the original binary and the replacement binary. In this second case, we are replacing the original function with the replacement function, and the stub function with the original function. Lastly, we recompile everything to a new working binary.

Bash script that runs the LLVMReplaceFunction opt pass on the original and replacement bitcode, then recompiles the result into a new binary — Runs the pass and compiles a new binary from updated bitcode

Results

Fennec uses binary lifting and recompilation to make a difficult problem relatively manageable. It’s especially useful for fixing security bugs in legacy software, where you might not have access to source code.

Using this tool, it becomes possible to automatically fix a cryptographic IV vulnerability. As seen below, the original binary encrypts a message identically each time using a static IV. After running Fennec, however, the newly created binary uses a different IV, thereby producing a unique ciphertext each time it is run, even on the same plaintext (blue).

# Original binary
aditi@nessie:~/ToB-Summer19$ ./encrypt ""
MDEyMzQ1Njc4OTAxMjM0NQ==/reJh+5rktBatDpyuJNQEBo++0pyIRGZiNsmZkN09HTPIOBVqQ9ov6CrxPXO7dC4cUJGYzBEsejHuTQyjVQh+XsLCHyDkURmfCuJ+a97raPY+o8pKKt8yf/xTmYMtyq2zf7EQxqPxv2bXKdP+6K+h9KyuO3q4+3JbuJFTesNLy8Np1m9ShJ9UAHvAdO6LCZvQ
N91kz0ytIH+s7LgajIWyises+yz26UBQwOzZLeLcQp4=
176

aditi@nessie:~/ToB-Summer19$ ./encrypt ""
MDEyMzQ1Njc4OTAxMjM0NQ==/reJh+5rktBatDpyuJNQEBo++0pyIRGZiNsmZkN09HTPIOBVqQ9ov6CrxPXO7dC4cUJGYzBEsejHuTQyjVQh+XsLCHyDkURmfCuJ+a97raPY+o8pKKt8yf/xTmYMtyq2zf7EQxqPxv2bXKdP+6K+h9KyuO3q4+3JbuJFTesNLy8Np1m9ShJ9UAHvAdO6LCZvQ
N91kz0ytIH+s7LgajIWyises+yz26UBQwOzZLeLcQp4=
176

aditi@nessie:~/ToB-Summer19$ ./encrypt ""
MDEyMzQ1Njc4OTAxMjM0NQ==/reJh+5rktBatDpyuJNQEBo++0pyIRGZiNsmZkN09HTPIOBVqQ9ov6CrxPXO7dC4cUJGYzBEsejHuTQyjVQh+XsLCHyDkURmfCuJ+a97raPY+o8pKKt8yf/xTmYMtyq2zf7EQxqPxv2bXKdP+6K+h9KyuO3q4+3JbuJFTesNLy8Np1m9ShJ9UAHvAdO6LCZvQ
N91kz0ytIH+s7LgajIWyises+yz26UBQwOzZLeLcQp4=
176

aditi@nessie:~/ToB-Summer19$ bash run.sh 2 ../mcsema-2.0.0-ve/remill-2.0.0/remill-build-2/ /home/aditi/ToB-Summer19/ida-6.9/idal64 encrypt replaceIV generate_iv replacement generate_iv_original -lcrypto

# Fennec's modified binary
aditi@nessie:~/ToB-Summer19$ ./encrypt.new ""
L+PYRFiOKMcu18hSqdGQEw==/aK2hYm/GXHwA2tqZxPmoNccQwW+Zhj7E0PQUSRF+lOLJiEMwOc7yv+/Z2AA0pEJjP7Jq4lHMpq2eIVl73lvav0pJiVlOcmfnFwQ9cu0MW0EWqUdgl2FCsWKtO/TAfGhcQPopJyvP8KD/LHlru4QIfZiym7//tt0V9vvabFCLNiSTRG350XKO/zoydeuRFfSu
0HmNNQbAcLSQkcUETH424RyQ4SxmcreW3krOw30kfJY=
176

aditi@nessie:~/ToB-Summer19$ ./encrypt.new ""
hYnowxN2Z3QyPIzwNaFzJw==/pzCq+V1q5ipHoqJXZ9MaeDr+nMdV5E1RbeI+YrcQqXjFHcVmDSq4yZboEuIJJjkbNbdO5DG6n3CQnZ1C7CumGdaZsddaYJueORROk7X+PnQZUq5bKqvdN7ZJEhK7qaerjogOF4TAotDV3ryLC6l/EWY01DkhGrf0hlXAkjQnOz28lXF40GNMd6pIjcoIbZze
V72v5s5q67fVdKdCzVE3BH76qX8qYS9YnN5JkGLERYA=
176

aditi@nessie:~/ToB-Summer19$ ./encrypt.new ""
r3/wMu5nD3rEFn7N88fCjQ==/MisK9RcK8RLsqjV2nrAfprghBYrBmeJS3FbJ4YG6zHBk+uA0CcZ+R4CSDolAaAPlCmkupfxy6bFHNEqyMVv7moPaiJEAkHDDU/FKen8eAJjMvz9+RK+xmQja238jk7xmaS6JbJOdh8teQ2XiMzlHsBYBVpw89UBFrTqOSN8qtlgU3aR4xUVlwZAA1+Pg2GHy
2CIWQI6ioHGDhN3P3po7MaOldJAgHGZO5d2GluroI70=
176

You can download Fennec and find instructions for its use in the Fennec repository on GitHub.

If you have questions or comments about the tool, you can find Aditi on Twitter at @aditi_gupta0!