McSema: I’m liftin’ it

McSema, our x86 machine code to LLVM bitcode binary translator, just got a fresh coat of paint. Last week we held a successful hackathon that produced substantial improvements to McSema’s usability, documentation, and code quality. It’s now easier than ever to use McSema to analyze and reverse-engineer binaries.

Growth stage

We use McSema on a daily basis. It lets us find and retroactively harden binary programs against security bugs, independently validate vendor source code, and generate application tests with high code coverage. It is part of ongoing research, both in academia and in DARPA programs. We (and others) are constantly extending it to analyze increasingly complex programs.

You could say that McSema has been on a growth spurt since we open-sourced it in 2014. Back then, LLVM 3.5 was new and shiny and that’s what McSema used. And that’s what it used in 2015. And in 2016. McSema stretched and grew, but some things stagnated. Over time an ache developed — a desire to modernize and to polish things off. Last week we massaged those growing pains away during our McSema usability hackathon.

Paying dividends

We made broad improvements to McSema. The code is cleaner than ever. It’s easier to install and is more portable. It runs faster and the code it produces is better.

Performance

McSema builds much faster than before. We simplified the build system by removing dead code and unneeded libraries, and by reorganizing the directory layout to be more descriptive.

McSema is faster at producing bitcode. We improved how McSema traverses the control flow graph, removed dependencies on Boost, and simplified bitcode generation.

McSema generates leaner and faster bitcode. McSema no longer stores and spills register context on entry and exit to functions. Flag operations use faster natural bitwidth operations instead of bit fields. McSema can now optimize the lazily generated bitcode to eliminate unused computations. The optimized bitcode is easier to analyze and truer to the intent of the original program.

Modernization

McSema now uses a stock distribution of LLVM 3.8. Previously, McSema used a custom modified version of LLVM 3.5. This upgrade brings in faster build times and more modern LLVM features. We have also eliminated McSema’s dependency on Boost, opting to use modern C++11 features instead.

Simplifications

The new command-line interface is more consistent and easier to use: mcsema-disass disassembles binaries, and mcsema-lift converts the disassembly into LLVM bitcode.

We removed bin_descend, our custom binary disassembler. There is now only one supported decoder that uses IDA Pro as the disassembly engine.

The new code layout is simpler and more intuitive. The CMake scripts to build McSema are now smaller and simpler.

The old testing framework has been removed in favor of an integration testing based approach with no external dependencies.

New Features

McSema supports more instructions. We are always looking for help adding new instruction semantics, and we have updated our instruction addition guide.

Mcsema will now tell you which instructions are supported and which are not, via the mcsema-lift --list-supported command.

The new integration testing framework allows for easy addition of comprehensive translation tests, and there is a new guide about adding tests to McSema.

Documentation

Our new documentation describes in detail how to install, use, test, extend, and debug McSema’s codebase. We have also documented common errors and how to resolve them. These improvements will make it easier for third-parties to hack on McSema.

Runtime

McSema isn’t just for static analysis. The lifted bitcode can be compiled back into a runnable program. We improved McSema’s runtime footprint, making it faster, greatly reducing its memory usage, and making it able to seamlessly interact with native Windows and Linux code in complex ways.

Investing in the future

We will continue to invest in improving McSema. We are always expanding support for larger and more complex software. We hope to move to Binary Ninja for control flow recovery instead of IDA Pro. And we plan to add support for lifting ARM binaries to LLVM bitcode. We want to broaden McSema’s applicability to include analyzing mobile apps and embedded firmware.

We are looking for interns that are excited about the possibilities of McSema. Looking to get started? Try out the walkthrough of translating a real Linux binary. After that, see how McSema can enable tools like libFuzzer to work on binaries. Finally, contact us and tell us where you’d like to take McSema. If we like it and you have a plan then we will pay you to make it happen.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s