Announcing the community-oriented osquery fork, osql

For months, Facebook has been heavily refactoring the entire osquery codebase, migrating osquery away from standard development tools like CMake and integrating it with Facebook’s internal tooling. Their intention was to improve code quality, implement additional tests, and move the project to a more modular architecture. In practice, the changes sacrificed support for a number of architectures, operating systems, and a variety of useful developer tools that integrate well only with the standard build system preferred by the open-source C++ community.

Worse still, the project’s new inward focus has greatly delayed the review of community contributions — effectively stalling development of features or fixes for the needs of the community — without a clear end in sight. Lacking a roadmap or predictable release cycle, user confidence in the project has fallen. Enterprises are postponing their planned osquery deployments and searching for alternative solutions.

Many of the most secure organizations in the world have already invested in making osquery the absolute best endpoint management solution for their needs. Being forced to look elsewhere would be a waste of their investment, and leave them relying on less effective alternatives. That is why we are announcing the community-oriented osquery fork: osql.

What are the goals of osql?

With osql, we are committed to restoring the community’s confidence in the osquery project, to making the development process more open and predictable, and to reviewing and accepting community contributions more quickly. Our goal is to restore direct community participation.

An open and transparent development process

In the immediate term, osql will be maintained as a “soft-fork.” We will closely track Facebook’s upstream updates without diverging from the codebase. Plenty of completed work is simply waiting upstream, in Pull Requests. We prepared a workflow through which the osql project can accept Pull Requests that the community deems stable enough to be shipped, but which have been ignored by the upstream maintainers. The community can pick and choose its priorities from those contributions, and incorporate them into the next release of osql.

Screen Shot 2019-04-18 at 8.56.40 AM

The osql organization on GitHub will be a hub for community projects

Continuous Integration, Continuous Delivery

We’ve also integrated a much-needed public CI using Azure Pipelines, which will build and run tests at each commit. Find the results here. The CI will help us build, test, and release faster and more frequently. We are committing to release a new osql binary (package installer) on a regular monthly cadence. We will communicate the changes that users can expect in the next release. They will know when to expect it, and that the version they download has passed all tests.

Screen Shot 2019-04-18 at 8.58.37 AM

Determine if the latest code is building for all platforms, at a glance

Restoring standard tool support for developers

We rewrote the build system from scratch to return it to CMake, the C++ community’s de-facto standard for building projects. This effort was non-trivial, but we believe it was central to preserving the project’s compatibility with open-source toolchains. The libraries and tools that represent the foundation of modern C++ development, such as Boost or the LLVM/Clang compiler toolchain, all support CMake natively. The most-used third party libraries use CMake as well, making it quite easy to include them in a CMake-based project.

Developers benefit from built-in CMake support in their IDEs. Visual Studio, VS Code, CLion and QtCreator can all easily open a project from its CMakeLists file, enabling a precise view of the project’s structure and the outputs of its build process. They’ll also regain the convenience of CMake-supporting static analyzer frameworks, like Clang’s scan-build, which helps discover critical bugs across an entire project.

By re-centering everything around a CMake build process, we made osql a more developer-friendly project than upstream osquery. If you would like to see for yourself and begin contributing to osql, check out the build guide.

VSCode

Work conveniently in the Visual Studio Code IDE, with CMake integration

What’s next for osql

Our work is just beginning! We plan to continue improving the automation of osql releases. Initially, osql releases will be unsigned binaries/packages. The next priority for the project is to implement a secure code-signing step into the CI procedure, so that every release is a binary signed by the “osql” organization.

The osquery project’s build process used to allow you to choose whether to download or to build third-party dependencies, thanks to easily modifiable Homebrew formulas. Not only that, you could also choose from where these dependencies were downloaded. That is no longer true for osquery currently, but we will restore that ability in osql (a task made easier thanks to CMake).

We also plan to extend the public CI for osql to enable it to test PRs opened against upstream osquery. This will help the community review those PRs, and provide a kind of quality assurance for their inclusion in a future release of osql.

In the longer term, thanks to CMake’s support for building on various platforms, it will be possible for osql to be built for whatever new systems that the community demands.

Want More? Let’s Talk

When we originally ported osquery to Windows, we didn’t imagine it would become so big, or that it would outgrow what Facebook alone could maintain. A whole community of organizations now deploy and depend on osquery. That’s why we’ve launched osql, the community-oriented osquery fork. If you are part of this community and are interested in porting to other platforms, need special features from the project, or want some customization done to the core, join our osquery/osql support group or contact us!

Announcing QueryCon 2019

Exciting news: We’re hosting the second annual QueryCon on June 20th-21st in New York City, co-sponsored by Kolide and Carbon Black!

Register here

QueryCon has become the foremost event for the osquery and osql open-source community. QueryCon brings together core maintainers, developers, and end-users to teach, discuss, and collaborate on Facebook’s award-winning open-source endpoint detection tool.

Last year’s inaugural conference (hosted by Kolide in San Francisco) boasted 120 attendees, 16 speakers, and talk topics ranging from ‘super features’ to ‘the extensions skunkworks’ to ‘catching everything with osquery events.’ This year, we’re switching coasts and growing the event in honor of the growing community. Join us for what is sure to be a great event!

Event details

Conference room at the venue

  • When: June 20th – 21st
  • Where: Convene at 32 Old Slip in downtown Manhattan, just steps from Wall Street and the New York Stock Exchange.
  • What to expect:
    • Two days of talks by osquery and osql engineers, users, and fans — no sales talks
    • Structured time to discuss and collaborate on fixing issues and improving the project
    • Networking with users and experts
    • Sponsored afterparty in downtown Manhattan

Learn more and register

Make sure to buy your tickets ASAP — last year’s event sold out!

Call for Papers

Would you like to be a featured speaker at this year’s QueryCon? You’re in luck: Speaker slots are still open.

Apply here!

About Trail of Bits

It’s no secret that we are huge fans of osquery. From when we ported osquery to Windows in 2016 to our launch of our osquery extension repo last year, we’ve been one of the leading contributors to the tool’s development.

Trail of Bits helps secure the world’s most targeted organizations and products. We combine high-end security research with a real-world attacker mentality to reduce risk and fortify code.

We’re a security research and engineering firm headquartered in New York City. Our engineering services team works closely with business customers in tech, defense, and finance on quick-response feature development, bug fixes, and integration of the tools they depend on for endpoint detection and response, event log aggregation, secure software updates, and security testing.

We leverage the best of open-source software for our work, and regularly contribute enhancements to these projects as a result. In this way, we plan to bring projects like osquery, Santa, Omaha and StreamAlert to parity with the leading proprietary alternatives.

User-Friendly Fuzzing with Sienna Locomotive

Fuzzing is a great way to find bugs in software, but many developers don’t use it. We hope to change that today with the release of Sienna Locomotive, a new open-source fuzzer for Windows that emphasizes usability. Sienna Locomotive aims to make fuzzing accessible to developers with limited security expertise. Its user-oriented features make it easy to configure, easy to run, and easy to interpret the results.

Fuzzing is Underused

At Trail of Bits, we use state-of-the-art program analysis tools every day, but even those can’t replace the bug-finding potential of random fuzz testing.

Explicitly testing software for bugs is incredibly important. Many open-source developers once believed in the “many eyes” hypothesis–-that open-source projects would be less susceptible to security issues because anyone could search the code for bugs. The past few years have shown that this only applies to high-value targets, and even then, only in a limited capacity. For example, OpenSSL was running on at least 2/3rds of web servers in 2014, but it still took over two years for researchers to discover the Heartbleed flaw. Open-source software makes up so much of the internet that we can’t afford to count on that kind of luck again. As Google poignantly put it when they announced OSS-Fuzz: “It is important that the open source foundation be stable, secure, and reliable, as cracks and weaknesses impact all who build on it.”

We asked ourselves: why don’t developers fuzz their own software, and what can we do to change that?

Barrier to Entry

One likely reason for fuzzers’ disuse is that they can be relatively difficult to use, especially on Windows. WinAFL (the de-facto standard), in particular, places fairly strict requirements on the functions it can target. If your code doesn’t meet these constraints, WinAFL won’t work correctly. It doesn’t help that most fuzzers are designed for Unix platforms, despite Windows’ 75+% market share. It’s easy to understand why developers often do not bother setting up a fuzzer to test their Windows software.

What to Do?

We believe that security tools only succeed if they’re actually used. Fuzzing techniques that improve code coverage or increase executions per second are the subject of new research at almost every security conference, but these improvements are moot if the code gathers dust on a shelf. We saw this as an opportunity to improve the state of security in a different vein: instead of building a smarter or faster fuzzer, we would build a fuzzer with a lower barrier to entry.

Introducing Sienna Locomotive

To address these problems, we engineered Sienna Locomotive with three features that make fuzzing as painless as possible.

Easy Configuration

We want Sienna Locomotive to be usable for testing a wide variety of software, so we made it easy for developers to tailor the fuzzer to their specific application. New targets can be configured with just the path to the executable and a command line string. Other than setting timeouts for applications that don’t exit on their own, there are very few settings the user needs to configure.

Configuring the Fuzzgoat application shown in the demo video

Powerful Function Targeting

Developers don’t have to be picky about which functions they can target, nor do they have to modify the binary to target a function for fuzzing. Sienna Locomotive runs the target application once and scans for functions that take user input (like ReadFile or fread) and allows the developer to select individual function calls to target. This is especially useful when fuzzing a program that makes incremental reads because it allows the user to fuzz only one specific portion of the file.

A sample function targeting window for a built-in Windows utility

Helpful Crash Triage

Fuzzers produce a myriad of crashes with varying severity. You want to debug the most critical crashes first, but how can you tell where to start? Tools like Breakpad and !exploitable analyze a crashing program to estimate the severity of the crash, which helps developers decide which crashes to debug first. In September, we open-sourced Winchecksec, a component of Sienna Locomotive that helps power our triaging system. Sienna Locomotive combines a reimplementation of the heuristics used by Breakpad and !exploitable, augmented with information from Winchecksec and a custom taint tracer, to estimate the severity of each crash. This helps developers to prioritize which crashes to debug and fix first.

An export directory containing triaged crash information

Will Sienna Locomotive Work for You?

If you describe yourself as more of a Windows developer than a security expert, Sienna Locomotive may be the right tool for you. With relatively minimal effort, it can help you test your code against a much larger space of mutated inputs than you could ever write unit tests for. Depending on how you’ve structured your program, you may also be able to make testing during development more efficient by only fuzzing newly implemented features.

Sienna Locomotive makes some performance tradeoffs for the sake of usability. If you’re more interested in test case throughput than usability, or you’re looking for bugs in Chrome and need to perform thousands of iterations per second, Sienna Locomotive isn’t for you.

Try it Out!

We think that Sienna Locomotive will improve the state of Windows software security by making it easier for developers to test their code via fuzzing. To try out Sienna Locomotive for yourself, download a prebuilt binary from the releases page on our GitHub Repo, or follow the instructions in the readme to build it yourself. If you’d like to help make Sienna Locomotive better, visit our issues page.

Performing Concolic Execution on Cryptographic Primitives

Alan Cao

For my winternship and springternship at Trail of Bits, I researched novel techniques for symbolic execution on cryptographic protocols. I analyzed various implementation-level bugs in cryptographic libraries, and built a prototype Manticore-based concolic unit testing tool, Sandshrew, that analyzed C cryptographic primitives under a symbolic and concrete environment.

Sandshrew is a first step for crypto developers to easily create powerful unit test cases for their implementations, backed by advancements in symbolic execution. While it can be used as a security tool to discover bugs, it also can be used as a framework for cryptographic verification.

Playing with Cryptographic Verification

When choosing and implementing crypto, our trust should lie in whether or not the implementation is formally verified. This is crucial, since crypto implementations often introduce new classes of bugs like bignum vulnerabilities, which can appear probabilistically. Therefore, by ensuring verification, we are also ensuring functional correctness of our implementation.

There are a few ways we could check our crypto for verification:

  • Traditional fuzzing. We can use fuzz testing tools like AFL and libFuzzer. This is not optimal for coverage, as finding deeper classes of bugs requires time. In addition, since they are random tools, they aren’t exactly “formal verification,” so much as a sotchastic approximation thereof.
  • Extracting model abstractions. We can lift source code into cryptographic models that can be verified with proof languages. This requires learning purely academic tools and languages, and having a sound translation.
  • Just use a verified implementation! Instead of trying to prove our code, let’s just use something that is already formally verified, like Project Everest’s HACL* library. This strips away configurability when designing protocols and applications, as we are only limited to what the library offers (i.e HACL* doesn’t implement Bitcoin’s secp256k1 curve).

What about symbolic execution?

Due to its ability to exhaustively explore all paths in a program, using symbolic execution to analyze cryptographic libraries can be very beneficial. It can efficiently discover bugs, guarantee coverage, and ensure verification. However, this is still an immense area of research that has yielded only a sparse number of working implementations.

Why? Because cryptographic primitives often rely on properties that a symbolic execution engine may not be able to emulate. This can include the use of pseudorandom sources and platform-specific optimized assembly instructions. These contribute to complex SMT queries passed to the engine, resulting in path explosion and a significant slowdown during runtime.

One way to address this is by using concolic execution. Concolic execution mixes symbolic and concrete execution, where portions of code execution can be “concretized,” or run without the presence of a symbolic executor. We harness this ability of concretization in order to maximize coverage on code paths without SMT timeouts, making this a viable strategy for approaching crypto verification.

Introducing sandshrew

After realizing the shortcomings in cryptographic symbolic execution, I decided to write a prototype concolic unit testing tool, sandshrew. sandshrew verifies crypto by checking equivalence between a target unverified implementation and a benchmark verified implementation through small C test cases. These are then analyzed with concolic execution, using Manticore and Unicorn to execute instructions both symbolically and concretely.

Fig 1. Sample OpenSSL test case with a SANDSHREW_* wrapper over the MD5() function.

Writing Test Cases

We first write and compile a test case that tests an individual cryptographic primitive or function for equivalence against another implementation. The example shown in Figure 1 tests for a hash collision for a plaintext input, by implementing a libFuzzer-style wrapper over the MD5() function from OpenSSL. Wrappers signify to sandshrew that the primitive they wrap should be concretized during analysis.

Performing Concretization

Sandshrew leverages a symbolic environment through the robust Manticore binary API. I implemented the manticore.resolve() feature for ELF symbol resolution and used it to determine memory locations for user-written SANDSHREW_* functions from the GOT/PLT of the test case binary.

Fig 2. Using Manticore’s UnicornEmulator feature in order to concretize a call instruction to the target crypto primitive.

Once Manticore resolves out the wrapper functions, hooks are attached to the target crypto primitives in the binary for concretization. As seen in Figure 2, we then harness Manticore’s Unicorn fallback instruction emulator, UnicornEmulator, to emulate the call instruction made to the crypto primitive. UnicornEmulator concretizes symbolic inputs in the current state, executes the instruction under Unicorn, and stores modified registers back to the Manticore state.

All seems well, except this: if all the symbolic inputs are concretized, what will be solved after the concretization of the call instruction?

Restoring Symbolic State

Before our program tests implementations for equivalence, we introduce an unconstrained symbolic variable as the returned output from our concretized function. This variable guarantees a new symbolic input that continues to drive execution, but does not contain previously collected constraints.

Mathy Vanhoef (2018) takes this approach to analyze cryptographic protocols over the WPA2 protocol. We do this in order to avoid the problem of timeouts due to complex SMT queries.

Fig 3. Writing a new unconstrained symbolic value into memory after concretization.

As seen in Figure 3, this is implemented through the concrete_checker hook at the SANDSHREW_* symbol, which performs the unconstrained re-symbolication if the hook detects the presence of symbolic input being passed to the wrapper.

Once symbolic state is restored, sandshrew is then able to continue to execute symbolically with Manticore, forking once it has reached the equivalence checking portion of the program, and generating solver solutions.

Results

Here is Sandshrew performing analysis on the example MD5 hash collision program from earlier:

The prototype implementation of Sandshrew currently exists here. With it comes a suite of test cases that check equivalence between a few real-world implementation libraries and the primitives that they implement.

Limitations

Sandshrew has a sizable test suite for critical cryptographic primitives. However, analysis still becomes stuck for many of the test cases. This may be due to the large statespace needing to be explored for symbolic inputs. Arriving at a solution is probabilistic, as the Manticore z3 interface often times out.

With this, we can identify several areas of improvement for the future:

  • Add support for allowing users to supply concrete input sets to check before symbolic execution. With a proper input generator (i.e., radamsa), this potentially hybridizes Sandshrew into a fuzzer as well.
  • Implement Manticore function models for common cryptographic operations. This can increase performance during analysis and allows us to properly simulate execution under the Dolev-Yao verification model.
  • Reduce unnecessary code branching using opportunistic state merging.

Conclusion

Sandshrew is an interesting approach at attacking the problem of cryptographic verification, and demonstrates the awesome features of the Manticore API for efficiently creating security testing tools. While it is still a prototype implementation and experimental, we invite you to contribute to its development, whether through optimizations or new example test cases.

Thank you

Working at Trail of Bits was an awesome experience, and offered me a lot of incentive to explore and learn new and exciting areas of security research. Working in an industry environment pushed me to understand difficult concepts and ideas, which I will take to my first year of college.

Fuzzing In The Year 2000

It is time for the second installment of our efforts to reproduce original fuzzing research on modern systems. If you haven’t yet, please read the first part. This time we tackle fuzzing on Windows by reproducing the results of “An Empirical Study of the Robustness of Windows NT Applications Using Random Testing” (aka ‘the NT Fuzz Report’) by Justin E. Forrester and Barton P. Miller, published in 2000.

The NT Fuzz Report tested 33 applications on Windows NT and an early release copy of Windows 2000 for susceptibility to malformed window messages and randomly generated mouse and keyboard events. Since Dr. Miller published the fuzzer code, we used the exact same tools as the original authors to find bugs in modern Windows applications.

The results were nearly identical: 19 years ago 100% of tested applications crashed or froze when fuzzed with malformed window messages. Today, 93% of tested applications crash or freeze when confronted with the same fuzzer. Among the the applications that did not crash was our old friend, Calculator (Figure 1). We also found a bug (but not a security issue) in Windows.

Figure 1: Bruised but not beaten. The recently open-sourced Windows Calculator was one of two tested applications that didn’t freeze or crash after facing off against the window message fuzzer from the year 2000. Calculator was resized after fuzzing to showcase artifacts of the fuzzing session.

A Quick Introduction to Windows

So what are window messages and why do they crash programs?

Windows applications that display a GUI are driven by events: a mouse move, a button click, a key press, etc. An event-driven application doesn’t do anything until it is notified of an event. Once an event is received, the application takes action based on the event, and then waits for more events. If this sounds familiar, it’s because the architecture is making a comeback in platforms like node.js.

Window messages are the event notification method in Windows. Each window message has a numeric code associated with a particular event. Each message has one or more parameters, by convention called lParam and wParam, that specify more details about the event. Examples of such details include the coordinates of mouse movement, what key was pressed, or what text to draw in a window. These messages can be sent by the program itself, by the operating system, or by other programs. They can arrive at any time and in any order, and must be handled by the receiving application.

Security Implications

Prior to Windows Vista it was possible for a low-privilege process to send messages to a high-privilege process. Using the right combination of messages, it was possible to gain code execution in the high-privilege process. These “shatter attacks” have been largely mitigated since Vista with UIPI and by isolating system services in a separate session.

Mishandling of window messages is unlikely to have a security impact on modern Windows systems for two reasons. First, window messages can’t be sent over the network. Second, crashing or gaining code execution at the same privilege level as you already have is not useful. This was likely apparent to the authors of the NT Fuzz Report. They do not make security claims, but correctly point out that crashes during window message handling imply a lack of rigorous testing.

There are some domains where same-privilege code execution may violate a real security boundary. Some applications combine various security primitives to create an artificial privilege level not natively present in the operating system. The prime examples is a browser’s renderer sandbox. Browser vendors are well aware of these issues and take steps to mitigate them. Another example is antivirus products. Their control panel runs with normal user privileges but is protected against inspection and tampering by other parts of the product.

Testing Methodology

We used the same core fuzzing code and methodology described in the original NT Fuzz Report to fuzz all applications in our test set. Specifically, in both SendMessage and PostMessage modes, the fuzzer used three iterations of 500,000 messages with the seed 42 and three iterations of 500,000 messages using the seed 1,337. We saw results after executing just one iteration of each method.

Fuzzing using the “random mouse and keyboard input” method was omitted due to time constraints and the desire to focus purely on window messages. We encourage you to replicate those results as well.

Caveats

Two minor changes were necessary to use the fuzzer on Windows 10. First was a tiny change to build the fuzzer on 64-bit Windows. The second change was enabling the fuzzer to target a specific window handle via a command line argument. Fuzzing a specific handle was a quick solution to the problem of fuzzing Universal Windows Platform (UWP) applications. The window message fuzzer is oriented to fuzzing windows belonging to a specific process, but UWP applications all display their UI via the same process (Figure 2). This meant that the fuzzer could not target the main window of UWP applications.

Figure 2: UWP application windows all belong to the same process (ApplicationFrameHost.exe). To fuzz these applications, the original NT fuzzer was modified to allow fuzzing of a user-specified window handle.

While modifying the fuzzer, a serious flaw was identified: the values selected for the two primary sources of randomized input, the lParam and wParam arguments to SendMessage and PostMessage, are limited to 16-bit integers. Both of the arguments are 32-bit on 32-bit Windows, and 64-bit on 64-bit Windows. The problem occurs In Fuzz.cpp, where the lParam and wParam values are set:

     wParam = (UINT) rand();
     lParam = (LONG) rand();

The rand() function returns a number in the range [0, 216], greatly limiting the set of tested values. This bug was purposely preserved during evaluation, to ensure results were accurately comparable against the original work.

Tested Applications

The NT Fuzz Report tested 33 programs. This reproduction tests just 28 because only one version of each program is used for testing. The Windows software ecosystem has changed substantially since 2000, but there is also a surprising amount of conservation. The Microsoft Office suite features the same programs as the original tests. Netscape Communicator evolved into what is now Firefox. Adobe Acrobat was renamed to Adobe Reader, but is still going strong. Even Winamp made a new release in 2018, allowing for a fair comparison with the original NT Fuzz Report. However, some legacy software has gone the way of the last millenium. Find below the list of changes, and why:

  • CD Player ⇨ Windows Media Player: The Windows Media Player has subsumed CD Player functionality.
  • Eudora ⇨ Windows Mail: Qualcomm now makes basebands, not email clients. Because Eudora is no longer around, the default Windows email client was tested instead.
  • Command AntiVirus ⇨ Avast Free Edition: The Command product is no longer available. It was replaced with Avast, the most popular third-party antivirus vendor.
  • GSView ⇨ Photos: The GSView application is no longer maintained. It was replaced with Photos, the default Windows photo viewer.
  • JavaWorkshop ⇨ NetBeans IDE: The JavaWorkshop IDE is no longer maintained. NetBeans seemed like a good free alternative that fits the spirit of what should be tested.
  • Secure CRT ⇨ BitVise SSH: Secure CRT is still around, but required a very long web form to download a trial version. BitVise SSH offered a quick download.
  • Telnet ⇨ Putty: The telnet application still exists on Windows, but now it is a console application. To fuzz a GUI application, we replaced telnet with Putty, a popular open-source terminal emulator for Windows.
  • Freecell & Solitaire were run from the Microsoft Solitaire Collection application in the Windows App Store.

The specific application version appears in the results table. All fuzzing was done on a 64-bit installation of Windows 10 Pro, version 1809 (OS Build 17763.253).

Results

As mentioned in the NT Fuzz Report, the results should not be treated as security vulnerabilities, but instead a measure of software robustness and quality.

“Finally, our results form a quantitative starting point from which to judge the relative improvement in software robustness.”

From “An Empirical Study of the Robustness of Windows NT Applications Using Random Testing” by Justin E. Forrester and Barton P. Miller

The numbers are not particularly encouraging, although the situation is improving. In the original NT Fuzz Report, every application either crashed or froze when fuzzed. Now, two programs, Calculator and Avast Antivirus, survive the window message fuzzer with no ill effects. Our praise goes to the Avast and Windows Calculator teams for thinking about erroneous window messages. The Calculator team gets additional kudos for open sourcing Calculator and showing everyone how a high-quality UWP application is built. See Table 1 for all of our fuzzing results, along with the specific version of the software used.

Program Version SendMessage PostMessage
Microsoft Access 1901 crash crash
Adobe Reader DC 2019.010.20098 crash ok
Calculator 10.1812.10048.0 ok ok
Windows Media Player 12.0.17763.292 crash crash
Visual Studio Code 1.30.2 crash ok
Avast Free 19.2.2364 ok ok
Windows Mail 16005.11231.20182.0 crash crash
Excel 1901 crash ok
Adobe FrameMaker 15.0.2.503 crash crash
Freecell 4.3.2112.0 crash crash
GhostScript 9.26 crash ok
Photos 2019.18114.17710.0 crash crash
GNU Emacs 26.1 crash crash
IE Edge 44.17763.1.0 crash crash
NetBeans 10 crash crash
Firefox 64.0.2 crash crash
Notepad 1809 crash ok
Paint 1809 crash crash
Paint Shop Pro 2019 21.1 crash crash
Powerpoint 1901 crash ok
Bitvise SSH 8.23 crash crash
Solitaire 4.3.2112.0 crash crash
Putty 0.70 freeze freeze
VS Community 2017 15.9.5 crash crash
WinAmp 5.8 5.8 Build 3660 crash ok
Word 1901 crash ok
Wordpad 1809 crash crash
WS_FTP 12.7.0.1903 crash crash

Table 1: The results of replicating the original NT Fuzz Report on Windows 10. After 19 years, very few applications properly handle malformed window messages.

A Bug in Windows?

Unfortunately our curiosity got the better of us and we had to make one exception. One common problem seemed to plague multiple unrelated applications. Some debugging showed the responsible message was WM_DEVICECHANGE. When the fuzzer sent that message, it would even crash the simplest application possible — the official Windows API HelloWorld Sample (Figure 3).

Figure 3: A 32-bit HelloWorld.exe crashes when faced with the window message fuzzer. This shouldn’t happen since the program is so simple. The implication is that the issue is somewhere in Windows.

Using the HelloWorld sample we quickly realized that the problem only affects 32-bit applications, not 64-bit applications. Some rapid debugging revealed that the crash is in wow64win.dll, the 32-to-64-bit compatibility layer. My quick (and possibly wrong) analysis of the problem shows that the wow64win.dll!whcbfnINDEVICECHANGE function will treat wParam as a pointer to a DEV_BROADCAST_HANDLE64 structure in the the target program. The function converts that structure to a DEV_BROADCAST_HANDLE32 structure for compatibility with 32-bit applications. The crash happens because the wParam value generated by the fuzzer points to invalid memory.

Treating wParam as a local pointer is a bad idea, although it was probably an intentional design decision to make sure removable device notifications work with legacy 32-bit Windows applications. Regardless, it certainly feels wrong that it is possible to crash another application without explicitly debugging it. We reported the issue to MSRC, even though no security boundary was being crossed. They confirmed the bug is not a security issue. We hope to see a fix for this admittedly obscure problem in a future version of Windows.

Conclusion

Window messages are an under-appreciated and often ignored source of untrusted input to Windows programs. Even 19 years after the first open-source window message fuzzer was deployed, 93% of tested applications still freeze or crash when run against the very same fuzzer. The fact that some applications gracefully handle these malformed inputs is an encouraging sign: it means frameworks and institutional knowledge to avoid these errors exist in some organizations.

There is also much room for improvement in window message fuzzing — the simplest method possible crashes 93% of applications. There may even be examples where window messages travel across a real security boundary. If you explore this area further, we hope you’ll share what you find.

What Application Developers Need To Know About TLS Early Data (0RTT)

TLS 1.3 represents the culmination of over two decades of experience in deploying large-scale transport security. For the most part it simplifies and improves the security of TLS and can act as a drop-in replacement for TLS 1.2. However, one new feature in the protocol represents a significant security risk to some existing applications: TLS 0-RTT (also known as early data). This performance optimization can allow replay attacks in applications that don’t implement their own anti-replay defenses. In some cases, just upgrading your TLS dependencies can introduce application-level vulnerabilities.

Let’s look at an example of a vulnerable application, demonstrate why it’s vulnerable, and discuss what can be done at the various application layers to resolve this issue.

A Vulnerable Application

Your company runs a platform with a buy-and-sell API. For a variety of legacy reasons the company implemented these operations with an API GET /api/sell/(item)/(qty) and GET /api/buy/(item)/(qty).

At some later date your operations team upgrades your TLS infrastructure to support TLS 1.3. This could take the form of enabling it on the CDN, upgrading the TLS hardware offload box, or updating the software on your load balancers. They view this update just like any other standard patching process. After all, this is a transparent upgrade just like TLS 1.0 to 1.1, or 1.1 to 1.2…

…except if 0-RTT is enabled. If it is, then the API just described is now vulnerable to arbitrary replay. The design tradeoffs implicit in 0-RTT make possible the following attack:

  • A user previously logged into your system walks into a coffee shop, gets on the WiFi, and initiates a buy or sell. Thanks to TLS 1.3 0-RTT this transaction occurs without a round trip for an initial handshake, saving 300ms! And, of course, all communication is still encrypted via TLS.
  • An adversary capturing traffic off the WiFi captures that request and sends the same request again to the server. Unlike TLS 1.2, this request is not rejected by the TLS layer, and the buy/sell occurs again.

This is a surprising result! The API in question is now vulnerable to a new attack on the transport layer without any code changes to the actual application. Why? The answer lies in 0-RTT’s design.

What is 0-RTT?

TLS 0-RTT (an abbreviation for “zero round trip time,” officially known as “TLS early data”) is a method of lowering the time to first byte on a TLS connection. TLS 1.3 only requires 1-RTT (a single round trip) of the protocol, where TLS 1.2 and below required two, but the designers wanted more! For connections to a server where the client and server possess a pre-shared key (PSK) the client may choose to encrypt early data under this key and send it along with the ClientHello. This allows the server to respond immediately with the requested data after its own ServerHello/EncryptedExtensions/Finished messages. This cuts an entire round trip off the communication: zero round trip time. In a mobile environment this may save a significant amount of time (hundreds or even thousands of milliseconds). The PSK may be obtained out-of-band, but typically it is retained from an earlier handshake. Early data communication is generally limited to communicating with servers that the client has previously spoken to.

How Can It Break You?

Of course, shaving off a round trip comes with certain tradeoffs. In a typical TLS 1.3 connection model, every session has a property known as forward secrecy. Forward secrecy guarantees that past sessions are secure even if the private key for the current session is somehow disclosed. TLS 1.3 (and 1.2 with ephemeral key exchange modes) provides this by generating a new set of keys for every handshake. Unfortunately, since the PSK can’t be refreshed without a round trip, an initial request sent via 0-RTT is not forward secure. It is encrypted under the previous session’s key.

A much more significant concern, however, is that a 0-RTT request cannot prevent a replay attack. To counter this, the application layer needs to be provided information from its TLS implementation about whether the received request is 0-RTT or not. With that information the application can deny 0-RTT either by refusing to allow 0-RTT requests on non-idempotent operations or via direct anti-replay defenses such as nonces, which can be checked against shared global state to confirm that a given request has never been seen before. RFC 8470 attempts to document mitigations like this as they apply to 0-RTT.

Unfortunately, implementing a robust defense for this is easier said than done. Until now web applications have generally not needed to be aware of the vagaries of their transport security. As of this writing there has been only limited effort in trying to tackle this. Hypothetically, applications should already be capable of handling replay. Reality is never so convenient. Go’s new TLS 1.3 support does not include 0-RTT partially due to concerns around how to expose it safely. Cloudflare has chosen to disallow 0-RTT for all but GET requests with no query string specifically to try to mitigate this issue while also proxying additional information about any given request via added headers. Unfortunately, our example application from earlier is still vulnerable since it uses GET requests!

What Can You Do?

Most importantly, upgrade to TLS 1.3! It is a better and more secure protocol than its predecessors. However, as part of the upgrade, disable 0-RTT until you can audit your application for this class of vulnerability. If you’re using a CDN with TLS termination, read the documentation to determine what information they forward for you to interpret at the application layer. Otherwise, if you don’t have access to the specific connection details you’ll need to ensure you have very robust anti-replay defenses in place for sensitive operations.

If you’re a web framework developer you should think seriously about what APIs you can provide to your consumers to help them manage this risk while providing the performance benefits. This will likely require engaging with the various servers your framework runs on to come up with a common API to proxy the information you need. For example, if the web framework used in the vulnerable application above had an annotation for idempotency then routes annotated in that fashion could be automatically enabled for 0-RTT while all the others would reject 0-RTT requests (thus falling back automatically to a standard handshake).

If you are directly consuming a TLS API like OpenSSL’s in your application you’ll need to implement the various callbacks like SSL_CTX_set_allow_early_data_cb and carefully consider the implications with regard to your session management vis-à-vis replay protection. 0-RTT support is not enabled unless you consume these new APIs so you can opt-in to them over time.

Cryptographers have been looking at how to obtain usable forward secrecy in the context of a 0-RTT request as well. Some recently published research (Session Resumption Protocols and Efficient Forward Security for TLS 1.3 0-RTT) proposes the use of puncturable pseudorandom functions to significantly reduce the size of a session database, but with trade-offs in computational complexity and post-compromise security. As of publication this is an area of active research with no solution truly suitable for deployment.

If you want to take advantage of TLS 1.3’s performance while ensuring your application and users are secure in a 0-RTT world, contact the Trail of Bits engineering and cryptography teams. We would love to help you engineer your application securely.

Frequently Asked Questions

What do I do if I’m behind a CDN?

For CDN-based termination you’ll need to check their documentation to see what capabilities they provide. Cloudflare (which is one of the only CDN companies that provides public documentation for this) uses a header named CF-0RTT-Unique and the application needs to track values received from that and reject duplicates on non-idempotent endpoints.

What if I terminate with HAProxy?

By default enabling TLS 1.3 will not enable 0-RTT support.

You can enable 0-RTT by adding allow-0rtt to the bind or server lines in the configuration. Once enabled a 0-RTT request will be proxied to the application layer with the header Early-Data: 1 and a request may be rejected by returning a 425 status code. This method of proxying information is codified in RFC 8470.

What if I terminate with nginx?

By default enabling TLS 1.3 will not enable 0-RTT support.

You can enable 0-RTT with ssl_early_data on; in the configuration. You’ll also need to add proxy_set_header Early-Data $ssl_early_data; to your proxy directives to ensure that the Early-Data header is passed to your application.

What if I terminate with Apache httpd?

httpd 2.4.37 and above supports TLS 1.3, but has no 0-RTT support at present (March 2019).

In-application termination: Go, Python, Ruby, C

Go supports TLS 1.3 as of 1.12, but has no support for 0-RTT.

Python has no early data support at present (March 2019).

Ruby has no early data support at present (March 2019).

C applications can utilize whatever TLS stack they desire. Each individual library is different, but the most common, OpenSSL, only enables 0-RTT when calling SSL_CTX_set_max_early_data (or SSL_set_max_early_data) with a value greater than zero. Developers can also use SSL_CTX_set_allow_early_data_cb to set a callback function that determines whether a given 0-RTT request should be accepted.

Symbolic Path Merging in Manticore

Each year, Trail of Bits runs a month-long winter internship “winternship” program. This year we were happy to host 4 winterns who contributed to 3 projects. This is the first in a series of blog posts covering the 2019 Wintern class.

Our first report is from Vaibhav Sharma (@vbsharma), a PhD student at the University of Minnesota. Vaibhav’s research focuses on improving symbolic executors and he took a crack at introducing a new optimization to Manticore:

Symbolic Path Merging in Manticore

My project was about investigating the use of path-merging techniques in Manticore, a symbolic execution engine that supports symbolic exploration of binaries compiled for X86, X64, and Ethereum platforms. A significant barrier for symbolic exploration of many practical programs is path explosion. As a symbolic executor explores a program, it encounters branch instructions with two feasible sides. The symbolic executor needs to explore both sides of the branch instruction. Manticore explores such branch instructions by forking the path that reached the branch instruction into two paths each of which explores a feasible side. A linear increase in the number of branch instructions with both sides feasible causes an exponential increase in the number of paths Manticore needs to explore through the program. If we hit enough of these branch conditions, Manticore may never finish exploring all the states.

Path merging reduces the number of paths to be explored. The central idea is to merge paths at the same program location that are similar. Manticore uses the notion of a “state” object to capture the processor, memory, and file system information into a single data structure at every point of symbolic exploration through a program. Hence, path merging can be specialized to “state merging” in Manticore where merging similar states that are at the same program location leads to an exponential reduction in the number of paths to explore. With a simple program, I observed Manticore could cut its number of explored execution paths by 33% if it merged similar states at the same program location.

State merging can be implemented statically or dynamically. Static state merging explores the control-flow graph of the subject program in topological order and merges states at the same program location when possible. Veritesting is a path-merging technique that is similar to static state merging, it requires paths to be at same program location to merge them. Dynamic state merging does not require two states to be at the same program location for them to be considered for merging. Given two states a1, a2 at different program locations l1, l2 respectively, if a transitive successor a1′ of a1 has a high and beneficial similarity to a2, dynamic state merging fast-forwards a1 to a1′ and merges it with a2. The fast-forwarding involves overriding the symbolic executor’s search heuristic to reach l2. Dynamic state merging uses the intuition that if two states are similar, their successors within a few steps are also likely to be similar.

While it is possible to implement either state merging technique in Manticore, I chose dynamic state merging as described by Kuznetsov et al. as a better fit for Manticore’s use of state-based instead of path-based symbolic executors. Also, static state merging is less suited for symbolic exploration guided towards a goal and more suited for exhaustive exploration of a subject program. Since static state merging can only merge states at the same program location, when directed towards a goal it tends to cover less code than dynamic state merging in the same time budget. This was also a conclusion of Kuznetsov et al (see Figure 8 from their paper below). Since we often tend to use symbolic execution to reach an exploration goal, static state merging is less suited to our needs.

DSM and SSM vs KLEE coverage

Dynamic State Merging (DSM) provided more statement coverage than Static State Merging (SSM). Figure from “Efficient state merging in symbolic execution.” Kuznetsov et al. 11 Jun. 2012.

Engineering Challenges

Both static and dynamic state merging require the use of an external static analysis tool like Binary Ninja to find the topological ordering of program locations. Given the short duration of my winternship, I chose to implement opportunistic state merging which only merges states that happen to be at the same program location. While this approach does not give the full benefit of dynamic state merging, it is easier to implement because it does not rely on integration with an external static analysis tool to obtain topological ordering. This approach is also easily extensible to dynamic state merging since it uses many of the same primitive operations like state comparison and state merging.

Implementation

I implemented opportunistic state merging for Manticore. The implementation checks if two states at the same program location have semantically equivalent input, output socket buffers, memory, and system call traces in an “isMergeable” predicate. If this predicate is satisfied, the implementation merges CPU register values that are semantically inequivalent.

Results

I used a simple example where I could see two states saved in Manticore’s queue that are at the same program location making them good candidates to be merged. I present the partial CFG of this example program below.

Merged CFG Annotated

Merged CFG Annotated

The two basic blocks highlighted in red cause control flow to merge at the basic block highlighted in green. The first highlighted red block causes control flow to jump directly to the green block. The second highlighted red block moves a constant (0x4a12dd) to the edi register and then jumps to the green block. To explore this example, Manticore creates two states, one which explores the first red block and jumps to the green block, and another state that explores the second red block and jumps to the green block. Since the only difference between these two states which are at the same program location (the green block) is the value present in their edi register, Manticore can merge these two states into a single state with the value for edi set to be an if-then-else expression. This if-then-else expression will use the condition that chooses which side of the branch (jbe 0x40060d) gets taken. If the condition is satisfied, the if-then-else expression will evaluate to the value that is present in edi in the first red block. If the condition is not satisfied, it will evaluate to 0x4a12dd (the constant set in the second red block). Thus, Manticore merges two control flow paths into one path opportunistically which eventually leads to Manticore cutting its number of execution paths by 33% on the binary compiled with the -Os optimization option with gcc and by 20% if the binary is compiled with the -O3 optimization option.

Directions for future improvement:

  1. This implementation can be extended to get the full benefits of dynamic state merging by integrating Manticore with a tool that can provide a topological ordering of program locations.
  2. State merging always creates new symbolic data since it converts all concrete writes in a region of code to symbolic writes.
  3. Check if new symbolic data introduced by state merging causes more branching later during exploration. We need to implement re-interpret heuristics such as query count estimation by Kuznetsov et al. so that we may use dynamic state merging only when it is most useful.

Path merging is a technique that needs to be re-interpreted to fit the needs of a symbolic executor. This winternship allowed me to understand the inner workings of Manticore, a state-based symbolic executor, and re-interpret path merging to better fit the use-case of binary symbolic execution with Manticore. My implementation of opportunistic state merging merges similar states if they are at the same program location. The implementation can be used in a Python script by registering a plugin called Merger with Manticore. basic_statemerging.py under examples/script is an example of such use of state merging.

Fuzzing an API with DeepState (Part 2)

Alex Groce, Associate Professor, School of Informatics, Computing and Cyber Systems, Northern Arizona University

Mutation Testing

Introducing one bug by hand (as we did in Part 1) is fine, and we could try it again, but “the plural of anecdote is not data.” However, this is not strictly true. If we have enough anecdotes, we can probably call it data (the field of “big multiple anecdotes” is due to take off any day now). In software testing, creating multiple “fake bugs” has a name, mutation testing (or mutation analysis). Mutation testing works by automatically generating lots of small changes to a program, in the expectation that most such changes will make the program incorrect. A test suite or fuzzer is better if it detects more of these changes. In the lingo of mutation testing, a detected mutant is “killed.” The phrasing is a bit harsh on mutants, but in testing a certain hard-heartedness towards bugs is in order. Mutation testing was once an academic niche topic, but is now in use at major companies, in real-world situations.

There are many tools for mutation testing available, especially for Java. The tools for C code are less robust, or more difficult to use, in general. I (along with colleagues at NAU and other universities) recently released a tool, the universalmutator, that uses regular expressions to allow mutation for many languages, including C and C++ (not to mention Swift, Solidity, Rust, and numerous other languages previously without mutation-testing tools). We’ll use the universalmutator to see how well our fuzzers do at detecting artificial red-black tree bugs. Besides generality, one advantage of universalmutator is that it produces lots of mutants, including ones that are often equivalent but can sometimes produce subtle distinctions in behavior — that is, hard to detect bugs — that are not supported in most mutation systems. For high-stakes software, this can be worth the additional effort of analyzing and examining the mutants.

Installing universalmutator and generating some mutants is easy:

pip install universalmutator
mkdir mutants
mutate red_black_tree.c --mutantDir mutants

This will generate a large number of mutants, most of which won’t compile (the universalmutator doesn’t parse, or “know” C, so it’s no surprise many of its mutants are not valid C). We can discover the compiling mutants by running “mutation analysis” on the mutants, with “does it compile?” as our “test”:

analyze_mutants red_black_tree.c "make clean; make" --mutantDir mutants

This will produce two files: killed.txt, containing mutants that don’t compile, and notkilled.txt, containing the 1120 mutants that actually compile. To see if a mutant is killed, the analysis tool just determines whether the command in quotes returns a non-zero exit code, or times out (the default timeout is 30 seconds; unless you have a very slow machine, this is plenty of time to compile our code here).

If we copy the notkilled.txt file containing valid (compiling) mutants to another file, we can then do some real mutation testing:

cp notkilled.txt compile.txt
analyze_mutants red_black_tree.c "make clean; make fuzz_rb; ./fuzz_rb" --mutantDir mutants --verbose --timeout 120--fromFile compile.txt

Output will look something like:

ANALYZING red_black_tree.c
COMMAND: ** ['make clean; make fuzz_rb; ./fuzz_rb'] **
#1: [0.0s 0.0% DONE]
  mutants/red_black_tree.mutant.2132.c NOT KILLED
  RUNNING SCORE: 0.0
...
Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702.
/bin/sh: line 1: 30015 Abort trap: 6           ./fuzz_rb
#2: [62.23s 0.09% DONE]
  mutants/red_black_tree.mutant.1628.c KILLED IN 1.78541398048
  RUNNING SCORE: 0.5
...

Similar commands will run mutation testing on the DeepState fuzzer and libFuzzer. Just change make fuzz_rb; ./fuzz_rb to make ds_rb; ./ds_rb --fuzz --timeout 60 --exit_on_fail to use the built-in DeepState fuzzer. For libFuzzer, to speed things up, we’ll want to set the environment variable LIBFUZZER_EXIT_ON_FAIL to TRUE, and pipe output to /dev/null since libFuzzer’s verbosity will hide our actual mutation results:

export LIBFUZZER_EXIT_ON_FAIL=TRUE
analyze_mutants red_black_tree.c "make clean; make ds_rb_lf; ./ds_rb_lf -use_value_profile=1 -detect_leaks=0 -max_total_time=60 >& /dev/null" --mutantDir mutants --verbose --timeout 120 --fromFile compile.txt

The tool generates 2,602 mutants, but only 1,120 of these actually compile. Analyzing those mutants with a test budget of 60 seconds, we can get a better idea of the quality of our fuzzing efforts. The DeepState brute-force fuzzer kills 797 of these mutants (71.16%). John’s original fuzzer kills 822 (73.39%). Fuzzing the mutants not killed by these fuzzers another 60 seconds doesn’t kill any additional mutants. The performance of libFuzzer is strikingly similar: 60 seconds of libFuzzer (starting from an empty corpus) kills 797 mutants, exactly the same as DeepState’s brute force fuzzer – the same mutants, in fact.

“There ain’t no such thing as a free lunch” (or is there?)

DeepState’s native fuzzer appears, for a given amount of time, to be less effective than John’s “raw” fuzzer. This shouldn’t be a surprise: in fuzzing, speed is king. Because DeepState is parsing a byte-stream, forking in order to save crashes, and producing extensive, user-controlled logging (among other things), it is impossible for it to generate and execute tests as quickly as John’s bare-bones fuzzer.

libFuzzer is even slower; in addition to all the services (except forking for crashes, which is handled by libFuzzer itself) provided by the DeepState fuzzer, libFuzzer determines the code coverage and computes value profiles for every test, and performs computations needed to base future testing on those evaluations of input quality.

Is this why John’s fuzzer kills 25 mutants that DeepState does not? Well, not quite. If we examine the 25 additional mutants, we discover that every one involves changing an equality comparison on a pointer into an inequality. For example:

<   if ( (y == tree->root) ||
---
>   if ( (y <= tree->root) ||

The DeepState fuzzer is not finding these because it runs each test in a fork. The code doesn’t allocate enough times to use enough of the address space to cause a problem for these particular checks, since most allocations are in a fork! In theory, this shouldn’t be the case for libFuzzer, which runs without forking. And, sure enough, if we give the slow-and-steady libFuzzer five minutes instead of 60 seconds, it catches all of these mutants, too. No amount of additional fuzzing will help the DeepState fuzzer. In this case, the bug is strange enough and unlikely enough we can perhaps ignore it. The issue is not the speed of our fuzzer, or the quality (exactly), but the fact that different fuzzing environments create subtle differences in what tests we are actually running.

After we saw this problem, we added an option to DeepState to make the brute force fuzzer (or test replay) run in a non-forking mode: --no_fork. Unfortunately, this is not a complete solution. While we can now detect these bugs, we can’t produce a good saved test case for them, since the failure depends on all the mallocs that have been issued, and the exact addresses of certain pointers. However, it turns out that --no_fork has a more important benefit: it dramatically speeds up fuzzing and test replay on mac OS – often by orders of magnitude. While we omit it in our examples because it complicates analyzing failure causes, you should probably use it for most fuzzing and test replay on mac OS.

We can safely say that, for most intents and purposes, DeepState is as powerful as John’s “raw” fuzzer, as easy to implement, and considerably more convenient for debugging and regression testing.

Examining the Mutants

This takes care of the differences in our fuzzers’ performances. But how about the remaining mutants? None of them are killed by five minutes of fuzzing using any of our fuzzers. Do they show holes in our testing? There are various ways to detect equivalent mutants (mutants that don’t actually change the program semantics, and so can’t possibly be killed), such as comparing the binaries generated by an optimizing compiler. For our purposes, we will just examine a random sample of the 298 unkilled mutants, to confirm that at least most of the unkilled mutants are genuinely uninteresting.

  • The first mutant changes a <= in a comment. There’s no way we can kill this. Comparing compiled binaries would have proven it.
  • The second mutant modifies code in the InorderTreePrint function, which John’s fuzzer (and thus ours) explicitly chooses not to test. This would not be detectable by comparing binaries, but it is common sense. If our fuzzer never covers a piece of code, intentionally, it cannot very well detect bugs in that code.
  • The third mutant changes the assignment to temp->key on line 44, in the RBTreeCreate function, so it assigns a 1 rather than a 0. This is more interesting. It will take some thought to convince ourselves this does not matter. If we follow the code’s advice and look at the comments on root and nil in the header file, we can see these are used as sentinels. Perhaps the exact data values in root and nil don’t matter, since we’ll only detect them by pointer comparisons? Sure enough, this is the case.
  • The fourth mutant removes the assignment newTree->PrintKey= PrintFunc; on line 35. Again, since we never print trees, this can’t be detected.
  • The fifth mutant is inside a comment.
  • The sixth mutant changes a pointer comparison in an assert.
    686c686
    <     assert (node->right->parent == node);
    ---
    >     assert (node->right->parent >= node);

    If we assume the assert always held for the original code, then changing == to the more permissive >= obviously cannot fail.

  • The seventh mutant lurks in a comment.
  • The eighth mutant removes an assert. Again, removing an assert can never cause previously passing tests to fail, unless something is wrong with your assert!
  • The ninth mutant changes a red assignment:
    243c243
    <       x->parent->parent->red=1;
    ---
    >       x->parent->parent->red=-1;

    Since we don’t check the exact value of the red field, but use it to branch (so all non-zero values are the same) this is fine.

  • The tenth mutant is again inside the InorderTreePrint function.

At this point if we were really testing this red-black tree as a critical piece of code, we would probably:

  • Make a tool (like a 10-line Python script, not anything heavyweight!) to throw out all mutants inside comments, inside the InorderTreePrint function, or that remove an assertion.
  • Compile all the mutants and compare binaries with each other and the original file, to throw out obvious equivalent mutants and redundant mutants. This step can be a little annoying. Compilers don’t always produce equivalent binaries, due to timestamps generated at compile time, which is why we skipped over it in the discussion above.
  • Examine the remaining mutants (maybe 200 or so) carefully, to make sure we’re not missing anything. Finding categories of “that’s fine” mutants often makes this process much easier than it sounds off hand (things like “assertion removals are always ok”).

The process of (1) making a test generator then (2) applying mutation testing and (3) actually looking at the surviving mutants and using them to improve our testing can be thought of as a falsification-driven testing process. For highly critical, small pieces of code, this can be a very effective way to build an effective fuzzing regimen. It helped Paul E. McKenney discover real bugs in the Linux kernel’s RCU module.

Just Fuzz it More

Alternatively, before turning to mutant investigation, you can just fuzz the code more aggressively. Our mutant sample suggests there won’t be many outstanding bugs, but perhaps there are a few. Five minutes is not that extreme a fuzzing regimen. People expect to run AFL for days. If we were really testing the red-black tree as a critical piece of code, we probably wouldn’t give up after five minutes.

Which fuzzer would be best for this? It’s hard to know for sure, but one reasonable approach would be first to use libFuzzer to generate a large corpus of tests to seed fuzzing, that achieve high coverage on the un-mutated red-black tree. Then, we could try a longer fuzzing run on each mutant, using the seeds to make sure we’re not spending most of the time just “learning” the red-black tree API.

After generating a corpus on the original code for an hour, we ran libFuzzer, starting from that corpus, for ten minutes. The tests we generated this way can be found here. How many additional mutants does this kill? We can already guess it will be fewer than 30, based on our 3% sample. A simple script, as described above, brings the number of interesting, unkilled, mutants to analyze down to 174 by removing comment mutations, print function mutations, and assertion removals. In fact, this more aggressive (and time-consuming) fuzzing kills zero additional mutants over the ones already killed by John’s fuzzer in one minute and libFuzzer in five minutes. Even an hour-long libFuzzer run with the hour-long corpus kills only three additional mutants, and those are not very interesting. One new kill removes a free call, and the memory leak eventually kills libFuzzer; the other two kills are just more pointer comparisons. Is this solid evidence that our remaining mutants (assuming we haven’t examined them all yet) are harmless? We’ll see.

What About Symbolic Execution?

[Note: this part doesn’t work on Mac systems right now, unless you know enough to do a cross compile, and can get the binary analysis tools working with that. I ran it on Linux inside docker.]

DeepState also supports symbolic execution, which, according to some definitions, is just another kind of fuzzing (white box fuzzing). Unfortunately, at this time, neither Manticore nor angr (the two binary analysis engines we support) can scale to the full red-black tree or file system examples with a search depth anything like 100. This isn’t really surprising, given the tools are trying to generate all possible paths through the code! However, simply lowering the depth to a more reasonable number is also insufficient. You’re likely to get solver timeout errors even at depth three. Instead, we use symex.cpp, which does a much simpler insert/delete pattern, with comparisons to the reference, three times in a row.

clang -c red_black_tree.c container.c stack.c misc.c
clang++ -o symex symex.cpp -ldeepstate red_black_tree.o stack.o misc.o container.o -static -Wl,--allow-multiple-definition,--no-export-dynamic
deepstate-manticore ./symex --log_level 1

The result will be tests covering all paths through the code, saved in the out directory. This may take quite some time to run, since each path can take a minute or two to generate, and there are quite a few paths. If deepstate-manticore is too slow, try deepstate-angr (or vice versa). Different code is best suited for different symbolic execution engines. (This is one of the purposes of DeepState – to make shopping around for a good back-end easy.)

INFO:deepstate.mcore:Running 1 tests across 1 workers
TRACE:deepstate:Running RBTree_TinySymex from symex.cpp(65)
TRACE:deepstate:symex.cpp(80): 0: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 0: DELETE:0
TRACE:deepstate:symex.cpp(80): 1: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 1: DELETE:0
TRACE:deepstate:symex.cpp(80): 2: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 2: DELETE:-2147483648
TRACE:deepstate:Passed: RBTree_TinySymex
TRACE:deepstate:Input: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
TRACE:deepstate:Saved test case in file out/symex.cpp/RBTree_TinySymex/89b9a0aba0287935fa5055d8cb402b37.pass
TRACE:deepstate:Running RBTree_TinySymex from symex.cpp(65)
TRACE:deepstate:symex.cpp(80): 0: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 0: DELETE:0
TRACE:deepstate:symex.cpp(80): 1: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 1: DELETE:0
TRACE:deepstate:symex.cpp(80): 2: INSERT:0 0x0000000000000000
TRACE:deepstate:symex.cpp(85): 2: DELETE:0
TRACE:deepstate:Passed: RBTree_TinySymex
...

We can see how well the 583 generated tests perform using mutation analysis as before. Because we are just replaying the tests, not performing symbolic execution, we can now add back in the checkRep and RBTreeVerify checks that were removed in order to speed symbolic execution, by compiling symex.cpp with -DREPLAY, and compile everything with all of our sanitizers. The generated tests, which can be run (on a correct red_black_tree.c) in less than a second, kill 428 mutants (38.21%). This is considerably lower than for fuzzing, and worse than the 797 (71.16%) killed by the libFuzzer one hour corpus, which has a similar < 1s runtime. However, this summary hides something more interesting: five of the killed mutants are ones not killed by any of our fuzzers, even in the well-seeded ten minute libFuzzer runs:

703c703
<   return left_black_cnt + (node->red ? 0 : 1);
---
>   return left_black_cnt / (node->red ? 0 : 1);
703c703
<   return left_black_cnt + (node->red ? 0 : 1);
---
>   return left_black_cnt % (node->red ? 0 : 1);
703c703
<   return left_black_cnt + (node->red ? 0 : 1);
---
>   /*return left_black_cnt + (node->red ? 0 : 1);*/
701c701
<   right_black_cnt = checkRepHelper (node->right, t);
---
>   /*right_black_cnt = checkRepHelper (node->right, t);*/
700c700
<   left_black_cnt = checkRepHelper (node->left, t);
---
>   /*left_black_cnt = checkRepHelper (node->left, t);*/

These bugs are all in the checkRep code itself, which was not even targeted by symbolic execution. While these bugs do not involve actual faulty red-black tree behavior, they show that our fuzzers could allow subtle flaws to be introduced into the red-black tree’s tools for checking its own validity. In the right context, these could be serious faults, and certainly show a gap in the fuzzer-based testing. In order to see how hard it is to detect these faults, we tried using libFuzzer on each of these mutants, with our one hour corpus as seed, for one additional hour of fuzzing on each mutant. It was still unable to detect any of these mutants.

While generating tests using symbolic execution takes more computational power, and, perhaps, more human effort, the very thorough (if limited in scope) tests that result can detect bugs that even aggressive fuzzing may miss. Such tests are certainly a powerful addition to a regression test suite for an API. Learning to use DeepState makes mixing fuzzing and symbolic execution in your testing easy. Even if you need a new harness for symbolic execution work, it looks like, and can share code with, most of your fuzzing-based testing. A major long-term goal for DeepState is to increase the scalability of symbolic execution for API sequence testing, using high-level strategies not dependent on the underlying engine, so you can use the same harness more often.

See the DeepState repo for more information on how to use symbolic execution.

What About Code Coverage?

We didn’t even look at code coverage in our fuzzing. The reason is simple: if we’re willing to go to the effort of applying mutation testing, and examining all surviving mutants, there’s not much additional benefit in looking at code coverage. Under the hood, libFuzzer and the symbolic execution engines aim to maximize coverage, but for our purposes mutants work even better. After all, if we don’t cover mutated code, we can hardly kill it. Coverage can be very useful, of course, in early stages of fuzzer harness development, where mutation testing is expensive, and you really just want to know if you are even hitting most of the code. But for intensive testing, when you have the time to do it, mutation testing is much more thorough. Not only do you have to cover the code, you actually have to test what it does. In fact, at present, most scientific evidence for the usefulness of code coverage relies on the greater usefulness of mutation testing.

Further Reading

For a more involved example using DeepState to test an API, see the TestFs example, which tests a user-level, ext3-like file system, or the differential tester that compares behavior of Google’s leveldb and Facebook’s rocksdb. For more details on DeepState in general, see our NDSS 2018 Binary Analysis Research Workshop paper.

Fuzzing an API with DeepState (Part 1)

Alex Groce, Associate Professor, School of Informatics, Computing and Cyber Systems, Northern Arizona University

Using DeepState, we took a handwritten red-black tree fuzzer and, with minimal effort, turned it into a much more fully featured test generator. The DeepState fuzzer, despite requiring no more coding effort, supports replay of regression tests, reduction of the size of test cases for debugging, and multiple data-generation back-ends, including Manticore, angr, libFuzzer, and AFL. Using symbolic execution, we even discovered artificially introduced bugs that the original fuzzer missed. After reading this article, you should be ready to start applying high-powered automated test generation to your own APIs.

Background

In 2013, John Regehr wrote a blog post on “How to Fuzz an ADT Implementation.” John wrote at some length about general issues in gaining confidence that a data-type implementation is reliable, discussing code coverage, test oracles, and differential testing. If you have not yet read John’s article, then I recommend reading it now. It gives a good overview of how to construct a simple custom fuzzer for an ADT, or, for that matter, any fairly self-contained API where there are good ways to check for correctness.

The general problem is simple. Suppose we have a piece of software that provides a set of functions or methods on objects. Our running example in this post is a red-black tree; however, an AVL tree, a file-system, an in-memory store, or even a crypto library could easily be swapped in. We have some expectations about what will happen when we call the available functions. Our goal is to thoroughly test the software, and the traditional unit-testing approach to the problem is to write a series of small functions that look like:

result1 = foo(3, "hello");
result2 = bar(result1, "goodbye")
assert(result2 == DONE);

That is, each test has the form: “do something, then check that it did the right thing.” This approach has two problems. First, it’s a lot of work. Second, the return on investment for that work is not as good as you would hope; each test does one specific thing, and if the author of the tests doesn’t happen to think of a potential problem, then the tests are very unlikely to catch that problem. These unit tests are insufficient for the same reasons that AFL and other fuzzers have been so successful at finding security vulnerabilities in widely used programs: humans are too slow at writing many tests, and are limited in their ability to imagine insane, harmful inputs. The randomness of fuzzing makes it possible to produce many tests very quickly and results in tests that go far outside the “expected uses.”

Fuzzing is often thought of as generating files or packets, but it can also generate sequences of API calls to test software libraries. Such fuzzing is often referred to as random or randomized testing, but fuzzing is fuzzing. Instead of a series of unit tests doing one specific thing, a fuzzer test (also known as a property-based test or a parameterized unit test) looks more like:

foo_result = NULL;
bar_result = NULL;
repeat LENGTH times:
   switch (choice):
      choose_foo:
         foo_result = foo(randomInt(), randomString());
         break;
      choose_bar:
         bar_result = bar(foo_result, randomString());
         break;
      choose_baz:
         baz_result = baz(foo_result, bar_result);
         break;
   checkInvariants();

That is, the fuzzer repeatedly chooses a random function to call, and then calls the chosen function, perhaps storing the results for use in later function calls.

A well-constructed test of this form will include lots of generalized assertions about how the system should behave, so that the fuzzer is more likely to shake out unusual interactions between the function calls. The most obvious such checks are any assertions in the code, but there are numerous other possibilities. For a data structure, this will come in the form of a repOK function that makes sure that the ADT’s internal representation is in a consistent state. For red-black trees, that involves checking node coloring and balance. For a file system, you may expect that chkdsk will never find any errors after a series of valid file system operations. In a crypto library (or a JSON parser, for that matter, with some restrictions on the content of message) you may want to check round-trip properties: message == decode(encode(message, key), key). In many cases, such as with ADTs and file systems, you can use another implementation of the same or similar functionality, and compare results. Such differential testing is extremely powerful, because it lets you write a very complete specification of correctness with relatively little work.

John’s post doesn’t just give general advice, it also includes links to a working fuzzer for a red-black tree. The fuzzer is effective and serves as a great example of how to really hammer an API using a solid test harness based on random value generation. However, it’s also not a completely practical testing tool. It generates inputs, and tests the red-black tree, but when the fuzzer finds a bug, it simply prints an error message and crashes. You don’t learn anything except “Your code has a bug. Here is the symptom.” Modifying the code to print out the test steps as they happen slightly improves the situation, but there are likely to be hundreds or thousands of steps before the failure.

Ideally, the fuzzer would automatically store failing test sequences in a file, minimize the sequences to make debugging easy, and make it possible to replay old failing tests in a regression suite. Writing the code to support all this infrastructure is no fun (especially in C/C++) and dramatically increases the amount of work required for your testing effort. Handling the more subtle aspects, such as trapping assertion violations and hard crashes so that you write the test to the file system before terminating, is also hard to get right.

AFL and other general-purpose fuzzers usually provide this kind of functionality, which makes fuzzing a much more practical tool in debugging. Unfortunately, such fuzzers are not convenient for testing APIs. They typically generate a file or byte buffer, and expect that the program being tested will take that file as input. Turning a series of bytes into a red-black tree test is probably easier and more fun than writing all the machinery for saving, replaying, and reducing tests, but it still seems like a lot of work that isn’t directly relevant to your real task: figuring out how to describe valid sequences of API calls, and how to check for correct behavior. What you really want is a unit testing framework like GoogleTest, but one that is capable of varying the input values used in tests. There are lots of good tools for random testing, including my own TSTL, but few sophisticated ones target C/C++, and none that we are aware of let you use any test generation method other than the tools’ built-in random tester. That’s what we want: GoogleTest, but with the ability to use libFuzzer, AFL, HonggFuzz, or what you will to generate data.

Enter DeepState

DeepState fills that need, and more. (We’ll get to the ‘more’ when we discuss symbolic execution).

Translating John’s fuzzer into a DeepState test harness is relatively easy. Here is a DeepState version of “the same fuzzer.” The primary changes for DeepState, which can be found in the file deepstate_harness.cpp, are:

  • Remove main and replace it with a named test (TEST(RBTree, GeneralFuzzer))
    • A DeepState file can contain more than one named test, though it is fine to only have one test.
  • Just create one tree in each test, rather than having an outer loop that iterates over calls that affect a single tree at a time.
    • Instead of a fuzzing loop, our tests are closer to very generalized unit tests: each test does one sequence of interesting API calls.
    • DeepState will handle running multiple tests; the fuzzer or symbolic execution engine will provide the “outer loop.”
  • Fix the length of each API call sequence to a fixed value, rather than a random one.
    • The #define LENGTH 100 at the top of the file controls how many functions we call in each test.
    • Having bytes be in somewhat the same positions in every test is helpful for mutation-based fuzzers. Extremely long tests will go beyond libFuzzer’s default byte length.
    • So long as they don’t consume so many bytes that fuzzers or DeepState reach their limits, or have trouble finding the right bytes to mutate, longer tests are usually better than shorter tests. There may be a length five sequence that exposes your bug, but DeepState’s brute-force fuzzer and even libFuzzer and AFL will likely have trouble finding it, and more easily produce a length 45 version of the same problem. Symbolic execution, on the other hand, will find such rare sequences for any length it can handle.
    • For simplicity, we use a #define in our harness, but it is possible to define such testing parameters as optional command-line arguments with a default value, for even greater flexibility in testing.  Just use the same tools as DeepState uses to define its own command-line options (see DeepState.c and DeepState.h).
  • Replace various rand() % NNN calls with DeepState_Int(), DeepState_Char() and DeepState_IntInRange(...) calls.
    • DeepState provides calls to generate most of the basic data types you want, optionally over restricted ranges.
    • You can actually just use rand() instead of making DeepState calls. If you include DeepState and have defined DEEPSTATE_TAKEOVER_RAND, all rand calls will be translated to appropriate DeepState functions. The file easy_deepstate_fuzzer.cpp shows how this works, and is the simplest translation of John’s fuzzer. It isn’t ideal, since it doesn’t provide any logging to show what happens during tests. This is often the easiest way to convert an existing fuzzer to use DeepState; the changes from John’s fuzzer are minimal: 90% of the work is just changing a few includes and removing main.
  • Replace the switch statement choosing the API call to make with DeepState’s OneOf construct.
    • OneOf takes a list of C++ lambdas, and chooses one to execute.
    • This change is not strictly required, but using OneOf simplifies the code and allows optimization of choices and smart test reduction.
    • Another version of OneOf takes a fixed-size array as input, and returns some value in it; e.g., OneOf("abcd") will produce a character, either a, b, c, or d.

There are a number of other cosmetic (e.g. formatting, variable naming) changes, but the essence of the fuzzer is clearly preserved here. With these changes, the fuzzer works almost as before, except that instead of running the fuzz_rb executable, we’ll use DeepState to run the test we’ve defined and generate input values that choose which function calls to make, what values to insert in the red-black tree, and all the other decisions represented by DeepState_Int, OneOf, and other calls:

int GetValue() {
  if (!restrictValues) {
    return DeepState_Int();
  } else {
    return DeepState_IntInRange(0, valueRange);
  }
}
...
  for (int n = 0; n < LENGTH; n++) {
    OneOf(
      [&] {
        int key = GetValue();
        int* ip = (int*)malloc(sizeof(int));
        *ip = key;
        if (!noDuplicates || !containerFind(*ip)) {
          void* vp = voidP();
          LOG(TRACE) << n << ": INSERT:" << *ip << " " << vp;
          RBTreeInsert(tree, ip, vp);
          containerInsert(*ip, vp);
        } else {
          LOG(TRACE) << n << ": AVOIDING DUPLICATE INSERT:" << *ip;
          free(ip);
        }
      },
      [&] {
        int key = GetValue();
        LOG(TRACE) << n << ": FIND:" << key;
        if ((node = RBExactQuery(tree, &key))) {
          ASSERT(containerFind(key)) << "Expected to find " << key;
        } else {
          ASSERT(!containerFind(key)) << "Expected not to find " << key;
        }
      },
...

Installing DeepState

The DeepState GitHub repository provides more details and dependencies, but on my MacBook Pro, installation is simple:

git clone https://github.com/trailofbits/deepstate
cd deepstate
mkdir build
cd build
cmake ..
sudo make install

Building a version with libFuzzer enabled is slightly more involved:

brew install llvm@7
git clone https://github.com/trailofbits/deepstate
cd deepstate
mkdir build
cd build
CC=/usr/local/opt/llvm\@7/bin/clang CXX=/usr/local/opt/llvm\@7/bin/clang++ BUILD_LIBFUZZER=TRUE cmake ..
sudo make install

AFL can also be used to generate inputs for DeepState, but most of the time, raw speed (due to not needing to fork), decomposition of compares, and value profiles seem to give libFuzzer an edge for this kind of API testing, in our (limited experimentally!) experience. For more on using AFL and other file-based fuzzers with DeepState, see the DeepState README.

Using the DeepState Red-Black Tree Fuzzer

Once you have installed DeepState, building the red-black tree fuzzer(s) is also simple:

git clone https://github.com/agroce/rb_tree_demo
cd rb_tree_demo
make

The make command compiles everything with all the sanitizers we could think of (address, undefined, and integer) in order to catch more bugs in fuzzing. This has a performance penalty, but is usually worth it.

If you are on macOS and using a non-Apple clang in order to get libFuzzer support, you’ll want to do something like

CC=/usr/local/opt/llvm\@7/bin/clang CXX=/usr/local/opt/llvm\@7/bin/clang++ make

in order to use the right (e.g., homebrew-installed) version of the compiler.

This will give you a few different executables of interest. One, fuzz_rb, is simply John’s fuzzer, modified to use a 60-second timeout instead of a fixed number of “meta-iterations.” The ds_rb executable is the DeepState executable. You can fuzz the red-black tree using a simple brute-force fuzzer (that behaves very much like John’s original fuzzer):

mkdir tests
./ds_rb --fuzz --timeout 60 --output_test_dir tests

If you want to see more about what the fuzzer is doing, you can specify a log level using --log_level to indicate the minimum importance of messages you want to see. A log_level of 0 corresponds to including all messages, even debug messages; 1 is TRACE messages from the system under test (e.g., those produced by the LOG(TRACE) code shown above); 2 is INFO, non-critical messages from DeepState itself (this is the default, and usually appropriate); 3 is warnings, and so forth up the hierarchy. The tests directory should be empty at the termination of fuzzing, since the red-black tree code in the repo (to my knowledge) has no bugs. If you add --fuzz_save_passing to the options, you will end up with a large number of files for passing tests in the directory.

Finally, we can use libFuzzer to generate tests:

mkdir corpus
./ds_rb_lf corpus -use_value_profile=1 -detect_leaks=0 -max_total_time=60

The ds_rb_lf executable is a normal libFuzzer executable, with the same command line options. This will run libFuzzer for 60 seconds, and place any interesting inputs (including test failures) in the corpus directory. If there is a crash, it will leave a crash- file in the current directory. You can tune it to perform a little better in some cases by determining the maximum input size your tests use, but this is a non-trivial exercise. In our case at length 100 the gap between our max size and 4096 bytes is not extremely large.

For more complex code, a coverage-driven, instrumentation-based fuzzer like libFuzzer or AFL will be much more effective than the brute force randomness of John’s fuzzer or the simple DeepState fuzzer.  For an example like the red-black-tree, this may not matter as much, since few states may be very hard to reach for a  fast “dumb” fuzzer.  Even here, however, smarter fuzzers have the advantage of producing a corpus of tests that produce interesting code coverage.  DeepState lets you use a faster fuzzer for quick runs, and smarter tools for more in-depth testing, with almost no effort.

We can replay any DeepState-generated tests (from libFuzzer or DeepState’s fuzzer) easily:

./ds_rb --input_test_file file

Or replay an entire directory of tests:

./ds_rb --input_test_files_dir dir

Adding an --exit_on_fail flag when replaying an entire directory lets you stop the testing as soon as you hit a failing or crashing test. This approach can easily be used to add failures found with DeepState (or interesting passing tests, or perhaps corpus tests from libFuzzer) to automatic regression tests for a project, including in CI.

Adding a Bug

This is all fine, but it doesn’t (or at least shouldn’t) give us much confidence in John’s fuzzer or in DeepState. Even if we changed the Makefile to let us see code coverage, it would be easy to write a fuzzer that doesn’t actually check for correct behavior – it covers everything, but doesn’t find any bugs other than crashes. To see the fuzzers in action (and see more of what DeepState gives us), we can add a moderately subtle bug. Go to line 267 of red_black_tree.c and change the 1 to a 0. The diff of the new file and the original should look like:

267c267
<   x->parent->parent->red=0;
---
>   x->parent->parent->red=1;

Do a make to rebuild all the fuzzers with the new, broken red_black_tree.c.

Running John’s fuzzer will fail almost immediately:

time ./fuzz_rb
Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702.
Abort trap: 6

real 0m0.100s
user 0m0.008s
sys 0m0.070s

Using the DeepState fuzzer will produce results almost as quickly. (We’ll let it show us the testing using the --log_level option, and tell it to stop as soon as it finds a failing test.):

time ./ds_rb --fuzz --log_level 1 --exit_on_fail --output_test_dir tests
INFO: Starting fuzzing
WARNING: No seed provided; using 1546625762
WARNING: No test specified, defaulting to last test defined (RBTree_GeneralFuzzer)
TRACE: Running: RBTree_GeneralFuzzer from deepstate_harness.cpp(78)
TRACE: deepstate_harness.cpp(122): 0: DELETE:-747598508
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(122): 1: DELETE:831257296
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(134): 2: PRED:1291220586
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(154): 4: SUCC:-1845067087
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(113): 6: FIND:-427918646
TRACE: deepstate_harness.cpp(190): checkRep...
...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(103): 44: INSERT:-1835066397 0x00000000ffffff9c
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(154): 46: SUCC:-244966140
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(103): 48: INSERT:1679127713 0x00000000ffffffa4
TRACE: deepstate_harness.cpp(190): checkRep...
Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702.
ERROR: Crashed: RBTree_GeneralFuzzer
INFO: Saved test case to file `tests/6de8b2ffd42af6878875833c0cbfa9ea09617285.crash`
...
real 0m0.148s
user 0m0.011s
sys 0m0.131s

I’ve omitted much of the output above, since showing all 49 steps before the detection of the problem is a bit much, and the details of your output will certainly vary. The big difference from John’s fuzzer, besides the verbose output, is the fact that DeepState saved a test case. The name of your saved test case will, of course, be different, since the names are uniquely generated for each saved test. To replay the test, I would do this:

./ds_rb --input_test_file tests/6de8b2ffd42af6878875833c0cbfa9ea09617285.crash

and I would get to see the whole disaster again, in gory detail. As we said above, this lengthy sequence of seemingly arbitrary operations isn’t the most helpful test for seeing what’s going on. DeepState can help us here:

deepstate-reduce ./ds_rb tests/6de8b2ffd42af6878875833c0cbfa9ea09617285.crash minimized.crash
ORIGINAL TEST HAS 8192 BYTES
LAST BYTE READ IS 509
SHRINKING TO IGNORE UNREAD BYTES
ONEOF REMOVAL REDUCED TEST TO 502 BYTES
ONEOF REMOVAL REDUCED TEST TO 494 BYTES
...
ONEOF REMOVAL REDUCED TEST TO 18 BYTES
ONEOF REMOVAL REDUCED TEST TO 2 BYTES
BYTE RANGE REMOVAL REDUCED TEST TO 1 BYTES
BYTE REDUCTION: BYTE 0 FROM 168 TO 0
NO (MORE) REDUCTIONS FOUND
PADDING TEST WITH 49 ZEROS

WRITING REDUCED TEST WITH 50 BYTES TO minimized.crash

Again, we omit some of the lengthy process of reducing the test. The new test is (much!) easier to understand:

./ds_rb --input_test_file minimized.crash
WARNING: No test specified, defaulting to last test defined (RBTree_GeneralFuzzer)
TRACE: Initialized test input buffer with data from `minimized.crash`
TRACE: Running: RBTree_GeneralFuzzer from deepstate_harness.cpp(78)
TRACE: deepstate_harness.cpp(103): 0: INSERT:0 0x0000000000000000
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(103): 1: INSERT:0 0x0000000000000000
TRACE: deepstate_harness.cpp(190): checkRep...
TRACE: deepstate_harness.cpp(192): RBTreeVerify...
TRACE: deepstate_harness.cpp(103): 2: INSERT:0 0x0000000000000000
TRACE: deepstate_harness.cpp(190): checkRep...
Assertion failed: (left_black_cnt == right_black_cnt), function checkRepHelper, file red_black_tree.c, line 702.
ERROR: Crashed: RBTree_GeneralFuzzer

We just need to insert three identical values into the tree to expose the problem. Remember to fix your red_black_tree.c before proceeding!

You can watch the whole process in action:

In Part 2, we’ll look at how to assess the quality of our testing: is our DeepState testing as effective as John’s fuzzer? Are both approaches unable to find certain subtle bugs? And what about symbolic execution?

How McSema Handles C++ Exceptions

C++ programs using exceptions are problematic for binary lifters. The non-local control-flow “throw” and “catch” operations that appear in C++ source code do not map neatly to straightforward binary representations. One could allege that the compiler, runtime, and stack unwinding library collude to make exceptions work. We recently completed our investigation into exceptions and can claim beyond a reasonable doubt that McSema is the only binary lifter that correctly lifts programs with exception-based control flow.

Our work on McSema had to bridge the semantic gap between a program’s high-level language semantics and its binary representation, which required a complete understanding of how exceptions work under the hood. This post is organized into three sections: first, we are going to explain how C++ exceptions are handled in Linux for x86-64 architectures and explain core exception handling concepts. Second, we will show how we used this knowledge to recover exception information at the binary level. And third, we will explain how to emit exception information for the LLVM ecosystem.

A Short Primer in C++ Exception Handling

In this section, we will use a small motivating example to demonstrate how C++ exceptions work at the binary level and discuss exception semantics for Linux programs running on x86-64 processors. While exceptions work differently on different operating systems, processors, and languages, many of the core concepts are identical.

Exceptions are a programming language construct that provide a standardized way to handle abnormal or erroneous situations. They work by automatically redirecting execution flow to a special handler called the exception handler when such an event occurs. Using exceptions, it is possible to be explicit about ways in which operations can fail and how those failures should be handled. For example, some operations like object instantiation and file processing can fail in multiple ways. Exception handling allows the programmer to handle these failures in a generic way for large blocks of code, instead of manually verifying each individual operation.

Exceptions are a core part of C++, although their use is optional. Code that may fail is surrounded in a try {…} block, and the exceptions that may be raised are caught via a catch {…} block. Signalling of exceptional conditions is triggered via the throw keyword, which raises an exception of a specific type. Figure 1 shows a simple program that uses C++ exception semantics. Try building the program yourself (clang++ -o exception exception.cpp) or look at the code in the Compiler Explorer.

#include <iostream>
#include <vector>
#include <stdexcept>

int main(int argc, const char *argv[]) {
std::vector myVector(10);
int var = std::atoi(argv[1]);

try {
if( var == 0 ) {
throw std::runtime_error("Runtime error: argv[1] cannot be zero.");
}

if(argc != 2) {
throw std::out_of_range("Supply one argument.");
}

myVector.at( var ) = 100;
while(true) {
new int [100000000ul];
}

} catch (const std::out_of_range& e) {
std::cerr << "Out of range error: " << e.what() << '\n';
return 1;
} catch (const std::bad_alloc& e) {
std::cerr << "Allocation failed: " << e.what() << '\n';
return 1;
} catch (...) {
std::cerr << "Unknown error.\n";
return 1;
}
return 0;
}

Figure 1: Example C++ program that throws and catches exceptions, including a catch-all clause.

This simple program can explicitly throw std::runtime_error and std::out_of_range exceptions based on input arguments. It also implicitly throws the std::bad_alloc exception when it runs out of memory. The program installs three exception handlers: one for std::out_of_range, one for std::bad_alloc, and a catch-all handler for generic unknown exceptions. Run the following sample inputs to trigger the three exceptional conditions:

Scenario 1: ./exception 0
Unknown error.
The program checks for the input argument and it does not expect `0` as the input and throws the std::runtime_error exception.
Scenario 2: ./exception 0 1
Out of range error: vector::_M_range_check
The program expects one argument as the input and checks for it. If the number of input arguments are more than one it throws std::out_of_range exception.
Scenario 3: ./exception 1
Allocation failed: std::bad_alloc
For the input other than `0`, the program does the large memory allocation which can fail. It can happen during runtime and may go unnoticed. The memory allocator in such cases throws the std::bad_alloc exception to safely terminate the program.

Let’s look at the same program at the binary level. Compiler Explorer shows the binary code generated by the compiler for this program. The compiler translates the throw statements into a pair of calls to libstdc++ functions (__cxa_allocate_exception and __cxa_throw) that allocates the exception structure and start the process of cleaning up local objects in the scopes leading up to the exception stack unwinding (see lines 40-48 in Compiler Explorer).

Stack unwinding: Removes the stack frame of the exited functions from the process stack.

The catch statements are translated into functions that handle the exception and perform clean-up operations called the landingpad. The compiler generates an exception table that ties together everything the operating system needs to dispatch exceptions, including exception type, associated landing pad, and various utility functions.

landingpad: User code intended to catch an exception. It gains control from the exception runtime via the personality function, and either merges into the normal user code or returns to the runtime by resuming or raising a new exception.

When an exception occurs, the stack unwinder cleans up previously allocated variables and call the catch block. The unwinder:

  1. Calls the libstdc++ personality function. First, the stack unwinder calls a special function provided by libstdc++ called the personality function. The personality function will determine whether the raised exception is handled by a function somewhere on the call stack. In high-level terms, the personality function determines whether there is a catch block that should be called for this exception. If no handler can be located (i.e. the exception is unhandled), the personality function terminates the program by calling std::terminate.
  2. Cleans up allocated objects. To cleanly call the catch block, the unwinder must first clean up (i.e. call destructors for each allocated object) after every function called inside the try block. The unwinder will iterate through the call stack, using the personality function to identify a cleanup method for each stack frame. If there are any cleanup actions, the unwinder calls the associated cleanup code.
  3. Executes the catch block. Eventually the unwinder will reach the stack frame of the function containing the exception handler, and execute the catch block.
  4. Releases memory. Once the catch block completes, a cleanup function will be called again to release memory allocated for the exception structure.

For the curious, more information is available in the comments and source code for libgcc’s stack unwinder.

personality function: A libstdc++ function called by the stack unwinder. It determines whether there is a catch block for a raised exception. If none is found, the program is terminated with std::terminate.

Recovering Exception Information

Recovering exception-based control flow is a challenging proposition for binary analysis tools like McSema. The fundamental data is difficult to assemble, because exception information is spread throughout the binary and tied together via multiple tables. Utilizing exception data to recover control flow is hard, because operations that affect flow, like stack unwinding, calls to personality functions, and exception table decoding happen outside the purview of the compiled program.

Here’s a quick summary of the end goal. McSema must identify every basic block that may raise an exception (i.e. the contents of a try block) and associate it with the appropriate exception handler and cleanup code (i.e. the catch block or landing pad). This association will then be used to re-generate exception handlers at the LLVM level. To associate blocks with landing pads, McSema parses the exception table to provide these mappings.

We’re going to go into some detail about the exception table. It’s important to understand, because this is the main data structure that allows McSema to recover exception-based control flow.

The Exception Table

The exception table provides language runtimes the information to support exceptions. It has two levels: the language-independent level and the language-specific level. Locating stack frames and restoring them is language agnostic, and is therefore stored in the independent level. Identifying the frame that handles the exceptions and transferring control to it is language dependent, so this is stored in the language-specific level.

Language-Independent Level

The table is stored in special sections in the binary called .eh_frame and .eh_framehdr. The .eh_frame section contains one or more call frame information records encoded in the DWARF debug information format. Each frame information record contains a Common Information Entry (CIE) record, followed by one or more Frame Descriptor Entry (FDE) records. Together they describe how to unwind the caller based on the current instruction pointer. More details are described in the Linux Standards Base documentation.

Language-Specific Level

The language-specific data area (LSDA) contains pointers to related data, a list of call sites, and a list of action records. Each function has its own LSDA, which is provided as the augmentation data of the Frame Descriptor Entry (FDE). Information from the LSDA is essential to recovering C++ exception information, and in translating it to LLVM semantics.

exceptionblog

Figure 2: Exception handling information from the current instruction pointer (IP). Graphic adapted from the linked original.

The LSDA header describes how exception information applies to language-specific procedure fragments. Figure 4 shows the LSDA in more detail. There are two fields defined in the LSDA header that McSema needs to recover exception information:

  • The landing pad start pointer: A relative offset to the start of the landing pad code.
  • The types table pointer: A relative offset to the types table, which describes exception types handled by the catch clauses for this procedure fragment.

Following the LSDA header, the call site table lists all call sites that may throw an exception. Each entry in the call site table indicates the position of the call site, the position of the landing pad, and the first action record for that call site. A missing entry from the call site table indicates that a call should not throw an exception. Information from this table will be used by McSema during the translation stage to emit proper LLVM semantics for call sites that may throw exceptions.

The action table follows the call site table in the LSDA and specifies both catch clauses and exception specifications. By exception specifications here we mean the much maligned C++ feature called “exception specifications”, that enumerates the exceptions a function may throw. The two record types have the same format and are distinguished solely by the first field of each entry. Positive values for this field specify types used in catch clauses. Negative values specify exception specifications. Figure 3 shows the action table with a catch clauses (red), catch-all clause (orange), an exception specification (blue). (The exception specification feature has been deprecated in C++17.) Because this feature is being deprecated and rarely used, currently McSema does not handle exception specifications.

.gcc_except_table:4022CF db 7Fh; ar_filter[1]: -1(exception spec index = 4022EC)
.gcc_except_table:4022D0 db 0  ; ar_next[1]: 0 (end)
.gcc_except_table:4022D1 db 0  ; ar_filter[2]: 0 (cleanup)
.gcc_except_table:4022D2 db 7Dh; ar_next[2]: -3 (next: 1 => 004022CF)
.gcc_except_table:4022D3 db 4  ; ar_filter[3]: 4 (catch typeinfo = 000000)
.gcc_except_table:4022D4 db 0  ; ar_next[3]: 0 (end)
.gcc_except_table:4022D5 db 1  ; ar_filter[4]: 1 (catch typeinfo = 00603280)
.gcc_except_table:4022D6 db 7Dh; ar_next[4]: -3 (next: 3 => 004022D3)
.gcc_except_table:4022D7 db 3  ; ar_filter[5]: 3 (catch typeinfo = 603230)
.gcc_except_table:4022D8 db 7Dh; ar_next[5]: -3 (next: 4 => 004022D5)

Figure 3: Action table entries in the LSDA section

Lifting Exception Information

So far we have looked at how exceptions in C++ work at a low level, how exception information is stored, and how McSema recovers exception based control flow. Now we will look at how McSema lifts this control flow to LLVM.

To lift exception information, the exception and language semantics described in the last section have to be recovered from the binary and translated into LLVM. The recovery and translation is a three-phase process that required updating control flow graph (CFG) recovery, lifting, and runtime components of McSema.

McSema’s translation stage uses the information gleaned from CFG recovery to generate LLVM IR that handles exception semantics. To ensure the final binary will execute like the original, the following steps must happen:

  • McSema must associate exception handlers and cleanup methods with blocks that raise exceptions. Functions that throw exceptions must be called via LLVM’s invoke instruction versus the call instruction.
  • Stack unwinding has to be enabled for function fragments that raise exceptions. This is complicated by the fact that translated code may have two stacks: a native stack (used for calling external APIs) and a lifted stack.
  • McSema must ensure there is a smooth transition between lifted code and the language runtime. Handlers called directly by the language runtime must serialize processor state into a structure expected by lifted code.

Associating Blocks and Handlers

The initial association between blocks that may throw exceptions and the handlers for those exceptions is performed during CFG recovery, via information extracted from the exception table. This association is required because the translator must ensure functions that may throw exceptions are called via LLVM’s invoke semantics and not the typical call instruction. The invoke instruction has two continuation points: normal flow when call succeeds and exception flow (i.e., the exception handler) if the function raises an exception (Figure 4). The replacement of call with invoke must cover every invocation of that function. Any call of the function convinces the optimizer the function doesn’t throw and does not need an exception table.

%1403 = call i64 @__mcsema_get_stack_pointer()
store i64 %1403, i64* %stack_ptr_var
%1404 = call i64 @__mcsema_get_frame_pointer()
store i64 %1404, i64* %frame_ptr_var
%1405 = load %struct.Memory*, %struct.Memory** %MEMORY
%1406 = load i64, i64* %PC
%1407 = invoke %struct.Memory* @ext_6032a0__Znam(%struct.State* %0,
i64 %1406, %struct.Memory* %1405)
to label %block_40119f unwind label %landingpad_4012615

Figure 4: An invoke instruction replaces a call instruction to a function that may throw an exception

Unwinding of the Stack

When an exception occurs, control transfers from the throw statement to the first catch statement that can handle the exception. Before the transfer, variables defined in function scope must be properly destroyed. This is called stack unwinding.

McSema uses two different stacks: one for lifted code, and one for native code (i.e. external functions). The split stack puts limitations on stack unwinding, since the native execution (i.e. libstdc++ API) doesn’t have a full view of the stack. To support stack unwinding, we added a new flag, --abi-libraries, which enables the usage of the same stack for lifted and native code execution.

The --abi-libraries flag enables usage of the same stack for native and lifted code by removing the need for lifted code to native transitions. McSema needs to transition stacks so that an external function that does not know about McSema can see CPU state as it was in the original program. Application binary interface (ABI) libraries, which provide external function signatures, including the return value, argument type, and argument count, allow lifted code to directly call native functions on the same stack. Figure 5 shows a snapshot of function signatures defined via ABI libraries.

declare i8* @__cxa_allocate_exception(i64) #0
declare void @__cxa_free_exception(i8*) #0
declare i8* @__cxa_allocate_dependent_exception() #0
declare void @__cxa_free_dependent_exception(i8*) #0
declare void @__cxa_throw(i8*, %"class.std::type_info"*, void (i8*)*) #0
declare i8* @__cxa_get_exception_ptr(i8*) #0
declare i8* @__cxa_begin_catch(i8*) #0
declare void @__cxa_end_catch() #0

Figure 5: An ABI library defining the external functions relating to exceptions.

Exception handling at runtime

Exception handlers and cleanup methods are called by the language runtime, and are expected to follow a strict calling convention. Lifted code does not follow standard calling convention semantics, because it expresses the original instructions as operations on CPU state. To support these callbacks, we implemented a special adaptor that converts a native state into a machine context usable by lifted code. Special care has been taken to preserve the RDX register, which stores the type index of the exception.

There is one more trick to emitting functional exception handlers: proper ordering of type indices. Recall that our motivating example (Figure 1) has three exception handlers: std::out_of_range, std::bad_alloc, and the catch-all handler. Each of these handlers are assigned a type index, say 1, 2, 3 respectively (Figure 6a), meaning that the original program expects type index 1 to corresponds to std::out_of_range.

.gcc_except_table:402254 db 3      ; ar_filter[1]: 3 (catch typeinfo = 000000)
.gcc_except_table:402255 db 0      ; ar_next[1]: 0 (end) 
.gcc_except_table:402256 db 2      ; ar_filter[2]: 2 (catch typeinfo = 603280)
.gcc_except_table:402257 db 7Dh    ; ar_next[2]: -3 (next: 1 => 402254)
.gcc_except_table:402258 db 1      ; ar_filter[3]: 1 (catch typeinfo = 603230)
.gcc_except_table:402259 db 7Dh    ; ar_next[3]: -3 (next: 2 => 402256)
.gcc_except_table:40225A db 0 
.gcc_except_table:40225B db 0 
.gcc_except_table:40225C dd 0      ; Type index 3 
.gcc_except_table:402260 dd 603280h; Type index 2 
.gcc_except_table:402264 dd 603230h; Type index 1
.gcc_except_table:41A78E db 1      ; ar_filter[1]: 1 (catch typeinfo = 000000)
.gcc_except_table:41A78F db 0      ; ar_next[1]: 0 (end) 
.gcc_except_table:41A790 db 2      ; ar_filter[2]: 2 (catch typeinfo = 61B450)
.gcc_except_table:41A791 db 7Dh    ; ar_next[2]: -3 (next: 1 => 0041A78E)
.gcc_except_table:41A792 db 3      ; ar_filter[3]: 3 (catch typeinfo = 61B4A0)
.gcc_except_table:41A793 db 7Dh    ; ar_next[3]: -3 (next: 2 => 0041A790)
.gcc_except_table:41A794 dd 61B4A0h; Type index 3 
.gcc_except_table:41A798 dd 61B450h; Type index 2 
.gcc_except_table:41A79C dd 0      ; Type index 1

Figure 6 (a & b): Type indices assignment in the exception table of the original & new binary (std::out_of_range, std::bad_alloc, and catch-all exception types are assigned type index 1, 2, and 3 respectively)

During the lifting process McSema recreates exception handlers used in the program. The type index assigned to each handler is generated at compile time. When lifted bitcode is compiled into a new binary, the type indices could be, and often are, reassigned. For example, std::out_of_range could get type index 3 in a new binary (Figure 6b). This would cause the lifted binary to run the catch-all handler when std::out_of_range is thrown!

To ensure the right exception handler is called, McSema generates a static map (see gvar_landingpad_401133 in Figure 7) of original type indices to new type indices, and fixes the type index during ladningpad passthrough. The landingpad passthrough is a function that is automatically generated by McSema. Not only does it ensure the type index is correct, it also transitions between lifted and native state.

Upon being called, the passthrough saves native execution state, loads lifted state, and calls any exception handlers (that have been lifted, and expect lifted state). When the passthrough returns (in case the exception wasn’t handled), it must do the reverse, and transition from lifted to native state to return into runtime library code. Figure 7 shows the landingpad passthrough generated for our motivating example. The generated passthrough code gets the type index from the RDX register using the function __mcsema_get_type_index. It fixes and restores the machine context of the lifted execution using the function __mcsema_exception_ret. The wrapper instruction across the invoke statement saves the stack and frame pointer in the function context.

%landingpad_4011336 = landingpad { i8*, i32 }
catch i8* @"_ZTISt13runtime_error@@GLIBCXX_3.4"
catch i8* @"_ZTISt12out_of_range@@GLIBCXX_3.4"
catch i8* null
%4021 = call i64 @__mcsema_get_type_index()
%4022 = getelementptr [4 x i32], [4 x i32]* @gvar_landingpad_401133, i32 0, i64 %4021
%4023 = load i64, i64* %stack_ptr_var
%4024 = load i64, i64* %frame_ptr_var
%4025 = load i32, i32* %4022
call void @__mcsema_exception_ret(i64 %4023, i64 %4024, i32 %4025)
br label %block_401133

Figure 7: landing pad pass through for the exception handlers

With all of these pieces in place, McSema can finally translate C++ programs that use exceptions into LLVM.

Conclusion

To our knowledge, McSema is the only binary lifter to handle C++ exceptions, which are common throughout C++ software of any complexity. As we have shown, exception-based control flow recovery and accurate translation is an extremely complex topic and difficult to implement correctly. Implementing exception handling touched all parts of McSema, including new challenges for both control flow recovery and translation. The fine details, such as type index re-ordering and ensuring every call is replaced with an invoke all had to be discovered the hard way by debugging subtle and frustrating failures.

We are continuing to develop and enhance McSema, and have more to share about exciting new features. If you are interested in McSema, try it out, contribute (we love open source contributions!), and talk to us in the binary-lifting channel on the Empire Hacking Slack.