Multi-Party Computation on Machine Learning

During my internship this summer, I built a multi-party computation (MPC) tool that implements a 3-party computation protocol for perceptron and support vector machine (SVM) algorithms.

MPC enables multiple parties to perform analyses on private datasets without sharing them with each other. I developed a technique that lets three parties obtain the results of machine learning across non-public datasets. It is now possible to perform data analytics on private datasets that was previously impossible due to data privacy constraints.

MPC Primer

For MPC protocols, a group of parties, each with their own set of secret data, xi, share an input function, f, and each is able to obtain the output of f(x1,…,xn) without learning the private data of other parties.

One of the first demonstrations of MPC was Yao’s Millionaire Problem, where two people want to determine who has more money without revealing the specific amounts they each have. Each person has a private bank account balance, xi, and they want to compute f(x1,x2) = x1 > x2. Yao demonstrated how to securely compute this, along with any other function, using garbled circuits.

Each person may infer some information about the other’s data via the MPC function’s output. The leaked information is limited to whether the other party’s amount is greater or less than their own; the exact amount is never disclosed.

Boolean and arithmetic circuits are used by MPC protocols to compute arbitrary functions (see Figure 1). Any computable function can be represented as a combination of Boolean (AND/OR) and/or arithmetic (ADD/MULT) gates. Thus, a secure MPC protocol is established through a series of secure MPC addition and multiplication protocols. In practice, we want to convert a function into a circuit and then run MPC protocols on that circuit.

MPC protocols must also operate in a finite field, so many MPC tools are designed to operate over the integers modulo a prime p (a finite field). This makes machine learning applications difficult and, at times, incompatible, because they require fixed-point arithmetic. Almost all of these tools require a synthesis tool to convert a program (typically a C program) into its corresponding arithmetic circuit representation.

Here are several challenges that make MPC machine learning hard.

  • Fixed-point arithmetic and truncation protocols are not well-suited for the finite fields that MPC protocols operate in.
  • The circuit size can be as large as the input data. When functions are evaluated as circuits, all input values must be loaded into the circuit. The circuit will have input gates for every input value, which makes the input layer of the circuit as large as the dataset.
  • Storing the circuit and intermediate values requires a significant amount of memory.
  • Machine learning introduces loops whose exit conditions depend on intermediate values; however, each party must share its circuit with the other parties at the start of the protocol. They can’t depend on intermediate values.

The goal of my project was to create a novel MPC tool that can train machine learning models from secret data held by multiple parties. We wanted to implement a more-than-two-party computation protocol that included fixed-point arithmetic for arbitrary precision for machine learning. We also wanted to provide an alternative to circuit synthesis as a way of performing MPC with complex functions.

Figure 1: (left) A simple arithmetic circuit. (middle) A simple boolean circuit. (right) The roadmap for performing multi-party computation.

Secure MPC Protocols

For three parties, each party will split their data into “secret shares” and send them to the others. Secret shares must only reveal secret value information when they are all combined, otherwise all secret values must remain hidden. This property is formally defined as privacy, and is illustrated by the following:

Let n=3 parties. The parties agree to use the integers mod p, Zp, with p=43, a random small prime. Party 1 has x1=11 and wants to create secret shares for this value. Party 1 will then generate two random values in Zp, say r2=21 and r3 = 36, and send r2 to Party 2 and r3 to Party 3. Each party then stores the following values as their secret shares:

share1 = (x1r2r3) mod 43 = 40 mod 43

share2 = r2 = 21 mod 43  

share3 = r3 = 36 mod 43

Notice that any one or combination of two secret shares reveal information about x1. This is where the strength of secret sharing lies.

After creating secret shares for all their private data, the parties can begin to perform secure addition or multiplication on their inputs. The protocols for both operations depend on the method of secret sharing used. I will demonstrate how we can add and multiply for the specific secret sharing scheme used above.

Let’s build the secure addition and multiplication primitives for MPC.

Addition

Consider the case of n=3, using field Zp and p=43. Here, we will take the three input values to be x1 = 11, x2 = 12, x3 = 13. Each party generates and distributes random values to obtain the following secret shares. Again, notice that:

(share1,i + share2,i + share3,i) mod 43 = xi mod 43:

Party 1 shares:  share1,1 = 40   share1,2 = 13   share1,3 = 8

Party 2 shares:  share2,1 = 21   share2,2 = 2 share2,3 = 17

Party 3 shares:  share3,1 = 36   share3,2 = 40   share3,3 = 31

To securely add the values x1, x2, and x3, the three parties simply add their shares of each of those values.

Party 1: share+,1 = share1,1 + share1,2 + share1,3 = 40 + 13 + 8 = 18 (mod 43)

Party 2: share+,2 = share2,1 + share2,2 + share2,3 = 21 + 2 + 17 = 40 (mod 43)

Party 3: share+,3 = share3,1 + share3,2 + share3,3 = 36 + 40 + 31 = 21 (mod 43)

That’s it! Now, each party has a secret share for the value:

x1 + x2 + x3 = 11 + 12 + 13 = (18 + 40 + 21) mod 43 = 36.

Scalar Addition and Multiplication

Scalar addition and multiplication refer to adding/multiplying a secret value by a public value. Suppose each party has secret shares of some value z and they want to obtain shares of say 5 + z or 5z. This actually turns out to be easy as well. To perform addition, Party 1 adds 5 to their share and the rest of the parties do nothing, and now they all have secret shares of 5 + z. To perform multiplication, all of the parties simply multiply their shares of z by 5, and they have all obtained secret shares of 5z. You can easily check that these hold in general for any public integer.

Multiplication

Secure multiplication is more complicated than addition, because all of the parties must interact. This slows down performance when computing non-trivial functions that require several multiplications, especially if the parties are operating on a high-latency network. Therefore, the goal of most of these schemes is to minimize the amount of communication required to securely multiply.

Using Beaver Triples is a well-known method for multiplication that breaks the operation into a series of secure addition, scalar addition, and scalar multiplication. In more concrete terms, this method takes advantage of the following property:

Secret shares of Beaver Triples are created along with secret shares of the input values. Beaver Triples are two random values, a and b, along with their product, ab = c. Secret shares for each of these values are sent to the other parties, so each party has secret shares for x1, x2, a, b, and c.

To compute x1x2, every party first computes shares of (x1 – a) and (x2 – b) to be broadcast to the other parties. The share of (x1 – a) are obtained by scalar multiplying by -1 and adding the result to x1. The same is done for (x2 – b). Once those shares are computed and broadcast,  everyone can now compute yz. In the final step, parties obtain shares for yb and az through scalar multiplication, leaving everyone with the value yz and the secret shares for yb, az, and c.

Secure addition is used on the shares for yb, az, and c, in order to obtain secret shares of (yb + az + c). Then, they perform a scalar addition with yz and their shares of (yb + az + c). Now, the shares for (yz + yb + az + c) = x1 ✕ x2 are obtained, and the protocol is completed.

Optimization

A major bottleneck in MPC is the communication between parties due to multiplication, which cannot typically be done in parallel. To optimize, the choices are to either reduce the communication exchanges and/or reduce the number of multiplications needed for a function. My work focused on the latter.

Integer comparison is difficult for arithmetic circuits, and if the transformation from function to circuit isn’t clever enough, comparisons may be very inefficient. But in fact, all arithmetic circuits can be represented by a polynomial in the input values.

Let’s consider the polynomial representing f(x1,x2) = x1 < x2, where the number of zeros is half the input space and proportional to the field size. Since this function is not constant zero, the polynomial degree is proportional to the number of zeroes. So the arithmetic circuit that is computing this has as many multiplication gates as the size of the field. Incredibly inefficient!But we can actually obtain an exponentially smaller arithmetic circuit by breaking everything down into building blocks using the methodologies from Octavian Catrina and Sebastiaan de Hoogh’s paper. The major building block needed is an efficient protocol for truncation, and they develop this via a modular remainder protocol that is built from other sub-protocols. What’s important is that instead of circuits needing to be proportional to the field itself, they allow us to use circuits proportional to the bit-size of the field, making them exponentially smaller.

Results

I chose to implement the MPC protocol from Araki, et. al., including their methods for secret sharing, addition, and multiplication. I also developed the specialized secure comparison and truncation protocols, as well as a dot product protocol. Combining complex protocols with addition and multiplication is known as the hybrid model of secure computation. This hybrid model gives us really compact, intuitive circuits that can even be written by hand! (see figures 2 and 3)

I applied this MPC protocol to both perceptron and SVM algorithms, which requires converting them into the building block operations. The SVM algorithm can be manually synthesized into a compact, hybrid circuit using complex building blocks of comparison and dot product operations. Specifically, one iteration of each algorithm only requires around 15 gates. For comparison, synthesizing AES using a synthesis tool will result in thousands of addition and multiplication gates.

Figure 2: (left) Pseudocode for the perceptron algorithm. (right) Hybrid circuit representing one iteration of the perceptron algorithm.

To avoid massive circuits, proportional to the dataset, I only synthesized circuits for one iteration of each protocol. This avoided having to deal with loops and yielded a very compact circuit that can be reused for any dataset. Lastly, using one iteration gives flexibility on how to handle the convergence of our protocol. This approach provides three options:

  • the parties agree on a fixed number of iterations beforehand,
  • the parties reveal an epsilon value publicly, or
  • the parties compute a convergence check according to another MPC.

Figure 3: (left) Pseudocode for the support vector machine (using partial gradient descent). (right) Hybrid circuit representing one iteration of a support vector machine.

Once the hybrid circuits were manually synthesized, I performed the algorithms on test datasets and achieved arbitrary fixed-point precision. The classification accuracy was also the same as the raw algorithms. Therefore, this tool can be used to train secret data using either perceptron or SVMs with arbitrary precision.

Concluding Thoughts

I spent the summer tackling several difficulties related to MPC with machine learning. By the end of the summer, I had an efficient solution to each of these problems and was able to run two different machine learning algorithms securely across three parties. I enjoyed working on this project. Before interning at Trail of Bits, I had just completed the second year of my Ph.D. I initially thought the transition from school to industry would be drastic. But I quickly noticed that in this internship, my project would be very similar to Ph.D. research, which is one of the many things that makes Trail of Bits such a great place to work.

TSC Frequency For All: Better Profiling and Benchmarking

Have you ever tried using LLVM’s X-Ray profiling tools to make some flame graphs, but gotten obscure errors like:

==65892==Unable to determine CPU frequency for TSC accounting.

==65892==Unable to determine CPU frequency.

Or worse, have you profiled every function in an application, only to find the sum of all function runtimes accounted for ~15 minutes of a 20-minute run? Where did those five minutes go!?

Well, we’ve run into both situations, so we built a solution in the form of a Linux kernel module called tsc_freq_khz. We’ll take a quick but deep dive into x86 timers to explain the problem, but first, let’s get to the goods. The tsc_freq_khz module enhances the performance of profiling and benchmarking tools like X-Ray in virtualized environments. No more “Unable to determine CPU frequency” errors!

The tsc_freq_khz module also makes these tools more accurate on newer Intel processors. X-Ray uses the x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration, and assumes that TSC frequency is equivalent to maximum clock speed. This assumption is wrong on newer CPUs, and leads to improper time measurement. The tsc_freq_khz module provides a way to read the actual TSC frequency. No more missing minutes in your profiling data!

The tsc_freq_khz module works by exporting the Linux kernel’s internal value of the x86 Time Stamp Counter (TSC) frequency, tsc_khz, to userspace via sysfs. Programs can query this value by reading /sys/devices/system/cpu/cpu0/tsc_freq_khz.

Several open-source projects, like X-Ray and Abseil, already check for the presence of tsc_freq_khz, but until now, that file was only present on Google’s production kernels. The tsc_freq_khz module enables TSC frequency export to userspace for everyone else.

The trouble with timestamps

Before we explain what we did and why this works, let’s do a quick introduction to timestamps on the x86 architecture.

An x86 machine has at least six different ways to measure time:

Each method has unique and subtle flaws that make it completely unworkable for certain applications. This cornucopia of timers is what happens when you maintain 30 years of backwards compatibility and never remove a feature because someone, somewhere, is depending on it.

Despite its many flaws, the TSC is useful for benchmarking and profiling. It has extremely low latency, because the circuitry is directly on the CPU, and it is accessible directly from user-mode applications. Very useful profiling tools, like X-Ray, rely on the TSC for accurate measurements.

However, TSC measures time in ticks, which are not comparable across different processors and are therefore largely meaningless. For profiling and benchmarking, we want to measure time in comparable units, like nanoseconds.

From ticks to nanoseconds

So how many nanoseconds are in a tick? The basic formula for converting some duration of ticks to nanoseconds is simple:

time_in_nanoseconds = (tsc_count_end - tsc_count_start) * tsc_frequency

Unfortunately, determining TSC frequency is difficult, and in some cases, e.g., cloud-based or other virtualized environments, impossible. Until now.

The core issue is that the Linux kernel does not provide a way for applications to know the TSC frequency, although the Linux perf utility does attempt to calculate TSC frequency via Intel PT. There are reasonable arguments for not exposing the value directly, because TSC frequency is fairly obscure, and there are cases where the value is completely meaningless. Until recently, the processor’s maximum clock speed was also an accurate approximation of TSC frequency.

However, this is no longer true. Using maximum clock speed as the TSC frequency will give the wrong results on newer Intel CPUs. This is the cause of the missing minutes when profiling.

Additionally, the maximum clock speed, accessible in Linux via /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq, is not available in cloud-based or other virtualized environments. This is the cause of “Unable to determine CPU Frequency” errors.

cpuinfo_max_freq is populated by the cpufreq driver used for frequency scaling—that is, making the CPU run faster or slower depending on power-saving settings. Naturally, frequency scaling is not typically permitted in virtualized environments, because each physical CPU is shared with multiple virtual tenants. Hence, this value is not present.

A hint from Google

Timestamp measurement code in X-Ray refers to a mystery sysfs entry named /sys/devices/system/cpu/cpu0/tsc_freq_khz, which sounds like it provides exactly what we need: TSC frequency in kilohertz. Unfortunately, there are absolutely no references to that file in the Linux kernel source code. What’s going on?

More searching reveals the following hint in the Abseil source code:

  // Google's production kernel has a patch to export the TSC
  // frequency through sysfs. If the kernel is exporting the TSC
  // frequency use that. There are issues where cpuinfo_max_freq
  // cannot be relied on because the BIOS may be exporting an invalid
  // p-state (on x86) or p-states may be used to put the processor in
  // a new mode (turbo mode). Essentially, those frequencies cannot
  // always be relied upon. The same reasons apply to /proc/cpuinfo as
  // well.
  if (ReadLongFromFile("/sys/devices/system/cpu/cpu0/tsc_freq_khz", &freq)) {
    return freq * 1e3;  // Value is kHz.
  }

Jackpot! The comment tells us that:

  • Google runs a custom Linux kernel in production.
  • Google is aware that cpuinfo_max_freq should not be used for benchmarking.
  • The kernel’s calculation of TSC frequency is sane, and
  • Google’s internal kernels have a patch to export TSC frequency via sysfs, but the patch has not been upstreamed to the main kernel tree.

TSC frequency For all!

Why should only Google kernels have access to the TSC frequency? Since the Linux kernel is open source, we can write our own kernel module that does the same thing. Fortunately, the Linux kernel already computes the TSC frequency during boot and stores it in the tsc_khz variable. All we had to do was export the variable via sysfs.

Our kernel module, tsc_freq_khz, does exactly this. It creates a sysfs entry that reads the tsc_khz variable defined by the kernel and exports it via /sys/devices/system/cpu/cpu0/tsc_freq_khz. The module is extremely simple but also quite useful.

Follow the build instructions on Github and test the module by inserting it into your kernel:

$ sudo insmod ./tsc_freq_khz.ko
$ dmesg | grep tsc_freq_khz
[14045.345025] tsc_freq_khz: starting driver
[14045.345026] tsc_freq_khz: registering with sysfs
[14045.345028] tsc_freq_khz: successfully registered

The file at /sys/devices/system/cpu/cpu0/tsc_freq_khz should now be populated. (The values on your system will be different.)

$ cat /sys/devices/system/cpu/cpu0/tsc_freq_khz
2712020

Warning: Please do not use this code in a real production system. While it shouldn’t cause problems, it does make some assumptions, e.g., that CPU0 exists. There are also no sanity checks to warn if the TSC value is nonsensical or unreliable.

Conclusion

Seemingly simple problems like this one can lead us down a fascinating rabbit hole. In time, Google’s internal patches may make it into the Linux kernel, or kernel developers may create their own official patch to export TSC frequency to userspace. Meanwhile, tsc_freq_khz can be a stopgap for those who want accurate measurements with LLVM’s excellent X-Ray profiling tools.

Enjoyed reading this? Well, we thrive on interesting issues and challenging problems. We’d love to work with you to solve your security or software engineering challenges — please contact us.

Tethered jailbreaks are back

Earlier today, a new iPhone Boot ROM exploit, checkm8 (or Apollo or Moonshine), was published on GitHub by axi0mX, affecting the iPhone 4S through the iPhone X. The vulnerability was patched in devices with A12 and A13 CPUs. As of this writing, the iPhone XS, XS Max, XR, 11, 11 Pro and 11 Pro Max are all safe from this exploit.

We strongly urge all journalists, activists, and politicians to upgrade to an iPhone that was released in the past two years with an A12 or higher CPU. All other devices, including models that are still sold — like the iPhone 8, are vulnerable to this exploit. Regardless of your device, we also recommend an alphanumeric passcode, rather than a 6-digit numeric passcode. A strong alphanumeric passcode will protect the data on your phone from this and similar attacks.

It’s been a long time since the release of a public Boot ROM exploit. The last one was discovered in 2010 by George Hotz (geohot), and it affected the iPhone 3GS and iPhone 4.

What was released

checkm8 exploits the Boot ROM to allow anyone with physical control of a phone to run arbitrary code. The Boot ROM, also called the Secure ROM, is the first code that executes when an iPhone is powered on and cannot be changed, because it’s “burned in” to the iPhone’s hardware. The Boot ROM initializes the system and eventually passes control to the kernel. It’s the root of trust for the trusted boot chain of iOS and verifies the integrity of the next stage of the boot process before passing execution control.

Screen Shot 2019-09-27 at 12.26.05 PM

The Boot ROM is the powerhouse of the cell root of trust of the secure boot chain (source: Apple iOS Security Guide)

checkm8 does not include code required to boot a jailbroken kernel, but it does include the first steps. Likely, the community will soon release a full-tethered jailbreak that will slowly evolve to support all devices and versions.

The exploit also includes the ability to enable debugging features like JTAG on the iPhone CPU—a big win for security researchers and jailbreakers. Apple refers to this as “demoting” the phone in their 2016 BlackHat presentation. This is probably not what Apple intended when they said they were releasing “research devices.”

Impact and use cases

The arbitrary (i.e., not signed by Apple) code that can be executed during the boot process can be used to boot already jailbroken kernels, as was done in ~2011 with redsn0w. This early access also provides access to the AES engine, enabling decryption with the GID key of Apple-encrypted firmware like iBoot.

In 2013, we used the previous Boot ROM exploit to get access to user data by brute-forcing passcodes. In the years since, Apple introduced the Secure Enclave (SEP), a separate processor that manages encryption keys for user data. The SEP has been hardened against passcode brute-forcing using replay counters and backoff timers. It’s currently unclear how much access this new exploit provides to the SEP, so we can’t accurately judge the impact to device privacy. In Apple’s 2016 presentation, they point out that demoting the device forces the SEP to change the UID key, which is entangled with the passcode. Changing the UID key has the effect of permanently protecting user data on the device by making it impossible to decrypt.

When designing the SEP, Apple’s threat model included “adversarial” situations such as another Boot ROM exploit.

When designing the SEP, Apple’s threat model included “adversarial” situations such as another Boot ROM exploit.

It would not be surprising if the vulnerability exploited by checkm8 has been used by Cellebrite for forensic analysis, but they would need another trick or exploit to undermine the SEP’s protections.

Future

The vulnerability has been fixed on new hardware (iPhone XS, iPhone XR, iPhone 11), but Apple still sells hardware vulnerable to this exploit (iPhone 8 and also some iPads and iPods). They would need to create new CPU masks and release updated versions of these devices to secure them. This is unlikely, since Apple did not release a patched iPhone 4 when the previous Boot ROM exploit was released.

It is likely we will see one large community project that will apply common patches and install Cydia and other alternate App Stores. This is a boon to researchers but also to pirates.

Advice for at-risk users

Upgrade to an iPhone that was released in the past two years with an A12 or higher CPU. All other devices are vulnerable to this exploit. After upgrading, wipe your previous phone by selecting “Erase All Content and Settings” from Settings > General > Reset. Devices patched for this issue include:

  • iPhone 11, 11 Pro, 11 Pro Max
  • iPhone XS, XS Max
  • iPhone XR

Set an alphanumeric passcode, rather than a 6-digit numeric passcode. Even with this exploit, an attacker must brute-force your password to gain access to your data. A strong alphanumeric passcode will protect your data against these attacks. To configure an alphanumeric passcode:

  1. On iPhone X and later, go to Settings > Face ID & Passcode.
    On earlier iPhone models, go to Touch ID & Passcode. On devices without Touch ID, go to Settings > Passcode.
  2. Tap Change passcode.
  3. Enter your current passcode.
  4. When prompted to enter a new passcode, tap Passcode Options and select “Custom Alphanumeric Code”

Detecting these jailbreaks

It will still be possible to detect jailbroken phones in the common case, but it is not possible to detect if an iPhone has been exploited with checkm8. Jailbreak detection will have to continue to rely on identifying side-effects of the exploitation. iVerify Core continues to offer a comprehensive jailbreak detection suite, and we will continue to update it as new jailbreaks are released. Contact us to learn more about iVerify Core.

QueryCon 2019: A Turning Point for osquery

Has it really been 3 months since Trail of Bits hosted QueryCon? We’ve had such a busy and productive summer that we nearly forgot to go back and reflect on the success of this event!

On June 20-21, Trail of Bits partnered with Kolide and Carbon Back to host the 2nd annual QueryCon, at the Convene Old Slip Convention Center in downtown New York. We beat last year’s attendance with 150 attendees from around the globe. The 14 speakers presented talks on osquery ranging from technical presentations on Linux security event monitoring to discussions of end-user research. We saw familiar faces from last year’s event in San Francisco, and we met many new teams interested in osquery.

_OSS4146

Tania McCormack of Carbon Black presented her user research on introducing osquery to new audiences.

Last year’s inaugural QueryCon brought us all together in person for the first time. QueryCon 2019 strengthened our sense of community and proved a catalyst for positive change: Our productive collaboration generated community-based and technical changes that have put this project back on track.

A new foundation

On June 18th, the day before QueryCon, the Linux Foundation officially announced that they would be taking over ownership of osquery from Facebook. Under the Linux Foundation, the new osquery Foundation will be directed by a Technical Steering Committee (TSC) consisting of engineers and developers from Facebook, Trail of Bits, Google, and Kolide—companies that are using osquery and have committed to supporting the project. The TSC members are:

  • Teddy Reed (Facebook)
  • Alessandro Gario (Trail of Bits)
  • Zachary Wasserman (independent consultant)
  • Victor Vrantchan (from Google, but working independently)
  • Joseph Sokol-Margolis (Kolide)

This change was exciting news to a growing list of companies who rely on osquery for endpoint protection. As we reported in April, osquery outgrew its original home as a Facebook project, and its community’s expectations and needs now exceed what Facebook can be expected to manage on its own. A new community-based governance model was needed, and conference attendees were eager to discuss the change. We hosted a panel discussion with Facebook’s lead osquery maintainer, Teddy Reed, and representatives from the new osquery TSC.

 

 

How the foundation will work

The Linux Foundation functions as a steward for osquery, providing various funding and management platforms. (Learn more about their stewardship model here.) The new osquery TSC will guide and maintain the project with the help of contributions from the greater community, and Trail of Bits will commit to biweekly office hours for public comment and transparent project governance.

Meanwhile, Facebook will turn over credentials and control of funding, infrastructure, hosting, and engineer review to a new committee of maintainers (of which Facebook will remain a member). The organizations on the TSC are contributing significant engineering time to establish build and release processes, and a forthcoming funding platform on CommunityBridge will allow sponsorship.

Technical decisions

The TSC has a significant backlog of contributions to work through, but we’re already seeing a massive acceleration of activity on the project.

First, osquery core will be updated to feature parity with osql, the community- oriented osquery fork by Trail of Bits. The initial goal is a monthly release, with alternating “developer” and “stable” releases. Another big priority is to merge all major independent efforts and private forks into a single canonical osquery that everyone can benefit from.
Once Trail of Bits resolves the technical debt that has accrued on the project—build toolchains, dependency management, CI systems—it will maintain these components and focus on client-driven engineering requests for osquery. Other stakeholders are also contributing a backlog of Pull Requests, which will be prioritized and merged as soon as possible.

A proliferation of committed PRs

One way to track the health and activity of a project on GitHub is by Pull Requests. Over nine months, from September 2018 to the day before QueryCon, there were roughly 35 PRs merged to the osquery project, with only a few from the community outside Facebook. In just the 12 weeks since QueryCon, nearly 90 PRs were successfully merged (representing about 113 commits). More importantly, the majority of those contributions were from outside Facebook.

Trail of Bits alone is responsible for approximately 44 of the PRs merged this summer.

Some highlights from our recent contributions:

  • #5604 and #5610: The new osquery Foundation version of osquery was kicked off by merging the Facebook and Trail of Bits versions of osquery. This meant we restored CMake build support and set up the public CI, which were key improvements brought over from the osql fork.
  • #5706: We refactored the build scripts so that all of osquery’s third-party library dependencies will build from source on Linux. This absolves the need for the project to host and distribute the correct versions of pre-built binaries for these libraries (a job that previously relied on Facebook); improves compatibility across different varieties of Linux; is a prerequisite for our goals of reproducible builds and offline builds; and, finally, avoids incompatibilities arising from system-installed versions of these dependencies.
  • #5696, #5697, and #5640: We fixed and vastly improved the table for querying the certificates stores on Windows. It is now possible to use osquery to list all per-user certificates, whether they are in a registry hive or on the filesystem, and whether or not those users are even logged in at the time of the query. Searching your fleet for anomalous certificates is an important security monitoring and incident response capability, as it can be an indicator of compromise.
  • #5665: We fixed several bugs that we found with the use of dynamic analysis (fuzzing and Clang’s Address Sanitizer). Soon, we plan to incorporate both static and dynamic analysis passes into the CI system, for proactive detection of code vulnerabilities in osquery. This is a best practice that Trail of Bits recommends for all of our clients, and we’re happy to contribute it to the security of osquery.

A new stable release

DSC_6478

During a community workshop at the end of the conference, osquery users and TSC members discussed the best path to the next stable release.

Prior to QueryCon 2019, the most recent major cross-platform release was August 2018. Seven days after the conference, Trail of Bits’ Alessandro Gario provided a pre-release of the new version of osquery. For the past nine months Facebook had refactored osquery around Buck, a build system created and used by Facebook that had long been problematic for the greater community. Our pre-release restored CMake support, CI and packaging, and a few fixes not related to the build system.

Now the first full stable release of osquery is out! It’s a significant effort to improve the build system for the future of osquery, ensuring that:

  • Building osquery from source will no longer rely on Facebook to maintain and host the downloads for pre-built dependencies
  • The osquery project once again has a public-facing Continuous Integration server, automatically testing all proposed contributions to detect any developer mistakes or regressions
  • All contributors can use their preferred build tools: developers inside Facebook can use their build tool, Buck, and developers in the greater community can use the standard build tool, CMake
  • An all-new custom build toolchain for Linux will enable broader Linux support, and eventually reproducible builds

New features for osquery users:

  • The process_events table detects more kinds of events, on Linux
  • More powerful query syntax: osquery now supports a regex_match function to allow searches over a particular column of a given table
  • Initial support for eBPF-based event tracing, on Linux
  • New macOS tables for detecting T1/T2 chips, querying the list of running apps, and listing the installed Atom packages on macOS or Linux
  • The certificates, logged_in_users, and logical_drives tables on Windows are all greatly improved
  • Initial implementation of a new high-performance eventing framework that will enable more types of event-based monitoringImproved ability to profile and benchmark tables’ performance
  • New detections added to the macOS query pack

But wait, there’s more! Dozens of bugs have been squashed, additional security hardening mitigations have been turned on, certain performance cases have been improved and resource leaks plugged, the documentation has been updated…we could go on and on. For a full list of changes in this release, refer to the comprehensive change notes.

QueryCon and beyond

DSC_5954

The hosts of the QueryCon 2019 posed for a team group shot!

We had so much fun hosting QueryCon this year and we want to thank everyone who attended. This event was a catalyst for positive change in our community thanks to the thoughts, discussions, and passion of this year’s attendees. We can’t wait to see how osquery improves now that its development has been unlocked.

What’s next for osquery? We want you to tell us! If you’re using osquery in your organization, let’s talk about what features and fixes should be next. Thanks to a revolutionary meeting of the minds, we now have the power to make it happen.

Crypto 2019 Takeaways

This year’s IACR Crypto conference was an excellent blend of far-out theory and down-to-earth pragmatism. A major theme throughout the conference was the huge importance of getting basic cryptographic primitives right. Systems ranging from TLS servers and bitcoin wallets to state-of-the-art secure multiparty computation protocols were broken when one small sub-component was either chosen poorly or misused. People need to stop using RSA, drop AES-CBC, and make sure they’re generating randomness in a cryptographically secure way.

It wasn’t all attacks and bad news, though. The ascendance of cryptographic tools for privacy-preserving computation continues apace. Zero-knowledge proofs, secure multiparty computation, and secure messaging systems all had major breakthroughs this year. These areas of cryptography have become efficient enough to transcend purely theoretical interest: Companies large and small, from niche blockchain startups to tech giants like Facebook, are deploying cutting-edge cryptographic protocols. Even the city of Boston has used secure multiparty computation in a study on pay equity.

All of the papers presented at Crypto were remarkable and groundbreaking. From this impressive pool, we’ve highlighted the papers we believe will have a substantial impact outside of academia in the near future.

Attacks

Breaking OCB2

The Best Paper award went to the Cryptanalysis of OCB2: Attacks on Authenticity and Confidentiality. This amalgamation of three papers released last fall demonstrates attacks against OCB2, an ISO-standard authenticated encryption scheme, for forging signatures on arbitrary messages and full plaintext recovery. This result is especially shocking because all three versions of OCB have been standardized, and were thought to have airtight security proofs. Fortunately, OCB1 and OCB3 are not vulnerable to this attack because it relies on details specific to how OCB2 applies the XEX* mode of operation.

ECDSA Nonce Bias and Reuse

A well-known weakness in the ECDSA signature scheme is that nonces must be generated uniformly at random; otherwise, an attacker can recover the signer’s private key. In Biased Nonce Sense, Breitner and Heninger demonstrate the real-world practicality of these attacks by scraping the Bitcoin and Ethereum blockchains in search of either duplicate or slightly biased nonces. These attacks allowed them to recover the private keys for over 300 Bitcoin accounts and several dozen Ethereum accounts. In most instances, these accounts had a nonzero balance, which indicates that active users of these blockchains are using libraries with bad nonce generation. This result confirms the need to move towards deterministic nonce generation in ECDSA, as specified in RFC6979—or, even better, to stop using ECDSA entirely and use Ed25519 instead.

Automating TLS Padding Oracle Attack Discovery and Implementation

Developing tools for vulnerability detection proved to be a popular theme this year. One group of researchers developed a tool that automatically scans websites for CBC padding oracle vulnerabilities in their TLS protocol. They found that roughly 1.83% of the Alexa Top Million websites were vulnerable. Matthew Green’s team at Johns Hopkins used SAT and SMT solvers to automate the development of new padding oracle attacks (they actually consider a broader class of attacks called format oracles), which eliminates the laborious task of discovering and writing such attacks by hand. Their tool was able to rediscover common examples of such attacks, including Bleichenbacher’s attack on PKCS#1 v1.5.

Secure Computation

Efficient Zero-Knowledge Proofs

This year was huge for pushing zero-knowledge proofs further into the realm of practicality. Among the most impressive results was the development of the Libra scheme (no relation to the Facebook cryptocurrency). Libra is notable for several reasons. First, it has a pre-processing phase that is only linear in the witness size, not linear in the statement size like SNARKs. Second, it has a prover that runs in linear time with respect to the computation being run in zero-knowledge.

Comparison of zero-knowledge protocols

Only Bulletproofs achieve the same asymptotic efficiency, although they run much slower in practice because they require the prover to perform many expensive cryptographic operations. On the other hand, Bulletproofs have no trusted setup phase and make somewhat more standard cryptographic assumptions. The table above, taken from the Libra paper, shows a comprehensive overview of the most performant zero-knowledge proofs.

Breaking Secure Multiparty Computation

Over in the secure multiparty computation world, Jonathan Katz delivered a keynote talk about a devastating class of vulnerabilities that affects nearly all MPC implementations. The fundamental issue is that these protocols are extremely complex and often leave low-level details up to the implementer. Since MPC protocols are extremely resource intensive, practical implementations often apply optimizations in a haphazard way.

Here’s some background: MPC protocols require the computation of many hash functions, and even relatively simple functions like RSA encryption require the computation of tens of millions of hashes in this context. While we often think of SHA3 as rather fast, in extreme settings like this it’s actually quite slow and is one of the main bottlenecks of the protocol. This led to researchers using fixed-key AES instead, since it’s roughly 50 times faster to compute than SHA3. Originally, this optimization was placed on firm cryptographic ground in the JustGarble system. However, the security proof does not extend to many modern systems. In fact, Katz et al showed that using fixed-key AES in most widespread protocols completely undermines the privacy guarantees of MPC. However, they also showed that a simple modification to fixed-key AES was secure and equally performant.

This attack highlights the recklessness of rushing to deploy cutting-edge cryptography. These protocols are often extremely slow and complex, and few people understand the subtle details of the security proof. More work must be done to quantify the concrete security of these protocols as they are actually instantiated, not just asymptotically using idealized functionalities.

Content Moderation and Signatures

Metadata-Private Message Franking

A fundamental problem in end-to-end messaging systems is how content moderation should be handled. Facebook has been particularly concerned with this issue due to the widespread use of WhatsApp and Messenger, so they developed something called a message franking system. This system allows users to report abusive content without allowing Facebook to see the content of all users’ messages. Message franking provides the following security guarantees:

  • Message privacy: Platform should only learn messages that are reported
  • Accountability: Moderator should always be able to verify that the alleged sender actually sent the reported message
  • Deniability: Only the moderator should be able to verify the reported message

Unfortunately, prior work on message franking is not metadata private—the moderator is able to see the sender and recipient of every message, even those that aren’t reported. This has been remedied in a new scheme that extends the functionality of message franking to private end-to-end encryption protocols using zero-knowledge proofs. Unlike the zero-knowledge proofs discussed above, the ones used in this franking scheme are extremely efficient and produce signatures that are only around 500 bytes.

Ring Signatures With Deniability

Another popular signature-type primitive that has been especially useful in blockchain systems is the ring signature. Ring signatures allow one member of a ring (i.e., a designated group of people) to anonymously sign messages on behalf of the entire group. For example, a ring could be a workers’ union, and a ring signature would allow someone from the union to anonymously make a complaint while verifying that they’re actually part of the union. It would be beneficial in some ring signature use cases if individuals could claim responsibility for signing the message. Conversely, it may also be useful for members of a ring to prove that they did not sign a given message. Park and Sealfon formalized these security properties and developed new ring signature constructions that satisfy them.

Post-Quantum Cryptography

As the NIST post-quantum cryptography standardization effort enters its second phase, it’s essential to understand the concrete security of proposed cryptosystems. Quantum cryptanalysis in the RAM model: Claw-finding attacks on SIKE, one of the two papers chosen for the Best Young Researcher award, develops a new model for thinking about post-quantum security and applies that model to the SIKE cryptosystem. SIKE is a key encapsulation mechanism that uses supersingular isogenies to achieve post-quantum security.

One big takeaway of the paper is that SIKE is more secure than previously thought. But the paper’s formulation of a new way to quantify post-quantum security may have a more enduring impact. This method involves thinking about quantum computers as a classical RAM machine controlling arrays of bits and qubits, then using this model to reason about time/memory tradeoffs in different attack strategies. As a result, the researchers determined that attacks previously thought to be especially potent against SIKE would require so much classical RAM that it would be more efficient to simply run the best known classical attack.

Looking Forward

Navigating the complex landscape of cryptographic protocols and parameter choices continues to be a key point of difficulty for developers. We need to agree as a community that developers should not be responsible for security-critical configuration choices, and move towards building libraries that are misuse resistant, such as Libsodium and Tink. The use of these libraries alone would have prevented many of the attacks on deployed systems we saw this year.

However, we realize that not all systems can realistically support a complete cryptography overhaul, and some are often stuck using a specific library or primitive for legacy reasons. While we saw lots of activity this year around automating vulnerability detection and exploit writing, we’d like to see the community emphasize bug-fixing tools as well. For example, we recently released a tool called Fennec that can rewrite functions in compiled binaries, and we used this tool to detect and repair the use of a static IV in AES-CBC without access to source.

On the theory side, we’d like to see a more stringent examination of the concrete security of cutting-edge protocols like zero-knowledge proofs and MPC. As we saw at Crypto this year, implementation details matter, and many of these complex systems are implemented in ways that either render them insecure or dramatically reduce their security level. This is made worse by the fact that many of these new protocols aren’t based on standard security assumptions like factoring or the discrete log problem. For example, many blockchain companies are rushing to roll out verifiable delay functions, which rely on a very new and poorly understood property of imaginary quadratic number fields. We need more thorough analyses of assumptions like this and how they impact security.

Finally, NIST will select the third round of candidates for their post-quantum cryptography standardization program in 2020. To succeed, NIST will need to rigorously assess the security of the round two candidates. We need more work like the paper on SIKE, which helps us compare classical and quantum security in a more precise way.

DeepState Now Supports Ensemble Fuzzing

by Alan Cao, Francis Lewis High School, Queens, NY

We are proud to announce the integration of ensemble fuzzing into DeepState, our unit-testing framework powered by fuzzing and symbolic execution. Ensemble fuzzing allows testers to execute multiple fuzzers with varying heuristics in a single campaign, while maintaining an architecture for synchronizing generated input seeds across fuzzer queues.

This new ensemble fuzzer includes a new deepstate-ensembler tool and several notable fuzzer engines powered by a new and open-sourced DeepState front-end API/SDK. This SDK enables developers to integrate front-end executors for DeepState and provides seed synchronization support for our ensembler tool.

The Fallacy of Smart Fuzzers

Fuzzers are one of the most effective tools in a security researcher’s toolbox. In recent years, they have been widely studied to build better heuristics, or strategies that fuzzers use to explore programs. However, one thing remains clear: fuzzer heuristics rarely live up to their hype. When scaling up so-called smart fuzzers for real-world programs, their performance often falters, and we end up defaulting back to our “dumb” standard tooling, like AFL and LibFuzzer.

Since we have explored the topic of evaluating fuzzing research in the past, let’s take a tangent and instead explore the possibility of combining various fuzzer heuristics to maximize fuzzer performance, without giving up time in our testing workflow. This led to my summer internship project of integrating ensemble fuzzing into DeepState.

What is Ensemble Fuzzing?

The insight from ensemble fuzzing is that, while certain heuristics work well in certain contexts, combining them should produce greater results than just a single fuzzer with a single strategy. This idea was first introduced in Chen et al.’s EnFuzz paper, however, there are no available open-source or commercial implementations.

Our implementation of an ensemble fuzzer follows the architecture implemented in EnFuzz, as seen below:

alan_1

Given a set of pre-determined diverse base fuzzers with respective local seed queues, we can integrate a global-asynchronous and local-synchronous (GALS) seed synchronization mechanism that pulls interesting seeds from local queues to a shared global queue during an execution cycle. Therefore, as a base fuzzer’s heuristics fail to improve coverage or discover interesting input seeds, it can now pull other fuzzers’ seeds from this global queue in the next execution cycle. Furthermore, once the campaign is terminated, we can receive any fuzzing feedback from the ensembler regarding base fuzzer performance, crash triaging/deduplication, or any other post-processing statistics.

Using ensemble fuzzing and our already powerful unit-testing framework for fuzzing/symbolic execution, DeepState, we are able to approach the following problems during testing:

  • Fuzzer performance diversity – do different fuzzer heuristics contribute varying useful seeds, maximizing the potential for improving coverage and crashes discovered?
  • Fuzzer workflow – how can we do exhaustive fuzz testing and/or symbolic execution while simplifying our workflow?
  • Invariant consistency – do different fuzzers return different results, indicating that there might be a source of nondeterminism in our test?

Spinning up Front-ends

Since DeepState already supports Eclipser as a backend, we chose to first build a front-end API, where a developer can write a front-end wrapper for a fuzzer backend. This orchestrates the running fuzzer process, and performs compile-time instrumentation, pre- and post-processing, and seed synchronization. It also simplifies the fuzzing environment setup by unifying how we can construct tools while implementing functionality.

The snippet below shows an example of a front-end wrapper for AFL. It inherits from a base DeepStateFrontend class and includes methods that define fuzzer-related functionality.

alan_2

Example front-end wrapper for AFL. One inherited method, pre_exec, allows the user to perform sanity checks before execution. Both environment properties (i.e. core dump pattern) and argument parsing are checked.

In order to build a front-end wrapper, we should have the following methods in our fuzzer object:

alan_6Each fuzzer has its own ensemble method, which provides a specialized rsync call to push and pull seeds from a global and local queue directory:

alan_3

Ensemble method for seed synchronization. Each __sync_seeds() call invokes a specialized rsync command to transfer seeds between a local and global queue directory.

Once built, we can use a front-end wrapper as so:

# compile a DeepState harness with AFL instrumentation
$ deepstate-afl --compile_test MyDeepStateHarness.cpp --compiler_args=”-Isomelib/include -lsomelib -lotherlib”

# execute the AFL fuzzer through DeepState
$ deepstate-afl -i seeds -o out ./out.afl

For a more detailed look into the fuzzer front-end API and how you can implement your own frontends, see this tutorial. DeepState has existing front-end executors for AFL, libFuzzer, Angora, Eclipser, and Honggfuzz.

https://asciinema.org/a/262023

Building the Ensembler

Using the unified API, we can now build an ensemble fuzzer that provisions front-end objects and executes fuzzers concurrently while maintaining seed synchronization.

To start, take a DeepState test harness input, and “ensemble compile” multiple instrumented binaries to a workspace directory with our fuzzers through the exposed compile() call from each frontend object.

alan_4

Provisioning a test workspace with binaries with the “ensemble compilation” strategy.

Once complete, each parallel fuzzer process is instantiated through run(). Since each front-end wrapper invokes rsync-style synchronization through ensemble(), the ensembler simply calls it from each front-end after a specified sync cycle (in seconds) to synchronize seeds.

This implementation is surprisingly simple, and was built with around 300 lines of code in Python. Here’s a quick demo running the ensembler on one of our test examples, Crash.cpp.

Fuzz ‘Em All!

Inspired by Google’s fuzzer-test-suite and the aforementioned work in fuzzer-performance testing, we decided that a DeepState-powered test suite, deepstate-test-suite, could help fuzzer performance A/B tests and actual bug-hunting. With our easy-to-use fuzzers and ensembler, let’s evaluate how they stand against real-world test-case benchmarks!

Bignum vulnerabilities are an especially interesting class of bugs because edge cases are much more difficult, and even probabilistic, to discover. This makes them ideal targets for DeepState property tests.

We benchmarked our base and ensembler fuzzers for their performance reproducing an existing test case and a real-world bignum vulnerability—a carry mis-propagation bug in TweetNaCl. Following the evaluation methods in this fuzzing evaluation paper, the average time was measured for each test case using 10 crash instances from well-formed initial inputs:

alan_5

These results provide some interesting insights about fuzzer diversity. Running smarter fuzzers like Angora and Eclipser on smaller contained test cases, like the Runlen example, work well. However, their performance falters when scaled up to the context of actual bug discovery in real-world software, like the TweetNaCl bug. The ensemble fuzzer’s performance shows it can scale up well for both of these test cases.

What’s in the future for Ensemble Fuzzing?

Ensemble fuzzing is a powerful technique that scales to real-world software libraries and programs. Integrating an ensemble fuzzer into DeepState gives power to unit-testing with a simplified workflow, and it opens up the possibility for many other research and engineering efforts.

Based on our current benchmark results, we can’t definitely say that ensemble fuzzing is the best fuzzing strategy, but it’s worth noting that there are always elements of randomness and probabilistic behavior when evaluating fuzzers. Effective ensemble fuzzing may be dependent on base fuzzer selection—determining which fuzzers to invoke and when based on the type of target or bug class being analyzed.

Maybe our current ensemble of fuzzers works effectively on reproducing bignum vulnerabilities, but would they work just as well on other classes of bugs? Would it be even more effective if we invoke fuzzers in a specific order? These are the questions we can answer more accurately with more benchmark tests on diverse targets.

Thanks!

Being a security engineering intern at Trail of Bits has been a wonderful experience, as always. Working with awesome employees and interns has really propelled my understandings in security research and how we can turn insightful academic research into working software implementations, just like my previous work with analyzing cryptographic primitives with Manticore. I’m especially excited to continue to do this at NYU, where I’ll be starting in the spring!

Rewriting Functions in Compiled Binaries

by Aditi Gupta, Carnegie Mellon University

As a summer intern at Trail of Bits, I’ve been working on building Fennec, a tool to automatically replace function calls in compiled binaries that’s built on top of McSema, a binary lifter developed by Trail of Bits.

The Problem

Let’s say you have a compiled binary, but you don’t have access to the original source code. Now, imagine you find something wrong with your program, or something you’d like to change. You could try to fix it directly in the binary—for example, by patching the file in a hex editor—but that becomes tedious very quickly. Instead, being able to write a C function and swap it in would massively speed up the process.

I spent my summer developing a tool that allows you to do so easily. Knowing the name of the function you want to replace, you can write another C function that you want to use instead, compile it, and feed it into Fennec, which will automatically create a new and improved binary.

A Cryptographic Example

To demonstrate what Fennec can do, let’s look at a fairly common cryptographic vulnerability that has shown up in the real world: the use of a static initialization vector in the CBC mode of AES encryption. In the very first step of CBC encryption, the plaintext block is XOR’ed with an initialization vector (IV). An IV is a block of 128 bits (the same as the block size) that is used a single time in any encryption to prevent repetition in ciphertexts. Once encrypted, this ciphertext plays the role of the IV for the next block of plaintext and is XOR’ed with this plaintext block.

aditi_1

This process can become insecure when an initialization vector is constant throughout plaintexts. Under a fixed IV, if every message begins with the same block of plaintext, they will all correspond to the same ciphertext. In other words, a static IV can allow an attacker to analyze multiple ciphertexts as a group rather than as individual messages. Below is an example of an IV being generated statically.

unsigned char *generate_iv() {
  return (unsigned char *)"0123456789012345";
}

Sometimes, developers will use cryptography libraries like OpenSSL to do the actual encryption, but write their own functions to generate IVs. This can be dangerous, since non-random IVs can make AES insecure. I built Fennec to help fix issues like this one—it checks whether IV generation was random or static and replaces the function with a new, secure IV if necessary.

The Process

The end goal was to lift executable binaries to LLVM bitcode with McSema and combine it with some LLVM manipulation to replace any function automatically. I started by understanding my cryptographic example and exploring different ways of patching binaries as a bit of background before getting started.

My first step was to work through a couple of the Matasano Cryptopals Challenges to learn about AES and how it can be used and broken. This stage of the project gave me working encryption and decryption programs in C, both of which called OpenSSL, as well as a few Python scripts to attack my implementations. My encryption program used a static IV generation function, which I was hoping to replace automatically later.

I kept using these C binaries throughout the summer. Then, I started looking at binary patching. I spent some time looking into both LD_PRELOAD and the Witchcraft Compiler Collection, which would work if my IV generation function was dynamically linked into the program. The goal of my project, however, was to replace function calls within binaries, not just dynamically loaded functions.

I didn’t want to complicate everything with lifted bitcode yet, so I started by using clean bitcode that generated directly from source code. I wanted to run an LLVM pass on this bitcode to change the functionality of part of my program—namely, the part that generated an IV.

I started by trying to change the function’s bitcode directly in my pass, but soon moved to writing a new function in C and making my original program call that function instead. Every call to the old function would be replaced with a call to my new function.

After some experiments, I created an LLVM pass that would replace all calls to my old function with calls to a new one. Before moving to lifted bitcode, I added code to make sure I would still be able to call the original function if I wanted to. In my cryptographic example, this meant being able to check whether the original function was generating a static IV and, if so, replace it with the code below, as opposed to assuming it was insecure and replacing it no matter what.

// a stub function that represents the function in original binary
unsigned char *generate_iv_original() {
  unsigned char *result = (unsigned char *)"";
  // the contents of this function do not matter
  return result;
}

unsigned char *random_iv() {
  unsigned char *iv = malloc(sizeof(int) * 16);
  RAND_bytes(iv, 16);  // an OpenSSL call
  return iv;
}

unsigned char *replacement() {
  unsigned char *original = generate_iv_original();
  for (int i = 0; i < 10; i++) {
    unsigned char *iv = generate_iv_original();
    if (iv == original) { . // if the IV is static
      return random_iv();
    }
  }
  return original;
}

With my tool working on clean bitcode, it was time to start looking at lifted bitcode. I familiarized myself with how McSema worked by lifting and recompiling binaries and looking through the intermediate representation. Because McSema changes the way functions are called, it took some extra effort to make my tool work on lifted bitcode in the same way that it had on clean bitcode. I had to lift both the original binary and the replacement with McSema. Additional effort was required because the replacement function in a non-lifted binary doesn’t follow McSema’s calling conventions, so it couldn’t be swapped in trivially.

aditi_2

Function names and types are more complex through McSema, but I eventually made a working procedure. Like the tool for clean bitcode, the original function could be kept for use in the replacement.

The last step was to generalize my process and wrap everything into a command line tool that others could use. So I tested it on a variety of targets (including stripped binaries and dynamically-loaded functions), added tests, and tested my installation process.

The Function Replacement Pass

The complete process consists of three primary steps: 1) lifting the binaries to bitcode with McSema, 2) using an LLVM pass to carry out the function replacement within the bitcode, and 3) recompiling a new binary. The LLVM pass is the core of this tool, as it actually replaces the functions. The pass works by iterating through each instruction in the program and checking whether it is a call to the function we want to replace. In the following code, each instruction is checked for calls to the function to replace.

for (auto &B : F) {
  for (auto &I : B) {
    // check if instruction is call to function to be replaced
    if (auto *op = dyn_cast(&I)) {
      auto function = op->getCalledFunction();
      if (function != NULL) {
        auto name = function->getName();
        if (name == OriginalFunction) {
...

Then, we find the replacement function by looking for a new function with the specified name and same type as the original.

Type *retType = function->getReturnType();
FunctionType *newFunctionType =
  FunctionType::get(retType, function->getFunctionType()->params(), false);

// create new function
newFunction = (Function *)(F.getParent()->getOrInsertFunction(ReplacementFunction, newFunctionType));

The next step is to pass the original function’s arguments to the new call.

CallSite CS(&I);

// get args to original function to be passed to replacement
std::vector arguments;

for (unsigned int i = 0; i uses()) {
User* user = U.getUser();
user->setOperand(U.getOperandNo(), newCall);
}

The Complete Tool

Although the LLVM pass does the work of replacing a given function, it is wrapped with the other steps in a bash script that implements the full process. First, we disassemble and lift both input binaries using McSema.

aditi_3

Lifts binaries with McSema

Next, we analyze and tweak the bitcode to find the names of the functions as McSema represents them. This section of code includes support for both dynamically-loaded functions and stripped binaries, which affect the names of functions. We need to know these names so that we can pass them as arguments to the LLVM pass when we actually do the replacement. If we were to look for the names from the original binary, the LLVM pass wouldn’t be able to find any matching functions, since we’re using lifted bitcode.

aditi_4

Finds the names of functions to be replaced

Finally, we run the pass. If we don’t need access to the original function, we only need to run the pass on the original binary. If, however, we want to call the original function from the replacement, we run the pass on both the original binary and the replacement binary. In this second case, we are replacing the original function with the replacement function, and the stub function with the original function. Lastly, we recompile everything to a new working binary.

aditi_5

Runs the pass and compiles a new binary from updated bitcode

Results

Fennec uses binary lifting and recompilation to make a difficult problem relatively manageable. It’s especially useful for fixing security bugs in legacy software, where you might not have access to source code.

Using this tool, it becomes possible to automatically fix a cryptographic IV vulnerability. As seen below, the original binary encrypts a message identically each time using a static IV. After running Fennec, however, the newly created binary uses a different IV, thereby producing a unique ciphertext each time it is run, even on the same plaintext (blue).

# Original binary
aditi@nessie:~/ToB-Summer19$ ./encrypt ""
MDEyMzQ1Njc4OTAxMjM0NQ==/reJh+5rktBatDpyuJNQEBo++0pyIRGZiNsmZkN09HTPIOBVqQ9ov6CrxPXO7dC4cUJGYzBEsejHuTQyjVQh+XsLCHyDkURmfCuJ+a97raPY+o8pKKt8yf/xTmYMtyq2zf7EQxqPxv2bXKdP+6K+h9KyuO3q4+3JbuJFTesNLy8Np1m9ShJ9UAHvAdO6LCZvQ
N91kz0ytIH+s7LgajIWyises+yz26UBQwOzZLeLcQp4=
176

aditi@nessie:~/ToB-Summer19$ ./encrypt ""
MDEyMzQ1Njc4OTAxMjM0NQ==/reJh+5rktBatDpyuJNQEBo++0pyIRGZiNsmZkN09HTPIOBVqQ9ov6CrxPXO7dC4cUJGYzBEsejHuTQyjVQh+XsLCHyDkURmfCuJ+a97raPY+o8pKKt8yf/xTmYMtyq2zf7EQxqPxv2bXKdP+6K+h9KyuO3q4+3JbuJFTesNLy8Np1m9ShJ9UAHvAdO6LCZvQ
N91kz0ytIH+s7LgajIWyises+yz26UBQwOzZLeLcQp4=
176

aditi@nessie:~/ToB-Summer19$ ./encrypt ""
MDEyMzQ1Njc4OTAxMjM0NQ==/reJh+5rktBatDpyuJNQEBo++0pyIRGZiNsmZkN09HTPIOBVqQ9ov6CrxPXO7dC4cUJGYzBEsejHuTQyjVQh+XsLCHyDkURmfCuJ+a97raPY+o8pKKt8yf/xTmYMtyq2zf7EQxqPxv2bXKdP+6K+h9KyuO3q4+3JbuJFTesNLy8Np1m9ShJ9UAHvAdO6LCZvQ
N91kz0ytIH+s7LgajIWyises+yz26UBQwOzZLeLcQp4=
176

aditi@nessie:~/ToB-Summer19$ bash run.sh 2 ../mcsema-2.0.0-ve/remill-2.0.0/remill-build-2/ /home/aditi/ToB-Summer19/ida-6.9/idal64 encrypt replaceIV generate_iv replacement generate_iv_original -lcrypto

# Fennec's modified binary
aditi@nessie:~/ToB-Summer19$ ./encrypt.new ""
L+PYRFiOKMcu18hSqdGQEw==/aK2hYm/GXHwA2tqZxPmoNccQwW+Zhj7E0PQUSRF+lOLJiEMwOc7yv+/Z2AA0pEJjP7Jq4lHMpq2eIVl73lvav0pJiVlOcmfnFwQ9cu0MW0EWqUdgl2FCsWKtO/TAfGhcQPopJyvP8KD/LHlru4QIfZiym7//tt0V9vvabFCLNiSTRG350XKO/zoydeuRFfSu
0HmNNQbAcLSQkcUETH424RyQ4SxmcreW3krOw30kfJY=
176

aditi@nessie:~/ToB-Summer19$ ./encrypt.new ""
hYnowxN2Z3QyPIzwNaFzJw==/pzCq+V1q5ipHoqJXZ9MaeDr+nMdV5E1RbeI+YrcQqXjFHcVmDSq4yZboEuIJJjkbNbdO5DG6n3CQnZ1C7CumGdaZsddaYJueORROk7X+PnQZUq5bKqvdN7ZJEhK7qaerjogOF4TAotDV3ryLC6l/EWY01DkhGrf0hlXAkjQnOz28lXF40GNMd6pIjcoIbZze
V72v5s5q67fVdKdCzVE3BH76qX8qYS9YnN5JkGLERYA=
176

aditi@nessie:~/ToB-Summer19$ ./encrypt.new ""
r3/wMu5nD3rEFn7N88fCjQ==/MisK9RcK8RLsqjV2nrAfprghBYrBmeJS3FbJ4YG6zHBk+uA0CcZ+R4CSDolAaAPlCmkupfxy6bFHNEqyMVv7moPaiJEAkHDDU/FKen8eAJjMvz9+RK+xmQja238jk7xmaS6JbJOdh8teQ2XiMzlHsBYBVpw89UBFrTqOSN8qtlgU3aR4xUVlwZAA1+Pg2GHy
2CIWQI6ioHGDhN3P3po7MaOldJAgHGZO5d2GluroI70=
176

You can download Fennec and find instructions for its use here.

If you have questions or comments about the tool, you can find Aditi on Twitter at @aditi_gupta0!

Binary symbolic execution with KLEE-Native

by Sai Vegasena, New York University, and Peter Goodman, Senior Security Engineer

KLEE is a symbolic execution tool that intelligently produces high-coverage test cases by emulating LLVM bitcode in a custom runtime environment. Yet, unlike simpler fuzzers, it’s not a go-to tool for automated bug discovery. Despite constant improvements by the academic community, KLEE remains difficult for bug hunters to adopt. We’re working to bridge this gap!

My internship project focused on KLEE-Native, a fork of KLEE that operates on binary program snapshots by lifting machine code to LLVM bitcode.

What doesn’t kill you makes you stronger

KLEE’s greatest strength is also its biggest weakness: It operates on LLVM bitcode. The most apparent strength of operating on bitcode is that KLEE can run on anything that the Clang compiler toolchain can compile: C, C++, Swift, Rust, etc. However, there is a more subtle benefit to KLEE’s approach that is often overlooked. Operating on bitcode means the “supporting code” or “runtime” can be implemented in C or C++, compiled to bitcode, linked against the system under test (SUT), and then subjected to the same symbolic path exploration as the SUT itself.

This provides flexibility and power. For example, a KLEE runtime can implement I/O-related system calls (read, write​, etc.) as plain old C functions. These functions are subject to symbolic exploration, just like the SUT, and contribute to code coverage. This allows KLEE to “see into” the OS kernel, and explore paths that might lead to tricky corner cases.

Now for the downside of operating in bitcode. Typically, KLEE is used on programs where source code is available, but getting bitcode from source code is not easy because of the difficulties created by build systems, configurations, and dependencies. Even when bitcode is available, a vulnerability researcher may have to manually inject KLEE API calls into the source code, link in the KLEE runtime, and possibly stub out or manually model external library dependencies. These tasks become daunting when dealing with large code bases and complicated build systems. McSema is a black-box option when source code is not available, but the limitations of binary control-flow graph recovery and occasional inaccuracies may not produce acceptable results.

KLEE-Native runs on binary program snapshots

First, we focused on getting bitcode for any program, and the approach we took was to operate on bitcode lifted from machine code, as opposed to compiled source. Using a dynamic approach based on snapshots like with GRR, we can start KLEE-Native deep into a program’s execution, which isn’t possible with mainline KLEE.

Snapshotting

By default, the KLEE-Native snapshotter captures the program’s memory and register state right before the first instruction is executed. This means KLEE-Native needs to emulate pre-main code (e.g., loading shared libraries), which isn’t ideal. To avoid emulating that type of deterministic setup, we implemented a feature that works by injecting an INT3 breakpoint instruction at a user-specified virtual address, via ptrace.

In this mode, the target process executes natively until the breakpoint instruction is hit. Once it’s hit, the snapshotter reclaims control of the target and subsequently dumps the target process memory and a register state structure compatible with Remill into a “workspace” directory. Memory mappings of the original process can then be recreated from this workspace.

$ klee-snapshot-7.0 --workspace_dir ws --breakpoint 0x555555555555 --arch amd64 -- ./a.out

For address-space layout randomization (ASLR) binaries, the --dynamic flag instructs the snapshotter to interpret the breakpoint address as an offset within the main program binary. To do this, we use a neat trick of parsing the /proc/[pid]/maps file for the target process to discover the base virtual address of the loaded program. Do some arithmetic and voila, we have our breakpoint location!

$ klee-snapshot-7.0 --workspace_dir ws --dynamic --breakpoint 0x1337 --arch amd64_avx -- ./a.out

Side note: One interesting side effect involves CPU feature testing. Libc checks for available CPU features using the CPUID instruction and uses it to determine whether to use specialized versions of some functions (e.g., a hand-coded, SSE4-optimized memset). If a snapshot is taken after CPUID is executed natively, then you must specify an AVX option for the architecture of the snapshot. Otherwise those kinds of instructions might not lift.

Dynamically lifting machine code to bitcode

Now that we have our snapshot, we can ask KLEE-Native to dynamically lift and execute the program. We can do this with the following command:

$ klee-exec-7.0 --workspace_dir ws

The following diagram shows the control flow of klee-exec.

Screenshot from 2019-08-09 15-11-26

At a high level, KLEE-Native just-in-time decodes and lifts traces, which are simply LLVM function objects that contain a logical segment of the lifted machine code. Traces contain LLVM instructions emulating machine code and calls to Remill intrinsics and other lifted traces.

Intrinsic calls may be handled by KLEE’s “special function handler” capability, which allows for runtime bitcode to have direct access to KLEE’s executor and state. For example, Remill’s memory read-and-write intrinsics use the special function handler to interact with the snapshotted address space.

When emulating a function like malloc, traces are created from libc’s implementation, and execution is able to continue smoothly. All is good in the world. We see the light and everything makes sense…

… Just kidding!

brk and mmap come along, and now we have to execute a system call. What do we do?

The runtime is the kernel and the machine

KLEE-Native lifts all machine code down to the raw system call instructions. Using Remill, system call instructions are handled by a call to the intrinsic function __remill_async_hyper_call in the runtime bitcode. Remill doesn’t specify the semantics of this function, but the intent is that it should implement any hardware- or OS-specific functionality needed to carry out a low-level action.

In our case, KLEE-Native implements the __remill_async_hyper_call function, so it passes execution from lifted bitcode back into the runtime bitcode, where each Linux system call wrapper is implemented.

System call wrappers are parameterized by an Application Binary Interface (ABI) type, which finds the system call number and arguments and stores the return value. Here is an example the SysOpen function, which wraps the POSIX open system call implemented by KLEE’s runtime.

trent_edit

Snippet of emulated open system call

This wrapper performs some of the error checking that the actual OS kernel would do. (i.e., making sure the file path name can be read from the snapshotted address space, checking the path length, etc.) Here is where we play to KLEE’s strengths: All of these error-checking paths are subject to symbolic exploration if any of the data being tested is symbolic.

We have now dynamic-lifted machine code to bitcode, and we’ve emulated system calls. That means we can run the lifted machine code and have it “talk” to KLEE’s own runtime the same way that bitcode compiled from source might do.

Our call to malloc continues executing and allocating a chunk of memory. Life is good again, but things are starting to get slow. And boy, do I mean sloooooooow.

Recovering source-level abstractions with paravirtualization

Operating on machine code means that we must handle everything. This is problematic when translating seemingly benign libc functions. For example, strcpy, strcmp, and memset require significant lifting effort, as their assembly is comprised of SIMD instructions. Emulating the complex instructions that form these functions ends up being more time consuming than if we had emulated simplistic versions. This doesn’t even address the sheer amount of state forking that can occur if these functions operate on symbolic data.

We paravirtualized libc to resolve this issue. This means we introduced an LD_PRELOAD-based library into snapshotted programs that lets us interpose and define our own variants of common libc functions. In the interceptor library, our paravirtualized functions are thin wrappers around the POSIX originals, and executing them results in the original POSIX functions being called before the snapshot.

Their purpose is to be a single point of entry that we find and patch during the snapshot phase. In the following example, the snapshotter will patch over a JMP with a NOP instruction so that malloc ends up invoking INT 0x81 when it is emulated by KLEE-Native.

trent_edits3

LD_PRELOAD malloc interceptor

When an interrupt is hit during emulation, we check the interrupt vector number and hook to a corresponding paravirtualized libc runtime function. Sometimes our paravirtualized versions of the libc functions cannot handle all cases, so we support a fallback mechanism, where we give control to the original libc function. To do this, we increment the emulated program counter by one and jump over the RETN instruction, which leads to the original function being executed.

Here are the libc functions that our LD_PRELOAD library currently intercepts. Each of the numbers (e.g., malloc’s 0x81) is the interrupt vector number that corresponds with the paravirtualized version of that function.

inter_alias_nbc

libc​ functions with paravirtualized equivalents

Wonderful! Our call to malloc has, instead, hit an interrupt, and we are able to hook to its paravirtualized version in KLEE. What exactly are we doing with it, and why do we care?

Modeling heap memory with libc interceptors

A call to mmap or brk during an emulated malloc will layout memory for allocations. While this is an accurate representation of the lifted machine code instructions, it is not a productive model for finding bugs. The problem: mmaps are very coarse grained.

Every access to memory can be seen, but it is unclear where a given allocation begins or ends. As a result, it is difficult to do things like bounds checks to detect overflows and underflows. Furthermore, there is no oversight on bug classes like double frees and use-after-frees, when allocations are opaque blobs.

That’s why we have interposed on malloc and other allocation routines to formulate a crystalline memory model that demarcates memory allocations. This approach makes it trivial for KLEE-Native to classify and report heap vulnerabilities. What’s unique about this approach is that we have invented a totally new address format for allocations to make our lives easier. It contains metadata about each allocation, which makes it simple to locate in our memory model.

trent_edit2

Union showing the components of our custom address encoding

Allocations backed by our paravirtualized malloc don’t truly “exist” in a traditional address space. Instead, they exist within allocation lists. These are structures that allow individual allocations to coexist so that they’re easy to access, track, and manipulate, which makes bounds checks, double-free detection, and overflow/underflow detection extremely transparent. Furthermore, allocation lists give us the flexibility to recover from issues like heap-based buffer overflows by expanding the backing “allocation” in place.

Huzzah! We’ve achieved a clear-cut representation of the target’s allocated memory using allocation lists. But wait. Isn’t this supposed to be a symbolic execution engine? All of this is really only concrete execution. What’s going on?

Eager concretization for an improved forking model

Our approach to symbolic execution is a departure from the typical scheduling and forking model used by KLEE. Where KLEE is a “dynamic symbolic executor,” KLEE-Native is closer to SAGE, a static symbolic executor.

KLEE-Native’s approach to forking favors eager concretization and depth-first exploration. The kick is that we can do it without sacrificing comprehensiveness. Meaning, it is always possible to go to a prior point in execution and request the next possible concretization, if that is our policy. This is different than something like KLEE or Manticore, which eagerly fork (i.e., generate a multitude of feasible states, and then defer to a scheduler to choose them over time).

The mechanism we created to enable eager concretization, without sacrificing forking potential, is implemented using state continuations. State continuations are like Python generators. In KLEE-Native, they package up and hold on to a copy of the execution state prior to any forking, as well as any meta-data needed in order to produce the next possible concretization (thus giving us comprehensiveness). The executor can then request the next possible forked state from a given continuation. Thus, each request gives us back a new execution state, where some condition has been concretized (hence the term “eager concretization”).

For now, we store state continuations on a stack. The result is that KLEE-Native “goes deep” before it “goes wide.” This is because at each point where the state could be forked, we create a continuation, get the first state from that continuation, and push it onto the stack. When a state is done executing (e.g. it exits, or an unrecoverable error is encountered), we look at the last continuation on the stack and ask for its next state. This process continues until a continuation is exhausted. If that happens, it is popped off and we go to the next continuation on the stack. In the future, we will explore alternative strategies.

trent4

How to fork when eagerly concretizing

Our approach was motivated by a need to handle symbolic memory addresses. We started adding symbols, but couldn’t always get KLEE to explore all paths. KLEE was concretizing memory addresses, but not in a comprehensive way. This was honestly expected, because symbolic memory is a hard problem.

Implementing concrete memory is easy, because there is essentially a map of addresses to byte values. However, what does it mean when an address could take on many values? We decided the best policy was to create a generic mechanism to concretize the address immediately without throwing away all the other possibilities, and then leave it up to policy handlers to make more substantive approaches. Examples of more substantive policies could be sampling, min/max, etc.

Wonderful! We can now explore the program’s state space. Let’s go hunting.

Applying KLEE-Native in the real world

Because we have control over emulated address space and memory allocations, classifying different types of memory corruption vulnerabilities becomes easy with KLEE-Native, and vulnerability triaging is a fantastic use case for this. Furthermore, our eager concretization strategy ensures we will stick to the code path of interest.

Here is CVE-2016-5180 from the Google fuzzer-test-suite. It is a one-byte-write heap buffer overflow in c-ares that was used in a ChromeOS exploit chain.

We first snapshot the program at main with a dynamic breakpoint:

$ klee-snapshot-7.0 --workspace_dir ws_CVE --dynamic --breakpoint 0xb33 --arch amd64_avx -- ./c_ares

And simply run the klee-exec command:

$ klee-exec-7.0 --workspace_dir ws

Here we get KLEE-Native detecting a one-byte heap overflow.

repro_c

So what makes KLEE-Native special compared to AddressSanitizer or Valgrind? This is where our policy handler comes in. One policy to handle memory access violations like this one is replacing overflow bytes with symbolic ones. As execution continues, we could potentially diagnose the severity of the bug by reporting the range of symbolic overflow bytes at the end. This could let a vulnerability researcher distinguish states that allow limited possibility for an overflow from ones that could potentially allow a write-what-where primitive.

In KLEE-Native, undefined behavior can be the new source for symbolic execution. This enables vulnerability triaging without prior knowledge of your threat model and the need for tedious reverse engineering.

Au revoir!

My internship produced KLEE-Native; a version of KLEE that can concretely and symbolically execute binaries, model heap memory, reproduce CVEs, and accurately classify different heap bugs. The project is now positioned to explore applications made possible by KLEE-Native’s unique approaches to symbolic execution. We will also be looking into potential execution time speed-ups from different lifting strategies. As with all articles on symbolic execution, KLEE is both the problem and the solution.

Reverse Taint Analysis Using Binary Ninja

by Henry Wildermuth, Horace Mann High School

We open-sourced a set of static analysis tools, KRFAnalysis, that analyze and triage output from our system call (syscall) fault injection tool KRF. Now you can easily figure out where and why, KRF crashes your programs.

During my summer internship at Trail of Bits, I worked on KRF, a fuzzer that directly faults syscalls to cause crashes. KRF works extremely well and pumps out core dumps like nobody’s business. However, it is difficult to determine which faulted syscall caused a particular crash since there could be hundreds of faulted syscalls in a single run. Manually tracing the cause of the crash through the source or disassembled binary is tedious, tiring, and prone to errors.

I set out to solve this problem using a technique I call Reverse Taint Analysis, and implemented my solution using the Binary Ninja API. The script gives a short list of possible causes of the crash, drastically limiting the amount of manual work required. Here, I describe the process I went through to create the algorithm and script, and give a brief overview of the additional tools built to ease its use.

Human in the Loop

How can we reliably determine the source of a crash? Well, how would a human determine the cause of a crash? First, we would look at the stack trace and figure out where the crash occurred. Let’s take a look at this vulnerable example program:

#include 
#include

void fillBuffer(char * string, unsigned len) {
  for (unsigned i = 0; i &lt; len; ++i) {
      string[i] = 'A'; // if string = NULL, segfaults on invalid write
  }
}

int main() {
  char *str;

  // Memory allocated
  str = (char *) malloc(16); // if malloc fails, str = NULL 
  fillBuffer(str, 16); // str not checked for malloc errors!
  free(str);
  return 0;
}

Running KRF against this program caused a fault. In this case, we can easily guess why the crash occurred—a faulted brk or mmap caused malloc to return NULL, which produced a segfault when fillBuffer tried to write to NULL. But let’s figure out the cause of the crash for certain, acting as if we didn’t have access to the source code.

First, let’s open up the core dump’s stack trace with gdb and see what caused the crash:

(gdb) bt
#0 0x00005555555546a8 in fillBuffer ()
#1 0x00005555555546e1 in main ()

Next, let’s take a look at our memory maps for the process so we can find the instruction in the binaries:

(gdb) info proc mappings
Mapped address spaces:
Start Addr End Addr Size Offset objfile
0x555555554000 0x555555555000 0x1000 0x0 /vagrant/shouldve_gone_for_the_head
[output truncated]

Now, cross-referencing the address from the stack trace, we see that the instructions of the top stack frame, at 0x555555554000, are in the binary /vagrant/shouldve_gone_for_the_head. We can calculate the instruction pointer’s offset in the binary by subtracting its location in the mapped address space from the beginning of the memory-mapped objfile and adding the offset:

0x00005555555546a8 - 0x555555554000 + 0x0 = 0x6a8.

Great! Now we can examine the binary itself in our disassembler of choice (Binary Ninja) and see what went wrong.

binja1

Here, we can see the disassembly of the fillBuffer() function, with the instruction that causes the segfault highlighted in red. This instruction sets the byte pointed to by rax to the character code for A. So, the issue must be an invalid value of rax. Looking back, we see that rax = rax + rdx, which are both previously set to the local variables string and i, respectively. We can see in the instruction at 0x68e that string was originally stored in rdi, which is the first argument to the function. i is initialized to zero and is only incremented, so we can ignore it, since we know it could not have been tainted by a function call or the function’s arguments.

Knowing that the first argument to fillBuffer() is tainted, we can go to the next frame in the stack trace and see what happened. We perform the same subtraction with the memory map to the address in the stack trace, 0x00005555555546e1, and get the actual address:

0x00005555555546e1 - 0x555555554000 + 0x0 = 0x6e1.​

This address is going to be one instruction after the function call to fillBuffer() since it is the return address. So, we want to examine the instruction directly before the one at 0x6e1. Let’s open it up in Binary Ninja!

binja2

Here, we have the instruction at 0x6e1 highlighted in blue, and the previous instruction highlighted in red. We know from our manual analysis of fillBuffer that the first parameter is stored in rdi, so we should track the data being stored in rdi. In the instruction before, we see that rdi is set to rax, and above that, there is a call to malloc, which stores its return value in rax.

Great! Now we know that the output of malloc gets passed into fillBuffer, where it causes the segfault. We’ve figured it out! But that was really annoying. If only there were a better way…

Enter MLIL Static Single Assignment

Well, it turns out there is a better way! Binary Ninja can decompile code into something called Medium Level IL (MLIL), which is a more human-readable form of assembly. It can then convert that MLIL into a form called Static Single Assignment (SSA), where every variable is assigned exactly once. This becomes really useful, because we don’t need to worry about things changing a variable other than its definition. As an example of SSA, consider this pseudocode function:

def f(a):
  if a < 5:
    a = a * 2
  else:
    a = a - 5
  return a

In SSA form is:

def f(a0):
  if a0 < 5:
    a1 = a0 * 2
  else:
    a2 = a0 - 5
  a3 = Φ(a1, a2) // meaning “a3 is either a1 or a2”
  return a3

So, let’s look at our same example again through the lens of SSA MLIL. Here’s fillBuffer in SSA MLIL:

binja3

Here, we can easily trace rax_2#4 to rax_1#3 + rdx_1#2, then trace rax_1#3 to string#1, which we see is arg1. We can also easily trace back i and see that it is set to 0. We have once again discovered that the first argument to fillBuffer is the source of the crash. So now, let’s look at main.

binja4

This is where we really see the benefits of SSA MLIL over regular disassembly. It lets us see what arguments are passed into fillBuffer, and what values are returned by malloc, making the analysis much easier. By tracing the sources of rdi#1 backwards, we again see that malloc is tainting the first argument of fillBuffer and, therefore, causing the crash.

We’re in the endgame now

So now that we’ve realized (for the second time) that malloc is the cause of our issues, let’s write out the process we’ve been applying, so we can easily convert it to code:

1. Make an empty stack. 
2. Push the crashing instruction to the stack. 
3. While the stack is not empty: 
4.   Pop an instruction off the stack. 
5.   If it is a MLIL function call instruction: 
6.     The return value of that function call may be cause of crash 
7.   Otherwise: 
8.     For each SSA variable used in the MLIL instruction: 
9.        If it’s not assigned in this function: 
10.         # It’s a function argument 
11.         We will have to go another frame up our stack trace. 
12.         # The same as going to main after finding arg1 was tainted
13.       Otherwise: 
14.         Add the instruction assigning SSA variable to the stack.

This is going to be easy! We just have to write it out in Python using the Binary Ninja API. We need to write a function that takes our instruction’s address and a BinaryView (a class holding information on the binary), and prints out the taint sources of the instruction.

def checkFunction(self, inst_addr, bv): 
  # Get MLILFunction obj for the function containing the instruction 
  func = bv.get_functions_containing(inst_addr)[0].medium_level_il   
  
  # Get the MLILInstruction obj for instruction at inst_addr
  inst = func[func.get_instruction_start(inst_addr)].ssa_form
  
  # Convert MLILFunction to SSA form
  func = func.ssa_form 
  
  # Keep track of what is seen
  visited_instructions = set()
  
  # Variables we are interested in
  var_stack = []  
  
  # Add the vars used by first instruction to stack
  for v in inst.vars_read: 
    var_stack.append(v)
 
  # Continuously run analysis while elements are in the stack
  while len(var_stack) > 0:
    var = var_stack.pop()
    if var not in visited_instructions:
      # Add to list of things seen
      visited_instructions.add(var) 

    # Get variable declaration
    decl = func.get_ssa_var_definition(var)
    
    # Check if its an argument
    if decl is None: 
      print("Argument " + var.var.name + " tainted from function call")
      continue

    # Check if its a function call
    if decl.operation == MediumLevelILOperation.MLIL_CALL_SSA:
      # If direct call
      if decl.dest.value.is_constant:
        # Get MLILFunction object of callee from address 
        func_called = bv.get_function_at(decl.dest.value.value) 
        print("Tainted by call to", func_called.name, "(" + hex(decl.dest.value.value) + ")")
      else: 
        # Indirect calls
        print("Tainted by indirect call at instruction", hex(decl.address))
      continue

    # If not an argument or call, add variables used in instruction to the stack. Constants are filtered out
    for v in decl.vars_read:
      var_stack.append(v)

The power of SSA is used in the vars_read and get_ssa_var_definition methods. MLIL makes detecting calls easy using decl.operation == MediumLevelILOperation.MLIL_CALL_SSA.

Extending the script

We can expand on a lot here with error handling, edge cases, automatically analyzing the frame above in the stack trace, automatically extracting information from the stack trace, etc. Thankfully, I’ve already done some of that with a set of python scripts.

python3 main.py binary coredump1 [coredump2] …

Automatically extracts the needed information from the core dumps, then inserts that information and binaries into a tarball to be copied to another computer, including libraries that are called in the stack trace.

gdb.py

Uses GDB Python API to extract data from each core dump. It’s called by main.py, so they must be in the same directory.

python3 analyze.py tarball.tar.gz

Takes a tarball output by main.py and automatically runs reverse taint analysis on each core dump in it, automatically cascading tainted arguments to the next frame. It uses krf.py to run the analysis, so they must be in the same directory.

krf.py contains the analysis code, which is a more featured version of the script written in this blog post. (Requires the Binary Ninja API.)

Let’s try them on our test binary:

$ # Linux VM with KRF
$ python3 main.py shouldve_gone_for_the_head core
Produced tar archive krfanalysis-shouldve_gone_for_the_head.tar.gz in /vagrant

$ # Machine with Binary Ninja
$ python3 analyze.py krfanalysis-shouldve_gone_for_the_head.tar.gz
Analyzing binary shouldve_gone_for_the_head
Done
Analyzing crash krfanalysis-shouldve_gone_for_the_head/cores/core.json
Tainted by call to malloc (0x560)
All paths checked

Conclusion

Writing this analysis script has shown me the Binary Ninja API is amazing. The versatility and automatic analysis it allows is incredible, especially considering it acts directly on binaries, and its intermediate languages are easy to use and understand.

I’d also like to mention LLVM, another framework for static analysis, which has a very similar API to Binary Ninja. It has many benefits over Binary Ninja, including better access to debug and type information, being free, having a more mature codebase, and always-perfect analysis of calling conventions. Its downside is that it needs the source code or LLVM IR of what you are analyzing.

Three LLVM passes are available in the KRFAnalysis repository to run static analysis: one detecting race conditions caused by checking the state of a system before use (i.e. time-of-check, time-of-use or TOC/TOU), another detecting unchecked errors from standard library calls, and a third reimplementing reverse taint analysis.

My summer: A small price to pay for salvation

I am incredibly grateful to everyone at Trail of Bits for my internship. I gained some amazing technical experience and got the chance to work with the Linux Kernel, FreeBSD Kernel, and LLVM—codebases I had previously considered to be mystical.

Some of my highlights:

  • I ported KRF to FreeBSD
  • Added the ability for KRF to target processes by PID, GID, UID, or if it had a specific file open
  • Wrote LLVM passes for static analysis
  • Upstreamed LLVM changes
  • Learned how to use Binary Ninja and its API
  • Picked up good coding practices
  • Gained a sense of the security industry

I also met some incredible people. I would like to give special thanks to my mentor Will Woodruff (@8x5clPW2), who was always willing to talk over an implementation, idea, or review my pull requests. I can’t wait to apply what I’ve learned at Trail of Bits as I move forward in my career.

Wrapper’s Delight

by Patrick Palka, University of Illinois at Chicago

During my summer at Trail of Bits, I took full advantage of the latest C++ language features to build a new SQLite wrapper from scratch that is easy to use, lightweight, high performance, and concurrency friendly—all in under 750 lines of code. The wrapper is available at https://github.com/trailofbits/sqlite_wrapper under the Apache 2.0 license. Comments and pull requests are welcome.

The motivation for this new SQLite wrapper came from working on an internal code-auditing tool built on top of Clang that uses a database to store and perform queries on semantic information about source code. Originally, RocksDB was chosen as the backing database, but I quickly found myself wrestling with the rigidity of a key-value database as queries became more complex. Wishing that we had a more expressive relational database, I began to explore switching to SQLite.

Initial experiments suggested that switching would not degrade performance if the database was properly tuned (see below), so I began looking at existing C++ SQLite wrappers to see which, if any, could suit our needs. We wanted something that would let us perform raw SQL queries that fit all our criteria for being both lightweight and able to handle concurrency. Unfortunately, none of the existing wrappers satisfied all of these, so I set out to write one from scratch.

After migrating over the backend to SQLite, we were impressed by the scalability and feature-richness of SQLite. It has a command line interface that makes debugging and prototyping easy, handles databases on the order of 100GB without a sweat and even has a built-in full-text search (FTS) extension. The wrapper makes interfacing with SQLite in C++ about as easy as possible, too.

Details

  • An example of its usage can be found at https://gist.github.com/patrick-palka/ffd836d0294f71d183f4199d0e842186. For simplicity, we chose to model parameter and column bindings as variadic function calls, so that all binds are specified at once.
  • Some of the modern C++ language features you’ll notice are inline and template variables, constexpr if and auto template parameters, generic lambdas, fold expressions, and thread_local variables.
  • This wrapper supports user-defined serialization and deserialization hooks. (https://gist.github.com/patrick-palka/d22df0eceb9b73ed405e6dfec10068c7)
  • This wrapper also supports automatic marshaling of C++ functions into SQL user functions. (https://gist.github.com/patrick-palka/00f002c76ad35ec55957716879c87ebe)
  • There is no database or connection object to explicitly initialize. Because the wrapper utilizes thread_local objects to manage connections to the database, a connection is made implicitly before the first use of the connection and is disconnected once the thread exits.
  • The database name and query strings are passed as template arguments instead of function arguments. This creates compile-time separation of the thread_local connection objects, which are per-database, and the thread_local prepared-statement caches, which are per-database, per-query-string. This design decision discourages the use of dynamically-generated query strings, since non-type template arguments must be compile-time constant expressions. In cases where a database name or a query string must be dynamically generated, the wrapper does support passing a lambda, which builds and returns the query string at runtime.
  • Every single prepared statement made by this wrapper is cached and reused, so the bare minimum number of calls to sqlite3_prepare is made throughout the lifetime of the program.
  • Downside: This wrapper cannot be used to manually manage connections to the database. It currently handles connections using a thread_local object, so a connection is created before the first query is performed on a given thread and is destroyed during thread exit. If you find yourself needing fine-grained control of when to connect or disconnect from your SQLite database, this wrapper may not work well for you. But this is limitation may be amended in the future.

Fine-tuning SQLite

Here are some SQLite tuning tips to maximize performance. Our wrapper does the first three for you automatically.

  1. Prefer “external” FTS tables when using SQLite’s FTS extension. Build the table after the data is inserted using the ‘rebuild’ command. (https://www.sqlite.org/fts5.html#the_rebuild_command)
  2. Reuse prepared statements. Create them using the sqlite_prepare_v3() routine and pass in the SQLITE_PREPARE_PERSISTENT option. (https://www.sqlite.org/c3ref/prepare.html)
  3. Use the SQLITE_STATIC option when binding text and blob values via the sqlite3_bind_*() routines. Ensure that the underlying memory is valid until the first call to sqlite3_step(). This avoids a redundant copy of the text or blob data. (https://www.sqlite.org/c3ref/bind_blob.html)
  4. Perform your insertions and updates in bulk transactions when possible. The speedup relative to using unit-size transactions grows nearly linearly with the size of the transaction, so inserting 10,000 rows per transaction is thousands of times faster than inserting 1 row per transaction.
  5. Create your indexes after all most or all of your data has been inserted, and choose your indices wisely. Creating indices once is going to be faster overall than continuously building and rebuilding them as more data gets inserted. Use the SQLite command-line interface to double-check each query’s plan, and install a log callback to have SQLite inform you whenever it decides to create a temporary index.
  6. Don’t use the same database connection or prepared statement object concurrently. SQLite serializes access to these objects. (https://sqlite.org/threadsafe.html) Also, the isolation component of ACID is guaranteed only between separate connections to the same database. (https://www.sqlite.org/isolation.html)
  7. Consider pragma temp_store = memory when storing temporary tables and data structures in memory. (https://www.sqlite.org/pragma.html#pragma_temp_store)

SQLite C API Tips

Finally, here are some miscellaneous tips to simplify working with the SQLite C API, where the first two are done for you by our wrapper.

  1. Install a concurrency-friendly sqlite_busy_handler to avoid having to check for `SQLITE_BUSY` after every API call. (https://www.sqlite.org/c3ref/busy_handler.html)
  2. Set up a log callback to have errors and other notices, like hints on where to add indices, printed automatically. (https://www.sqlite.org/errlog.html)
  3. Bundle a copy of SQLite into your project. This is the recommended way to use SQLite in your application. (https://www.sqlite.org/custombuild.html) Doing so also lets you enable SQLite’s full-text-search extension and its other useful disabled-by-default extensions.
  4. Use C++11 raw string literals to format query strings.

Final Thoughts

This summer, some of my takeaways were that when locally storing a moderate amount of structured data without large concurrency demands, sooner or later you will want to perform complex queries on this data. Unless you have a clear vision for your data-access patterns from the outset, using a key-value database will quickly back you into a corner whenever your data-access pattern changes. On the other hand, relational databases make it easy to adapt your database to continuously changing access patterns. And finally, modern C++ can help make interfacing with SQLite and other C APIs concise and easy, and when configured properly, SQLite is quite scalable.