Themes from Real World Crypto 2020

Over 642 brilliant cryptographic minds gathered for Real World Crypto 2020, an annual conference that brings together cryptographic researchers with developers implementing cryptography in the wild. Overall, RWC 2020 was an impressive conference that demonstrated some amazing work. Here we explore three major themes that emerged:

  1. Crypto bugs are everywhere...Whether it’s a somewhat unsurprising Bleichenbacher attack on TLS, or cryptographic side-channel attacks on (supposedly) secure hardware, there are a lot of cryptographic vulnerabilities out there. This became abundantly clear this past week.
  1. …so we need more cryptographers on projects…When designing, implementing, and reviewing cryptographic systems, the more cryptographers involved, the better. RWC 2020 featured big examples of how well collaboration can work, and how badly important systems can fail without it.
  1. …but cryptographic capabilities are growing fast! Advanced cryptography is becoming more practical, as shown by new multi-party computation frameworks and improvements to ZK-proofs. Plus we saw exciting new applications in Apple’s Find My protocol for finding offline devices, message authentication for satellites to prevent spoofing, and more.

Let’s dig in! 

1. Crypto bugs are everywhere

Traditional attacks

Yet another Bleichenbacher attack was presented: The 9 Lives of Bleichenbacher’s CAT:

New Cache ATtacks on TLS Implementations. (Which brings us to a fourth theme: Cryptographers still love using tortured puns and silly acronyms.) The attack leverages Bleichenbacher’s attack on PKCS#1 v1.5 padding for RSA key exchanges. Specifically, the attack takes advantage of the fact that many companies reuse certificates across multiple servers, so the Bleichenbacher attack can be parallelized and thus completed before the 30-second session timeout occurs. 

Unfortunately, this insecure padding scheme is still supported by ~6% of the internet; further, a man-in-the-middle downgrade attack can be performed, so any server that supports a vulnerable implementation can be broken 100% of the time (and this works even if the client does not support RSA key exchange).

Another talk, SHA-1 is a Shambles, discussed a chosen-prefix collision on SHA-1, and showed that SHA-1 can now be attacked in practice with affordable hardware.  The authors used this vulnerability to perform an impersonation attack on PGP. This project was the culmination of several years of work, with theoretical attacks discovered in the early 2000s, and the first practical attack found in the 2017 paper, SHAttered. In other words, SHA-1 shall never be used again (ok, coming up with puns is harder than it looks). 

Other attacks

Two different attacks on secure hardware were presented at RWC: one on a hardware security module (HSM) and another on a trusted platform module (TPM). The first attack targeted a specific HSM model and was able to (among other things) perform arbitrary code execution and decrypt all secrets. Although the attack itself was not heavily cryptographic, the talk demonstrated (yet again) that we cannot necessarily trust that our cryptographic secrets will be safe on HSMs. The second talk combined a timing side-channel attack with a lattice attack on ECDSA to recover the private signing key, demonstrating that TPMs are unfortunately not side-channel resistant.

Meanwhile, “Pseudorandom Black Swans: Cache Attacks on CTR DRBG” demonstrated that random number generators are also vulnerable to side-channel attacks. The cache attack leverages two problems with CTR_DRBG: Keys are not rotated fast enough, and adding more entropy is optional (and chosen by the API caller). This means keys can be compromised, and if inadequate entropy is used, an attack can then obtain all future states. These attacks were not a part of the previous standard’s threat model; fortunately, FIPS 140-3 updates this threat model.

2. The case for more cryptographers

From all of these attacks, the lesson is to involve more cryptographers and think about a variety of threat scenarios when designing your system (and in the case of the last talk, use Hash_DRBG). Several RWC 2020 presentations confirmed this. For instance, we saw how CRLite, a scalable system for TLS revocations, was achieved through academic and industrial collaboration. On the other hand, two different cryptographic reviews of e-voting systems and an analysis of the handshake protocol in WPA3 showed the dangers of too few cryptographic cooks.

The good

CRLite, the system for TLS revocations, started as an academic design and Firefox extension proof of concept; from there industry improved on the scheme, taking into account infrastructure that exceeded the means of academia alone. Now there is a working prototype and development is progressing while academia continues to refine the protocol.

More promising news came from model-checking 5G security: Our tools are sufficiently advanced that standardization now can and should be accompanied by formal models and analysis. This idea was pioneered by the symbolic analysis of TLS 1.3, and it’s great to see the trend continuing. These types of analysis are very powerful for protocols and standards, as they ensure that security goals are clearly stated and achieved by the protocol. 

In the case of 5G, the security goals were not clearly stated in the initial conception of the protocol. The RWC 2020 presentation, “A Formal Analysis of 5G Authentication,” specified the security goals more clearly, which led to the discovery that 5G does not achieve untraceability (perhaps this is bad after all!). Nevertheless, this work serves as an important demonstration and should be replicated for future standardization efforts.

The bad

“Dragonblood: Analyzing the Dragonfly Handshake of WPA3 and EAP-pwd” makes a pretty compelling case for involving cryptographers in protocol design. WPA2 is vulnerable to offline dictionary attacks, and WPA3 was proposed as the improvement. However, Dragonblood found that WPA3 is vulnerable to side-channels, and, according to the authors of the paper, “WPA3 does not meet the standards of a modern security protocol.” To make matters worse, the countermeasures are costly and may not be adopted. Worst of all, as the authors state, these issues could have been avoided if the protocol design process was open to more cryptographers.

The ugly

There’s plenty of ugliness in the world of e-voting, as the talks at RWC 2020 confirmed. In one analysis of the Moscow internet voting system, two significant breaks to the encryption scheme were found within a somewhat constrained time frame. For example, the first break resulted from an insecure variant of ElGamal dubbed “Triple ElGamal,” which attempted to achieve 768-bit security, but actually achieved three separate instances of 256-bit security, which can be broken in under 10 minutes using CADO-NFS

Both breaks cited were fixed; however, the fixes to the second break were published only two days before the election, and the technology was still deployed. The general impression of the presenter was that the voting scheme achieved no privacy, very partial verifiability, no coercion resistance, and no protection against vote-buying. Although the Russian government should be commended for opening their source code, it is clear that more cryptographers should have been involved in this entire process.

Similar work on the Switzerland internet voting system led to the discovery of some significant cryptographic bugs. The protocol uses a zero-knowledge proof system to achieve both privacy and verifiability; however, due to a flaw in their Fiat-Shamir transformation, none of the zero-knowledge proofs were sound. Further, parameters were generated incorrectly in a way that could allow for votes to be modified. Even worse, statements were malformed for their zero-knowledge proofs, which broke their security proofs. This result is not ideal. However, to be fair, it is great to see cryptographers involved, as critical issues were spotted before deployment in Switzerland (and revealed similar issues to non-public systems in other countries).

3. New growth and cryptography applications

It’s not all bad; our cryptographic capabilities are growing quickly! And RWC 2020 displayed some fascinating efforts to apply cryptography to real world problems.

“Find My” cryptography

Earlier this year, Apple released a new “Find My” feature in iOS 13 that allows offline devices to be located while protecting privacy of both the owner and the finder of the device. Previously, similar features like “Find My Phone” required the device to be online, a serious limitation, particularly for devices like MacBooks which are typically offline. The cryptography behind this feature was presented at RWC 2020. Apple sought a protocol that achieved the following goals: 

  • Only the owner of the device can track the device and access location reports remotely
  • Previous locations of the device are protected if the device is compromised 
  • Owners only receive anonymous information from the finder  
  • The finder’s location is never revealed to others (including the server) 

To achieve this, the protocol calls for offline devices to broadcast public keys via Bluetooth. Active devices become “finders,” and when other offline devices are discovered via Bluetooth, the finder encrypts its location using the offline device’s public key and sends it to the cloud. This way, even the server does not know the location—however, IP-based information does leak to the server, and Apple’s only promise is that they do not store logs of this information.

The owner can then access the time and location of their offline device whenever there is an active device in its vicinity. (There are more subtleties to the protocol to achieve the remaining security goals, such as key rotation). In summary, Apple specified rigorous security and privacy goals, and constructed a novel design in their attempt to achieve them.

Private detection of compromised credentials

Protocols for Checking Compromised Credentials” presented a formal security analysis of two protocols for checking compromised credentials: HaveIBeenPwned (HIBP) and Google Password Checkup (GPC). These protocols aim to alert users if their credentials have been breached and shared across the web. GPC maintains an active database of username and password pairs for users to query. HIBP, on the other hand, only maintains passwords. 

Since these databases contain hundreds of millions of records, both protocols implement a bucketization strategy, where hash values corresponding to records are sorted into buckets, based on their hash prefix. This allows users to query the database with a hash prefix, receive a bucket of hash values, and check if their credentials have been compromised, without revealing their entire hash of their secret to the server. The study presented at RWC 2020 demonstrated that each protocol leaks noticeable information about user secrets due to their bucketization strategies—both protocols leak information for different, subtle reasons. Luckily, the study also produced mitigation strategies for both protocols.

Out of this world cryptography

RWC even included some cryptographic applications that are out of this world. Galileo is a global navigation satellite system (like GPS) used by the European Union. As discussed at RWC, these navigation systems are a critical part of our infrastructure, and spoofing location is actually fairly easy. Luckily, so far, this spoofing is mostly used for playing Pokemon Go; however, spoofing attacks on these satellite systems are real. To protect against potential future attacks, Galileo will offer a public navigation message authentication service.

Banking on collaboration

The final talk at RWC discussed using multi-party computation to detect money laundering. Financial regulators impose large fines on banks if they allow money laundering activities, so these banks are incentivized to detect illegal activities. However, collaboration between banks is difficult because transaction data is private. Fortunately, multi-party computation can facilitate this collaboration without violating privacy. Overall, this effort achieved promising results by applying a graph-based approach for modeling transactions and algorithms specialized for multi-party computation for efficient, collaborative analysis between various banks.

Conclusion

RWC 2020 made it clear that involving cryptographers in the design and implementation of your novel protocols will save you both time and money, as well as keeping everyone safer. If you’re involved in this type of work encourage everyone involved to open-source your code, publish your protocols for review, and hey, talk to the Trail of Bits cryptography team!

Exploiting the Windows CryptoAPI Vulnerability

On Tuesday, the NSA announced they had found a critical vulnerability in the certificate validation functionality on Windows 10 and Windows Server 2016/2019. This bug allows attackers to break the validation of trust in a wide variety of contexts, such as HTTPS and code signing. Concerned? Get the important details and see if you’re vulnerable at https://whosecurve.com/. Then come back to this tab and keep reading to see exactly what this bug is and how it works.

image (7)

Why even patch if it doesn’t have a logo? https://whosecurve.com

At a high level, this vulnerability takes advantage of the fact that Crypt32.dll fails to properly check that the elliptic curve parameters specified in a provided root certificate match those known to Microsoft. Interestingly, the vulnerability doesn’t exploit any mathematical properties unique to elliptic curves – the same exact bug could have manifested itself in a normal DSA signature verification library. So let’s first review how this bug would have worked if Crypto32.dll used normal DSA.

A toy version of the attack

The security of DSA relies on the fact that the discrete log problem is hard when dealing with the group of integers mod a prime. Consider the following equation:

b = gx mod p

It’s hard to find x if all you know is p, g, and b. To set up DSA, users need to specify a prime p and a generator g. Using these two parameters, they can then create a private key x and public key pk = gx mod p. These keys allow for signatures that can only be created by the private key, but can be verified with the public key. Signature forgery is then as hard as the discrete log problem (very hard).

Digital signatures algorithms such as DSA aren’t very useful on their own, though, since they don’t provide a mechanism by which users can trust a given public key associated with a specific entity. This is where X.509 certificates come into play. An X.509 cert is a file that explicitly says “this public key belongs to this person,” signed by someone else (possibly the owner of the public key in question). These certificates can be chained together, starting at a “root” certificate, attesting to the identity of a root certificate authority (CA). The root CA signs intermediate certificates to attesting the identity of intermediate CAs, the intermediate CAs sign other intermediate certificates; and so on, down to the “leaf” certificate at the end.

Each certificate contains information about the signature algorithm and parameters used. Microsoft’s certificate might look like the following (heavily simplified) example:

  • Certificate authority: Microsoft
  • Name: Trail of Bits
  • Public key info
    • Algorithm: DSA
    • Generator: g
    • Prime: p
    • Public key: pk

When Windows users receive a X.509 certificate chain, they check to make sure the CA at its root is one that Microsoft trusts. But what happens if Windows only checks to make sure the public key of the certificate in question matches a trusted entity, not the associated system parameters? In other words, what happens when an attacker can change the value of p or g associated with a given public key pk without Windows noticing? In fact, omitting this check completely breaks the security of DSA.

One potential way to exploit this vulnerability is to simply set g = p and make the private key x = 1. This allows the attacker to sign any message as if they were the legitimate owner of pk, since they now know the private key (it’s 1). Things can get even more interesting, though: Instead of simply setting the new generator to the target’s public key, we can choose a new private key y and set the malicious generator to be g’ = y1 * pk. This means the certificate still effectively has a secret key, but it is known only to the attacker, not the original issuer.

Importantly, this attack works without anyone solving the discrete log problem. Essentially, if the authenticity of parameters associated with a given public key is not established, the attacker can choose any private key they want. This exploit scenario was originally outlined in 2004 by Vaudenay and referred to as a domain parameter shifting attack, but wasn’t seen in the wild until now.

The actual vulnerability

Exploiting the vulnerability in Crypt32.dll involves adapting the previous attack so that instead of using DSA, the signer is using the elliptic curve variant ECDSA. In fact, you don’t really need to know much about elliptic curves at all to understand how this works. You need only know elliptic curves are more or less mathematically equivalent to the integers mod p, except instead of multiplying numbers, you geometrically manipulate points lying on a curve. In this post, curve points are bold upper case letters and adding a point P to itself n times is written as n * P

ecc

Fig 1. The diagram you’ve seen a million times showing elliptic curve addition

Elliptic curves, along with point addition, create another structure in which the discrete log problem is hard. Also, like the normal DSA case, ECDSA requires choosing a set of public parameters before generating a private/public key-pair, including a generator. Usually these parameters are specified by naming a curve, such as Elliptic Curve secp256r1 (1.2.840.10045.3.1.7), but users can manually specify them instead. In that case, users must supply constants that define the elliptic curve (A,B), the prime over which arithmetic is done p, the generator of the group G, and information about that group’s size (order, cofactor). For the purposes of this attack we only care about G.

Now that we have some background on elliptic curves, it’s not hard to see that the attack works basically the same as with DSA: Change the parameters specifying ECDSA to have a generator corresponding to a private key you know, but with the same public key as the certificate authority you’re trying to spoof. By editing the parameters, we can control the effective secret key for the certificate, and use it to attest to whatever identities we’d like.

In real life, the parameter validation bypass is also slightly more involved. Microsoft does check that the parameters used in most certificates are valid, but when it is presented a root certificate it has cached, it will skip parameter validation if the certificate uses elliptic curve cryptography and the public key matches what’s cached. This means that for common root CAs, which most users will have seen, our attack is viable. In practice, this means we can generate valid TLS certificates for almost any website, bypass code signing restrictions, or forge signatures for files and emails. For explanatory purposes, let’s look at how we might man-in-the-middle https traffic to some website.

Building a fake certificate

First we need to pick a trusted root certificate. Microsoft maintains a list here. For our purposes, we picked the Microsoft EV ECC Root Certificate Authority 2017. This is a secp384r1 cert, so the public key is a point on the curve defined by the parameters given by the secp384r1 curve.

carbon (12)

Fig 2. A trusted certificate with public key

Next we need to generate a new private key for our malicious certificate, defined over a different curve, using explicit parameters. This object has a specific ASN.1 key encoding, which we generate with OpenSSL. Remember from the previous section, we want to hold the public key the same as the private one to bypass validation. Once we have the public and private keys for our new certificate, we can use them to calculate a generator such that they correspond. More precisely, we need to calculate G’ = x-1 * P where x is our private scalar and P is the public key point from the MS certificate (this corresponds to the second attack scenario in the previous section).

Now that we have a new mutated key, we can use this to generate a CA certificate:

carbon (11)

Fig 3: A parsed view of our bad root

Once we’ve generated that certificate we can use it to sign a leaf certificate for whatever we want. This key/cert is just signed by the “bad” root – it doesn’t need custom parameters or any magic.

cert

Fig 4: A cert for whosecurve.com

Finally, we ensure that we send the “full chain” as part of a TLS connection. In normal TLS you send the leaf certificate along with any intermediates, but you don’t send the root itself. In this case we need to send that root to trigger the cache bug. Voila!

image (6)

Fig 5. Not a real TLS certificate

Fixing the vulnerability and lessons learned

Fortunately for Microsoft, fixing this bug simply required adding a couple of checks during signature verification to make sure ECDSA parameters are authentic. Unfortunately for everyone else, this vulnerability is absolutely devastating and requires all systems running Windows 10 to be patched immediately. Until it is, attackers can forge signatures for TLS, code, files, or email.

Cryptographically, this bug is a great example of how increasing the number of parameter choices increases system fragility. We’ve known for years that explicitly specified curves (as opposed to named curves) are a bad idea, and this is yet another piece of evidence to that point. It also is a great reminder that all cryptographic information needs to be verified when handling digital signatures.

While we have not provided a PoC for this we strongly encourage people to patch since public exploit code has become available today! And we have put up a website demonstrating the flaw for anyone interested in checking whether they have an unpatched system: Go find out Whose Curve Is It Anyway?

References:

Mainnet360: joint economic and security reviews with Prysm Group

On Monday, October 28th at the Crypto Economics Security Conference, Trail of Bits announced a new joint offering with Prysm Group: Mainnet360. Carefully designed to produce a comprehensive assessment of the security and economic elements of blockchain software, Mainnet360 gives teams a broader perspective that will allow them to build safer and more resilient systems.

The short story: Mainnet360 makes sure a system’s actual deployed code is both correct and economically efficient. These systems are secure only through a complex interaction of economics and computer science; implementation errors in either allow value to be stolen or destroyed. This kind of multidimensional problem is exactly the kind of work we specialize in.

How it works

Since the original Bitcoin whitepaper, decentralized systems have built on a notion of economic security. To avoid having a single privileged administrator, the incentives must be aligned for network participants to maintain the system cooperatively. Realizing this vision requires both a sound incentive model and code that faithfully implements it; errors in either the model or the code can lead to total system collapse.

Mainnet360 clients will receive a comprehensive review of both the economic framework that drives their system and the code with which it is implemented. We will work closely with teams to identify and remove risks, architect future work, and find the ideal technical solutions for tricky economic constraints. Building stable decentralized systems requires a broad set of experts cooperating closely, and we’re proud to offer that in a convenient package.

Our offering extends beyond just design review. Trail of Bits specializes in delivering clients new testing and verification tools, too. Now with Prysm Group’s input, we can extend this tooling further to verify economic properties. Our comprehensive understanding of the risks present in the systems we review means that we can deliver more architectural guidance. Lists of bugs are useful, but strategic guidance to eliminate bug classes puts more power in your hands.

The partnership

Mainnet360 has been in the works for months. After being introduced by DARPA at the Applications and Barriers to Consensus Protocols workshop in February where Prysm Group presented their research on “Designing the market for blockchain nodes,” our teams were struck by the similarity in our assessment processes. Despite our wildly different expertise, we found that we deliver similar advice in similar formats to some of the same clients. We also quickly realized that our skillsets were highly complementary.

After shadowing each other on a few trial projects, we found many of the mechanisms that we were assessing required a perspective that took both code correctness and mechanism design into account. From there, we worked together closely to understand each other’s processes, strategies, deliverables, and limitations. We collected feedback from past mutual clients, reviewed each other’s reports, sat in on each other’s calls, and built a collaborative process.

Now, we are excited to unveil what we have built to the public and work with a first batch of companies to prepare the systems they’re building for real-world usage. If you’re building something that could use this kind of review and guidance, get in touch. We’d love to work together with you.

64 Bits ought to be enough for anybody!

How quickly can we use brute force to guess a 64-bit number? The short answer is, it all depends on what resources are available. So we’re going to examine this problem starting with the most naive approach and then expand to other techniques involving parallelization.

We’ll discuss parallelization at the CPU level with SIMD instructions, then via multiple cores, GPUs, and cloud computing. Along the way we’ll touch on a variety of topics about microprocessors and some interesting discoveries, e.g., adding more cores isn’t always an improvement, and not all cloud vCPUs are equivalent.

Sixty-four is (more than) a magic number

Why try to guess a 64-bit number? Modern processors operate on 64-bit quantities, so 64 bits is a natural size for magic numbers, headers, and other markers. When fuzzing, it’s common to run into comparisons against such “magic” 64-bit values, but guessing these values is seen as a canonical impossible problem. Fortunately, no one has to use brute force in such situations, because there are better approaches like removing the comparison, using pre-made input seeds, dictionaries, symbolic execution, and compile-time transformation.

But the problem of a brute force guess is easy to understand and parallelize, demonstrating just how effective parallelization can be against a herculean task. Looking at this problem also shows that as hardware gets faster, new sets of computational problems become possible. Imagine what we can achieve using the full arsenal of modern computing power!

Still, we must consider just how intractable it is to guess a 64-bit number by simply trying all possibilities. How long would it take? And how much would it cost?

How big is a 64-bit number?

A 64-bit number can hold 264 (that is, 18,446,744,073,709,551,616) distinct values—more than the grains of sand on Earth and cells in the human body (Figure 1).

Smallest Largest
Cells in the human body Amount of sand grains on Earth 264 Avogadro’s number Stars in the Universe
3.72 * 1013 7.5 * 1018 1.84 * 1019 6.02 * 1023 1024
Figure 1: Getting an intuitive feel for the size of 264 by comparing it to other large numbers.

A modern CPU can execute about 270 billion instructions per second, so exhausting a 264 search space would take 776 days—a little more than two years. Thankfully, brute force comparison is an embarrassingly parallel problem, where the work can be evenly distributed among many processors. So what if we can coax each processor into doing more than one comparison at a time? Maybe there’s some kind of service where we can get a lot of processors on short notice.

All of a sudden this is starting to look tractable!

Crunching the numbers

Disclaimer: Before we get to the numbers, I want to emphatically state that this is a fun experiment and not a benchmark. No attempt was made to ensure a fair apples-to-apples comparison of different processors or machines. Plus:

  • The code used to generate the performance numbers is written in C and lacks tests.
  • It may not be the fastest possible version.
  • It is also certain to have multiple bugs.
  • In fact, a serious time measurement bug found during review delayed this post by a few weeks.

Fixes and suggestions are welcome on Github.

All measurements reflect an average of 10 trials. Two of the machines tested are cloud instances in Digital Ocean and Google Cloud, respectively. The Digital Ocean (DO) High CPU instance reports itself as an “Intel(R) Xeon(R) Platinum 8168 CPU” running at 2.70 GHz. The Google Cloud (GCP) High CPU instance reports itself as an “Intel(R) Xeon(R) CPU” running at 3.10 GHz. Neither of these self-reported identifiers can be trusted: Virtualization platforms can and usually do lie about the underlying CPU. Cloud machines are also shared, and what other tenants do on your machine may affect your CPU throughput.

Of the physical machines tested, my 2018 MacBook Pro has an Intel Core i7 8559U processor. The Xeon E5-2640 v4 is a 2.40 GHz 40-core shared server. Both of these machines had other software running at the same time as these measurements. The Core i3 530 running at 2.93 GHz is an old machine that was literally serving as a footrest and cat scratching post before it was revived for this project thanks to its GPU.

Pull Requests Wanted: You will notice that these are all x86 CPUs. ARM system measurements are very much wanted. I did not include these since this project was already taking too long, and I would have to start from scratch to learn ARM’s SIMD instructions.

The naive for loop

Some say premature optimization is the root of all evil. So let’s measure how long a generic for loop takes to compare all 264 values. Figure 2 below lists the operations per millisecond (an average of 10 runs) performed on a subset of the full 64-bit range, and an estimate of how long it would take to try all 264 values.

Device Operations Per Millisecond (ms-1) Years To Completion (Yrs)
2018 MacBook Pro 4.41E+06 132.78
Core i3 530 1.45E+06 402.46
DO High CPU 3.50E+06 167.30
GCP High CPU 3.77E+06 155.21
Xeon E5-2640 v4 3.37E+06 173.38
Figure 2: A generic for loop comparison of all 264 values.

The most naive approach would take 133 years. Clearly, this is much too long to wait and some optimization is in order.

Vectorized for loop

Modern processors can operate on multiple 64-bit quantities at a time via SIMD or vector instructions. Currently, Intel leads the pack with AVX-512, which, like the name implies, operates on 512-bit vectors. This lets us compare eight 64-bit quantities per iteration. For those who want to know more about vector instructions, Cornell has a good set of introductory material on the topic

Vectorization is the process of transforming code that runs on one quantity at a time (a scalar) into code that operates on multiple quantities simultaneously (a vector). Vectorization was my first optimization, because I thought clang would automatically vectorize code for me. Unfortunately, I was wrong—clang’s auto vectorization is meant to vectorize code without dependence on the loop variable, like matrix multiplication. Instead of relying on the compiler, I used artisanally hand-crafted vectorized comparisons (sse, avx2, and avx512) and hand-unrolled loops to make maximum use of multiple vector execution units.

Vector instructions are continually improving, but not every x86 CPU supports AVX-512. Some only support AVX2 (256-bit vectors), while others just do SSE4.1 (128-bit vectors), and some don’t even support that. In Figure 3 (table) and Figure 4 (graph) below, we can compare different vectorization approaches available on our hardware collection.

Device Method Operations Per Millisecond (ms-1) Years To Completion (Yrs)
2018 MacBook Pro Naive 4.41E+06 132.78
2018 MacBook Pro SSE 6.73E+06 86.94
2018 MacBook Pro AVX2 1.51E+07 38.85
Core i3 530 Naive 1.45E+06 402.46
Core i3 530 SSE 3.08E+06 190.12
DO High CPU Naive 3.50E+06 167.30
DO High CPU SSE 4.86E+06 120.44
DO High CPU AVX2 1.09E+07 53.57
DO High CPU AVX512 1.41E+07 41.44
GCP High CPU Naive 3.77E+06 155.21
GCP High CPU SSE 5.02E+06 116.41
GCP High CPU AVX2 1.17E+07 49.82
GCP High CPU AVX512 1.35E+07 43.37
Xeon E5-2640 v4 Naive 3.37E+06 173.38
Xeon E5-2640 v4 SSE 4.50E+06 129.99
Xeon E5-2640 v4 AVX2 1.02E+07 57.49
Figure 3: Performance of vectorized and naive versions of 64-bit comparison operations. Not all hardware platforms support the same vectorization methods.

The table above is included for completeness; the graph of the same data below, provides an easier visual comparison.

SingleCPUThroughput.png

Figure 4: A graphical representation of Figure 3; performance of different methods to compare 64-bit numbers on a single core.

Several things stand out in this data:

  • Vectorization always helps. Comparing more values per iteration is always faster, even when accounting for setup time to move values in and out of vector registers.
  • AVX2 (four comparisons at a time) is a big improvement over SSE4.1 (two comparisons at a time). As expected, AVX2 is about twice as fast as SSE4.1.
  • Conversely, AVX-512 (eight comparisons at a time) is a small improvement over AVX2 (four comparisons at a time). How can this be? I suspect that this is due to a little-known side effect of using AVX-512 instructions: They slow down the processor by as much as 40%. The processor’s power budget doesn’t permit it to both run at full speed and make heavy use of AVX-512.

Even with these improvements, it would still take 39 years to check all 64 bits on a single core. How much faster can we go using multiple cores?

Parallelized and vectorized

The problem of finding a needle in a 64-bit haystack happens to be ridiculously parallelizable, so multiple cores should deliver a linear increase in throughput. The data below (Figure 5) shows that this is true—except when it’s not! Sometimes performance plateus even as more cores are added. How can this be?

This effect is due to hyperthreading. A hyperthreaded processor presents physical cores as two or more virtual cores to the operating system. While each hyperthread appears independent, the underlying execution units are shared. For computationally intensive applications, hyperthreading can have significant and surprising effects on performance. In a hyperthreaded environment and on CPU-intensive workloads, which cores are used can matter almost as much as the amount of cores. This is especially important for cloud-based workloads, where each vCPU is a hyperthread on a physical core.

MultiCPUThroughput

Figure 5: Operations per hour versus number of cores for each tested machine, separated by different methods of allocating cores. MacBook Pro and Core i3 530 results are omitted.

Figure 5 shows multicore performance using two methods to allocate workers to hyperthreads. Split cores allocates to separate physical cores, while share cores allocates to the same physical core.

The 40-hyperthreaded-core Xeon E5-2640 machine represents this difference: Using split core allocation, performance peaks at 20 cores—the amount of physical cores —and then levels off. Using shared core allocation, throughput follows a step function, increasing with each new physical core. We may also make inferences about cloud hardware using this data: The 16 vCPUs of a high-CPU GCP instance probably represent 8 physical and 16 hyperthreaded cores dedicated to our workload.

The DO high-CPU machine presents a puzzle: The same effect isn’t observed. Assuming the vCPUs come from a real and single Xeon Platinum 8168, there should be differences after 24 cores are utilized. This doesn’t happen, and there are a few possible explanations. First, the vCPUs are not from the same physical processor: The Xeon 8168 can operate in an 8-way multiprocessing configuration. Second, the processor is not a Xeon 8168, but another chip altogether. And finally, there may be something wrong with my thread affinity or timing measurement code.

Regardless, the scaling results show a valuable lesson: Using more cores isn’t always better, and it matters which cores (or vCPUs) you use. There is little to no gain, for this workload, when allocating more workers than physical cores present in the machine. Always measure when performance counts.

For completeness, Figure 6 lists operations per millisecond and estimated years to completion when utilizing multiple cores, with each core using the fastest supported single-core method. For the cloud machines, each core is a “vCPU,” which is roughly equivalent to one hyperthreaded core. For physical machines, each core is one hyperthreaded core.

Device Cores (Hyperthreaded) Operations Per Millisecond (ms-1) Years To Completion (Yrs)
Xeon E5-2640 v4 39 / 40 1.57E+08 3.73
GCP High CPU 16 / 16 1.24E+08 4.71
DO High CPU 32 / 32 2.55E+08 2.30
Core i3 530 2 / 4 6.24E+06 93.76
2018 MacBook Pro 8 / 8 5.14E+07 11.38
Figure 6: The best multiprocessing times observed, and at how many cores.

The only outliers in Figure 6 are the Xeon E5-2640, which did best at 39/40 hyperthreaded cores, and the Core i3 530, which did best at 2/4 hyperthreaded cores. Why 39 cores? The machine is shared and handles other workloads; at 39 cores, all workloads can be put on one core. At 40 cores utilized, the workload is spread to more cores and takes scheduling time away from CPU-intensive integer comparisons.

As expected, using multiple CPUs has drastically reduced compute time. However, it is still too slow. We can’t expect to wait more than two years to guess one number.

Enter the graphics

GPUs take a different approach to computation than CPUs. CPUs are good at multiple concurrent tasks; GPUs are great at doing simple operations over a huge volume of data. This manifests in the main architectural differences: A high-end CPU may come with 24 very complex cores, while a high-end GPU comes with 5,120 simple cores (Figure 7).

NVIDIAV100.png

Figure 7: This is a Tesla V100 GPU from NVIDIA. It comes with 5,120 CUDA cores and was the most powerful GPU tested. Graphic taken from the NVIDIA Volta architecture whitepaper.

Inadvertently, I picked a problem that is tailor-made for GPU optimization. Brute force comparison is easy to parallelize, involves no complex decisions, and is entirely compute- bound. Thanks to the role of GPUs in machine learning, cheap off-peak GPU capacity is available from every large cloud provider.

Figuring out how to use GPU computation took some work, but it was absolutely worth it. Just look at the throughput in Figure 8! There is an order of magnitude performance gain even though my knowledge of GPU programming and CUDA was about zero when I started, and my CUDA code has approximately my-first-CUDA-tutorial-level of optimization.

Device GPUs Operations Per Millisecond (ms-1) Years To Completion (Yrs)
GeForce GT 1030 1 1.60E+08 3.66
NVIDIA Tesla K80 8 2.92E+09 0.20
NVIDIA Tesla P100 4 4.19E+09 0.14
NVIDIA Tesla V100 4 9.08E+09 0.06
Figure 8: 64-bit comparison of throughput on GPUs, and estimated years to search all 64-bits.

As Figure 8 shows, using GPUs instead of CPUs shortens the required time from years to days. What a difference the right platform makes! An old $85 video card (the GeForce GT 1030) performs on par with a 40-core Xeon machine. Using 4x NVIDIA Tesla V100 GPUs, we can brute force a 64-bit comparison in 0.06 years, or about 22 days.

The difference between CPU and GPU computation for this problem is so dramatic (a lone V100 is about 18.5x as fast as 16 high-CPU cores) that it doesn’t make sense to use CPUs. Adding another GPU will always be a better use of resources than relying on CPU computation. To show how bad CPUs are at this problems, I made this handy graph:

CPUGPUGraph.png

Figure 9: Throughput when comparing 64-bit numbers on selected CPUs and GPUs. This specific problem can be done so much faster on GPUs that it doesn’t make sense to use CPUs at all.

Now that we’ve gone from 130 years (with the naive for loop) to 22 days (on multiple GPUs), can we get to hours or minutes? How much hardware would it take…and how much would it cost?

Show me the money

We established that GPUs are much faster than CPUs, but GPU time is also much more expensive. And which GPU is best? Different GPU families command rather different prices.

Using current pricing (as of November 2019), we can see that the price/performance of GPUs is about ~5x as good as CPUs. While more expensive, the V100 delivers better performance per dollar than its GPU predecessors (Figure 10).

Compute-Years Needed Per-Hour Cost (Preemptible) Total Cost
Pre-Emptible CPU GCP [16 vCPU] 4.71 $0.20 $8,311.70
Pre-Emptible GPU [4x P100] 0.14 $1.76 $2,147.34
Pre-Emptible GPU [4x V100] 0.06 $3.00 $1,689.45
Pre-Emptible K80 [8x K80] 0.20 $1.12 $1,959.64
Figure 10: Price of CPU and GPU time needed to compare a value against 264 numbers, measured in compute-years. Costs are for preemptible time on Google Cloud.

For something that seemed so insurmountable at the start, the final cost is well within reach: Comparing a number against 264 values would take about $1,700 of preemptible GPU compute time.

Because the problem can be parallelized so efficiently, total compute time is almost directly interchangeable for hardware cost and availability. For example, 4x V100 GPUs will look through all 264 numbers in about 22 days, but 2,259x V100 GPUs would look through 264 numbers in about an hour, for the same overall cost.

Conclusion

This silly experiment shows the importance of understanding your problem, your hardware, and the kinds of computing resources that are available to the average developer.

What seemed like a completely crazy idea turned out to be only mildly insane. Trying 264 comparisons would cost about $1,700 in compute time, and can be accomplished in hours or days.

We also learned a few more things along the way:

  • Parallelization is not always easy, but it can be remarkably effective. Learning some CUDA made this originally insurmountable problem go from impractical to “tolerable but expensive.”
  • Using more cores isn’t always better, and it matters which cores you use.
  • The underlying hardware can leak under the abstraction of the cloud.
  • Cloud computing provides access to nearly limitless on-demand hardware; it just has to be put to use.

There are certainly many problems that seem crazy and completely out of reach, but can be solved via a combination of parallelizable algorithms, cloud computing, and a corporate credit card.

We’re always developing ways to work faster and smarter. Need help with your next project? Contact us!

Introducing iVerify, the security toolkit for iPhone users

“If privacy matters, it should matter to the phone your life is on.” So says Apple in their recent ads about Privacy on the iPhone and controlling the data you share—but many of the security features they highlight are opt-in, and users often don’t know when or how to activate them.

But hey… we got your back!

Today, Trail of Bits launched iVerify, a user-friendly iPhone security toolkit, now available for download in the iOS Apple Store. iVerify makes it easy to manage the security of your accounts and online presence with simple instructional guides. Crucially, it also detects security anomalies on your iPhone or iPad.

Read more about iVerify in VICE Motherboard.

Yes, you do need to secure your iPhone

Although iOS malware attacks have been relatively rare, 2019 saw the largest attack against iPhone users to date. And in September, a new iPhone Boot ROM exploit was released that allows anyone with physical control of a phone to run arbitrary code. (See details in our post “Tethered Jailbreaks Are Back.”)

Fortunately, there are plenty of ways to protect your data, but locking down your iPhone and iCloud account is not straightforward. There are dozens of settings that need to change, and the trade-offs between enabling and disabling a feature aren’t always clear. iVerify makes these trade-offs straightforward with guides that show you how to adjust settings based on privacy and security needs.

iVerify alerts you to security anomalies

Not only does iVerify help you keep your data confidential and limit data sharing, it helps protect the integrity of your device. It’s normally almost impossible to tell if your iPhone has been hacked, but our app gives you a heads-up. iVerify periodically scans your device for anomalies that might indicate it’s been compromised, gives you a detailed report on what was detected, and provides actionable advice on how to proceed.

iVerify does all of this using iverify-core, our suite of iOS integrity and security checks originally created for professional developers. These checks were developed by Trail of Bits based on our extensive experience and expertise in iOS internals and iOS security. iverify-core is updated as new versions of the iPhone and new security checks are released, so new iVerify app users will get the benefit of years of improvement.

Get yours now on the iOS App Store

For iPhone users: To find iVerify on the iOS App Store, simply search for “Trail of Bits” or click here to get the app now.

For developers: Want to license iverify-core? Contact us today and let’s get you started.

Announcing the Crytic $10k Research Prize

At Trail of Bits, we make a significant effort to stay up to date with the academic world. We frequently evaluate our work through peer-reviewed conferences, and we love to attend academic events (see our recent ICSE and Crypto recaps).

However, we consistently see one recurring issue at these academic events: a lack of reliable tools and experiments. Researchers have no incentive to maintain tools, and, most of the time, don’t have the necessary engineering resources. That’s where Trail of Bits comes in: We spend considerable effort to maintain our research-oriented, open-source tools because we want researchers to benefit from our work.

And now, to encourage more activity in this area, we’ve created the Crytic Research Prize to reward published academic papers built on our blockchain tools.

Topics of interest

Any work that is built on top of our tools is eligible.

Do you want to make an industrial-scale impact? Our team is currently interested in these topics:

  • Improving the core capabilities of the tools:
    • How to make Echidna exploration smarter
    • How to make Manticore faster on smart contracts
    • How to reduce false positives from Slither
  • Extending the areas of application for the tools:
    • How to model and detect race conditions precisely
    • How to automatically repair bugs
    • How to guide Echidna with machine learning or grammar-based approaches
    • Any specific instrumentation techniques for smart contracts
    • How to combine the tools
  • We would especially love to see creative uses for our tools outside their original purpose. For example, see our code similarity tool based on Slither.

Rules

  • The paper must be based on one of our open-source tools:
  • The paper must have been accepted in a peer-reviewed conference. We recommend:
    • ACSAC, ASE, CAV, CCS, Crypto, FC, FSE, NDSS, ICSE, ISSTA, OSDI, POPL, PLDI, S&P, S&P Europe, TACAS, and Usenix.
    • Any workshops from the conferences listed above.
  • All material must be open-source, including the benchmark. We will re-run the experimentation and evaluate the tools.
  • Applications must comply with the applicable US and international laws, regulations, and policies.

Trail of Bits will evaluate all entries and award cash prizes to the best papers:

  • First place: 6,000 USD
  • Second place: 2,000 USD
  • Third place: 2,000 USD

We will close the submissions on November 1st, 2020.

We are happy to provide support for our tools. Join our Slack channel (#ethereum) if you want direct assistance from our developers.

Remember, if you want to participate in the competition, send us your paper!

Everything You Ever Wanted To Know About Test-Case Reduction, But Didn’t Know to Ask

Imagine reducing the amount of code and time needed to test software, while at the same time increasing the efficacy of your tests and making your debugging tasks easier—all with minimal human effort. It seems too good to be true, but we’re going to explain how test-case reduction can do all this (and maybe more).

Understanding how reduction works can help with troubleshooting, and makes it easier to figure out an efficient workflow and the best tools to optimize your tests. We’ll explain why test-case reduction is an especially important topic for security engineers to understand, and take a look at DeepState’s state-of-the-art reducer.

Test-case reduction for humans

The most common purpose for test-case reduction is to take a complicated failing test case for a confirmed bug, and turn it into one that is much easier to understand in order to make debugging easier. But you may also want to dismiss low-priority bugs! Having a simple version of such bugs helps quickly identify future duplicate submissions of the same, unimportant problem. For unconfirmed bugs, reducing test cases can be even more critical, because you often can’t tell if a bug is a duplicate (or even a bug at all) until you’ve simplified it. You may discover that the problem is with your specification, test generation tool, or operating system.

Without a reduced test case that can be easily understood and succinctly described, it’s hard to even say that you have a bug. You may have evidence that a software system is behaving unexpectedly, but when the test case is complex enough “a bug” can be a dubious concept. A complex test case is almost analogous to a mathematical proof that only shows that an integer with a certain property must exist; your proof needs to be constructive if you want to do something that requires an actual number.

Suppose you have a very large, randomly-generated HTML file that crashes Chrome. You’ve possibly found an important bug; or maybe you just did something that caused Chrome to run out of memory in an expected way. If your only conclusion is, “When I load this huge file in Chrome, it stops running,” you don’t really know much at all. In this case, you may want to apply a test-case reducer instead of spending time looking at core files and attaching debuggers. If it reduces the file down to a single page of HTML or, better yet, reduces to something as small as a single SELECT tag, you have a shortcut to understanding what is going on.

Test-case reduction has direct applications in cybersecurity, especially when fuzzers are used to craft exploits from crashes generated by random inputs. The simpler the input, the easier it will be to construct an exploit or realize that a bug cannot be exploited. AFL and libFuzzer provide built-in limited test-case reduction, but sometimes you need more than that. Modern test-case reduction tools can simplify this analysis, and are probably essential if you want to produce complex sequences of API calls to find vulnerabilities in a library like TLS, SQLite, or LevelDB. This concept also extends to fuzzing smart contracts, which is why Echidna, Trail of Bits’ smart contract fuzzer, includes a test-case reducer.

Test-case reduction for machines

Test-case reduction does more than just make tests easier for humans to read; it’s useful for core cybersecurity tasks like fuzzing, symbolic execution, and bug triage. Reducers can:

If you’d like to search the literature on test-case reduction, it’s also referred to as minimization, shrinking, and, originally, delta-debugging. All of these names refer to the same process of taking a program input A that does X, and producing a smaller input, B, that also does X.

Using a test-case reducer

Let’s see test-case reduction in action. Bring up the DeepState docker image, and install the testfs repository, which is a DeepState harness for a user-mode ext3-like file system.

> git clone https://github.com/agroce/testfs.git
> cd testfs
> cmake .
> make

Our DeepState harness lets us simulate device resets in the middle of operations, and check for problems caused by interrupting a file system call. It checks that after a reset, the file system can still be mounted. To generate a test case showing that this isn’t always true, we can just use DeepState’s built-in fuzzer:

> mkdir failure
> ./Tests --fuzz --output_test_dir failure --exit_on_fail --seed 10

DeepState will report a problem, and save the resulting test case in a file with a unique file name ID and .fail extension. Let’s look at the sequence producing the file system corruption. For brevity, we show only the actual test steps below.

> ./Tests --input_test_file failure/dbb393e55c77bac878ab06a02a022370e33761cb.fail

TRACE: Tests.cpp(115): STEP 0: tfs_lsr(sb);
TRACE: Tests.cpp(140): STEP 1: tfs_stat(sb, ".a.BBA");
TRACE: Tests.cpp(146): STEP 2: tfs_cat(sb, "/aAb.BaAb");
TRACE: Tests.cpp(140): STEP 3: tfs_stat(sb, "");
TRACE: Tests.cpp(115): STEP 4: tfs_lsr(sb);
TRACE: Tests.cpp(103): STEP 5: tfs_rmdir(sb, "BB");
TRACE: Tests.cpp(146): STEP 6: tfs_cat(sb, "b");
TRACE: Tests.cpp(110): STEP 7: tfs_ls(sb);
TRACE: Tests.cpp(95): STEP 8: tfs_mkdir(sb, "A");
TRACE: Tests.cpp(146): STEP 9: tfs_cat(sb, "a./b");
TRACE: Tests.cpp(103): STEP 10: tfs_rmdir(sb, "AaBbBB.A.");
TRACE: Tests.cpp(130): STEP 11: tfs_write(sb, "BA/BB/", "yx");
TRACE: Tests.cpp(140): STEP 12: tfs_stat(sb, "bba");
TRACE: Tests.cpp(155): STEP 13: set_reset_countdown(4);
TRACE: Tests.cpp(140): STEP 14: tfs_stat(sb, "/A");
TRACE: Tests.cpp(121): STEP 15: tfs_create(sb, "bA");

This output shows the 16 steps taken to reach an assertion violation on line 252 of super.c, when we try to remount the file system after the reset. But are all of these steps necessary? Do we really need to cat the file "a./b" for this to happen? We’ll use DeepState’s reducer to find out.

> deepstate-reduce ./Tests failure/dbb393e55c77bac878ab06a02a022370e33761cb.fail failure/shrink.fail

You’ll see output like:

Original test has 8192 bytes
Applied 75 range conversions
Last byte read: 307
Shrinking to ignore unread bytes
Writing reduced test with 308 bytes to failure/shrink.fail
================================================================================
Iteration #1 0.18 secs / 2 execs / 0.0% reduction
Structured deletion reduced test to 304 bytes
Writing reduced test with 304 bytes to failure/shrink.fail
0.36 secs / 3 execs / 1.3% reduction
================================================================================
Structured deletion reduced test to 272 bytes
Writing reduced test with 272 bytes to failure/shrink.fail
0.5 secs / 4 execs / 11.69% reduction
================================================================================
Structured deletion reduced test to 228 bytes
Writing reduced test with 228 bytes to failure/shrink.fail
0.6 secs / 5 execs / 25.97% reduction
…
1-byte chunk removal: PASS FINISHED IN 0.24 SECONDS, RUN: 1.45 secs / 57 execs / 95.78% reduction
4-byte chunk removal: PASS FINISHED IN 0.08 SECONDS, RUN: 1.53 secs / 70 execs / 95.78% reduction
8-byte chunk removal: PASS FINISHED IN 0.08 SECONDS, RUN: 1.61 secs / 83 execs / 95.78% reduction
1-byte reduce and delete: PASS FINISHED IN 0.02 SECONDS, RUN: 1.62 secs / 86 execs / 95.78% reduction
4-byte reduce and delete: PASS FINISHED IN 0.01 SECONDS, RUN: 1.64 secs / 88 execs / 95.78% reduction
8-byte reduce and delete: PASS FINISHED IN 0.01 SECONDS, RUN: 1.64 secs / 89 execs / 95.78% reduction
Byte range removal: PASS FINISHED IN 0.31 SECONDS, RUN: 1.96 secs / 141 execs / 95.78% reduction
Structured swap: PASS FINISHED IN 0.01 SECONDS, RUN: 1.96 secs / 142 execs / 95.78% reduction
Byte reduce: PASS FINISHED IN 0.1 SECONDS, RUN: 2.06 secs / 159 execs / 95.78% reduction
================================================================================
Iteration #2 2.06 secs / 159 execs / 95.78% reduction
Structured deletion: PASS FINISHED IN 0.01 SECONDS, RUN: 2.08 secs / 161 execs / 95.78% reduction
Structured edge deletion: PASS FINISHED IN 0.01 SECONDS, RUN: 2.09 secs / 163 execs / 95.78% reduction
================================================================================
Completed 2 iterations: 2.09 secs / 163 execs / 95.78% reduction
Padding test with 23 zeroes
Writing reduced test with 36 bytes to failure/shrink.fail

After a few seconds, we can run the new, smaller test case:

> ./Tests --input_test_file failure/shrink.fail

TRACE: Tests.cpp(155): STEP 0: set_reset_countdown(4);
TRACE: Tests.cpp(121): STEP 1: tfs_create(sb, "aaaaa");
CRITICAL: /home/user/testfs/super.c(252): Assertion (testfs_inode_get_type(in) == I_FILE) || (testfs_inode_get_type(in) == I_DIR) failed in function int testfs_checkfs(struct super_block *, struct bitmap *, struct bitmap *, int)
ERROR: Failed: TestFs_FilesDirs
ERROR: Test case failure/shrink.fail failed

To “break” testfs under reset interruption, we need only cause a reset at the fourth write to the block when the file "aaaaa" is created. Using a debugger or logging statements to understand this behavior will obviously be a much more pleasant experience than with the original test case.

For this bug, we only needed to give deepstate-reduce:

  • the DeepState harness executable (./Tests),
  • the test case to reduce (dbb393e55c77bac878ab06a02a022370e33761cb.fail), and
  • the new, reduced test case to generate (failure/shrink.fail).

But sometimes we’ll need to provide more information. For example, a test case may “change bugs” during reduction if all we require is that the reduced test fails. The reducer can take an additional requirement, in the form of a string or regular expression that must appear in the output, or a required exit code.

Reducing with respect to code coverage

Reducers can also be used in the absence of failure. For example, we might want to take a complicated test case that covers a hard-to-reach line of code—perhaps generated by a fuzzer or symbolic execution tool—and make it easier to follow. A good reducer can do that too, by using a revised definition of what we consider an interesting test case.

First, we’ll generate some passing tests:

> mkdir coverage
> ./Tests --fuzz --output_test_dir coverage --timeout 10 --fuzz_save_passing --seed 1

This creates a file, coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass, that produces a file system that mounts, but has inconsistent inode and block freemaps. We could just use --criterion to reduce the test case with respect to that interesting fact alone, but let’s produce a test case that retains all the code coverage of our test.

First, we compile a version of the harness and testfs with code coverage instrumentation enabled:

> clang -c -fprofile-instr-generate -fcoverage-mapping *.c
> clang++ -o TestsCov Tests.cpp -fprofile-instr-generate -fcoverage-mapping -ldeepstate *.o

Next we run our test with coverage instrumentation and collect the results:

> rm default.profraw
> ./TestsCov --input_test_file coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass
> llvm-profdata-6.0 merge -o testscov.profdata default.profraw
> llvm-cov-6.0 show ./TestsCov -instr-profile=testscov.profdata *.c Tests.cpp  >& covout

Then we can use the reducer to produce a new test with the same coverage:

> deepstate-reduce python coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass coverage/smallcov.pass --cmdArgs "checkCov.py @@" --exitCriterion 0

Now the reducer runs a Python script to determine how interesting the results are. It will take a lot longer to finish, since preserving coverage is usually harder than preserving a specific failure. Comparing the file system calls made in the two test cases (which, remember, have identical coverage), we can see what this additional overhead has brought us. The original test case does all this…

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");
TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "B/Aa/aabA.");
TRACE: Tests.cpp(115): STEP 2: tfs_lsr(sb);
TRACE: Tests.cpp(121): STEP 3: tfs_create(sb, "AbB/BAb");
TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(4);
TRACE: Tests.cpp(158): STEP 5: set_reset_countdown() ignored; already set
TRACE: Tests.cpp(95): STEP 6: tfs_mkdir(sb, ".BB");
TRACE: Tests.cpp(146): STEP 7: tfs_cat(sb, "b./aBa..");
TRACE: Tests.cpp(103): STEP 8: tfs_rmdir(sb, "");
TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "bbbA/.abA");
TRACE: Tests.cpp(95): STEP 10: tfs_mkdir(sb, "..a");
TRACE: Tests.cpp(110): STEP 11: tfs_ls(sb);
TRACE: Tests.cpp(110): STEP 12: tfs_ls(sb);
TRACE: Tests.cpp(95): STEP 13: tfs_mkdir(sb, "A.");
TRACE: Tests.cpp(155): STEP 14: set_reset_countdown(5);
TRACE: Tests.cpp(158): STEP 15: set_reset_countdown() ignored; already set
TRACE: Tests.cpp(140): STEP 16: tfs_stat(sb, "B..Aaab./b");
TRACE: Tests.cpp(158): STEP 17: set_reset_countdown() ignored; already set
TRACE: Tests.cpp(140): STEP 18: tfs_stat(sb, "/AaA");
TRACE: Tests.cpp(146): STEP 19: tfs_cat(sb, ".");

…whereas the reduced test case makes API calls only when required to cover the code, and uses a far less bizarre set of paths:

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");
TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "aa");
TRACE: Tests.cpp(95): STEP 2: tfs_mkdir(sb, "aaa");
TRACE: Tests.cpp(103): STEP 3: tfs_rmdir(sb, "");
TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(3);
TRACE: Tests.cpp(95): STEP 5: tfs_mkdir(sb, "B/aa/a");
TRACE: Tests.cpp(110): STEP 6: tfs_ls(sb);
TRACE: Tests.cpp(115): STEP 7: tfs_lsr(sb);
TRACE: Tests.cpp(121): STEP 8: tfs_create(sb, "aaaa");
TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "a");
TRACE: Tests.cpp(146): STEP 10: tfs_cat(sb, "a");
TRACE: Tests.cpp(155): STEP 11: set_reset_countdown(1);
TRACE: Tests.cpp(158): STEP 12: set_reset_countdown() ignored; already set
TRACE: Tests.cpp(146): STEP 13: tfs_cat(sb, "aaa");
TRACE: Tests.cpp(95): STEP 14: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 15: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 16: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 17: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 18: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 19: tfs_mkdir(sb, "");

Steps 15-19 can be ignored; they simply reflect that our file system tests always perform 20 steps, unless there is a failure before that point. It’s clear now that the multiple calls to cat, ls, and so forth in the original version were redundant. Taking this test, adding assertions about expected return values, and maintaining the resulting high-coverage unit test would be a much more reasonable task than dealing with the chaos produced by the fuzzer alone. It’s also obvious that steps 11 and 12 are not important to include in a unit test, because the reset never happens. You only really need to understand and write asserts for this trace:

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");
TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "aa");
TRACE: Tests.cpp(95): STEP 2: tfs_mkdir(sb, "aaa");
TRACE: Tests.cpp(103): STEP 3: tfs_rmdir(sb, "");
TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(3);
TRACE: Tests.cpp(95): STEP 5: tfs_mkdir(sb, "B/aa/a");
TRACE: Tests.cpp(110): STEP 6: tfs_ls(sb);
TRACE: Tests.cpp(115): STEP 7: tfs_lsr(sb);
TRACE: Tests.cpp(121): STEP 8: tfs_create(sb, "aaaa");
TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "a");
TRACE: Tests.cpp(146): STEP 10: tfs_cat(sb, "a");
TRACE: Tests.cpp(146): STEP 13: tfs_cat(sb, "aaa");
TRACE: Tests.cpp(95): STEP 14: tfs_mkdir(sb, "");

Coverage isn’t the only property other than failure that we can preserve. Recent work on automatic resource-adaptation for software reduces actual programs, not just their test cases. Reduced programs do less, but use fewer resources and still pass all important tests. We are investigating producing versions of a program that avoid calling insecure APIs only required for optional functionality, thus reducing the attack surface of the system.

How does it do that?

To a user, test-case reduction can appear to be magic. You give a reducer a huge, extremely complex test case, and, after some time, it produces a much smaller test case that achieves the same goal. In many cases, the process doesn’t take very long, and the resulting test case is, as far as you can see, not even a simple chunk of the original test case. How does this happen?

At a high level, almost all approaches to test-case reduction use some variant of an extremely simple algorithm:

Given a test case, T_INIT, to reduce, and a function PRED(test) that checks if a test case is a valid reduction (e.g., if it fails in the same way as T_INIT or covers the same source code), we can reduce T_INIT as follows:

1. Set T_CURR = T_INIT
2. Let T_NEXT = a new variation of T_CURR (one that has not yet been tried)
If there are no new variations of T_CURR, stop and report T_CURR as the reduced test-case.
3. Add T_NEXT to the set of variations that have been tried.
4. If PRED(T_NEXT), set T_CURR=T_NEXT.
5. Go to 2.

The devil is in the details, particularly those of step two. A completely naive approach to test-case reduction would consider all possible subsets or subsequences of T_CURR as the set of variations. This would be good and thorough except for the minor problem that it would take approximately forever to reduce even fairly small test cases.

There are also some subtle points about the design of the reduction criteria function, PRED, for systems that have more than one bug (which, in reality, includes most complex software). If the definition of “fails in the same way as T_INIT” is too restrictive, we’re unlikely to get much reduction. For example, forcing the exact same failure output might be a bad idea when an assertion includes some input value, or references a position in T_INIT or an address in memory.

On the other hand, if our criteria are too weak, we may run into slippage. Slippage is the annoying phenomenon where a test case for a complex, subtle new bug reduces to a test case for a simple, and likely already known, bug. Avoiding slippage using reduction criteria requires some intuition about bugs and the system under test. More sophisticated approaches to avoiding slippage require modifying the reduction algorithm, and are related to the idea that “test-case reduction is fuzzing,” a topic beyond the scope of this blog post. Here, we’re going to assume you have PRED, and it does what you want it to do.

A good first step: Modified binary search

Rather than considering all possible subsets of a test case as potential shorter versions, we can divide and conquer through a modified binary search, which is the heart of an approach called delta-debugging. First, we try to see if just the first half of our test case satisfies our reduction criterion. If not, we try the second half of the test case. If neither satisfies PRED, what do we do? Well, we can just increase the granularity, and try removing quarters of the original test case, then eighths, and so on, until we’re trying to remove whatever our “atomic” parts are, and we can’t do anything else.

The DeepState reducer

DeepState’s reducer doesn’t use Zeller’s delta-debugging binary search strategy, which is often less effective than simpler greedy approaches. It instead produces variants of a test case by applying a series of passes similar to compiler passes:

  • Structured deletions: DeepState inputs are not always treated as uninterpreted byte buffers. DeepState knows when the bytes read within the context of a single OneOf call begin and end, and tries to delete such blocks accordingly. While the files themselves do not contain any hints of this structure, DeepState actually runs each test case T_CURR and dynamically extracts the structure. Deletions according to OneOf boundaries often correspond to removing a single function call, e.g., reducing foo(); bar(); foo; to foo(); foo();. Using as much structure as possible to make likely meaningful reductions is a key optimization in reduction methods. In addition to DeepState structure, our reducer also knows common delimiters, such as whitespace, quotes, commas, and brackets, used in source code, xml, json, and so forth.
  • Structure edge deletions: This pass tries to remove the boundaries between structured parts of the test, e.g., merging two adjacent code blocks into one.
  • One-, four-, and eight-byte removals: This pass is the first that doesn’t use any structure from the test. It just tries deleting single bytes, then four-byte sequences, then eight-byte sequences (e.g., string components, 32-bit values, and 64-bit values).
  • One-, four-, and eight-byte reduce and deletes: This pass reduces the value of a single byte by one, then tries to delete a sequence of bytes after the reduced byte. This covers the common case where a value specifies the length of some field, or number of times to run a loop, and then the following bytes are the actual values.
  • Byte-range removals: This pass tries removing increasingly larger byte-ranges, skipping over the one, four, and eight special cases that were already tried. DeepState has a user-configurable upper limit (16 by default) on the size of the range. This is the brute-force core of the reducer.
  • Structure swaps: This is the first pass that cannot reduce test length. It tries to swap adjacent structures whenever this makes the test’s bytes more sorted.
  • Byte reductions: This pass takes each byte of a test case and tries to make it as close to zero as possible.
  • Byte pattern search: By default, DeepState doesn’t apply this expensive pass, but when the reducer is asked to aim for maximum reduction, it will search for all two-byte patterns that appear more than once, and try to remove part of the pattern in multiple places at once. The idea is that if two function calls need the same input bytes (e.g., a filename), this can replace both with a smaller matching input. The original byte patterns may be larger than two bytes, but repeated application of the pass can handle the largest patterns.

DeepState repeatedly interleaves these passes, using later passes only when the earlier passes in the list are no longer helping with reduction, until no reduction is possible or the time allowed for reduction has been exceeded. While DeepState’s reducer is best at reducing DeepState tests, the ability to run an arbitrary program to check for interestingness means you can also use it to reduce other kinds of test cases.

Six test-case reducers you should know about

If you’re interested in test-case reduction, you should at least know about these reducers in addition to DeepState:

  • Delta-debugging in Python. I personally used this, Andreas Zeller’s original implementation, to reduce thousands of test cases for the file system on the Curiosity Mars Rover, and it made understanding and debugging even the nastiest bugs discovered by random-testing a flash file system fairly easy.
  • CReduce. This is widely used in compiler testing and was a major influence on the DeepState reducer. It can reduce more than just C, and is a recent entry in a long line of more structure-aware reducers, starting with the HDD (Hierarchical Delta-Debugging) algorithm. The CReduce paper and John Regehr’s blog posts are good resources to learn more about CReduce.
  • Hypothesis. Hypothesis is a popular Python test generation tool informed by David MacIver’s ideas about test reduction. A key insight of MacIver’s is that reducers should order tests in shortlex order, not only considering shorter tests better, but preferring equal-length tests that are “simpler.” DeepState also focuses on shortlex ordering. I haven’t yet tried David’s brand-new tool, ShrinkRay, but given what I know of his shrinker savvy, it’s likely to be very good.
  • QuickCheck. QuickCheck’s success in finding bugs in Haskell programs arguably started the modern renaissance in random testing, which, with DeepState, is beginning to merge with the renaissance in fuzzing. QuickCheck “shrunk” generated inputs to produce small counterexamples, and its successors continue or expand on this tradition.
  • Google’s halfempty. While it wasn’t the first test-case reducer to try to use parallelism, halfempty is the most focused on getting the most out of multiple cores when reducing huge inputs.
  • TSTL. This Python property-based testing system introduced the idea of normalization, which works to make test cases simpler as well as shorter. For instance, normalization will replace large integers with smaller values, and group API calls together to improve readability. Normalization is a formalization of one approach to MacIver’s emphasis on shortlex ordering.

As it turns out, normalization also helps with reduction. Test-case reducers can easily get stuck in local minima, where no simple changes can reduce test size. Normalization can “unstick” things: The simplification often removes obstacles to reduction. Once anything resembling normalization is on the table, including many passes in CReduce, Hypothesis, or DeepState, a “reduced” test case may not overlap with the original test case at all.

TSTL and SmartCheck also use normalization to assist in generalization, an approach that takes a failing test case and distinguishes the essential aspects of the test case (“this function call must have zero for this parameter”) from the accidental aspects (“any number will do here, zero is just a really simple one”) in order to make debugging and understanding even easier. A generalization tool for DeepState is on our to-do list.

While the above reducers are powerful and important, you may want to primarily use DeepState’s own reducer when reducing DeepState tests, because it’s the only one that uses dynamic feedback about structure from DeepState test execution. For extremely large inputs, it may be a good idea to apply halfempty or ShrinkRay first, to take advantage of multiple cores; then you can use DeepState to go the last mile.

Conclusion

Test-case reduction is a powerful tool to have in your testing utility belt, and is constantly operating behind the scenes in fuzzers like AFL and libFuzzer. Knowing all the things that test-case reduction can do for you can improve the effectiveness of your tests and make debugging a much more pleasant task. DeepState includes a state-of-the-art reduction tool, and we encourage you to play with it, using reduction criteria of your own invention.

We’re always developing new tools to make finding and fixing issues easier. Need help with your next project? Contact us!

Security assessment techniques for Go projects

The Trail of Bits Assurance practice has received an influx of Go projects, following the success of our Kubernetes assessment this summer. As a result, we’ve been adapting for Go projects some of the security assessment techniques and tactics we’ve used with other compiled languages.

We started by understanding the design of the language, identifying areas where developers may not fully understand the functionality of a language semantic. Many of these misunderstood semantics originated from findings we reported to our clients and independent research into the language itself. While not exhaustive, some of these problem areas include scoping, coroutines, error handling, and dependency management. Notably, many of theses are not directly related to the runtime. The Go runtime itself is designed to be safe by default, preventing many C-like vulnerabilities.

With a better understanding of the root causes, we searched for existing tooling to help us quickly and effectively instrument client codebases. The result was a sample of static and dynamic open-source tools, including several that were Go-agnostic. To complement these tools, we also identified several compiler configurations that help with instrumentation.

Static analysis

Because Go is a compiled language, the compiler detects and prevents many potentially erroneous patterns before the binary executable is even produced. While this is a major annoyance for newer Go developers, these warnings are extremely important in preventing unexpected behavior and keeping code clean and readable.

Static analysis tends to catch a lot of very low hanging fruit not included in compiler errors and warnings. Within the Go ecosystem, there are many disparate tools such as go-vet, staticcheck, and those within the analysis package. These tools typically identify problems like variable shadowing, unsafe pointer use, and unused function return values. Investigating the areas of a project where these tools display warnings typically leads to exploitable functionality.

These tools are by no means perfect. For example, go-vet can miss very common accidents like the one below, where the A function’s err return value is unused, and immediately reassigned during the assignment of bSuccess on the left-hand side of the expression. The compiler will not provide a warning, and go-vet does not detect this; nor does errcheck. In fact, the tools that successfully identify this case (non-exhaustive) are the aforementioned staticcheck and ineffassign, which identify the err return value of A as unused or ineffectual.

package main

import "fmt"

func A() (bool, error) { return false, fmt.Errorf("I get overridden!") }

func B() (bool, error) { return true, nil }

func main() {
	aSuccess, err := A()
	bSuccess, err := B()
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println(aSuccess, ":", bSuccess)
}

Figure 1: An example program showing an ineffectual assignment of err tricking go-vet and errcheck into considering err as checked.

$ go run .
false : true
$ errcheck .
$ go vet .
$ staticcheck .
main.go:5:50: error strings should not be capitalized (ST1005)
main.go:5:50: error strings should not end with punctuation or a newline (ST1005)
main.go:10:12: this value of err is never used (SA4006)
$ ineffassign .
<snip>/main.go:10:12: ineffectual assignment to err

Figure 2: The output of the example program, along with errcheck, go-vet, staticcheck, and ineffassign.

When you look deeper into this example, you may wonder why the compiler does not warn on this problem. The Go compiler will error when variables are not used within a program, but this example successfully compiles. This is caused by the semantics of the “short variable declaration.”

ShortVarDecl = IdentifierList ":=" ExpressionList .

Figure 3: The grammar specification of the “short variable declaration.”

According to the specification, the short variable declaration has the special ability to redeclare variables as long as:

  • The redeclaration is in a multi-variable short declaration.
  • The redeclared variable is declared earlier in the same block or function’s parameter list.
  • The redeclared variable is of the same type as the previous declaration.
  • At least one non-blank variable in the declaration is new.

All of these constraints hold in the previous example, preventing the compiler from producing errors for this problem.

Many tools have edge cases like this where they are unsuccessful in identifying related issues, or identify an issue but describe it differently. Compounding the problem, these tools often require building the Go source code before analysis can be performed. This makes third-party security assessments complicated if the analysts cannot easily build the codebase or its dependencies.

Despite these pitfalls, when put together, the available tools can provide good hints as to where to look for problems within a given project, with just a little bit of effort. We recommend using gosec, go-vet, and staticcheck, at a minimum. These have the best documentation and ergonomics for most codebases. They also provide a wide variety of checks (such as ineffassign or errcheck) for common issues, without getting too specific. For more in-depth analysis of a particular type of issue, however, one might have to use the more specific analyzers, develop custom tooling directly against the SSA, or use $emmle.

Dynamic analysis

Once static analysis has been performed and the results have been reviewed, dynamic analysis techniques are typically the next step for deeper results. Due to Go’s memory safety, the problems normally found with dynamic analysis are those that result in a hard crash or an invalidation of program state. Various tools and approaches have been built to help identify these types of issues within the Go ecosystem. Additionally, it’s possible to retrofit existing language-agnostic tooling for the dynamic testing of Go software, which we show next.

Fuzzing

The best-known dynamic testing tool in the Go space is likely Dimitry Vyukov’s implementation of dvyukov/go-fuzz. This tool allows you to quickly and effectively implement mutational fuzzing. It even has an extensive wall of trophies. More advanced users may also find the distributed fuzzing and libFuzzer support useful when hunting for bugs.

Google also produced a more primitive fuzzer with a confusingly similar name, google/gofuzz, that assists users by initializing structures with random values. Unlike Dimitry’s go-fuzz, Google’s gofuzz does not generate a harness or assist with storing crash output, fuzzed input, or any other type of information. While this can be a downside for testing some targets, it makes for a lightweight and extensible framework.

For the sake of brevity, we refer you to examples of both tools in their respective READMEs.

Property testing

Diverging from more traditional fuzzing approaches, Go’s testing package (typically used for unit and integration testing) provides the testing/quick sub-package for “black box testing” of Go functions. In other terms, it is a basic primitive for property testing. Given a function and generator, the package can be used to build a harness to test for potential property violations given the range of the input generator. The following example is pulled directly from the documentation.

func TestOddMultipleOfThree(t *testing.T) {
	f := func(x int) bool {
		y := OddMultipleOfThree(x)
		return y%2 == 1 && y%3 == 0
	}
	if err := quick.Check(f, nil); err != nil {
		t.Error(err)
	}
}

Figure 4: The OddMultipleOfThree function is being tested, where its return value should always be an odd multiple of three. If it’s not, the f function will return false and the property will be violated. This is detected by the quick.Check function.

While the functionality provided by this package is acceptable for simple applications of property testing, important properties do not often fit well into such a basic interface. To address these shortcomings, the leanovate/gopter framework was born. Gopter provides a wide variety of generators for the common Go types, and has helpers to assist you in creating your own generators compatible with Gopter. Stateful tests are also supported through the gopter/commands sub-package, which is useful for testing that properties hold across sequences of actions. Compounding this, when a property is violated, Gopter shrinks the generated inputs. See a brief example of property tests with input shrinking in the output below.

package main_test
import (
  "github.com/leanovate/gopter"
  "github.com/leanovate/gopter/gen"
  "github.com/leanovate/gopter/prop"
  "math"
  "testing"
)

type Compute struct {
  A uint32
  B uint32
}

func (c *Compute) CoerceInt () { c.A = c.A % 10; c.B = c.B % 10; }
func (c Compute) Add () uint32 { return c.A + c.B }
func (c Compute) Subtract () uint32 { return c.A - c.B }
func (c Compute) Divide () uint32 { return c.A / c.B }
func (c Compute) Multiply () uint32 { return c.A * c.B }

func TestCompute(t *testing.T) {
  parameters := gopter.DefaultTestParameters()
  parameters.Rng.Seed(1234) // Just for this example to generate reproducible results


  properties := gopter.NewProperties(parameters)

  properties.Property("Add should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      inpCompute.CoerceInt()
      inpCompute.Add()
      return true
    },
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),
  ))

  properties.Property("Subtract should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      inpCompute.CoerceInt()
      inpCompute.Subtract()
      return true
    },
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),
  ))
  
  properties.Property("Multiply should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      inpCompute.CoerceInt()
      inpCompute.Multiply()
      return true
    },
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),
  ))

  properties.Property("Divide should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      inpCompute.CoerceInt()
      inpCompute.Divide()
      return true
    },
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),
  ))

  properties.TestingRun(t)
}

Figure 5: The testing harness for the Compute structure.

user@host:~/Desktop/gopter_math$ go test
+ Add should never fail.: OK, passed 100 tests.
Elapsed time: 253.291µs
+ Subtract should never fail.: OK, passed 100 tests.
Elapsed time: 203.55µs
+ Multiply should never fail.: OK, passed 100 tests.
Elapsed time: 203.464µs
! Divide should never fail.: Error on property evaluation after 1 passed
   tests: Check paniced: runtime error: integer divide by zero
goroutine 5 [running]:
runtime/debug.Stack(0x5583a0, 0xc0000ccd80, 0xc00009d580)
	/usr/lib/go-1.12/src/runtime/debug/stack.go:24 +0x9d
github.com/leanovate/gopter/prop.checkConditionFunc.func2.1(0xc00009d9c0)
	/home/user/go/src/github.com/leanovate/gopter/prop/check_condition_func.g
  o:43 +0xeb
panic(0x554480, 0x6aa440)
	/usr/lib/go-1.12/src/runtime/panic.go:522 +0x1b5
_/home/user/Desktop/gopter_math_test.Compute.Divide(...)
	/home/user/Desktop/gopter_math/main_test.go:18
_/home/user/Desktop/gopter_math_test.TestCompute.func4(0x0, 0x0)
	/home/user/Desktop/gopter_math/main_test.go:63 +0x3d
# <snip for brevity>

ARG_0: 0
ARG_0_ORIGINAL (1 shrinks): 117380812
ARG_1: 0
ARG_1_ORIGINAL (1 shrinks): 3287875120
Elapsed time: 183.113µs
--- FAIL: TestCompute (0.00s)
    properties.go:57: failed with initial seed: 1568637945819043624
FAIL
exit status 1
FAIL	_/home/user/Desktop/gopter_math	0.004s

Figure 6: Executing the test harness and observing the output of the property tests, where Divide fails.

Fault injection

Fault injection has been surprisingly effective when attacking Go systems. The most common mistakes we found using this method involve the handling of the error type. Since error is only a type in Go, when it is returned it does not change a program’s execution flow on it’s own like a panic statement would. We identify such bugs by enforcing errors from the lowest level: the kernel. Because Go produces static binaries, faults must be injected without LD_PRELOAD. One of our tools, KRF, allows us to do exactly this.

During our recent assessment of the Kubernetes codebase, the use of KRF provided a finding deep inside a vendored dependency, simply by randomly faulting read and write system calls spawned by a process and its children. This technique was effective against the Kubelet, which commonly interfaces with the underlying system. The bug was triggered when the ionice command was faulted, producing no output to STDOUT and sending an error to STDERR. After the error was logged, execution continued instead of returning the error in STDERR to the caller. This results in STDOUT later being indexed, causing an index out of range runtime panic.

E0320 19:31:54.493854    6450 fs.go:591] Failed to read from stdout for cmd [ionice -c3 nice -n 19 du -s /var/lib/docker/overlay2/bbfc9596c0b12fb31c70db5ffdb78f47af303247bea7b93eee2cbf9062e307d8/diff] - read |0: bad file descriptor
panic: runtime error: index out of range

goroutine 289 [running]:
k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs.GetDirDiskUsage(0xc001192c60, 0x5e, 0x1bf08eb000, 0x1, 0x0, 0xc0011a7188)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs/fs.go:600 +0xa86
k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs.(*RealFsInfo).GetDirDiskUsage(0xc000bdbb60, 0xc001192c60, 0x5e, 0x1bf08eb000, 0x0, 0x0, 0x0)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs/fs.go:565 +0x89
k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common.(*realFsHandler).update(0xc000ee7560, 0x0, 0x0)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common/fsHandler.go:82 +0x36a
k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common.(*realFsHandler).trackUsage(0xc000ee7560)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common/fsHandler.go:120 +0x13b
created by
k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common.(*realFsHandler).Start
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common/fsHandler.go:142 +0x3f

Figure 7: The shortened callstack of the resulting Kubelet panic.

stdoutb, souterr := ioutil.ReadAll(stdoutp)
if souterr != nil {
	klog.Errorf("Failed to read from stdout for cmd %v - %v", cmd.Args, souterr)
}

Figure 8: The logging of STDERR without returning the error to the caller.

usageInKb, err := strconv.ParseUint(strings.Fields(stdout)[0], 10, 64)

Figure 9: The attempted indexing of STDOUT, even though it is empty. This is the cause of the runtime panic.

For a more complete walkthrough containing reproduction steps, our Kubernetes Final Report details the use of KRF against the Kubelet in Appendix G (pg. 109).

Go’s compiler also allows instrumentation to be included in a binary, which permits detection of race conditions at runtime. This is extremely useful for identifying potentially exploitable races as an attacker, but it can also be leveraged to identify incorrect handling of defer, panic, and recover. We built trailofbits/on-edge to do exactly this: Identify global state changes between a function entrypoint and the point at which a function panics, and exfiltrate this information through the Go race detector. More in-depth use of OnEdge can be found in our previous blog post, “Panicking the Right Way in Go.”

In practice, we recommend using:

All of these tools, with the exception of KRF, require a bit of effort to use in practice.

Using the compiler to our advantage

The Go compiler has many built-in features and directives that aid in finding bugs. These features are hidden in and throughout various switches, and require a bit of configuration for our purposes.

Subverting the type system

Sometimes when attempting to test the functionality of a system, the exported functions aren’t what we want to test. Getting testable access to the desired functions may require renaming a lot of them so they can be exported, which can be burdensome. To help address this problem, the build directives of the compiler can be used to perform name linking, accessing controls provided by the export system. As an example of this functionality, the program below (graciously extracted from a Stack Overflow answer) accesses the unexported reflect.typelinks function and subsequently iterates the type link table to identify types present in the compiled program.

package main

import (
	"fmt"
	"reflect"
	"unsafe"
)

func Typelinks() (sections []unsafe.Pointer, offset [][]int32) {
	return typelinks()
}

//go:linkname typelinks reflect.typelinks
func typelinks() (sections []unsafe.Pointer, offset [][]int32)

func Add(p unsafe.Pointer, x uintptr, whySafe string) unsafe.Pointer {
	return add(p, x, whySafe)
}

//go:linkname add reflect.add
func add(p unsafe.Pointer, x uintptr, whySafe string) unsafe.Pointer

func main() {
	sections, offsets := Typelinks()
	for i, base := range sections {
		for _, offset := range offsets[i] {
			typeAddr := Add(base, uintptr(offset), "")
			typ := reflect.TypeOf(*(*interface{})(unsafe.Pointer(&typeAddr)))
			fmt.Println(typ)
		}
	}
}

Figure 10: A generalized version of the Stack Overflow answer, using the linkname build directive.

$ go run main.go 
**reflect.rtype
**runtime._defer
**runtime._type
**runtime.funcval
**runtime.g
**runtime.hchan
**runtime.heapArena
**runtime.itab
**runtime.mcache
**runtime.moduledata
**runtime.mspan
**runtime.notInHeap
**runtime.p
**runtime.special
**runtime.sudog
**runtime.treapNode
**sync.entry
**sync.poolChainElt
**syscall.Dirent
**uint8
...

Figure 11: The output of the typelinks table.

In situations where you need even more granular control at runtime (i.e., more than just the linkname directive), you can write in Go’s intermediate assembly and include it during compilation. While it may be incomplete and slightly out of date in some places, the teh-cmc/go-internals repository provides a great introduction to how Go assembles functions.

Compiler-generated coverage maps

To help with testing, the Go compiler can perform preprocessing to generate coverage information. This is intended for identifying unit and integration testing coverage information, but we can also use it to identify coverage generated by our fuzzing and property testing. Filippo Valsorda provides a simple example of this in a blog post.

Type-width safety

Go has support for automatically determining the size of integers and floating point numbers based on the target platform. However, it also allows for fixed-width definitions, such as int32 and int64. When mixing both automatic and fixed-width sizes, there are opportunities for incorrect assumptions about behaviour across multiple target platforms.

Testing against both 32-bit and 64-bit platform builds of a target will help identify platform-specific problems. These problems tend to be found in areas performing validation, decoding, or type conversion, where improper assumption about source and destination type properties are made. Examples of this were identified in the Kubernetes security assessment, specifically TOB-K8S-015: Overflows when using strconv.Atoi and downcasting the result (pg. 42 in the Kubernetes Final Report), with an example inlined below.

// updatePodContainers updates PodSpec.Containers.Ports with passed parameters.
func updatePodPorts(params map[string]string, podSpec *v1.PodSpec) (err error) {
    port := -1
    hostPort := -1
    if len(params["port"]) > 0 {
        port, err = strconv.Atoi(params["port"]) // <-- this should parse port as strconv.ParseUint(params["port"], 10, 16) if err != nil { return err } } // (...) // Don't include the port if it was not specified. if len(params["port"]) > 0 {
        podSpec.Containers[0].Ports = []v1.ContainerPort{
            {
                ContainerPort: int32(port), // <-- this should later just be uint16(port)
            },
        }
...

Figure 12: An example of downcasting to a fixed-width integer from an automatic-width integer (returned by Atoi).

root@k8s-1:/home/vagrant# kubectl expose deployment nginx-deployment --port 4294967377 --target-port 4294967376
E0402 09:25:31.888983    3625 intstr.go:61] value: 4294967376 overflows int32
goroutine 1 [running]:
runtime/debug.Stack(0xc000e54eb8, 0xc4f1e9b8, 0xa3ce32e2a3d43b34)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/intstr.FromInt(0x100000050, 0xa, 0x100000050, 0x0, 0x0)
...
service/nginx-deployment exposed

Figure 13: The resulting overflow from incorrect type-width assumptions.

In practice, the type system subversion is rarely necessary. Most interesting targets for testing are already exported, available through traditional imports. We recommend using this only when helpers and similar unexported functions are required for testing. As for testing type-width safety, we recommend compiling against all targets when possible, even if it is not directly supported, since problems may be more apparent on different targets. Finally, we recommend generating coverage reports on projects with unit and integration tests, at a minimum. It helps identify areas that are not directly tested, which can be prioritized for review.

A note about dependencies

In languages such as JavaScript and Rust, dependency managers have built-in support for dependency auditing—scanning project dependencies for versions known to have vulnerabilities. In Go, no such tool exists, at least not in a publicly available and non-experimental state.

This lack likely stems from the fact that there are many different methods of dependency management: go-mod, go-get, vendored, etc. These various methods use radically different approaches, resulting in no straightforward way to universally identify dependencies and their versions. Furthermore, in some cases it is common for developers to subsequently modify their vendored dependency source code.

The problem of dependency management has progressed over the years of Go’s development, and most developers are moving towards the use of go mod. This allows dependencies to be tracked and versioned within a project through the go.mod file, opening the door for future dependency scanning efforts. An example of such an effort can be seen within the OWASP DependencyCheck tool, which has an experimental go mod plugin.

Conclusion

Ultimately, there are quite a few tools available for use within the Go ecosystem. Although mostly disparate, the various static analysis tools help identify “low hanging fruit” within a given project. When looking for deeper concerns, fuzzing, property testing, and fault injection tools are readily available. Compiler configuration subsequently augments the dynamic techniques, making it easier to build harnesses and evaluate their effectiveness.

Interested in seeing these techniques shake out bugs in your Go systems? Trail of Bits can make that happen. Do you want custom analysis built specifically for your organization? We do that too. Contact us!

Two New Tools that Tame the Treachery of Files

Parsing is hard, even when a file format is well specified. But when the specification is ambiguous, it leads to unintended and strange parser and interpreter behaviors that make file formats susceptible to security vulnerabilities. What if we could automatically generate a “safe” subset of any file format, along with an associated, verified parser? That’s our collective goal in Dr. Sergey Bratus’s DARPA SafeDocs program.

But wait—why is parsing hard in the first place? Design decisions like embedded scripting languages, complex context-sensitive grammars, and object models that allow arbitrary dependencies between objects may have looked like good ways to enrich a format, but they increase the attack surface of a parser, leading to forgotten or bypassed security checks, denial of service, privacy leakage, information hiding, and even hidden malicious payloads.

Two examples of this problem are polyglots and schizophrenic files. Polyglots are files that can be validly interpreted as two different formats. Have you ever read a PDF file and then been astonished to discover that it is also a valid ZIP file? Or edited an HTML file only to discover that it is also a Ruby script? Congratulations, you discovered a polyglot. This is not to be confused with schizophrenic files: That’s when two parsers interpret the same file in different ways, e.g., your PDF displays different content depending on whether you opened it in Adobe Acrobat or Foxit Reader, or your HTML page renders differently between Chrome and Internet Explorer.

We’ve developed two new tools that take the pain out of parsing and make file formats safer:

  1. PolyFile: A polyglot-aware file identification utility with manually instrumented parsers that can semantically label the bytes of a file hierarchically; and
  2. PolyTracker: An automated instrumentation framework that efficiently tracks input file taint through the execution of a program.

Collectively, the tools enable Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of referring to them as The ALAN Parsers Project.

Before we get into their details, let’s first talk about why these tools are necessary.

Ceci N’est Pas Un PDF

Please rise and open your hymnals to page 541 for the recitation of chapter 7 verse 6:

…a file has no intrinsic meaning. The meaning of a file—its type, its validity, its contents—can be different for each parser or interpreter.

Ange Albertini

You may be seated.

This talk by Trail of Bits researcher Evan Sultanik gives a number of examples of how polyglots and induced schizophrenia are more than just nifty parlor tricks. For example:

A PDF can even be a valid git repository that, when cloned, contains the LaTeX source code to generate the PDF and a copy of itself. Ange Albertini also has an excellent series of videos introducing funky file tricks.

What does it take to understand a popular file format that has
been accreting features (and misfeatures) over 20 years? PDF
provides just such a challenge.

  • An embedded Turing complete programming language? 👌
  • Arbitrary chaining of stream decoders that allow for both memory and computational denial of service? 👌
  • Multiple, redundant, and potentially conflicting ways to specify the length of a stream object? 👌
  • Arbitrary data allowed both before and after the file? 👌
  • Numerous ways to steganographically embed data, including arbitrary length binary blobs? 👌
  • A graph-based document object model that allows, and in some cases requires, cycles? 👌
  • A multi-decade history with ambiguous or incomplete specifications resulting in dozens of conflicting implementations, some of which emit non-compliant, malformed documents? 👌
  • The necessity for parser implementations to be resilient to malformations but also free to handle them differently? 👌
  • A specification that has ​got in the way of creating of a formally verified parser in Coq because Coq could not prove that a parser trying to do its best on checking indirect references would, in fact, terminate on maliciously crafted files​? 👌

Challenge accepted! To be fair, PDF is leading the way on defining simpler, reduced subsets of the format. PDF/A, designed to make sure PDF documents remain parseable for long-term preservation, has removed some of these problematic features. Moreover, they are by no means specific to PDF: they are endemic to document formats in general. For example, Microsoft Office’s OOXML has not done better, with severe external entity attacks that have been employed in the wild, not to mention XML-based vulnerabilities like the billion laughs attack. Even parsing JSON is harder than one might think, as is plain old UTF-8.

But surely in the land of Programming Languages, at least, all must be well, since their parsers are automatically generated from unambiguous specifications by classic algorithms and proven tools. Not so much: This Empire Hacking talk gives examples of how a poorly designed language can cause parsing problems even when no malice is involved. One does not simply walk into the shunting yard!

But back to data formats. In view of the challenges above, instead of focusing on the specification, we examine the de facto interpretations of the specification: Parser implementations. Our underlying hypothesis is that the “unsafe” portions of a file format will exist in the symmetric difference of the parsers’ accepted grammars. The portions of the file format to keep are the ones accepted and interpreted equivalently across all implementations.

PolyFile: Ground Truth Labeling of File Semantics


File identification utilities are, by and large, dumb in the sense that they simply compare the file against magic byte signatures of various formats. Moreover, these tools terminate once they find the first match, and do not recursively identify embedded file types or files that do not start at byte offset zero. Once a file is classified, there is typically little to no information about the contents of the file. It’s a PDF, but how many objects does it contain? It’s a ZIP, but what are its file contents?

Our new PolyFile project resolves these issues and provides:

  • Identification of any and all files embedded within the input, not necessarily starting at byte offset zero;
  • File formats for which an instrumented parser is available should be fully parsed, emitting a hierarchical semantic mapping of the input’s contents;
  • An interactive file explorer for a human to examine its contents and structure; and
  • Computer-readable output that can be used to assign semantic meaning to each byte of the input file (e.g., byte x corresponds to the first byte in a PDF stream object, and the start of a JPEG/JFIF header).

A fairly ideal file identification utility, n’est ce pas?

Ange Albertini’s SBuD project comes close in spirit, but currently only supports a couple image formats. Even the popular Unix file command only has support for several hundred file signatures. In contrast, PolyFile has support for over 10,000 file formats, and can recursively identify them in a file, emitting a hierarchical mapping as an extension of the SBuD JSON format. It also has support for semantically labeling files based on Kaitai Struct declarative file format specifications.

Additionally, PolyFile can optionally emit a self-contained HTML file with an interactive hex viewer and semantic labeling explorer. Here is an example of the HTML output from the résumé Evan Sultanik submitted to Trail of Bits. In addition to being a PDF that displays its own MD5 hash, it is a valid Nintendo Entertainment System ROM that, when emulated, is a playable game that displays the MD5 hash of the PDF. It is also a valid ZIP file containing, among other things, a PDF that is a git repository containing its LaTeX source code and a copy of itself.

PolyFile is free and open-source. You can download a copy at: https://github.com/trailofbits/polyfile.

Parser Instrumentation

Now that we have PolyFile to provide ground truth, we need a way to propagate the semantic labels through a parser; the programmatic equivalent of using contrast dye to track blood flow in the brain during a CT scan. We therefore need an automated way to instrument a parser to track those labels, with the goal of associating functions with the byte offsets of the input files on which they operate. Since PolyFile can tell us the semantic meaning behind those offsets, this will let us infer the purpose of the parser’s functions. For example, if a function in a PDF parser always operates on bytes associated with JFIF stream objects, we can assume it is responsible for processing embedded JPEGs.

There are several existing projects to do this sort of taint tracking, using various methods. The best maintained and easiest to use are AUTOGRAM and TaintGrind. However, the former is limited to analysis on the JVM, and the latter Valgrind plugin suffers from unacceptable runtime overhead when tracking as few as several bytes at a time. For example, we ran mutool, a utility in the muPDF project, using TaintGrind over a corpus of medium sized PDFs, and in every case the tool had to be halted after over 24 hours of execution for operations that would normally complete in milliseconds without instrumentation.

At first glance, our goals might seem to be satisfied by AFL-analyze, a tool bundled with the AFL fuzzer. In a sense, our goal is in fact to create its counterpart. AFL-analyze uses fuzzing to reverse engineer a file format from a parser. In our case, we have ground truth about the file format and want to reverse-engineer the parser.

Although intended for fuzzing, Angora’s taint analysis engine has many of the features necessary to track byte indexes during execution. In fact, as is described in the following sections, we build on many of the algorithmic advances of Angora while improving both computational and memory efficiency. Angora is built atop the LLVM Data Flow Sanitizer (dfsan), which we also leverage for PolyTracker. The following section describes dfsan’s operation, limitations, and how we improved upon both dfsan and Angora.

LLVM and the Data Flow Sanitizer

We chose a static instrumentation approach built on LLVM, since this allows us to instrument any parser capable of being compiled with LLVM and eventually instrument closed-source parsers (e.g., by lifting their binaries to LLVM using Remill or McSema).

LLVM has an instrumentation tool for propagating taint called the Data Flow Sanitizer (dfsan), which is also used by Angora. However, dfsan imposes severe limitations on the total number of taints tracked during program execution, which means that, in practice, we could only track a handful of input bytes from a file at once. To see why, consider a parser that performs the following:

fd = fopen(“foo.pdf”, “rb”);
a = fgetc(fd);
b = fgetc(fd);
c = a + b;

In this case, dfsan will taint the variable a by byte offset 0 and variable b by taint offset 1. Byte c will be tainted by both byte 0 and byte 1. The combinatorial challenge here is that there are 2n possible taints, where n is the number of bytes in the input file. Therefore, the naïve approach of storing taints using a bitset will be infeasible, even for small numbers of input, and even when using a compressed bitset.

The representational problem is addressed in dfsan by storing taint provenance in a data structure it calls the “union table,” which is a computationally efficient way to store a binary forest of taint unions. Each taint gets a unique 16-bit label. Then, in the example above, where the taint of a is unioned with b to create c, dfsan would record union_table[a’s label][b’s label] = c’s label.

Thereafter, if a and b are ever unioned again, c’s taint label can be reused. This allows constant time union table checks; however, the table itself requires O(n2) storage. This is very wasteful, since the table will almost always be very sparse. It’s also what necessitates the 16-bit taint labels, since using larger labels would exponentially increase the size of the union table. This means that dfsan can only track, at most, 65,536 taints throughout execution, including all new taints that are created from unions. This is insufficient to track more than a handful of input bytes at a time.

Introducing PolyTracker: Efficient Binary Instrumentation for Universal Taint Tracking


Our novel taint tracking data structures and algorithms—as well as numerous heuristics for reducing computational overhead—are manifested in our new PolyTracker tool. It is a wrapper around clang and clang++ that allows you to instrument any executable. Simply replace your compiler with PolyTracker during your normal build process. The resulting executable will create a JSON file containing a mapping of functions to the input file byte offsets on which each function operates. Moreover, since PolyTracker is built as an LLVM pass, it can be used on any black-box binary that has been lifted to LLVM/IR, even when source code is unavailable.

We maintained dfsan’s concepts of shadow memory and its instrumentation framework for tracking taints. However, we switched away from the union table and implemented a scalable data structure capable of exploiting the inherent sparsity in taint unions. This is augmented by a binary forest of taint unions, supplanting dfsan’s label array and allowing us to increase the size of the taint labels past 16-bits. PolyTracker’s binary forest data structure uses a memory layout algorithm that eliminates the need for an Angora-style taint label to bitvector lookup table, while also providing constant time insertion. This reduces the memory requirements from exponential to linear in the size of the input file plus the number of instructions executed by the parser, at the expense of an O(n log n) graph traversal in post-processing to resolve the taints. In practice, this results in negligible execution overhead for the majority of PDFs.

PolyTracker is free and open-source. You can download a copy today at: https://github.com/trailofbits/polytracker.

It’s Easy to Get Started

PolyFile can be installed with this one quick command:

pip3 install polyfile

PolyTracker requires a working build of LLVM. However, we have made this easy by providing a Docker container on DockerHub that already has everything built. Simply download a copy of the container to start instrumenting your parsers!

docker pull trailofbits/polytracker:latest
docker run -it --rm trailofbits/polytracker:latest

We have lots of features in active development, including intelligent file mutation for fuzzing and differential testing, temporal analysis of taint propagation, and automated identification of error handling routines.

We’re also excited to hear what other clever uses the community devises for the tools. Are you using PolyTracker to discover input bytes that are ignored by the parser? Do you use special taint labels to track the results of functions like strlen that are likely to correspond to field lengths? Let us know on the Empire Hacking Slack! Have an idea that you’d like to see turned into a new feature? Feel free to add a GitHub issue!

Acknowledgements

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Destroying x86_64 instruction decoders with differential fuzzing

TL;DR: x86_64 decoding is hard, and the number and variety of implementations available for it makes it uniquely suited to differential fuzzing. We’re open sourcing mishegos, a differential fuzzer for instruction decoders. You can use it to discover discrepancies in your own decoders and analysis tools!

Some of mishegos's output, visualized.

Figure 1: Some of Mishegos’s output, visualized.

In the beginning, there was instruction decoding

Decompilation and reverse engineering tools are massive, complicated beasts that deal with some of the hardest problems in binary analysis: variable type and layout recovery, control flow graph inference, and sound lifting to higher-order representations for both manual and automated inspection.

At the heart of each of these tasks is accurate instruction decoding. Automated tools require faithful extraction of instruction semantics to automate their analyses, and reverse engineers expect accurate disassembly listings (or well-defined failure modes) when attempting manual comprehension.

Instruction decoding is implicitly treated as a solved problem. Analysis platforms give analysts a false sense of confidence by encouraging them to treat disassembled output as ground truth, without regarding potential errors in the decoder or adversarial instruction sequences in the input.

Mishegos challenges this assumption.

(x86_64) Instruction decoding is hard

Like, really hard:

  • Unlike RISC ISAs such as ARM and MIPS, x86_64 has variable-length instructions, meaning that decoder implementations must incrementally parse the input to know how many bytes to fetch. An instruction can be anywhere between 1 byte (e.g., 0x90, nop) and 15 bytes long. Longer instructions may be semantically valid (i.e., they may describe valid combinations of prefixes, operations, and literals), but actual silicon implementations will only fetch and decode 15 bytes at most (see the Intel x64 Developer’s Manual, §2.3.11).
  • x86_64 is the 64-bit extension of a 32-bit extension of a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-old 8-bit ISA. In short, it’s a mess, with each generation adding and removing functionality, reusing or overloading instructions and instruction prefixes, and introducing increasingly complicated switching mechanisms between supported modes and privilege boundaries.
  • Many instruction sequences have overloaded interpretations or plausible disassemblies, depending on the active processor’s state or compatibility mode. Disassemblers are required to make educated guesses, even when given relatively precise information about the compilation target or the expected execution mode.

The complexity of the x86_64 instruction format is especially apparent when visualized:

Figure 2: Visualizing some of x86_64’s complexity.

Figure 2: Visualizing some of x86_64’s complexity.

Even the graphic above doesn’t fully capture x86_64’s nuances—it ignores the internal complexity of the ModR/M and scale-index-base (SIB) bytes, as well as the opcode extension bit and various escaping formats for extended opcodes (legacy escape prefixes, VEX escape, and XOP escape).

All told, these complexities make x86_64 decoder implementations uniquely amenable to testing via differential fuzzing—by hooking a mutation engine up to several different implementations at once and comparing each collection of outputs, we can quickly suss out bugs and missing functionality.

Building a “sliding” mutation engine for x86_64 instructions

Given this layout and our knowledge about minimum and maximum instruction lengths on x86_64, we can construct a mutation engine that probes large parts of the decoding pipeline with a “sliding” strategy:

  • Generate an initial instruction candidate of up to 26 bytes, including structurally valid prefixes and groomed ModR/M and SIB fields.
  • Extract each “window” of the candidate, where each window is up to 15 bytes beginning at index 0 and moving to the right.
  • Once all windows are exhausted, generate a new instruction candidate and repeat.

Why up to 26 bytes? See above! x86_64 decoders will only accept up to 15 bytes, but generating long, (potentially) semantically valid x86_64 instruction candidates that we “slide” through means we can test likely edge cases in decoding:

  • Failing to handle multiple, duplicate instruction prefixes.
  • Emitting nonsense prefixes or disassembly attributes (e.g., accepting and emitting a repeat prefix on a non-string operation, or the lock prefix on something that isn’t atomizable).
  • Failing to parse the ModR/M or SIB bytes correctly, causing incorrect opcode decoding or bad displacement/immediate scaling/indexing.

so, a maximal instruction candidate, shown in purple (with dummy displacement and immediate values, shown in grey) like…

f0 f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
Figure 3: A maximal instruction candidate.

… yields 12 “window” candidates for actual fuzzing.

f0 f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04
f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05
2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06
67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07
46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08
0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09
3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a
7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b
22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c
8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e
01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f

Figure 4: Extracted instruction candidates.

Consequently, our mutation engine spends a lot of time trying out different sequences of prefixes and flags, and relatively little time interacting with the (mostly irrelevant) displacement and immediate fields.

Mishegos: Differential fuzzing of x86_64 decoders

Mishegos takes the “sliding” approach above and integrates it into a pretty typical differential fuzzing scheme. Each fuzzing target is wrapped into a “worker” process with a well-defined ABI:

  • worker_ctor and worker_dtor: Worker setup and teardown functions, respectively.
  • try_decode: Called for each input sample, returns the decoder’s results along with some metadata (e.g., how many bytes of input were consumed, the status of the decoder).
  • worker_name: A constant string used to uniquely identify the type of worker.

The codebase currently implements five workers:

  • Capstone—A popular disassembly framework originally based on the LLVM project’s disassemblers.
  • libbfd/libopcodes—The backing libraries used by the popular GNU binutils.
  • udis86—An older, potentially unmaintained decoder (last commit 2014).
  • XED—Intel’s reference decoder.
  • Zydis—Another popular open source disassembly library, with an emphasis on speed and feature-completeness.

Because of the barebones ABI, Mishegos workers tend to be extremely simple. The worker for Capstone, for example, is just 32 lines:

#include <capstone/capstone.h>

#include "../worker.h"

static csh cs_hnd;

char *worker_name = "capstone";

void worker_ctor() {
  if (cs_open(CS_ARCH_X86, CS_MODE_64, &cs_hnd) != CS_ERR_OK) {
    errx(1, "cs_open");
  }
}

void worker_dtor() {
  cs_close(&cs_hnd);
}

void try_decode(decode_result *result, uint8_t *raw_insn, uint8_t length) {
  cs_insn *insn;
  size_t count = cs_disasm(cs_hnd, raw_insn, length, 0, 1, &insn);

  if (count > 0) {
    result->status = S_SUCCESS;
    result->len =
      snprintf(result->result, MISHEGOS_DEC_MAXLEN, "%s %s\n", insn[0].mnemonic, insn[0].op_str);
    result->ndecoded = insn[0].size;
    cs_free(insn, count);
  } else {
    result->status = S_FAILURE;
  }
}

Figure 5: Source for the Capstone worker.

Behind the scenes, workers receive inputs and send outputs in parallel via slots, which are accessed through a shared memory region managed by the fuzzing engine. Input slots are polled via semaphores to ensure that each worker has retrieved a candidate for decoding; output slots are tagged with the worker’s name and instruction candidate to allow for later collection into cohorts. The result is a relatively fast differential engine that doesn’t require each worker to complete a particular sample before continuing: Each worker can consume inputs at its own rate, with only the number of output slots and cohort collection limiting overall performance.

The bird’s-eye view:

Figure 6: Mishegos’s architecture.

Figure 6: Mishegos’s architecture.

Making sense of the noise

Mishegos produces a lot of output: A single 60-second run on a not particularly fast Linux server (inside of Docker!) produces about 1 million cohorts, or 4 million bundled outputs (1 output per input per fuzzing worker with 4 workers configured):

Figure 7: An example Mishegos run.

Figure 7: An example Mishegos run.

Each output cohort is structured as a JSON blob, and looks something like this:

{
"input": "3626f3f3fc0f587c22",
"outputs": [
  {
    "ndecoded": 5,
    "len": 21,
    "result": "ss es repz repz cld \n",
    "workerno": 0,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/bfd/bfd.so"
  },
  {
    "ndecoded": 5,
    "len": 5,
    "result": "cld \n",
    "workerno": 1,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/capstone/capstone.so"
  },
  {
    "ndecoded": 5,
    "len": 4,
    "result": "cld ",
    "workerno": 2,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/xed/xed.so"
  },
  {
    "ndecoded": 5,
    "len": 3,
    "result": "cld",
    "workerno": 3,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/zydis/zydis.so"
  }
]
}

Figure 8: An example output cohort from Mishegos.

In this case, all of the decoders agree: The first five bytes of the input decode to a valid cld instruction. libbfd is extra eager and reports the (nonsense) prefixes, while the others silently drop them as irrelevant.

But consistent successes aren’t what we’re interested in—we want discrepancies, dammit!

Discrepancies can occur along a few dimensions:

  • One or more decoders disagree about how many bytes to consume during decoding, despite all reporting success.
  • One or more decoders report failure (or success), in contrast to others.
  • All decoders report success and consume the same number of input bytes, but one or more disagree about a significant component of the decoding (e.g., the actual opcode or immediate/displacement values).

Each of these has adversarial applications:

  • Decoding length discrepancies can cause a cascade of incorrect disassemblies, preventing an automated tool from continuing or leaving a manual analyst responsible for realigning the disassembler.
  • Outright decoding failures can be used to prevent usage of a susceptible tool or platform entirely, or to smuggle malicious code past an analyst.
  • Component discrepancies can be used to mislead an analysis or human analyst into incorrectly interpreting the program’s behavior. Severe enough discrepancies could even be used to mask the recovered control flow graph!

Mishegos discovers each of these discrepancy classes via its analysis tool and presents them with mishmat, a hacky HTML visualization. The analysis tool collects language-agnostic “filters” into “passes” (think LLVM), which can then order their internal filters either via a dependency graph or based on perceived performance requirements (i.e., largest filters first). Passes are defined in ./src/analysis/passes.yml, e.g.:

# Find inputs that all workers agree are one size, but one or more
# decodes differently.
same-size-different-decodings:
- filter-any-failure
- filter-ndecoded-different
- filter-same-effects
- minimize-input
- normalize

Figure 9: An example of a Mishegos analysis pass comprised of several filters

Individual filters are written as small scripts that take cohorts on stdin and conditionally emit them on stdout. For example, filter-ndecoded-different:

require "json"

STDERR.puts "[+] pass: filter-ndecoded-different"

count = 0
STDIN.each_line do |line|
  result = JSON.parse line, symbolize_names: true

  outputs_ndecoded = result[:outputs].map { |o| o[:ndecoded] }

  if outputs_ndecoded.uniq.size > 1
    count += 1
    next
  end

  STDOUT.puts result.to_json
end

STDERR.puts "[+] pass: filter-ndecoded-different done: #{count} filtered"

Figure 10: An example of a Mishegos analysis filter

Filters can also modify individual results or entire cohorts. The minimize-input filter chops the instruction candidate down to the longest indicated ndecoded field, and the normalize filter removes extra whitespace in preparation for additional analysis of individual assemblies.

Finally, passes can be run as a whole via the analysis command-line:

Figure 11: An example analysis run.

Figure 11: An example analysis run.

The analysis output can be visualized with mishmat, with an optional cap on the size of the HTML table:

mishmat -l 10000 < /tmp/mishegos.sd > /tmp/mishegos.sd.html

Ultimately, this yields fun results like the ones below (slightly reformatted for readability). Instruction candidates are on the left, individual decoder results are labeled by column. (bad) in libbfd‘s column indicates a decoding failure. The (N/M) syntax represents the number of bytes decoded (N) and the total length of the assembled string (M):

libbfd capstone zydis xed
f3f326264e0f3806cc repz repz es es rex.WRX (bad) (8 / 29) phsubd mm1, mm4 (9 / 15) (0 / 0) (0 / 0)
26f366364f0f38c94035 es data16 ss rex.WRXB (bad) (8 / 27) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (10 / 41) (0 / 0) (0 / 0)
f366364f0f38c94035 data16 ss rex.WRXB (bad) (7 / 24) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (9 / 41) (0 / 0) (0 / 0)
66364f0f38c94035 ss rex.WRXB (bad) (6 / 17) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (8 / 41) (0 / 0) (0 / 0)

Figure 12: Capstone thinking that nonsense decodes to valid SSE instructions.

libbfd capstone zydis xed
f36766360f921d32fa9c83 repz data16 setb BYTE PTR ss:[eip+0xffffffff839cfa32] # 0xffffffff839cfa3d (11 / 74) (0 / 0) setb byte ptr [eip-0x7c6305ce] (11 / 30) setb byte ptr ss:[0x00000000839CFA3D] (11 / 37)

Figure 13: Capstone missing an instruction entirely.

libbfd capstone zydis xed
3665f0f241687aa82c8d ss gs lock repnz rex.B push 0xffffffff8d2ca87a (10 / 46) push -0x72d35786 (10 / 16) (0 / 0) (0 / 0)
65f0f241687aa82c8d gs lock repnz rex.B push 0xffffffff8d2ca87a (9 / 43) push -0x72d35786 (9 / 16) (0 / 0) (0 / 0)
f0f241687aa82c8d lock repnz rex.B push 0xffffffff8d2ca87a (8 / 40) push -0x72d35786 (8 / 16) (0 / 0) (0 / 0)

Figure 14: Amusing signed representations.

libbfd capstone zydis xed
3e26f0f2f1 ds es lock repnz icebp (5 / 22) int1 (5 / 4) (0 / 0) (0 / 0)

Figure 15: Undocumented opcode discrepancies!

libbfd capstone zydis xed
f3430f38f890d20aec2c repz rex.XB (bad) (5 / 17) (0 / 0) enqcmds rdx, zmmword ptr [r8+0x2cec0ad2] (10 / 40) enqcmds rdx, zmmword ptr [r8+0x2CEC0AD2] (10 / 40)
2e363e65440f0dd8 cs ss ds gs rex.R prefetch (bad) (6 / 32) (0 / 0) nop eax, r11d (8 / 13) nop eax, r11d (8 / 13)
f2f266260fbdee repnz data16 es (bad) (6 / 21) (0 / 0) bsr bp, si (7 / 10) bsr bp, si (7 / 10)

Figure 16: XED and Zydis only.

libbfd capstone zydis xed
64f064675c fs lock fs addr32 pop rsp (5 / 25) (0 / 0) (0 / 0) (0 / 0)
2e cs (1 / 2) (0 / 0) (0 / 0) (0 / 0)
f06636f00f3802c7 lock ss lock phaddd xmm0,xmm7 (8 / 29) (0 / 0) (0 / 0) (0 / 0)
f03e4efd lock ds rex.WRX std (4 / 19) (0 / 0) (0 / 0) (0 / 0)
36f03e4efd ss lock ds rex.WRX std (5 / 22) (0 / 0) (0 / 0) (0 / 0)

Figure 17: And, of course, libbfd being utterly and repeatedly wrong.

The results above were captured with revision 88878dc on the repository. You can reproduce them by running the fuzzer in manual mode:

M=1 mishegos ./workers.spec <<< “36f03e4efd” > /tmp/mishegos

The big takeaways

For reverse engineers and program analysts: x86_64 instruction decoding is hard. The collection of tools that you rely on to do it are not, in fact, reliable. It is possible (and even trivial), given Mishegos’s output, to construct adversarial binaries that confuse your tools and waste your time. We’ve reported some of these issues upstream, but make no mistake: Trusting your decoder to perfectly report the machine behavior of a byte sequence will burn you.

Not everything is doom and gloom. If you need accurate instruction decoding (and you do!), you should use XED or Zydis. libopcodes is frequently close to Zydis and XED in terms of instruction support but consistently records false positives and decodes just prefixes as valid instructions. Capstone reports both false positives and false negatives with some regularity. udis86 (not shown above) behaves similarly to libopcodes and, given its spotty maintenance, should not be used.

This post is one of many from our team on the vagaries of parsing. Watch this space for a post by Evan Sultanik on polyglots and schizophrenic parsing.