64 Bits ought to be enough for anybody!

How quickly can we use brute force to guess a 64-bit number? The short answer is, it all depends on what resources are available. So we’re going to examine this problem starting with the most naive approach and then expand to other techniques involving parallelization.

We’ll discuss parallelization at the CPU level with SIMD instructions, then via multiple cores, GPUs, and cloud computing. Along the way we’ll touch on a variety of topics about microprocessors and some interesting discoveries, e.g., adding more cores isn’t always an improvement, and not all cloud vCPUs are equivalent.

Sixty-four is (more than) a magic number

Why try to guess a 64-bit number? Modern processors operate on 64-bit quantities, so 64 bits is a natural size for magic numbers, headers, and other markers. When fuzzing, it’s common to run into comparisons against such “magic” 64-bit values, but guessing these values is seen as a canonical impossible problem. Fortunately, no one has to use brute force in such situations, because there are better approaches like removing the comparison, using pre-made input seeds, dictionaries, symbolic execution, and compile-time transformation.

But the problem of a brute force guess is easy to understand and parallelize, demonstrating just how effective parallelization can be against a herculean task. Looking at this problem also shows that as hardware gets faster, new sets of computational problems become possible. Imagine what we can achieve using the full arsenal of modern computing power!

Still, we must consider just how intractable it is to guess a 64-bit number by simply trying all possibilities. How long would it take? And how much would it cost?

How big is a 64-bit number?

A 64-bit number can hold 264 (that is, 18,446,744,073,709,551,616) distinct values—more than the grains of sand on Earth and cells in the human body (Figure 1).

Smallest Largest
Cells in the human body Amount of sand grains on Earth 264 Avogadro’s number Stars in the Universe
3.72 * 1013 7.5 * 1018 1.84 * 1019 6.02 * 1023 1024
Figure 1: Getting an intuitive feel for the size of 264 by comparing it to other large numbers.

A modern CPU can execute about 270 billion instructions per second, so exhausting a 264 search space would take 776 days—a little more than two years. Thankfully, brute force comparison is an embarrassingly parallel problem, where the work can be evenly distributed among many processors. So what if we can coax each processor into doing more than one comparison at a time? Maybe there’s some kind of service where we can get a lot of processors on short notice.

All of a sudden this is starting to look tractable!

Crunching the numbers

Disclaimer: Before we get to the numbers, I want to emphatically state that this is a fun experiment and not a benchmark. No attempt was made to ensure a fair apples-to-apples comparison of different processors or machines. Plus:

  • The code used to generate the performance numbers is written in C and lacks tests.
  • It may not be the fastest possible version.
  • It is also certain to have multiple bugs.
  • In fact, a serious time measurement bug found during review delayed this post by a few weeks.

Fixes and suggestions are welcome on Github.

All measurements reflect an average of 10 trials. Two of the machines tested are cloud instances in Digital Ocean and Google Cloud, respectively. The Digital Ocean (DO) High CPU instance reports itself as an “Intel(R) Xeon(R) Platinum 8168 CPU” running at 2.70 GHz. The Google Cloud (GCP) High CPU instance reports itself as an “Intel(R) Xeon(R) CPU” running at 3.10 GHz. Neither of these self-reported identifiers can be trusted: Virtualization platforms can and usually do lie about the underlying CPU. Cloud machines are also shared, and what other tenants do on your machine may affect your CPU throughput.

Of the physical machines tested, my 2018 MacBook Pro has an Intel Core i7 8559U processor. The Xeon E5-2640 v4 is a 2.40 GHz 40-core shared server. Both of these machines had other software running at the same time as these measurements. The Core i3 530 running at 2.93 GHz is an old machine that was literally serving as a footrest and cat scratching post before it was revived for this project thanks to its GPU.

Pull Requests Wanted: You will notice that these are all x86 CPUs. ARM system measurements are very much wanted. I did not include these since this project was already taking too long, and I would have to start from scratch to learn ARM’s SIMD instructions.

The naive for loop

Some say premature optimization is the root of all evil. So let’s measure how long a generic for loop takes to compare all 264 values. Figure 2 below lists the operations per millisecond (an average of 10 runs) performed on a subset of the full 64-bit range, and an estimate of how long it would take to try all 264 values.

Device Operations Per Millisecond (ms-1) Years To Completion (Yrs)
2018 MacBook Pro 4.41E+06 132.78
Core i3 530 1.45E+06 402.46
DO High CPU 3.50E+06 167.30
GCP High CPU 3.77E+06 155.21
Xeon E5-2640 v4 3.37E+06 173.38
Figure 2: A generic for loop comparison of all 264 values.

The most naive approach would take 133 years. Clearly, this is much too long to wait and some optimization is in order.

Vectorized for loop

Modern processors can operate on multiple 64-bit quantities at a time via SIMD or vector instructions. Currently, Intel leads the pack with AVX-512, which, like the name implies, operates on 512-bit vectors. This lets us compare eight 64-bit quantities per iteration. For those who want to know more about vector instructions, Cornell has a good set of introductory material on the topic

Vectorization is the process of transforming code that runs on one quantity at a time (a scalar) into code that operates on multiple quantities simultaneously (a vector). Vectorization was my first optimization, because I thought clang would automatically vectorize code for me. Unfortunately, I was wrong—clang’s auto vectorization is meant to vectorize code without dependence on the loop variable, like matrix multiplication. Instead of relying on the compiler, I used artisanally hand-crafted vectorized comparisons (sse, avx2, and avx512) and hand-unrolled loops to make maximum use of multiple vector execution units.

Vector instructions are continually improving, but not every x86 CPU supports AVX-512. Some only support AVX2 (256-bit vectors), while others just do SSE4.1 (128-bit vectors), and some don’t even support that. In Figure 3 (table) and Figure 4 (graph) below, we can compare different vectorization approaches available on our hardware collection.

Device Method Operations Per Millisecond (ms-1) Years To Completion (Yrs)
2018 MacBook Pro Naive 4.41E+06 132.78
2018 MacBook Pro SSE 6.73E+06 86.94
2018 MacBook Pro AVX2 1.51E+07 38.85
Core i3 530 Naive 1.45E+06 402.46
Core i3 530 SSE 3.08E+06 190.12
DO High CPU Naive 3.50E+06 167.30
DO High CPU SSE 4.86E+06 120.44
DO High CPU AVX2 1.09E+07 53.57
DO High CPU AVX512 1.41E+07 41.44
GCP High CPU Naive 3.77E+06 155.21
GCP High CPU SSE 5.02E+06 116.41
GCP High CPU AVX2 1.17E+07 49.82
GCP High CPU AVX512 1.35E+07 43.37
Xeon E5-2640 v4 Naive 3.37E+06 173.38
Xeon E5-2640 v4 SSE 4.50E+06 129.99
Xeon E5-2640 v4 AVX2 1.02E+07 57.49
Figure 3: Performance of vectorized and naive versions of 64-bit comparison operations. Not all hardware platforms support the same vectorization methods.

The table above is included for completeness; the graph of the same data below, provides an easier visual comparison.


Figure 4: A graphical representation of Figure 3; performance of different methods to compare 64-bit numbers on a single core.

Several things stand out in this data:

  • Vectorization always helps. Comparing more values per iteration is always faster, even when accounting for setup time to move values in and out of vector registers.
  • AVX2 (four comparisons at a time) is a big improvement over SSE4.1 (two comparisons at a time). As expected, AVX2 is about twice as fast as SSE4.1.
  • Conversely, AVX-512 (eight comparisons at a time) is a small improvement over AVX2 (four comparisons at a time). How can this be? I suspect that this is due to a little-known side effect of using AVX-512 instructions: They slow down the processor by as much as 40%. The processor’s power budget doesn’t permit it to both run at full speed and make heavy use of AVX-512.

Even with these improvements, it would still take 39 years to check all 64 bits on a single core. How much faster can we go using multiple cores?

Parallelized and vectorized

The problem of finding a needle in a 64-bit haystack happens to be ridiculously parallelizable, so multiple cores should deliver a linear increase in throughput. The data below (Figure 5) shows that this is true—except when it’s not! Sometimes performance plateus even as more cores are added. How can this be?

This effect is due to hyperthreading. A hyperthreaded processor presents physical cores as two or more virtual cores to the operating system. While each hyperthread appears independent, the underlying execution units are shared. For computationally intensive applications, hyperthreading can have significant and surprising effects on performance. In a hyperthreaded environment and on CPU-intensive workloads, which cores are used can matter almost as much as the amount of cores. This is especially important for cloud-based workloads, where each vCPU is a hyperthread on a physical core.


Figure 5: Operations per hour versus number of cores for each tested machine, separated by different methods of allocating cores. MacBook Pro and Core i3 530 results are omitted.

Figure 5 shows multicore performance using two methods to allocate workers to hyperthreads. Split cores allocates to separate physical cores, while share cores allocates to the same physical core.

The 40-hyperthreaded-core Xeon E5-2640 machine represents this difference: Using split core allocation, performance peaks at 20 cores—the amount of physical cores —and then levels off. Using shared core allocation, throughput follows a step function, increasing with each new physical core. We may also make inferences about cloud hardware using this data: The 16 vCPUs of a high-CPU GCP instance probably represent 8 physical and 16 hyperthreaded cores dedicated to our workload.

The DO high-CPU machine presents a puzzle: The same effect isn’t observed. Assuming the vCPUs come from a real and single Xeon Platinum 8168, there should be differences after 24 cores are utilized. This doesn’t happen, and there are a few possible explanations. First, the vCPUs are not from the same physical processor: The Xeon 8168 can operate in an 8-way multiprocessing configuration. Second, the processor is not a Xeon 8168, but another chip altogether. And finally, there may be something wrong with my thread affinity or timing measurement code.

Regardless, the scaling results show a valuable lesson: Using more cores isn’t always better, and it matters which cores (or vCPUs) you use. There is little to no gain, for this workload, when allocating more workers than physical cores present in the machine. Always measure when performance counts.

For completeness, Figure 6 lists operations per millisecond and estimated years to completion when utilizing multiple cores, with each core using the fastest supported single-core method. For the cloud machines, each core is a “vCPU,” which is roughly equivalent to one hyperthreaded core. For physical machines, each core is one hyperthreaded core.

Device Cores (Hyperthreaded) Operations Per Millisecond (ms-1) Years To Completion (Yrs)
Xeon E5-2640 v4 39 / 40 1.57E+08 3.73
GCP High CPU 16 / 16 1.24E+08 4.71
DO High CPU 32 / 32 2.55E+08 2.30
Core i3 530 2 / 4 6.24E+06 93.76
2018 MacBook Pro 8 / 8 5.14E+07 11.38
Figure 6: The best multiprocessing times observed, and at how many cores.

The only outliers in Figure 6 are the Xeon E5-2640, which did best at 39/40 hyperthreaded cores, and the Core i3 530, which did best at 2/4 hyperthreaded cores. Why 39 cores? The machine is shared and handles other workloads; at 39 cores, all workloads can be put on one core. At 40 cores utilized, the workload is spread to more cores and takes scheduling time away from CPU-intensive integer comparisons.

As expected, using multiple CPUs has drastically reduced compute time. However, it is still too slow. We can’t expect to wait more than two years to guess one number.

Enter the graphics

GPUs take a different approach to computation than CPUs. CPUs are good at multiple concurrent tasks; GPUs are great at doing simple operations over a huge volume of data. This manifests in the main architectural differences: A high-end CPU may come with 24 very complex cores, while a high-end GPU comes with 5,120 simple cores (Figure 7).


Figure 7: This is a Tesla V100 GPU from NVIDIA. It comes with 5,120 CUDA cores and was the most powerful GPU tested. Graphic taken from the NVIDIA Volta architecture whitepaper.

Inadvertently, I picked a problem that is tailor-made for GPU optimization. Brute force comparison is easy to parallelize, involves no complex decisions, and is entirely compute- bound. Thanks to the role of GPUs in machine learning, cheap off-peak GPU capacity is available from every large cloud provider.

Figuring out how to use GPU computation took some work, but it was absolutely worth it. Just look at the throughput in Figure 8! There is an order of magnitude performance gain even though my knowledge of GPU programming and CUDA was about zero when I started, and my CUDA code has approximately my-first-CUDA-tutorial-level of optimization.

Device GPUs Operations Per Millisecond (ms-1) Years To Completion (Yrs)
GeForce GT 1030 1 1.60E+08 3.66
NVIDIA Tesla K80 8 2.92E+09 0.20
NVIDIA Tesla P100 4 4.19E+09 0.14
NVIDIA Tesla V100 4 9.08E+09 0.06
Figure 8: 64-bit comparison of throughput on GPUs, and estimated years to search all 64-bits.

As Figure 8 shows, using GPUs instead of CPUs shortens the required time from years to days. What a difference the right platform makes! An old $85 video card (the GeForce GT 1030) performs on par with a 40-core Xeon machine. Using 4x NVIDIA Tesla V100 GPUs, we can brute force a 64-bit comparison in 0.06 years, or about 22 days.

The difference between CPU and GPU computation for this problem is so dramatic (a lone V100 is about 18.5x as fast as 16 high-CPU cores) that it doesn’t make sense to use CPUs. Adding another GPU will always be a better use of resources than relying on CPU computation. To show how bad CPUs are at this problems, I made this handy graph:


Figure 9: Throughput when comparing 64-bit numbers on selected CPUs and GPUs. This specific problem can be done so much faster on GPUs that it doesn’t make sense to use CPUs at all.

Now that we’ve gone from 130 years (with the naive for loop) to 22 days (on multiple GPUs), can we get to hours or minutes? How much hardware would it take…and how much would it cost?

Show me the money

We established that GPUs are much faster than CPUs, but GPU time is also much more expensive. And which GPU is best? Different GPU families command rather different prices.

Using current pricing (as of November 2019), we can see that the price/performance of GPUs is about ~5x as good as CPUs. While more expensive, the V100 delivers better performance per dollar than its GPU predecessors (Figure 10).

Compute-Years Needed Per-Hour Cost (Preemptible) Total Cost
Pre-Emptible CPU GCP [16 vCPU] 4.71 $0.20 $8,311.70
Pre-Emptible GPU [4x P100] 0.14 $1.76 $2,147.34
Pre-Emptible GPU [4x V100] 0.06 $3.00 $1,689.45
Pre-Emptible K80 [8x K80] 0.20 $1.12 $1,959.64
Figure 10: Price of CPU and GPU time needed to compare a value against 264 numbers, measured in compute-years. Costs are for preemptible time on Google Cloud.

For something that seemed so insurmountable at the start, the final cost is well within reach: Comparing a number against 264 values would take about $1,700 of preemptible GPU compute time.

Because the problem can be parallelized so efficiently, total compute time is almost directly interchangeable for hardware cost and availability. For example, 4x V100 GPUs will look through all 264 numbers in about 22 days, but 2,259x V100 GPUs would look through 264 numbers in about an hour, for the same overall cost.


This silly experiment shows the importance of understanding your problem, your hardware, and the kinds of computing resources that are available to the average developer.

What seemed like a completely crazy idea turned out to be only mildly insane. Trying 264 comparisons would cost about $1,700 in compute time, and can be accomplished in hours or days.

We also learned a few more things along the way:

  • Parallelization is not always easy, but it can be remarkably effective. Learning some CUDA made this originally insurmountable problem go from impractical to “tolerable but expensive.”
  • Using more cores isn’t always better, and it matters which cores you use.
  • The underlying hardware can leak under the abstraction of the cloud.
  • Cloud computing provides access to nearly limitless on-demand hardware; it just has to be put to use.

There are certainly many problems that seem crazy and completely out of reach, but can be solved via a combination of parallelizable algorithms, cloud computing, and a corporate credit card.

We’re always developing ways to work faster and smarter. Need help with your next project? Contact us!

Introducing iVerify, the security toolkit for iPhone users

“If privacy matters, it should matter to the phone your life is on.” So says Apple in their recent ads about Privacy on the iPhone and controlling the data you share—but many of the security features they highlight are opt-in, and users often don’t know when or how to activate them.

But hey… we got your back!

Today, Trail of Bits launched iVerify, a user-friendly iPhone security toolkit, now available for download in the iOS Apple Store. iVerify makes it easy to manage the security of your accounts and online presence with simple instructional guides. Crucially, it also detects security anomalies on your iPhone or iPad.

Read more about iVerify in VICE Motherboard.

Yes, you do need to secure your iPhone

Although iOS malware attacks have been relatively rare, 2019 saw the largest attack against iPhone users to date. And in September, a new iPhone Boot ROM exploit was released that allows anyone with physical control of a phone to run arbitrary code. (See details in our post “Tethered Jailbreaks Are Back.”)

Fortunately, there are plenty of ways to protect your data, but locking down your iPhone and iCloud account is not straightforward. There are dozens of settings that need to change, and the trade-offs between enabling and disabling a feature aren’t always clear. iVerify makes these trade-offs straightforward with guides that show you how to adjust settings based on privacy and security needs.

iVerify alerts you to security anomalies

Not only does iVerify help you keep your data confidential and limit data sharing, it helps protect the integrity of your device. It’s normally almost impossible to tell if your iPhone has been hacked, but our app gives you a heads-up. iVerify periodically scans your device for anomalies that might indicate it’s been compromised, gives you a detailed report on what was detected, and provides actionable advice on how to proceed.

iVerify does all of this using iverify-core, our suite of iOS integrity and security checks originally created for professional developers. These checks were developed by Trail of Bits based on our extensive experience and expertise in iOS internals and iOS security. iverify-core is updated as new versions of the iPhone and new security checks are released, so new iVerify app users will get the benefit of years of improvement.

Get yours now on the iOS App Store

For iPhone users: To find iVerify on the iOS App Store, simply search for “Trail of Bits” or click here to get the app now.

For developers: Want to license iverify-core? Contact us today and let’s get you started.

Announcing the Crytic $10k Research Prize

At Trail of Bits, we make a significant effort to stay up to date with the academic world. We frequently evaluate our work through peer-reviewed conferences, and we love to attend academic events (see our recent ICSE and Crypto recaps).

However, we consistently see one recurring issue at these academic events: a lack of reliable tools and experiments. Researchers have no incentive to maintain tools, and, most of the time, don’t have the necessary engineering resources. That’s where Trail of Bits comes in: We spend considerable effort to maintain our research-oriented, open-source tools because we want researchers to benefit from our work.

And now, to encourage more activity in this area, we’ve created the Crytic Research Prize to reward published academic papers built on our blockchain tools.

Topics of interest

Any work that is built on top of our tools is eligible.

Do you want to make an industrial-scale impact? Our team is currently interested in these topics:

  • Improving the core capabilities of the tools:
    • How to make Echidna exploration smarter
    • How to make Manticore faster on smart contracts
    • How to reduce false positives from Slither
  • Extending the areas of application for the tools:
    • How to model and detect race conditions precisely
    • How to automatically repair bugs
    • How to guide Echidna with machine learning or grammar-based approaches
    • Any specific instrumentation techniques for smart contracts
    • How to combine the tools
  • We would especially love to see creative uses for our tools outside their original purpose. For example, see our code similarity tool based on Slither.


  • The paper must be based on one of our open-source tools:
  • The paper must have been accepted in a peer-reviewed conference. We recommend:
    • ACSAC, ASE, CAV, CCS, Crypto, FC, FSE, NDSS, ICSE, ISSTA, OSDI, POPL, PLDI, S&P, S&P Europe, TACAS, and Usenix.
    • Any workshops from the conferences listed above.
  • All material must be open-source, including the benchmark. We will re-run the experimentation and evaluate the tools.
  • Applications must comply with the applicable US and international laws, regulations, and policies.

Trail of Bits will evaluate all entries and award cash prizes to the best papers:

  • First place: 6,000 USD
  • Second place: 2,000 USD
  • Third place: 2,000 USD

We will close the submissions on November 1st, 2020.

We are happy to provide support for our tools. Join our Slack channel (#ethereum) if you want direct assistance from our developers.

Remember, if you want to participate in the competition, send us your paper!

Everything You Ever Wanted To Know About Test-Case Reduction, But Didn’t Know to Ask

Imagine reducing the amount of code and time needed to test software, while at the same time increasing the efficacy of your tests and making your debugging tasks easier—all with minimal human effort. It seems too good to be true, but we’re going to explain how test-case reduction can do all this (and maybe more).

Understanding how reduction works can help with troubleshooting, and makes it easier to figure out an efficient workflow and the best tools to optimize your tests. We’ll explain why test-case reduction is an especially important topic for security engineers to understand, and take a look at DeepState’s state-of-the-art reducer.

Test-case reduction for humans

The most common purpose for test-case reduction is to take a complicated failing test case for a confirmed bug, and turn it into one that is much easier to understand in order to make debugging easier. But you may also want to dismiss low-priority bugs! Having a simple version of such bugs helps quickly identify future duplicate submissions of the same, unimportant problem. For unconfirmed bugs, reducing test cases can be even more critical, because you often can’t tell if a bug is a duplicate (or even a bug at all) until you’ve simplified it. You may discover that the problem is with your specification, test generation tool, or operating system.

Without a reduced test case that can be easily understood and succinctly described, it’s hard to even say that you have a bug. You may have evidence that a software system is behaving unexpectedly, but when the test case is complex enough “a bug” can be a dubious concept. A complex test case is almost analogous to a mathematical proof that only shows that an integer with a certain property must exist; your proof needs to be constructive if you want to do something that requires an actual number.

Suppose you have a very large, randomly-generated HTML file that crashes Chrome. You’ve possibly found an important bug; or maybe you just did something that caused Chrome to run out of memory in an expected way. If your only conclusion is, “When I load this huge file in Chrome, it stops running,” you don’t really know much at all. In this case, you may want to apply a test-case reducer instead of spending time looking at core files and attaching debuggers. If it reduces the file down to a single page of HTML or, better yet, reduces to something as small as a single SELECT tag, you have a shortcut to understanding what is going on.

Test-case reduction has direct applications in cybersecurity, especially when fuzzers are used to craft exploits from crashes generated by random inputs. The simpler the input, the easier it will be to construct an exploit or realize that a bug cannot be exploited. AFL and libFuzzer provide built-in limited test-case reduction, but sometimes you need more than that. Modern test-case reduction tools can simplify this analysis, and are probably essential if you want to produce complex sequences of API calls to find vulnerabilities in a library like TLS, SQLite, or LevelDB. This concept also extends to fuzzing smart contracts, which is why Echidna, Trail of Bits’ smart contract fuzzer, includes a test-case reducer.

Test-case reduction for machines

Test-case reduction does more than just make tests easier for humans to read; it’s useful for core cybersecurity tasks like fuzzing, symbolic execution, and bug triage. Reducers can:

If you’d like to search the literature on test-case reduction, it’s also referred to as minimization, shrinking, and, originally, delta-debugging. All of these names refer to the same process of taking a program input A that does X, and producing a smaller input, B, that also does X.

Using a test-case reducer

Let’s see test-case reduction in action. Bring up the DeepState docker image, and install the testfs repository, which is a DeepState harness for a user-mode ext3-like file system.

> git clone https://github.com/agroce/testfs.git
> cd testfs
> cmake .
> make

Our DeepState harness lets us simulate device resets in the middle of operations, and check for problems caused by interrupting a file system call. It checks that after a reset, the file system can still be mounted. To generate a test case showing that this isn’t always true, we can just use DeepState’s built-in fuzzer:

> mkdir failure
> ./Tests --fuzz --output_test_dir failure --exit_on_fail --seed 10

DeepState will report a problem, and save the resulting test case in a file with a unique file name ID and .fail extension. Let’s look at the sequence producing the file system corruption. For brevity, we show only the actual test steps below.

> ./Tests --input_test_file failure/dbb393e55c77bac878ab06a02a022370e33761cb.fail

TRACE: Tests.cpp(115): STEP 0: tfs_lsr(sb);
TRACE: Tests.cpp(140): STEP 1: tfs_stat(sb, ".a.BBA");
TRACE: Tests.cpp(146): STEP 2: tfs_cat(sb, "/aAb.BaAb");
TRACE: Tests.cpp(140): STEP 3: tfs_stat(sb, "");
TRACE: Tests.cpp(115): STEP 4: tfs_lsr(sb);
TRACE: Tests.cpp(103): STEP 5: tfs_rmdir(sb, "BB");
TRACE: Tests.cpp(146): STEP 6: tfs_cat(sb, "b");
TRACE: Tests.cpp(110): STEP 7: tfs_ls(sb);
TRACE: Tests.cpp(95): STEP 8: tfs_mkdir(sb, "A");
TRACE: Tests.cpp(146): STEP 9: tfs_cat(sb, "a./b");
TRACE: Tests.cpp(103): STEP 10: tfs_rmdir(sb, "AaBbBB.A.");
TRACE: Tests.cpp(130): STEP 11: tfs_write(sb, "BA/BB/", "yx");
TRACE: Tests.cpp(140): STEP 12: tfs_stat(sb, "bba");
TRACE: Tests.cpp(155): STEP 13: set_reset_countdown(4);
TRACE: Tests.cpp(140): STEP 14: tfs_stat(sb, "/A");
TRACE: Tests.cpp(121): STEP 15: tfs_create(sb, "bA");

This output shows the 16 steps taken to reach an assertion violation on line 252 of super.c, when we try to remount the file system after the reset. But are all of these steps necessary? Do we really need to cat the file "a./b" for this to happen? We’ll use DeepState’s reducer to find out.

> deepstate-reduce ./Tests failure/dbb393e55c77bac878ab06a02a022370e33761cb.fail failure/shrink.fail

You’ll see output like:

Original test has 8192 bytes
Applied 75 range conversions
Last byte read: 307
Shrinking to ignore unread bytes
Writing reduced test with 308 bytes to failure/shrink.fail
Iteration #1 0.18 secs / 2 execs / 0.0% reduction
Structured deletion reduced test to 304 bytes
Writing reduced test with 304 bytes to failure/shrink.fail
0.36 secs / 3 execs / 1.3% reduction
Structured deletion reduced test to 272 bytes
Writing reduced test with 272 bytes to failure/shrink.fail
0.5 secs / 4 execs / 11.69% reduction
Structured deletion reduced test to 228 bytes
Writing reduced test with 228 bytes to failure/shrink.fail
0.6 secs / 5 execs / 25.97% reduction
1-byte chunk removal: PASS FINISHED IN 0.24 SECONDS, RUN: 1.45 secs / 57 execs / 95.78% reduction
4-byte chunk removal: PASS FINISHED IN 0.08 SECONDS, RUN: 1.53 secs / 70 execs / 95.78% reduction
8-byte chunk removal: PASS FINISHED IN 0.08 SECONDS, RUN: 1.61 secs / 83 execs / 95.78% reduction
1-byte reduce and delete: PASS FINISHED IN 0.02 SECONDS, RUN: 1.62 secs / 86 execs / 95.78% reduction
4-byte reduce and delete: PASS FINISHED IN 0.01 SECONDS, RUN: 1.64 secs / 88 execs / 95.78% reduction
8-byte reduce and delete: PASS FINISHED IN 0.01 SECONDS, RUN: 1.64 secs / 89 execs / 95.78% reduction
Byte range removal: PASS FINISHED IN 0.31 SECONDS, RUN: 1.96 secs / 141 execs / 95.78% reduction
Structured swap: PASS FINISHED IN 0.01 SECONDS, RUN: 1.96 secs / 142 execs / 95.78% reduction
Byte reduce: PASS FINISHED IN 0.1 SECONDS, RUN: 2.06 secs / 159 execs / 95.78% reduction
Iteration #2 2.06 secs / 159 execs / 95.78% reduction
Structured deletion: PASS FINISHED IN 0.01 SECONDS, RUN: 2.08 secs / 161 execs / 95.78% reduction
Structured edge deletion: PASS FINISHED IN 0.01 SECONDS, RUN: 2.09 secs / 163 execs / 95.78% reduction
Completed 2 iterations: 2.09 secs / 163 execs / 95.78% reduction
Padding test with 23 zeroes
Writing reduced test with 36 bytes to failure/shrink.fail

After a few seconds, we can run the new, smaller test case:

> ./Tests --input_test_file failure/shrink.fail

TRACE: Tests.cpp(155): STEP 0: set_reset_countdown(4);
TRACE: Tests.cpp(121): STEP 1: tfs_create(sb, "aaaaa");
CRITICAL: /home/user/testfs/super.c(252): Assertion (testfs_inode_get_type(in) == I_FILE) || (testfs_inode_get_type(in) == I_DIR) failed in function int testfs_checkfs(struct super_block *, struct bitmap *, struct bitmap *, int)
ERROR: Failed: TestFs_FilesDirs
ERROR: Test case failure/shrink.fail failed

To “break” testfs under reset interruption, we need only cause a reset at the fourth write to the block when the file "aaaaa" is created. Using a debugger or logging statements to understand this behavior will obviously be a much more pleasant experience than with the original test case.

For this bug, we only needed to give deepstate-reduce:

  • the DeepState harness executable (./Tests),
  • the test case to reduce (dbb393e55c77bac878ab06a02a022370e33761cb.fail), and
  • the new, reduced test case to generate (failure/shrink.fail).

But sometimes we’ll need to provide more information. For example, a test case may “change bugs” during reduction if all we require is that the reduced test fails. The reducer can take an additional requirement, in the form of a string or regular expression that must appear in the output, or a required exit code.

Reducing with respect to code coverage

Reducers can also be used in the absence of failure. For example, we might want to take a complicated test case that covers a hard-to-reach line of code—perhaps generated by a fuzzer or symbolic execution tool—and make it easier to follow. A good reducer can do that too, by using a revised definition of what we consider an interesting test case.

First, we’ll generate some passing tests:

> mkdir coverage
> ./Tests --fuzz --output_test_dir coverage --timeout 10 --fuzz_save_passing --seed 1

This creates a file, coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass, that produces a file system that mounts, but has inconsistent inode and block freemaps. We could just use --criterion to reduce the test case with respect to that interesting fact alone, but let’s produce a test case that retains all the code coverage of our test.

First, we compile a version of the harness and testfs with code coverage instrumentation enabled:

> clang -c -fprofile-instr-generate -fcoverage-mapping *.c
> clang++ -o TestsCov Tests.cpp -fprofile-instr-generate -fcoverage-mapping -ldeepstate *.o

Next we run our test with coverage instrumentation and collect the results:

> rm default.profraw
> ./TestsCov --input_test_file coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass
> llvm-profdata-6.0 merge -o testscov.profdata default.profraw
> llvm-cov-6.0 show ./TestsCov -instr-profile=testscov.profdata *.c Tests.cpp  >& covout

Then we can use the reducer to produce a new test with the same coverage:

> deepstate-reduce python coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass coverage/smallcov.pass --cmdArgs "checkCov.py @@" --exitCriterion 0

Now the reducer runs a Python script to determine how interesting the results are. It will take a lot longer to finish, since preserving coverage is usually harder than preserving a specific failure. Comparing the file system calls made in the two test cases (which, remember, have identical coverage), we can see what this additional overhead has brought us. The original test case does all this…

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");
TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "B/Aa/aabA.");
TRACE: Tests.cpp(115): STEP 2: tfs_lsr(sb);
TRACE: Tests.cpp(121): STEP 3: tfs_create(sb, "AbB/BAb");
TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(4);
TRACE: Tests.cpp(158): STEP 5: set_reset_countdown() ignored; already set
TRACE: Tests.cpp(95): STEP 6: tfs_mkdir(sb, ".BB");
TRACE: Tests.cpp(146): STEP 7: tfs_cat(sb, "b./aBa..");
TRACE: Tests.cpp(103): STEP 8: tfs_rmdir(sb, "");
TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "bbbA/.abA");
TRACE: Tests.cpp(95): STEP 10: tfs_mkdir(sb, "..a");
TRACE: Tests.cpp(110): STEP 11: tfs_ls(sb);
TRACE: Tests.cpp(110): STEP 12: tfs_ls(sb);
TRACE: Tests.cpp(95): STEP 13: tfs_mkdir(sb, "A.");
TRACE: Tests.cpp(155): STEP 14: set_reset_countdown(5);
TRACE: Tests.cpp(158): STEP 15: set_reset_countdown() ignored; already set
TRACE: Tests.cpp(140): STEP 16: tfs_stat(sb, "B..Aaab./b");
TRACE: Tests.cpp(158): STEP 17: set_reset_countdown() ignored; already set
TRACE: Tests.cpp(140): STEP 18: tfs_stat(sb, "/AaA");
TRACE: Tests.cpp(146): STEP 19: tfs_cat(sb, ".");

…whereas the reduced test case makes API calls only when required to cover the code, and uses a far less bizarre set of paths:

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");
TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "aa");
TRACE: Tests.cpp(95): STEP 2: tfs_mkdir(sb, "aaa");
TRACE: Tests.cpp(103): STEP 3: tfs_rmdir(sb, "");
TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(3);
TRACE: Tests.cpp(95): STEP 5: tfs_mkdir(sb, "B/aa/a");
TRACE: Tests.cpp(110): STEP 6: tfs_ls(sb);
TRACE: Tests.cpp(115): STEP 7: tfs_lsr(sb);
TRACE: Tests.cpp(121): STEP 8: tfs_create(sb, "aaaa");
TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "a");
TRACE: Tests.cpp(146): STEP 10: tfs_cat(sb, "a");
TRACE: Tests.cpp(155): STEP 11: set_reset_countdown(1);
TRACE: Tests.cpp(158): STEP 12: set_reset_countdown() ignored; already set
TRACE: Tests.cpp(146): STEP 13: tfs_cat(sb, "aaa");
TRACE: Tests.cpp(95): STEP 14: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 15: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 16: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 17: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 18: tfs_mkdir(sb, "");
TRACE: Tests.cpp(95): STEP 19: tfs_mkdir(sb, "");

Steps 15-19 can be ignored; they simply reflect that our file system tests always perform 20 steps, unless there is a failure before that point. It’s clear now that the multiple calls to cat, ls, and so forth in the original version were redundant. Taking this test, adding assertions about expected return values, and maintaining the resulting high-coverage unit test would be a much more reasonable task than dealing with the chaos produced by the fuzzer alone. It’s also obvious that steps 11 and 12 are not important to include in a unit test, because the reset never happens. You only really need to understand and write asserts for this trace:

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");
TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "aa");
TRACE: Tests.cpp(95): STEP 2: tfs_mkdir(sb, "aaa");
TRACE: Tests.cpp(103): STEP 3: tfs_rmdir(sb, "");
TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(3);
TRACE: Tests.cpp(95): STEP 5: tfs_mkdir(sb, "B/aa/a");
TRACE: Tests.cpp(110): STEP 6: tfs_ls(sb);
TRACE: Tests.cpp(115): STEP 7: tfs_lsr(sb);
TRACE: Tests.cpp(121): STEP 8: tfs_create(sb, "aaaa");
TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "a");
TRACE: Tests.cpp(146): STEP 10: tfs_cat(sb, "a");
TRACE: Tests.cpp(146): STEP 13: tfs_cat(sb, "aaa");
TRACE: Tests.cpp(95): STEP 14: tfs_mkdir(sb, "");

Coverage isn’t the only property other than failure that we can preserve. Recent work on automatic resource-adaptation for software reduces actual programs, not just their test cases. Reduced programs do less, but use fewer resources and still pass all important tests. We are investigating producing versions of a program that avoid calling insecure APIs only required for optional functionality, thus reducing the attack surface of the system.

How does it do that?

To a user, test-case reduction can appear to be magic. You give a reducer a huge, extremely complex test case, and, after some time, it produces a much smaller test case that achieves the same goal. In many cases, the process doesn’t take very long, and the resulting test case is, as far as you can see, not even a simple chunk of the original test case. How does this happen?

At a high level, almost all approaches to test-case reduction use some variant of an extremely simple algorithm:

Given a test case, T_INIT, to reduce, and a function PRED(test) that checks if a test case is a valid reduction (e.g., if it fails in the same way as T_INIT or covers the same source code), we can reduce T_INIT as follows:

1. Set T_CURR = T_INIT
2. Let T_NEXT = a new variation of T_CURR (one that has not yet been tried)
If there are no new variations of T_CURR, stop and report T_CURR as the reduced test-case.
3. Add T_NEXT to the set of variations that have been tried.
5. Go to 2.

The devil is in the details, particularly those of step two. A completely naive approach to test-case reduction would consider all possible subsets or subsequences of T_CURR as the set of variations. This would be good and thorough except for the minor problem that it would take approximately forever to reduce even fairly small test cases.

There are also some subtle points about the design of the reduction criteria function, PRED, for systems that have more than one bug (which, in reality, includes most complex software). If the definition of “fails in the same way as T_INIT” is too restrictive, we’re unlikely to get much reduction. For example, forcing the exact same failure output might be a bad idea when an assertion includes some input value, or references a position in T_INIT or an address in memory.

On the other hand, if our criteria are too weak, we may run into slippage. Slippage is the annoying phenomenon where a test case for a complex, subtle new bug reduces to a test case for a simple, and likely already known, bug. Avoiding slippage using reduction criteria requires some intuition about bugs and the system under test. More sophisticated approaches to avoiding slippage require modifying the reduction algorithm, and are related to the idea that “test-case reduction is fuzzing,” a topic beyond the scope of this blog post. Here, we’re going to assume you have PRED, and it does what you want it to do.

A good first step: Modified binary search

Rather than considering all possible subsets of a test case as potential shorter versions, we can divide and conquer through a modified binary search, which is the heart of an approach called delta-debugging. First, we try to see if just the first half of our test case satisfies our reduction criterion. If not, we try the second half of the test case. If neither satisfies PRED, what do we do? Well, we can just increase the granularity, and try removing quarters of the original test case, then eighths, and so on, until we’re trying to remove whatever our “atomic” parts are, and we can’t do anything else.

The DeepState reducer

DeepState’s reducer doesn’t use Zeller’s delta-debugging binary search strategy, which is often less effective than simpler greedy approaches. It instead produces variants of a test case by applying a series of passes similar to compiler passes:

  • Structured deletions: DeepState inputs are not always treated as uninterpreted byte buffers. DeepState knows when the bytes read within the context of a single OneOf call begin and end, and tries to delete such blocks accordingly. While the files themselves do not contain any hints of this structure, DeepState actually runs each test case T_CURR and dynamically extracts the structure. Deletions according to OneOf boundaries often correspond to removing a single function call, e.g., reducing foo(); bar(); foo; to foo(); foo();. Using as much structure as possible to make likely meaningful reductions is a key optimization in reduction methods. In addition to DeepState structure, our reducer also knows common delimiters, such as whitespace, quotes, commas, and brackets, used in source code, xml, json, and so forth.
  • Structure edge deletions: This pass tries to remove the boundaries between structured parts of the test, e.g., merging two adjacent code blocks into one.
  • One-, four-, and eight-byte removals: This pass is the first that doesn’t use any structure from the test. It just tries deleting single bytes, then four-byte sequences, then eight-byte sequences (e.g., string components, 32-bit values, and 64-bit values).
  • One-, four-, and eight-byte reduce and deletes: This pass reduces the value of a single byte by one, then tries to delete a sequence of bytes after the reduced byte. This covers the common case where a value specifies the length of some field, or number of times to run a loop, and then the following bytes are the actual values.
  • Byte-range removals: This pass tries removing increasingly larger byte-ranges, skipping over the one, four, and eight special cases that were already tried. DeepState has a user-configurable upper limit (16 by default) on the size of the range. This is the brute-force core of the reducer.
  • Structure swaps: This is the first pass that cannot reduce test length. It tries to swap adjacent structures whenever this makes the test’s bytes more sorted.
  • Byte reductions: This pass takes each byte of a test case and tries to make it as close to zero as possible.
  • Byte pattern search: By default, DeepState doesn’t apply this expensive pass, but when the reducer is asked to aim for maximum reduction, it will search for all two-byte patterns that appear more than once, and try to remove part of the pattern in multiple places at once. The idea is that if two function calls need the same input bytes (e.g., a filename), this can replace both with a smaller matching input. The original byte patterns may be larger than two bytes, but repeated application of the pass can handle the largest patterns.

DeepState repeatedly interleaves these passes, using later passes only when the earlier passes in the list are no longer helping with reduction, until no reduction is possible or the time allowed for reduction has been exceeded. While DeepState’s reducer is best at reducing DeepState tests, the ability to run an arbitrary program to check for interestingness means you can also use it to reduce other kinds of test cases.

Six test-case reducers you should know about

If you’re interested in test-case reduction, you should at least know about these reducers in addition to DeepState:

  • Delta-debugging in Python. I personally used this, Andreas Zeller’s original implementation, to reduce thousands of test cases for the file system on the Curiosity Mars Rover, and it made understanding and debugging even the nastiest bugs discovered by random-testing a flash file system fairly easy.
  • CReduce. This is widely used in compiler testing and was a major influence on the DeepState reducer. It can reduce more than just C, and is a recent entry in a long line of more structure-aware reducers, starting with the HDD (Hierarchical Delta-Debugging) algorithm. The CReduce paper and John Regehr’s blog posts are good resources to learn more about CReduce.
  • Hypothesis. Hypothesis is a popular Python test generation tool informed by David MacIver’s ideas about test reduction. A key insight of MacIver’s is that reducers should order tests in shortlex order, not only considering shorter tests better, but preferring equal-length tests that are “simpler.” DeepState also focuses on shortlex ordering. I haven’t yet tried David’s brand-new tool, ShrinkRay, but given what I know of his shrinker savvy, it’s likely to be very good.
  • QuickCheck. QuickCheck’s success in finding bugs in Haskell programs arguably started the modern renaissance in random testing, which, with DeepState, is beginning to merge with the renaissance in fuzzing. QuickCheck “shrunk” generated inputs to produce small counterexamples, and its successors continue or expand on this tradition.
  • Google’s halfempty. While it wasn’t the first test-case reducer to try to use parallelism, halfempty is the most focused on getting the most out of multiple cores when reducing huge inputs.
  • TSTL. This Python property-based testing system introduced the idea of normalization, which works to make test cases simpler as well as shorter. For instance, normalization will replace large integers with smaller values, and group API calls together to improve readability. Normalization is a formalization of one approach to MacIver’s emphasis on shortlex ordering.

As it turns out, normalization also helps with reduction. Test-case reducers can easily get stuck in local minima, where no simple changes can reduce test size. Normalization can “unstick” things: The simplification often removes obstacles to reduction. Once anything resembling normalization is on the table, including many passes in CReduce, Hypothesis, or DeepState, a “reduced” test case may not overlap with the original test case at all.

TSTL and SmartCheck also use normalization to assist in generalization, an approach that takes a failing test case and distinguishes the essential aspects of the test case (“this function call must have zero for this parameter”) from the accidental aspects (“any number will do here, zero is just a really simple one”) in order to make debugging and understanding even easier. A generalization tool for DeepState is on our to-do list.

While the above reducers are powerful and important, you may want to primarily use DeepState’s own reducer when reducing DeepState tests, because it’s the only one that uses dynamic feedback about structure from DeepState test execution. For extremely large inputs, it may be a good idea to apply halfempty or ShrinkRay first, to take advantage of multiple cores; then you can use DeepState to go the last mile.


Test-case reduction is a powerful tool to have in your testing utility belt, and is constantly operating behind the scenes in fuzzers like AFL and libFuzzer. Knowing all the things that test-case reduction can do for you can improve the effectiveness of your tests and make debugging a much more pleasant task. DeepState includes a state-of-the-art reduction tool, and we encourage you to play with it, using reduction criteria of your own invention.

We’re always developing new tools to make finding and fixing issues easier. Need help with your next project? Contact us!

Security assessment techniques for Go projects

The Trail of Bits Assurance practice has received an influx of Go projects, following the success of our Kubernetes assessment this summer. As a result, we’ve been adapting for Go projects some of the security assessment techniques and tactics we’ve used with other compiled languages.

We started by understanding the design of the language, identifying areas where developers may not fully understand the functionality of a language semantic. Many of these misunderstood semantics originated from findings we reported to our clients and independent research into the language itself. While not exhaustive, some of these problem areas include scoping, coroutines, error handling, and dependency management. Notably, many of theses are not directly related to the runtime. The Go runtime itself is designed to be safe by default, preventing many C-like vulnerabilities.

With a better understanding of the root causes, we searched for existing tooling to help us quickly and effectively instrument client codebases. The result was a sample of static and dynamic open-source tools, including several that were Go-agnostic. To complement these tools, we also identified several compiler configurations that help with instrumentation.

Static analysis

Because Go is a compiled language, the compiler detects and prevents many potentially erroneous patterns before the binary executable is even produced. While this is a major annoyance for newer Go developers, these warnings are extremely important in preventing unexpected behavior and keeping code clean and readable.

Static analysis tends to catch a lot of very low hanging fruit not included in compiler errors and warnings. Within the Go ecosystem, there are many disparate tools such as go-vet, staticcheck, and those within the analysis package. These tools typically identify problems like variable shadowing, unsafe pointer use, and unused function return values. Investigating the areas of a project where these tools display warnings typically leads to exploitable functionality.

These tools are by no means perfect. For example, go-vet can miss very common accidents like the one below, where the A function’s err return value is unused, and immediately reassigned during the assignment of bSuccess on the left-hand side of the expression. The compiler will not provide a warning, and go-vet does not detect this; nor does errcheck. In fact, the tools that successfully identify this case (non-exhaustive) are the aforementioned staticcheck and ineffassign, which identify the err return value of A as unused or ineffectual.

package main

import "fmt"

func A() (bool, error) { return false, fmt.Errorf("I get overridden!") }

func B() (bool, error) { return true, nil }

func main() {
	aSuccess, err := A()
	bSuccess, err := B()
	if err != nil {
	fmt.Println(aSuccess, ":", bSuccess)

Figure 1: An example program showing an ineffectual assignment of err tricking go-vet and errcheck into considering err as checked.

$ go run .
false : true
$ errcheck .
$ go vet .
$ staticcheck .
main.go:5:50: error strings should not be capitalized (ST1005)
main.go:5:50: error strings should not end with punctuation or a newline (ST1005)
main.go:10:12: this value of err is never used (SA4006)
$ ineffassign .
<snip>/main.go:10:12: ineffectual assignment to err

Figure 2: The output of the example program, along with errcheck, go-vet, staticcheck, and ineffassign.

When you look deeper into this example, you may wonder why the compiler does not warn on this problem. The Go compiler will error when variables are not used within a program, but this example successfully compiles. This is caused by the semantics of the “short variable declaration.”

ShortVarDecl = IdentifierList ":=" ExpressionList .

Figure 3: The grammar specification of the “short variable declaration.”

According to the specification, the short variable declaration has the special ability to redeclare variables as long as:

  • The redeclaration is in a multi-variable short declaration.
  • The redeclared variable is declared earlier in the same block or function’s parameter list.
  • The redeclared variable is of the same type as the previous declaration.
  • At least one non-blank variable in the declaration is new.

All of these constraints hold in the previous example, preventing the compiler from producing errors for this problem.

Many tools have edge cases like this where they are unsuccessful in identifying related issues, or identify an issue but describe it differently. Compounding the problem, these tools often require building the Go source code before analysis can be performed. This makes third-party security assessments complicated if the analysts cannot easily build the codebase or its dependencies.

Despite these pitfalls, when put together, the available tools can provide good hints as to where to look for problems within a given project, with just a little bit of effort. We recommend using gosec, go-vet, and staticcheck, at a minimum. These have the best documentation and ergonomics for most codebases. They also provide a wide variety of checks (such as ineffassign or errcheck) for common issues, without getting too specific. For more in-depth analysis of a particular type of issue, however, one might have to use the more specific analyzers, develop custom tooling directly against the SSA, or use $emmle.

Dynamic analysis

Once static analysis has been performed and the results have been reviewed, dynamic analysis techniques are typically the next step for deeper results. Due to Go’s memory safety, the problems normally found with dynamic analysis are those that result in a hard crash or an invalidation of program state. Various tools and approaches have been built to help identify these types of issues within the Go ecosystem. Additionally, it’s possible to retrofit existing language-agnostic tooling for the dynamic testing of Go software, which we show next.


The best-known dynamic testing tool in the Go space is likely Dimitry Vyukov’s implementation of dvyukov/go-fuzz. This tool allows you to quickly and effectively implement mutational fuzzing. It even has an extensive wall of trophies. More advanced users may also find the distributed fuzzing and libFuzzer support useful when hunting for bugs.

Google also produced a more primitive fuzzer with a confusingly similar name, google/gofuzz, that assists users by initializing structures with random values. Unlike Dimitry’s go-fuzz, Google’s gofuzz does not generate a harness or assist with storing crash output, fuzzed input, or any other type of information. While this can be a downside for testing some targets, it makes for a lightweight and extensible framework.

For the sake of brevity, we refer you to examples of both tools in their respective READMEs.

Property testing

Diverging from more traditional fuzzing approaches, Go’s testing package (typically used for unit and integration testing) provides the testing/quick sub-package for “black box testing” of Go functions. In other terms, it is a basic primitive for property testing. Given a function and generator, the package can be used to build a harness to test for potential property violations given the range of the input generator. The following example is pulled directly from the documentation.

func TestOddMultipleOfThree(t *testing.T) {
	f := func(x int) bool {
		y := OddMultipleOfThree(x)
		return y%2 == 1 && y%3 == 0
	if err := quick.Check(f, nil); err != nil {

Figure 4: The OddMultipleOfThree function is being tested, where its return value should always be an odd multiple of three. If it’s not, the f function will return false and the property will be violated. This is detected by the quick.Check function.

While the functionality provided by this package is acceptable for simple applications of property testing, important properties do not often fit well into such a basic interface. To address these shortcomings, the leanovate/gopter framework was born. Gopter provides a wide variety of generators for the common Go types, and has helpers to assist you in creating your own generators compatible with Gopter. Stateful tests are also supported through the gopter/commands sub-package, which is useful for testing that properties hold across sequences of actions. Compounding this, when a property is violated, Gopter shrinks the generated inputs. See a brief example of property tests with input shrinking in the output below.

package main_test
import (

type Compute struct {
  A uint32
  B uint32

func (c *Compute) CoerceInt () { c.A = c.A % 10; c.B = c.B % 10; }
func (c Compute) Add () uint32 { return c.A + c.B }
func (c Compute) Subtract () uint32 { return c.A - c.B }
func (c Compute) Divide () uint32 { return c.A / c.B }
func (c Compute) Multiply () uint32 { return c.A * c.B }

func TestCompute(t *testing.T) {
  parameters := gopter.DefaultTestParameters()
  parameters.Rng.Seed(1234) // Just for this example to generate reproducible results

  properties := gopter.NewProperties(parameters)

  properties.Property("Add should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      return true
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),

  properties.Property("Subtract should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      return true
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),
  properties.Property("Multiply should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      return true
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),

  properties.Property("Divide should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      return true
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),


Figure 5: The testing harness for the Compute structure.

user@host:~/Desktop/gopter_math$ go test
+ Add should never fail.: OK, passed 100 tests.
Elapsed time: 253.291µs
+ Subtract should never fail.: OK, passed 100 tests.
Elapsed time: 203.55µs
+ Multiply should never fail.: OK, passed 100 tests.
Elapsed time: 203.464µs
! Divide should never fail.: Error on property evaluation after 1 passed
   tests: Check paniced: runtime error: integer divide by zero
goroutine 5 [running]:
runtime/debug.Stack(0x5583a0, 0xc0000ccd80, 0xc00009d580)
	/usr/lib/go-1.12/src/runtime/debug/stack.go:24 +0x9d
  o:43 +0xeb
panic(0x554480, 0x6aa440)
	/usr/lib/go-1.12/src/runtime/panic.go:522 +0x1b5
_/home/user/Desktop/gopter_math_test.TestCompute.func4(0x0, 0x0)
	/home/user/Desktop/gopter_math/main_test.go:63 +0x3d
# <snip for brevity>

ARG_0: 0
ARG_0_ORIGINAL (1 shrinks): 117380812
ARG_1: 0
ARG_1_ORIGINAL (1 shrinks): 3287875120
Elapsed time: 183.113µs
--- FAIL: TestCompute (0.00s)
    properties.go:57: failed with initial seed: 1568637945819043624
exit status 1
FAIL	_/home/user/Desktop/gopter_math	0.004s

Figure 6: Executing the test harness and observing the output of the property tests, where Divide fails.

Fault injection

Fault injection has been surprisingly effective when attacking Go systems. The most common mistakes we found using this method involve the handling of the error type. Since error is only a type in Go, when it is returned it does not change a program’s execution flow on it’s own like a panic statement would. We identify such bugs by enforcing errors from the lowest level: the kernel. Because Go produces static binaries, faults must be injected without LD_PRELOAD. One of our tools, KRF, allows us to do exactly this.

During our recent assessment of the Kubernetes codebase, the use of KRF provided a finding deep inside a vendored dependency, simply by randomly faulting read and write system calls spawned by a process and its children. This technique was effective against the Kubelet, which commonly interfaces with the underlying system. The bug was triggered when the ionice command was faulted, producing no output to STDOUT and sending an error to STDERR. After the error was logged, execution continued instead of returning the error in STDERR to the caller. This results in STDOUT later being indexed, causing an index out of range runtime panic.

E0320 19:31:54.493854    6450 fs.go:591] Failed to read from stdout for cmd [ionice -c3 nice -n 19 du -s /var/lib/docker/overlay2/bbfc9596c0b12fb31c70db5ffdb78f47af303247bea7b93eee2cbf9062e307d8/diff] - read |0: bad file descriptor
panic: runtime error: index out of range

goroutine 289 [running]:
k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs.GetDirDiskUsage(0xc001192c60, 0x5e, 0x1bf08eb000, 0x1, 0x0, 0xc0011a7188)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs/fs.go:600 +0xa86
k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs.(*RealFsInfo).GetDirDiskUsage(0xc000bdbb60, 0xc001192c60, 0x5e, 0x1bf08eb000, 0x0, 0x0, 0x0)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs/fs.go:565 +0x89
k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common.(*realFsHandler).update(0xc000ee7560, 0x0, 0x0)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common/fsHandler.go:82 +0x36a
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common/fsHandler.go:120 +0x13b
created by
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/common/fsHandler.go:142 +0x3f

Figure 7: The shortened callstack of the resulting Kubelet panic.

stdoutb, souterr := ioutil.ReadAll(stdoutp)
if souterr != nil {
	klog.Errorf("Failed to read from stdout for cmd %v - %v", cmd.Args, souterr)

Figure 8: The logging of STDERR without returning the error to the caller.

usageInKb, err := strconv.ParseUint(strings.Fields(stdout)[0], 10, 64)

Figure 9: The attempted indexing of STDOUT, even though it is empty. This is the cause of the runtime panic.

For a more complete walkthrough containing reproduction steps, our Kubernetes Final Report details the use of KRF against the Kubelet in Appendix G (pg. 109).

Go’s compiler also allows instrumentation to be included in a binary, which permits detection of race conditions at runtime. This is extremely useful for identifying potentially exploitable races as an attacker, but it can also be leveraged to identify incorrect handling of defer, panic, and recover. We built trailofbits/on-edge to do exactly this: Identify global state changes between a function entrypoint and the point at which a function panics, and exfiltrate this information through the Go race detector. More in-depth use of OnEdge can be found in our previous blog post, “Panicking the Right Way in Go.”

In practice, we recommend using:

All of these tools, with the exception of KRF, require a bit of effort to use in practice.

Using the compiler to our advantage

The Go compiler has many built-in features and directives that aid in finding bugs. These features are hidden in and throughout various switches, and require a bit of configuration for our purposes.

Subverting the type system

Sometimes when attempting to test the functionality of a system, the exported functions aren’t what we want to test. Getting testable access to the desired functions may require renaming a lot of them so they can be exported, which can be burdensome. To help address this problem, the build directives of the compiler can be used to perform name linking, accessing controls provided by the export system. As an example of this functionality, the program below (graciously extracted from a Stack Overflow answer) accesses the unexported reflect.typelinks function and subsequently iterates the type link table to identify types present in the compiled program.

package main

import (

func Typelinks() (sections []unsafe.Pointer, offset [][]int32) {
	return typelinks()

//go:linkname typelinks reflect.typelinks
func typelinks() (sections []unsafe.Pointer, offset [][]int32)

func Add(p unsafe.Pointer, x uintptr, whySafe string) unsafe.Pointer {
	return add(p, x, whySafe)

//go:linkname add reflect.add
func add(p unsafe.Pointer, x uintptr, whySafe string) unsafe.Pointer

func main() {
	sections, offsets := Typelinks()
	for i, base := range sections {
		for _, offset := range offsets[i] {
			typeAddr := Add(base, uintptr(offset), "")
			typ := reflect.TypeOf(*(*interface{})(unsafe.Pointer(&typeAddr)))

Figure 10: A generalized version of the Stack Overflow answer, using the linkname build directive.

$ go run main.go 

Figure 11: The output of the typelinks table.

In situations where you need even more granular control at runtime (i.e., more than just the linkname directive), you can write in Go’s intermediate assembly and include it during compilation. While it may be incomplete and slightly out of date in some places, the teh-cmc/go-internals repository provides a great introduction to how Go assembles functions.

Compiler-generated coverage maps

To help with testing, the Go compiler can perform preprocessing to generate coverage information. This is intended for identifying unit and integration testing coverage information, but we can also use it to identify coverage generated by our fuzzing and property testing. Filippo Valsorda provides a simple example of this in a blog post.

Type-width safety

Go has support for automatically determining the size of integers and floating point numbers based on the target platform. However, it also allows for fixed-width definitions, such as int32 and int64. When mixing both automatic and fixed-width sizes, there are opportunities for incorrect assumptions about behaviour across multiple target platforms.

Testing against both 32-bit and 64-bit platform builds of a target will help identify platform-specific problems. These problems tend to be found in areas performing validation, decoding, or type conversion, where improper assumption about source and destination type properties are made. Examples of this were identified in the Kubernetes security assessment, specifically TOB-K8S-015: Overflows when using strconv.Atoi and downcasting the result (pg. 42 in the Kubernetes Final Report), with an example inlined below.

// updatePodContainers updates PodSpec.Containers.Ports with passed parameters.
func updatePodPorts(params map[string]string, podSpec *v1.PodSpec) (err error) {
    port := -1
    hostPort := -1
    if len(params["port"]) > 0 {
        port, err = strconv.Atoi(params["port"]) // <-- this should parse port as strconv.ParseUint(params["port"], 10, 16) if err != nil { return err } } // (...) // Don't include the port if it was not specified. if len(params["port"]) > 0 {
        podSpec.Containers[0].Ports = []v1.ContainerPort{
                ContainerPort: int32(port), // <-- this should later just be uint16(port)

Figure 12: An example of downcasting to a fixed-width integer from an automatic-width integer (returned by Atoi).

root@k8s-1:/home/vagrant# kubectl expose deployment nginx-deployment --port 4294967377 --target-port 4294967376
E0402 09:25:31.888983    3625 intstr.go:61] value: 4294967376 overflows int32
goroutine 1 [running]:
runtime/debug.Stack(0xc000e54eb8, 0xc4f1e9b8, 0xa3ce32e2a3d43b34)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/intstr.FromInt(0x100000050, 0xa, 0x100000050, 0x0, 0x0)
service/nginx-deployment exposed

Figure 13: The resulting overflow from incorrect type-width assumptions.

In practice, the type system subversion is rarely necessary. Most interesting targets for testing are already exported, available through traditional imports. We recommend using this only when helpers and similar unexported functions are required for testing. As for testing type-width safety, we recommend compiling against all targets when possible, even if it is not directly supported, since problems may be more apparent on different targets. Finally, we recommend generating coverage reports on projects with unit and integration tests, at a minimum. It helps identify areas that are not directly tested, which can be prioritized for review.

A note about dependencies

In languages such as JavaScript and Rust, dependency managers have built-in support for dependency auditing—scanning project dependencies for versions known to have vulnerabilities. In Go, no such tool exists, at least not in a publicly available and non-experimental state.

This lack likely stems from the fact that there are many different methods of dependency management: go-mod, go-get, vendored, etc. These various methods use radically different approaches, resulting in no straightforward way to universally identify dependencies and their versions. Furthermore, in some cases it is common for developers to subsequently modify their vendored dependency source code.

The problem of dependency management has progressed over the years of Go’s development, and most developers are moving towards the use of go mod. This allows dependencies to be tracked and versioned within a project through the go.mod file, opening the door for future dependency scanning efforts. An example of such an effort can be seen within the OWASP DependencyCheck tool, which has an experimental go mod plugin.


Ultimately, there are quite a few tools available for use within the Go ecosystem. Although mostly disparate, the various static analysis tools help identify “low hanging fruit” within a given project. When looking for deeper concerns, fuzzing, property testing, and fault injection tools are readily available. Compiler configuration subsequently augments the dynamic techniques, making it easier to build harnesses and evaluate their effectiveness.

Interested in seeing these techniques shake out bugs in your Go systems? Trail of Bits can make that happen. Do you want custom analysis built specifically for your organization? We do that too. Contact us!

Two New Tools that Tame the Treachery of Files

Parsing is hard, even when a file format is well specified. But when the specification is ambiguous, it leads to unintended and strange parser and interpreter behaviors that make file formats susceptible to security vulnerabilities. What if we could automatically generate a “safe” subset of any file format, along with an associated, verified parser? That’s our collective goal in Dr. Sergey Bratus’s DARPA SafeDocs program.

But wait—why is parsing hard in the first place? Design decisions like embedded scripting languages, complex context-sensitive grammars, and object models that allow arbitrary dependencies between objects may have looked like good ways to enrich a format, but they increase the attack surface of a parser, leading to forgotten or bypassed security checks, denial of service, privacy leakage, information hiding, and even hidden malicious payloads.

Two examples of this problem are polyglots and schizophrenic files. Polyglots are files that can be validly interpreted as two different formats. Have you ever read a PDF file and then been astonished to discover that it is also a valid ZIP file? Or edited an HTML file only to discover that it is also a Ruby script? Congratulations, you discovered a polyglot. This is not to be confused with schizophrenic files: That’s when two parsers interpret the same file in different ways, e.g., your PDF displays different content depending on whether you opened it in Adobe Acrobat or Foxit Reader, or your HTML page renders differently between Chrome and Internet Explorer.

We’ve developed two new tools that take the pain out of parsing and make file formats safer:

  1. PolyFile: A polyglot-aware file identification utility with manually instrumented parsers that can semantically label the bytes of a file hierarchically; and
  2. PolyTracker: An automated instrumentation framework that efficiently tracks input file taint through the execution of a program.

Collectively, the tools enable Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of referring to them as The ALAN Parsers Project.

Before we get into their details, let’s first talk about why these tools are necessary.

Ceci N’est Pas Un PDF

Please rise and open your hymnals to page 541 for the recitation of chapter 7 verse 6:

…a file has no intrinsic meaning. The meaning of a file—its type, its validity, its contents—can be different for each parser or interpreter.

Ange Albertini

You may be seated.

This talk by Trail of Bits researcher Evan Sultanik gives a number of examples of how polyglots and induced schizophrenia are more than just nifty parlor tricks. For example:

A PDF can even be a valid git repository that, when cloned, contains the LaTeX source code to generate the PDF and a copy of itself. Ange Albertini also has an excellent series of videos introducing funky file tricks.

What does it take to understand a popular file format that has
been accreting features (and misfeatures) over 20 years? PDF
provides just such a challenge.

  • An embedded Turing complete programming language? 👌
  • Arbitrary chaining of stream decoders that allow for both memory and computational denial of service? 👌
  • Multiple, redundant, and potentially conflicting ways to specify the length of a stream object? 👌
  • Arbitrary data allowed both before and after the file? 👌
  • Numerous ways to steganographically embed data, including arbitrary length binary blobs? 👌
  • A graph-based document object model that allows, and in some cases requires, cycles? 👌
  • A multi-decade history with ambiguous or incomplete specifications resulting in dozens of conflicting implementations, some of which emit non-compliant, malformed documents? 👌
  • The necessity for parser implementations to be resilient to malformations but also free to handle them differently? 👌
  • A specification that has ​got in the way of creating of a formally verified parser in Coq because Coq could not prove that a parser trying to do its best on checking indirect references would, in fact, terminate on maliciously crafted files​? 👌

Challenge accepted! To be fair, PDF is leading the way on defining simpler, reduced subsets of the format. PDF/A, designed to make sure PDF documents remain parseable for long-term preservation, has removed some of these problematic features. Moreover, they are by no means specific to PDF: they are endemic to document formats in general. For example, Microsoft Office’s OOXML has not done better, with severe external entity attacks that have been employed in the wild, not to mention XML-based vulnerabilities like the billion laughs attack. Even parsing JSON is harder than one might think, as is plain old UTF-8.

But surely in the land of Programming Languages, at least, all must be well, since their parsers are automatically generated from unambiguous specifications by classic algorithms and proven tools. Not so much: This Empire Hacking talk gives examples of how a poorly designed language can cause parsing problems even when no malice is involved. One does not simply walk into the shunting yard!

But back to data formats. In view of the challenges above, instead of focusing on the specification, we examine the de facto interpretations of the specification: Parser implementations. Our underlying hypothesis is that the “unsafe” portions of a file format will exist in the symmetric difference of the parsers’ accepted grammars. The portions of the file format to keep are the ones accepted and interpreted equivalently across all implementations.

PolyFile: Ground Truth Labeling of File Semantics

File identification utilities are, by and large, dumb in the sense that they simply compare the file against magic byte signatures of various formats. Moreover, these tools terminate once they find the first match, and do not recursively identify embedded file types or files that do not start at byte offset zero. Once a file is classified, there is typically little to no information about the contents of the file. It’s a PDF, but how many objects does it contain? It’s a ZIP, but what are its file contents?

Our new PolyFile project resolves these issues and provides:

  • Identification of any and all files embedded within the input, not necessarily starting at byte offset zero;
  • File formats for which an instrumented parser is available should be fully parsed, emitting a hierarchical semantic mapping of the input’s contents;
  • An interactive file explorer for a human to examine its contents and structure; and
  • Computer-readable output that can be used to assign semantic meaning to each byte of the input file (e.g., byte x corresponds to the first byte in a PDF stream object, and the start of a JPEG/JFIF header).

A fairly ideal file identification utility, n’est ce pas?

Ange Albertini’s SBuD project comes close in spirit, but currently only supports a couple image formats. Even the popular Unix file command only has support for several hundred file signatures. In contrast, PolyFile has support for over 10,000 file formats, and can recursively identify them in a file, emitting a hierarchical mapping as an extension of the SBuD JSON format. It also has support for semantically labeling files based on Kaitai Struct declarative file format specifications.

Additionally, PolyFile can optionally emit a self-contained HTML file with an interactive hex viewer and semantic labeling explorer. Here is an example of the HTML output from the résumé Evan Sultanik submitted to Trail of Bits. In addition to being a PDF that displays its own MD5 hash, it is a valid Nintendo Entertainment System ROM that, when emulated, is a playable game that displays the MD5 hash of the PDF. It is also a valid ZIP file containing, among other things, a PDF that is a git repository containing its LaTeX source code and a copy of itself.

PolyFile is free and open-source. You can download a copy at: https://github.com/trailofbits/polyfile.

Parser Instrumentation

Now that we have PolyFile to provide ground truth, we need a way to propagate the semantic labels through a parser; the programmatic equivalent of using contrast dye to track blood flow in the brain during a CT scan. We therefore need an automated way to instrument a parser to track those labels, with the goal of associating functions with the byte offsets of the input files on which they operate. Since PolyFile can tell us the semantic meaning behind those offsets, this will let us infer the purpose of the parser’s functions. For example, if a function in a PDF parser always operates on bytes associated with JFIF stream objects, we can assume it is responsible for processing embedded JPEGs.

There are several existing projects to do this sort of taint tracking, using various methods. The best maintained and easiest to use are AUTOGRAM and TaintGrind. However, the former is limited to analysis on the JVM, and the latter Valgrind plugin suffers from unacceptable runtime overhead when tracking as few as several bytes at a time. For example, we ran mutool, a utility in the muPDF project, using TaintGrind over a corpus of medium sized PDFs, and in every case the tool had to be halted after over 24 hours of execution for operations that would normally complete in milliseconds without instrumentation.

At first glance, our goals might seem to be satisfied by AFL-analyze, a tool bundled with the AFL fuzzer. In a sense, our goal is in fact to create its counterpart. AFL-analyze uses fuzzing to reverse engineer a file format from a parser. In our case, we have ground truth about the file format and want to reverse-engineer the parser.

Although intended for fuzzing, Angora’s taint analysis engine has many of the features necessary to track byte indexes during execution. In fact, as is described in the following sections, we build on many of the algorithmic advances of Angora while improving both computational and memory efficiency. Angora is built atop the LLVM Data Flow Sanitizer (dfsan), which we also leverage for PolyTracker. The following section describes dfsan’s operation, limitations, and how we improved upon both dfsan and Angora.

LLVM and the Data Flow Sanitizer

We chose a static instrumentation approach built on LLVM, since this allows us to instrument any parser capable of being compiled with LLVM and eventually instrument closed-source parsers (e.g., by lifting their binaries to LLVM using Remill or McSema).

LLVM has an instrumentation tool for propagating taint called the Data Flow Sanitizer (dfsan), which is also used by Angora. However, dfsan imposes severe limitations on the total number of taints tracked during program execution, which means that, in practice, we could only track a handful of input bytes from a file at once. To see why, consider a parser that performs the following:

fd = fopen(“foo.pdf”, “rb”);
a = fgetc(fd);
b = fgetc(fd);
c = a + b;

In this case, dfsan will taint the variable a by byte offset 0 and variable b by taint offset 1. Byte c will be tainted by both byte 0 and byte 1. The combinatorial challenge here is that there are 2n possible taints, where n is the number of bytes in the input file. Therefore, the naïve approach of storing taints using a bitset will be infeasible, even for small numbers of input, and even when using a compressed bitset.

The representational problem is addressed in dfsan by storing taint provenance in a data structure it calls the “union table,” which is a computationally efficient way to store a binary forest of taint unions. Each taint gets a unique 16-bit label. Then, in the example above, where the taint of a is unioned with b to create c, dfsan would record union_table[a’s label][b’s label] = c’s label.

Thereafter, if a and b are ever unioned again, c’s taint label can be reused. This allows constant time union table checks; however, the table itself requires O(n2) storage. This is very wasteful, since the table will almost always be very sparse. It’s also what necessitates the 16-bit taint labels, since using larger labels would exponentially increase the size of the union table. This means that dfsan can only track, at most, 65,536 taints throughout execution, including all new taints that are created from unions. This is insufficient to track more than a handful of input bytes at a time.

Introducing PolyTracker: Efficient Binary Instrumentation for Universal Taint Tracking

Our novel taint tracking data structures and algorithms—as well as numerous heuristics for reducing computational overhead—are manifested in our new PolyTracker tool. It is a wrapper around clang and clang++ that allows you to instrument any executable. Simply replace your compiler with PolyTracker during your normal build process. The resulting executable will create a JSON file containing a mapping of functions to the input file byte offsets on which each function operates. Moreover, since PolyTracker is built as an LLVM pass, it can be used on any black-box binary that has been lifted to LLVM/IR, even when source code is unavailable.

We maintained dfsan’s concepts of shadow memory and its instrumentation framework for tracking taints. However, we switched away from the union table and implemented a scalable data structure capable of exploiting the inherent sparsity in taint unions. This is augmented by a binary forest of taint unions, supplanting dfsan’s label array and allowing us to increase the size of the taint labels past 16-bits. PolyTracker’s binary forest data structure uses a memory layout algorithm that eliminates the need for an Angora-style taint label to bitvector lookup table, while also providing constant time insertion. This reduces the memory requirements from exponential to linear in the size of the input file plus the number of instructions executed by the parser, at the expense of an O(n log n) graph traversal in post-processing to resolve the taints. In practice, this results in negligible execution overhead for the majority of PDFs.

PolyTracker is free and open-source. You can download a copy today at: https://github.com/trailofbits/polytracker.

It’s Easy to Get Started

PolyFile can be installed with this one quick command:

pip3 install polyfile

PolyTracker requires a working build of LLVM. However, we have made this easy by providing a Docker container on DockerHub that already has everything built. Simply download a copy of the container to start instrumenting your parsers!

docker pull trailofbits/polytracker:latest
docker run -it --rm trailofbits/polytracker:latest

We have lots of features in active development, including intelligent file mutation for fuzzing and differential testing, temporal analysis of taint propagation, and automated identification of error handling routines.

We’re also excited to hear what other clever uses the community devises for the tools. Are you using PolyTracker to discover input bytes that are ignored by the parser? Do you use special taint labels to track the results of functions like strlen that are likely to correspond to field lengths? Let us know on the Empire Hacking Slack! Have an idea that you’d like to see turned into a new feature? Feel free to add a GitHub issue!


This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Destroying x86_64 instruction decoders with differential fuzzing

TL;DR: x86_64 decoding is hard, and the number and variety of implementations available for it makes it uniquely suited to differential fuzzing. We’re open sourcing mishegos, a differential fuzzer for instruction decoders. You can use it to discover discrepancies in your own decoders and analysis tools!

Some of mishegos's output, visualized.

Figure 1: Some of Mishegos’s output, visualized.

In the beginning, there was instruction decoding

Decompilation and reverse engineering tools are massive, complicated beasts that deal with some of the hardest problems in binary analysis: variable type and layout recovery, control flow graph inference, and sound lifting to higher-order representations for both manual and automated inspection.

At the heart of each of these tasks is accurate instruction decoding. Automated tools require faithful extraction of instruction semantics to automate their analyses, and reverse engineers expect accurate disassembly listings (or well-defined failure modes) when attempting manual comprehension.

Instruction decoding is implicitly treated as a solved problem. Analysis platforms give analysts a false sense of confidence by encouraging them to treat disassembled output as ground truth, without regarding potential errors in the decoder or adversarial instruction sequences in the input.

Mishegos challenges this assumption.

(x86_64) Instruction decoding is hard

Like, really hard:

  • Unlike RISC ISAs such as ARM and MIPS, x86_64 has variable-length instructions, meaning that decoder implementations must incrementally parse the input to know how many bytes to fetch. An instruction can be anywhere between 1 byte (e.g., 0x90, nop) and 15 bytes long. Longer instructions may be semantically valid (i.e., they may describe valid combinations of prefixes, operations, and literals), but actual silicon implementations will only fetch and decode 15 bytes at most (see the Intel x64 Developer’s Manual, §2.3.11).
  • x86_64 is the 64-bit extension of a 32-bit extension of a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-old 8-bit ISA. In short, it’s a mess, with each generation adding and removing functionality, reusing or overloading instructions and instruction prefixes, and introducing increasingly complicated switching mechanisms between supported modes and privilege boundaries.
  • Many instruction sequences have overloaded interpretations or plausible disassemblies, depending on the active processor’s state or compatibility mode. Disassemblers are required to make educated guesses, even when given relatively precise information about the compilation target or the expected execution mode.

The complexity of the x86_64 instruction format is especially apparent when visualized:

Figure 2: Visualizing some of x86_64’s complexity.

Figure 2: Visualizing some of x86_64’s complexity.

Even the graphic above doesn’t fully capture x86_64’s nuances—it ignores the internal complexity of the ModR/M and scale-index-base (SIB) bytes, as well as the opcode extension bit and various escaping formats for extended opcodes (legacy escape prefixes, VEX escape, and XOP escape).

All told, these complexities make x86_64 decoder implementations uniquely amenable to testing via differential fuzzing—by hooking a mutation engine up to several different implementations at once and comparing each collection of outputs, we can quickly suss out bugs and missing functionality.

Building a “sliding” mutation engine for x86_64 instructions

Given this layout and our knowledge about minimum and maximum instruction lengths on x86_64, we can construct a mutation engine that probes large parts of the decoding pipeline with a “sliding” strategy:

  • Generate an initial instruction candidate of up to 26 bytes, including structurally valid prefixes and groomed ModR/M and SIB fields.
  • Extract each “window” of the candidate, where each window is up to 15 bytes beginning at index 0 and moving to the right.
  • Once all windows are exhausted, generate a new instruction candidate and repeat.

Why up to 26 bytes? See above! x86_64 decoders will only accept up to 15 bytes, but generating long, (potentially) semantically valid x86_64 instruction candidates that we “slide” through means we can test likely edge cases in decoding:

  • Failing to handle multiple, duplicate instruction prefixes.
  • Emitting nonsense prefixes or disassembly attributes (e.g., accepting and emitting a repeat prefix on a non-string operation, or the lock prefix on something that isn’t atomizable).
  • Failing to parse the ModR/M or SIB bytes correctly, causing incorrect opcode decoding or bad displacement/immediate scaling/indexing.

so, a maximal instruction candidate, shown in purple (with dummy displacement and immediate values, shown in grey) like…

f0 f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
Figure 3: A maximal instruction candidate.

… yields 12 “window” candidates for actual fuzzing.

f0 f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04
f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05
2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06
67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07
46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08
0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09
3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a
7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b
22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c
8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e
01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f

Figure 4: Extracted instruction candidates.

Consequently, our mutation engine spends a lot of time trying out different sequences of prefixes and flags, and relatively little time interacting with the (mostly irrelevant) displacement and immediate fields.

Mishegos: Differential fuzzing of x86_64 decoders

Mishegos takes the “sliding” approach above and integrates it into a pretty typical differential fuzzing scheme. Each fuzzing target is wrapped into a “worker” process with a well-defined ABI:

  • worker_ctor and worker_dtor: Worker setup and teardown functions, respectively.
  • try_decode: Called for each input sample, returns the decoder’s results along with some metadata (e.g., how many bytes of input were consumed, the status of the decoder).
  • worker_name: A constant string used to uniquely identify the type of worker.

The codebase currently implements five workers:

  • Capstone—A popular disassembly framework originally based on the LLVM project’s disassemblers.
  • libbfd/libopcodes—The backing libraries used by the popular GNU binutils.
  • udis86—An older, potentially unmaintained decoder (last commit 2014).
  • XED—Intel’s reference decoder.
  • Zydis—Another popular open source disassembly library, with an emphasis on speed and feature-completeness.

Because of the barebones ABI, Mishegos workers tend to be extremely simple. The worker for Capstone, for example, is just 32 lines:

#include <capstone/capstone.h>

#include "../worker.h"

static csh cs_hnd;

char *worker_name = "capstone";

void worker_ctor() {
  if (cs_open(CS_ARCH_X86, CS_MODE_64, &cs_hnd) != CS_ERR_OK) {
    errx(1, "cs_open");

void worker_dtor() {

void try_decode(decode_result *result, uint8_t *raw_insn, uint8_t length) {
  cs_insn *insn;
  size_t count = cs_disasm(cs_hnd, raw_insn, length, 0, 1, &insn);

  if (count > 0) {
    result->status = S_SUCCESS;
    result->len =
      snprintf(result->result, MISHEGOS_DEC_MAXLEN, "%s %s\n", insn[0].mnemonic, insn[0].op_str);
    result->ndecoded = insn[0].size;
    cs_free(insn, count);
  } else {
    result->status = S_FAILURE;

Figure 5: Source for the Capstone worker.

Behind the scenes, workers receive inputs and send outputs in parallel via slots, which are accessed through a shared memory region managed by the fuzzing engine. Input slots are polled via semaphores to ensure that each worker has retrieved a candidate for decoding; output slots are tagged with the worker’s name and instruction candidate to allow for later collection into cohorts. The result is a relatively fast differential engine that doesn’t require each worker to complete a particular sample before continuing: Each worker can consume inputs at its own rate, with only the number of output slots and cohort collection limiting overall performance.

The bird’s-eye view:

Figure 6: Mishegos’s architecture.

Figure 6: Mishegos’s architecture.

Making sense of the noise

Mishegos produces a lot of output: A single 60-second run on a not particularly fast Linux server (inside of Docker!) produces about 1 million cohorts, or 4 million bundled outputs (1 output per input per fuzzing worker with 4 workers configured):

Figure 7: An example Mishegos run.

Figure 7: An example Mishegos run.

Each output cohort is structured as a JSON blob, and looks something like this:

"input": "3626f3f3fc0f587c22",
"outputs": [
    "ndecoded": 5,
    "len": 21,
    "result": "ss es repz repz cld \n",
    "workerno": 0,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/bfd/bfd.so"
    "ndecoded": 5,
    "len": 5,
    "result": "cld \n",
    "workerno": 1,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/capstone/capstone.so"
    "ndecoded": 5,
    "len": 4,
    "result": "cld ",
    "workerno": 2,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/xed/xed.so"
    "ndecoded": 5,
    "len": 3,
    "result": "cld",
    "workerno": 3,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/zydis/zydis.so"

Figure 8: An example output cohort from Mishegos.

In this case, all of the decoders agree: The first five bytes of the input decode to a valid cld instruction. libbfd is extra eager and reports the (nonsense) prefixes, while the others silently drop them as irrelevant.

But consistent successes aren’t what we’re interested in—we want discrepancies, dammit!

Discrepancies can occur along a few dimensions:

  • One or more decoders disagree about how many bytes to consume during decoding, despite all reporting success.
  • One or more decoders report failure (or success), in contrast to others.
  • All decoders report success and consume the same number of input bytes, but one or more disagree about a significant component of the decoding (e.g., the actual opcode or immediate/displacement values).

Each of these has adversarial applications:

  • Decoding length discrepancies can cause a cascade of incorrect disassemblies, preventing an automated tool from continuing or leaving a manual analyst responsible for realigning the disassembler.
  • Outright decoding failures can be used to prevent usage of a susceptible tool or platform entirely, or to smuggle malicious code past an analyst.
  • Component discrepancies can be used to mislead an analysis or human analyst into incorrectly interpreting the program’s behavior. Severe enough discrepancies could even be used to mask the recovered control flow graph!

Mishegos discovers each of these discrepancy classes via its analysis tool and presents them with mishmat, a hacky HTML visualization. The analysis tool collects language-agnostic “filters” into “passes” (think LLVM), which can then order their internal filters either via a dependency graph or based on perceived performance requirements (i.e., largest filters first). Passes are defined in ./src/analysis/passes.yml, e.g.:

# Find inputs that all workers agree are one size, but one or more
# decodes differently.
- filter-any-failure
- filter-ndecoded-different
- filter-same-effects
- minimize-input
- normalize

Figure 9: An example of a Mishegos analysis pass comprised of several filters

Individual filters are written as small scripts that take cohorts on stdin and conditionally emit them on stdout. For example, filter-ndecoded-different:

require "json"

STDERR.puts "[+] pass: filter-ndecoded-different"

count = 0
STDIN.each_line do |line|
  result = JSON.parse line, symbolize_names: true

  outputs_ndecoded = result[:outputs].map { |o| o[:ndecoded] }

  if outputs_ndecoded.uniq.size > 1
    count += 1

  STDOUT.puts result.to_json

STDERR.puts "[+] pass: filter-ndecoded-different done: #{count} filtered"

Figure 10: An example of a Mishegos analysis filter

Filters can also modify individual results or entire cohorts. The minimize-input filter chops the instruction candidate down to the longest indicated ndecoded field, and the normalize filter removes extra whitespace in preparation for additional analysis of individual assemblies.

Finally, passes can be run as a whole via the analysis command-line:

Figure 11: An example analysis run.

Figure 11: An example analysis run.

The analysis output can be visualized with mishmat, with an optional cap on the size of the HTML table:

mishmat -l 10000 < /tmp/mishegos.sd > /tmp/mishegos.sd.html

Ultimately, this yields fun results like the ones below (slightly reformatted for readability). Instruction candidates are on the left, individual decoder results are labeled by column. (bad) in libbfd‘s column indicates a decoding failure. The (N/M) syntax represents the number of bytes decoded (N) and the total length of the assembled string (M):

libbfd capstone zydis xed
f3f326264e0f3806cc repz repz es es rex.WRX (bad) (8 / 29) phsubd mm1, mm4 (9 / 15) (0 / 0) (0 / 0)
26f366364f0f38c94035 es data16 ss rex.WRXB (bad) (8 / 27) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (10 / 41) (0 / 0) (0 / 0)
f366364f0f38c94035 data16 ss rex.WRXB (bad) (7 / 24) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (9 / 41) (0 / 0) (0 / 0)
66364f0f38c94035 ss rex.WRXB (bad) (6 / 17) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (8 / 41) (0 / 0) (0 / 0)

Figure 12: Capstone thinking that nonsense decodes to valid SSE instructions.

libbfd capstone zydis xed
f36766360f921d32fa9c83 repz data16 setb BYTE PTR ss:[eip+0xffffffff839cfa32] # 0xffffffff839cfa3d (11 / 74) (0 / 0) setb byte ptr [eip-0x7c6305ce] (11 / 30) setb byte ptr ss:[0x00000000839CFA3D] (11 / 37)

Figure 13: Capstone missing an instruction entirely.

libbfd capstone zydis xed
3665f0f241687aa82c8d ss gs lock repnz rex.B push 0xffffffff8d2ca87a (10 / 46) push -0x72d35786 (10 / 16) (0 / 0) (0 / 0)
65f0f241687aa82c8d gs lock repnz rex.B push 0xffffffff8d2ca87a (9 / 43) push -0x72d35786 (9 / 16) (0 / 0) (0 / 0)
f0f241687aa82c8d lock repnz rex.B push 0xffffffff8d2ca87a (8 / 40) push -0x72d35786 (8 / 16) (0 / 0) (0 / 0)

Figure 14: Amusing signed representations.

libbfd capstone zydis xed
3e26f0f2f1 ds es lock repnz icebp (5 / 22) int1 (5 / 4) (0 / 0) (0 / 0)

Figure 15: Undocumented opcode discrepancies!

libbfd capstone zydis xed
f3430f38f890d20aec2c repz rex.XB (bad) (5 / 17) (0 / 0) enqcmds rdx, zmmword ptr [r8+0x2cec0ad2] (10 / 40) enqcmds rdx, zmmword ptr [r8+0x2CEC0AD2] (10 / 40)
2e363e65440f0dd8 cs ss ds gs rex.R prefetch (bad) (6 / 32) (0 / 0) nop eax, r11d (8 / 13) nop eax, r11d (8 / 13)
f2f266260fbdee repnz data16 es (bad) (6 / 21) (0 / 0) bsr bp, si (7 / 10) bsr bp, si (7 / 10)

Figure 16: XED and Zydis only.

libbfd capstone zydis xed
64f064675c fs lock fs addr32 pop rsp (5 / 25) (0 / 0) (0 / 0) (0 / 0)
2e cs (1 / 2) (0 / 0) (0 / 0) (0 / 0)
f06636f00f3802c7 lock ss lock phaddd xmm0,xmm7 (8 / 29) (0 / 0) (0 / 0) (0 / 0)
f03e4efd lock ds rex.WRX std (4 / 19) (0 / 0) (0 / 0) (0 / 0)
36f03e4efd ss lock ds rex.WRX std (5 / 22) (0 / 0) (0 / 0) (0 / 0)

Figure 17: And, of course, libbfd being utterly and repeatedly wrong.

The results above were captured with revision 88878dc on the repository. You can reproduce them by running the fuzzer in manual mode:

M=1 mishegos ./workers.spec <<< “36f03e4efd” > /tmp/mishegos

The big takeaways

For reverse engineers and program analysts: x86_64 instruction decoding is hard. The collection of tools that you rely on to do it are not, in fact, reliable. It is possible (and even trivial), given Mishegos’s output, to construct adversarial binaries that confuse your tools and waste your time. We’ve reported some of these issues upstream, but make no mistake: Trusting your decoder to perfectly report the machine behavior of a byte sequence will burn you.

Not everything is doom and gloom. If you need accurate instruction decoding (and you do!), you should use XED or Zydis. libopcodes is frequently close to Zydis and XED in terms of instruction support but consistently records false positives and decodes just prefixes as valid instructions. Capstone reports both false positives and false negatives with some regularity. udis86 (not shown above) behaves similarly to libopcodes and, given its spotty maintenance, should not be used.

This post is one of many from our team on the vagaries of parsing. Watch this space for a post by Evan Sultanik on polyglots and schizophrenic parsing.

How safe browsing fails to protect user privacy

Recently, security researchers discovered that Apple was sending safe browsing data to Tencent for all Chinese users. This revelation has brought the underlying security and privacy guarantees of the safe browsing protocol under increased scrutiny. In particular, safe browsing claims to protect users by providing them with something called k-anonymity. In this post we’ll show that this definition of privacy is somewhat meaningless in the context of safe browsing. Before jumping into why k-anonymity is insufficient, let’s take a look at how the safe browsing protocol works.

How does safe browsing work?

A while back, Google thought it would be useful to leverage their knowledge of the web to provide a database of malicious sites to web clients. Initially, users would submit their IP address and the URL in question to Google, which would subsequently be checked against a malware database. This scheme was called the Lookup API and is still available today. However, people quickly became uneasy about surrendering so much of their privacy. This reasonable concern led to the development of the current safe browsing scheme, called the Update API, which is used by both Google and Tencent.

Screen Shot 2019-10-29 at 6.19.23 PM

Safe browsing Update API flowchart from Gerbet et al

At a high level, Google maintains a list of malicious URLs and their 256-bit hashes. To save on bandwidth when distributing this list to browsers, they only send out a 32-bit prefix of each hash. This means that when the user’s browser checks whether or not a site is malicious, they might get a false positive, since many 256-bit URL hashes will contain the same 32-bit prefix. To remedy this, if a match occurs, the browser will send the 32-bit prefix in question to Google and get a full list of URLs whose 256-bit hash contains that prefix. To recap, the safe browsing Update API goes through the following steps every time a user tries to visit a new URL:

  1. Browser hashes the URL and checks it against the (local) list of 32-bit prefixes.
  2. If there is a match, the browser sends Google the 32-bit prefix.
  3. Google then sends back all blacklisted URLs whose 256-bit hash contains the prefix.
  4. If there is a match in the updated list, the browser issues a warning to the user.

Intuitively, this safe browsing scheme is more private than the original, since Google only learns the 32-bit prefix of each potentially malicious site the user visits. Indeed, Google has argued that it provides users with something called k-anonymity—a metric used by privacy analysts to determine how unique a piece of identifying information is. Let’s take a look at what exactly k-anonymity is, and to what extent safe browsing satisfies this definition.

What is k-anonymity

Traditionally, k-anonymity has been used to remove personal identifying information from a database. At a high level, it involves removing pieces of sensitive data until everyone in the dataset “looks like” at least k other people with respect to certain traits. For example, if we had the table of medical records in Figure 1, we could modify the Name and Age fields to make patients 2-anonymous with respect to Name, Age, Gender, and State, as shown in Figure 2.

Name Age Gender State Disease
Candace 26 Female NY Flu
Richard 23 Male CA Flu
Robin 15 Nonbinary NY None
Alyssa 52 Female CA Cancer
Omar 29 Male CA None
Kristine 17 Nonbinary NY Cancer
Emily 58 Female CA Heart-disease
Jasmine 20 Female NY None

Figure 1

Anyone trying to use this data will get all the info they need to perform some kind of statistical analysis (ostensibly your name won’t really affect your likelihood of getting TB), but anyone represented in the database will “look like” at least two other people. That way, an attacker trying to de-anonymize people will fail because they won’t be able to distinguish between the three entries that look alike. Obviously the bigger the k, the better; if the attacker is an insurance provider trying to use medical data as a way to justify hiking up your premiums, a database providing 2-anonymity might not be enough. In Figure 2, if the insurance company knows you are represented in the database and a 52 year old woman from California, they will be able to deduce that you have either cancer or heart disease and start charging you more money.

Name Age Gender State Disease
* 20-30 Female NY Flu
* 20-30 Male CA Flu
* 10-20 Nonbinary NY None
* 50-60 Female CA Cancer
* 20-30 Male CA None
* 10-20 Nonbinary NY Cancer
* 50-60 Female CA Heart-disease
* 20-30 Female NY None

Figure 2

Back to safe browsing: We can see how restricting the URLs viewable by Google or Tencent to a 32-bit hash prefix renders both providers unable to distinguish your request from any other URL with that same hash prefix. The question is, how many such collisions can we expect to occur? In 2015 Gerbet et al concluded that each prefix occurred roughly 14757 times across the web, implying that users of safe browsing can expect their browsing data to be roughly 14757-anonymous. In other words, Google/Tencent only knows that the website you attempted to go to is contained in a set of size approximately 14757, which is likely big enough to contain plenty of generic websites that would not be politically (or commercially) very revealing.

Why k-anonymity fails to protect users

Despite the fact that safe browsing satisfies the definition of k-anonymity, it actually isn’t very hard for Google to recover your browsing data from these queries. This insecurity is due to the fact that the privacy guarantees of k-anonymity don’t account for Google’s ability to cross-reference multiple safe browsing queries and narrow down which specific website corresponds with a given 32-bit prefix.

As a first example of such an attack, recall that Google uses cookies for safe browsing and can therefore see when multiple queries come from the same IP address. Now, suppose both www.amazon.com and https://www.amazon.com/gp/cart/view.html?ref_=nav_cart share a 32-bit hash prefix with two different malicious websites. If a user visits both Amazon and their shopping cart in rapid succession, Google will receive both 32-bit hash prefixes. Since it is unlikely the user visited two unrelated malicious websites back to back, Google can be reasonably sure that they were shopping on Amazon. This attack only works when two related websites both share a 32-bit hash prefix with malicious websites and the user visits them within a small window of time. However, this example already shows that k-anonymity isn’t so useful when faced with an adversary capable of correlating multiple queries.

The situation is actually much worse, though, because the safe browsing protocol often forces users to submit several highly correlated URLs at the same time. These multiple queries occur because many URLs are in some sense “too specific” for Google to keep track of, since malicious websites can create new URLs faster than Google can report each specific one. To account for this, users submit a set of “URL decompositions” for each query, which is constructed by progressively stripping pieces of the URL off. For example, when visiting the URL http://a.b.c/1/2 the browser would simultaneously check the following URLs against the safe browsing database:

  1. a.b.c/1/2
  2. a.b.c/1/
  3. a.b.c/
  4. b.c/
  5. b.c/1/

Using the full URL decomposition allows Google to provide users with a high degree of confidence that the website they are visiting isn’t malicious. However, submitting many highly correlated 32-bit hash prefixes all at once ruins much of the privacy originally provided by the safe browsing protocol. If Google receives the 32-bit hash prefix corresponding to both a.b.c/ and a.b.c/1 in the same query, it can easily de-anonymize the user’s browsing data. Even in circumstances where submitting multiple prefixes doesn’t lead to full URL recovery, there may be sufficient information to learn sensitive details about the user’s browsing history.

To bring things down to earth, Gerbet et al. showed that this URL decomposition attack can be used to identify a user’s taste in pornography—something an oppressive government would certainly be interested in monitoring. Even worse, since these malware databases aren’t made public, it’s difficult to determine if hash prefixes haven’t been adversarially included to track users. While we may trust Google not to rig the database so they can determine when users visit pro-Hong Kong websites, it’s easy to imagine Tencent taking advantage of this vulnerability.

Looking forward

While safe browsing undoubtedly provides real security benefits to users, it fails to protect them from companies or governments that want to monitor their browsing habits. Unfortunately, this lack of privacy is obscured by the fact that the protocol provides users with a weak, but technically precise, notion of anonymity. As both the technology and legal communities rally around tools like k-anonymity and differential privacy, we need to keep in mind that they are not one-size-fits-all techniques, and systems that theoretically satisfy such definitions might provide no real meaningful privacy when actually deployed.

If you’re considering using tools like differential privacy or k-anonymity in your application, our cryptographic services team can help you navigate the inherent subtleties of these systems. Whether you need help with protocol design or auditing an existing codebase, our team can help you build something your users will be able to trust.

Grace Hopper Celebration (GHC) 2019 Recap

by Rachel Cipkins, Stevens Institute of Technology, Hoboken, NJ

A few weeks ago I had the inspiring experience of attending the annual Grace Hopper Celebration (GHC), the world’s largest gathering of women in technology. Over four days in Orlando, Florida, GHC hosted a slew of workshops and presentations, plus a massive career fair with over 450 vendors (by comparison, Black Hat USA had about 300 vendors this year). The conference attracted over 25,000 attendees from 83+ countries—a mix of women in technology as well as a significant population of male allies. And at a time when still only 25% of computing jobs are held by women, it was encouraging to see GHC garner vast coverage from the technology industry.

As an aspiring security professional, I was also pleased that GHC 2019 had extensive coverage of cybersecurity (at least 15 talks!), even though the conference was not dedicated to it. It was uplifting to represent women in computing from Stevens along with 20 of my peers (half engineering majors and half computer science majors). We took full advantage of the jam-packed conference program, which included highlights like:

  • Keynote speakers focused on the importance of giving women the courage to explore careers in technology.
  • The PitchHER competition, where female entrepreneurs leading early-stage startups compete for prize money.
  • Open Source day, where participants can contribute to open source projects with the goal of making a positive impact on the world.
  • Diversity group meetups for organizations such as Lesbians Who Tech and Black Girls Who Code.
  • 20 workshop tracks scheduled in between events such as Artificial Intelligence, Emerging Technology, and Open Source.

That awesome, unexpected cybersecurity focus

Naturally, I took in as many of the cybersecurity events as I could. I especially appreciated the featured academic presentations from three ACM Award Winning Research in Cybersecurity winners—I highly recommend digging into their research papers:

  • Aisha Ali-Gombe, Towson University, Leveraging Software Instrumentation for Android Security Assessment. Ali-Gombe outlined ways to maliciously use Android security flaws, such as methods for spying on users, privilege escalation, data exploitation, and botnet. She also described her process for combating these threats, which included developing DroidScraper and AspectDroid to eliminate malicious Android use without low-level modification of the OS/framework.
  • Alexandra Dmitrienko, Institute of Information Security, ETH Zurich, Secure Free-Floating Car Sharing For Offline Cars. Free-floating car sharing is efficient, cost effective, and reduces traffic and air pollution. However, it does not support offline cars, and modifications have to be made to the car in order for the service to work. Dmitrienko created a solution that uses RFID chips in off-the-shelf cars to overcome these issues. She also implemented two-factor authentication on mobile platforms for enhanced security.
  • Shagufta Mehnaz, Purdue University, A Fine-Grained Approach for Anomaly Detection in File System Accesses. Mehnaz’ research explored how access control mechanisms cannot always prevent authorized users from misusing or stealing sensitive data. She created fine-grained profiles of file system users, then used these profiles to detect anomalous access to file systems.

I attended most of the Security/Privacy track workshops, which covered all knowledge levels. One of my peers, a third-year doctoral student in cybersecurity research, agreed the workshops covered a wide range of skill levels and presented the problems well, but thought the time constraints (~1 hour) didn’t allow for enough background information to help everyone understand solutions. Still, it was a generous attempt to allow non-technical attendees to benefit as well.

THE place to find female tech talent

GHC is a well-known recruiting ground for women in technology; the conference is scheduled to coincide with the start of the recruiting season for new graduates and summer internships. The career fair spans the duration of the conference and includes companies as well as universities.

One of the most useful things GHC provides is a resume database companies and universities can use to recruit potential candidates before the conference starts. Recruiters can also rent space in the interview hall to host interviews at the conference, and larger companies tend to host private networking events at offsite locations. Almost everyone in my group received an offer from a company they interviewed with, or was contacted to schedule an interview with a company not holding interviews at the conference.

Nearly every company I talked to at the career fair perked up at the mention of cybersecurity. Despite the high demand for security expertise, though, there were few security sponsors. Notable security companies that were recruiting at the conference included MITRE, Red Balloon, and Crowdstrike, but only one of them conducted interviews on-site. Hopefully, we’ll see more security companies at GHC next year.

You want to be there

Attending the Grace Hopper Celebration was an empowering experience for me as a woman in technology and especially as a woman in security. The coverage of cybersecurity-related workshops and talks at GHC is definitely growing, and it has proven to be a great place to recruit female security talent.

It was also just an incredible and unique experience to spend the week surrounded by amazing women. Everyone I spoke to described the energy at the conference as “electric.” My peers and I were able to make new professional connections, learn about new technology trends, and bring back new ideas to the various women in STEM groups at Stevens. We left the conference with a sense of courage to continue to grow our careers in technology and to inspire other women to pursue the same path.

GHC 2020 will be held in Orlando, Florida, Sept. 29–Oct. 2, and it is definitely an event worth considering for both general attendees and sponsors.

Formal Analysis of the CBC Casper Consensus Algorithm with TLA+

by Anne Ouyang, Piedmont Hills High School, San Jose, CA

As a summer intern at Trail of Bits, I used the PlusCal and TLA+ formal specification languages to explore Ethereum’s CBC Casper consensus protocol and its Byzantine fault tolerance. This work was motivated by the Medium.com article Peer Review: CBC Casper by Muneeb Ali, Jude Nelson, and Aaron Blankstein, which indicated that CBC Casper’s liveness properties impose stricter Byzantine fault thresholds than those suggested by the safety proof. To explore this, I specified the Casper the Friendly Binary consensus protocol in TLA+.

As expected, it was impossible to determine finality without a Byzantine fault threshold of less than one-third of the total weight of all validators, which is consistent with Lamport et al.’s original paper on the Byzantine Generals Problem. However, as long as that condition was satisfied, CBC Casper appeared to function as intended.

First, a Little Background

The CBC Casper (Correct By Construction) protocol is a PoS consensus protocol designed to one day be used for Ethereum. However, the current state of CBC Casper hasn’t undergone much formal analysis, and its under-specification poses a challenge to the introduction of Ethereum 2.0.

In a distributed network, individual nodes, called validators, use the Minimal CBC Casper family of consensus protocols to make consistent decisions. The protocol’s five parameters are the names and weights of validators, fault tolerant threshold, consensus values, and estimator. Validators make individual decisions based on their current state, which is defined in terms of the messages received, which have three components:

  • Sender: the name of a validator sending the message
  • Estimate: a subset of values in the consensus values set
  • Justification: a set of messages (state) received to arrive at the estimate

As a result, the sending and receiving of messages can be defined as state transitions.

Equivocation occurs when a validator sends a pair of messages that do not have each other in their justifications. The future states are all reachable states, where the equivocation fault is less than the fault tolerant threshold. Finality is defined by safety oracles, which detect when a certain property holds for all future states.

TLA+ and PlusCal

TLA+ is a formal specification language describing behaviors with states and state transitions. The specifications and state machine models can be checked with the TLC model checker. TLC performs a breadth-first search over the defined state transitions and checks for the properties that need to be satisfied.

The Byzantine Generals Problem

For context, The Byzantine Generals Problem is an analogy for decision-making in distributed systems in the presence of malicious individuals. The problem states that a commanding general must send an order to n-1 lieutenant generals such that 1) all of them obey the same order and 2) if the commanding general is loyal, then every loyal lieutenant obeys the order. A solution to the problem must ensure that all the loyal generals agree on the same plan, and a small number of traitors cannot lead the generals to a bad plan.

Now Let’s Dive into the Process

To start, I specified the definitions in the TLA+ language and defined the states and state transitions in terms of sets and set operations. Figure 1 shows a snippet of the specification.


Figure 1: Snippet of specification in TLA+

I specified the message relay in PlusCal, which is more pseudocode-like and can be automatically transpiled to TLA+.


Figure 2: The CBC Casper message relay specified in PlusCal.

The assumption is that all the messages are sent and received successfully without time delay. The specification does not include message authentication, because it is assumed that the validators can verify the authenticity and source of messages. In this implementation, equivocating validators behave such that they take different subsets of all received messages, use these subsets to obtain the estimates, and send different messages to different validators.

The TLC model checker checks that eventually a clique will be found where all the non-Byzantine nodes can mutually see each other agreeing with a certain estimate in messages, and cannot see each other disagreeing. When this condition is met, finality is achieved, and the model checker should terminate without giving errors, as shown in Figure 3.


Figure 3: TLC model checking results.

When finality cannot be reached, the model checker will detect that a temporal property has been violated, as shown in Figure 4.


Figure 4: TLC errors.

The temporal property checks for the existence of a clique of validators with a total weight greater than:

CodeCogsEqn (2)

All the validators in the clique satisfy the following:

  • None of the validators are Byzantine-faulty. They do not send equivocating messages.
  • All the validators can mutually see each other agreeing. Each validator has exactly one latest message, and they have the same estimate.
  • None of the validators can mutually see each other disagreeing. A validator has a latest message in the other’s justification, but has a new latest message that doesn’t have the same estimate as the latest message of the other validator.

In practice, the failure to achieve finality would mean the blockchain stops producing new blocks, since there is no guarantee that a transaction (i.e. an estimate) will not be reversed in the future.


We find that when the fault-tolerant threshold is set to greater than one-third of the total weight of all the validators, e-cliques cannot be formed. This means finality can be reached when the fault-tolerant threshold is less than one-third of the total weight. This is consistent with “The Byzantine Generals Problem,” which states that the problem “is solvable if and only if more than two-thirds of the generals are loyal.” Intuitively, for the protocol using the clique oracle, a higher fault-tolerant threshold would cause there to be too few validators to form a clique.

While the CBC Casper specification provides a proof of safety, it does not address liveness properties. For example, the CBC Casper blockchain protocol may encounter the problem in which no more blocks can be finalized. Further work is needed in specifying liveness, because finality is a liveness problem and is necessary for a switch to a PoS method. I found no liveness faults, but only tested binary consensus with a very small number of validators. Liveness faults may exist in more sophisticated instantiations of CBC Casper.

Some Thoughts on Formal Verification and TLA+

Developing an abstract mathematical model and systematically exploring the possible states is an interesting and important way to check the correctness of algorithms. However, I encountered the following challenges:

  • Few examples of implementation details of the TLA+ language and good practices for formal verification. Therefore, writing specifications can involve a lot of trial and error.
  • Unhelpful error messages generated by the TLC model checker when something fails to compile. The error messages are vague and do not pinpoint a specific location where the error occurs. In addition, the TLA+ toolbox is a Java application, so the error messages are often Java exceptions. Figuring out what’s wrong with the TLA+ specification, given the Java exceptions, is difficult.
  • Limited documentation of formal verification methods. Googling a question specific to TLA+ yields very few results. As of [date], there were only 39 questions on Stack Overflow with “TLA+” as a tag.


Working at Trail of Bits as an intern this summer was an amazing experience for me. I learned a lot about distributed systems and formal verification and greatly enjoyed the topics. I am glad to have experienced working in security research, and I am motivated to explore more when I go to college.