Security assessment techniques for Go projects

The Trail of Bits Assurance practice has received an influx of Go projects, following the success of our Kubernetes assessment this summer. As a result, we’ve been adapting for Go projects some of the security assessment techniques and tactics we’ve used with other compiled languages.

We started by understanding the design of the language, identifying areas where developers may not fully understand the functionality of a language semantic. Many of these misunderstood semantics originated from findings we reported to our clients and independent research into the language itself. While not exhaustive, some of these problem areas include scoping, coroutines, error handling, and dependency management. Notably, many of theses are not directly related to the runtime. The Go runtime itself is designed to be safe by default, preventing many C-like vulnerabilities.

With a better understanding of the root causes, we searched for existing tooling to help us quickly and effectively instrument client codebases. The result was a sample of static and dynamic open-source tools, including several that were Go-agnostic. To complement these tools, we also identified several compiler configurations that help with instrumentation.

Static analysis

Because Go is a compiled language, the compiler detects and prevents many potentially erroneous patterns before the binary executable is even produced. While this is a major annoyance for newer Go developers, these warnings are extremely important in preventing unexpected behavior and keeping code clean and readable.

Static analysis tends to catch a lot of very low hanging fruit not included in compiler errors and warnings. Within the Go ecosystem, there are many disparate tools such as go-vet, staticcheck, and those within the analysis package. These tools typically identify problems like variable shadowing, unsafe pointer use, and unused function return values. Investigating the areas of a project where these tools display warnings typically leads to exploitable functionality.

These tools are by no means perfect. For example, go-vet can miss very common accidents like the one below, where the A function’s err return value is unused, and immediately reassigned during the assignment of bSuccess on the left-hand side of the expression. The compiler will not provide a warning, and go-vet does not detect this; nor does errcheck. In fact, the tools that successfully identify this case (non-exhaustive) are the aforementioned staticcheck and ineffassign, which identify the err return value of A as unused or ineffectual.

package main

import "fmt"

func A() (bool, error) { return false, fmt.Errorf("I get overridden!") }

func B() (bool, error) { return true, nil }

func main() {
	aSuccess, err := A()
	bSuccess, err := B()
	if err != nil {
	fmt.Println(aSuccess, ":", bSuccess)

Figure 1: An example program showing an ineffectual assignment of err tricking go-vet and errcheck into considering err as checked.

$ go run .
false : true
$ errcheck .
$ go vet .
$ staticcheck .
main.go:5:50: error strings should not be capitalized (ST1005)
main.go:5:50: error strings should not end with punctuation or a newline (ST1005)
main.go:10:12: this value of err is never used (SA4006)
$ ineffassign .
<snip>/main.go:10:12: ineffectual assignment to err

Figure 2: The output of the example program, along with errcheck, go-vet, staticcheck, and ineffassign.

When you look deeper into this example, you may wonder why the compiler does not warn on this problem. The Go compiler will error when variables are not used within a program, but this example successfully compiles. This is caused by the semantics of the “short variable declaration.”

ShortVarDecl = IdentifierList ":=" ExpressionList .

Figure 3: The grammar specification of the “short variable declaration.”

According to the specification, the short variable declaration has the special ability to redeclare variables as long as:

  • The redeclaration is in a multi-variable short declaration.
  • The redeclared variable is declared earlier in the same block or function’s parameter list.
  • The redeclared variable is of the same type as the previous declaration.
  • At least one non-blank variable in the declaration is new.

All of these constraints hold in the previous example, preventing the compiler from producing errors for this problem.

Many tools have edge cases like this where they are unsuccessful in identifying related issues, or identify an issue but describe it differently. Compounding the problem, these tools often require building the Go source code before analysis can be performed. This makes third-party security assessments complicated if the analysts cannot easily build the codebase or its dependencies.

Despite these pitfalls, when put together, the available tools can provide good hints as to where to look for problems within a given project, with just a little bit of effort. We recommend using gosec, go-vet, and staticcheck, at a minimum. These have the best documentation and ergonomics for most codebases. They also provide a wide variety of checks (such as ineffassign or errcheck) for common issues, without getting too specific. For more in-depth analysis of a particular type of issue, however, one might have to use the more specific analyzers, develop custom tooling directly against the SSA, or use $emmle.

Dynamic analysis

Once static analysis has been performed and the results have been reviewed, dynamic analysis techniques are typically the next step for deeper results. Due to Go’s memory safety, the problems normally found with dynamic analysis are those that result in a hard crash or an invalidation of program state. Various tools and approaches have been built to help identify these types of issues within the Go ecosystem. Additionally, it’s possible to retrofit existing language-agnostic tooling for the dynamic testing of Go software, which we show next.


The best-known dynamic testing tool in the Go space is likely Dimitry Vyukov’s implementation of dvyukov/go-fuzz. This tool allows you to quickly and effectively implement mutational fuzzing. It even has an extensive wall of trophies. More advanced users may also find the distributed fuzzing and libFuzzer support useful when hunting for bugs.

Google also produced a more primitive fuzzer with a confusingly similar name, google/gofuzz, that assists users by initializing structures with random values. Unlike Dimitry’s go-fuzz, Google’s gofuzz does not generate a harness or assist with storing crash output, fuzzed input, or any other type of information. While this can be a downside for testing some targets, it makes for a lightweight and extensible framework.

For the sake of brevity, we refer you to examples of both tools in their respective READMEs.

Property testing

Diverging from more traditional fuzzing approaches, Go’s testing package (typically used for unit and integration testing) provides the testing/quick sub-package for “black box testing” of Go functions. In other terms, it is a basic primitive for property testing. Given a function and generator, the package can be used to build a harness to test for potential property violations given the range of the input generator. The following example is pulled directly from the documentation.

func TestOddMultipleOfThree(t *testing.T) {
	f := func(x int) bool {
		y := OddMultipleOfThree(x)
		return y%2 == 1 && y%3 == 0
	if err := quick.Check(f, nil); err != nil {

Figure 4: The OddMultipleOfThree function is being tested, where its return value should always be an odd multiple of three. If it’s not, the f function will return false and the property will be violated. This is detected by the quick.Check function.

While the functionality provided by this package is acceptable for simple applications of property testing, important properties do not often fit well into such a basic interface. To address these shortcomings, the leanovate/gopter framework was born. Gopter provides a wide variety of generators for the common Go types, and has helpers to assist you in creating your own generators compatible with Gopter. Stateful tests are also supported through the gopter/commands sub-package, which is useful for testing that properties hold across sequences of actions. Compounding this, when a property is violated, Gopter shrinks the generated inputs. See a brief example of property tests with input shrinking in the output below.

package main_test
import (

type Compute struct {
  A uint32
  B uint32

func (c *Compute) CoerceInt () { c.A = c.A % 10; c.B = c.B % 10; }
func (c Compute) Add () uint32 { return c.A + c.B }
func (c Compute) Subtract () uint32 { return c.A - c.B }
func (c Compute) Divide () uint32 { return c.A / c.B }
func (c Compute) Multiply () uint32 { return c.A * c.B }

func TestCompute(t *testing.T) {
  parameters := gopter.DefaultTestParameters()
  parameters.Rng.Seed(1234) // Just for this example to generate reproducible results

  properties := gopter.NewProperties(parameters)

  properties.Property("Add should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      return true
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),

  properties.Property("Subtract should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      return true
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),
  properties.Property("Multiply should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      return true
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),

  properties.Property("Divide should never fail.", prop.ForAll(
    func(a uint32, b uint32) bool {
      inpCompute := Compute{A: a, B: b}
      return true
    gen.UInt32Range(0, math.MaxUint32),
    gen.UInt32Range(0, math.MaxUint32),


Figure 5: The testing harness for the Compute structure.

user@host:~/Desktop/gopter_math$ go test
+ Add should never fail.: OK, passed 100 tests.
Elapsed time: 253.291µs
+ Subtract should never fail.: OK, passed 100 tests.
Elapsed time: 203.55µs
+ Multiply should never fail.: OK, passed 100 tests.
Elapsed time: 203.464µs
! Divide should never fail.: Error on property evaluation after 1 passed
   tests: Check paniced: runtime error: integer divide by zero
goroutine 5 [running]:
runtime/debug.Stack(0x5583a0, 0xc0000ccd80, 0xc00009d580)
	/usr/lib/go-1.12/src/runtime/debug/stack.go:24 +0x9d
  o:43 +0xeb
panic(0x554480, 0x6aa440)
	/usr/lib/go-1.12/src/runtime/panic.go:522 +0x1b5
_/home/user/Desktop/gopter_math_test.TestCompute.func4(0x0, 0x0)
	/home/user/Desktop/gopter_math/main_test.go:63 +0x3d
# <snip for brevity>

ARG_0: 0
ARG_0_ORIGINAL (1 shrinks): 117380812
ARG_1: 0
ARG_1_ORIGINAL (1 shrinks): 3287875120
Elapsed time: 183.113µs
--- FAIL: TestCompute (0.00s)
    properties.go:57: failed with initial seed: 1568637945819043624
exit status 1
FAIL	_/home/user/Desktop/gopter_math	0.004s

Figure 6: Executing the test harness and observing the output of the property tests, where Divide fails.

Fault injection

Fault injection has been surprisingly effective when attacking Go systems. The most common mistakes we found using this method involve the handling of the error type. Since error is only a type in Go, when it is returned it does not change a program’s execution flow on it’s own like a panic statement would. We identify such bugs by enforcing errors from the lowest level: the kernel. Because Go produces static binaries, faults must be injected without LD_PRELOAD. One of our tools, KRF, allows us to do exactly this.

During our recent assessment of the Kubernetes codebase, the use of KRF provided a finding deep inside a vendored dependency, simply by randomly faulting read and write system calls spawned by a process and its children. This technique was effective against the Kubelet, which commonly interfaces with the underlying system. The bug was triggered when the ionice command was faulted, producing no output to STDOUT and sending an error to STDERR. After the error was logged, execution continued instead of returning the error in STDERR to the caller. This results in STDOUT later being indexed, causing an index out of range runtime panic.

E0320 19:31:54.493854    6450 fs.go:591] Failed to read from stdout for cmd [ionice -c3 nice -n 19 du -s /var/lib/docker/overlay2/bbfc9596c0b12fb31c70db5ffdb78f47af303247bea7b93eee2cbf9062e307d8/diff] - read |0: bad file descriptor
panic: runtime error: index out of range

goroutine 289 [running]:, 0x5e, 0x1bf08eb000, 0x1, 0x0, 0xc0011a7188)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/ +0xa86*RealFsInfo).GetDirDiskUsage(0xc000bdbb60, 0xc001192c60, 0x5e, 0x1bf08eb000, 0x0, 0x0, 0x0)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/ +0x89*realFsHandler).update(0xc000ee7560, 0x0, 0x0)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/ +0x36a*realFsHandler).trackUsage(0xc000ee7560)
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/ +0x13b
created by*realFsHandler).Start
    /workspace/anago-v1.13.4-beta.0.55+c27b913fddd1a6/src/ +0x3f

Figure 7: The shortened callstack of the resulting Kubelet panic.

stdoutb, souterr := ioutil.ReadAll(stdoutp)
if souterr != nil {
	klog.Errorf("Failed to read from stdout for cmd %v - %v", cmd.Args, souterr)

Figure 8: The logging of STDERR without returning the error to the caller.

usageInKb, err := strconv.ParseUint(strings.Fields(stdout)[0], 10, 64)

Figure 9: The attempted indexing of STDOUT, even though it is empty. This is the cause of the runtime panic.

For a more complete walkthrough containing reproduction steps, our Kubernetes Final Report details the use of KRF against the Kubelet in Appendix G (pg. 109).

Go’s compiler also allows instrumentation to be included in a binary, which permits detection of race conditions at runtime. This is extremely useful for identifying potentially exploitable races as an attacker, but it can also be leveraged to identify incorrect handling of defer, panic, and recover. We built trailofbits/on-edge to do exactly this: Identify global state changes between a function entrypoint and the point at which a function panics, and exfiltrate this information through the Go race detector. More in-depth use of OnEdge can be found in our previous blog post, “Panicking the Right Way in Go.”

In practice, we recommend using:

All of these tools, with the exception of KRF, require a bit of effort to use in practice.

Using the compiler to our advantage

The Go compiler has many built-in features and directives that aid in finding bugs. These features are hidden in and throughout various switches, and require a bit of configuration for our purposes.

Subverting the type system

Sometimes when attempting to test the functionality of a system, the exported functions aren’t what we want to test. Getting testable access to the desired functions may require renaming a lot of them so they can be exported, which can be burdensome. To help address this problem, the build directives of the compiler can be used to perform name linking, accessing controls provided by the export system. As an example of this functionality, the program below (graciously extracted from a Stack Overflow answer) accesses the unexported reflect.typelinks function and subsequently iterates the type link table to identify types present in the compiled program.

package main

import (

func Typelinks() (sections []unsafe.Pointer, offset [][]int32) {
	return typelinks()

//go:linkname typelinks reflect.typelinks
func typelinks() (sections []unsafe.Pointer, offset [][]int32)

func Add(p unsafe.Pointer, x uintptr, whySafe string) unsafe.Pointer {
	return add(p, x, whySafe)

//go:linkname add reflect.add
func add(p unsafe.Pointer, x uintptr, whySafe string) unsafe.Pointer

func main() {
	sections, offsets := Typelinks()
	for i, base := range sections {
		for _, offset := range offsets[i] {
			typeAddr := Add(base, uintptr(offset), "")
			typ := reflect.TypeOf(*(*interface{})(unsafe.Pointer(&typeAddr)))

Figure 10: A generalized version of the Stack Overflow answer, using the linkname build directive.

$ go run main.go 

Figure 11: The output of the typelinks table.

In situations where you need even more granular control at runtime (i.e., more than just the linkname directive), you can write in Go’s intermediate assembly and include it during compilation. While it may be incomplete and slightly out of date in some places, the teh-cmc/go-internals repository provides a great introduction to how Go assembles functions.

Compiler-generated coverage maps

To help with testing, the Go compiler can perform preprocessing to generate coverage information. This is intended for identifying unit and integration testing coverage information, but we can also use it to identify coverage generated by our fuzzing and property testing. Filippo Valsorda provides a simple example of this in a blog post.

Type-width safety

Go has support for automatically determining the size of integers and floating point numbers based on the target platform. However, it also allows for fixed-width definitions, such as int32 and int64. When mixing both automatic and fixed-width sizes, there are opportunities for incorrect assumptions about behaviour across multiple target platforms.

Testing against both 32-bit and 64-bit platform builds of a target will help identify platform-specific problems. These problems tend to be found in areas performing validation, decoding, or type conversion, where improper assumption about source and destination type properties are made. Examples of this were identified in the Kubernetes security assessment, specifically TOB-K8S-015: Overflows when using strconv.Atoi and downcasting the result (pg. 42 in the Kubernetes Final Report), with an example inlined below.

// updatePodContainers updates PodSpec.Containers.Ports with passed parameters.
func updatePodPorts(params map[string]string, podSpec *v1.PodSpec) (err error) {
    port := -1
    hostPort := -1
    if len(params["port"]) > 0 {
        port, err = strconv.Atoi(params["port"]) // <-- this should parse port as strconv.ParseUint(params["port"], 10, 16) if err != nil { return err } } // (...) // Don't include the port if it was not specified. if len(params["port"]) > 0 {
        podSpec.Containers[0].Ports = []v1.ContainerPort{
                ContainerPort: int32(port), // <-- this should later just be uint16(port)

Figure 12: An example of downcasting to a fixed-width integer from an automatic-width integer (returned by Atoi).

root@k8s-1:/home/vagrant# kubectl expose deployment nginx-deployment --port 4294967377 --target-port 4294967376
E0402 09:25:31.888983    3625 intstr.go:61] value: 4294967376 overflows int32
goroutine 1 [running]:
runtime/debug.Stack(0xc000e54eb8, 0xc4f1e9b8, 0xa3ce32e2a3d43b34)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7, 0xa, 0x100000050, 0x0, 0x0)
service/nginx-deployment exposed

Figure 13: The resulting overflow from incorrect type-width assumptions.

In practice, the type system subversion is rarely necessary. Most interesting targets for testing are already exported, available through traditional imports. We recommend using this only when helpers and similar unexported functions are required for testing. As for testing type-width safety, we recommend compiling against all targets when possible, even if it is not directly supported, since problems may be more apparent on different targets. Finally, we recommend generating coverage reports on projects with unit and integration tests, at a minimum. It helps identify areas that are not directly tested, which can be prioritized for review.

A note about dependencies

In languages such as JavaScript and Rust, dependency managers have built-in support for dependency auditing—scanning project dependencies for versions known to have vulnerabilities. In Go, no such tool exists, at least not in a publicly available and non-experimental state.

This lack likely stems from the fact that there are many different methods of dependency management: go-mod, go-get, vendored, etc. These various methods use radically different approaches, resulting in no straightforward way to universally identify dependencies and their versions. Furthermore, in some cases it is common for developers to subsequently modify their vendored dependency source code.

The problem of dependency management has progressed over the years of Go’s development, and most developers are moving towards the use of go mod. This allows dependencies to be tracked and versioned within a project through the go.mod file, opening the door for future dependency scanning efforts. An example of such an effort can be seen within the OWASP DependencyCheck tool, which has an experimental go mod plugin.


Ultimately, there are quite a few tools available for use within the Go ecosystem. Although mostly disparate, the various static analysis tools help identify “low hanging fruit” within a given project. When looking for deeper concerns, fuzzing, property testing, and fault injection tools are readily available. Compiler configuration subsequently augments the dynamic techniques, making it easier to build harnesses and evaluate their effectiveness.

Interested in seeing these techniques shake out bugs in your Go systems? Trail of Bits can make that happen. Do you want custom analysis built specifically for your organization? We do that too. Contact us!

Two New Tools that Tame the Treachery of Files

Parsing is hard, even when a file format is well specified. But when the specification is ambiguous, it leads to unintended and strange parser and interpreter behaviors that make file formats susceptible to security vulnerabilities. What if we could automatically generate a “safe” subset of any file format, along with an associated, verified parser? That’s our collective goal in Dr. Sergey Bratus’s DARPA SafeDocs program.

But wait—why is parsing hard in the first place? Design decisions like embedded scripting languages, complex context-sensitive grammars, and object models that allow arbitrary dependencies between objects may have looked like good ways to enrich a format, but they increase the attack surface of a parser, leading to forgotten or bypassed security checks, denial of service, privacy leakage, information hiding, and even hidden malicious payloads.

Two examples of this problem are polyglots and schizophrenic files. Polyglots are files that can be validly interpreted as two different formats. Have you ever read a PDF file and then been astonished to discover that it is also a valid ZIP file? Or edited an HTML file only to discover that it is also a Ruby script? Congratulations, you discovered a polyglot. This is not to be confused with schizophrenic files: That’s when two parsers interpret the same file in different ways, e.g., your PDF displays different content depending on whether you opened it in Adobe Acrobat or Foxit Reader, or your HTML page renders differently between Chrome and Internet Explorer.

We’ve developed two new tools that take the pain out of parsing and make file formats safer:

  1. PolyFile: A polyglot-aware file identification utility with manually instrumented parsers that can semantically label the bytes of a file hierarchically; and
  2. PolyTracker: An automated instrumentation framework that efficiently tracks input file taint through the execution of a program.

Collectively, the tools enable Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of referring to them as The ALAN Parsers Project.

Before we get into their details, let’s first talk about why these tools are necessary.

Ceci N’est Pas Un PDF

Please rise and open your hymnals to page 541 for the recitation of chapter 7 verse 6:

…a file has no intrinsic meaning. The meaning of a file—its type, its validity, its contents—can be different for each parser or interpreter.

Ange Albertini

You may be seated.

This talk by Trail of Bits researcher Evan Sultanik gives a number of examples of how polyglots and induced schizophrenia are more than just nifty parlor tricks. For example:

A PDF can even be a valid git repository that, when cloned, contains the LaTeX source code to generate the PDF and a copy of itself. Ange Albertini also has an excellent series of videos introducing funky file tricks.

What does it take to understand a popular file format that has
been accreting features (and misfeatures) over 20 years? PDF
provides just such a challenge.

  • An embedded Turing complete programming language? 👌
  • Arbitrary chaining of stream decoders that allow for both memory and computational denial of service? 👌
  • Multiple, redundant, and potentially conflicting ways to specify the length of a stream object? 👌
  • Arbitrary data allowed both before and after the file? 👌
  • Numerous ways to steganographically embed data, including arbitrary length binary blobs? 👌
  • A graph-based document object model that allows, and in some cases requires, cycles? 👌
  • A multi-decade history with ambiguous or incomplete specifications resulting in dozens of conflicting implementations, some of which emit non-compliant, malformed documents? 👌
  • The necessity for parser implementations to be resilient to malformations but also free to handle them differently? 👌
  • A specification that has ​got in the way of creating of a formally verified parser in Coq because Coq could not prove that a parser trying to do its best on checking indirect references would, in fact, terminate on maliciously crafted files​? 👌

Challenge accepted! To be fair, PDF is leading the way on defining simpler, reduced subsets of the format. PDF/A, designed to make sure PDF documents remain parseable for long-term preservation, has removed some of these problematic features. Moreover, they are by no means specific to PDF: they are endemic to document formats in general. For example, Microsoft Office’s OOXML has not done better, with severe external entity attacks that have been employed in the wild, not to mention XML-based vulnerabilities like the billion laughs attack. Even parsing JSON is harder than one might think, as is plain old UTF-8.

But surely in the land of Programming Languages, at least, all must be well, since their parsers are automatically generated from unambiguous specifications by classic algorithms and proven tools. Not so much: This Empire Hacking talk gives examples of how a poorly designed language can cause parsing problems even when no malice is involved. One does not simply walk into the shunting yard!

But back to data formats. In view of the challenges above, instead of focusing on the specification, we examine the de facto interpretations of the specification: Parser implementations. Our underlying hypothesis is that the “unsafe” portions of a file format will exist in the symmetric difference of the parsers’ accepted grammars. The portions of the file format to keep are the ones accepted and interpreted equivalently across all implementations.

PolyFile: Ground Truth Labeling of File Semantics

File identification utilities are, by and large, dumb in the sense that they simply compare the file against magic byte signatures of various formats. Moreover, these tools terminate once they find the first match, and do not recursively identify embedded file types or files that do not start at byte offset zero. Once a file is classified, there is typically little to no information about the contents of the file. It’s a PDF, but how many objects does it contain? It’s a ZIP, but what are its file contents?

Our new PolyFile project resolves these issues and provides:

  • Identification of any and all files embedded within the input, not necessarily starting at byte offset zero;
  • File formats for which an instrumented parser is available should be fully parsed, emitting a hierarchical semantic mapping of the input’s contents;
  • An interactive file explorer for a human to examine its contents and structure; and
  • Computer-readable output that can be used to assign semantic meaning to each byte of the input file (e.g., byte x corresponds to the first byte in a PDF stream object, and the start of a JPEG/JFIF header).

A fairly ideal file identification utility, n’est ce pas?

Ange Albertini’s SBuD project comes close in spirit, but currently only supports a couple image formats. Even the popular Unix file command only has support for several hundred file signatures. In contrast, PolyFile has support for over 10,000 file formats, and can recursively identify them in a file, emitting a hierarchical mapping as an extension of the SBuD JSON format. It also has support for semantically labeling files based on Kaitai Struct declarative file format specifications.

Additionally, PolyFile can optionally emit a self-contained HTML file with an interactive hex viewer and semantic labeling explorer. Here is an example of the HTML output from the résumé Evan Sultanik submitted to Trail of Bits. In addition to being a PDF that displays its own MD5 hash, it is a valid Nintendo Entertainment System ROM that, when emulated, is a playable game that displays the MD5 hash of the PDF. It is also a valid ZIP file containing, among other things, a PDF that is a git repository containing its LaTeX source code and a copy of itself.

PolyFile is free and open-source. You can download a copy at:

Parser Instrumentation

Now that we have PolyFile to provide ground truth, we need a way to propagate the semantic labels through a parser; the programmatic equivalent of using contrast dye to track blood flow in the brain during a CT scan. We therefore need an automated way to instrument a parser to track those labels, with the goal of associating functions with the byte offsets of the input files on which they operate. Since PolyFile can tell us the semantic meaning behind those offsets, this will let us infer the purpose of the parser’s functions. For example, if a function in a PDF parser always operates on bytes associated with JFIF stream objects, we can assume it is responsible for processing embedded JPEGs.

There are several existing projects to do this sort of taint tracking, using various methods. The best maintained and easiest to use are AUTOGRAM and TaintGrind. However, the former is limited to analysis on the JVM, and the latter Valgrind plugin suffers from unacceptable runtime overhead when tracking as few as several bytes at a time. For example, we ran mutool, a utility in the muPDF project, using TaintGrind over a corpus of medium sized PDFs, and in every case the tool had to be halted after over 24 hours of execution for operations that would normally complete in milliseconds without instrumentation.

At first glance, our goals might seem to be satisfied by AFL-analyze, a tool bundled with the AFL fuzzer. In a sense, our goal is in fact to create its counterpart. AFL-analyze uses fuzzing to reverse engineer a file format from a parser. In our case, we have ground truth about the file format and want to reverse-engineer the parser.

Although intended for fuzzing, Angora’s taint analysis engine has many of the features necessary to track byte indexes during execution. In fact, as is described in the following sections, we build on many of the algorithmic advances of Angora while improving both computational and memory efficiency. Angora is built atop the LLVM Data Flow Sanitizer (dfsan), which we also leverage for PolyTracker. The following section describes dfsan’s operation, limitations, and how we improved upon both dfsan and Angora.

LLVM and the Data Flow Sanitizer

We chose a static instrumentation approach built on LLVM, since this allows us to instrument any parser capable of being compiled with LLVM and eventually instrument closed-source parsers (e.g., by lifting their binaries to LLVM using Remill or McSema).

LLVM has an instrumentation tool for propagating taint called the Data Flow Sanitizer (dfsan), which is also used by Angora. However, dfsan imposes severe limitations on the total number of taints tracked during program execution, which means that, in practice, we could only track a handful of input bytes from a file at once. To see why, consider a parser that performs the following:

fd = fopen(“foo.pdf”, “rb”);
a = fgetc(fd);
b = fgetc(fd);
c = a + b;

In this case, dfsan will taint the variable a by byte offset 0 and variable b by taint offset 1. Byte c will be tainted by both byte 0 and byte 1. The combinatorial challenge here is that there are 2n possible taints, where n is the number of bytes in the input file. Therefore, the naïve approach of storing taints using a bitset will be infeasible, even for small numbers of input, and even when using a compressed bitset.

The representational problem is addressed in dfsan by storing taint provenance in a data structure it calls the “union table,” which is a computationally efficient way to store a binary forest of taint unions. Each taint gets a unique 16-bit label. Then, in the example above, where the taint of a is unioned with b to create c, dfsan would record union_table[a’s label][b’s label] = c’s label.

Thereafter, if a and b are ever unioned again, c’s taint label can be reused. This allows constant time union table checks; however, the table itself requires O(n2) storage. This is very wasteful, since the table will almost always be very sparse. It’s also what necessitates the 16-bit taint labels, since using larger labels would exponentially increase the size of the union table. This means that dfsan can only track, at most, 65,536 taints throughout execution, including all new taints that are created from unions. This is insufficient to track more than a handful of input bytes at a time.

Introducing PolyTracker: Efficient Binary Instrumentation for Universal Taint Tracking

Our novel taint tracking data structures and algorithms—as well as numerous heuristics for reducing computational overhead—are manifested in our new PolyTracker tool. It is a wrapper around clang and clang++ that allows you to instrument any executable. Simply replace your compiler with PolyTracker during your normal build process. The resulting executable will create a JSON file containing a mapping of functions to the input file byte offsets on which each function operates. Moreover, since PolyTracker is built as an LLVM pass, it can be used on any black-box binary that has been lifted to LLVM/IR, even when source code is unavailable.

We maintained dfsan’s concepts of shadow memory and its instrumentation framework for tracking taints. However, we switched away from the union table and implemented a scalable data structure capable of exploiting the inherent sparsity in taint unions. This is augmented by a binary forest of taint unions, supplanting dfsan’s label array and allowing us to increase the size of the taint labels past 16-bits. PolyTracker’s binary forest data structure uses a memory layout algorithm that eliminates the need for an Angora-style taint label to bitvector lookup table, while also providing constant time insertion. This reduces the memory requirements from exponential to linear in the size of the input file plus the number of instructions executed by the parser, at the expense of an O(n log n) graph traversal in post-processing to resolve the taints. In practice, this results in negligible execution overhead for the majority of PDFs.

PolyTracker is free and open-source. You can download a copy today at:

It’s Easy to Get Started

PolyFile can be installed with this one quick command:

pip3 install polyfile

PolyTracker requires a working build of LLVM. However, we have made this easy by providing a Docker container on DockerHub that already has everything built. Simply download a copy of the container to start instrumenting your parsers!

docker pull trailofbits/polytracker:latest
docker run -it --rm trailofbits/polytracker:latest

We have lots of features in active development, including intelligent file mutation for fuzzing and differential testing, temporal analysis of taint propagation, and automated identification of error handling routines.

We’re also excited to hear what other clever uses the community devises for the tools. Are you using PolyTracker to discover input bytes that are ignored by the parser? Do you use special taint labels to track the results of functions like strlen that are likely to correspond to field lengths? Let us know on the Empire Hacking Slack! Have an idea that you’d like to see turned into a new feature? Feel free to add a GitHub issue!


This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Destroying x86_64 instruction decoders with differential fuzzing

TL;DR: x86_64 decoding is hard, and the number and variety of implementations available for it makes it uniquely suited to differential fuzzing. We’re open sourcing mishegos, a differential fuzzer for instruction decoders. You can use it to discover discrepancies in your own decoders and analysis tools!

Some of mishegos's output, visualized.

Figure 1: Some of Mishegos’s output, visualized.

In the beginning, there was instruction decoding

Decompilation and reverse engineering tools are massive, complicated beasts that deal with some of the hardest problems in binary analysis: variable type and layout recovery, control flow graph inference, and sound lifting to higher-order representations for both manual and automated inspection.

At the heart of each of these tasks is accurate instruction decoding. Automated tools require faithful extraction of instruction semantics to automate their analyses, and reverse engineers expect accurate disassembly listings (or well-defined failure modes) when attempting manual comprehension.

Instruction decoding is implicitly treated as a solved problem. Analysis platforms give analysts a false sense of confidence by encouraging them to treat disassembled output as ground truth, without regarding potential errors in the decoder or adversarial instruction sequences in the input.

Mishegos challenges this assumption.

(x86_64) Instruction decoding is hard

Like, really hard:

  • Unlike RISC ISAs such as ARM and MIPS, x86_64 has variable-length instructions, meaning that decoder implementations must incrementally parse the input to know how many bytes to fetch. An instruction can be anywhere between 1 byte (e.g., 0x90, nop) and 15 bytes long. Longer instructions may be semantically valid (i.e., they may describe valid combinations of prefixes, operations, and literals), but actual silicon implementations will only fetch and decode 15 bytes at most (see the Intel x64 Developer’s Manual, §2.3.11).
  • x86_64 is the 64-bit extension of a 32-bit extension of a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-old 8-bit ISA. In short, it’s a mess, with each generation adding and removing functionality, reusing or overloading instructions and instruction prefixes, and introducing increasingly complicated switching mechanisms between supported modes and privilege boundaries.
  • Many instruction sequences have overloaded interpretations or plausible disassemblies, depending on the active processor’s state or compatibility mode. Disassemblers are required to make educated guesses, even when given relatively precise information about the compilation target or the expected execution mode.

The complexity of the x86_64 instruction format is especially apparent when visualized:

Figure 2: Visualizing some of x86_64’s complexity.

Figure 2: Visualizing some of x86_64’s complexity.

Even the graphic above doesn’t fully capture x86_64’s nuances—it ignores the internal complexity of the ModR/M and scale-index-base (SIB) bytes, as well as the opcode extension bit and various escaping formats for extended opcodes (legacy escape prefixes, VEX escape, and XOP escape).

All told, these complexities make x86_64 decoder implementations uniquely amenable to testing via differential fuzzing—by hooking a mutation engine up to several different implementations at once and comparing each collection of outputs, we can quickly suss out bugs and missing functionality.

Building a “sliding” mutation engine for x86_64 instructions

Given this layout and our knowledge about minimum and maximum instruction lengths on x86_64, we can construct a mutation engine that probes large parts of the decoding pipeline with a “sliding” strategy:

  • Generate an initial instruction candidate of up to 26 bytes, including structurally valid prefixes and groomed ModR/M and SIB fields.
  • Extract each “window” of the candidate, where each window is up to 15 bytes beginning at index 0 and moving to the right.
  • Once all windows are exhausted, generate a new instruction candidate and repeat.

Why up to 26 bytes? See above! x86_64 decoders will only accept up to 15 bytes, but generating long, (potentially) semantically valid x86_64 instruction candidates that we “slide” through means we can test likely edge cases in decoding:

  • Failing to handle multiple, duplicate instruction prefixes.
  • Emitting nonsense prefixes or disassembly attributes (e.g., accepting and emitting a repeat prefix on a non-string operation, or the lock prefix on something that isn’t atomizable).
  • Failing to parse the ModR/M or SIB bytes correctly, causing incorrect opcode decoding or bad displacement/immediate scaling/indexing.

so, a maximal instruction candidate, shown in purple (with dummy displacement and immediate values, shown in grey) like…

f0 f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
Figure 3: A maximal instruction candidate.

… yields 12 “window” candidates for actual fuzzing.

f0 f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04
f2 2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05
2e 67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06
67 46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07
46 0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08
0f 3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09
3a 7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a
7a 22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b
22 8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c
8e 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e
01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f

Figure 4: Extracted instruction candidates.

Consequently, our mutation engine spends a lot of time trying out different sequences of prefixes and flags, and relatively little time interacting with the (mostly irrelevant) displacement and immediate fields.

Mishegos: Differential fuzzing of x86_64 decoders

Mishegos takes the “sliding” approach above and integrates it into a pretty typical differential fuzzing scheme. Each fuzzing target is wrapped into a “worker” process with a well-defined ABI:

  • worker_ctor and worker_dtor: Worker setup and teardown functions, respectively.
  • try_decode: Called for each input sample, returns the decoder’s results along with some metadata (e.g., how many bytes of input were consumed, the status of the decoder).
  • worker_name: A constant string used to uniquely identify the type of worker.

The codebase currently implements five workers:

  • Capstone—A popular disassembly framework originally based on the LLVM project’s disassemblers.
  • libbfd/libopcodes—The backing libraries used by the popular GNU binutils.
  • udis86—An older, potentially unmaintained decoder (last commit 2014).
  • XED—Intel’s reference decoder.
  • Zydis—Another popular open source disassembly library, with an emphasis on speed and feature-completeness.

Because of the barebones ABI, Mishegos workers tend to be extremely simple. The worker for Capstone, for example, is just 32 lines:

#include <capstone/capstone.h>

#include "../worker.h"

static csh cs_hnd;

char *worker_name = "capstone";

void worker_ctor() {
  if (cs_open(CS_ARCH_X86, CS_MODE_64, &cs_hnd) != CS_ERR_OK) {
    errx(1, "cs_open");

void worker_dtor() {

void try_decode(decode_result *result, uint8_t *raw_insn, uint8_t length) {
  cs_insn *insn;
  size_t count = cs_disasm(cs_hnd, raw_insn, length, 0, 1, &insn);

  if (count > 0) {
    result->status = S_SUCCESS;
    result->len =
      snprintf(result->result, MISHEGOS_DEC_MAXLEN, "%s %s\n", insn[0].mnemonic, insn[0].op_str);
    result->ndecoded = insn[0].size;
    cs_free(insn, count);
  } else {
    result->status = S_FAILURE;

Figure 5: Source for the Capstone worker.

Behind the scenes, workers receive inputs and send outputs in parallel via slots, which are accessed through a shared memory region managed by the fuzzing engine. Input slots are polled via semaphores to ensure that each worker has retrieved a candidate for decoding; output slots are tagged with the worker’s name and instruction candidate to allow for later collection into cohorts. The result is a relatively fast differential engine that doesn’t require each worker to complete a particular sample before continuing: Each worker can consume inputs at its own rate, with only the number of output slots and cohort collection limiting overall performance.

The bird’s-eye view:

Figure 6: Mishegos’s architecture.

Figure 6: Mishegos’s architecture.

Making sense of the noise

Mishegos produces a lot of output: A single 60-second run on a not particularly fast Linux server (inside of Docker!) produces about 1 million cohorts, or 4 million bundled outputs (1 output per input per fuzzing worker with 4 workers configured):

Figure 7: An example Mishegos run.

Figure 7: An example Mishegos run.

Each output cohort is structured as a JSON blob, and looks something like this:

"input": "3626f3f3fc0f587c22",
"outputs": [
    "ndecoded": 5,
    "len": 21,
    "result": "ss es repz repz cld \n",
    "workerno": 0,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/bfd/"
    "ndecoded": 5,
    "len": 5,
    "result": "cld \n",
    "workerno": 1,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/capstone/"
    "ndecoded": 5,
    "len": 4,
    "result": "cld ",
    "workerno": 2,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/xed/"
    "ndecoded": 5,
    "len": 3,
    "result": "cld",
    "workerno": 3,
    "status": 1,
    "status_name": "success",
    "worker_so": "./src/worker/zydis/"

Figure 8: An example output cohort from Mishegos.

In this case, all of the decoders agree: The first five bytes of the input decode to a valid cld instruction. libbfd is extra eager and reports the (nonsense) prefixes, while the others silently drop them as irrelevant.

But consistent successes aren’t what we’re interested in—we want discrepancies, dammit!

Discrepancies can occur along a few dimensions:

  • One or more decoders disagree about how many bytes to consume during decoding, despite all reporting success.
  • One or more decoders report failure (or success), in contrast to others.
  • All decoders report success and consume the same number of input bytes, but one or more disagree about a significant component of the decoding (e.g., the actual opcode or immediate/displacement values).

Each of these has adversarial applications:

  • Decoding length discrepancies can cause a cascade of incorrect disassemblies, preventing an automated tool from continuing or leaving a manual analyst responsible for realigning the disassembler.
  • Outright decoding failures can be used to prevent usage of a susceptible tool or platform entirely, or to smuggle malicious code past an analyst.
  • Component discrepancies can be used to mislead an analysis or human analyst into incorrectly interpreting the program’s behavior. Severe enough discrepancies could even be used to mask the recovered control flow graph!

Mishegos discovers each of these discrepancy classes via its analysis tool and presents them with mishmat, a hacky HTML visualization. The analysis tool collects language-agnostic “filters” into “passes” (think LLVM), which can then order their internal filters either via a dependency graph or based on perceived performance requirements (i.e., largest filters first). Passes are defined in ./src/analysis/passes.yml, e.g.:

# Find inputs that all workers agree are one size, but one or more
# decodes differently.
- filter-any-failure
- filter-ndecoded-different
- filter-same-effects
- minimize-input
- normalize

Figure 9: An example of a Mishegos analysis pass comprised of several filters

Individual filters are written as small scripts that take cohorts on stdin and conditionally emit them on stdout. For example, filter-ndecoded-different:

require "json"

STDERR.puts "[+] pass: filter-ndecoded-different"

count = 0
STDIN.each_line do |line|
  result = JSON.parse line, symbolize_names: true

  outputs_ndecoded = result[:outputs].map { |o| o[:ndecoded] }

  if outputs_ndecoded.uniq.size > 1
    count += 1

  STDOUT.puts result.to_json

STDERR.puts "[+] pass: filter-ndecoded-different done: #{count} filtered"

Figure 10: An example of a Mishegos analysis filter

Filters can also modify individual results or entire cohorts. The minimize-input filter chops the instruction candidate down to the longest indicated ndecoded field, and the normalize filter removes extra whitespace in preparation for additional analysis of individual assemblies.

Finally, passes can be run as a whole via the analysis command-line:

Figure 11: An example analysis run.

Figure 11: An example analysis run.

The analysis output can be visualized with mishmat, with an optional cap on the size of the HTML table:

mishmat -l 10000 < /tmp/ > /tmp/

Ultimately, this yields fun results like the ones below (slightly reformatted for readability). Instruction candidates are on the left, individual decoder results are labeled by column. (bad) in libbfd‘s column indicates a decoding failure. The (N/M) syntax represents the number of bytes decoded (N) and the total length of the assembled string (M):

libbfd capstone zydis xed
f3f326264e0f3806cc repz repz es es rex.WRX (bad) (8 / 29) phsubd mm1, mm4 (9 / 15) (0 / 0) (0 / 0)
26f366364f0f38c94035 es data16 ss rex.WRXB (bad) (8 / 27) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (10 / 41) (0 / 0) (0 / 0)
f366364f0f38c94035 data16 ss rex.WRXB (bad) (7 / 24) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (9 / 41) (0 / 0) (0 / 0)
66364f0f38c94035 ss rex.WRXB (bad) (6 / 17) sha1msg1 xmm8, xmmword ptr ss:[r8 + 0x35] (8 / 41) (0 / 0) (0 / 0)

Figure 12: Capstone thinking that nonsense decodes to valid SSE instructions.

libbfd capstone zydis xed
f36766360f921d32fa9c83 repz data16 setb BYTE PTR ss:[eip+0xffffffff839cfa32] # 0xffffffff839cfa3d (11 / 74) (0 / 0) setb byte ptr [eip-0x7c6305ce] (11 / 30) setb byte ptr ss:[0x00000000839CFA3D] (11 / 37)

Figure 13: Capstone missing an instruction entirely.

libbfd capstone zydis xed
3665f0f241687aa82c8d ss gs lock repnz rex.B push 0xffffffff8d2ca87a (10 / 46) push -0x72d35786 (10 / 16) (0 / 0) (0 / 0)
65f0f241687aa82c8d gs lock repnz rex.B push 0xffffffff8d2ca87a (9 / 43) push -0x72d35786 (9 / 16) (0 / 0) (0 / 0)
f0f241687aa82c8d lock repnz rex.B push 0xffffffff8d2ca87a (8 / 40) push -0x72d35786 (8 / 16) (0 / 0) (0 / 0)

Figure 14: Amusing signed representations.

libbfd capstone zydis xed
3e26f0f2f1 ds es lock repnz icebp (5 / 22) int1 (5 / 4) (0 / 0) (0 / 0)

Figure 15: Undocumented opcode discrepancies!

libbfd capstone zydis xed
f3430f38f890d20aec2c repz rex.XB (bad) (5 / 17) (0 / 0) enqcmds rdx, zmmword ptr [r8+0x2cec0ad2] (10 / 40) enqcmds rdx, zmmword ptr [r8+0x2CEC0AD2] (10 / 40)
2e363e65440f0dd8 cs ss ds gs rex.R prefetch (bad) (6 / 32) (0 / 0) nop eax, r11d (8 / 13) nop eax, r11d (8 / 13)
f2f266260fbdee repnz data16 es (bad) (6 / 21) (0 / 0) bsr bp, si (7 / 10) bsr bp, si (7 / 10)

Figure 16: XED and Zydis only.

libbfd capstone zydis xed
64f064675c fs lock fs addr32 pop rsp (5 / 25) (0 / 0) (0 / 0) (0 / 0)
2e cs (1 / 2) (0 / 0) (0 / 0) (0 / 0)
f06636f00f3802c7 lock ss lock phaddd xmm0,xmm7 (8 / 29) (0 / 0) (0 / 0) (0 / 0)
f03e4efd lock ds rex.WRX std (4 / 19) (0 / 0) (0 / 0) (0 / 0)
36f03e4efd ss lock ds rex.WRX std (5 / 22) (0 / 0) (0 / 0) (0 / 0)

Figure 17: And, of course, libbfd being utterly and repeatedly wrong.

The results above were captured with revision 88878dc on the repository. You can reproduce them by running the fuzzer in manual mode:

M=1 mishegos ./workers.spec <<< “36f03e4efd” > /tmp/mishegos

The big takeaways

For reverse engineers and program analysts: x86_64 instruction decoding is hard. The collection of tools that you rely on to do it are not, in fact, reliable. It is possible (and even trivial), given Mishegos’s output, to construct adversarial binaries that confuse your tools and waste your time. We’ve reported some of these issues upstream, but make no mistake: Trusting your decoder to perfectly report the machine behavior of a byte sequence will burn you.

Not everything is doom and gloom. If you need accurate instruction decoding (and you do!), you should use XED or Zydis. libopcodes is frequently close to Zydis and XED in terms of instruction support but consistently records false positives and decodes just prefixes as valid instructions. Capstone reports both false positives and false negatives with some regularity. udis86 (not shown above) behaves similarly to libopcodes and, given its spotty maintenance, should not be used.

This post is one of many from our team on the vagaries of parsing. Watch this space for a post by Evan Sultanik on polyglots and schizophrenic parsing.

How safe browsing fails to protect user privacy

Recently, security researchers discovered that Apple was sending safe browsing data to Tencent for all Chinese users. This revelation has brought the underlying security and privacy guarantees of the safe browsing protocol under increased scrutiny. In particular, safe browsing claims to protect users by providing them with something called k-anonymity. In this post we’ll show that this definition of privacy is somewhat meaningless in the context of safe browsing. Before jumping into why k-anonymity is insufficient, let’s take a look at how the safe browsing protocol works.

How does safe browsing work?

A while back, Google thought it would be useful to leverage their knowledge of the web to provide a database of malicious sites to web clients. Initially, users would submit their IP address and the URL in question to Google, which would subsequently be checked against a malware database. This scheme was called the Lookup API and is still available today. However, people quickly became uneasy about surrendering so much of their privacy. This reasonable concern led to the development of the current safe browsing scheme, called the Update API, which is used by both Google and Tencent.

Screen Shot 2019-10-29 at 6.19.23 PM

Safe browsing Update API flowchart from Gerbet et al

At a high level, Google maintains a list of malicious URLs and their 256-bit hashes. To save on bandwidth when distributing this list to browsers, they only send out a 32-bit prefix of each hash. This means that when the user’s browser checks whether or not a site is malicious, they might get a false positive, since many 256-bit URL hashes will contain the same 32-bit prefix. To remedy this, if a match occurs, the browser will send the 32-bit prefix in question to Google and get a full list of URLs whose 256-bit hash contains that prefix. To recap, the safe browsing Update API goes through the following steps every time a user tries to visit a new URL:

  1. Browser hashes the URL and checks it against the (local) list of 32-bit prefixes.
  2. If there is a match, the browser sends Google the 32-bit prefix.
  3. Google then sends back all blacklisted URLs whose 256-bit hash contains the prefix.
  4. If there is a match in the updated list, the browser issues a warning to the user.

Intuitively, this safe browsing scheme is more private than the original, since Google only learns the 32-bit prefix of each potentially malicious site the user visits. Indeed, Google has argued that it provides users with something called k-anonymity—a metric used by privacy analysts to determine how unique a piece of identifying information is. Let’s take a look at what exactly k-anonymity is, and to what extent safe browsing satisfies this definition.

What is k-anonymity

Traditionally, k-anonymity has been used to remove personal identifying information from a database. At a high level, it involves removing pieces of sensitive data until everyone in the dataset “looks like” at least k other people with respect to certain traits. For example, if we had the table of medical records in Figure 1, we could modify the Name and Age fields to make patients 2-anonymous with respect to Name, Age, Gender, and State, as shown in Figure 2.

Name Age Gender State Disease
Candace 26 Female NY Flu
Richard 23 Male CA Flu
Robin 15 Nonbinary NY None
Alyssa 52 Female CA Cancer
Omar 29 Male CA None
Kristine 17 Nonbinary NY Cancer
Emily 58 Female CA Heart-disease
Jasmine 20 Female NY None

Figure 1

Anyone trying to use this data will get all the info they need to perform some kind of statistical analysis (ostensibly your name won’t really affect your likelihood of getting TB), but anyone represented in the database will “look like” at least two other people. That way, an attacker trying to de-anonymize people will fail because they won’t be able to distinguish between the three entries that look alike. Obviously the bigger the k, the better; if the attacker is an insurance provider trying to use medical data as a way to justify hiking up your premiums, a database providing 2-anonymity might not be enough. In Figure 2, if the insurance company knows you are represented in the database and a 52 year old woman from California, they will be able to deduce that you have either cancer or heart disease and start charging you more money.

Name Age Gender State Disease
* 20-30 Female NY Flu
* 20-30 Male CA Flu
* 10-20 Nonbinary NY None
* 50-60 Female CA Cancer
* 20-30 Male CA None
* 10-20 Nonbinary NY Cancer
* 50-60 Female CA Heart-disease
* 20-30 Female NY None

Figure 2

Back to safe browsing: We can see how restricting the URLs viewable by Google or Tencent to a 32-bit hash prefix renders both providers unable to distinguish your request from any other URL with that same hash prefix. The question is, how many such collisions can we expect to occur? In 2015 Gerbet et al concluded that each prefix occurred roughly 14757 times across the web, implying that users of safe browsing can expect their browsing data to be roughly 14757-anonymous. In other words, Google/Tencent only knows that the website you attempted to go to is contained in a set of size approximately 14757, which is likely big enough to contain plenty of generic websites that would not be politically (or commercially) very revealing.

Why k-anonymity fails to protect users

Despite the fact that safe browsing satisfies the definition of k-anonymity, it actually isn’t very hard for Google to recover your browsing data from these queries. This insecurity is due to the fact that the privacy guarantees of k-anonymity don’t account for Google’s ability to cross-reference multiple safe browsing queries and narrow down which specific website corresponds with a given 32-bit prefix.

As a first example of such an attack, recall that Google uses cookies for safe browsing and can therefore see when multiple queries come from the same IP address. Now, suppose both and share a 32-bit hash prefix with two different malicious websites. If a user visits both Amazon and their shopping cart in rapid succession, Google will receive both 32-bit hash prefixes. Since it is unlikely the user visited two unrelated malicious websites back to back, Google can be reasonably sure that they were shopping on Amazon. This attack only works when two related websites both share a 32-bit hash prefix with malicious websites and the user visits them within a small window of time. However, this example already shows that k-anonymity isn’t so useful when faced with an adversary capable of correlating multiple queries.

The situation is actually much worse, though, because the safe browsing protocol often forces users to submit several highly correlated URLs at the same time. These multiple queries occur because many URLs are in some sense “too specific” for Google to keep track of, since malicious websites can create new URLs faster than Google can report each specific one. To account for this, users submit a set of “URL decompositions” for each query, which is constructed by progressively stripping pieces of the URL off. For example, when visiting the URL http://a.b.c/1/2 the browser would simultaneously check the following URLs against the safe browsing database:

  1. a.b.c/1/2
  2. a.b.c/1/
  3. a.b.c/
  4. b.c/
  5. b.c/1/

Using the full URL decomposition allows Google to provide users with a high degree of confidence that the website they are visiting isn’t malicious. However, submitting many highly correlated 32-bit hash prefixes all at once ruins much of the privacy originally provided by the safe browsing protocol. If Google receives the 32-bit hash prefix corresponding to both a.b.c/ and a.b.c/1 in the same query, it can easily de-anonymize the user’s browsing data. Even in circumstances where submitting multiple prefixes doesn’t lead to full URL recovery, there may be sufficient information to learn sensitive details about the user’s browsing history.

To bring things down to earth, Gerbet et al. showed that this URL decomposition attack can be used to identify a user’s taste in pornography—something an oppressive government would certainly be interested in monitoring. Even worse, since these malware databases aren’t made public, it’s difficult to determine if hash prefixes haven’t been adversarially included to track users. While we may trust Google not to rig the database so they can determine when users visit pro-Hong Kong websites, it’s easy to imagine Tencent taking advantage of this vulnerability.

Looking forward

While safe browsing undoubtedly provides real security benefits to users, it fails to protect them from companies or governments that want to monitor their browsing habits. Unfortunately, this lack of privacy is obscured by the fact that the protocol provides users with a weak, but technically precise, notion of anonymity. As both the technology and legal communities rally around tools like k-anonymity and differential privacy, we need to keep in mind that they are not one-size-fits-all techniques, and systems that theoretically satisfy such definitions might provide no real meaningful privacy when actually deployed.

If you’re considering using tools like differential privacy or k-anonymity in your application, our cryptographic services team can help you navigate the inherent subtleties of these systems. Whether you need help with protocol design or auditing an existing codebase, our team can help you build something your users will be able to trust.

Grace Hopper Celebration (GHC) 2019 Recap

by Rachel Cipkins, Stevens Institute of Technology, Hoboken, NJ

A few weeks ago I had the inspiring experience of attending the annual Grace Hopper Celebration (GHC), the world’s largest gathering of women in technology. Over four days in Orlando, Florida, GHC hosted a slew of workshops and presentations, plus a massive career fair with over 450 vendors (by comparison, Black Hat USA had about 300 vendors this year). The conference attracted over 25,000 attendees from 83+ countries—a mix of women in technology as well as a significant population of male allies. And at a time when still only 25% of computing jobs are held by women, it was encouraging to see GHC garner vast coverage from the technology industry.

As an aspiring security professional, I was also pleased that GHC 2019 had extensive coverage of cybersecurity (at least 15 talks!), even though the conference was not dedicated to it. It was uplifting to represent women in computing from Stevens along with 20 of my peers (half engineering majors and half computer science majors). We took full advantage of the jam-packed conference program, which included highlights like:

  • Keynote speakers focused on the importance of giving women the courage to explore careers in technology.
  • The PitchHER competition, where female entrepreneurs leading early-stage startups compete for prize money.
  • Open Source day, where participants can contribute to open source projects with the goal of making a positive impact on the world.
  • Diversity group meetups for organizations such as Lesbians Who Tech and Black Girls Who Code.
  • 20 workshop tracks scheduled in between events such as Artificial Intelligence, Emerging Technology, and Open Source.

That awesome, unexpected cybersecurity focus

Naturally, I took in as many of the cybersecurity events as I could. I especially appreciated the featured academic presentations from three ACM Award Winning Research in Cybersecurity winners—I highly recommend digging into their research papers:

  • Aisha Ali-Gombe, Towson University, Leveraging Software Instrumentation for Android Security Assessment. Ali-Gombe outlined ways to maliciously use Android security flaws, such as methods for spying on users, privilege escalation, data exploitation, and botnet. She also described her process for combating these threats, which included developing DroidScraper and AspectDroid to eliminate malicious Android use without low-level modification of the OS/framework.
  • Alexandra Dmitrienko, Institute of Information Security, ETH Zurich, Secure Free-Floating Car Sharing For Offline Cars. Free-floating car sharing is efficient, cost effective, and reduces traffic and air pollution. However, it does not support offline cars, and modifications have to be made to the car in order for the service to work. Dmitrienko created a solution that uses RFID chips in off-the-shelf cars to overcome these issues. She also implemented two-factor authentication on mobile platforms for enhanced security.
  • Shagufta Mehnaz, Purdue University, A Fine-Grained Approach for Anomaly Detection in File System Accesses. Mehnaz’ research explored how access control mechanisms cannot always prevent authorized users from misusing or stealing sensitive data. She created fine-grained profiles of file system users, then used these profiles to detect anomalous access to file systems.

I attended most of the Security/Privacy track workshops, which covered all knowledge levels. One of my peers, a third-year doctoral student in cybersecurity research, agreed the workshops covered a wide range of skill levels and presented the problems well, but thought the time constraints (~1 hour) didn’t allow for enough background information to help everyone understand solutions. Still, it was a generous attempt to allow non-technical attendees to benefit as well.

THE place to find female tech talent

GHC is a well-known recruiting ground for women in technology; the conference is scheduled to coincide with the start of the recruiting season for new graduates and summer internships. The career fair spans the duration of the conference and includes companies as well as universities.

One of the most useful things GHC provides is a resume database companies and universities can use to recruit potential candidates before the conference starts. Recruiters can also rent space in the interview hall to host interviews at the conference, and larger companies tend to host private networking events at offsite locations. Almost everyone in my group received an offer from a company they interviewed with, or was contacted to schedule an interview with a company not holding interviews at the conference.

Nearly every company I talked to at the career fair perked up at the mention of cybersecurity. Despite the high demand for security expertise, though, there were few security sponsors. Notable security companies that were recruiting at the conference included MITRE, Red Balloon, and Crowdstrike, but only one of them conducted interviews on-site. Hopefully, we’ll see more security companies at GHC next year.

You want to be there

Attending the Grace Hopper Celebration was an empowering experience for me as a woman in technology and especially as a woman in security. The coverage of cybersecurity-related workshops and talks at GHC is definitely growing, and it has proven to be a great place to recruit female security talent.

It was also just an incredible and unique experience to spend the week surrounded by amazing women. Everyone I spoke to described the energy at the conference as “electric.” My peers and I were able to make new professional connections, learn about new technology trends, and bring back new ideas to the various women in STEM groups at Stevens. We left the conference with a sense of courage to continue to grow our careers in technology and to inspire other women to pursue the same path.

GHC 2020 will be held in Orlando, Florida, Sept. 29–Oct. 2, and it is definitely an event worth considering for both general attendees and sponsors.

Formal Analysis of the CBC Casper Consensus Algorithm with TLA+

by Anne Ouyang, Piedmont Hills High School, San Jose, CA

As a summer intern at Trail of Bits, I used the PlusCal and TLA+ formal specification languages to explore Ethereum’s CBC Casper consensus protocol and its Byzantine fault tolerance. This work was motivated by the article Peer Review: CBC Casper by Muneeb Ali, Jude Nelson, and Aaron Blankstein, which indicated that CBC Casper’s liveness properties impose stricter Byzantine fault thresholds than those suggested by the safety proof. To explore this, I specified the Casper the Friendly Binary consensus protocol in TLA+.

As expected, it was impossible to determine finality without a Byzantine fault threshold of less than one-third of the total weight of all validators, which is consistent with Lamport et al.’s original paper on the Byzantine Generals Problem. However, as long as that condition was satisfied, CBC Casper appeared to function as intended.

First, a Little Background

The CBC Casper (Correct By Construction) protocol is a PoS consensus protocol designed to one day be used for Ethereum. However, the current state of CBC Casper hasn’t undergone much formal analysis, and its under-specification poses a challenge to the introduction of Ethereum 2.0.

In a distributed network, individual nodes, called validators, use the Minimal CBC Casper family of consensus protocols to make consistent decisions. The protocol’s five parameters are the names and weights of validators, fault tolerant threshold, consensus values, and estimator. Validators make individual decisions based on their current state, which is defined in terms of the messages received, which have three components:

  • Sender: the name of a validator sending the message
  • Estimate: a subset of values in the consensus values set
  • Justification: a set of messages (state) received to arrive at the estimate

As a result, the sending and receiving of messages can be defined as state transitions.

Equivocation occurs when a validator sends a pair of messages that do not have each other in their justifications. The future states are all reachable states, where the equivocation fault is less than the fault tolerant threshold. Finality is defined by safety oracles, which detect when a certain property holds for all future states.

TLA+ and PlusCal

TLA+ is a formal specification language describing behaviors with states and state transitions. The specifications and state machine models can be checked with the TLC model checker. TLC performs a breadth-first search over the defined state transitions and checks for the properties that need to be satisfied.

The Byzantine Generals Problem

For context, The Byzantine Generals Problem is an analogy for decision-making in distributed systems in the presence of malicious individuals. The problem states that a commanding general must send an order to n-1 lieutenant generals such that 1) all of them obey the same order and 2) if the commanding general is loyal, then every loyal lieutenant obeys the order. A solution to the problem must ensure that all the loyal generals agree on the same plan, and a small number of traitors cannot lead the generals to a bad plan.

Now Let’s Dive into the Process

To start, I specified the definitions in the TLA+ language and defined the states and state transitions in terms of sets and set operations. Figure 1 shows a snippet of the specification.


Figure 1: Snippet of specification in TLA+

I specified the message relay in PlusCal, which is more pseudocode-like and can be automatically transpiled to TLA+.


Figure 2: The CBC Casper message relay specified in PlusCal.

The assumption is that all the messages are sent and received successfully without time delay. The specification does not include message authentication, because it is assumed that the validators can verify the authenticity and source of messages. In this implementation, equivocating validators behave such that they take different subsets of all received messages, use these subsets to obtain the estimates, and send different messages to different validators.

The TLC model checker checks that eventually a clique will be found where all the non-Byzantine nodes can mutually see each other agreeing with a certain estimate in messages, and cannot see each other disagreeing. When this condition is met, finality is achieved, and the model checker should terminate without giving errors, as shown in Figure 3.


Figure 3: TLC model checking results.

When finality cannot be reached, the model checker will detect that a temporal property has been violated, as shown in Figure 4.


Figure 4: TLC errors.

The temporal property checks for the existence of a clique of validators with a total weight greater than:

CodeCogsEqn (2)

All the validators in the clique satisfy the following:

  • None of the validators are Byzantine-faulty. They do not send equivocating messages.
  • All the validators can mutually see each other agreeing. Each validator has exactly one latest message, and they have the same estimate.
  • None of the validators can mutually see each other disagreeing. A validator has a latest message in the other’s justification, but has a new latest message that doesn’t have the same estimate as the latest message of the other validator.

In practice, the failure to achieve finality would mean the blockchain stops producing new blocks, since there is no guarantee that a transaction (i.e. an estimate) will not be reversed in the future.


We find that when the fault-tolerant threshold is set to greater than one-third of the total weight of all the validators, e-cliques cannot be formed. This means finality can be reached when the fault-tolerant threshold is less than one-third of the total weight. This is consistent with “The Byzantine Generals Problem,” which states that the problem “is solvable if and only if more than two-thirds of the generals are loyal.” Intuitively, for the protocol using the clique oracle, a higher fault-tolerant threshold would cause there to be too few validators to form a clique.

While the CBC Casper specification provides a proof of safety, it does not address liveness properties. For example, the CBC Casper blockchain protocol may encounter the problem in which no more blocks can be finalized. Further work is needed in specifying liveness, because finality is a liveness problem and is necessary for a switch to a PoS method. I found no liveness faults, but only tested binary consensus with a very small number of validators. Liveness faults may exist in more sophisticated instantiations of CBC Casper.

Some Thoughts on Formal Verification and TLA+

Developing an abstract mathematical model and systematically exploring the possible states is an interesting and important way to check the correctness of algorithms. However, I encountered the following challenges:

  • Few examples of implementation details of the TLA+ language and good practices for formal verification. Therefore, writing specifications can involve a lot of trial and error.
  • Unhelpful error messages generated by the TLC model checker when something fails to compile. The error messages are vague and do not pinpoint a specific location where the error occurs. In addition, the TLA+ toolbox is a Java application, so the error messages are often Java exceptions. Figuring out what’s wrong with the TLA+ specification, given the Java exceptions, is difficult.
  • Limited documentation of formal verification methods. Googling a question specific to TLA+ yields very few results. As of [date], there were only 39 questions on Stack Overflow with “TLA+” as a tag.


Working at Trail of Bits as an intern this summer was an amazing experience for me. I learned a lot about distributed systems and formal verification and greatly enjoyed the topics. I am glad to have experienced working in security research, and I am motivated to explore more when I go to college.

Watch Your Language: Our First Vyper Audit

A lot of companies are working on Ethereum smart contracts, yet writing secure contracts remains a difficult task. You still have to avoid common pitfalls, compiler issues, and constantly check your code for recently discovered risks. A recurrent source of vulnerabilities comes from the early state of the programming languages available. Most developers are using Solidity, which is infamous for its numerous unsafe behaviors. Now Vyper, a Python-like language, aims to provide a safer language. And since community interest in Vyper is growing, we had to review Vyper contracts on a recent audit with Computable.

Overall, Vyper is a promising language that:

  • Includes built-in security checks,
  • Increases code readability, and
  • Makes code review simpler.

However, Vyper’s age is showing; our review confirmed that this young language will benefit from more testing and tools. For instance, we found a bug in the compiler, which indicates a lack of in-depth testing. Also, Vyper does not yet benefit from the third-party tool integrations that Solidity does, but we’re on the case: We recently added Vyper support to crytic-compile, allowing Manticore and Echidna to work on the Vyper contracts, and the Slither integration is in progress. For now, you can check out the details of our Vyper audit and our recommendations below.

The Good

Integer checks are built-in

Vyper comes with built-in integer overflow checks, and will revert if one is detected. Since integer overflows are frequently at the root of vulnerabilities, overflow protection by default is definitely a good step towards safer contracts. And with this protection, you don’t need to use libraries like SafeMath anymore.

The main caveat here, though, is the higher gas cost. For example, the compiler will add two SLOAD for the following code:

Screen Shot 2019-10-23 at 5.07.03 PM

Figure 1: Integer overflow check.


Figure 2: evm_cfg_builder result for the Figure 1 example.

Nevertheless, overflow protection by default is still the best strategy. In the future, Vyper could reduce the gas cost through optimizations (e.g., removing two SLOADs from the example above), or by adding unsafe types in the language for developers with specific needs.

Unsafe functionality is restricted

Vyper comes with a lot of restrictions compared to Solidity, including:

  • No inheritance
  • No recursive code
  • No infinite length loop
  • No dynamically sized array
  • No assembly code
  • Inability to import logic from another file
  • Inability to create one contract from another

Although these restrictions might seem excessive, most contracts can be implemented while still following these rules.

Solidity allows multiple inheritance, which is frequently overused by developers. We saw many codebases with an overly complex inheritance graph, which made the code review much harder than it should be. In fact, contracts are so difficult to track with multiple inheritance, we had to build a dedicated printer to output the inheritance graph in Slither. Preventing multiple inheritance will force developers to create better designs.

Solidity also allows assembly code, which is frequently used to compensate for inadequate compiler optimizations. When it’s impossible to write these optimizations at the developer level, there’s more pressure on the Vyper compiler team to write good compiler optimizations. This is not a bad thing—optimization should rely on the compiler, not the developers.

Overall, one-third of Slither’s detectors are not needed when using Vyper, thanks to Vyper’s many language restrictions. Vyper-specific detectors can be written, but the simplicity of the language tends to make it safer than Solidity by design.

The Not-So-Good

Vyper has not been tested or reviewed enough

Vyper’s Readme warns its users:


As a result, the compiler is likely to have bugs, and the language’s syntax and semantics might change. Vyper’s users must be careful, follow its development closely, and review the generated EVM bytecode.

For example, until 0.1.0b12, public functions were callable from the contract itself, which created a security risk due to the way Vyper handles msg.sender and msg.value. Since 0.1.0b12, all public functions are the equivalent of external functions in Solidity, removing the risk of this vulnerability.

The compiler bug we found shows that the compiler would benefit from more testing (see details below). It would not be a surprise to see previous solc bugs present in Vyper. For example, the following bugs were either recently fixed or are still present:

Some restrictions are cumbersome

While many of Vyper’s restrictions are good steps toward safer code, some may create problems.

For instance, the total absence of inheritance makes it more difficult to test the code. The creation of mock contracts, or the addition of properties for testing with Echidna, require copying and pasting the code—an error-prone process. Although multiple inheritance is frequently abused by developers, it won’t hurt to allow simple inheritance to facilitate testing.

Like the lack of inheritance, the absence of contract creation is also inconvenient— it increases the complexity of mock contracts, unit tests, and automated testing.

Finally, each contract has to be written in a separate file and import has a partial support. If contract A calls contract B, A needs to know B’s interface. It is then the developer’s responsibility to copy and paste the latest interface version. If B is updated, but its interface in A is not, A will be buggy and error-prone in handling the contract’s dependencies. To prevent these types of vulnerabilities, we built slither-dependencies, a tool that will check the correct interfaces in the codebase.

Our Solutions

Compiler bug: Function collision

Vyper follows the function dispatcher standard used by Solidity: To call a function, the first four bytes of the keccack of the function signature will be used as an identifier. A so-called dispatcher takes care to match the identifier with the correct code to execute. In Figure 3, the dispatcher checks for two different function id:

  • 0x0e8927fbc (pushed at 0x94): increase()
  • 0x61bc221a (pushed at 0xcb): counter()


Figure 3: Dispatcher of the example in Figure 1.

This strategy has a shortcoming: Four bytes is small, and collisions are possible. For example, both gsf() and tgeo() will lead to an id of 0x67e43e43. Figure 4 shows the dispatcher generated with vyper 0.1.0b10:


Figure 4: Function id collision.

As a result, calling tgeo() will execute gsf() code, and tgeo() will never be executable. This issue creates the perfect conditions for backdoored contracts. We reported this bug to the Vyper team and it was fixed in July. Their initial fix did not consider the corner case of a collision with the fallback function, but this is also properly fixed now.

Finally, we implemented a detector in Slither that will catch this bug. Use Slither if you are concerned about interacting with Vyper contracts.

Crytic tools integration

Vyper is now natively supported by most of our tools (including Manticore, Echidna and evm-cfg-builder) as of crytic-compile 0.1.3.


Manticore is a symbolic execution framework that lets you prove assertions in your code. It works at the EVM level, which is necessary to avoid potential compiler bugs. For example, the following token has a bug that will give free tokens to anyone requesting fewer than 10 tokens:

Screen Shot 2019-10-23 at 6.24.48 PM

Figure 5: Vyper buggy token.

The following Manticore script will detect this issue:

Screen Shot 2019-10-23 at 6.26.41 PM

Figure 6: Manticore example working with Vyper.

The script will generate a transaction showing inputs leading to the bug:

Function call:
buy(1) -> STOP (*)

Developers can then integrate the script to their CI to detect the bug, and prove its absence once it has been fixed.

Check out Computable Manticore scripts for more examples.


Echidna is a property-testing fuzzer: It tries different combinations of inputs until it succeeds in breaking a given property. Like Manticore, it works at the EVM level. In the following example, Echidna tries a combination of calls until the echidna_test function returns false:


Figure 7: Echidna example.

When running Echidna on the example in Figure 7, the result is:


Figure 8: Echidna running on Vyper code.

Similar to Manticore, Echidna can be integrated with CI to detect bugs during development. Keep an eye on for an easy solution for using Echidna.


We are working to support Vyper in our static analyzer. Slither is already capable of:

  1. Detecting if code is vulnerable to the collision id compiler bug we discovered.
  2. Detecting if there is an incorrect external contract definition (via slither-dependencies).

Once the Vyper support is complete, Vyper contracts will benefit from our intermediate representation (SlithIR), and have access to all the vulnerability detectors and code analyses already present in our framework.


Vyper is a good step towards a better smart contract language. We loved its simplicity and its focus on security. However, the language is a bit too young to recommend for production. If you want to use Vyper, we highly recommend using Manticore and Echidna to check the EVM code, and to follow along Slither’s development.

Already loving Vyper and want to secure your code? Contact us!

Multi-Party Computation on Machine Learning

During my internship this summer, I built a multi-party computation (MPC) tool that implements a 3-party computation protocol for perceptron and support vector machine (SVM) algorithms.

MPC enables multiple parties to perform analyses on private datasets without sharing them with each other. I developed a technique that lets three parties obtain the results of machine learning across non-public datasets. It is now possible to perform data analytics on private datasets that was previously impossible due to data privacy constraints.

MPC Primer

For MPC protocols, a group of parties, each with their own set of secret data, xi, share an input function, f, and each is able to obtain the output of f(x1,…,xn) without learning the private data of other parties.

One of the first demonstrations of MPC was Yao’s Millionaire Problem, where two people want to determine who has more money without revealing the specific amounts they each have. Each person has a private bank account balance, xi, and they want to compute f(x1,x2) = x1 > x2. Yao demonstrated how to securely compute this, along with any other function, using garbled circuits.

Each person may infer some information about the other’s data via the MPC function’s output. The leaked information is limited to whether the other party’s amount is greater or less than their own; the exact amount is never disclosed.

Boolean and arithmetic circuits are used by MPC protocols to compute arbitrary functions (see Figure 1). Any computable function can be represented as a combination of Boolean (AND/OR) and/or arithmetic (ADD/MULT) gates. Thus, a secure MPC protocol is established through a series of secure MPC addition and multiplication protocols. In practice, we want to convert a function into a circuit and then run MPC protocols on that circuit.

MPC protocols must also operate in a finite field, so many MPC tools are designed to operate over the integers modulo a prime p (a finite field). This makes machine learning applications difficult and, at times, incompatible, because they require fixed-point arithmetic. Almost all of these tools require a synthesis tool to convert a program (typically a C program) into its corresponding arithmetic circuit representation.

Here are several challenges that make MPC machine learning hard.

  • Fixed-point arithmetic and truncation protocols are not well-suited for the finite fields that MPC protocols operate in.
  • The circuit size can be as large as the input data. When functions are evaluated as circuits, all input values must be loaded into the circuit. The circuit will have input gates for every input value, which makes the input layer of the circuit as large as the dataset.
  • Storing the circuit and intermediate values requires a significant amount of memory.
  • Machine learning introduces loops whose exit conditions depend on intermediate values; however, each party must share its circuit with the other parties at the start of the protocol. They can’t depend on intermediate values.

The goal of my project was to create a novel MPC tool that can train machine learning models from secret data held by multiple parties. We wanted to implement a more-than-two-party computation protocol that included fixed-point arithmetic for arbitrary precision for machine learning. We also wanted to provide an alternative to circuit synthesis as a way of performing MPC with complex functions.

Figure 1: (left) A simple arithmetic circuit. (middle) A simple boolean circuit. (right) The roadmap for performing multi-party computation.

Secure MPC Protocols

For three parties, each party will split their data into “secret shares” and send them to the others. Secret shares must only reveal secret value information when they are all combined, otherwise all secret values must remain hidden. This property is formally defined as privacy, and is illustrated by the following:

Let n=3 parties. The parties agree to use the integers mod p, Zp, with p=43, a random small prime. Party 1 has x1=11 and wants to create secret shares for this value. Party 1 will then generate two random values in Zp, say r2=21 and r3 = 36, and send r2 to Party 2 and r3 to Party 3. Each party then stores the following values as their secret shares:

share1 = (x1r2r3) mod 43 = 40 mod 43

share2 = r2 = 21 mod 43  

share3 = r3 = 36 mod 43

Notice that any one or combination of two secret shares reveal information about x1. This is where the strength of secret sharing lies.

After creating secret shares for all their private data, the parties can begin to perform secure addition or multiplication on their inputs. The protocols for both operations depend on the method of secret sharing used. I will demonstrate how we can add and multiply for the specific secret sharing scheme used above.

Let’s build the secure addition and multiplication primitives for MPC.


Consider the case of n=3, using field Zp and p=43. Here, we will take the three input values to be x1 = 11, x2 = 12, x3 = 13. Each party generates and distributes random values to obtain the following secret shares. Again, notice that:

(share1,i + share2,i + share3,i) mod 43 = xi mod 43:

Party 1 shares:  share1,1 = 40   share1,2 = 13   share1,3 = 8

Party 2 shares:  share2,1 = 21   share2,2 = 2 share2,3 = 17

Party 3 shares:  share3,1 = 36   share3,2 = 40   share3,3 = 31

To securely add the values x1, x2, and x3, the three parties simply add their shares of each of those values.

Party 1: share+,1 = share1,1 + share1,2 + share1,3 = 40 + 13 + 8 = 18 (mod 43)

Party 2: share+,2 = share2,1 + share2,2 + share2,3 = 21 + 2 + 17 = 40 (mod 43)

Party 3: share+,3 = share3,1 + share3,2 + share3,3 = 36 + 40 + 31 = 21 (mod 43)

That’s it! Now, each party has a secret share for the value:

x1 + x2 + x3 = 11 + 12 + 13 = (18 + 40 + 21) mod 43 = 36.

Scalar Addition and Multiplication

Scalar addition and multiplication refer to adding/multiplying a secret value by a public value. Suppose each party has secret shares of some value z and they want to obtain shares of say 5 + z or 5z. This actually turns out to be easy as well. To perform addition, Party 1 adds 5 to their share and the rest of the parties do nothing, and now they all have secret shares of 5 + z. To perform multiplication, all of the parties simply multiply their shares of z by 5, and they have all obtained secret shares of 5z. You can easily check that these hold in general for any public integer.


Secure multiplication is more complicated than addition, because all of the parties must interact. This slows down performance when computing non-trivial functions that require several multiplications, especially if the parties are operating on a high-latency network. Therefore, the goal of most of these schemes is to minimize the amount of communication required to securely multiply.

Using Beaver Triples is a well-known method for multiplication that breaks the operation into a series of secure addition, scalar addition, and scalar multiplication. In more concrete terms, this method takes advantage of the following property:

Secret shares of Beaver Triples are created along with secret shares of the input values. Beaver Triples are two random values, a and b, along with their product, ab = c. Secret shares for each of these values are sent to the other parties, so each party has secret shares for x1, x2, a, b, and c.

To compute x1x2, every party first computes shares of (x1 – a) and (x2 – b) to be broadcast to the other parties. The share of (x1 – a) are obtained by scalar multiplying by -1 and adding the result to x1. The same is done for (x2 – b). Once those shares are computed and broadcast,  everyone can now compute yz. In the final step, parties obtain shares for yb and az through scalar multiplication, leaving everyone with the value yz and the secret shares for yb, az, and c.

Secure addition is used on the shares for yb, az, and c, in order to obtain secret shares of (yb + az + c). Then, they perform a scalar addition with yz and their shares of (yb + az + c). Now, the shares for (yz + yb + az + c) = x1 ✕ x2 are obtained, and the protocol is completed.


A major bottleneck in MPC is the communication between parties due to multiplication, which cannot typically be done in parallel. To optimize, the choices are to either reduce the communication exchanges and/or reduce the number of multiplications needed for a function. My work focused on the latter.

Integer comparison is difficult for arithmetic circuits, and if the transformation from function to circuit isn’t clever enough, comparisons may be very inefficient. But in fact, all arithmetic circuits can be represented by a polynomial in the input values.

Let’s consider the polynomial representing f(x1,x2) = x1 < x2, where the number of zeros is half the input space and proportional to the field size. Since this function is not constant zero, the polynomial degree is proportional to the number of zeroes. So the arithmetic circuit that is computing this has as many multiplication gates as the size of the field. Incredibly inefficient!But we can actually obtain an exponentially smaller arithmetic circuit by breaking everything down into building blocks using the methodologies from Octavian Catrina and Sebastiaan de Hoogh’s paper. The major building block needed is an efficient protocol for truncation, and they develop this via a modular remainder protocol that is built from other sub-protocols. What’s important is that instead of circuits needing to be proportional to the field itself, they allow us to use circuits proportional to the bit-size of the field, making them exponentially smaller.


I chose to implement the MPC protocol from Araki, et. al., including their methods for secret sharing, addition, and multiplication. I also developed the specialized secure comparison and truncation protocols, as well as a dot product protocol. Combining complex protocols with addition and multiplication is known as the hybrid model of secure computation. This hybrid model gives us really compact, intuitive circuits that can even be written by hand! (see figures 2 and 3)

I applied this MPC protocol to both perceptron and SVM algorithms, which requires converting them into the building block operations. The SVM algorithm can be manually synthesized into a compact, hybrid circuit using complex building blocks of comparison and dot product operations. Specifically, one iteration of each algorithm only requires around 15 gates. For comparison, synthesizing AES using a synthesis tool will result in thousands of addition and multiplication gates.

Figure 2: (left) Pseudocode for the perceptron algorithm. (right) Hybrid circuit representing one iteration of the perceptron algorithm.

To avoid massive circuits, proportional to the dataset, I only synthesized circuits for one iteration of each protocol. This avoided having to deal with loops and yielded a very compact circuit that can be reused for any dataset. Lastly, using one iteration gives flexibility on how to handle the convergence of our protocol. This approach provides three options:

  • the parties agree on a fixed number of iterations beforehand,
  • the parties reveal an epsilon value publicly, or
  • the parties compute a convergence check according to another MPC.

Figure 3: (left) Pseudocode for the support vector machine (using partial gradient descent). (right) Hybrid circuit representing one iteration of a support vector machine.

Once the hybrid circuits were manually synthesized, I performed the algorithms on test datasets and achieved arbitrary fixed-point precision. The classification accuracy was also the same as the raw algorithms. Therefore, this tool can be used to train secret data using either perceptron or SVMs with arbitrary precision.

Concluding Thoughts

I spent the summer tackling several difficulties related to MPC with machine learning. By the end of the summer, I had an efficient solution to each of these problems and was able to run two different machine learning algorithms securely across three parties. I enjoyed working on this project. Before interning at Trail of Bits, I had just completed the second year of my Ph.D. I initially thought the transition from school to industry would be drastic. But I quickly noticed that in this internship, my project would be very similar to Ph.D. research, which is one of the many things that makes Trail of Bits such a great place to work.

TSC Frequency For All: Better Profiling and Benchmarking

Have you ever tried using LLVM’s X-Ray profiling tools to make some flame graphs, but gotten obscure errors like:

==65892==Unable to determine CPU frequency for TSC accounting.

==65892==Unable to determine CPU frequency.

Or worse, have you profiled every function in an application, only to find the sum of all function runtimes accounted for ~15 minutes of a 20-minute run? Where did those five minutes go!?

Well, we’ve run into both situations, so we built a solution in the form of a Linux kernel module called tsc_freq_khz. We’ll take a quick but deep dive into x86 timers to explain the problem, but first, let’s get to the goods. The tsc_freq_khz module enhances the performance of profiling and benchmarking tools like X-Ray in virtualized environments. No more “Unable to determine CPU frequency” errors!

The tsc_freq_khz module also makes these tools more accurate on newer Intel processors. X-Ray uses the x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration, and assumes that TSC frequency is equivalent to maximum clock speed. This assumption is wrong on newer CPUs, and leads to improper time measurement. The tsc_freq_khz module provides a way to read the actual TSC frequency. No more missing minutes in your profiling data!

The tsc_freq_khz module works by exporting the Linux kernel’s internal value of the x86 Time Stamp Counter (TSC) frequency, tsc_khz, to userspace via sysfs. Programs can query this value by reading /sys/devices/system/cpu/cpu0/tsc_freq_khz.

Several open-source projects, like X-Ray and Abseil, already check for the presence of tsc_freq_khz, but until now, that file was only present on Google’s production kernels. The tsc_freq_khz module enables TSC frequency export to userspace for everyone else.

The trouble with timestamps

Before we explain what we did and why this works, let’s do a quick introduction to timestamps on the x86 architecture.

An x86 machine has at least six different ways to measure time:

Each method has unique and subtle flaws that make it completely unworkable for certain applications. This cornucopia of timers is what happens when you maintain 30 years of backwards compatibility and never remove a feature because someone, somewhere, is depending on it.

Despite its many flaws, the TSC is useful for benchmarking and profiling. It has extremely low latency, because the circuitry is directly on the CPU, and it is accessible directly from user-mode applications. Very useful profiling tools, like X-Ray, rely on the TSC for accurate measurements.

However, TSC measures time in ticks, which are not comparable across different processors and are therefore largely meaningless. For profiling and benchmarking, we want to measure time in comparable units, like nanoseconds.

From ticks to nanoseconds

So how many nanoseconds are in a tick? The basic formula for converting some duration of ticks to nanoseconds is simple:

time_in_nanoseconds = (tsc_count_end - tsc_count_start) * tsc_frequency

Unfortunately, determining TSC frequency is difficult, and in some cases, e.g., cloud-based or other virtualized environments, impossible. Until now.

The core issue is that the Linux kernel does not provide a way for applications to know the TSC frequency, although the Linux perf utility does attempt to calculate TSC frequency via Intel PT. There are reasonable arguments for not exposing the value directly, because TSC frequency is fairly obscure, and there are cases where the value is completely meaningless. Until recently, the processor’s maximum clock speed was also an accurate approximation of TSC frequency.

However, this is no longer true. Using maximum clock speed as the TSC frequency will give the wrong results on newer Intel CPUs. This is the cause of the missing minutes when profiling.

Additionally, the maximum clock speed, accessible in Linux via /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq, is not available in cloud-based or other virtualized environments. This is the cause of “Unable to determine CPU Frequency” errors.

cpuinfo_max_freq is populated by the cpufreq driver used for frequency scaling—that is, making the CPU run faster or slower depending on power-saving settings. Naturally, frequency scaling is not typically permitted in virtualized environments, because each physical CPU is shared with multiple virtual tenants. Hence, this value is not present.

A hint from Google

Timestamp measurement code in X-Ray refers to a mystery sysfs entry named /sys/devices/system/cpu/cpu0/tsc_freq_khz, which sounds like it provides exactly what we need: TSC frequency in kilohertz. Unfortunately, there are absolutely no references to that file in the Linux kernel source code. What’s going on?

More searching reveals the following hint in the Abseil source code:

  // Google's production kernel has a patch to export the TSC
  // frequency through sysfs. If the kernel is exporting the TSC
  // frequency use that. There are issues where cpuinfo_max_freq
  // cannot be relied on because the BIOS may be exporting an invalid
  // p-state (on x86) or p-states may be used to put the processor in
  // a new mode (turbo mode). Essentially, those frequencies cannot
  // always be relied upon. The same reasons apply to /proc/cpuinfo as
  // well.
  if (ReadLongFromFile("/sys/devices/system/cpu/cpu0/tsc_freq_khz", &freq)) {
    return freq * 1e3;  // Value is kHz.

Jackpot! The comment tells us that:

  • Google runs a custom Linux kernel in production.
  • Google is aware that cpuinfo_max_freq should not be used for benchmarking.
  • The kernel’s calculation of TSC frequency is sane, and
  • Google’s internal kernels have a patch to export TSC frequency via sysfs, but the patch has not been upstreamed to the main kernel tree.

TSC frequency For all!

Why should only Google kernels have access to the TSC frequency? Since the Linux kernel is open source, we can write our own kernel module that does the same thing. Fortunately, the Linux kernel already computes the TSC frequency during boot and stores it in the tsc_khz variable. All we had to do was export the variable via sysfs.

Our kernel module, tsc_freq_khz, does exactly this. It creates a sysfs entry that reads the tsc_khz variable defined by the kernel and exports it via /sys/devices/system/cpu/cpu0/tsc_freq_khz. The module is extremely simple but also quite useful.

Follow the build instructions on Github and test the module by inserting it into your kernel:

$ sudo insmod ./tsc_freq_khz.ko
$ dmesg | grep tsc_freq_khz
[14045.345025] tsc_freq_khz: starting driver
[14045.345026] tsc_freq_khz: registering with sysfs
[14045.345028] tsc_freq_khz: successfully registered

The file at /sys/devices/system/cpu/cpu0/tsc_freq_khz should now be populated. (The values on your system will be different.)

$ cat /sys/devices/system/cpu/cpu0/tsc_freq_khz

Warning: Please do not use this code in a real production system. While it shouldn’t cause problems, it does make some assumptions, e.g., that CPU0 exists. There are also no sanity checks to warn if the TSC value is nonsensical or unreliable.


Seemingly simple problems like this one can lead us down a fascinating rabbit hole. In time, Google’s internal patches may make it into the Linux kernel, or kernel developers may create their own official patch to export TSC frequency to userspace. Meanwhile, tsc_freq_khz can be a stopgap for those who want accurate measurements with LLVM’s excellent X-Ray profiling tools.

Enjoyed reading this? Well, we thrive on interesting issues and challenging problems. We’d love to work with you to solve your security or software engineering challenges — please contact us.

Tethered jailbreaks are back

Earlier today, a new iPhone Boot ROM exploit, checkm8 (or Apollo or Moonshine), was published on GitHub by axi0mX, affecting the iPhone 4S through the iPhone X. The vulnerability was patched in devices with A12 and A13 CPUs. As of this writing, the iPhone XS, XS Max, XR, 11, 11 Pro and 11 Pro Max are all safe from this exploit.

We strongly urge all journalists, activists, and politicians to upgrade to an iPhone that was released in the past two years with an A12 or higher CPU. All other devices, including models that are still sold — like the iPhone 8, are vulnerable to this exploit. Regardless of your device, we also recommend an alphanumeric passcode, rather than a 6-digit numeric passcode. A strong alphanumeric passcode will protect the data on your phone from this and similar attacks.

It’s been a long time since the release of a public Boot ROM exploit. The last one was discovered in 2010 by George Hotz (geohot), and it affected the iPhone 3GS and iPhone 4.

What was released

checkm8 exploits the Boot ROM to allow anyone with physical control of a phone to run arbitrary code. The Boot ROM, also called the Secure ROM, is the first code that executes when an iPhone is powered on and cannot be changed, because it’s “burned in” to the iPhone’s hardware. The Boot ROM initializes the system and eventually passes control to the kernel. It’s the root of trust for the trusted boot chain of iOS and verifies the integrity of the next stage of the boot process before passing execution control.

Screen Shot 2019-09-27 at 12.26.05 PM

The Boot ROM is the powerhouse of the cell root of trust of the secure boot chain (source: Apple iOS Security Guide)

checkm8 does not include code required to boot a jailbroken kernel, but it does include the first steps. Likely, the community will soon release a full-tethered jailbreak that will slowly evolve to support all devices and versions.

The exploit also includes the ability to enable debugging features like JTAG on the iPhone CPU—a big win for security researchers and jailbreakers. Apple refers to this as “demoting” the phone in their 2016 BlackHat presentation. This is probably not what Apple intended when they said they were releasing “research devices.”

Impact and use cases

The arbitrary (i.e., not signed by Apple) code that can be executed during the boot process can be used to boot already jailbroken kernels, as was done in ~2011 with redsn0w. This early access also provides access to the AES engine, enabling decryption with the GID key of Apple-encrypted firmware like iBoot.

In 2013, we used the previous Boot ROM exploit to get access to user data by brute-forcing passcodes. In the years since, Apple introduced the Secure Enclave (SEP), a separate processor that manages encryption keys for user data. The SEP has been hardened against passcode brute-forcing using replay counters and backoff timers. It’s currently unclear how much access this new exploit provides to the SEP, so we can’t accurately judge the impact to device privacy. In Apple’s 2016 presentation, they point out that demoting the device forces the SEP to change the UID key, which is entangled with the passcode. Changing the UID key has the effect of permanently protecting user data on the device by making it impossible to decrypt.

When designing the SEP, Apple’s threat model included “adversarial” situations such as another Boot ROM exploit.

When designing the SEP, Apple’s threat model included “adversarial” situations such as another Boot ROM exploit.

It would not be surprising if the vulnerability exploited by checkm8 has been used by Cellebrite for forensic analysis, but they would need another trick or exploit to undermine the SEP’s protections.


The vulnerability has been fixed on new hardware (iPhone XS, iPhone XR, iPhone 11), but Apple still sells hardware vulnerable to this exploit (iPhone 8 and also some iPads and iPods). They would need to create new CPU masks and release updated versions of these devices to secure them. This is unlikely, since Apple did not release a patched iPhone 4 when the previous Boot ROM exploit was released.

It is likely we will see one large community project that will apply common patches and install Cydia and other alternate App Stores. This is a boon to researchers but also to pirates.

Advice for at-risk users

Upgrade to an iPhone that was released in the past two years with an A12 or higher CPU. All other devices are vulnerable to this exploit. After upgrading, wipe your previous phone by selecting “Erase All Content and Settings” from Settings > General > Reset. Devices patched for this issue include:

  • iPhone 11, 11 Pro, 11 Pro Max
  • iPhone XS, XS Max
  • iPhone XR

Set an alphanumeric passcode, rather than a 6-digit numeric passcode. Even with this exploit, an attacker must brute-force your password to gain access to your data. A strong alphanumeric passcode will protect your data against these attacks. To configure an alphanumeric passcode:

  1. On iPhone X and later, go to Settings > Face ID & Passcode.
    On earlier iPhone models, go to Touch ID & Passcode. On devices without Touch ID, go to Settings > Passcode.
  2. Tap Change passcode.
  3. Enter your current passcode.
  4. When prompted to enter a new passcode, tap Passcode Options and select “Custom Alphanumeric Code”

Detecting these jailbreaks

It will still be possible to detect jailbroken phones in the common case, but it is not possible to detect if an iPhone has been exploited with checkm8. Jailbreak detection will have to continue to rely on identifying side-effects of the exploitation. iVerify Core continues to offer a comprehensive jailbreak detection suite, and we will continue to update it as new jailbreaks are released. Contact us to learn more about iVerify Core.