Two New Tools that Tame the Treachery of Files

Parsing is hard, even when a file format is well specified. But when the specification is ambiguous, it leads to unintended and strange parser and interpreter behaviors that make file formats susceptible to security vulnerabilities. What if we could automatically generate a “safe” subset of any file format, along with an associated, verified parser? That’s our collective goal in Dr. Sergey Bratus’s DARPA SafeDocs program.

But wait—why is parsing hard in the first place? Design decisions like embedded scripting languages, complex context-sensitive grammars, and object models that allow arbitrary dependencies between objects may have looked like good ways to enrich a format, but they increase the attack surface of a parser, leading to forgotten or bypassed security checks, denial of service, privacy leakage, information hiding, and even hidden malicious payloads.

Two examples of this problem are polyglots and schizophrenic files. Polyglots are files that can be validly interpreted as two different formats. Have you ever read a PDF file and then been astonished to discover that it is also a valid ZIP file? Or edited an HTML file only to discover that it is also a Ruby script? Congratulations, you discovered a polyglot. This is not to be confused with schizophrenic files: That’s when two parsers interpret the same file in different ways, e.g., your PDF displays different content depending on whether you opened it in Adobe Acrobat or Foxit Reader, or your HTML page renders differently between Chrome and Internet Explorer.

We’ve developed two new tools that take the pain out of parsing and make file formats safer:

  1. PolyFile: A polyglot-aware file identification utility with manually instrumented parsers that can semantically label the bytes of a file hierarchically; and
  2. PolyTracker: An automated instrumentation framework that efficiently tracks input file taint through the execution of a program.

Collectively, the tools enable Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of referring to them as The ALAN Parsers Project.

Before we get into their details, let’s first talk about why these tools are necessary.

Ceci N’est Pas Un PDF

Please rise and open your hymnals to page 541 for the recitation of chapter 7 verse 6:

…a file has no intrinsic meaning. The meaning of a file—its type, its validity, its contents—can be different for each parser or interpreter.

Ange Albertini

You may be seated.

This talk by Trail of Bits researcher Evan Sultanik gives a number of examples of how polyglots and induced schizophrenia are more than just nifty parlor tricks. For example:

A PDF can even be a valid git repository that, when cloned, contains the LaTeX source code to generate the PDF and a copy of itself. Ange Albertini also has an excellent series of videos introducing funky file tricks.

What does it take to understand a popular file format that has
been accreting features (and misfeatures) over 20 years? PDF
provides just such a challenge.

  • An embedded Turing complete programming language? 👌
  • Arbitrary chaining of stream decoders that allow for both memory and computational denial of service? 👌
  • Multiple, redundant, and potentially conflicting ways to specify the length of a stream object? 👌
  • Arbitrary data allowed both before and after the file? 👌
  • Numerous ways to steganographically embed data, including arbitrary length binary blobs? 👌
  • A graph-based document object model that allows, and in some cases requires, cycles? 👌
  • A multi-decade history with ambiguous or incomplete specifications resulting in dozens of conflicting implementations, some of which emit non-compliant, malformed documents? 👌
  • The necessity for parser implementations to be resilient to malformations but also free to handle them differently? 👌
  • A specification that has ​got in the way of creating of a formally verified parser in Coq because Coq could not prove that a parser trying to do its best on checking indirect references would, in fact, terminate on maliciously crafted files​? 👌

Challenge accepted! To be fair, PDF is leading the way on defining simpler, reduced subsets of the format. PDF/A, designed to make sure PDF documents remain parseable for long-term preservation, has removed some of these problematic features. Moreover, they are by no means specific to PDF: they are endemic to document formats in general. For example, Microsoft Office’s OOXML has not done better, with severe external entity attacks that have been employed in the wild, not to mention XML-based vulnerabilities like the billion laughs attack. Even parsing JSON is harder than one might think, as is plain old UTF-8.

But surely in the land of Programming Languages, at least, all must be well, since their parsers are automatically generated from unambiguous specifications by classic algorithms and proven tools. Not so much: This Empire Hacking talk gives examples of how a poorly designed language can cause parsing problems even when no malice is involved. One does not simply walk into the shunting yard!

But back to data formats. In view of the challenges above, instead of focusing on the specification, we examine the de facto interpretations of the specification: Parser implementations. Our underlying hypothesis is that the “unsafe” portions of a file format will exist in the symmetric difference of the parsers’ accepted grammars. The portions of the file format to keep are the ones accepted and interpreted equivalently across all implementations.

PolyFile: Ground Truth Labeling of File Semantics

File identification utilities are, by and large, dumb in the sense that they simply compare the file against magic byte signatures of various formats. Moreover, these tools terminate once they find the first match, and do not recursively identify embedded file types or files that do not start at byte offset zero. Once a file is classified, there is typically little to no information about the contents of the file. It’s a PDF, but how many objects does it contain? It’s a ZIP, but what are its file contents?

Our new PolyFile project resolves these issues and provides:

  • Identification of any and all files embedded within the input, not necessarily starting at byte offset zero;
  • File formats for which an instrumented parser is available should be fully parsed, emitting a hierarchical semantic mapping of the input’s contents;
  • An interactive file explorer for a human to examine its contents and structure; and
  • Computer-readable output that can be used to assign semantic meaning to each byte of the input file (e.g., byte x corresponds to the first byte in a PDF stream object, and the start of a JPEG/JFIF header).

A fairly ideal file identification utility, n’est ce pas?

Ange Albertini’s SBuD project comes close in spirit, but currently only supports a couple image formats. Even the popular Unix file command only has support for several hundred file signatures. In contrast, PolyFile has support for over 10,000 file formats, and can recursively identify them in a file, emitting a hierarchical mapping as an extension of the SBuD JSON format. It also has support for semantically labeling files based on Kaitai Struct declarative file format specifications.

Additionally, PolyFile can optionally emit a self-contained HTML file with an interactive hex viewer and semantic labeling explorer. Here is an example of the HTML output from the résumé Evan Sultanik submitted to Trail of Bits. In addition to being a PDF that displays its own MD5 hash, it is a valid Nintendo Entertainment System ROM that, when emulated, is a playable game that displays the MD5 hash of the PDF. It is also a valid ZIP file containing, among other things, a PDF that is a git repository containing its LaTeX source code and a copy of itself.

PolyFile is free and open-source. You can download a copy at:

Parser Instrumentation

Now that we have PolyFile to provide ground truth, we need a way to propagate the semantic labels through a parser; the programmatic equivalent of using contrast dye to track blood flow in the brain during a CT scan. We therefore need an automated way to instrument a parser to track those labels, with the goal of associating functions with the byte offsets of the input files on which they operate. Since PolyFile can tell us the semantic meaning behind those offsets, this will let us infer the purpose of the parser’s functions. For example, if a function in a PDF parser always operates on bytes associated with JFIF stream objects, we can assume it is responsible for processing embedded JPEGs.

There are several existing projects to do this sort of taint tracking, using various methods. The best maintained and easiest to use are AUTOGRAM and TaintGrind. However, the former is limited to analysis on the JVM, and the latter Valgrind plugin suffers from unacceptable runtime overhead when tracking as few as several bytes at a time. For example, we ran mutool, a utility in the muPDF project, using TaintGrind over a corpus of medium sized PDFs, and in every case the tool had to be halted after over 24 hours of execution for operations that would normally complete in milliseconds without instrumentation.

At first glance, our goals might seem to be satisfied by AFL-analyze, a tool bundled with the AFL fuzzer. In a sense, our goal is in fact to create its counterpart. AFL-analyze uses fuzzing to reverse engineer a file format from a parser. In our case, we have ground truth about the file format and want to reverse-engineer the parser.

Although intended for fuzzing, Angora’s taint analysis engine has many of the features necessary to track byte indexes during execution. In fact, as is described in the following sections, we build on many of the algorithmic advances of Angora while improving both computational and memory efficiency. Angora is built atop the LLVM Data Flow Sanitizer (dfsan), which we also leverage for PolyTracker. The following section describes dfsan’s operation, limitations, and how we improved upon both dfsan and Angora.

LLVM and the Data Flow Sanitizer

We chose a static instrumentation approach built on LLVM, since this allows us to instrument any parser capable of being compiled with LLVM and eventually instrument closed-source parsers (e.g., by lifting their binaries to LLVM using Remill or McSema).

LLVM has an instrumentation tool for propagating taint called the Data Flow Sanitizer (dfsan), which is also used by Angora. However, dfsan imposes severe limitations on the total number of taints tracked during program execution, which means that, in practice, we could only track a handful of input bytes from a file at once. To see why, consider a parser that performs the following:

fd = fopen(“foo.pdf”, “rb”);
a = fgetc(fd);
b = fgetc(fd);
c = a + b;

In this case, dfsan will taint the variable a by byte offset 0 and variable b by taint offset 1. Byte c will be tainted by both byte 0 and byte 1. The combinatorial challenge here is that there are 2n possible taints, where n is the number of bytes in the input file. Therefore, the naïve approach of storing taints using a bitset will be infeasible, even for small numbers of input, and even when using a compressed bitset.

The representational problem is addressed in dfsan by storing taint provenance in a data structure it calls the “union table,” which is a computationally efficient way to store a binary forest of taint unions. Each taint gets a unique 16-bit label. Then, in the example above, where the taint of a is unioned with b to create c, dfsan would record union_table[a’s label][b’s label] = c’s label.

Thereafter, if a and b are ever unioned again, c’s taint label can be reused. This allows constant time union table checks; however, the table itself requires O(n2) storage. This is very wasteful, since the table will almost always be very sparse. It’s also what necessitates the 16-bit taint labels, since using larger labels would exponentially increase the size of the union table. This means that dfsan can only track, at most, 65,536 taints throughout execution, including all new taints that are created from unions. This is insufficient to track more than a handful of input bytes at a time.

Introducing PolyTracker: Efficient Binary Instrumentation for Universal Taint Tracking

Our novel taint tracking data structures and algorithms—as well as numerous heuristics for reducing computational overhead—are manifested in our new PolyTracker tool. It is a wrapper around clang and clang++ that allows you to instrument any executable. Simply replace your compiler with PolyTracker during your normal build process. The resulting executable will create a JSON file containing a mapping of functions to the input file byte offsets on which each function operates. Moreover, since PolyTracker is built as an LLVM pass, it can be used on any black-box binary that has been lifted to LLVM/IR, even when source code is unavailable.

We maintained dfsan’s concepts of shadow memory and its instrumentation framework for tracking taints. However, we switched away from the union table and implemented a scalable data structure capable of exploiting the inherent sparsity in taint unions. This is augmented by a binary forest of taint unions, supplanting dfsan’s label array and allowing us to increase the size of the taint labels past 16-bits. PolyTracker’s binary forest data structure uses a memory layout algorithm that eliminates the need for an Angora-style taint label to bitvector lookup table, while also providing constant time insertion. This reduces the memory requirements from exponential to linear in the size of the input file plus the number of instructions executed by the parser, at the expense of an O(n log n) graph traversal in post-processing to resolve the taints. In practice, this results in negligible execution overhead for the majority of PDFs.

PolyTracker is free and open-source. You can download a copy today at:

It’s Easy to Get Started

PolyFile can be installed with this one quick command:

pip3 install polyfile

PolyTracker requires a working build of LLVM. However, we have made this easy by providing a Docker container on DockerHub that already has everything built. Simply download a copy of the container to start instrumenting your parsers!

docker pull trailofbits/polytracker:latest
docker run -it --rm trailofbits/polytracker:latest

We have lots of features in active development, including intelligent file mutation for fuzzing and differential testing, temporal analysis of taint propagation, and automated identification of error handling routines.

We’re also excited to hear what other clever uses the community devises for the tools. Are you using PolyTracker to discover input bytes that are ignored by the parser? Do you use special taint labels to track the results of functions like strlen that are likely to correspond to field lengths? Let us know on the Empire Hacking Slack! Have an idea that you’d like to see turned into a new feature? Feel free to add a GitHub issue!


This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

One thought on “Two New Tools that Tame the Treachery of Files

  1. Pingback: A SafeDocs update | PDF Association

Leave a Reply