Detecting Iterator Invalidation with CodeQL

Kevin Higgs (Montgomery Blair High School)

October 09, 2020

codeql, internship-projects, static-analysis

Page content

Iterator invalidation is a common and subtle class of C++ bugs that often leads to exploitable vulnerabilities. During my Trail of Bits internship this summer, I developed Itergator, a set of CodeQL classes and queries for analyzing and discovering iterator invalidation.

Results are easily interpretable by an auditor, providing information such as where an iterator is acquired, where it is invalidated, and a significance level that indicates the likelihood of a false positive. Itergator has been used to find bugs in real-world code, and the queries are easily extensible for further analysis.

Iterators Defined

Iterators are the standard way to traverse the contents of a container in C++. An iterator object supports at least two operations: dereferencing, to get the underlying object in the container; and incrementation, to get an iterator for the next element.

For example, the following code will output 1 2 3 4 5:

std::vector<int> vec{1, 2, 3, 4, 5};

for (std::vector<int>::iterator it = vec.begin(), end = vec.end(); it != end; ++it) {
    std::cout << *it << " ";
}

This is such a common code pattern that C++11 introduced a simplified syntax:

for (auto i : vec) {
    std::cout << i << " ";
}

While equivalent to the previous code, all iterator operations are now generated by the compiler. The details about the iteration are hidden from the developer.

Iterator Invalidation

Iterators are invalidated after certain modifications to their container, such as adding or erasing an element. Use of invalidated iterators is, per the standard, undefined behavior. In other words, what happens is implementation-specific and probably not good. For example, in the following code (discovered by Itergator in Cataclysm: Dark Days Ahead) the call to zones.erase invalidates the iterator it:

void zone_manager::deserialize( JsonIn &jsin )
{
    jsin.read( zones );
    for( auto it = zones.begin(); it != zones.end(); ++it ) {
        const zone_type_id zone_type = it->get_type();
        if( !has_type( zone_type ) ) {
            zones.erase( it );
            debugmsg( "Invalid zone type: %s", zone_type.c_str() );
        }
    }
}

The iterators of a vector in libstdc++ are pointers to the vector’s backing buffer. The erase method shifts all pointers past the erased iterator to the left by one, overwriting the erased object, and decrements the end of the vector.

If the vector contains only one element, vec.end() becomes the same as vec.begin(). In the example invalidation, at the end of the first loop iteration the iterator is incremented to be the address after vec.begin(). This means the continuation condition it != zones.end() holds, so we enter the loop with the iterator referencing whatever memory exists after the backing buffer on the heap! Because of the complexity of Cataclysm, the heap layout and the crash are not deterministic, but a properly modified game save frequently results in a segmentation fault from dereferencing an invalid address.

While this is a relatively benign example, the threat presented by this class of issues is not theoretical; iterator invalidation bugs in high-value targets have been weaponized before.

CodeQL

CodeQL is a static analysis framework developed by GitHub that allows you to query codebases with an SQL-like syntax. It has an object-oriented class system with predicates that define logical properties and relationships. The standard library provides a comprehensive set of classes which allow querying for a wide array of code properties and patterns.

A CodeQL database can be built for almost any code that compiles. GitHub maintains an index of databases they’ve built from public repositories at lgtm.com, which can be queried on their site or locally with the CodeQL CLI. There is also a Visual Studio Code extension for inspection of query results.

Itergator consists of both queries and libraries, allowing auditors to use Itergator’s classes in their own queries.

Detecting Iterator Invalidation

Using static analysis to detect iterator invalidation presents several challenges. The example above is simple, but invalidations can be nested many function calls deep, with complex logic surrounding them. Some iterators are declared and invalidated outside of a loop, resulting in flows that are expensive to detect without an enormous number of false positives. It is also important for the query to be extensible: Codebases often have their own iterable types with invalidation constraints that need to be detected.

CodeQL’s global data flow library and support for recursion make complex control flow analyses easy to write. Itergator is able to construct a graph of all potentially invalidating function calls (those that may result in a call to an invalidating function, like std::vector::push_back) and define classes to be used in queries:

Iterator: a variable that stores an iterator
Iterated: where a collection is iterated, e.g. vec in vec.begin()
Invalidator: a potentially invalidating function call in the scope of an iterator
Invalidation: a function call that directly invalidates an iterator

The InvalidationFlows query relates these classes with data flow to locate likely invalidations. To query non-standard iterated types, you simply extend the PotentialInvalidation class which, as an abstract class, is defined as the union of its subclasses. For example, here is an invalidation definition for destructors:

class PotentialInvalidationDestructor extends PotentialInvalidation {
    PotentialInvalidationDestructor() {
        this instanceof MemberFunction
        and this.getName().matches("~%")
    }

    override predicate invalidates(Iterated i) {
        i.getType().refersTo(this.getParentScope())
    }
}

These subclasses can be defined anywhere in your query or an imported library; definitions for STL classes are already included in Itergator. A utility query in Itergator, IteratedTypes, identifies what types to specify invalidation constraints for.

A large part of Itergator’s development required finding fixed iterator invalidation bugs on GitHub and attempting to reproduce them. One especially tricky bug in a regular expression library by Google exemplifies the challenges of this project:

struct Frame {
  Frame(Regexp** sub, int nsub)
      : sub(sub),
        nsub(nsub),
        round(0) {}

  Regexp** sub;
  int nsub;
  int round;
  std::vector<Splice> splices;
  int spliceidx;
};

int Regexp::FactorAlternation(Regexp** sub, int nsub, ParseFlags flags) {
  std::vector<Frame> stk;
  stk.emplace_back(sub, nsub);

  for (;;) {
    ...
    auto& splices = stk.back().splices;
    auto& spliceiter = stk.back().spliceiter;

    if (splices.empty()) {
      round++;
    } else if (spliceiter != splices.end()) {
      stk.emplace_back(spliceiter->sub, spliceiter->nsub);
      continue;
  } else { ... }

    switch (round) { ... }

    if (splices.empty() || round == 3) {
      spliceiter = splices.end();
    } else {
      spliceiter = splices.begin();
    }
  }
}

This function declares stk, a vector of frames, each of which has a splices vector and a spliceiter iterator. The iterator begins uninitialized, and is only assigned a value at the end of the first iteration of the loop (lines 32-36). It’s not obvious where the invalidation occurs; it’s not an operation on splices directly, but an element added to stk on line 26. If the backing buffer of stk is at capacity, it is reallocated and the Frame objects are copied, resulting in re-allocation of each splices vector. Because of the continue statement, spliceiter is never re-initialized, and an invalidated iterator is used on the next loop iteration.

This invalidation happens over three iterations of the loop: first initialization of the iterator, then invalidation, and finally, usage. The invalidating function call is performed on a member of an object stored inside a vector; confirming that this is the same vector the iterator refers to would be extremely complicated. Tracking control flow across all three executions is possible but expensive, and the query becomes impractical to run on large codebases.

My solution to these problems was to search for conditions necessary, but not sufficient, for invalidation. For example, I verified that the same variable—not value—can flow to both locations of iteration and invalidation. While this introduces a significant number of false positives, automatic filtering based on recurring patterns and the addition of a “significance” value makes searching through the results very manageable, while still identifying complex invalidations like the one above.

CodeQL’s caching and optimization also mean Itergator can query massive codebases, like Apple’s fork of LLVM for Swift, and find deep invalidations. Itergator identified the following bug, which was unintentionally fixed upstream a couple months ago, where the invalidation is 12 function calls deep. InvalidationFlows gives us the iteration and invalidation locations; then, after further investigation, including a customized path query, we can identify the necessary control flow:

And then we can construct a reproduction:

Step 1: Run the LLVM linker with undefined symbols and lazily loaded bitcode that references a linker script.

./ld.lld --no-threads --undefined asdf --undefined fawef --start-lib ~/lib.bc --end-lib ~/a.o

Steps 2 through 13:

handleUndefined
lld::elf::Symbol::fetch() const
lld::elf::LazyObjFile::fetch()
lld::elf::parseFile(lld::elf::InputFile*)
doParseFile<llvm::object::ELFType<1, true> >
void lld::elf::BitcodeFile::parse<llvm::object::ELFType<1, true> >()
addDependentLibrary
lld::elf::LinkerDriver::addFile(llvm::StringRef, bool)
lld::elf::readLinkerScript(llvm::MemoryBufferRef)
readLinkerScript
readExtern
std::vector<llvm::StringRef>::push_back(llvm::StringRef&&)

Step 14: Profit.

==332005==ERROR: AddressSanitizer: heap-use-after-free on address 0x603000004160 at pc 0x556f81e36288 bp 0x7ffd14c663f0 sp 0x7ffd14c663e0
READ of size 16 at 0x603000004160 thread T0
    #0 0x556f81e36287 in void
lld::elf::LinkerDriver::link<llvm::object::ELFType<(llvm::support::endianness)1, true> >(llvm::opt::InputArgList&) (/home/kmh/llvm-project/build/bin/lld+0x1d53287)
    #1 0x556f81ddaa10 in lld::elf::LinkerDriver::main(llvm::ArrayRef<char const*>) /home/kmh/llvm-project/lld/ELF/Driver.cpp:514
    #2 0x556f81ddc3b6 in lld::elf::link(llvm::ArrayRef<char const*>, bool, llvm::raw_ostream&, llvm::raw_ostream&) /home/kmh/llvm-project/lld/ELF/Driver.cpp:111
    #3 0x556f8186cda8 in main /home/kmh/llvm-project/lld/tools/lld/lld.cpp:154
    #4 0x7f1f70d1b151 in __libc_start_main (/usr/lib/libc.so.6+0x28151)
    #5 0x556f8186861d in _start (/home/kmh/llvm-project/build/bin/lld+0x178561d)

Conclusion

Itergator is a powerful tool for detecting complex iterator invalidations in codebases of any size. Working with CodeQL’s declarative query language was awesome, despite the occasional engine bug, as it incorporated concepts I was already familiar with to make static analysis easy to pick up. There will always be improvements to make and more bugs to hunt, but I’m very happy with my results.

Finally, I’d like to thank my mentor Josh and everyone else at Trail of Bits who made this summer great. I can definitively say that Trail of Bits is the best place I’ve ever worked! If you have any questions, or just want to talk, shoot me a message on Twitter @themalwareman.