A walk down memory lane

Admit it. Every now and then someone does something, and you think: “I also had that idea!” You feel validated — a kindred spirit has had the same intuitions, the same insights, and even drawn the same conclusions. I was reminded of this feeling recently when I came across a paper describing how to use Intel’s hardware transactional memory to enforce controlflow integrity.

Applied accounting: Enforcing control-flow integrity with checks and balances

A while back I had the same idea, wrote some code, and never published it. When I saw the paper, I secretly, and perhaps predictably, had that negative thought: “I got there first.” That’s not productive. Let’s go back in time to when I was disabused of the notion that there are any “firsts” with ideas. Don’t worry, hardware transactional memory and control-flow integrity will show up later. For now, I will tell you the story of Granary, my first binary translator.

Granary

I am a self-described binary translator. In fact, I have monograms in my suits saying as much. I dove head-first into binary translation and instrumentation as part of my Master’s degree. My colleague Akshay and I were working with two professors at the Systems Lab at the University of Toronto (UofT). We identified a problem: most bugs in the Linux kernel were actually coming from kernel modules (extensions). Opportunity struck. A former UofT student ported DynamoRIO to work inside of the Linux kernel, and that tool could help catch kernel module bugs in the act.

The path to finding actual bugs was long and twisted. Slowly finding bugs wasn’t as cool as doing it fast, and instrumenting the whole kernel to catch bugs in modules wasn’t fast. Our solution was to instrument modules only, and let the kernel run at full speed. This was challenging and ran up against core design decisions in DynamoRIO; thus, Granary was born. Granary’s claim to fame was that it could selectively instrument only parts of the kernel, leaving the rest of the kernel to run natively.

Second place, or first to lose?

With Granary came Address Watchpoints. This was a cool technique for doing fine-grained memory access instrumentation. Address watchpoints made it possible to instrument accesses to specific allocations in memory and detect errors like buffer overflows, use-before-initialization, and use-after-frees.

Address Watchpoints worked by intercepting calls to memory allocation functions (e.g. kmalloc) and embedding a taint tracking ID into the high 15 bits of returned pointers. Granary made it possible to interpose on memory access instructions that used those pointers. It was comprehensive because tainted pointers spread like radioactive dye — pointer copies and arithmetic transparently preserved any embedded IDs.

Yet it turned out that Address Watchpoints was not novel (this is an important metric in academia). SoftBound+CETS had applied a similar technique years before. Stay positive!

Not again

Despite the lack of novelty, Address Watchpoints were practical and attacked the real problem of memory access bugs in Linux kernel modules. Granary stepped forward as the novelty, and Address Watchpoints were an application showing that Granary was useful.

I presented Address Watchpoints at HotDep in 2013, which was co-located with the SOSP conference. At the same conference, btkernel, a fast, Linux kernel-space dynamic binary translator was released. It applied many of the same techniques that made Granary novel, but beat us to a full paper publication. Darn.

Hardware transactional memory

Time to feel good again. Trail of Bits got me a sweet laptop when I joined in 2015. It had a Broadwell chip, and supported the latest x86 features like hardware transactional memory.

The concurrency tax

The stated purpose of hardware transactional memory (HTM) is to enable lock-free and highly concurrent algorithms. For example, let’s say that I want to find use-after free bugs. A solid approach to this problem is to represent a program using a points-to graph. Use-after-free bugs exist if there is a path through the program’s points-to graph that goes from a free to a use.

Scaling this approach is challenging, but my laptop has many cores and so my imaginary doppelganger can throw some concurrency at the problem and hope for the best. Consider two threads that are working together to update the points-to graph. They propagate information from node to node, figuring out what pointers point where. If they both want to update the same node at the same time, then they need to synchronize so that one thread’s updates don’t clobber the other’s.

How do we know when synchronization is actually needed? We don’t! Instead, we need to conservatively assume that every access to a particular node requires synchronization, just in case “that other thread” rudely shows up. But points-to graphs are huge; we shouldn’t need to handle the worst case every time. That’s where HTM comes in. HTM lets us take an opportunistic approach, where threads don’t bother trying to synchronize. Instead, they try to make their own changes within a transaction, and if they both do so at the same time, then their transactions abort and they fall back to doing things the old fashioned way. This works because transactions provide failure atomicity: either the transaction succeeds and the thread’s changes are committed, or the transaction aborts, and it’s as if none of the changes ever happened.

That’s not what HTM is for

Hooray, we’re concurrency experts now. But didn’t this article start off saying something about control-flow integrity (CFI)? What the heck does concurrency have to do with that? Nothing! But HTM’s failure atomicity property has to do with CFI and more.

It turns out that HTM can be applied to unorthodox problems. For example, Intel’s HTM implementation enables a side-channel attack that can be used to defeat address space layout randomization. For a time I was also looking into similar misuses of HTM and surprise, surprise, I applied it to CFI.

Parallel discovery

Two years ago I had an idea about using HTM to enforce CFI. I had a little proof-of-concept script to go along with it, and a colleague helped me whip up an LLVM instrumentation pass that did something similar. Much to my surprise, researchers from Eurecom and UCSB recently produced a similar, more fleshed out implementation of the same idea. Here’s the gist of things.

Suppose an attacker takes control of the program counter, e.g. via ROP. Before their attack can proceed, they need to pivot and make the program change direction and go down the path to evil. The path to evil is paved with good, albeit unintended instructions. What if we live in a world without consequences? What if we let the attacker go wherever they want?

In ordinary circumstances that would be an awful idea. But attackers taking control is extraordinary, kind of like two threads simultaneously operating on the same node in a massive graph. What HTM gives us is the opportunity to do the wrong thing and be forgiven. We can begin a transaction just before a function return instruction, and end the transaction at its intended destination. Think of it like cattle herding. The only valid destinations are those that end transactions. If an attacker takes control, but doesn’t end the transaction, then the hardware will eventually run out of space and abort the transaction, and the program will seize hold of its destiny.

I believed and still think that this is a cool idea. Why didn’t I follow through? The approach that I envisioned lacked practicality. First, it wasn’t good enough as described. There are perhaps better ways to herd cattle. Aligning them within fences, for example. Protecting CFI using only failure atomicity doesn’t change the game, instead it just adds an extra requirement to ROP chains. Second, hardware transactions aren’t actually that fast — they incur full memory barriers. Executing transactions all the time kills ILP and concurrency. That makes this technique out of the question for real programs like Google Chrome. Finally, Intel’s new CET instruction set for CFI makes this approach dead on arrival. CET provides special instructions that bound control flow targets in a similar way.

If a tree falls in a forest

If I had the idea, and the Eurecom/UCSB group had the idea, then I bet some unknown third party also had the idea. Maybe Martin Abadi dreamt this up one day and didn’t tell the world. Maybe this is like operating systems, or distributed systems, or really just… systems, where similar problems and solutions seem to reappear every twenty or thirty years.

Seeing that someone else had the same idea that I had was really refreshing. It made me feel good and bad all at the same time, and reminded me of a fun time a few years ago where I felt like I was doing something clever. I’m not a special snowflake, though. There are no firsts with ideas. The Eurecom/UCSB group had an idea, then they followed it through, produced an implementation, and evaluated it. That’s what counts.

One thought on “A walk down memory lane

  1. What happens when attackers or whatever parties act like “two threads simultaneously operating on the same node in a massive graph.” Is there a mutex-like lock or are they free to overwrite? To let one thread run away only to be declined at the end seems like it would still affect data that the others shared on its way.

Leave a Reply